# Predicting House Prices with Logistic Regression 📈

**Authors:** [Melissa Perez](https://github.com/MelissaPerez09), [Adrian Flores](https://github.com/adrianRFlores), [Andrea Ramirez](https://github.com/Andrea-gt)

**Description:**

## Import Libraries

In [2]:
import numpy as np
import pandas as pd
import warnings
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import (mean_squared_error, mean_absolute_error, r2_score, confusion_matrix, accuracy_score,
precision_score, recall_score, ConfusionMatrixDisplay)
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegressionCV
from scipy.stats import pointbiserialr
import statsmodels.api as sm

random_state = 42
warnings.filterwarnings("ignore")

## Data Upload 📄

In [3]:
df = pd.read_csv('data/train.csv')
df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


## Feature Engineering 🗂️

### Handling Missing Values

In [4]:
# Fill missing values in low NaN count columns
df['Electrical'] = df['Electrical'].fillna('None')
df['MasVnrType'] = df['MasVnrType'].fillna('None')
df['MasVnrArea'] = df['MasVnrArea'].fillna(0)

In [5]:
# Impute missing LotFrontage values based on the median LotFrontage within each neighborhood.
df['LotFrontage'] = df.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))

In [6]:
# Fill missing values in FireplaceQu with 'None'.
df['FireplaceQu'] = df['FireplaceQu'].fillna('None')

In [7]:
# Fill missing values in garage-related variables with 'None'.
df['GarageType'] = df['GarageType'].fillna('None')
df['GarageYrBlt'] = df['GarageYrBlt'].fillna(0)
df['GarageFinish'] = df['GarageFinish'].fillna('None')
df['GarageQual'] = df['GarageQual'].fillna('None')
df['GarageCond'] = df['GarageCond'].fillna('None')

In [8]:
# Fill missing values in basement-related variables with 'None'.
df['BsmtQual'] = df['BsmtQual'].fillna('None')
df['BsmtCond'] = df['BsmtCond'].fillna('None')
df['BsmtExposure'] = df['BsmtExposure'].fillna('None')
df['BsmtFinType1'] = df['BsmtFinType1'].fillna('None')
df['BsmtFinType2'] = df['BsmtFinType2'].fillna('None')

In [9]:
# Drop columns with a high count of missing values
df.drop(['Alley', 'PoolQC', 'Fence', 'MiscFeature'], axis=1, inplace=True)

### Feature Creation

In [10]:
# Define conditions for categorizing SalePrice
conditions = [
    (df['SalePrice'] < 150000),
    (df['SalePrice'] >= 150000) & (df['SalePrice'] <= 250000),
    (df['SalePrice'] > 250000)
]

# Define labels for the categories
labels = ['economical', 'intermediate', 'expensive']

# Create a new column 'SalePriceCategory' based on the conditions and labels
df['SalePriceCategory'] = np.select(conditions, labels)

df

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,...,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,SalePriceCategory
0,1,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,2,2008,WD,Normal,208500,intermediate
1,2,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,...,0,0,0,0,5,2007,WD,Normal,181500,intermediate
2,3,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,...,0,0,0,0,9,2008,WD,Normal,223500,intermediate
3,4,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,0,2,2006,WD,Abnorml,140000,economical
4,5,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,...,0,0,0,0,12,2008,WD,Normal,250000,intermediate
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,8,2007,WD,Normal,175000,intermediate
1456,1457,20,RL,85.0,13175,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,2,2010,WD,Normal,210000,intermediate
1457,1458,70,RL,66.0,9042,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,2500,5,2010,WD,Normal,266500,expensive
1458,1459,20,RL,68.0,9717,Pave,Reg,Lvl,AllPub,Inside,...,0,0,0,0,4,2010,WD,Normal,142125,economical


### Feature Encoding

In [11]:
# Columns to encode, separated by feature category
nominalFeatures = ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
                    'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle',
                    'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation',
                    'Heating', 'CentralAir', 'Electrical', 'GarageType', 'SaleType',
                    'SaleCondition','PavedDrive', 'SalePriceCategory']

ordinalFeatures = ['ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 'KitchenQual','FireplaceQu', 'GarageQual', 'GarageCond']

otherOrdinalFeatures = ['BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Functional', 'GarageFinish']

In [12]:
# Convert nominal features into dummy variables
# Get dummies for nominal features
dummies = pd.get_dummies(df[nominalFeatures])
dummies.head()

Unnamed: 0,MSZoning_C (all),MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Grvl,Street_Pave,LotShape_IR1,LotShape_IR2,LotShape_IR3,...,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,PavedDrive_N,PavedDrive_P,PavedDrive_Y,SalePriceCategory_economical,SalePriceCategory_expensive,SalePriceCategory_intermediate
0,False,False,False,True,False,False,True,False,False,False,...,False,False,True,False,False,False,True,False,False,True
1,False,False,False,True,False,False,True,False,False,False,...,False,False,True,False,False,False,True,False,False,True
2,False,False,False,True,False,False,True,True,False,False,...,False,False,True,False,False,False,True,False,False,True
3,False,False,False,True,False,False,True,True,False,False,...,False,False,False,False,False,False,True,True,False,False
4,False,False,False,True,False,False,True,True,False,False,...,False,False,True,False,False,False,True,False,False,True


In [13]:
# Drop the original nominal features columns
df = df.drop(nominalFeatures, axis=1)

# Concatenate dummies with original DataFrame
df = pd.concat([df, dummies], axis=1)
df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,...,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,PavedDrive_N,PavedDrive_P,PavedDrive_Y,SalePriceCategory_economical,SalePriceCategory_expensive,SalePriceCategory_intermediate
0,1,60,65.0,8450,7,5,2003,2003,196.0,Gd,...,False,False,True,False,False,False,True,False,False,True
1,2,20,80.0,9600,6,8,1976,1976,0.0,TA,...,False,False,True,False,False,False,True,False,False,True
2,3,60,68.0,11250,7,5,2001,2002,162.0,Gd,...,False,False,True,False,False,False,True,False,False,True
3,4,70,60.0,9550,7,5,1915,1970,0.0,TA,...,False,False,False,False,False,False,True,True,False,False
4,5,60,84.0,14260,8,5,2000,2000,350.0,Gd,...,False,False,True,False,False,False,True,False,False,True


In [14]:
# Initialize the OrdinalEncoder
encoder = OrdinalEncoder()

# Define ordinal categories
ordinalCategories = ['None', 'Po', 'Fa', 'TA', 'Gd', 'Ex']

# Reshape the ordinal categories array
ordinalCategories = np.array(ordinalCategories).reshape(-1, 1)

# Fit the encoder to the ordinal categories
encoder.fit(ordinalCategories)

# Encode columns to in ordinalFeatures
for feature in ordinalFeatures:
    df[[feature]] = encoder.transform(df[[feature]])
    
df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,...,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,PavedDrive_N,PavedDrive_P,PavedDrive_Y,SalePriceCategory_economical,SalePriceCategory_expensive,SalePriceCategory_intermediate
0,1,60,65.0,8450,7,5,2003,2003,196.0,2.0,...,False,False,True,False,False,False,True,False,False,True
1,2,20,80.0,9600,6,8,1976,1976,0.0,5.0,...,False,False,True,False,False,False,True,False,False,True
2,3,60,68.0,11250,7,5,2001,2002,162.0,2.0,...,False,False,True,False,False,False,True,False,False,True
3,4,70,60.0,9550,7,5,1915,1970,0.0,5.0,...,False,False,False,False,False,False,True,True,False,False
4,5,60,84.0,14260,8,5,2000,2000,350.0,2.0,...,False,False,True,False,False,False,True,False,False,True


In [15]:
# Define encoding categories for each other ordinal feature
encodingCategories = [
    ['None', 'No', 'Mn', 'Av', 'Gd'],  # BsmtExposure
    ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # BsmtFinType1
    ['None', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],  # BsmtFinType2
    ['Sev', 'Maj2', 'Maj1', 'Mod', 'Min2', 'Min1', 'Typ'], # Functional
    ['None', 'Unf', 'RFn', 'Fin']  # GarageFinish
]

# Reshape the ordinal categories array
reshapedEncodingCategories = [np.array(categories).reshape(-1, 1) for categories in encodingCategories]

# Encode columns in otherOrdinalFeatures
for feature, categories in zip(otherOrdinalFeatures, reshapedEncodingCategories):
    # Fit the encoder to the encoding categories
    encoder.fit(categories)
    df[[feature]] = encoder.transform(df[[feature]])

df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,ExterQual,...,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial,PavedDrive_N,PavedDrive_P,PavedDrive_Y,SalePriceCategory_economical,SalePriceCategory_expensive,SalePriceCategory_intermediate
0,1,60,65.0,8450,7,5,2003,2003,196.0,2.0,...,False,False,True,False,False,False,True,False,False,True
1,2,20,80.0,9600,6,8,1976,1976,0.0,5.0,...,False,False,True,False,False,False,True,False,False,True
2,3,60,68.0,11250,7,5,2001,2002,162.0,2.0,...,False,False,True,False,False,False,True,False,False,True
3,4,70,60.0,9550,7,5,1915,1970,0.0,5.0,...,False,False,False,False,False,False,True,True,False,False
4,5,60,84.0,14260,8,5,2000,2000,350.0,2.0,...,False,False,True,False,False,False,True,False,False,True


## Log. Regression Models

### Initial Iteration - Identifying Expensive Homes (Using All Variables)

#### Splitting DataSet for First Iteration

In [16]:
# Make another copy of the dataframe
df_cp = df.copy()

# Separate the target variable 'SalePriceCategory' from features
y = df_cp.pop('SalePriceCategory_expensive')

# Exclude columns 'Id' and SalePriceCategory(s) from features
X = df_cp.loc[:, ~df_cp.columns.isin(['Id', 'SalePrice', 'SalePriceCategory_economical', 'SalePriceCategory_intermediate'])]

In [17]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, train_size=0.8, random_state=random_state)

In [18]:
print(X_train.shape)
print(X_test.shape)

(1168, 224)
(292, 224)


#### Logistic Regression Model

In [19]:
# Initialize Logistic Regression model with cross-validation
clf = LogisticRegressionCV(random_state=random_state, solver='liblinear', max_iter=1000)

# Fit the model to the training data
clf = clf.fit(X_train, y_train)

# Predict the target variable using the trained model on the test data
y_pred = clf.predict(X_test)

In [20]:
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.95


In [21]:
scores = cross_val_score(clf, X_test, y_test, cv=5)

print("Cross-validation scores: ", scores)
print("Average cross-validation score: ", scores.mean())

Cross-validation scores:  [0.94915254 0.93220339 0.94827586 0.94827586 0.9137931 ]
Average cross-validation score:  0.9383401519579193


Los resultados de la validación cruzada muestran que su modelo de regresión logística tiene un rendimiento bastante bueno en los datos de prueba. Los puntajes de validación cruzada para los 5 pliegues son todos superiores al 90%, lo que indica que el modelo es capaz de predecir correctamente la mayoría de las observaciones en cada pliegue.

El puntaje promedio de validación cruzada es aproximadamente 0.938, lo que significa que, en promedio, su modelo predice correctamente el 93.8% de las observaciones en los datos de prueba.

### Multicolinealidad en las variables

In [22]:
correlation_economical_expensive = pointbiserialr(df['SalePriceCategory_economical'], df['SalePriceCategory_expensive']).correlation
correlation_economical_intermediate = pointbiserialr(df['SalePriceCategory_economical'], df['SalePriceCategory_intermediate']).correlation
correlation_expensive_intermediate = pointbiserialr(df['SalePriceCategory_expensive'], df['SalePriceCategory_intermediate']).correlation

print("Correlation between economical and expensive: ", correlation_economical_expensive)
print("Correlation between economical and intermediate: ", correlation_economical_intermediate)
print("Correlation between expensive and intermediate: ", correlation_expensive_intermediate)

Correlation between economical and expensive:  -0.3564540110325301
Correlation between economical and intermediate:  -0.7411862639265637
Correlation between expensive and intermediate:  -0.3630048782284887


La correlación entre SalePriceCategory_economical y SalePriceCategory_intermediate es particularmente fuerte (-0.741), lo que indica una alta multicolinealidad. Esto significa que estas dos variables proporcionan información similar y pueden ser redundantes en su modelo.

La correlación entre SalePriceCategory_economical y SalePriceCategory_expensive, y entre SalePriceCategory_expensive y SalePriceCategory_intermediate es más débil, pero aún indica una cierta cantidad de multicolinealidad.