<a href="https://colab.research.google.com/github/FrancisKurian/CS530/blob/main/CS530_hw8_Midterm_House_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> CS530 Midterm

### Notes:
Download two data sets house_SalePrice.csv and house_SalePrice_predict.csv from Canvas and answer the following questions. We want to build a model to predict the sales price of homes. The target variable is 'SalePrice'.

The file house_SalePrice.csv is for training.  Each row is represents one house, and contains both features of the house and the sale price.

The file house_SalePrice_predict.csv contains additional houses but does not include the sale price. Your goal is to predict the price for these houses.

### Questions

#### 1. Pre-processing the data
1). There are missing values in the data. Instead of dropping them, fill them in by setting each missing value to the mean/median/mode of the column.  Here are some references if you need them: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html and https://scikit-learn.org/stable/modules/impute.html

Note:  If there's a more sophisticated method you prefer, you can use that instead.  Just note it.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving house_SalePrice.csv to house_SalePrice.csv
Saving house_SalePrice_predict.csv to house_SalePrice_predict.csv


In [None]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

In [None]:
df= pd.read_csv('house_SalePrice.csv')
df_p = pd.read_csv('house_SalePrice_predict.csv')

In [None]:
nan_cols = [i for i in df.columns if df[i].isnull().sum()>=1]
print(nan_cols)

['Lot Frontage', 'Mas Vnr Type', 'Mas Vnr Area', 'Bsmt Exposure', 'Bsmt Exposure.1', 'BsmtFin SF 2']


In [None]:
numerical_with_nan=[feature for feature in df.columns if df[feature].isnull().sum()>1 and df[feature].dtypes!='O']
numerical_with_nan
for feature in numerical_with_nan:
    median_value=df[feature].median()
    df[feature].fillna(median_value,inplace=True)
df[numerical_with_nan].isnull().sum()

Lot Frontage    0
Mas Vnr Area    0
dtype: int64

In [None]:
categorical_features=[feature for feature in df.columns if df[feature].dtype=='O']
for feature in categorical_features:
    print((df[feature].value_counts()))

In [None]:
for feature in categorical_features:
    temp=df.groupby(feature)['SalePrice'].count()/len(df)
    temp_df=temp[temp>0.01].index
    df[feature]=np.where(df[feature].isin(temp_df),df[feature],'Rare_var')

In [None]:
categorical_features=[feature for feature in df.columns if df[feature].dtype=='O']
for feature in categorical_features:
    print((df[feature].value_counts()))

2). use one-hot enconding to convert the categorical variables into dummies.

In [None]:
df = pd.get_dummies(df, drop_first=True)
df.shape

(1601, 62)

#### 2. Regresssion model building
In this part, you need to use the data to build a linear model by using OLS first and then build another linear model by using Lasso. Make sure to split the data into training and test sets. Report the performance on test set. Using k-fold cross validation to tune the hyperparameters in Lasso.

In [None]:
df.isnull().sum()

In [None]:
df['BsmtFin SF 2'].fillna(0,inplace=True)
df['SalePrice_binary'] = (df['SalePrice'] >df['SalePrice'].median())*1
df2=df.copy()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice','SalePrice_binary'], axis=1), 
                                                    df['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=1234)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1120, 61), (1120,), (481, 61), (481,))

### OLS Regression

In [None]:
from sklearn import metrics
from sklearn.metrics import mean_squared_error,mean_absolute_error,explained_variance_score, r2_score
md=linear_model.LinearRegression().fit(X_train, y_train)
y_pred=md.predict(X_test)
y_pred_bi_ols =(y_pred> df['SalePrice'].median())*1

In [None]:
print('Mean Absolute Error(MAE):', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error(MSE):', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R2:',metrics.r2_score(y_test,  y_pred))

Mean Absolute Error(MAE): 26796.515871982487
Mean Squared Error(MSE): 1800510670.6908
Root Mean Squared Error (RMSE): 42432.42475620266
R2: 0.7511846618656031


#### Grid Search for Lasso

In [None]:
from sklearn.model_selection import GridSearchCV
#lasso
params = {'alpha': [0.001, 0.01,.1,1,10,50,100,110, 120,130,140,150,200,500]}
lasso = linear_model.Lasso()

# cross validation
model_cv_l = GridSearchCV(estimator = lasso, 
                        param_grid = params, 
                        scoring= 'neg_mean_absolute_error', 
                        cv = 5, 
                        return_train_score=True,
                        verbose = 1)            

model_cv_l.fit(X_train, y_train)

Fitting 5 folds for each of 14 candidates, totalling 70 fits


GridSearchCV(cv=5, estimator=Lasso(),
             param_grid={'alpha': [0.001, 0.01, 0.1, 1, 10, 50, 100, 110, 120,
                                   130, 140, 150, 200, 500]},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)

In [None]:
print(model_cv_l.best_params_)
print(model_cv_l.best_score_)

{'alpha': 140}
-27181.546446400735


### Lasso Regression

In [None]:
md=linear_model.Lasso(alpha =140).fit(X_train, y_train)
y_pred=md.predict(X_test)
y_pred_bi_lasso =(y_pred> df['SalePrice'].median())*1
print('Mean Absolute Error(MAE):', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error(MSE):', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error (RMSE):', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('R2:',metrics.r2_score(y_test,  y_pred))

Mean Absolute Error(MAE): 26673.08972007473
Mean Squared Error(MSE): 1747719572.737833
Root Mean Squared Error (RMSE): 41805.73612242503
R2: 0.7584799448658499


#### 3. Classification model building

Create a binary variable to indicate whether the sale price is greater than median sale price (=1 if it's higher than median, and 0 otherwise). Build and compare two classification models to predict whether or not the house sells above the median price: a logistic model and a random forest model.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice','SalePrice_binary'], axis=1), 
                                                    df['SalePrice_binary'],
                                                    test_size=0.3,
                                                    random_state=1234)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1120, 61), (1120,), (481, 61), (481,))

### Logistics Regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc,matthews_corrcoef
logr = LogisticRegression(random_state=0,penalty="none",max_iter=5000).fit(X_train, y_train)
y_pred=logr.predict(X_test) 
ac_lr=logr.score(X_test, y_test)
mc_lr=matthews_corrcoef(y_test, y_pred)
print(f'The accuracy of Logistic Regression : {ac_lr:.5}')
print(f'Mattews Correlation Coefficient is: {mc_lr:.5}')

The accuracy of Logistic Regression : 0.89397
Mattews Correlation Coefficient is: 0.78814


### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=30,n_estimators=100, min_samples_leaf=2,min_samples_split=30,criterion='entropy', oob_score=True,random_state=42)
rf.fit(X_train, y_train)
y_predict = rf.predict(X_test)
mc_sk_income_rf=matthews_corrcoef(y_test, y_predict)
print(f'Accuracy of SKLearn Random Forest: {rf.score(X_test, y_test):.5}')
print(f'Matthews Coefficient for SKLearn Random Forest: {mc_sk_income_rf:.5}')

Accuracy of SKLearn Random Forest: 0.89813
Matthews Coefficient for SKLearn Random Forest: 0.79687


#### 4. Model comparasions
Now use your OLS and Lasso regressions from part 2 and binarize the output so that they predict 1 if the house is predicted to sell for more than the median, and 0 otherwise.

How does the result compare to the Logistic and Random Forest models? Which one is best and why do you think it performs better?

In [None]:
print(f'Accuracy: Logistics Regression vs OLS: {logr.score(X_test, y_test):.2} vs  {logr.score(X_test, y_pred_bi_ols):.2}')
print(f'Accuracy: Random Forest vs OLS: {rf.score(X_test, y_test):.2} vs  {logr.score(X_test, y_pred_bi_ols):.2}')

print(f'Accuracy: Logistics Regression vs Lasso: {logr.score(X_test, y_test):.2} vs  {logr.score(X_test, y_pred_bi_lasso):.2}')
print(f'Accuracy: Random Forest vs Lasso: {rf.score(X_test, y_test):.2} vs  {logr.score(X_test, y_pred_bi_lasso):.2}')


Accuracy: Logistics Regression vs OLS: 0.89 vs  0.91
Accuracy: Random Forest vs OLS: 0.9 vs  0.91
Accuracy: Logistics Regression vs Lasso: 0.89 vs  0.92
Accuracy: Random Forest vs Lasso: 0.9 vs  0.92


### Based on the results Lasso is a better models in predicting house prices.  When the dependent feature (Y) is continous, regression may offer a better model than classification trees.

#### 5. Test your model (Bonus)

Create a csv file with two columns by using house_SalePrice_predict.csv:
  1. Your best salePrice prediction.
  2. Your best aboveMedian prediction.

Name the file as '\<Your Last Name\>_prediction.csv'

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice','SalePrice_binary'], axis=1), 
                                                    df['SalePrice'],
                                                    test_size=0.3,
                                                    random_state=1234)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((1120, 61), (1120,), (481, 61), (481,))

In [None]:
md=linear_model.Lasso(alpha =140).fit(X_train, y_train)

In [None]:
nan_cols = [i for i in df_p.columns if df_p[i].isnull().sum()>=1]
print(nan_cols)

['Lot Frontage', 'Mas Vnr Type', 'Mas Vnr Area', 'Bsmt Exposure', 'Bsmt Exposure.1', 'Garage Cars']


In [None]:
numerical_with_nan=[feature for feature in df_p.columns if df_p[feature].isnull().sum()>1 and df_p[feature].dtypes!='O']
numerical_with_nan
for feature in numerical_with_nan:
    median_value=df_p[feature].median()
    df_p[feature].fillna(median_value,inplace=True)
df_p[numerical_with_nan].isnull().sum()

Lot Frontage    0
Mas Vnr Area    0
dtype: int64

In [None]:
categorical_features=[feature for feature in df_p.columns if df_p[feature].dtype=='O']
for feature in categorical_features:
    print((df_p[feature].value_counts()))

## More data cleaning is required to read the given prediction file into Lasso model. One issue is that reclassifying rare sub-features in to a group would require additional data treatment before SalesPrice and above Median price can be predicted and running out of time for 12 am deadline.

@ Jeomoan Francis Kurian