## Exercise 6: Choosing the best performing model on a dataset

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
- Use all Regression models

Submit your results to:
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview



In [18]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

## Dataset File

In [19]:
train_data = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/train.csv?raw=true'
df = pd.read_csv(train_data)

## Test File

In [20]:
test_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/test.csv?raw=true'
dt=pd.read_csv(test_url)

In [21]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

## Sample Submission File

In [22]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/sample_submission.csv?raw=true'

sf=pd.read_csv(sample_submission_url)

In [23]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB


In [24]:
dt.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [25]:
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

x = df.drop("SalePrice", axis=1)
y = df["SalePrice"]

x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1, test_size=0.3)

numerical_cols = x_train.select_dtypes(include=np.number).columns.tolist()

X_train_numerical = x_train[numerical_cols].fillna(x_train[numerical_cols].mean())
X_test_numerical = x_test[numerical_cols].fillna(x_test[numerical_cols].mean())
X_numerical = x[numerical_cols].fillna(x[numerical_cols].mean())

knn = KNeighborsRegressor(n_neighbors=5)
svm_regressor = SVR()
decision_tree_regressor = DecisionTreeRegressor(random_state=1)
random_forest_regressor = RandomForestRegressor(random_state=1)
print("Data splitting and numerical feature preprocessing complete. Regression models initialized.")

Data splitting and numerical feature preprocessing complete. Regression models initialized.


In [None]:
score_list = {}

## 1. Train a KNN Regressor

In [56]:
x_train_numerical = x_train[numerical_cols].fillna(x_train[numerical_cols].mean())
x_test_numerical = x_test[numerical_cols].fillna(x_test[numerical_cols].mean())

knn.fit(x_train_numerical, y_train)
knn_score = knn.score(x_test_numerical, y_test)

score_list["KNN"] = knn_score
print("KNN score:", score_list['KNN'])

KNN score: 0.6065835753985184


- Perform cross validation

In [35]:
from sklearn.model_selection import cross_val_score
import numpy as np
scores = cross_val_score(knn, X_numerical, y, cv=5, scoring='neg_mean_squared_error')

knn_rmse_scores = np.sqrt(-scores)

print("KNN Cross-validation RMSE scores:", knn_rmse_scores)
print("Average KNN Cross-validation RMSE:", knn_rmse_scores.mean())

KNN Cross-validation RMSE scores: [43701.77809279 47816.3646527  52930.50404426 40356.16489241
 55910.21659883]
Average KNN Cross-validation RMSE: 48143.0056561973


## 2. Train a SVM Regression

In [63]:
svm_regressor.fit(x_train_numerical, y_train)
svm_score = svm_regressor.score(x_test_numerical, y_test)

if 'score_list' not in locals() and 'score_list' not in globals():
    score_list = {}

score_list["SVM"] = svm_score
print("SVM score:", score_list['SVM'])

SVM score: -0.037695563817771616


- Perform cross validation

In [37]:
scores = cross_val_score(svm_regressor, X_numerical, y, cv=5, scoring='neg_mean_squared_error')
svm_rmse_scores = np.sqrt(-scores)

print("SVM Cross-validation RMSE scores:", svm_rmse_scores)
print("Average SVM Cross-validation RMSE:", svm_rmse_scores.mean())

SVM Cross-validation RMSE scores: [76943.35892461 83478.8998053  89448.38272659 71736.05116203
 84110.48215108]
Average SVM Cross-validation RMSE: 81143.43495392159


## 3. Train a Decision Tree Regression

In [58]:
decision_tree_regressor.fit(X_train_numerical, y_train)
decision_tree_score = decision_tree_regressor.score(X_test_numerical, y_test)

if 'score_list' not in locals() and 'score_list' not in globals():
    score_list = {}

score_list["DecisionTree"] = decision_tree_score
print("Decision Tree Regressor score:", score_list['DecisionTree'])

Decision Tree Regressor score: 0.7491680424148208


## 5. Compare all the performance of all regression models

- Perform cross validation

## 4. Train a Random Forest Regression

In [59]:
random_forest_regressor.fit(X_train_numerical, y_train)
random_forest_score = random_forest_regressor.score(X_test_numerical, y_test)

if 'score_list' not in locals() and 'score_list' not in globals():
    score_list = {}

score_list["RandomForest"] = random_forest_score
print(f"Random Forest Regressor score:", score_list["RandomForest"])

Random Forest Regressor score: 0.9028120658977223


In [42]:
from sklearn.model_selection import cross_val_score
import numpy as np

scores = cross_val_score(decision_tree_regressor, X_numerical, y, cv=5, scoring='neg_mean_squared_error')
decision_tree_rmse_scores = np.sqrt(-scores)

print("Decision Tree Regressor Cross-validation RMSE scores:", decision_tree_rmse_scores)
print("Average Decision Tree Regressor Cross-validation RMSE:", decision_tree_rmse_scores.mean())

Decision Tree Regressor Cross-validation RMSE scores: [41705.03783188 42423.237294   37827.04498917 36283.16339902
 50764.27799948]
Average Decision Tree Regressor Cross-validation RMSE: 41800.55230270976


- Perform cross validation


In [60]:
from sklearn.model_selection import cross_val_score
import numpy as np

scores = cross_val_score(decision_tree_regressor, X_numerical, y, cv=5, scoring='neg_mean_squared_error')
decision_tree_rmse_scores = np.sqrt(-scores)

print("Decision Tree Regressor Cross-validation RMSE scores:", decision_tree_rmse_scores)
print("Average Decision Tree Regressor Cross-validation RMSE:", decision_tree_rmse_scores.mean())

Decision Tree Regressor Cross-validation RMSE scores: [41705.03783188 42423.237294   37827.04498917 36283.16339902
 50764.27799948]
Average Decision Tree Regressor Cross-validation RMSE: 41800.55230270976


## 5. Compare all the performance of all regression models

In [61]:
score_list = {}
score_list["RandomForest"] = random_forest_score
score_list["DecisionTree"] = decision_tree_score
score_list["SVM"] = svm_score
score_list["KNN"] = knn_score

for alg, score in score_list.items():
    print(f"{alg} Score is {score:.2f}")


RandomForest Score is 0.90
DecisionTree Score is 0.75
SVM Score is -0.04
KNN Score is 0.61


## 6. Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [62]:
best_model_name = max(score_list, key=score_list.get)
print(f"The best performing model is: {best_model_name}")

if best_model_name == 'KNN':
    model = knn
elif best_model_name == 'SVM':
    model = svm_regressor
elif best_model_name == 'DecisionTree':
    model = decision_tree_regressor
elif best_model_name == 'RandomForest':
    model = random_forest_regressor

dt_numerical = dt[numerical_cols].fillna(dt[numerical_cols].mean())
y_pred = model.predict(dt_numerical)

submission_df = pd.DataFrame({
    'Id': id,
    'SalePrice': y_pred  #
})

submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")

The best performing model is: RandomForest
Submission file created: submission_file.csv
