<a href="https://colab.research.google.com/github/robitussin/CCMACLRL_EXERCISES/blob/main/Exercise6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise 6: Choosing the best performing model on a dataset

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
- Use all Regression models

Submit your results to:
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview



In [1]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

## Dataset File

In [2]:
train_data = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/train.csv?raw=true'
df = pd.read_csv(train_data)

## Test File

In [3]:
test_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/test.csv?raw=true'
dt=pd.read_csv(test_url)

In [4]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

## Sample Submission File

In [5]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/sample_submission.csv?raw=true'

sf=pd.read_csv(sample_submission_url)

In [6]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB


## 1. Train a KNN Regressor

In [15]:
from sklearn.neighbors import KNeighborsRegressor

X = df.drop("SalePrice", axis=1).select_dtypes(include=[np.number]).fillna(0)
y = df["SalePrice"]

knn = KNeighborsRegressor(n_neighbors=5)
knn.fit(X, y)


- Perform cross validation

In [16]:
from sklearn.model_selection import cross_val_score
import numpy as np

knn_scores = cross_val_score(knn, X, y, cv=5, scoring="neg_root_mean_squared_error")
print("KNN CV RMSE scores:", -knn_scores)
print("KNN Mean RMSE:", -knn_scores.mean())


KNN CV RMSE scores: [43753.76838484 47618.61520561 53087.38899875 40304.41962317
 55429.52334989]
KNN Mean RMSE: 48038.74311245359


## 2. Train a SVM Regression

In [17]:
from sklearn.svm import SVR

svr = SVR(kernel="rbf", C=100, gamma=0.1, epsilon=0.1)
svr.fit(X, y)


- Perform cross validation

In [18]:
svr_scores = cross_val_score(svr, X, y, cv=5, scoring="neg_root_mean_squared_error")
print("SVR CV RMSE scores:", -svr_scores)
print("SVR Mean RMSE:", -svr_scores.mean())

SVR CV RMSE scores: [76968.71703319 83503.90657919 89467.23948884 71750.89101623
 84132.6705403 ]
SVR Mean RMSE: 81164.6849315482


## 3. Train a Decision Tree Regression

In [19]:
from sklearn.tree import DecisionTreeRegressor

dtree = DecisionTreeRegressor(max_depth=10, random_state=42)
dtree.fit(X, y)

- Perform cross validation

In [20]:
dtree_scores = cross_val_score(dtree, X, y, cv=5, scoring="neg_root_mean_squared_error")
print("Decision Tree CV RMSE scores:", -dtree_scores)
print("Decision Tree Mean RMSE:", -dtree_scores.mean())

Decision Tree CV RMSE scores: [41637.98635846 39346.63655378 33233.68608253 34280.69014246
 54271.1440601 ]
Decision Tree Mean RMSE: 40554.02863946451


## 4. Train a Random Forest Regression

In [21]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42)

rf_scores = cross_val_score(rf, X, y, cv=5, scoring="neg_root_mean_squared_error")
print("Random Forest CV scores:", -rf_scores)
print("Random Forest Mean RMSE:", -rf_scores.mean())

Random Forest CV scores: [27324.1453714  32942.29967234 31466.7261206  25023.16373576
 35993.35787775]
Random Forest Mean RMSE: 30549.9385555689


## 5. Compare all the performance of all regression models

In [11]:
results = {
    "KNN": -knn_scores.mean(),
    "SVR": -svr_scores.mean(),
    "Decision Tree": -dtree_scores.mean(),
    "Random Forest": -rf_scores.mean()
}

for model, score in results.items():
    print(f"{model}: {score:.2f}")

best_model_name = min(results, key=results.get)
print("\nBest model is:", best_model_name)

KNN: 48038.74
SVR: 81164.68
Decision Tree: 40554.03
Random Forest: 30549.94

Best model is: Random Forest


## 6. Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [24]:
model = RandomForestRegressor(n_estimators=100, max_depth=15, random_state=42)
model.fit(X, y)
X_test = dt.select_dtypes(include=[np.number]).fillna(0)
y_pred = model.predict(X_test)
id = dt['Id']
submission_df = pd.DataFrame({
    'Id': id,
    'SalePrice': y_pred
})

submission_df.to_csv('submission_file.csv', index=False)
print("Submission file created: submission_file.csv")


Submission file created: submission_file.csv
