## Exercise 6: Choosing the best performing model on a dataset

Instructions:

- Use the Dataset File to train your model
- Use the Test File to generate your results
- Use the Sample Submission file to generate the same format
- Use all Regression models

Submit your results to:
https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview



In [101]:
import pandas as pd
import seaborn as sns

from matplotlib import pyplot as plt
from sklearn.neighbors import KNeighborsRegressor

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_score


## Dataset File

In [102]:
train_data = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/train.csv?raw=true'
df = pd.read_csv(train_data)

## Test File

In [103]:
test_url = 'https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/test.csv?raw=true'
dt=pd.read_csv(test_url)

In [104]:
dt.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1459 non-null   int64  
 1   MSSubClass     1459 non-null   int64  
 2   MSZoning       1455 non-null   object 
 3   LotFrontage    1232 non-null   float64
 4   LotArea        1459 non-null   int64  
 5   Street         1459 non-null   object 
 6   Alley          107 non-null    object 
 7   LotShape       1459 non-null   object 
 8   LandContour    1459 non-null   object 
 9   Utilities      1457 non-null   object 
 10  LotConfig      1459 non-null   object 
 11  LandSlope      1459 non-null   object 
 12  Neighborhood   1459 non-null   object 
 13  Condition1     1459 non-null   object 
 14  Condition2     1459 non-null   object 
 15  BldgType       1459 non-null   object 
 16  HouseStyle     1459 non-null   object 
 17  OverallQual    1459 non-null   int64  
 18  OverallC

## Sample Submission File

In [105]:
sample_submission_url ='https://github.com/robitussin/CCMACLRL_EXERCISES/blob/3fd7d51ffd17863598ac3f44eeefc558171a5b73/dataset/house-prices-advanced-regression-techniques/sample_submission.csv?raw=true'

sf=pd.read_csv(sample_submission_url)

In [106]:
sf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB


## Managing Data for Reuse

In [107]:
# DON'T use dropna() - it removes all rows
# df.dropna(inplace=True)  # Remove this line!

# Instead, fill missing values with column means
numeric_columns = ['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
                   'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
                   'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF',
                   '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea',
                   'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
                   'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
                   'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF',
                   'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch',
                   'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'SalePrice']

# Fill nulls in numeric columns with their means
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

# Fill nulls in TEST data
for col in numeric_columns:
    if col in dt.columns:
        dt[col] = dt[col].fillna(df[col].mean())

X = df[numeric_columns].drop(columns=['SalePrice']).values
y = df['SalePrice'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=5)

score_list={}


## 1. Train a KNN Regressor

In [108]:
knn = KNeighborsRegressor(n_neighbors=33)
knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

- Perform cross validation

In [109]:
scores = cross_val_score(knn, X, y, cv=5)
scores

array([0.62940442, 0.63584939, 0.49383577, 0.58986937, 0.55266083])

## 2. Train a SVM Regression

In [110]:
from sklearn.svm import SVR
svr = SVR(kernel='rbf', C=100, gamma=0.001)  # Added parameters
svr.fit(X_train, y_train)


- Perform cross validation

In [111]:
# put your answer here
scores = cross_val_score(svr, X, y, cv=5)
scores

array([-0.07075124, -0.06121468, -0.05627161, -0.01623287, -0.0556056 ])

## 3. Train a Decision Tree Regression

In [112]:
# 3. Decision Tree Regressor (NOT Classifier!)
dtr = DecisionTreeRegressor(max_depth=10, random_state=1)
dtr.fit(X_train, y_train)


- Perform cross validation

In [113]:
# put your answer here
scores = cross_val_score(dtr, X, y, cv=5)
scores

array([0.68758993, 0.69566024, 0.82651656, 0.74084503, 0.57206797])

## 4. Train a Random Forest Regression

In [114]:
# put your answer here
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=50,random_state=1)
rfr.fit(X_train,y_train)


- Perform cross validation

In [115]:
# put your answer here
scores = cross_val_score(rfr, X, y, cv=5)
scores

array([0.86291798, 0.83017482, 0.87532715, 0.87727582, 0.79669623])

## 5. Compare all the performance of all regression models

In [116]:
scores1 = cross_val_score(knn, X, y, cv=5)
scores2 = cross_val_score(svr, X, y, cv=5)
scores3 = cross_val_score(dtr, X, y, cv=5)
scores4 = cross_val_score(rfr, X, y, cv=5)

print("KNN Scores:", scores1)
print("SVR Scores:", scores2)
print("DTR Scores:", scores3)
print("RFR Scores:", scores4)

KNN Scores: [0.62940442 0.63584939 0.49383577 0.58986937 0.55266083]
SVR Scores: [-0.07075124 -0.06121468 -0.05627161 -0.01623287 -0.0556056 ]
DTR Scores: [0.68758993 0.69566024 0.82651656 0.74084503 0.57206797]
RFR Scores: [0.86291798 0.83017482 0.87532715 0.87727582 0.79669623]


## 6. Generate Submission File

Choose the model that has the best performance to generate a submission file.

In [117]:
# Set Random Forest as your model
model = rfr

# Train on ALL training data (not just X_train)
model.fit(X, y)

# Prepare test data - extract same numeric columns (excluding SalePrice which doesn't exist in test data)
numeric_columns_test = [col for col in numeric_columns if col != 'SalePrice']
X_test_final = dt[numeric_columns_test].values

# Make predictions on the properly preprocessed test data
y_pred = model.predict(X_test_final)

# Create a submission DataFrame
submission_df = pd.DataFrame({
    'Id': sf['Id'],
    'SalePrice': y_pred
})

# Save the submission DataFrame to a CSV file
submission_df.to_csv('submission_file.csv', index=False)
