The goal of this exercise is to build a machine learning model that can model the price of a house in Melbourne. 

This will also include the process for selecting variables, since our dataset has over 20 variables

Dataset link: https://www.kaggle.com/code/dansbecker/your-first-machine-learning-model/data

In [60]:
import pandas as pd

df = pd.read_csv("melb_data.csv")
df.head()

Unnamed: 0,Suburb,Address,Rooms,Type,Price,Method,SellerG,Date,Distance,Postcode,...,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,1480000.0,S,Biggin,3/12/2016,2.5,3067.0,...,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,1035000.0,S,Biggin,4/02/2016,2.5,3067.0,...,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0
2,Abbotsford,5 Charles St,3,h,1465000.0,SP,Biggin,4/03/2017,2.5,3067.0,...,2.0,0.0,134.0,150.0,1900.0,Yarra,-37.8093,144.9944,Northern Metropolitan,4019.0
3,Abbotsford,40 Federation La,3,h,850000.0,PI,Biggin,4/03/2017,2.5,3067.0,...,2.0,1.0,94.0,,,Yarra,-37.7969,144.9969,Northern Metropolitan,4019.0
4,Abbotsford,55a Park St,4,h,1600000.0,VB,Nelson,4/06/2016,2.5,3067.0,...,1.0,2.0,120.0,142.0,2014.0,Yarra,-37.8072,144.9941,Northern Metropolitan,4019.0


In [61]:
# Remove unneccessary columns

colums_to_drop = ["Address","SellerG","CouncilArea","Lattitude","Longtitude","Date","Postcode","Propertycount"]

df.drop(colums_to_drop,axis=1,inplace=True)

df.head()

Unnamed: 0,Suburb,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname
0,Abbotsford,2,h,1480000.0,S,2.5,2.0,1.0,1.0,202.0,,,Northern Metropolitan
1,Abbotsford,2,h,1035000.0,S,2.5,2.0,1.0,0.0,156.0,79.0,1900.0,Northern Metropolitan
2,Abbotsford,3,h,1465000.0,SP,2.5,3.0,2.0,0.0,134.0,150.0,1900.0,Northern Metropolitan
3,Abbotsford,3,h,850000.0,PI,2.5,3.0,2.0,1.0,94.0,,,Northern Metropolitan
4,Abbotsford,4,h,1600000.0,VB,2.5,3.0,1.0,2.0,120.0,142.0,2014.0,Northern Metropolitan


In [62]:
df["Suburb"].value_counts()

Suburb
Reservoir         359
Richmond          260
Bentleigh East    249
Preston           239
Brunswick         222
                 ... 
Sandhurst           1
Bullengarook        1
Croydon South       1
Montrose            1
Monbulk             1
Name: count, Length: 314, dtype: int64

In [63]:
df["Regionname"].value_counts()

Regionname
Southern Metropolitan         4695
Northern Metropolitan         3890
Western Metropolitan          2948
Eastern Metropolitan          1471
South-Eastern Metropolitan     450
Eastern Victoria                53
Northern Victoria               41
Western Victoria                32
Name: count, dtype: int64

I wanted to use either Suburb or Reigion name for this model. 

There are over 300 different Suburbs and only 9 different Regionnames. For this analysis, let's use Reigionname

In [64]:
df.drop("Suburb",axis=1,inplace=True)

df.head()

Unnamed: 0,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname
0,2,h,1480000.0,S,2.5,2.0,1.0,1.0,202.0,,,Northern Metropolitan
1,2,h,1035000.0,S,2.5,2.0,1.0,0.0,156.0,79.0,1900.0,Northern Metropolitan
2,3,h,1465000.0,SP,2.5,3.0,2.0,0.0,134.0,150.0,1900.0,Northern Metropolitan
3,3,h,850000.0,PI,2.5,3.0,2.0,1.0,94.0,,,Northern Metropolitan
4,4,h,1600000.0,VB,2.5,3.0,1.0,2.0,120.0,142.0,2014.0,Northern Metropolitan


In [65]:
df.describe()

Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt
count,13580.0,13580.0,13580.0,13580.0,13580.0,13518.0,13580.0,7130.0,8205.0
mean,2.937997,1075684.0,10.137776,2.914728,1.534242,1.610075,558.416127,151.96765,1964.684217
std,0.955748,639310.7,5.868725,0.965921,0.691712,0.962634,3990.669241,541.014538,37.273762
min,1.0,85000.0,0.0,0.0,0.0,0.0,0.0,0.0,1196.0
25%,2.0,650000.0,6.1,2.0,1.0,1.0,177.0,93.0,1940.0
50%,3.0,903000.0,9.2,3.0,1.0,2.0,440.0,126.0,1970.0
75%,3.0,1330000.0,13.0,3.0,2.0,2.0,651.0,174.0,1999.0
max,10.0,9000000.0,48.1,20.0,8.0,10.0,433014.0,44515.0,2018.0


In [66]:
df.isnull().sum()

Rooms              0
Type               0
Price              0
Method             0
Distance           0
Bedroom2           0
Bathroom           0
Car               62
Landsize           0
BuildingArea    6450
YearBuilt       5375
Regionname         0
dtype: int64

We have 3 columns with N/A values

In [67]:
df["Car"].median(), df["BuildingArea"].median(), df["YearBuilt"].median()

(2.0, 126.0, 1970.0)

In [68]:
df_car_null = df[df["Car"].isna()]

df_car_null.head()

Unnamed: 0,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname
12221,3,h,985000.0,S,4.3,3.0,1.0,,245.0,91.0,1945.0,Western Metropolitan
12247,2,h,1023000.0,S,4.0,2.0,1.0,,154.0,76.0,1890.0,Northern Metropolitan
12259,3,h,1436000.0,S,3.6,3.0,2.0,,123.0,128.0,1990.0,Northern Metropolitan
12320,3,h,1370000.0,S,16.7,3.0,1.0,,652.0,,,Eastern Metropolitan
12362,4,h,1180000.0,PI,6.2,4.0,1.0,,545.0,,,Western Metropolitan


In [69]:
df["Car"].fillna(df["Car"].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Car"].fillna(df["Car"].median(),inplace=True)


In [70]:
df_car_null = df[df["Car"].isna()]

df_car_null.head()

Unnamed: 0,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname


In [71]:
df["BuildingArea"].fillna(df["BuildingArea"].median(),inplace=True)
df["YearBuilt"].fillna(df["YearBuilt"].median(),inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["BuildingArea"].fillna(df["BuildingArea"].median(),inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["YearBuilt"].fillna(df["YearBuilt"].median(),inplace=True)


In [72]:
df.isnull().sum()

Rooms           0
Type            0
Price           0
Method          0
Distance        0
Bedroom2        0
Bathroom        0
Car             0
Landsize        0
BuildingArea    0
YearBuilt       0
Regionname      0
dtype: int64

Yay, no more N/A values!

In [73]:
df.describe()

Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt
count,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0,13580.0
mean,2.937997,1075684.0,10.137776,2.914728,1.534242,1.611856,558.416127,139.633972,1966.788218
std,0.955748,639310.7,5.868725,0.965921,0.691712,0.960793,3990.669241,392.217403,29.088642
min,1.0,85000.0,0.0,0.0,0.0,0.0,0.0,0.0,1196.0
25%,2.0,650000.0,6.1,2.0,1.0,1.0,177.0,122.0,1960.0
50%,3.0,903000.0,9.2,3.0,1.0,2.0,440.0,126.0,1970.0
75%,3.0,1330000.0,13.0,3.0,2.0,2.0,651.0,129.94,1975.0
max,10.0,9000000.0,48.1,20.0,8.0,10.0,433014.0,44515.0,2018.0


Based on this, there seems to be three initial
1. Where there are 0 bedrooms
2. Where there are 0 bathrooms
3. Where the YearBuilt is <1900

In [74]:
df[df["Bathroom"]==0]

Unnamed: 0,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname
139,3,h,1485000.0,SP,6.4,3.0,0.0,0.0,597.0,126.0,1970.0,Northern Metropolitan
281,2,u,546000.0,SP,6.3,2.0,0.0,0.0,0.0,126.0,1970.0,Southern Metropolitan
505,2,u,497500.0,PI,6.6,2.0,0.0,0.0,0.0,126.0,1970.0,Southern Metropolitan
584,2,h,1010000.0,PI,9.7,2.0,0.0,0.0,1611.0,126.0,1970.0,Southern Metropolitan
913,3,h,700000.0,S,13.9,0.0,0.0,0.0,456.0,126.0,1970.0,Southern Metropolitan
1063,3,h,1900000.0,S,11.2,3.0,0.0,0.0,0.0,126.0,1970.0,Southern Metropolitan
1070,3,t,1067000.0,S,11.2,3.0,0.0,1.0,0.0,126.0,1970.0,Southern Metropolitan
1593,4,h,1400000.0,PI,7.8,3.0,0.0,0.0,693.0,126.0,1935.0,Southern Metropolitan
2253,2,u,410000.0,VB,8.5,0.0,0.0,0.0,0.0,126.0,1970.0,Southern Metropolitan
2777,2,h,845000.0,S,9.2,2.0,0.0,0.0,207.0,126.0,1970.0,Southern Metropolitan


In [75]:
df[df["Bedroom2"]==0]

Unnamed: 0,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname
135,2,t,650000.0,SP,6.4,0.0,1.0,0.0,0.0,126.0,1970.0,Northern Metropolitan
804,3,t,830000.0,PI,13.9,0.0,2.0,2.0,292.0,141.0,2012.0,Southern Metropolitan
827,3,h,1355000.0,S,13.9,0.0,1.0,2.0,818.0,126.0,1970.0,Southern Metropolitan
913,3,h,700000.0,S,13.9,0.0,0.0,0.0,456.0,126.0,1970.0,Southern Metropolitan
2253,2,u,410000.0,VB,8.5,0.0,0.0,0.0,0.0,126.0,1970.0,Southern Metropolitan
3360,4,h,2400000.0,S,7.9,0.0,2.0,2.0,1252.0,201.0,1920.0,Eastern Metropolitan
6170,3,h,1560000.0,S,11.2,0.0,2.0,1.0,335.0,209.0,2013.0,Southern Metropolitan
6866,2,u,872000.0,S,1.5,0.0,0.0,0.0,0.0,126.0,1970.0,Northern Metropolitan
6893,3,h,585000.0,S,12.4,0.0,1.0,1.0,605.0,103.0,1960.0,Northern Metropolitan
7385,3,h,1030000.0,SP,9.1,0.0,1.0,1.0,224.0,126.0,1970.0,Western Metropolitan


In [76]:
df[df["YearBuilt"]<1850]

Unnamed: 0,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname
2079,2,u,855000.0,S,1.6,2.0,1.0,1.0,2886.0,122.0,1830.0,Northern Metropolitan
9968,3,h,1200000.0,VB,14.2,3.0,1.0,4.0,807.0,117.0,1196.0,Eastern Metropolitan


Upon further analysis, it seems like 1850 is a better lower end of outliers

In [77]:
df.shape

(13580, 12)

In [78]:
df_no_outliers_v1 = df[(df["Bedroom2"]>0) & (df["Bathroom"] > 0) & (df["YearBuilt"]>1850)]

df_no_outliers_v1.shape

(13530, 12)

We got rid of 50 outliers!

In [79]:
df_no_outliers_v1.describe()

Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt
count,13530.0,13530.0,13530.0,13530.0,13530.0,13530.0,13530.0,13530.0,13530.0
mean,2.939246,1076124.0,10.14694,2.91929,1.538137,1.615965,559.208869,139.688495,1966.879379
std,0.956287,639910.4,5.87354,0.961828,0.688373,0.958699,3997.893198,392.937342,28.272733
min,1.0,85000.0,0.0,1.0,1.0,0.0,0.0,0.0,1854.0
25%,2.0,650000.0,6.2,2.0,1.0,1.0,178.0,122.0,1960.0
50%,3.0,903500.0,9.2,3.0,1.0,2.0,441.5,126.0,1970.0
75%,3.0,1330000.0,13.0,3.0,2.0,2.0,651.0,130.0,1975.0
max,10.0,9000000.0,48.1,20.0,8.0,10.0,433014.0,44515.0,2018.0


In [80]:
df_no_outliers_v1.head()

Unnamed: 0,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname
0,2,h,1480000.0,S,2.5,2.0,1.0,1.0,202.0,126.0,1970.0,Northern Metropolitan
1,2,h,1035000.0,S,2.5,2.0,1.0,0.0,156.0,79.0,1900.0,Northern Metropolitan
2,3,h,1465000.0,SP,2.5,3.0,2.0,0.0,134.0,150.0,1900.0,Northern Metropolitan
3,3,h,850000.0,PI,2.5,3.0,2.0,1.0,94.0,126.0,1970.0,Northern Metropolitan
4,4,h,1600000.0,VB,2.5,3.0,1.0,2.0,120.0,142.0,2014.0,Northern Metropolitan


Now for Price, Car, Bedroom2, Landsize, and BuildingArea, let's go through and remove outliers that are 4 standard deviations from the mean. 

In [81]:
for column in ['Price', 'Car', 'Bedroom2', 'Landsize', 'BuildingArea']:
    mean = df_no_outliers_v1[column].mean()
    std = df_no_outliers_v1[column].std()
    lower_limit = mean - (4 * std)
    upper_limit = mean + (4 * std)
    
    # show us the outliers
    df_outliers_v2 = df_no_outliers_v1[(df_no_outliers_v1[column] < lower_limit) | (df_no_outliers_v1[column] > upper_limit)]


    # Keep only the rows where column values are within the mean +/- 4 standard deviations

    df_no_outliers_v2 = df_no_outliers_v1[(df_no_outliers_v1[column] >= lower_limit) & (df_no_outliers_v1[column] <= upper_limit)]

In [82]:
df_outliers_v2

Unnamed: 0,Rooms,Type,Price,Method,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Regionname
1484,4,h,1280000.0,S,11.8,4.0,1.0,2.0,732.0,6791.0,1970.0,Eastern Metropolitan
1588,5,h,2608000.0,S,7.8,5.0,2.0,4.0,730.0,3112.0,1920.0,Southern Metropolitan
2560,2,t,930000.0,S,3.5,2.0,3.0,0.0,2778.0,3558.0,1970.0,Northern Metropolitan
13245,5,h,1355000.0,S,48.1,5.0,3.0,5.0,44500.0,44515.0,1970.0,Northern Victoria


In [56]:
df_no_outliers_v2.shape

(13526, 12)

We removed another 4 outliers!

The last things we need to do to prepare our data are to scale our numerical columns using Standard Scalar and one hot encode our categorical columns

In [83]:
from sklearn.preprocessing import StandardScaler

# Assuming df is your dataframe
# Replace 'df' with the actual name of your dataframe

# Scale numerical values using StandardScaler
numerical_columns = ['Rooms', 'Distance', 'Bedroom2', 'Bathroom', 'Car', 'Landsize', 'BuildingArea', 'YearBuilt']
scaler = StandardScaler()
df_no_outliers_v2[numerical_columns] = scaler.fit_transform(df_no_outliers_v2[numerical_columns])

# One hot encode categorical values using pandas get_dummies
categorical_columns = ['Type', 'Method', 'Regionname']
df_no_outliers_v2 = pd.get_dummies(df_no_outliers_v2, columns=categorical_columns, drop_first=True)  # drop_first=True to avoid dummy variable trap

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_outliers_v2[numerical_columns] = scaler.fit_transform(df_no_outliers_v2[numerical_columns])


In [84]:
df_no_outliers_v2.shape

(13526, 22)

In [85]:
df_no_outliers_v2.head()

Unnamed: 0,Rooms,Price,Distance,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,Type_t,...,Method_SA,Method_SP,Method_VB,Regionname_Eastern Victoria,Regionname_Northern Metropolitan,Regionname_Northern Victoria,Regionname_South-Eastern Metropolitan,Regionname_Southern Metropolitan,Regionname_Western Metropolitan,Regionname_Western Victoria
0,-0.98216,1480000.0,-1.303487,-0.955743,-0.781651,-0.642594,-0.088878,-0.148496,0.110277,False,...,False,False,False,False,True,False,False,False,False,False
1,-0.98216,1035000.0,-1.303487,-0.955743,-0.781651,-1.686394,-0.100435,-0.887552,-2.365588,False,...,False,False,False,False,True,False,False,False,False,False
2,0.063879,1465000.0,-1.303487,0.084271,0.671429,-1.686394,-0.105962,0.228895,-2.365588,False,...,False,True,False,False,True,False,False,False,False,False
3,0.063879,850000.0,-1.303487,0.084271,0.671429,-0.642594,-0.116011,-0.148496,0.110277,False,...,False,False,False,False,True,False,False,False,False,False
4,1.109919,1600000.0,-1.303487,0.084271,-0.781651,0.401206,-0.109479,0.103098,1.666534,False,...,False,False,True,False,True,False,False,False,False,False


Now, it's time to train our model!

We will use K fold cross validation to find the best model and Grid Search CV to fine tune the parameters. 

In [88]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
import numpy as np

X = df_no_outliers_v2.drop('Price', axis=1)
y = df_no_outliers_v2['Price'] 

# Initialize models for a regression task
models = {
    'LinearRegression': LinearRegression(),
    'RidgeRegression': Ridge(),
    'SVR': SVR(),
    'DecisionTreeRegressor': DecisionTreeRegressor(),
    'RandomForestRegressor': RandomForestRegressor(),
    'KNeighborsRegressor': KNeighborsRegressor()
}

# Number of folds
n_splits = 5  # Or another number of folds you want to use
kf = KFold(n_splits=n_splits, shuffle=True, random_state=42)

# Perform cross-validation and store results
results = {}
for name, model in models.items():
    cv_scores = cross_val_score(model, X, y, cv=kf, scoring='r2')  # Use 'r2' for regression
    results[name] = cv_scores
    print(f'{name}: Mean R-squared={np.mean(cv_scores):.4f}, Standard Deviation={np.std(cv_scores):.4f}')


LinearRegression: Mean R-squared=0.6054, Standard Deviation=0.0187
RidgeRegression: Mean R-squared=0.6054, Standard Deviation=0.0188
SVR: Mean R-squared=-0.0721, Standard Deviation=0.0047
DecisionTreeRegressor: Mean R-squared=0.5472, Standard Deviation=0.0236
RandomForestRegressor: Mean R-squared=0.7626, Standard Deviation=0.0231
KNeighborsRegressor: Mean R-squared=0.6527, Standard Deviation=0.0181


Based on this, the RandomForestRegressor did best. Let's use GridSerach CV to find tune the parameters

In [90]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor

# Define a parameter grid to search over
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
}

# Initialize the grid search with cross-validation
grid_search = GridSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_grid=param_grid,
    cv=5,  # Number of folds in cross-validation
    n_jobs=-1,  # Use all cores
    verbose=2,
    scoring='r2'  # R-squared as the performance metric
)

# Fit the grid search to the data
X = df_no_outliers_v2.drop('Price', axis=1)
y = df_no_outliers_v2['Price'] 

grid_search.fit(X, y)

# View the best parameters from the grid search
best_params = grid_search.best_params_
print(f"Best parameters: {best_params}")

# Use the best estimator for further predictions or analysis
best_model = grid_search.best_estimator_

Fitting 5 folds for each of 12 candidates, totalling 60 fits
[CV] END ...................max_depth=None, n_estimators=100; total time=   3.4s
[CV] END ...................max_depth=None, n_estimators=100; total time=   3.4s
[CV] END ...................max_depth=None, n_estimators=100; total time=   3.4s
[CV] END ...................max_depth=None, n_estimators=100; total time=   3.5s
[CV] END ...................max_depth=None, n_estimators=100; total time=   3.4s
[CV] END ...................max_depth=None, n_estimators=200; total time=   6.5s
[CV] END ...................max_depth=None, n_estimators=200; total time=   6.7s
[CV] END ...................max_depth=None, n_estimators=200; total time=   6.9s
[CV] END ...................max_depth=None, n_estimators=200; total time=   6.9s
[CV] END ...................max_depth=None, n_estimators=200; total time=   6.9s
[CV] END .....................max_depth=10, n_estimators=100; total time=   2.0s
[CV] END .....................max_depth=10, n_es

Our best parameters are {'max_depth': 20, 'n_estimators': 300}

In [91]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
import joblib  # Used for saving the model

# Best parameters from GridSearchCV
best_params = {'max_depth': 20, 'n_estimators': 300}

# Create the model with the best parameters
model = RandomForestRegressor(**best_params, random_state=42)

# Setup KFold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)

# Lists to store each fold's training and validation scores
train_scores = []
valid_scores = []

# Perform K-fold cross-validation
for train_index, test_index in kf.split(X):
    # Split data into training and testing sets for this fold
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    # Train the model
    model.fit(X_train, y_train)
    
    # Evaluate the model
    train_score = model.score(X_train, y_train)
    valid_score = model.score(X_test, y_test)
    
    # Store the scores
    train_scores.append(train_score)
    valid_scores.append(valid_score)
    
    print(f"Train Score: {train_score}, Validation Score: {valid_score}")

# Average scores across all folds
average_train_score = sum(train_scores) / len(train_scores)
average_valid_score = sum(valid_scores) / len(valid_scores)
print(f"Average Train Score: {average_train_score}, Average Validation Score: {average_valid_score}")

# Since we've trained our model on multiple folds, we now need to retrain on the entire dataset
model.fit(X, y)

Train Score: 0.9647626187867511, Validation Score: 0.765443036138708
Train Score: 0.9639793170337031, Validation Score: 0.7888377752443156
Train Score: 0.965135676738798, Validation Score: 0.7626469595769781
Train Score: 0.9669447659616327, Validation Score: 0.7230545446810706
Train Score: 0.9651683656608749, Validation Score: 0.7771791392364069
Average Train Score: 0.9651981488363519, Average Validation Score: 0.7634322909754958


Our model is done :)