<a href="https://colab.research.google.com/github/LochanaBandara03/ML_tutorial/blob/main/Introduction_to_scikit_learn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [78]:
#Standard imports
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sklearn
print(f"Scikit learn version: {sklearn.__version__}")

Scikit learn version: 1.5.2


##An end-to-end Scikit-Learn workflow

* Getting data ready (split into features and labels, prepare train and test steps)
* Choosing a model for our problem
* Fit the model to the data and use it to make a prediction
* Evaluate the model
* Experiment to improve
* Save a model for someone else to use

##Random Forest Classifier Workflow for Classifying Heart Disease

###1.Get the data ready

In [79]:
import pandas as pd

In [80]:
heart_disease = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [81]:
#Create X (all the feature columns)
X = heart_disease.drop("target",axis=1)

#Create y (the target column)
y = heart_disease["target"]

#Check the head of DataFrame
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [82]:
#Check the head and value counts
y.head(), y.value_counts()

(0    1
 1    1
 2    1
 3    1
 4    1
 Name: target, dtype: int64,
 target
 1    165
 0    138
 Name: count, dtype: int64)

In [83]:
#Split the data into training
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))

###2. Choose the model and hyperparameters

In [84]:
#This is a classification problem - RandomForestClassifier (ML algorithm for classification)
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [85]:
#Current hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

###3. Fit the model to the data and use it to make a prediction

In [86]:
clf.fit(X=X_train, y=y_train)

In [87]:
#Predict a label, data should be the same shape as X_train
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
203,68,1,2,180,274,1,0,150,1,1.6,1,0,3
173,58,1,2,132,224,0,0,173,0,3.2,2,2,3
237,60,1,0,140,293,0,0,170,0,1.2,1,2,3
242,64,1,0,145,212,0,0,132,0,2.0,1,2,1
151,71,0,0,112,149,0,1,125,0,1.6,1,0,2


In [88]:
#Use thhe model to make a prediction on test data
y_preds = clf.predict(X=X_test)

###4. Evaluate the model

In [89]:
#Evaluate the model on training set
train_acc = clf.score(X=X_train, y=y_train)
print(f"The model's accuracy on the training dataset: {train_acc*100}%")

The model's accuracy on the training dataset: 100.0%


In [90]:
#Evaluate the mmodel on test dataset
test_acc = clf.score(X=X_test, y=y_test)
print(f"The model's accuracy on the testing dataset is: {test_acc*100:.2f}%")

The model's accuracy on the testing dataset is: 81.58%


All of the following classification metrics come from the sklearn.metrics module:

* classification_report(y_true, y_true) - Builds a text report showing various classification metrics such as precision, recall and F1-score.
* confusion_matrix(y_true, y_pred) - Create a confusion matrix to compare predictions to truth labels.
* accuracy_score(y_true, y_pred) - Find the accuracy score (the default metric) for a classifier.

In [91]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#Create a classification report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.80      0.75      0.77        32
           1       0.83      0.86      0.84        44

    accuracy                           0.82        76
   macro avg       0.81      0.81      0.81        76
weighted avg       0.82      0.82      0.81        76



In [92]:
#Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[24,  8],
       [ 6, 38]])

In [93]:
#Compute the accuray score
accuracy_score(y_test, y_preds)

0.8157894736842105

###5. Experiment to improve

But let's break it into two.

1. From a model perspective.
2. From a data perspective.

In [94]:
#Try different numbers of estimatros
np.random.seed(42)
for i in range(100, 200, 10):
  print(f"Trying model with {i} estimators...")
  model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
  print(f"Model accuracy on test set: {model.score(X_test, y_test)* 100:.2f}%")
  print("")

Trying model with 100 estimators...
Model accuracy on test set: 82.89%

Trying model with 110 estimators...
Model accuracy on test set: 81.58%

Trying model with 120 estimators...
Model accuracy on test set: 81.58%

Trying model with 130 estimators...
Model accuracy on test set: 82.89%

Trying model with 140 estimators...
Model accuracy on test set: 80.26%

Trying model with 150 estimators...
Model accuracy on test set: 78.95%

Trying model with 160 estimators...
Model accuracy on test set: 80.26%

Trying model with 170 estimators...
Model accuracy on test set: 82.89%

Trying model with 180 estimators...
Model accuracy on test set: 80.26%

Trying model with 190 estimators...
Model accuracy on test set: 81.58%



In [95]:
from sklearn.model_selection import cross_val_score

#with cross validation
np.random.seed(42)
for i in range(100, 200, 10):
  print(f"Trying model with {i} estimators.....")
  model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)

  #Measure the model score on a single train/test split
  model_score = model.score(X_test, y_test)
  print(f"Model accuracy on single test set split: {model_score*100:.2f}%")

  #Measure the mean cross-validation score across 5 different train and test splits
  cross_val_mean = np.mean(cross_val_score(model, X, y, cv=5))
  print(f"5-fold cross-validation score: {cross_val_mean*100 :.2f}%")

  print("")



Trying model with 100 estimators.....
Model accuracy on single test set split: 82.89%
5-fold cross-validation score: 82.15%

Trying model with 110 estimators.....
Model accuracy on single test set split: 78.95%
5-fold cross-validation score: 81.17%

Trying model with 120 estimators.....
Model accuracy on single test set split: 81.58%
5-fold cross-validation score: 83.16%

Trying model with 130 estimators.....
Model accuracy on single test set split: 81.58%
5-fold cross-validation score: 83.14%

Trying model with 140 estimators.....
Model accuracy on single test set split: 81.58%
5-fold cross-validation score: 82.48%

Trying model with 150 estimators.....
Model accuracy on single test set split: 80.26%
5-fold cross-validation score: 80.17%

Trying model with 160 estimators.....
Model accuracy on single test set split: 78.95%
5-fold cross-validation score: 80.83%

Trying model with 170 estimators.....
Model accuracy on single test set split: 81.58%
5-fold cross-validation score: 81.83%



In [96]:
#Another way to do it GridSearchCV
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

#Define the parameters to search over in dictiionary
#any of target model's hyperparameters
param_grid = {'n_estimators' : [i for i in range(100,200,10)]}

#Setup the grid search
grid = GridSearchCV(estimator=RandomForestClassifier(),
                    param_grid=param_grid,
                    cv=5,
                    verbose=1)

#Fit the grid  search to the data
grid.fit(X, y)

#Find the best parameters
print(f"The best parameter values are: {grid.best_params_}")
print(f"With a score of: {grid.best_score_*100:.2f}%")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
The best parameter values are: {'n_estimators': 120}
With a score of: 82.82%


In [97]:

#Set the model to be the best estimator
clf = grid.best_estimator_
clf

In [98]:
#Fit the best model
clf = clf.fit(X_train, y_train)

#Find the best model scores on our single test split
print(f"best model score on single split of the data: {clf.score(X_test, y_test)*100:.2f}%")

best model score on single split of the data: 81.58%


###6. Save a model

In [99]:
import pickle

#Save an existing model to file
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))

In [100]:
#Load a pickle model and evaluate it
loaded_pickle_model = pickle.load(open("random_forest_model_1.pkl","rb"))
print(f"Loaded pickel model prediction score: {loaded_pickle_model.score(X_test, y_test)*100:.2f}%")

Loaded pickel model prediction score: 80.26%


In [101]:
from joblib import dump, load

# Save a model using joblib
dump(model, "random_forest_model_1.joblib")

['random_forest_model_1.joblib']

In [102]:
# Load a saved joblib model and evaluate it
loaded_joblib_model = load("random_forest_model_1.joblib")
print(f"Loaded joblib model prediction score: {loaded_joblib_model.score(X_test, y_test) * 100:.2f}%")

Loaded joblib model prediction score: 80.26%


##1. Getting the data ready

Three of the main steps you'll often have to take are:

* Splitting the data into features (usually X) and labels (usually y).
* Splitting the data into training and testing sets (and possibly a validation set).
* Filling (also called imputing) or disregarding missing values.
* Converting non-numerical values to numerical values (also call feature encoding).

In [103]:
#spitting data into X and y
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [104]:
#Spitting data into features(X) and labels(y)
X = heart_disease.drop('target',axis=1)
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [105]:
y= heart_disease['target']
y

Unnamed: 0,target
0,1
1,1
2,1
3,1
4,1
...,...
298,0
299,0
300,0
301,0


In [106]:
#Spitting data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

#Check the shapes of different data splits
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [107]:
#80% of data is used for training set - learn patterns
X.shape[0]*0.8

242.4

In [108]:
#20% of data is used for testing test - model will evaluated on these examples
X.shape[0]*0.2

60.6

###Make sure all numerical

In [109]:
#Import car-sales-extended.csv
car_sales = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended.csv")
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


In [110]:
car_sales.dtypes

Unnamed: 0,0
Make,object
Colour,object
Odometer (KM),int64
Doors,int64
Price,int64


In [111]:
#Split into X  & y train/test
X = car_sales.drop("Price",axis=1)
y = car_sales["Price"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [112]:
#Try to predict with randoms forest on price column - doesn't work
from sklearn.ensemble import RandomForestRegressor

model = RandomForestClassifier()
# model.fit(X_train, y_train)
# model.score(X_test, y_test)

In [113]:
#1. Import OneHotEncoder and ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

#2. Define the categorical features to transform
categorical_features = ["Make", "Colour", "Doors"]

#3. Create an instance of OneHotEncoder
one_hot = OneHotEncoder()

#4. Create an instance of ColumnTransFormer
transformer = ColumnTransformer([("one_hot", #name
                                 one_hot, #transformer
                                 categorical_features)], #Columns to transform
                                remainder="passthrough") #Rest of the columns - (passthrough - leave the columns

#5. Turn the categorial features into numbers
transformed_X = transformer.fit_transform(X)
transformed_X


array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [114]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [115]:
#First transformed sample
transformed_X[0]

array([0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
       1.0000e+00, 0.0000e+00, 3.5431e+04])

In [116]:
#Original first sample
X.iloc[0]

Unnamed: 0,0
Make,Honda
Colour,White
Odometer (KM),35431
Doors,4


###Nuemrically encoding data with pandas

In [117]:
#View head of original DataFrame
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [118]:
# One-hot encode categorial variables
categorical_variables = ["Make", "Colour", "Doors"]
dummies = pd.get_dummies(data=car_sales[categorical_variables])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True


In [119]:
np.random.seed(42)

#Create train and test splits with transformed_X
X_train, X_test, y_train, y_test =  train_test_split(transformed_X,
                                                     y,
                                                     test_size=0.2)

#Create the model instance
model = RandomForestRegressor()

#Fit the model on the numerical data
model.fit(X_train, y_train)

#Score the model
model.score(X_test, y_test)

0.3235867221569877

###If there were missing values

In [120]:
#Car sales DataFrame with missing values
car_sales_missing = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv")
car_sales_missing.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
5,Honda,Red,42652.0,4.0,23883.0
6,Toyota,Blue,163453.0,4.0,8473.0
7,Honda,White,,4.0,20306.0
8,,White,130538.0,4.0,9374.0
9,Honda,Blue,51029.0,4.0,26683.0


In [121]:
#Sum of all missing values
car_sales_missing.isna().sum()  #isna - able to detect missing values

Unnamed: 0,0
Make,49
Colour,50
Odometer (KM),50
Doors,50
Price,50


In [122]:
#Create features
X_missing =  car_sales_missing.drop("Price", axis=1)
print(f"Number of missing values: \n {X_missing.isna().sum()}")

Number of missing values: 
 Make             49
Colour           50
Odometer (KM)    50
Doors            50
dtype: int64


In [123]:
#Create lables
y_missing = car_sales_missing["Price"]
print(f"Number of missing y values: {y_missing.isna().sum()}")

Number of missing y values: 50


In [124]:
#Turn categories (Make and Colour) into Numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0) #return a sparse matrix or not

transformed_X_missing =  transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [125]:
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,49
Colour,50
Odometer (KM),50
Doors,50
Price,50


###Fill missing data with pandas

In [126]:
#Fill the make column
car_sales_missing["Colour"] = car_sales_missing["Colour"].fillna(value="missing")

#Fill the colour column
car_sales_missing["Make"] = car_sales_missing["Make"].fillna(value="missing")

In [127]:
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Odometer (KM),50
Doors,50
Price,50


In [128]:
#Find the most common values of doors column
car_sales_missing["Doors"].value_counts()

Unnamed: 0_level_0,count
Doors,Unnamed: 1_level_1
4.0,811
5.0,75
3.0,64


In [129]:
car_sales_missing["Doors"] = car_sales_missing["Doors"].fillna(value=4)

In [130]:
#Fill the Odometer column with mean values
car_sales_missing["Odometer (KM)"] = car_sales["Odometer (KM)"].fillna(value =car_sales_missing["Odometer (KM)"].mean() )

In [131]:
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Odometer (KM),0
Doors,0
Price,50


In [132]:
#Remove the rows with missing prices
car_sales_missing.dropna(inplace=True)

In [133]:
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Odometer (KM),0
Doors,0
Price,0


In [134]:
#Create features
X_missing = car_sales_missing.drop("Price", axis=1)
print(f"Number of missing X values: \n{X_missing.isna().sum()}")

#Create lables
y_missing = car_sales_missing["Price"]
print(f"Number of missing y values: {y_missing.isna().sum()}")

Number of missing X values: 
Make             0
Colour           0
Odometer (KM)    0
Doors            0
dtype: int64
Number of missing y values: 0


In [135]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("One_hot",
                                  one_hot,
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0)

transformed_X_missing = transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [136]:
#Split data into training and test sets
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X_missing,
                                                    y_missing,
                                                    test_size=0.2)

#Fit and score a model
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.2983358663638609

###Filling missing data and transforming categorical data with scikit-learn

In [137]:
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Odometer (KM),0
Doors,0
Price,0


In [138]:
#Reimport with missing values to fill with scikit-learn
car_sales_missing = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,49
Colour,50
Odometer (KM),50
Doors,50
Price,50


In [139]:
#Drop the rows with missing price values
car_sales_missing.dropna(subset=["Price"], inplace=True)

In [140]:
car_sales_missing.isna().sum()

Unnamed: 0,0
Make,47
Colour,46
Odometer (KM),48
Doors,47
Price,0


In [141]:
#Split into X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

#Split data into train and test
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

In [142]:
from sklearn.impute import SimpleImputer

#Create categorical variable imputer
cat_imputer =  SimpleImputer(strategy="constant", fill_value="missing")

#Create Door column imputer
door_imputer = SimpleImputer(strategy="constant", fill_value=4)

#Create Odometer (KM) column imputer
num_imputer = SimpleImputer(strategy="mean")

In [143]:
#Define different column features
categorical_features = ["Make", "Colour"]
door_feature =["Doors"]
numerical_feature = ["Odometer (KM)"]

In [144]:
from sklearn.compose import ColumnTransformer

#Create series of column transforms to perform
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, categorical_features),
    ("door_imputer", door_imputer, door_feature),
    ("num_imputer", num_imputer, numerical_feature)
])

In [145]:
#Find values to fill and transform training data
filled_X_train = imputer.fit_transform(X_train)

#Fill values in to the test set with values learned from the training set
filled_X_test = imputer.transform(X_test)

#Check filled X_train
filled_X_train

array([['Honda', 'White', 4.0, 71934.0],
       ['Toyota', 'Red', 4.0, 162665.0],
       ['Honda', 'White', 4.0, 42844.0],
       ...,
       ['Toyota', 'White', 4.0, 196225.0],
       ['Honda', 'Blue', 4.0, 133117.0],
       ['Honda', 'missing', 4.0, 150582.0]], dtype=object)

In [146]:
#Transformed data array's back into DataFrame
filled_X_train_df = pd.DataFrame(filled_X_train,
                                 columns=["Make", "Colour", "Doors", "Odometer (KM)"])

filled_X_test_df = pd.DataFrame(filled_X_test,
                                columns=["Make", "Colour", "Doors", "Odometer (KM)"])

#Check missing data in training set
filled_X_train_df.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Doors,0
Odometer (KM),0


In [147]:
#Missing data in test set
filled_X_test_df.isna().sum()

Unnamed: 0,0
Make,0
Colour,0
Doors,0
Odometer (KM),0


In [148]:
#Check all data numerical or not
filled_X_train_df.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,71934.0
1,Toyota,Red,4.0,162665.0
2,Honda,White,4.0,42844.0
3,Honda,White,4.0,195829.0
4,Honda,Blue,4.0,219217.0


In [151]:
#One hot encoded the features because doors and colours are still strings
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0)

#Fill train and test values separately
transformed_X_train = transformer.fit_transform(filled_X_train_df)
transformed_X_test = transformer.transform(filled_X_test_df)

#Check transformed and filled X_train
transformed_X_train

array([[0.0, 1.0, 0.0, ..., 1.0, 0.0, 71934.0],
       [0.0, 0.0, 0.0, ..., 1.0, 0.0, 162665.0],
       [0.0, 1.0, 0.0, ..., 1.0, 0.0, 42844.0],
       ...,
       [0.0, 0.0, 0.0, ..., 1.0, 0.0, 196225.0],
       [0.0, 1.0, 0.0, ..., 1.0, 0.0, 133117.0],
       [0.0, 1.0, 0.0, ..., 1.0, 0.0, 150582.0]], dtype=object)

In [152]:
#Fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

#Use transformed data
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)

0.21229043336119102