# Introduction to Scikit-Learn

What to learn:
    
    1. An end-to-end Scikit-Learn workflow
    2. Getting data ready
    3. Choose the right estimator/algorithm for our problems
    4. fit the model/algorithm to use it to make predictions on our data
    5. evaluate the model
    6. improve the model
    7. save and load a trained model
    8. put it all together

In [116]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [117]:
df = pd.read_csv('data/heart-disease.csv')
df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [16]:
# x(features matric)
X = df.drop("target", axis = 1)

# create y(labels)
Y = heart_disease["target"]

In [13]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

#keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [28]:
# step 3- fit model to training data
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.25)

In [29]:
clf.fit(x_train, y_train);

In [30]:
#make a prediction
y_preds = clf.predict(x_test)
y_preds

array([1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1])

In [31]:
y_test

68     1
43     1
69     1
180    0
241    0
      ..
157    1
143    1
74     1
98     1
27     1
Name: target, Length: 76, dtype: int64

In [32]:
#evaluate the model
clf.score(x_train, y_train)

1.0

In [33]:
clf.score(x_test, y_test)

0.8289473684210527

In [35]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.86      0.74      0.79        34
           1       0.81      0.90      0.85        42

    accuracy                           0.83        76
   macro avg       0.84      0.82      0.82        76
weighted avg       0.83      0.83      0.83        76



In [36]:
confusion_matrix(y_test, y_preds)

array([[25,  9],
       [ 4, 38]])

In [42]:
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimator = i).fit(x_train, y_train)
    print(f"Model accuracy on test set: {model.score(x_test, y_test) * 100.2f}%")
    print("")

SyntaxError: invalid decimal literal (2345299106.py, line 1)

In [45]:
# Try different numbers of estimators (n_estimators is a hyperparameter you can change)
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
    print(f"Model accruacy on test set: {model.score(x_test, y_test)}")
    print("")

Trying model with 10 estimators...
Model accruacy on test set: 0.7894736842105263

Trying model with 20 estimators...
Model accruacy on test set: 0.8026315789473685

Trying model with 30 estimators...
Model accruacy on test set: 0.9078947368421053

Trying model with 40 estimators...
Model accruacy on test set: 0.8947368421052632

Trying model with 50 estimators...
Model accruacy on test set: 0.8947368421052632

Trying model with 60 estimators...
Model accruacy on test set: 0.8947368421052632

Trying model with 70 estimators...
Model accruacy on test set: 0.868421052631579

Trying model with 80 estimators...
Model accruacy on test set: 0.8552631578947368

Trying model with 90 estimators...
Model accruacy on test set: 0.881578947368421



In [46]:

import pickle

# Save trained model to file
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))

In [49]:

# Load a saved model and make a prediction on a single example
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(x_test, y_test)

0.881578947368421

# 1. Getting our data ready to be used with machine learning

Three Main things we have to do:

1. split the data into features and labels (usually 'x' and 'y')
2. filling (also called imputing) or disregarding missing values
3. converting non-numerical values to numerical values (also called features encoding)

In [50]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [51]:
X = heart_disease.drop('target', axis = 1)
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [52]:
y = heart_disease['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [61]:
#split the data into training and test sets
from sklearn.model_selection import train_test_split 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

In [62]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((212, 13), (91, 13), (212,), (91,))

In [63]:
x.shape


(303, 13)

In [64]:
len(heart_disease)

303

# Make sure it all Numerical

In [92]:
car_sales = pd.read_csv("data/car-sales-extended.csv")
car_sales.head(102)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
97,Toyota,Black,133433,4,16519
98,Toyota,Red,147455,4,22296
99,Honda,Blue,48069,4,12196
100,Honda,Blue,146233,4,14202


In [68]:
car_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Make           1000 non-null   object
 1   Colour         1000 non-null   object
 2   Odometer (KM)  1000 non-null   int64 
 3   Doors          1000 non-null   int64 
 4   Price          1000 non-null   int64 
dtypes: int64(3), object(2)
memory usage: 39.2+ KB


In [69]:
x = car_sales.drop("Price", axis=1)
x.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [71]:
y = car_sales['Price']
y.head()

0    15323
1    19943
2    28343
3    13434
4    14043
Name: Price, dtype: int64

In [72]:
x = car_sales.drop("Price", axis=1)
y = car_sales['Price']


x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3)

In [78]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder = "passthrough")
transformed_x = transformer.fit_transform(x)
transformed_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [79]:
pd.DataFrame(transformed_x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [91]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies.head(102)

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...
97,4,0,0,0,1,1,0,0,0,0
98,4,0,0,0,1,0,0,0,1,0
99,4,0,1,0,0,0,1,0,0,0
100,4,0,1,0,0,0,1,0,0,0


In [84]:
np.random.seed(42)
x_train, x_test, y_train, y_test = train_test_split(transformed_x,
                                                   y,
                                                   test_size = 0.2)
model.fit(x_train, y_train)

In [86]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.336546945503971

## 1.2 what if there was missing data

1. fill them with values like mean
2. delete them altogether

In [94]:
car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [95]:
car_sales_missing.isnull().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [97]:
x = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing['Price']

In [100]:
x = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing['Price']

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder = "passthrough")
transformed_x = transformer.fit_transform(x)
transformed_x

<1000x16 sparse matrix of type '<class 'numpy.float64'>'
	with 4000 stored elements in Compressed Sparse Row format>

In [102]:
car_sales_missing["Make"].fillna("missing", inplace = True)

car_sales_missing["Colour"].fillna("missing", inplace = True)

car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace = True)

car_sales_missing["Doors"].fillna(4, inplace = True)

In [103]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [104]:
car_sales_missing.dropna(inplace = True)

In [106]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [108]:
x = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing['Price']

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder = "passthrough")
transformed_x = transformer.fit_transform(car_sales_missing)
transformed_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

# Choosing the right estimator/algorithm for the problem

some things to note:

 * Sklearn refers to machine learning models, algorithm as estimators
 * classification problem - predicting a category (heart disease or not)
 * sSometimes we see clf (short for classifier) used as a classification estimator
 * regression problem - predicting a number (selling price of a car from car data)

#### Califorinia housing dataset for regression mode

In [121]:
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)


URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>

In [122]:
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


In [143]:
cars = pd.read_csv("data/car.csv")
cars

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19232,45798355,8467,-,MERCEDES-BENZ,CLK 200,1999,Coupe,Yes,CNG,2.0 Turbo,300000 km,4.0,Manual,Rear,02-Mar,Left wheel,Silver,5
19233,45778856,15681,831,HYUNDAI,Sonata,2011,Sedan,Yes,Petrol,2.4,161600 km,4.0,Tiptronic,Front,04-May,Left wheel,Red,8
19234,45804997,26108,836,HYUNDAI,Tucson,2010,Jeep,Yes,Diesel,2,116365 km,4.0,Automatic,Front,04-May,Left wheel,Grey,4
19235,45793526,5331,1288,CHEVROLET,Captiva,2007,Jeep,Yes,Diesel,2,51258 km,4.0,Automatic,Front,04-May,Left wheel,Black,4


In [145]:
cars.drop(['ID', 'Levy', 'Engine volume', 'Drive wheels', 'Doors', 'Wheel'], axis = 1, inplace = True)

In [146]:
cars

Unnamed: 0,Price,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Mileage,Cylinders,Gear box type,Color,Airbags
0,13328,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,186005 km,6.0,Automatic,Silver,12
1,16621,CHEVROLET,Equinox,2011,Jeep,No,Petrol,192000 km,6.0,Tiptronic,Black,8
2,8467,HONDA,FIT,2006,Hatchback,No,Petrol,200000 km,4.0,Variator,Black,2
3,3607,FORD,Escape,2011,Jeep,Yes,Hybrid,168966 km,4.0,Automatic,White,0
4,11726,HONDA,FIT,2014,Hatchback,Yes,Petrol,91901 km,4.0,Automatic,Silver,4
...,...,...,...,...,...,...,...,...,...,...,...,...
19232,8467,MERCEDES-BENZ,CLK 200,1999,Coupe,Yes,CNG,300000 km,4.0,Manual,Silver,5
19233,15681,HYUNDAI,Sonata,2011,Sedan,Yes,Petrol,161600 km,4.0,Tiptronic,Red,8
19234,26108,HYUNDAI,Tucson,2010,Jeep,Yes,Diesel,116365 km,4.0,Automatic,Grey,4
19235,5331,CHEVROLET,Captiva,2007,Jeep,Yes,Diesel,51258 km,4.0,Automatic,Black,4


In [151]:
cars.drop(['Model', 'Cylinders'], axis = 1, inplace = True)

In [148]:
cars.Manufacturer = pd.Categorical(cars.Manufacturer)
cars['Manufacturer'] = cars.Manufacturer.cat.codes

In [149]:
cars.Model = pd.Categorical(cars.Model)
cars['Mode'] = cars.Model.cat.codes

cars.Category = pd.Categorical(cars.Category)
cars['Category'] = cars.Category.cat.codes

cars['Leather interior'] = pd.Categorical(cars['Leather interior'])
cars['Leather interior'] = cars['Leather interior'].cat.codes

In [150]:
cars['Gear box type'] = pd.Categorical(cars['Gear box type'])
cars['Gear box type'] = cars['Gear box type'].cat.codes

cars.Color = pd.Categorical(cars.Color)
cars['Color'] = cars.Color.cat.codes

cars['Fuel type'] = pd.Categorical(cars['Fuel type'])
cars['Fuel type'] = cars['Fuel type'].cat.codes

In [152]:
cars

Unnamed: 0,Price,Manufacturer,Prod. year,Category,Leather interior,Fuel type,Mileage,Gear box type,Color,Airbags,Mode
0,13328,32,2010,4,1,2,186005 km,0,12,12,1242
1,16621,8,2011,4,0,5,192000 km,2,1,8,658
2,8467,21,2006,3,0,5,200000 km,3,1,2,684
3,3607,16,2011,4,1,2,168966 km,0,14,0,661
4,11726,21,2014,3,1,5,91901 km,0,12,4,684
...,...,...,...,...,...,...,...,...,...,...,...
19232,8467,36,1999,1,1,0,300000 km,1,12,5,385
19233,15681,23,2011,9,1,5,161600 km,2,11,8,1334
19234,26108,23,2010,4,1,1,116365 km,0,7,4,1442
19235,5331,8,2007,4,1,1,51258 km,0,1,4,456


In [155]:
cars['Mileage'] = cars['Mileage'].str.replace("km", "")

In [156]:
cars

Unnamed: 0,Price,Manufacturer,Prod. year,Category,Leather interior,Fuel type,Mileage,Gear box type,Color,Airbags,Mode
0,13328,32,2010,4,1,2,186005,0,12,12,1242
1,16621,8,2011,4,0,5,192000,2,1,8,658
2,8467,21,2006,3,0,5,200000,3,1,2,684
3,3607,16,2011,4,1,2,168966,0,14,0,661
4,11726,21,2014,3,1,5,91901,0,12,4,684
...,...,...,...,...,...,...,...,...,...,...,...
19232,8467,36,1999,1,1,0,300000,1,12,5,385
19233,15681,23,2011,9,1,5,161600,2,11,8,1334
19234,26108,23,2010,4,1,1,116365,0,7,4,1442
19235,5331,8,2007,4,1,1,51258,0,1,4,456
