## Introduction to Scikit learn

### Practice of functions inside Scikit learning

##### We are going to cover
##### 1. End-to-end workflow
##### 2. Chosing the right algorithm for the problem
##### 3. Fit the model/Algorithm to make predictions on the data
##### 4. Evaluate the model
##### 5. Improve the model 
##### 6. Save and load the trainned model
##### 7. Put it all together

## 1.Getting the data ready

In [308]:
import pandas as pd
import numpy as np
from sklearn.linear_model import Ridge
heart_disease = pd.read_csv("Data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [249]:
# Creates X (Features)

X = heart_disease.drop("target", axis=1)

# Creates Y (Labels)

Y = heart_disease["target"]

### 2. Choose the right model and hyperparameters

In [250]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=100)

# We'll keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3. Fit the model to the trainning data 

In [251]:
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y , test_size=0.2)

In [252]:
clf.fit(X_train, Y_train);

### 4. Evaluate the model

In [253]:
# Make a prediction
y_label = clf.predict(np.array([0,2,3,4]))



ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
Y_preds = clf.predict(X_test)
Y_preds

### 4. Evaluate the model on the trainning data

In [254]:
clf.score(X_train, Y_train)

1.0

In [255]:
clf.score(X_test, Y_test)

0.8524590163934426

In [256]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score; 

In [257]:
print(classification_report(Y_test, Y_preds))

              precision    recall  f1-score   support

           0       0.44      0.38      0.41        29
           1       0.50      0.56      0.53        32

    accuracy                           0.48        61
   macro avg       0.47      0.47      0.47        61
weighted avg       0.47      0.48      0.47        61



### 5. Improve the model

In [258]:
# Try different ammounts of n_stimators
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying models with {i} stimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, Y_train)
    print(f"Model accuracy on test set : {clf.score(X_test, Y_test) * 100:.2f}%")

Trying models with 10 stimators...
Model accuracy on test set : 85.25%
Trying models with 20 stimators...
Model accuracy on test set : 80.33%
Trying models with 30 stimators...
Model accuracy on test set : 83.61%
Trying models with 40 stimators...
Model accuracy on test set : 80.33%
Trying models with 50 stimators...
Model accuracy on test set : 86.89%
Trying models with 60 stimators...
Model accuracy on test set : 83.61%
Trying models with 70 stimators...
Model accuracy on test set : 83.61%
Trying models with 80 stimators...
Model accuracy on test set : 83.61%
Trying models with 90 stimators...
Model accuracy on test set : 81.97%


### 6. Save the model and save it

In [259]:
import pickle 

pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

In [260]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))

In [261]:
loaded_model.score(X_test, Y_test)

0.819672131147541

#1. Getting the data ready 

Three things to do :
    
    1. Split the data into features and labels ("X" & "Y")
    
    2. Imputting or disregarding missing values
    
    3. Converting non-numerical values to numerical values



In [262]:
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [263]:
Y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

In [264]:
# Machine learning principle : Never test models that haven't been splited into trainning and test data

In [265]:
# Split the data in trainning and testing
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, 
                                                    Y, 
                                                    test_size=0.2)

In [266]:
X.shape

(303, 13)

In [267]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((242, 13), (61, 13), (242,), (61,))

### 1.1 Make sure all data is numerical 

In [268]:
car_sales = pd.read_csv("Data/car-sales-extended.csv")

In [269]:
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


In [270]:
len(car_sales)

1000

In [271]:
# X to predict price
X = car_sales.drop("Price", axis=1)

# Y Label
Y = car_sales["Price"]

X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                   Y,
                                                   test_size = 0.2)

In [272]:
from sklearn.ensemble import RandomForestRegressor 

model = RandomForestRegressor()
model.fit(X_train, Y_train)
model.score(X_test, Y_test)

ValueError: could not convert string to float: 'Toyota'

In [273]:
# Value error : Sklearn cannot deal with strings 

In [274]:
# Turn categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.0, 1.0, 0.0, ..., 0.0, 35431.0, 15323.0],
       [1.0, 0.0, 0.0, ..., 1.0, 192714.0, 19943.0],
       [0.0, 1.0, 0.0, ..., 0.0, 84714.0, 28343.0],
       ...,
       [0.0, 0.0, 1.0, ..., 0.0, 66604.0, 31570.0],
       [0.0, 1.0, 0.0, ..., 0.0, 215883.0, 4001.0],
       [0.0, 0.0, 0.0, ..., 0.0, 248360.0, 12732.0]], dtype=object)

In [275]:
pd.DataFrame(X)

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3
...,...,...,...,...
995,Toyota,Black,35820,4
996,Nissan,White,155144,3
997,Nissan,Blue,66604,4
998,Honda,White,215883,4


In [276]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,35431.0,15323.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0,19943.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,84714.0,28343.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,154365.0,13434.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0,14043.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0,32042.0
996,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,155144.0,5716.0
997,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0,31570.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,215883.0,4001.0


In [277]:
dummies = pd.get_dummies(car_sales[["Colour", "Make", "Doors"]])

In [278]:
dummies

Unnamed: 0,Doors,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota
0,4,0,0,0,0,1,0,1,0,0
1,5,0,1,0,0,0,1,0,0,0
2,4,0,0,0,0,1,0,1,0,0
3,4,0,0,0,0,1,0,0,0,1
4,3,0,1,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...
995,4,1,0,0,0,0,0,0,0,1
996,3,0,0,0,0,1,0,0,1,0
997,4,0,1,0,0,0,0,0,1,0
998,4,0,0,0,0,1,0,1,0,0


In [279]:
# ALl the data is already numerical

In [280]:
# Refit the model

In [281]:
np.random.seed(42)
X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.2)

In [282]:
model.score(X_test, Y_test)

ValueError: could not convert string to float: 'Nissan'

### 1.2 What if there were missing values ?

1. Fill them with some value
2. Remove the sample with missing values


In [283]:
car_sales_missing = pd.read_csv("Data/car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [284]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [285]:
# Convert data to numbers

### 1. Filling missing values with pd - INCOMPLETE SECTION

In [286]:
# FIll the Make column
car_sales_missing["Make"].fillna("missing", inplace=True)


# Fill the Colour column
car_sales_missing["Colour"].fillna("missing", inplace=True)

# Fill the Odometer column
car_sales_missing["Odometer (KM)"].fillna("missing", inplace=True)

# FIll the Doors column
car_sales_missing["Doors"].fillna(4, inplace=True)

In [287]:
car_sales_missing.head(10)

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
5,Honda,Red,42652.0,4.0,23883.0
6,Toyota,Blue,163453.0,4.0,8473.0
7,Honda,White,missing,4.0,20306.0
8,missing,White,130538.0,4.0,9374.0
9,Honda,Blue,51029.0,4.0,26683.0


In [288]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [289]:
len(car_sales_missing)

1000

### Option 2: Fill missing values with Scikit-learn

In [290]:
car_missing_values = pd.read_csv("Data/car-sales-extended-missing-data.csv")
car_missing_values

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [291]:
car_missing_values.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [292]:
X = car_sales_missing.drop("Price", axis=1)
Y = car_sales_missing["Price"]

In [293]:
X.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
dtype: int64

In [294]:
# Fill missing values with Scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with missing & and numerical values with mean

cat_imputer = SimpleImputer(strategy="constant")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="constant")

# Define columns

cat_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

# Creates an imputer (Something that fills missing data)

imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_features", door_imputer, door_features),
    ("num_features", num_imputer, num_features)
])

filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [295]:
# Get our transformed data array's back into DataFrame's
car_sales_filled = pd.DataFrame(filled_X, 
                                      columns=["Make", "Colour", "Doors", "Odometer (KM)"])

# Check missing data in training set
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [296]:
# Now we only have numeric data 

np.random.seed(41)
from sklearn.ensemble import RandomForestRegressor 
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(transformed_X,
                                                   Y,
                                                   test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, Y_train)
model.score(X_test, Y_test)

ValueError: could not convert string to float: 'missing'

### 2. Choosing the right algorithm/estimator

Once you've got your data ready, the next step is to choose an appropriate machine learning algorithm or model to find patterns in your data.

Some things to note:

Sklearn refers to machine learning models and algorithms as estimators.
Classification problem - predicting a category (heart disease or not).
Sometimes you'll see clf (short for classifier) used as a classification estimator instance's variable name.
Regression problem - predicting a number (selling price of a car).
Unsupervised problem - clustering (grouping unlabelled samples with other similar unlabelled samples).
If you know what kind of problem you're working with, one of the next places you should look at is the Scikit-Learn algorithm cheatsheet.

2.1 Picking a machine learning model for regression

In [297]:
# Get california housing dataset

In [298]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [299]:
housing_df = pd.DataFrame(housing["data"])
housing_df

Unnamed: 0,0,1,2,3,4,5,6,7
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [300]:
# Problem : Try to predict MedHouseVal

housing_df["target"] = housing["target"]
housing_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [309]:
# Import algorithm 

# Setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
Y = housing_df["target"]

# Split into train and test sets 
X_train, X_test, Y_train, Y_test = train_test_split(X,
                                                   Y,
                                                   test_size=0.2)

# Instantiate and fit model (on the trainning set)

model = Ridge()
model.fit(X_train, Y_train)

# Check the score of the model

model.score(X_test, Y_test)

0.575854961144012

In [310]:
# Try the above method in the following models
# 1. Lasso
# 2. ElasticNet
# 3. SDGRegressor