### Introduction to Scikit-Learn

Illustrates the most useful functions of the scikitlearn 

What we cover:
0. An end to end Scikit-Learn workflow
1. Get the data ready
2. Choose correct model
3. Fit model and make predictions
4. Evaluate model
5. Improve model
6. Save/Load model
7. Put it together


### 0. An end to end SK Learn Workflow

In [2]:
#1. Get the data ready
import pandas as pd
import numpy as np
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [9]:
#Create X (feature matrix)
x= heart_disease.drop("target", axis=1)

#Create y (labels)
y = heart_disease["target"]

In [2]:
#Chose the right model and hypermarameters
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100)

#Use default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [10]:
#Fit the model to the training data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [13]:
clf.fit(x_train, y_train)

RandomForestClassifier()

In [17]:
y_preds = clf.predict(x_test)

In [18]:
y_preds

array([1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1], dtype=int64)

In [19]:
y_test

92     1
9      1
174    0
237    0
30     1
      ..
16     1
136    1
93     1
276    0
147    1
Name: target, Length: 61, dtype: int64

In [20]:
#4. Evaulate the model
clf.score(x_train, y_train)

1.0

In [21]:
clf.score(x_test, y_test)

0.9016393442622951

In [22]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.91      0.83      0.87        24
           1       0.90      0.95      0.92        37

    accuracy                           0.90        61
   macro avg       0.90      0.89      0.90        61
weighted avg       0.90      0.90      0.90        61



In [23]:
confusion_matrix(y_test, y_preds)

array([[20,  4],
       [ 2, 35]], dtype=int64)

In [24]:
accuracy_score(y_test, y_preds)

0.9016393442622951

### 

In [36]:
#5. Improve a model
#Try diffrent amount of n_estimators

np.random.seed(42)

for i in range(10,1000,10):
    print(f"Trying model {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
    print(f"Model accuracy on testset: {clf.score(x_test, y_test)*100:.2f}")

Trying model 10 estimators...
Model accuracy on testset: 86.89
Trying model 20 estimators...
Model accuracy on testset: 85.25
Trying model 30 estimators...
Model accuracy on testset: 85.25
Trying model 40 estimators...
Model accuracy on testset: 86.89
Trying model 50 estimators...
Model accuracy on testset: 85.25
Trying model 60 estimators...
Model accuracy on testset: 90.16
Trying model 70 estimators...
Model accuracy on testset: 86.89
Trying model 80 estimators...
Model accuracy on testset: 90.16
Trying model 90 estimators...
Model accuracy on testset: 88.52
Trying model 100 estimators...
Model accuracy on testset: 86.89
Trying model 110 estimators...
Model accuracy on testset: 90.16
Trying model 120 estimators...
Model accuracy on testset: 88.52
Trying model 130 estimators...
Model accuracy on testset: 90.16
Trying model 140 estimators...
Model accuracy on testset: 90.16
Trying model 150 estimators...
Model accuracy on testset: 90.16
Trying model 160 estimators...
Model accuracy on 

In [37]:
#Save  a model and load it
import pickle

pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

In [39]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(x_test, y_test)

0.9016393442622951

##1. Getting our data ready to be used with ML

Three main things we have to do:
1) Split the data into features and labels (usually x and y)
2. Filling(also called imputing) or disregarding missing values
3) Convert non numerical values to numerical values( also called feature encoding)

In [7]:
x = heart_disease.drop("target", axis=1)
x.tail()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3
302,57,0,1,130,236,0,0,174,0,0.0,1,1,2


In [8]:
y = heart_disease["target"]
y.tail()

298    0
299    0
300    0
301    0
302    0
Name: target, dtype: int64

In [10]:
#Split the data into test and train set

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2)

In [13]:
np.shape(x_train)

(242, 13)

In [14]:
len(heart_disease)

303

#### 1.1 Make sure its all numerical

In [25]:
car_sales = pd.read_csv("data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [28]:
#Split the data features and target
x = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [None]:
#Convert String to Numerical data.


In [31]:
from sklearn.ensemble import RandomForestRegressor

model  = RandomForestRegressor()
model.fit(x_train, y_train)
model.score(x_test, y_test)


ValueError: could not convert string to float: 'Honda'

In [37]:
#Turn the categories into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_x = transformer.fit_transform(x)
pd.DataFrame(transformed_x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [40]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies.tail()

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
995,4,0,0,0,1,1,0,0,0,0
996,3,0,0,1,0,0,0,0,0,1
997,4,0,0,1,0,0,1,0,0,0
998,4,0,1,0,0,0,0,0,0,1
999,4,0,0,0,1,0,1,0,0,0


In [46]:
#Lets refit the model

np.random.seed(42)
x_train, x_test, y_train, y_test= train_test_split(transformed_x, y, test_size = 0.2)

model.fit(x_train, y_train)

RandomForestRegressor()

In [47]:
model.score(x_test, y_test)

0.3235867221569877

####1.2 What if there are missing values?
1. FIll them with some value ( also known as imputation).
2. Remove the samples with missing data altogether.


In [49]:
#Import car sales missing data

car_sales_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")

In [51]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [52]:
#Lets try and compare our data to numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_x = transformer.fit_transform(x)
pd.DataFrame(transformed_x)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


### Fill missing data with Pandas

In [54]:
# FIll the "Make" column
#As it is categorical data, just dfill it with data that is says missing

car_sales_missing["Make"].fillna("missing", inplace=True)

#Fill the Colour column
car_sales_missing["Colour"].fillna("missing", inplace=True)

car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)

#Fill the door column with the average value of 4

car_sales_missing["Doors"].fillna(4, inplace=True)

In [55]:
#Check our dataframe again

car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [56]:
#Remove rows with missing Price value

car_sales_missing.dropna(inplace=True)

In [58]:
x = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [59]:
#Lets try and compare our data to numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_x = transformer.fit_transform(x)
pd.DataFrame(transformed_x)

Unnamed: 0,0
0,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
1,"(0, 0)\t1.0\n (0, 6)\t1.0\n (0, 13)\t1.0\n..."
2,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
3,"(0, 3)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."
4,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 11)\t1.0\n..."
...,...
945,"(0, 3)\t1.0\n (0, 5)\t1.0\n (0, 12)\t1.0\n..."
946,"(0, 4)\t1.0\n (0, 9)\t1.0\n (0, 11)\t1.0\n..."
947,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 12)\t1.0\n..."
948,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 12)\t1.0\n..."


### Option 2: Fill missing data with SciKitLearn

Data sets are filled by first splititng and transforming the data into training and test sets seperately.

Split your data first(into train/test)
Fill/transform the training set and test sets sperately.

In [67]:
car_sale_missing = pd.read_csv("data/car-sales-extended-missing-data.csv")

In [68]:
car_sale_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [69]:
#Split into x and y 

x = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [70]:
#Fill missing values with SKL
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [74]:
#Fill categorical values wth missing and numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
numerical_imputer = SimpleImputer(strategy="mean")

#Define coloumns

cat_features = ["Make", "Colour"]
door_features = ["Doors"]
numerical_features = ["Odometer (KM)"]

#Create an imputer(something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_features),
    ("numerical_imputer", numerical_imputer, numerical_features)
])


#Transform the data
filled_x = imputer.fit_transform(x)
filled_x

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [75]:
car_sales_filled = pd.DataFrame( filled_x, columns=["Make", "Colour", "Doors", "Odometer (KM)"])

In [77]:
transformed_x = transformer.fit_transform(car_sales_filled)
transformed_x

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [7]:
#NOw we vgot our data as number anf filled
#Lets fit a model

np.random.seed(42)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(transformed_x, y, test_size=0.2)

model = RandomForestRegressor(n_estimators=100)
model.fit(x_train, y_train)
model.score(x_test, y_test)

NameError: name 'transformed_x' is not defined

#2 Choosing the right estimator/algorithm for your problem

Some things to note:
    Sklearn refers to machine leanring models, algorithmsa s estimators.
    Classification problem- predictiong cateogry
    clf = classifier
Regression - predicting a number

### 2.1 Picking a learning model for a regression problem

In [82]:
#Get California housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [86]:
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [92]:
housing_df["target"] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [93]:
#Import algorithm
#Setup random seed

np.random.seed(42)

#Create the data

x = housing_df.drop("target", axis=1)
y = housing_df["target"]

In [94]:
#Split into traina nd test sets

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [97]:
from sklearn.linear_model import Ridge
model = Ridge()
model.fit(x_train, y_train)
model.score(x_test, y_test)

0.5758549611440128

In [101]:
from sklearn.linear_model import Lasso
model_2 = Lasso()
model_2.fit(x_train, y_train)
model_2.score(x_test, y_test)

0.2841671821008398

In [None]:
#Lets try another regression,

#Ensemble Model:
    #Combinations of other models to improve prediction.
    #We use DecisioTree and RandomForest

In [106]:
#import the RandomForestRegressor modedl class from ensemble module

from sklearn.ensemble import RandomForestRegressor

#Setup a seed
np.random.seed(42)

#Create the data
x  = housing_df.drop("target", axis=1)
y = housing_df["target"]

#Split the data

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(x_train, y_train)

model.score(x_test, y_test)

0.8057655811971304

## 2.2 Choosing an estimator for a classification problem


In [4]:
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
#Using Linear SVC for multi-class classification, using Linear SVC

In [11]:
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split

#Setup random seed
np.random.seed(42)

#Make the data
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#Instiate LinearSVC
clf = LinearSVC(max_iter=100000)
clf.fit(x_train, y_train)

clf.score(x_test, y_test)



0.8852459016393442

In [10]:
heart_disease["target"].value_counts()

1    165
0    138
Name: target, dtype: int64

In [12]:
#LinearSVC not working so we use Ensemble Classifier using RandomForestClassifier

In [15]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#Setup random seed
np.random.seed(42)

#Make the data
x = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

#Instiate LinearSVC
clf = RandomForestClassifier()
clf.fit(x_train, y_train)

clf.score(x_test, y_test)

0.8524590163934426

Tidbit:
1) **If you have structured data, use ensemble methods**

2) **If you have unstructured data, use deep learning or transfer learning.