## 0. Standard Libraries Needed

What will be covered:
0. Scikit-Learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fir the model.algo and use it to make predictions on our data
4. Evaluate the model
5. Improve the model
6. Save and load a trained model
7. Put it all together

In [2]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

## 1. Get the data ready

In [3]:
heart_disease = pd.read_csv("../matplotlib/heart-disease (1).csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [4]:
# Create X (features matrix)
X = heart_disease.drop("target",axis=1)

# Create Y (labels)
Y = heart_disease["target"]


## 2. Choose the right model and hyperparameters

In [5]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
# We'll keep the default hyper parameters
clf.get_params()


{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

## 3. Fit the model to the data

In [6]:
from sklearn.model_selection import train_test_split

X_train, X_test,Y_train, Y_test, = train_test_split(X,Y,test_size=0.2) # 80% of the data will be used for training, 20% for testing


In [7]:
clf.fit(X_train, Y_train);

In [8]:
# make a prediction

y_preds = clf.predict(X_test)
y_preds

array([1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1,
       0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0], dtype=int64)

## 4. Evaluate the model on the training data and the test data

In [9]:

clf.score(X_train, Y_train)

1.0

In [10]:
clf.score(X_test, Y_test)

0.7868852459016393

In [11]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(Y_test,y_preds))

              precision    recall  f1-score   support

           0       0.76      0.73      0.75        26
           1       0.81      0.83      0.82        35

    accuracy                           0.79        61
   macro avg       0.78      0.78      0.78        61
weighted avg       0.79      0.79      0.79        61



In [12]:
confusion_matrix(Y_test,y_preds)

array([[19,  7],
       [ 6, 29]], dtype=int64)

In [13]:
accuracy_score(Y_test,y_preds)

0.7868852459016393

## 5. Improve a model

In [14]:

# Try different amount of n_estimators

np.random.seed(42)
for i in range(10,100,10):
    print(f"Trying model with {i}, estimators")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train,Y_train)
    print(f"Model accurancy on the test set: {clf.score(X_test,Y_test)*100:.2f}")


Trying model with 10, estimators
Model accurancy on the test set: 81.97
Trying model with 20, estimators
Model accurancy on the test set: 78.69
Trying model with 30, estimators
Model accurancy on the test set: 77.05
Trying model with 40, estimators
Model accurancy on the test set: 78.69
Trying model with 50, estimators
Model accurancy on the test set: 77.05
Trying model with 60, estimators
Model accurancy on the test set: 78.69
Trying model with 70, estimators
Model accurancy on the test set: 77.05
Trying model with 80, estimators
Model accurancy on the test set: 73.77
Trying model with 90, estimators
Model accurancy on the test set: 77.05


## 6. Save a model and load it

In [15]:
import pickle

pickle.dump(clf,open("Random_forest_model_1.pkl", "wb"))

In [16]:
loaded_model = pickle.load(open("Random_forest_model_1.pkl","rb"))


In [17]:
loaded_model.score(X_test,Y_test)

0.7704918032786885

# Detailed Steps by Steps

## 1. Getting our data ready to be used with ML

There are three main things we need to do:
    1. Split the data into features and labels (usually X & Y)
    2. Filling (also called imputing) or disregarding missing values
    3. Converting non-numerical values to numerical (feature encoding)

In [18]:
heart_disease


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [19]:
X = heart_disease.drop("target", axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [21]:
y = heart_disease["target"]

In [22]:
# Split data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [24]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

### Cleaning data
Clean data -> Transform Data -> Reduce Data

Clean data = remove outliers / NAN values
Transform data = convert string to numbers, convert color to numbers, yes or no is boolean, make sure the data is using the same metric system
Reduce Data = more data, more cpu, reduce until we get the same result with either amount of data

### 1.1 Make sure it's all numerical

In [26]:
car_sales = pd.read_csv("car-sales-extended.csv")
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


In [27]:
len(car_sales)

1000

In [28]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [30]:
# Split data into X and y
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

# Split into trainig

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [31]:
# Build machine learning model
# use regressor for predicting number
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train,X_test)


ValueError: could not convert string to float: 'Toyota'

In [35]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Why doors?
##  It's numerical, but also categorical so we treat it as categorical
categorical_features = ["Make","Colour","Doors"]
# one for true in a category, 0 for the rest
one_hot = OneHotEncoder()

# transform the given cat. columns, and pass through - don't touch the remaining columns
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],remainder="passthrough")

# transform the X values to numbers
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [36]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [40]:
# Let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X,y,train_size=0.2)

model.fit(X_train,y_train);

In [42]:
# model hasn't really been able to find good patterns
model.score(X_test,y_test)

0.26160452087764263

### What if there were missing values
1. Fill it with some value(aka imputation)
2. Remove the samples with missing data altogether

In [44]:
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [45]:
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [46]:
## Make data
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

## Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2)

## pick model
model = RandomForestRegressor()
## get the model to learn and find pattern
model.fit(X_train,y_train)
## test how well it's learnt
model.score(X_test,y_test)

ValueError: could not convert string to float: 'BMW'

In [48]:
## determine the categorical features
categorical_features = ["Make","Colour","Doors"]

## create one hot encoder
one_hot = OneHotEncoder()

## create column transformer
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)], remainder="passthrough")

## fit the transformer to X
transformed_X = transformer.fit_transform(X)

## refit for the model

np.random.seed(52)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, train_size=0.2)
model.fit(X_train,y_train)
model.score(X_test,y_test)



ValueError: Input X contains NaN.
RandomForestRegressor does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

### Option 1: Fill missing data with Pandas

In [52]:
## Fill the Make column
car_sales_missing["Make"].fillna("missing",inplace=True)

## Fill the Colour column
car_sales_missing["Colour"].fillna("missing",inplace=True)

## Fill the Ode column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(),inplace=True)

## Fill the Doors column - category
car_sales_missing["Doors"].fillna(4, inplace=True)

## Remove price with missing price value, its bad to predict based on missing lables
car_sales_missing.dropna(inplace=True)


len(car_sales_missing)


950

In [53]:
## Make data
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

## determine the categorical features
categorical_features = ["Make","Colour","Doors"]

## create one hot encoder
one_hot = OneHotEncoder()

## create column transformer
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)], remainder="passthrough")

## fit the transformer to X
transformed_X = transformer.fit_transform(X)

## refit for the model

np.random.seed(52)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, train_size=0.2)
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.09221717582096511

In [54]:
len(car_sales_missing)

950

### Option 2: Fill missing values with Sickit-Learn

## Important:
Fill the training and testing data seperately no matter the technique used to fill the data

In [56]:
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")
car_sales_missing

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
...,...,...,...,...,...
995,Toyota,Black,35820.0,4.0,32042.0
996,,White,155144.0,3.0,5716.0
997,Nissan,Blue,66604.0,4.0,31570.0
998,Honda,White,215883.0,4.0,4001.0


In [58]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"],inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [59]:
# Split into X and Y

X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [61]:
# Fill missing valys with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with missing, and numerical with mean
# if strategy is constant, we need to suggest the fill value
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define Columns
cat_features = ["Make","Colour"]
door_feature = ["Doors"]
num_feature = ["Odometer (KM)"]

# Create an imputer (something that goes in and fills the missing data accordingly)
# the tuples are the tranformations we want it to make, params are : name of imputer, imputer, imputee
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer,cat_features),
    ("door_imputer", door_imputer,door_feature),
    ("num_imputer",num_imputer,num_feature)
])

# Transform the data

filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [62]:
car_sales_filled = pd.DataFrame(filled_X,columns=["Make","Colour","Doors","Odometer (KM)"])
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [63]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Why doors?
##  It's numerical, but also categorical so we treat it as categorical
categorical_features = ["Make","Colour","Doors"]
# one for true in a category, 0 for the rest
one_hot = OneHotEncoder()

# transform the given cat. columns, and pass through - don't touch the remaining columns
transformer = ColumnTransformer([("one_hot",one_hot,categorical_features)],remainder="passthrough")

# transform the X values to numbers
transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [65]:
# Now we've got the data as numbers and filled the missing values
# Lets fit the model

np.random.seed(42)

X_train, X_test, y_train, y_test = train_test_split(transformed_X,y,train_size=0.2)

model = RandomForestRegressor(n_estimators=1000)
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.19202325589865898

## 2. Choosing the right