## What will be covered

0. End to end scikit-learn workflow
1. Getting data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm to training data and use it to make predictions on our data
4. Evaluate a model on training data and test data
5. Improve a model
6. Save and load a trained model
7. Putting these all together

---

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## 0. End to end scikit-learn Workflow



#### 0.1 Getting data ready

In [2]:
# Getting data ready
import pandas as pd
heart_disease = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
# Create x (features matrix)
X = heart_disease.drop("target", axis=1)

# Create y (labels)
y = heart_disease["target"]

#### 0.2 Choose the right model and hyperparameters

In [4]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

# keep default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

#### 0.3 Fit the model to training data

In [5]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% data will be used for training

In [6]:
clf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [7]:
# Make prediction
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0])

#### 0.4 Evaluate the model

In [8]:
clf.score(X_train, y_train)

1.0

In [9]:
clf.score(X_test, y_test)

0.8852459016393442

In [10]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [11]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.84      0.93      0.89        29
           1       0.93      0.84      0.89        32

    accuracy                           0.89        61
   macro avg       0.89      0.89      0.89        61
weighted avg       0.89      0.89      0.89        61



In [12]:
confusion_matrix(y_test, y_preds)

array([[27,  2],
       [ 5, 27]])

In [13]:
accuracy_score(y_test, y_preds)

0.8852459016393442

#### 0.5 Improve the model

In [14]:
# Try different amount of n_estimators

# setup random seed
np.random.seed(42)

for i in range(10, 100, 10):
  print(f"Trying model with {i} estimators...")
  clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
  print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
  print(" ")

Trying model with 10 estimators...
Model accuracy on test set: 78.69%
 
Trying model with 20 estimators...
Model accuracy on test set: 85.25%
 
Trying model with 30 estimators...
Model accuracy on test set: 90.16%
 
Trying model with 40 estimators...
Model accuracy on test set: 86.89%
 
Trying model with 50 estimators...
Model accuracy on test set: 88.52%
 
Trying model with 60 estimators...
Model accuracy on test set: 91.80%
 
Trying model with 70 estimators...
Model accuracy on test set: 88.52%
 
Trying model with 80 estimators...
Model accuracy on test set: 90.16%
 
Trying model with 90 estimators...
Model accuracy on test set: 90.16%
 


#### 0.6 Save a model and load

In [15]:
import pickle

pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

In [16]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.9016393442622951

---

## 1. Getting data ready to be used with machine learning

Three main things to do:

1. Split data into features and labels (usually 'X' and 'y')
2. Converting non-numerical values to numerical values (feature encoding / feature encoding)
3. Filling (imputing) or disregarding missing values

#### 1.1.1 Split data into features and labels

In [17]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [18]:
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

#### 1.1.2 Split data into training and test dataset

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [20]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

#### 1.2 Make sure everything is numerical (Feature Encoding)

In [21]:
car_sales = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [22]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [23]:
# Split into labels (car_sales data)
X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

#### 1.2.1 Convert to Numerical Value

In [24]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"] # these should be converted to numerical value from non-numerical
# Doors is used despite its an integer because, 4: 865, 3: 65, 5: 300 etc. [number of doors: number of instances of that number of door]

one_hot = OneHotEncoder() # One Hot Encoding

transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")

transformed_X = transformer.fit_transform(X)

In [25]:
pd.DataFrame(transformed_X).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


#### Optional
#### Basically One Hot Encoded dataset will look like this:

In [26]:
dummies = pd.get_dummies(car_sales[categorical_features])

dummies.head()

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,0,1,0,0,0,0,0,0,1
1,5,1,0,0,0,0,1,0,0,0
2,4,0,1,0,0,0,0,0,0,1
3,4,0,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,0


In [27]:
# Fit Model

# setup random seed
np.random.seed(42)

# Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

# Build machine learning model
from sklearn.ensemble import RandomForestRegressor # Regression, as it will predict price (numeric value)

model = RandomForestRegressor()

model.fit(X_train, y_train)

model.score(X_test, y_test)

0.3235867221569877

#### 1.3 Handle Missing Values

1. Fill with some values (imputation)
2. Remove the samples with missing values

In [28]:
car_sales_missing = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv")

In [29]:
# How many missing data are there
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

#### Option 1: Handle missing data with pandas

In [30]:
# fill missing data of "Make" column
car_sales_missing["Make"].fillna("missing", inplace=True)

# fill missing data of "Colour" column
car_sales_missing["Colour"].fillna("missing", inplace=True)

# fill missing data of "Odometer (KM)" column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)

# fill missing data of "Doors" column
car_sales_missing["Doors"].fillna(4, inplace=True)

In [31]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [32]:
# As Price is the thing that we will predict, we didn't fill it up

# So, lets Remove rows with missing Price Value
car_sales_missing.dropna(inplace=True)

In [33]:
len(car_sales_missing) # 50 rows that were missing Price value are dropped

950

In [34]:
# Create X/y labels
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [35]:
# Convert non-numerical values to numerical values 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"] # these should be converted to numerical value from non-numerical
# Doors is used despite its an integer because, 4: 865, 3: 65, 5: 300 etc. [number of doors: number of instances of that number of door]

one_hot = OneHotEncoder() # One Hot Encoding

transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")

transformed_X = transformer.fit_transform(X)

#### Option 2: Handle missing data with scikit-learn

In [36]:
car_sales_missing = pd.read_csv("https://raw.githubusercontent.com/mrdbourke/zero-to-mastery-ml/master/data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [37]:
# Drop rows that don't have Price values
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [38]:
# Fill missing values with scikit-learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


# Fill categorical values with "missing", door values with 4 and numerical values with mean
category_imputer = SimpleImputer(strategy="constant", fill_value="missing") # strategy=constant means do the same thing for every missing cell
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Define columns
category_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

# Create Imputer
imputer = ColumnTransformer([
                             ("category_imputer", category_imputer, category_features),
                             ("door_imputer", door_imputer, door_features),
                             ("num_imputer", num_imputer, num_features)
])

filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [39]:
car_sales_filled = pd.DataFrame(filled_X, columns=["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [40]:
# Convert non-numerical values to numerical values 
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"] # these should be converted to numerical value from non-numerical
# Doors is used despite its an integer because, 4: 865, 3: 65, 5: 300 etc. [number of doors: number of instances of that number of door]

one_hot = OneHotEncoder() # One Hot Encoding

transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder="passthrough")

transformed_X = transformer.fit_transform(car_sales_filled)

In [41]:
# Split into X/y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

#### Fit model

For both options (Option-1 & Option-2) Model will perform the same,
that's why we didn't fit a model for Option-1

In [42]:
# Fit model

# setup random seed
np.random.seed(42)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model = RandomForestRegressor()

model.fit(X_train, y_train)

model.score(X_test, y_test)

0.22011714008302485

In this case, model performs better when we keep missing values as it is, instead of filling up or dropping the missing values

That's because the first one had 1000 samples and the second one had 950 samples

But the take away is that, most of time a dataset won't be in a form ready to immediately start using in a ML model and most of the time the data have to be in numerical form and they can't have missing values.

---

## 2. Choose The Right Estimator/Model

In Scikit-learn the term estimator is used as another way of saying Machine learning Model/Algorithm

- Classification: Predicting whether a sample is one thing or another
- Regression: Predicting a numerical value (number)

Sklearn Machine Learning Map: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

#### 2.1 Picking a machine learning model for a regression problem

In [50]:
# import Boston Housing dataset
from sklearn.datasets import load_boston
boston = load_boston()

boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])

boston_df["target"] = pd.Series(boston["target"])

boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


#### Attribute Information:

- CRIM per capita crime rate by town

- ZN proportion of residential land zoned for lots over 25,000 sq.ft.

- INDUS proportion of non-retail business acres per town

- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)

- NOX nitric oxides concentration (parts per 10 million)

- RM average number of rooms per dwelling

- AGE proportion of owner-occupied units built prior to 1940

- DIS weighted distances to five Boston employment centres

- RAD index of accessibility to radial highways

- TAX full-value property-tax rate per $10,000

- PTRATIO pupil-teacher ratio by town

- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

- LSTAT % lower status of the population

- MEDV Median value of owner-occupied homes in $1000’s

In [53]:
# Let's try Ridge Regression model
from sklearn.linear_model import Ridge

# Setup a random seed
np.random.seed(42)

# Split X/y
X = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate Ridge model
model = Ridge()
model.fit(X_train, y_train)

# Check the score of the model
model.score(X_test, y_test)

0.6662221670168522

- What if Ridge is not working?
- How to improve the score?

In [55]:
# Let's try Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

# Setup random seed
np.random.seed(42)

# Split into X/y
X = boston_df.drop("target", axis=1)
y = boston_df["target"]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestRegressor Model
rf = RandomForestRegressor()
rf.fit(X_train, y_train)

# Check the score of the model
rf.score(X_test, y_test)

0.873969014117403

So, just by changing the model (following https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) the score was improved to 0.874 from 0.666