# Introduction to Scikit Learn (sklearn)

This notebook demonstrates some of the most useful functions of the beautiful scikit learn library.

Flow of the work:

0. An-end-to-end scikit leatrn flow
1. Getting data ready
2. Choose the right estimator/algorithm for our problems 
3. Fit the model/algorithm/estimator and use it make predictions on our data 
4. Evaluating Model
5. Improve the model
6. Save and load a trained model
7. Putting it all together

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## 0. An End-to-end Scikit Learn Workflow

In [2]:
# 1. Get the data ready 
import pandas as pd
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
# Create X (Features Matrix)
X = heart_disease.drop("target", axis = 1)

# Create Y (labels)
y = heart_disease["target"]

In [4]:
# 2. Choose the right model and hyperparameters

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
 
#  We will keep the default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [5]:
# 3. Fit the model to the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [6]:
clf.fit(X_train,y_train);

In [7]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
265,66,1,0,112,212,0,0,132,1,0.1,2,1,2
280,42,1,0,136,315,0,1,125,1,1.8,1,0,1
34,51,1,3,125,213,0,0,125,1,1.4,2,1,2
237,60,1,0,140,293,0,0,170,0,1.2,1,2,3
13,64,1,3,110,211,0,0,144,1,1.8,1,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
164,38,1,2,138,175,0,1,173,0,0.0,2,4,2
122,41,0,2,112,268,0,0,172,1,0.0,2,0,2
74,43,0,2,122,213,0,1,165,0,0.2,1,0,2
257,50,1,0,144,200,0,0,126,1,0.9,1,0,3


In [8]:
X_test

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
79,58,1,2,105,240,0,0,154,1,0.6,1,0,3
154,39,0,2,138,220,0,1,152,0,0.0,1,0,2
49,53,0,0,138,234,0,0,160,0,0.0,2,0,2
214,56,1,0,125,249,1,0,144,1,1.2,1,1,2
162,41,1,1,120,157,0,1,182,0,0.0,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
235,51,1,0,140,299,0,1,173,1,1.6,2,0,3
11,48,0,2,130,275,0,1,139,0,0.2,2,0,2
125,34,0,1,118,210,0,1,192,0,0.7,2,0,2
174,60,1,0,130,206,0,0,132,1,2.4,1,2,3


In [9]:
y_preds = clf.predict(X_test)
y_preds

array([1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0])

In [10]:
y_test

79     1
154    1
49     1
214    0
162    1
      ..
235    0
11     1
125    1
174    0
97     1
Name: target, Length: 61, dtype: int64

In [11]:
# 4. Evaluate the model on the training data and the test data

clf.score(X_train,y_train)  # the score() returns the Mean Accuracy of the model.A float between 0.0 and 1.0. 
# accuracy = correct predictions / total predictions

1.0

In [12]:
clf.score(X_test, y_test)

0.8360655737704918

In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.85      0.79      0.81        28
           1       0.83      0.88      0.85        33

    accuracy                           0.84        61
   macro avg       0.84      0.83      0.83        61
weighted avg       0.84      0.84      0.84        61



In [14]:
confusion_matrix(y_test, y_preds)

array([[22,  6],
       [ 4, 29]])

In [15]:
accuracy_score(y_test,y_preds)

0.8360655737704918

### What are n_estimators ??
Think of a Random Forest like a jury in a courtroom.

n_estimators is simply the number of people in that jury.

1. The Single Tree vs. The Forest
* If you have 1 tree, it's like asking one person for their opinion. They might be biased or miss a tiny detail.

* If you have n_estimators=100, you are asking 100 people.  

2. Why more trees help Each "person" (tree) in the forest looks at the data a little bit differently. When it’s time to make a decision:

* Every single tree "votes" for a result.

* The forest looks at all the votes.

* The majority wins.

Because you are averaging the votes of many trees, the mistakes of one tree are cancelled out by the others. This makes the model much more reliable and "stable."

3. Practical Rules of Thumb
* 10 trees: Fast, but maybe a bit "shaky" (less accurate).

* 100 trees: This is the default. It’s usually very strong and accurate for most projects.

* 1,000 trees: Very accurate, but it takes a lot of "brain power" (battery/memory) to calculate.

In [16]:
# 5. Improve a model
# try different amount of n_estimators

np.random.seed(42)

for i in range(10,100,10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators = i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test) * 100:.2f}%")
    print("")

Trying model with 10 estimators...
Model accuracy on test set: 78.69%

Trying model with 20 estimators...
Model accuracy on test set: 80.33%

Trying model with 30 estimators...
Model accuracy on test set: 78.69%

Trying model with 40 estimators...
Model accuracy on test set: 73.77%

Trying model with 50 estimators...
Model accuracy on test set: 77.05%

Trying model with 60 estimators...
Model accuracy on test set: 83.61%

Trying model with 70 estimators...
Model accuracy on test set: 78.69%

Trying model with 80 estimators...
Model accuracy on test set: 81.97%

Trying model with 90 estimators...
Model accuracy on test set: 81.97%



In [17]:
# 6. Save a model and load it

import pickle

pickle.dump(clf, open("random_forest_model_1.pkl","wb"))

In [18]:
# reloding the model which we have saved

loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

0.819672131147541

## 1. Getting the data ready to be used with machine learning

Three main things we have to do:
   1. Split the data into features and labels (X and y)
   2. Filling (also called imputing) or disregarding missing values
   3. Converting non-numerical values to numerical values (also called feature coding)

In [19]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [20]:
X = heart_disease.drop("target", axis  = 1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [21]:
y = heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [22]:
# Split the data into training and test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [23]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

# This line is used to check the dimensions (the size and structure) of your datasets after you have split them into training and testing sets.
# In Python, .shape tells you how many rows (samples) and columns (features) are in each piece of your data.

((242, 13), (61, 13), (242,), (61,))

In [24]:
X.shape # this is the data before splitting, original data.

(303, 13)

In [25]:
# now out of this we have given 20% of the data (test_size = 0.2) for testing set

len(heart_disease) # tot samples  = 303

303

In [26]:
X.shape[0] * 0.8

242.4

In [27]:
# 303 - 242 = 61
# 242 + 61 = 303

### 1.1 Make sure all this data is numerical

In [28]:
car_sales = pd.read_csv("data/scikit-learn-data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [29]:
len(car_sales)

1000

In [30]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [31]:
# Split into X/y

X = car_sales.drop("Price", axis = 1)
y = car_sales["Price"]

# Split into training and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [32]:
# Build machine learning model

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test) 


# this code will give a valueError

ValueError: could not convert string to float: 'Toyota'

1. What is One-Hot Encoding?
Imagine you have a column called "Colour" with three options: Red, Blue, and Green. You can't just give them numbers like 1, 2, and 3, because the computer might think Green (3) is "bigger" or "better" than Red (1).

One-Hot Encoding solves this by creating a new column for every category:

* Is it Red? (1 for Yes, 0 for No)

* Is it Blue? (1 for Yes, 0 for No)

* Is it Green? (1 for Yes, 0 for No)

2. Breaking down your code
* categorical_features: You are telling the computer, "Look at the 'Make', 'Colour', and 'Doors' columns. These are the ones with words (categories) that need changing."

* OneHotEncoder(): This is the tool that creates those 1s and 0s.

* ColumnTransformer: This is like an assembly line. You are telling it: "Apply the One-Hot tool to my categorical columns, but for everything else (the remainder), just let it pass through without changing it."

* fit_transform(X): This actually runs the assembly line on your data X.

<img src="images\OneHot Encoding.png" width="800" height="300">

In [None]:
# To solve the errors we need to preprocess the data

# Turn the categories into numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(X)
transformed_X


array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]], shape=(1000, 13))

The array you see is simply your original data converted into a format a machine learning model can read.

The 1.0s and 0.0s: These are the "Make", "Colour", and "Doors" categories. If a car was "Blue," the "Blue" column gets a 1 and the "Red/Green" columns get a 0.

The 13 columns: You have 13 columns now because the encoder took your 3 categorical features and exploded them into multiple individual columns (one for every possible category found).

The Scientific Notation: Python displays 1.00000e+00 instead of just 1 because there are large numbers (like 3.54310e+04, which is 35,431) in the same array. It forces everything into that format to keep the columns aligned.

Essentially, transformer.fit_transform(X) took your spreadsheet of words and numbers and turned it into a purely numerical matrix.

In [None]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [None]:
dummies = pd.get_dummies(car_sales[["Make","Colour","Doors"]])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True


In [None]:
# Refit the model

np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size = 0.2)
model.fit(X_train, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""squared_error"", ""absolute_error"", ""friedman_mse"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in Poisson deviance to find splits. Training using ""absolute_error"" is significantly slower than when using ""squared_error"". .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion. .. versionadded:: 1.0  Poisson criterion.",'squared_error'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=1.0 The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None or 1.0, then `max_features=n_features`. .. note::  The default of 1.0 is equivalent to bagged trees and more  randomness can be achieved by setting smaller values, e.g. 0.3. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to 1.0. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",1.0
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


In [None]:
model.score(X_test, y_test)

0.3235867221569877

### 1.2 What if there were missing values ??

1. Fill them with some value (also known as imputation).
2. Remove the samples with the missing data altogether.


In [None]:
# Import car sales missing data

car_sales_missing = pd.read_csv("data/scikit-learn-data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [None]:
car_sales_missing.isna().sum()  # this will show how many missing values are there in the each column of the dataframe

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [None]:
# Create X and y

X = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing["Price"]

In [None]:
# Since the columns contains NaN values, the OneHot encoding code will give ValueError
# We need to fill the NaN values 

In [None]:
# car_sales_missing["Doors"].value_counts()

Doors
4.0    811
5.0     75
3.0     64
Name: count, dtype: int64

### Option 1. Filling missing data with Pandas

In [None]:
# Fill the "Make" column
car_sales_missing["Make"].fillna("missing", inplace=True)

# Fill the "Colour" column
car_sales_missing["Colour"].fillna("missing", inplace=True)

# Fill the "Odometer (KM)" column
car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace=True)

# Fill the "Doors" column
car_sales_missing["Doors"].fillna(4, inplace=True)

In [None]:
# Check our dataframe again
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [None]:
# Remove rows with missing Price value

car_sales_missing.dropna(inplace = True)

In [None]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [None]:
len(car_sales_missing)

950

In [None]:
X = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing["Price"]

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                  remainder="passthrough")

transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]], shape=(1000, 16))

### Option 2 : Fill the missing values with Scikit-Learn


In [None]:
car_sales_missing = pd.read_csv("data/scikit-learn-data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [None]:
car_sales_missing.isna().sum() 

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [None]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [None]:
# Split into X & y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

# Split data into train and test
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2)

In [None]:
# Check missing values
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

### What's an Imputer ?

🌟 First — What problem does an imputer solve?

In real-world data, sometimes values are missing.
For example:
| Age | Salary      |
| --- | ----------- |
| 25  | 30,000      |
| 28  | — missing — |
| 32  | 45,000      |

Machine-learning models cannot work with missing values — they throw an error.

So we need a way to fill in (replace) the missing values with something sensible.

👉 This process of filling missing values is called “imputation.”

🛠️ What is an imputer in scikit-learn?

In scikit-learn, an imputer is a tool that automatically fills missing data for you.

The most commonly used one is:  **SimpleImputer**

**🧹 Imputer = smart cleaner that fills empty boxes in your dataset**

------------------------------------------------------------------------------------------------------------------------------------------

Your dataset has missing values in:

text columns → 'Make', 'Colour'

number column → 'Odometer (KM)'

door count column → 'Doors'

Machine-learning models cannot handle missing values, so this code fills them (imputes them).

And it does this automatically and safely for both train & test data.

In [None]:
# Fill missing valueswith scikit learn
from sklearn.impute import SimpleImputer # This is the tool that fills missing values
from sklearn.compose import ColumnTransformer # Apply different imputers to different columns.

# Fill categorical values with 'missing' and numerical values with mean

# 🔹 Creating imputers
catg_imputer = SimpleImputer(strategy = 'constant', fill_value = "missing")
door_imputer = SimpleImputer(strategy='constant', fill_value = 4)
num_imputer = SimpleImputer(strategy="mean")


# 🔹 Telling scikit-learn which columns use which imputer
# Define columns
catg_features = ['Make', 'Colour']
door_feature = ['Doors']
num_features = ['Odometer (KM)']


# 🔹 Creating the ColumnTransformer
# create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ('catg_imputer', catg_imputer, catg_features),
    ('door_imputer', door_imputer, door_feature),
    ('num_imputer', num_imputer, num_features)
])


# 🧠 Think of this as a cleaning machine with 3 pipes:

# 1️⃣ First pipe
# Uses catg_imputer on → ['Make','Colour']

# 2️⃣ Second pipe
# Uses door_imputer on → ['Doors']

# 3️⃣ Third pipe
# Uses num_imputer on → ['Odometer (KM)']

# So each type of column gets the correct filling method.


# fitting & transforming
# Fill train and test values seperately
filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.transform(X_test)

# check filled X_train
filled_X_train


array([['Toyota', 'White', 4.0, 112004.0],
       ['Toyota', 'White', 4.0, 35673.0],
       ['Toyota', 'White', 4.0, 146824.0],
       ...,
       ['missing', 'missing', 4.0, 61876.0],
       ['Honda', 'missing', 4.0, 28625.0],
       ['Honda', 'missing', 4.0, 150582.0]], shape=(800, 4), dtype=object)

now you’re converting the imputed (cleaned) NumPy arrays back into Pandas DataFrames so you can work with them easily 🙂

In [None]:
car_sales_filled_train = pd.DataFrame(filled_X_train,
                                columns = ["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled_test = pd.DataFrame(filled_X_test,
                                     columns = ["Make", "Colour", "Doors", "Odometer (KM)"])

# Check missing data in training set
car_sales_filled_train.isna().sum()


Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [None]:
# Now lets One Hot Encode the features with the same wewe have done before
  
categorical_features = ["Make", "Colour", "Doors"]  # Select categorical columns
one_hot = OneHotEncoder()  # one_hot = OneHotEncoder()
# Apply encoder to specific columns
transformer = ColumnTransformer([("one_hot", 
                                 one_hot, 
                                 categorical_features)],
                                 remainder="passthrough")

# Fill train and test values separately (Fit & transform training data)
transformed_X_train = transformer.fit_transform(car_sales_filled_train)
transformed_X_test = transformer.transform(car_sales_filled_test)

# Check transformed and filled X_train
transformed_X_train.toarray() # Because the result is stored as a sparse matrix (to save memory), .toarray() converts it into a normal NumPy array so you can see it.

array([[0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.12004e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.56730e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.46824e+05],
       ...,
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.18760e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.86250e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.50582e+05]], shape=(800, 15))

you’re now moving to the next preprocessing step: One-Hot Encoding 🎉

In [None]:
# Now we've transformed X, let's see if we can fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

# Make sure to use transformed (filled and one-hot encoded X data)
model.fit(transformed_X_train, y_train)
model.score(transformed_X_test, y_test)


-0.41189213009480174

In [None]:
len(transformed_X_train.toarray())+len(transformed_X_test.toarray()), len(car_sales)

(1000, 1000)

## 2. Choosing the right estimator/algorithm for your problem

somethings to note:
* SkLearn refers to machine learning models, algorithms *as* -> estimators
* calssification problem - predicting catgories
   * sometimes you'll see *clf* (short for calssifier) used as a classifiction estimator
   * Regression problem - predicting a number (selling price of a car)

<img src="data/scikit-learn-data/ml_map.svg" width="1000" height="500">

### 2.1 Picking a machine learning problem for a regression problem

Let's use the California Housing dataset - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

In [None]:
# # Get California Housing dataset

# from sklearn.datasets import fetch_california_housing
# housing = fetch_california_housing()
# housing


above code is not working so we will download the dataset and load it using pandas

In [None]:
import pandas as pd

housing = pd.read_csv("data/scikit-learn-data/housing.csv")
housing.head()


Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [None]:
housing.rename(columns={
    "median_house_value": "target",
    "median_income": "MedInc",
    "housing_median_age": "HouseAge",
    "total_rooms": "AveRooms",
    "total_bedrooms": "AveBedrms",
    "households": "AveOccup",
    "longitude": "Longitude",
    "latitude": "Latitude"
}, inplace=True)

housing = housing.drop("ocean_proximity", axis=1)

housing


Unnamed: 0,Longitude,Latitude,HouseAge,AveRooms,AveBedrms,population,AveOccup,MedInc,target
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0
...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0


X = housing_df.drop("target", axis=1)

* 👉 Make a new table called X that contains ALL columns EXCEPT the target column.
* So X = features / inputs

y = housing_df["target"]

* 👉 Create y which contains ONLY the target column (house price).
* So y = correct answers (the value the model is trying to predict)

In [None]:
# Import the algorithm
from sklearn.linear_model import Ridge  # Ridge is just a type of machine learning model used for prediction (like Linear Regression, but a bit safer and more stable).

# Setup random seed
np.random.seed(42)

# Remove rows with missing values
housing = housing.dropna() # we did this to prevent any error of NaN values

# Create the data
X = housing.drop("target", axis = 1)
y = housing["target"] # median house prices in $10,000s

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate and fit the model (on the training set)  
model = Ridge()
model.fit(X_train, y_train)

# check the score of the model on the test set (this will return the value of R2)
model.score(X_test, y_test) # this line will give error if any column contains NaN values, Ridge does not accept the NaN values.



0.6400858817984225

**⭐ Coefficient of Determination -> R-squared (R²)**

* 👉 tells you how well your model’s predictions match the real values.
* ✅ It measures how much of the variation in house prices your model can explain.
* The range of the R² value is 0 - 1.


<img src="data/scikit-learn-data/R-squared.png" width="500" height="800">

What if **Ridge** doesnt work or the score doesn't fit our needs ??

* We could always try a different model.
* We can use the ensemble model
*Sklearn's ensemble models can be found here: https://scikit-learn.org/stable/modules/ensemble.html 

**⭐ What is an Ensemble Model?**
* 👉 An ensemble model is when we combine more than one machine-learning model to make a single, better model.

**🎯 Why do we use ensemble models?**
* ✔ One model may make mistakes
* ✔ Different models see patterns differently
* ✔ Combining them reduces error

**🧩 Two main types**

* 1️⃣ Bagging (Bootstrap Aggregating)

  * 👉 Train many models of the same type on different subsets of the data.

  * Example: *Random Forest* = many Decision Trees combined

* 2️⃣ Boosting

  * 👉 Train models one after another — each new model fixes the mistakes of the previous one.
  * Examples: *XGBoost, AdaBoost, Gradient Boosting*

<img src="data/scikit-learn-data/random_forest.png" width="500" height="500">

<img src="data/scikit-learn-data/random_forest_regressor.png" width="400" height="500">

In [None]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor 

# Set up random seed
np.random.seed(42)

# Create the data
X = housing.drop("target", axis = 1)
y = housing["target"]

# Split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create th random forest model
model = RandomForestRegressor()
model.fit(X_train, y_train)

# Check the score of the model on the test set
model.score(X_test, y_test)

0.8212532427977578

**⭐ What is L2 Regularization?**

* 👉 L2 regularization is a technique used to stop a machine-learning model from overfitting by keeping the model’s weights small.
* It is also called:
    * 🔹 Ridge Regularization
    * 🔹 Weight Decay (in deep learning)

**🛑 How L2 helps**

* L2 regularization penalizes large weights in the model.

* Simple idea:
    * 👉 Models with huge weights are punished
    * so the model prefers smaller, smoother weights.
* Smaller weights → simpler model → less overfitting 💙

**📌 The L2 penalty term (not scary)**

When training a model, the loss function becomes: *Loss=Prediction Error+λ∑w2*
| Symbol     | Meaning                                      |
| ---------- | -------------------------------------------- |
| (w)        | model weights                                |
| (w^2) | square of all weights (this is L2)           |
| (lambda)  | regularization strength (a tuning parameter) |


**🔥 L2 vs L1**

| Feature     | L2 Regularization      | L1 Regularization                         |
| ----------- | ---------------------- | ----------------------------------------- |
| Also called | Ridge                  | Lasso                                     |
| Penalty     | Sum of squared weights | Sum of absolute weights                   |
| Effect      | Shrinks weights        | Can make weights zero (feature selection) |
| Smooth?     | Yes                    | Less smooth                               |


🟢 Models that normally use L2

* Ridge Regression
* Logistic Regression (default in sklearn)
* SVM (L2-style)
* Neural Networks (if enabled)

🟡 Models that use L1

* Lasso Regression
* (sometimes) Logistic Regression
* Elastic Net (L1 + L2)

🔵 Models that don’t use L1/L2

* Decision Trees
* Random Forest
* Gradient Boosting
* XGBoost (has other regularizers)
* KNN
* Naive Bayes
* Clustering models

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

**Regression evaluation metrics:**

- MAE → average error

- MSE → squared average error

- RMSE → like MAE but punishes big errors

- R² → how well model explains data variance

---------------------------------------------------------------------------------------------------------------------

**🌳 RandomForestRegressor vs 📈 Ridge Regression**

🟢 First — what type of models are they?

* 📈 Ridge Regression

    * A linear model
    * Assumes the relationship between features and target is a straight line
    * Uses L2 regularization to prevent overfitting
    * Think of it like: “Draw the best straight (or flat) line/plane through the data.”

* 🌳 Random Forest Regressor
    - A tree-based ensemble model
    - Builds many decision trees
    - Combines their predictions (usually averaging)
    * Think of it like: “Ask many decision-trees for an answer and average their opinions.”
    * This makes it flexible and powerful.

| Question                                | Ridge                   | Random Forest       |
| --------------------------------------- | ----------------------- | ------------------- |
| Is it linear?                           | ✅ Yes                   | ❌ No                |
| Handles complex / curved relationships? | ❌ Poorly                | ✅ Very well         |
| Handles outliers?                       | ⚠️ Somewhat             | ✅ Better            |
| Handles missing values?                 | ❌ No                    | ⚠️ Sometimes yes    |
| Needs feature scaling?                  | ✅ Yes                   | ❌ No                |
| Tends to overfit?                       | ⚠️ Less (because of L2) | ⚠️ Can if not tuned |
| Speed                                   | ⚡ Very fast             | 🐢 Slower           |
| Interpretability                        | ⭐ Easy                  | ❌ Harder            |
| Works well on small data?               | 😊 Yes                  | 😊 Yes              |
| Works well on big messy data?           | 😐 Sometimes            | 😀 Often great      |


-------

**⭐ 1️⃣ What is the C parameter?**

➤ C controls how much the model is allowed to make mistakes on the training data vs keeping the model simple.

Think of C as: “How strict should the model be?”

🟢 If C is LARGE (e.g., 1000)

* 👉 The model becomes very strict
* 👉 It tries VERY HARD to classify every training example correctly
* 👉 It allows bigger weights
* 👉 Less regularization

* Result:
    * ✔ Lower bias
    * ❌ Higher risk of overfitting

“Fit the data as perfectly as possible!”

🔵 If C is SMALL (e.g., 0.01)

* 👉 The model becomes more relaxed
* 👉 It allows some mistakes on the training data
* 👉 It forces weights to stay small
* 👉 More regularization

* Result:
    * ✔ Lower risk of overfitting
    * ❌ Might underfit if too small

“It’s okay to be a little wrong — just keep the model simple.”




**⭐ 2️⃣ Formulas for L1 and L2 regularization**

When training a model, we minimize a loss function, but with regularization we add a penalty term.

<img src="data/scikit-learn-data/L1-L2.png" width="600" height="500">

λ = regularization strength,
𝑤 = model weights

**🔍 How C relates to λ (lambda)**
C = 1 / λ

| C           | λ       | Effect                |
| ----------- | ------- | --------------------- |
| **Large C** | Small λ | Weak regularization   |
| **Small C** | Large λ | Strong regularization |

	​


------

## 2.2 Picking a machine learning model for the classification problem
Let's go to the map... https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [None]:
import pandas as pd
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [None]:
len(heart_disease)

303

following the ML Map, we came to a point where we have to use the **LinearSVC** estimator

In [None]:
# Import the LinearSVC estimator class
from sklearn.svm import LinearSVC

# setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate LinearSVC
clf = LinearSVC()
clf.fit(X_train, y_train)

# Evaluate the LinearSVC
clf.score(X_test, y_test)


# ---------LinearSVC can show less score than the RandomforestClassifier--------

0.8688524590163934

**⭐ What is LinearSVC?**

* LinearSVC is a machine learning model for classification (NOT regression).
* It belongs to the Support Vector Machine (SVM) family.
* 👉 It finds the best straight line (or plane) that separates classes.
* So it is a linear classifier — meaning it works best when the data can be separated by a straight boundary.

**🏁 What problems can it solve?**

LinearSVC is used for classification, like:

✔ Spam vs not spam
✔ Positive vs negative review
✔ Fraud vs not fraud
✔ Disease vs no disease

And even multi-class classification.

✔ It uses L2 regularization by default

✔ It has a parameter C

* This controls regularization strength:

* small C → stronger regularization → simpler model

* large C → weaker regularization → more complex model

**⚖️ LinearSVC vs SVC(kernel="linear")**

| Feature              | `LinearSVC`    | `SVC(kernel="linear")`                |
| -------------------- | -------------- | ------------------------------------- |
| Speed                | ⚡ Faster       | 🐢 Slower                             |
| Handles big datasets | ✅ Yes          | ❌ Not ideal                           |
| Probability output   | ❌ No           | ⚠️ Optional (with `probability=True`) |
| Implementation       | Uses liblinear | Uses libsvm                           |




In [None]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Evaluate the RandomForestClassifier
clf.score(X_test, y_test)

0.8524590163934426

Tidbit(Interesting peice of information):

    1. If you have structured data use ensemble methods
    2. if you have unstructured data use Deep Learning or Transfer learning


In [None]:
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


## 3. Fit the Model/algorithm and use it to make predictions on our data and use it to make the predictions

### 3.1 Fitting the model to the data

Different names for :
* `X` = features, feature variables, data
* `y` = labels, target, target variables 

In [None]:
# Import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier()

# Fit the model to the data
clf.fit(X_train, y_train)

# Evaluate the RandomForestClassifier
clf.score(X_test, y_test)

0.8524590163934426

In [None]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [None]:
y.tail()

298    0
299    0
300    0
301    0
302    0
Name: target, dtype: int64

### Random forest model deep dive

These resources will help you understand what's happening inside the Random Forest models we've been using.

* [Random Forest Wikipedia](https://en.wikipedia.org/wiki/Random_forest)
* [Random Forest Wikipedia (simple version)](https://simple.wikipedia.org/wiki/Random_forest)
* [Random Forests in Python](https://simple.wikipedia.org/wiki/Random_forest) by yhat
* [An Implementation and Explanation of the Random Forest in Python](https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76) by Will Koehrsen

### 3.1 Make predictions using machine learning models
2 ways to make predictions:

1. `predict()`
2. `predict_proba()`

In [None]:
# Use a trained model to make the predictions

clf.predict(np.array([1 , 7 , 8 , 4]))  # -> This code will not work



ValueError: Expected 2D array, got 1D array instead:
array=[1. 7. 8. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2


In [None]:
# Now the model sees patient data WITHOUT the answers and predicts: array([0, 1, 0, 1, 1, 0, ...]) -> Each number = predicted heart disease status.
clf.predict(X_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [None]:
np.array(y_test) # These are the real answers (the correct diagnosis from the dataset)

array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [None]:
# Compare Predictions vs Truth 
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test) 
# y_preds == y_test -> This produces something like: array([True, True, False, True, ...])
# True  = 1 , False = 0
# np.mean() gives average accuracy (its same as score())


np.float64(0.8524590163934426)

In [None]:
clf.score(X_test, y_test)

0.8524590163934426

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

0.8524590163934426

Make predictions with `predict_proba()` 
    - use this if someone asks you "what's the probability your model is assigning to each prediction?"

In [None]:
# predict_proba() returns probabilities of a classification label 
clf.predict_proba(X_test[:5])

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

**⭐ What does predict_proba() do?**

* 👉 Instead of just giving the final class (0 or 1), it gives the probability of each class.

* So for each patient, the model says something like:
    - “I am X% sure this person is class-0
    - and Y% sure this person is class-1.”

In [None]:
# lets predict() on the same data 
clf.predict(X_test[:5])

array([0, 1, 1, 0, 1])

**predict_proba() → gives probabilities**   

**predict() → gives the final class (0 or 1)**


🧪 Imagine a simple real-life example

- Let’s say we want to predict: Does a person have heart disease?
    - 1 = Yes
    - 0 = No

And we test 5 patients.

**📌 Output of predict_proba(X_test[:5])**
array([
 [0.80, 0.20],  
 [0.10, 0.90],  
 [0.45, 0.55],  
 [0.60, 0.40],  
 [0.02, 0.98]  
])  

Each row = one patient

| Row | P(class 0) = No disease | P(class 1) = Disease |
| --- | ----------------------- | -------------------- |
| 1   | 0.80                    | 0.20                 |
| 2   | 0.10                    | 0.90                 |
| 3   | 0.45                    | 0.55                 |
| 4   | 0.60                    | 0.40                 |
| 5   | 0.02                    | 0.98                 |  

**📌 Output of predict(X_test[:5])**  
array([0, 1, 1, 0, 1])  

This simply picks the class with the HIGHER probability    

| Patient | Prob Disease | Prob No Disease | Higher prob | Predicted Class (`predict`) |
| ------- | ------------ | --------------- | ----------- | --------------------------- |
| 1       | 0.20         | 0.80            | No disease  | 0                           |
| 2       | 0.90         | 0.10            | Disease     | 1                           |
| 3       | 0.55         | 0.45            | Disease     | 1                           |
| 4       | 0.40         | 0.60            | No disease  | 0                           |
| 5       | 0.98         | 0.02            | Disease     | 1                           |


choose the class with probability ≥ 0.5    
(by default — threshold is 0.5)



In [None]:
heart_disease["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

`predict()` can also be used for regression models.

NameError: name 'housing' is not defined

In [None]:
%whos   # Quick way to check what variables exist currentlhy in the memory


Variable                 Type                      Data/Info
------------------------------------------------------------
RandomForestClassifier   ABCMeta                   <class 'sklearn.ensemble.<...>.RandomForestClassifier'>
RandomForestRegressor    ABCMeta                   <class 'sklearn.ensemble.<...>t.RandomForestRegressor'>
X                        DataFrame                 Shape: (1000, 4)
X_test                   DataFrame                 Shape: (200, 4)
X_train                  DataFrame                 Shape: (800, 4)
accuracy_score           function                  <function accuracy_score at 0x00000113FF5C1DD0>
car_sales                DataFrame                 Shape: (1000, 5)
classification_report    function                  <function classification_<...>rt at 0x00000113FF5C3320>
clf                      RandomForestClassifier    RandomForestClassifier(n_estimators=90)
confusion_matrix         function                  <function confusion_matrix at 0x00000113FF5C1F

In [34]:
import pandas as pd

housing = pd.read_csv("data/scikit-learn-data/housing.csv")


In [35]:
housing.rename(columns={
    "median_house_value": "target",
    "median_income": "MedInc",
    "housing_median_age": "HouseAge",
    "total_rooms": "AveRooms",
    "total_bedrooms": "AveBedrms",
    "households": "AveOccup",
    "longitude": "Longitude",
    "latitude": "Latitude"
}, inplace=True)

housing = housing.drop("ocean_proximity", axis=1)

housing

Unnamed: 0,Longitude,Latitude,HouseAge,AveRooms,AveBedrms,population,AveOccup,MedInc,target
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0
...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0


In [36]:
%whos

Variable                 Type                      Data/Info
------------------------------------------------------------
RandomForestClassifier   ABCMeta                   <class 'sklearn.ensemble.<...>.RandomForestClassifier'>
RandomForestRegressor    ABCMeta                   <class 'sklearn.ensemble.<...>t.RandomForestRegressor'>
X                        DataFrame                 Shape: (1000, 4)
X_test                   DataFrame                 Shape: (200, 4)
X_train                  DataFrame                 Shape: (800, 4)
accuracy_score           function                  <function accuracy_score at 0x0000017172281DD0>
car_sales                DataFrame                 Shape: (1000, 5)
classification_report    function                  <function classification_<...>rt at 0x0000017172283320>
clf                      RandomForestClassifier    RandomForestClassifier(n_estimators=90)
confusion_matrix         function                  <function confusion_matrix at 0x0000017172281F

In [37]:
housing.head()

Unnamed: 0,Longitude,Latitude,HouseAge,AveRooms,AveBedrms,population,AveOccup,MedInc,target
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


In [38]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create the data 
X = housing.drop("target", axis = 1)
y = housing['target']

# Split into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create the model instance 
model = RandomForestRegressor()

# Fit the data to the model
model.fit(X_train, y_train)

# Make predictions
y_preds = model.predict(X_test)

In [39]:
y_preds[:10]

array([ 49639.  ,  68489.  , 471253.52, 258076.  , 272915.  , 163097.  ,
       240405.01, 168769.  , 282527.05, 477871.76])

In [40]:
np.array(y_test[:10])

array([ 47700.,  45800., 500001., 218600., 278000., 158700., 198200.,
       157500., 340000., 446600.])

In [None]:
len(y_preds)

4128

In [None]:
len(y_test)

4128

In [41]:
# Compare the predictions to the truth

from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

32251.46173691861

In [None]:
housing['target']

0        452600.0
1        358500.0
2        352100.0
3        341300.0
4        342200.0
           ...   
20635     78100.0
20636     77100.0
20637     92300.0
20638     84700.0
20639     89400.0
Name: target, Length: 20640, dtype: float64

## 4. Evaluate a machine learning model

Three ways to evaluate scikit-learn models/estimators:
  1. Estimator's built-in `score()` method
  2. The `scoring` parameter
  3. Problem-specific metric functions

You can read mor about this here:  https://scikit-learn.org/stable/modules/model_evaluation.html

### 4.1 Evaluating the model with the `score` method

In [42]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

# Create X and y
X = heart_disease.drop('target', axis = 1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create calssifier model instance
clf = RandomForestClassifier()

# Fit classifier to training data 
clf.fit(X_train, y_train)

0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""gini"", ""entropy"", ""log_loss""}, default=""gini"" The function to measure the quality of a split. Supported criteria are ""gini"" for the Gini impurity and ""log_loss"" and ""entropy"" both for the Shannon information gain, see :ref:`tree_mathematical_formulation`. Note: This parameter is tree-specific.",'gini'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=""sqrt"" The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None, then `max_features=n_features`. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to `""sqrt""`. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",'sqrt'
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


In [43]:
# The highest value of the score() method is 1.0, the lowest is 0.0
clf.score(X_train, y_train)

1.0

In [44]:
clf.score(X_test, y_test)

0.8524590163934426

Let's use the `score()` method on our regression problem

In [53]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create the data 
X = housing.drop("target", axis = 1)
y = housing['target']

# Split into train and test set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create the model instance 
model = RandomForestRegressor(n_estimators = 100)

# Fit the data to the model
model.fit(X_train, y_train)



0,1,2
,"n_estimators  n_estimators: int, default=100 The number of trees in the forest. .. versionchanged:: 0.22  The default value of ``n_estimators`` changed from 10 to 100  in 0.22.",100
,"criterion  criterion: {""squared_error"", ""absolute_error"", ""friedman_mse"", ""poisson""}, default=""squared_error"" The function to measure the quality of a split. Supported criteria are ""squared_error"" for the mean squared error, which is equal to variance reduction as feature selection criterion and minimizes the L2 loss using the mean of each terminal node, ""friedman_mse"", which uses mean squared error with Friedman's improvement score for potential splits, ""absolute_error"" for the mean absolute error, which minimizes the L1 loss using the median of each terminal node, and ""poisson"" which uses reduction in Poisson deviance to find splits. Training using ""absolute_error"" is significantly slower than when using ""squared_error"". .. versionadded:: 0.18  Mean Absolute Error (MAE) criterion. .. versionadded:: 1.0  Poisson criterion.",'squared_error'
,"max_depth  max_depth: int, default=None The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.",
,"min_samples_split  min_samples_split: int or float, default=2 The minimum number of samples required to split an internal node: - If int, then consider `min_samples_split` as the minimum number. - If float, then `min_samples_split` is a fraction and  `ceil(min_samples_split * n_samples)` are the minimum  number of samples for each split. .. versionchanged:: 0.18  Added float values for fractions.",2
,"min_samples_leaf  min_samples_leaf: int or float, default=1 The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least ``min_samples_leaf`` training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression. - If int, then consider `min_samples_leaf` as the minimum number. - If float, then `min_samples_leaf` is a fraction and  `ceil(min_samples_leaf * n_samples)` are the minimum  number of samples for each node. .. versionchanged:: 0.18  Added float values for fractions.",1
,"min_weight_fraction_leaf  min_weight_fraction_leaf: float, default=0.0 The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.",0.0
,"max_features  max_features: {""sqrt"", ""log2"", None}, int or float, default=1.0 The number of features to consider when looking for the best split: - If int, then consider `max_features` features at each split. - If float, then `max_features` is a fraction and  `max(1, int(max_features * n_features_in_))` features are considered at each  split. - If ""sqrt"", then `max_features=sqrt(n_features)`. - If ""log2"", then `max_features=log2(n_features)`. - If None or 1.0, then `max_features=n_features`. .. note::  The default of 1.0 is equivalent to bagged trees and more  randomness can be achieved by setting smaller values, e.g. 0.3. .. versionchanged:: 1.1  The default of `max_features` changed from `""auto""` to 1.0. Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than ``max_features`` features.",1.0
,"max_leaf_nodes  max_leaf_nodes: int, default=None Grow trees with ``max_leaf_nodes`` in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.",
,"min_impurity_decrease  min_impurity_decrease: float, default=0.0 A node will be split if this split induces a decrease of the impurity greater than or equal to this value. The weighted impurity decrease equation is the following::  N_t / N * (impurity - N_t_R / N_t * right_impurity  - N_t_L / N_t * left_impurity) where ``N`` is the total number of samples, ``N_t`` is the number of samples at the current node, ``N_t_L`` is the number of samples in the left child, and ``N_t_R`` is the number of samples in the right child. ``N``, ``N_t``, ``N_t_R`` and ``N_t_L`` all refer to the weighted sum, if ``sample_weight`` is passed. .. versionadded:: 0.19",0.0
,"bootstrap  bootstrap: bool, default=True Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.",True


More the number of the estimators, more time it will take to run the code.      

Also the model score will improve if the no. of the estimators are more.

In [54]:
model.score(X_test, y_test)

0.8104992594035967

**Regression problems -> the evaluation metrices is R²**   

*but in*     

**Classification Problems -> Evaluation mertices is Accuracy**    

if 85 out of 100 samples is evaluated corect, the accuracy is *0.85* i.e. *(85%)*




### 4.2 Evaluating a model using th `scoring` parameter