# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates the most useful function sklearn-library

What are we covering:

0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choosing the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a model
7. Putting it all together

## Method to ignore Future warnings
`
import warnings
warnings.filterwarnings("ignore") 
"ignore" to "default" change the argument see the warning again
`

## 0. An end-to-end Scikit-Learn workflow

In [1]:
import numpy as np

In [2]:
# 1. Get the data ready
import pandas as pd
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [3]:
# Create X (Feature matrix/variables)
X = heart_disease.drop("target", axis = 1)

# Create Y (labels)
Y = heart_disease["target"]

### Problem Definition : Predicting a person has heart-disease or not. So, it is kind of Classification Problem(Supervised Learning)


### A random forest classifier.

### A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. Trees in the forest use the best split strategy, i.e. equivalent to passing splitter="best" to the underlying DecisionTreeRegressor. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

In [4]:
# 2 Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# We'll keep the default hyperparameteres
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### clf = RandomForestClassifier(n_estimators = 100); by default
### n_estimators = no. of trees in forest

In [5]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

# 80% training 20% test data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.8)

In [6]:
import sklearn
sklearn.show_versions()


System:
    python: 3.10.13 (main, Sep 11 2023, 08:16:02) [Clang 14.0.6 ]
executable: /Users/shubhamkumar/MLProject/env/bin/python
   machine: macOS-14.2.1-arm64-arm-64bit

Python dependencies:
      sklearn: 1.3.0
          pip: 23.3.1
   setuptools: 68.2.2
        numpy: 1.26.2
        scipy: 1.11.4
       Cython: None
       pandas: 2.1.4
   matplotlib: 3.8.0
       joblib: 1.2.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /Users/shubhamkumar/MLProject/env/lib/libopenblasp-r0.3.21.dylib
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.21
    num_threads: 8
threading_layer: pthreads
   architecture: armv8

       filepath: /Users/shubhamkumar/MLProject/env/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8


In [7]:
clf.fit(X_train, Y_train);

In [8]:
X_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
197,67,1,0,125,254,1,1,163,0,0.2,1,2,3
161,55,0,1,132,342,0,1,166,0,1.2,2,0,2
166,67,1,0,120,229,0,0,129,1,2.6,1,2,3
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
21,44,1,2,130,233,0,1,179,1,0.4,2,0,2
154,39,0,2,138,220,0,1,152,0,0.0,1,0,2
220,63,0,0,150,407,0,0,154,0,4.0,1,3,3
231,57,1,0,165,289,1,0,124,0,1.0,1,3,3
97,52,1,0,108,233,1,1,147,0,0.1,2,3,3
157,35,1,1,122,192,0,1,174,0,0.0,2,0,2


# make a prediction
y_label = clf.predict(np.array([0, 2, 3, 4]))

In [9]:
Y_preds = clf.predict(X_test)
Y_preds

array([0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1,
       1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
       1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,
       1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
       1])

In [10]:
Y_test

232    0
25     1
256    0
226    0
262    0
      ..
250    0
64     1
8      1
274    0
119    1
Name: target, Length: 243, dtype: int64

In [11]:
# 4. Evaluate the model
# training data performance
clf.score(X_train, Y_train) 

1.0

In [12]:
# test data performance
clf.score(X_test, Y_test)

0.7860082304526749

In [13]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(classification_report(Y_test, Y_preds))

              precision    recall  f1-score   support

           0       0.86      0.66      0.75       116
           1       0.75      0.90      0.81       127

    accuracy                           0.79       243
   macro avg       0.80      0.78      0.78       243
weighted avg       0.80      0.79      0.78       243



TP = true positive, TN = true negative,
FP = false positive, FP = false negative

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

F1-Score = 2× ((Precision * Recall) / (Precision + Recall)) 

Accuracy = (TP + TN)/(TP + TN + FP + FN)

Confusion Matrix : ((TN, FP),
                    (FN, TP))

In [14]:
confusion_matrix(Y_test, Y_preds)

array([[ 77,  39],
       [ 13, 114]])

In [15]:
accuracy_score(Y_test, Y_preds) 

0.7860082304526749

In [16]:
# 5. Improving Model
# try different type of n_estimators

np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Tryin model with {i} estimators...")
    rfc = RandomForestClassifier(n_estimators = i).fit(X_train, Y_train)
    print(f"Model Accuracy on test set: {rfc.score(X_test, Y_test) * 100:.2f}") 
    print("")

Tryin model with 10 estimators...
Model Accuracy on test set: 76.95

Tryin model with 20 estimators...
Model Accuracy on test set: 76.54

Tryin model with 30 estimators...
Model Accuracy on test set: 77.37

Tryin model with 40 estimators...
Model Accuracy on test set: 77.37

Tryin model with 50 estimators...
Model Accuracy on test set: 78.60

Tryin model with 60 estimators...
Model Accuracy on test set: 78.60

Tryin model with 70 estimators...
Model Accuracy on test set: 80.25

Tryin model with 80 estimators...
Model Accuracy on test set: 81.07

Tryin model with 90 estimators...
Model Accuracy on test set: 81.89



In [17]:
# 6. Save a model and load it
import pickle 

pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

In [18]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, Y_test)

0.7860082304526749

## 1. Getting our data ready to be used with machine learning

#### Three main things we have to do:
1. Split the data into features and labels (usually `X` and `Y`)
2. Filling (aka imputing) or disregarding missing values
3. Converting non-numerical to numerical values (aka feature encoding)

In [19]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [20]:
X = heart_disease.drop("target", axis = 1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [21]:
Y = heart_disease["target"]
Y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [22]:
# Split the data intoo training and test data set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)

In [23]:
X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((227, 13), (76, 13), (227,), (76,))

In [24]:
X.shape, Y.shape

((303, 13), (303,))

### 1.1 Make sure data is all numerical

In [25]:
car_sales = pd.read_csv("../data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [26]:
car_sales["Doors"].value_counts()

Doors
4    856
5     79
3     65
Name: count, dtype: int64

In [27]:
len(car_sales)

1000

In [28]:
# Split into X/Y
X = car_sales.drop("Price", axis = 1)
Y = car_sales["Price"]

# Split into training and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25)

In [29]:
# Build Machine Learning Model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

# fitting the training data into model
model.fit(X_train, Y_train)

# Model Evaluation with test data
model.score(X_test, Y_test)

ValueError: could not convert string to float: 'Honda'

In [30]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [31]:
# turn categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot, 
                                   categorical_features)],
                                   remainder="passthrough")

# Fit all transformers, transform the data and concatenate results.
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [32]:
pd.DataFrame(transformed_X).head(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0


In [33]:
X.head(5)

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [34]:
car_sales["Colour"].value_counts()

Colour
White    407
Blue     321
Black     99
Red       94
Green     79
Name: count, dtype: int64

In [35]:
car_sales["Make"].value_counts()

Make
Toyota    398
Honda     304
Nissan    198
BMW       100
Name: count, dtype: int64

In [36]:
car_sales["Doors"].value_counts()

Doors
4    856
5     79
3     65
Name: count, dtype: int64

0 -> BMW
1 -> Honda
2 -> Nissan
3 -> Toyota
4 -> Black
5 -> Blue
6 -> Green
7 -> Red
8 -> White
9 -> 3 doors
10 -> 4 doors
11 -> 5 doors
12 -> Odometer

In [37]:
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies.head()

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False


In [38]:
# let's refill the data
np.random.seed(42)
X_train, X_test, Y_train, Y_test = train_test_split(transformed_X,
                                                    Y,
                                                    test_size = 0.20)
model.fit(X_train, Y_train)
model.score(X_test, Y_test)

0.3235867221569877

In [39]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


### 1.2 What if there is missing vlaues?

1. Fill them with some value (also known as imputation).
2. Remove the sample with missing data altogether.

In [40]:
# Import car_sales missing data
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [41]:
# Create X & y (Feature-Labels)

X = car_sales_missing.drop("Price", axis = 1)
Y = car_sales_missing["Price"]

In [42]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [43]:
len(car_sales)

1000

In [44]:
# Let's try and convert our data to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                                 remainder = "passthrough")

# Fit all transformers, transform the data and concatenate results.
transformed_X = transformer.fit_transform(X)
transformed_X

<1000x16 sparse matrix of type '<class 'numpy.float64'>'
	with 4000 stored elements in Compressed Sparse Row format>

### Option 1: Fill Missing data with pandas

In [45]:
## Fill the "Make" Column
car_sales_missing["Make"].fillna("missing", inplace = True)

## Fill the "Colour" Column
car_sales_missing["Colour"].fillna("missing", inplace = True)

## Fill the "Odometer (KM)" Column
car_sales_missing["Odometer (KM)"].fillna(car_sales["Odometer (KM)"].mean(), inplace = True)

## Fill the "Doors" column
car_sales_missing["Doors"].fillna(4, inplace = True)

In [46]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

### Since, we trying the predict the price of the car. So, we left it as it is, without manipulating. Rather than manipulating "Price" column we will drop those rows which have missing price

In [47]:
# Remove rows with missing price columns
car_sales_missing.dropna(inplace = True)
car_sales.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [48]:
# Split the data into X and Y
X = car_sales_missing.drop("Price", axis = 1)
Y = car_sales_missing["Price"]

In [49]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder = "passthrough")
transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [50]:
pd.DataFrame(transformed_X)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,35431.0,15323.0
1,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0,19943.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,84714.0,28343.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,154365.0,13434.0
4,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0,14043.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
945,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0,32042.0
946,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,155144.0,5716.0
947,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0,31570.0
948,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,215883.0,4001.0


In [51]:
pd.get_dummies(car_sales_missing)

Unnamed: 0,Odometer (KM),Doors,Price,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Make_missing,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Colour_missing
0,35431.0,4.0,15323.0,False,True,False,False,False,False,False,False,False,True,False
1,192714.0,5.0,19943.0,True,False,False,False,False,False,True,False,False,False,False
2,84714.0,4.0,28343.0,False,True,False,False,False,False,False,False,False,True,False
3,154365.0,4.0,13434.0,False,False,False,True,False,False,False,False,False,True,False
4,181577.0,3.0,14043.0,False,False,True,False,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,35820.0,4.0,32042.0,False,False,False,True,False,True,False,False,False,False,False
996,155144.0,3.0,5716.0,False,False,False,False,True,False,False,False,False,True,False
997,66604.0,4.0,31570.0,False,False,True,False,False,False,True,False,False,False,False
998,215883.0,4.0,4001.0,False,True,False,False,False,False,False,False,False,True,False


### Feature Scaling.

In other words, making sure all of your numerical data is on the same scale.

### Why do Feature Scaling

- all data of particular column should be in same scale
- 5km and 5000m is differnt in magnitude even though they are same
- `5km and 5m might be treated same by some ML algorithms, and we don't want that`
- So, scale used for measuring magnitude should be same

### Algorithms in which Feature Scaling Matters
- KNN, PCA, Gradient Descent, Tree-Based Model, Linear Discriminant Analysis(LDA),
- Naive Bayes, SVM etc
- In general, whichever algorithm uses distance b/w two data points `(euclidean distance)` feature scaling does matter a lot.


## Feature Scaling Methods in Sklearn'

- There are several common techniques for feature scaling, including standardization, normalization, and min-max scaling.
- Normalization (min-max scaling) using MinMaxScalar class
- Standardization (divides the fature by standard deviation, so that resulting features have 0 mean) using StandardScalar class

It is a good practice to fit the scaler on the training data and then use it to transform the testing data. Also, the scaling of target values is generally not required.

### Option 2: Fill missing values with Sklearn

In [52]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [53]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [54]:
# drop the rows with no labels
car_sales_missing.dropna(subset = ["Price"], inplace = True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [55]:
# Split into X/Y
X = car_sales_missing.drop("Price", axis = 1)
Y = car_sales_missing["Price"]

In [56]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431.0,4.0
1,BMW,Blue,192714.0,5.0
2,Honda,White,84714.0,4.0
3,Toyota,White,154365.0,4.0
4,Nissan,Blue,181577.0,3.0


In [57]:
Y.head()

0    15323.0
1    19943.0
2    28343.0
3    13434.0
4    14043.0
Name: Price, dtype: float64

In [58]:
X.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

In [59]:
# fill missing values with SKLearn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with "missing" and numerical values with "mean"
cat_imputer = SimpleImputer(strategy = "constant", fill_value = "missing")
door_imputer = SimpleImputer(strategy = "constant", fill_value = 4)
num_imputer = SimpleImputer(strategy = "mean")

# Define columns
cat_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_features),
    ("num_imputer", num_imputer, num_features)
])

# Transform the data
filled_X = imputer.fit_transform(X)
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [60]:
car_sales_filled = pd.DataFrame(filled_X,
                                columns = ["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0


In [61]:
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [62]:
# Feature Encoding
# Let's try and convert our data to numbers
# Turn the categories to numbers

from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", 
                                   one_hot,
                                   categorical_features)],
                                   remainder = "passthrough")

transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

<950x15 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [63]:
car_sales_filled.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [64]:
# Now we've got our data as numbers and filled (no missing values)
# Let's fit a model

np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(transformed_X,
                                                    Y, 
                                                    test_size = 0.2)

model = RandomForestRegressor(n_estimators = 100)

# fit training data into model
model.fit(X_train, Y_train)
model.score(X_test, Y_test)

0.21990196728583944

## BEST WAY TO BUILD A MODEL

In [65]:
# Splitting the data into training and test dataset
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

In [66]:
# fill missing values with SKLearn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with "missing" and numerical values with "mean"
cat_imputer = SimpleImputer(strategy = "constant", fill_value = "missing")
door_imputer = SimpleImputer(strategy = "constant", fill_value = 4)
num_imputer = SimpleImputer(strategy = "mean")

# Define columns
cat_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("door_imputer", door_imputer, door_features),
    ("num_imputer", num_imputer, num_features)
])

# Transform the data
filled_X_train = imputer.fit_transform(X_train)
filled_X_test = imputer.transform(X_test)

In [67]:
filled_X_train

array([['Nissan', 'White', 4.0, 64362.0],
       ['Nissan', 'missing', 4.0, 206073.0],
       ['Toyota', 'Blue', 4.0, 156478.0],
       ...,
       ['Honda', 'Blue', 4.0, 230570.0],
       ['Toyota', 'Red', 4.0, 193006.0],
       ['Nissan', 'Red', 4.0, 61892.0]], dtype=object)

In [68]:
filled_X_test

array([['Honda', 'White', 4.0, 166028.0],
       ['Nissan', 'Green', 3.0, 190794.0],
       ['Nissan', 'Red', 4.0, 128235.0],
       ['Toyota', 'Blue', 4.0, 223390.0],
       ['Honda', 'White', 4.0, 130783.0],
       ['Nissan', 'White', 4.0, 82726.0],
       ['Toyota', 'White', 4.0, 189194.0],
       ['Toyota', 'White', 4.0, 188338.0],
       ['Toyota', 'White', 4.0, 26191.0],
       ['Honda', 'Blue', 4.0, 63825.0],
       ['BMW', 'White', 5.0, 143651.0],
       ['Honda', 'Blue', 4.0, 73869.0],
       ['Toyota', 'White', 4.0, 129815.26104972376],
       ['Toyota', 'White', 4.0, 72118.0],
       ['Honda', 'White', 4.0, 42844.0],
       ['Nissan', 'missing', 4.0, 219137.0],
       ['Honda', 'Blue', 4.0, 68223.0],
       ['Toyota', 'Blue', 4.0, 225286.0],
       ['Honda', 'White', 4.0, 82039.0],
       ['Honda', 'White', 4.0, 95481.0],
       ['Toyota', 'Blue', 4.0, 129815.26104972376],
       ['Nissan', 'White', 4.0, 192747.0],
       ['BMW', 'Red', 5.0, 131587.0],
       ['Honda', 'Blue

In [69]:
car_sales_train_X = pd.DataFrame(filled_X_train,
                                 columns = ["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_test_X = pd.DataFrame(filled_X_test,
                                columns  = ["Make", "Colour", "Doors", "Odometer (KM)"])


In [70]:
car_sales_train_X.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Nissan,White,4.0,64362.0
1,Nissan,missing,4.0,206073.0
2,Toyota,Blue,4.0,156478.0
3,Nissan,White,4.0,87997.0
4,Toyota,missing,4.0,228619.0


In [71]:
car_sales_test_X.head()

Unnamed: 0,Make,Colour,Doors,Odometer (KM)
0,Honda,White,4.0,166028.0
1,Nissan,Green,3.0,190794.0
2,Nissan,Red,4.0,128235.0
3,Toyota,Blue,4.0,223390.0
4,Honda,White,4.0,130783.0


In [72]:
# Feature Encoding (convert non-numerical data to numerical data)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                   one_hot,
                                   categorical_features)],
                                   remainder = "passthrough")

transformed_X_train = transformer.fit_transform(car_sales_train_X)
transformed_X_test = transformer.transform(car_sales_test_X)

transformed_X_train.toarray()

array([[0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.43620e+04],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.06073e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.56478e+05],
       ...,
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.30570e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.93006e+05],
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.18920e+04]])

In [73]:
# Model Fitting and Evaluation
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor

# Model Setup
model = RandomForestRegressor()

# Fitting the training data
model.fit(transformed_X_train, Y_train)

# Model Evaluation
model.score(transformed_X_test, Y_test)

0.3260131242495673

## 2. Choosing the right estimator/algorithm for the problem

Some things to note:

* Sklearn refers to machine learning models, algorithm as estimatos.
* Classification problem - predicting a category (heart diesase or not)
    * Sometimes `clf` can be seen as short form of classifier
* Regression problem = predicting a number (selling price of a car)

<img src = "https://scikit-learn.org/stable/_static/ml_map.png">

### 2.1 Picking a machine learning model for a regression problem
Let's use the California Housing Dataset. 
Read about dataset: https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

In [74]:
# get c/a housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [75]:
housing_df = pd.DataFrame(housing["data"], columns = housing["feature_names"])
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [76]:
housing_df["MedHouseVal"] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [77]:
housing_df["target"] = housing["target"]
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422,3.422


In [78]:
housing_df = housing_df.drop("MedHouseVal", axis = 1)
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [79]:
# import algorithm/estimator
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# create the data
X = housing_df.drop("target", axis = 1)
Y = housing_df["target"] # median house price in $100,000s

# Split the data into train and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

# Instantize and fit the model (on training data set)
model = Ridge()
model.fit(X_train, Y_train)

# Check the score of model on test set
model.score(X_test, Y_test)

0.5758549611440126

In [80]:
## let's try this with LinearRegression Model
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# create the data
X = housing_df.drop("target", axis = 1)
Y = housing_df["target"]

# split the data into train and test test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

# Instantize and fit the model (on training set)
model = LinearRegression()
model.fit(X_train, Y_train)

# Model Evaluation
model.score(X_test, Y_test)

0.5757877060324508

In [81]:
## let's try out another model (lasso regression)
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# create the data
X = housing_df.drop("target", axis = 1)
Y = housing_df["target"]

# split the data into train and test test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

# Instantize and fit the model (on training set)
model = Lasso(alpha = 0.1)
model.fit(X_train, Y_train)

# Model Evaluation
model.score(X_test, Y_test)

0.5318167610318159

In [82]:
## let's try out Ridge-Cross-Validation model
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# create the data
X = housing_df.drop("target", axis = 1)
Y = housing_df["target"]

# split the data into train and test test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

# Instantize and fit the model (on training set)
model = RidgeCV()
model.fit(X_train, Y_train)

# Model Evaluation
model.score(X_test, Y_test)

0.5764371560147499

In [83]:
## let's try out BayesianRidge model
from sklearn.linear_model import BayesianRidge
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# create the data
X = housing_df.drop("target", axis = 1)
Y = housing_df["target"]

# split the data into train and test test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

# Instantize and fit the model (on training set)
model = BayesianRidge()
model.fit(X_train, Y_train)

# Model Evaluation
model.score(X_test, Y_test)

0.576020635350551

In [84]:
# Using SGDRegressor as Model and StandardScaler for Feature Scaling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# setup random seed
np.random.seed(42)

# split the data into train and test test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDRegressor
est = make_pipeline(StandardScaler(), SGDRegressor(loss = 'squared_error'))
est.fit(X_train, Y_train)
est.score(X_test, Y_test)

0.5846517276218621

What if `Ridge and other models` didn't work or the score didn't fit our needs?

Well, we could always try a differnet model...

How about we try an ensemble model (an ensemble is combination of smaller models to try and make predictions and than just a single model)?

Sklearn's ensemble models can be found here: <link>https://scikit-learn.org/stable/modules/ensemble.html</link>

In [85]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# create the data
X = housing_df.drop("target", axis = 1)
Y = housing_df["target"]

# split the data into train and test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

# instantize and fit the model (on training set)
model = RandomForestRegressor()
model.fit(X_train, Y_train)

# Check the score of the model (on the test set)
model.score(X_test, Y_test)

0.8059809073051385

In [86]:
model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [87]:
# Let's try tuning hyperparameters

# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# create the data
X = housing_df.drop("target", axis = 1)
Y = housing_df["target"]

# split the data into train and test split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

for i in range(10, 101, 10):
    # instantize and fit the model (on training set)
    model = RandomForestRegressor(n_estimators = i)
    model.fit(X_train, Y_train)
    
    # Check the score of the model (on the test set)
    print(model.score(X_test, Y_test))

0.7851752292818259
0.7957333006393108
0.8013955778016778
0.8022259225750528
0.8023903663982276
0.8041790304668666
0.8065599186561039
0.806860270657419
0.8062806967223503
0.8075750850731847


In [88]:
from sklearn.metrics import mean_squared_error
Y_predict = model.predict(X_test)

(mean_squared_error(Y_test, Y_predict)/len(Y_predict)) ** 0.5

0.007815634427926089

In [89]:
Y_predict

array([0.48981  , 0.72094  , 4.8560661, ..., 4.856659 , 0.71992  ,
       1.62578  ])

In [90]:
Y_test

20046    0.47700
3024     0.45800
15663    5.00001
20484    2.18600
9814     2.78000
          ...   
15362    2.63300
16623    2.66800
18086    5.00001
2144     0.72300
3665     1.51500
Name: target, Length: 4128, dtype: float64

# Classification Model Building Challenges

In [91]:
from sklearn.datasets import load_breast_cancer
breast_cancer = load_breast_cancer()
breast_cancer

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

In [92]:
# adding features to data frame
cancer_df = pd.DataFrame(breast_cancer["data"], columns = breast_cancer["feature_names"])
cancer_df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


In [93]:
# adding target column to dataframe
# target = 0 => malignant(life-threatening) and 1 => benign(harmless)
cancer_df["target"] = pd.DataFrame(breast_cancer["target"])
cancer_df.head(25)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0
5,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,0.2087,0.07613,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,0
6,18.25,19.98,119.6,1040.0,0.09463,0.109,0.1127,0.074,0.1794,0.05742,...,27.66,153.2,1606.0,0.1442,0.2576,0.3784,0.1932,0.3063,0.08368,0
7,13.71,20.83,90.2,577.9,0.1189,0.1645,0.09366,0.05985,0.2196,0.07451,...,28.14,110.6,897.0,0.1654,0.3682,0.2678,0.1556,0.3196,0.1151,0
8,13.0,21.82,87.5,519.8,0.1273,0.1932,0.1859,0.09353,0.235,0.07389,...,30.73,106.2,739.3,0.1703,0.5401,0.539,0.206,0.4378,0.1072,0
9,12.46,24.04,83.97,475.9,0.1186,0.2396,0.2273,0.08543,0.203,0.08243,...,40.68,97.65,711.4,0.1853,1.058,1.105,0.221,0.4366,0.2075,0


In [94]:
# Split the data into X and y
X = cancer_df.drop("target", axis = 1)
y = cancer_df["target"]

In [95]:
X.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


### SVC Model

In [96]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# setup random seed
np.random.seed(42)

# split the data into train and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)

# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# setup model
model = SVC()
model.fit(X_train, Y_train)
model.score(X_test, Y_test)

0.9824561403508771

### NaiveBayes Model

In [97]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# setup random seed
np.random.seed(42)

# split the data into train and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)

# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# setup model
model = GaussianNB()
model.fit(X_train, Y_train)
model.score(X_test, Y_test)

0.9649122807017544

### KNN Classifier Model

In [98]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier

# setup random seed
np.random.seed(42)

# split the data into train and test set
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)

# feature scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# setup model
model = KNeighborsClassifier(n_neighbors = 2, metric = "minkowski", p = 2)
model.fit(X_train, Y_train)
model.score(X_test, Y_test)

0.9385964912280702

In [99]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [100]:
len(heart_disease)

303

Consulting the model map and it says to try `LinearSVC`

In [107]:
# import the LinearSVC
from sklearn.svm import LinearSVC

# setup random seed
np.random.seed(42)

# Make the data into X/y
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data into t and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)

# Model setup and training
clf = LinearSVC(dual = "auto")
clf.fit(X_train, Y_train)

# Model evaluation
clf.score(X_test, Y_test)

0.8688524590163934

In [113]:
# import the logisticRegression for classification
from sklearn.linear_model import LogisticRegression

# setup random seed
np.random.seed(42)


# Make the data into X/y
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]


# Split the data into t and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)

# Scaling data is required for logistic regression
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()

scaler_train = sc.fit(X_train)
scaler_test = sc.fit(X_test)

X_train_scaled = scaler_train.transform(X_train)
X_test_scaled = scaler_test.transform(X_test)

# Model setup and training
clf = LogisticRegression()
clf.fit(X_train_scaled, Y_train)

# Model evaluation
clf.score(X_test_scaled, Y_test)

0.8524590163934426

In [114]:
heart_disease["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [118]:
# import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# setup random seed
np.random.seed(42)

# Make the data into X/y
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)

# Model setup and training
clf = RandomForestClassifier()
clf.fit(X_train, Y_train)

# Model evaluation
clf.score(X_test, Y_test)

0.8524590163934426

## Trying LogisticRegression with StandardScaler using pipeline

In [116]:
# importing necessity
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split

# setup random seed
np.random.seed(42)

# make the data into X/y
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# split the test and train set
X_train, X_test, Y_train, Y_test = train_test_split(X, y)

# Make pipeline to serve scaled data to model
pipe = make_pipeline(StandardScaler(), LogisticRegression())
pipe.fit(X_train, Y_train) # apply scaling

# Model Evaluation
pipe.score(X_test, Y_test)

0.881578947368421

## Tidbit:
    1. If you have structured data, used ensemble methods
    2. If you have unstructured data, use deep learning or 
       transfer learning

## 3. Fit the model/algorithm on our data and use it to make prediction

### 3.1 Fitting the model to the data

* X = features, feature variables, data
* y = labels, target, target variables

In [120]:
# import the RandomForestClassifier estimator class
from sklearn.ensemble import RandomForestClassifier

# setup random seed
np.random.seed(42)

# Make the data into X/y
X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

# Split the data into train and test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size = 0.2)

# Model setup and training
clf = RandomForestClassifier()

# Model Fitting
clf.fit(X_train, Y_train)

# Model evaluation
clf.score(X_test, Y_test)

0.8524590163934426

In [121]:
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [122]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

### 3.2 Make predictions using a machine learning model

2 ways to make predictions:

* `predicr()`
* `predict_proba()` 

In [124]:
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2


In [125]:
# Use a trained model to make predictions
clf.predict(X_test)

array([0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [127]:
np.array(Y_test)

array([0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0])

In [134]:
# Compare predictions to truth labels to evaluate the model
Y_preds = clf.predict(X_test)
np.mean(Y_preds == Y_test)

0.8524590163934426

In [135]:
clf.score(X_test, Y_test)

0.8524590163934426

In [136]:
from sklearn.metrics import accuracy_score
accuracy_score(Y_test, Y_preds)

0.8524590163934426

Make predictions with `predict_proba()`

In [138]:
# predict_proba() return probabilities of classification models being true
clf.predict_proba(X_test[:5])

array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

In [139]:
# Let's predict() on the same data...
clf.predict(X_test[:5])

array([0, 1, 1, 0, 1])

In [140]:
X_test[:5]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2


In [141]:
heart_disease["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [142]:
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [144]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# create the data
X = housing_df.drop("target", axis = 1)
y = housing_df["target"]

# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# create model instance
model = RandomForestRegressor()

# Model fitting
model.fit(X_train, y_train)

# Make predictions
y_preds = model.predict(X_test)

In [146]:
y_preds[:10]

array([0.49058  , 0.75989  , 4.9350165, 2.55864  , 2.33461  , 1.6580801,
       2.34237  , 1.66708  , 2.5609601, 4.8519781])

In [147]:
np.array(y_test[:10])

array([0.477  , 0.458  , 5.00001, 2.186  , 2.78   , 1.587  , 1.982  ,
       1.575  , 3.4    , 4.466  ])

In [148]:
len(y_preds), len(y_test)

(4128, 4128)

In [150]:
# Compare the prediction to the truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

0.3270458119670544