# Intorduction to Scikit-Learn

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn li bibrary.

Cover:
    0. An end-to-end SK-Learn workflow
    1. Getting the data ready
    2. Choose the right estimator/algorithm for our problems
    3. Fit the model/algorithm and use it to make predictions on our data 
    4. Evaluating a model
    5. Improve a model
    6. Save and load a trained model
    7. Putting it all together!

## 0. And ent-to-end

In [2]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt

In [3]:
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [5]:
# Create X = (feature matrix)
x = heart_disease.drop("target",axis=1)

# Create y (labels)
y = heart_disease["target"]

In [6]:
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [7]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [9]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [10]:
# Fit the model to the training data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [11]:
 clf.fit(x_train,y_train)

In [12]:
# make a prediction 
y_label = clf.predict(np.array([0,2,3,4]))



ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [13]:
y_preds = clf.predict(x_test)
y_preds

array([1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1,
       1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0])

In [14]:
# Evaluate the model
clf.score(x_train,y_train)

1.0

In [15]:
clf.score(x_test,y_test)

0.8524590163934426

In [16]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score  

In [17]:
print(classification_report(y_test,y_preds))

              precision    recall  f1-score   support

           0       0.81      0.77      0.79        22
           1       0.88      0.90      0.89        39

    accuracy                           0.85        61
   macro avg       0.84      0.84      0.84        61
weighted avg       0.85      0.85      0.85        61



In [19]:
confusion_matrix(y_test,y_preds)

array([[17,  5],
       [ 4, 35]])

In [20]:
accuracy_score(y_test, y_preds)

0.8524590163934426

In [23]:
# Improve a model
# Try different amount of n_estimators
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(x_train,y_train)
    print(f"Model Acc;uracy on test set: {clf.score(x_test, y_test)*100:.2f}%")

Trying model with 10 estimators...
Model Acc;uracy on test set: 81.97%
Trying model with 20 estimators...
Model Acc;uracy on test set: 78.69%
Trying model with 30 estimators...
Model Acc;uracy on test set: 78.69%
Trying model with 40 estimators...
Model Acc;uracy on test set: 83.61%
Trying model with 50 estimators...
Model Acc;uracy on test set: 85.25%
Trying model with 60 estimators...
Model Acc;uracy on test set: 85.25%
Trying model with 70 estimators...
Model Acc;uracy on test set: 83.61%
Trying model with 80 estimators...
Model Acc;uracy on test set: 83.61%
Trying model with 90 estimators...
Model Acc;uracy on test set: 85.25%


In [24]:
# Save  a model and load it
import pickle

pickle.dump(clf, open("random_forst_model_1.pkl","wb"))

In [25]:
loaded_model = pickle.load(open("random_forst_model_1.pkl","rb"))

In [26]:
loaded_model.score(x_test,y_test)

0.8524590163934426

## 1. Getting our data ready to be used with machine learning
The main Thing we have to do:
   1. Split the data into features and labels (usually 'x' and 'y')
   2. Filling (also called imputing) or disreagaring missing values
   3. Converting non-numerical values to numerical values (also called feature encoding)


In [27]:
heart_disease.head()


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [28]:
x = heart_disease.drop("target",axis=1)

In [29]:
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [30]:
y = heart_disease["target"]

In [31]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [32]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

In [33]:
x_train,x_test, y_train, y_test = train_test_split(x,y,test_size=0.2)


In [35]:
x_train.shape, y_train.shape

((242, 13), (242,))

## 1.1 Make sure it's all numerical

In [64]:
car_sales = pd.read_csv("car-sales-extended.csv")
len(car_sales)

1000

In [65]:
# Split into X/y
X = car_sales.drop("Price",axis=1)
y = car_sales["Price"]

In [66]:
# Turn the catefories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                 categorical_features)],
                               remainder="passthrough")
transform_X = transformer.fit_transform(X)
# transform_X

In [67]:

X_train,X_test,y_train,y_test = train_test_split(transform_X,y,test_size=0.2)

In [70]:
# Build machine learning model
from sklearn.ensemble import RandomForestRegressor
model  = RandomForestClassifier()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.0

### 1.2 Missing data
1. Fill them with value (also known as imputation)
2. Remove the sample with missing dala altogether

In [76]:
car_sales_missing = pd.read_csv("car-sales-extended-missing-data.csv")

In [77]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [78]:
len (car_sales_missing)

1000

In [92]:
X = car_sales_missing.drop("Price",axis=1)
y = car_sales_missing["Price"]

In [93]:
categorical_features = ["Make","Colour","Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                 one_hot,
                                 categorical_features)],
                               remainder="passthrough")
transformed_X = transformer.fit_transform(X)
transformed_X.getnnz()

3800

In [87]:
# Fill the "Make"
car_sales_missing["Make"].fillna("missing",inplace=True)

car_sales_missing["Colour"].fillna("missing",inplace=True)

car_sales_missing["Odometer (KM)"].fillna(car_sales_missing["Odometer (KM)"].mean(),inplace=True)

car_sales_missing["Doors"].fillna(4, inplace=True)

In [90]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [89]:
# Remove rows with missing Price value
car_sales_missing.dropna(inplace=True)

In [91]:
len(car_sales_missing)

950

## Fit model/algorithm on our data use it to make predictions

In [97]:
from sklearn.ensemble import RandomForestClassifier

heart_disease = pd.read_csv("heart-disease.csv")

np.random.seed(42)

X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

clf = RandomForestClassifier()

clf.fit(X_train,y_train)

clf.score(X_test,y_test)



0.8524590163934426

#### 3.2 Make predicions using a machine learning model

In [103]:
# Use a train model to make predictions
y_preds= clf.predict(X_test)

In [104]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_preds)

0.8524590163934426

In [105]:
# predict_proba() returns probabilities of a classification label 
clf.predict_proba(X_test[:5])


array([[0.89, 0.11],
       [0.49, 0.51],
       [0.43, 0.57],
       [0.84, 0.16],
       [0.18, 0.82]])

In [106]:
clf.predict_proba(X_train[:5])

array([[0.01, 0.99],
       [0.88, 0.12],
       [0.81, 0.19],
       [0.03, 0.97],
       [0.93, 0.07]])

In [107]:
### 4. Instaling packages from jupyter notebook

import sys
!conda install --yes --prefix {sys.prefix} seaborn

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 4.12.0
  latest version: 4.14.0

Please update conda by running

    $ conda update -n base -c defaults conda



## Package Plan ##

  environment location: /home/bbozga/Projects/ML/ZeroToMasterML/env

  added / updated specs:
    - seaborn


The following NEW packages will be INSTALLED:

  seaborn            pkgs/main/noarch::seaborn-0.11.2-pyhd3eb1b0_0


Preparing transaction: done
Verifying transaction: done
Executing transaction: done


### 4.3 Using different evaluation metrics as Scikit-Learn funcions
sklearn.metrics

In [6]:
import pandas as pd
import numpy as np

In [9]:
heart_disease = pd.read_csv("heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [53]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
np.random.seed(42)
# Create X & y
X = heart_disease.drop("target",axis=1)
y = heart_disease["target"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# Create model
model = RandomForestClassifier()

# Fit model
model.fit(X_train,y_train)

# Evaluate model using evaluation funcions
y_pred = model.predict(X_test)

In [54]:
accuracy_score(y_test,y_pred)

0.8524590163934426

In [55]:
precision_score(y_test,y_pred)

0.8484848484848485

In [56]:
recall_score(y_test,y_pred)

0.875

In [57]:
f1_score(y_test,y_pred)

0.8615384615384615

In [58]:
#f1_score made after formula
# F1 = 2 * (precision * recall) / (precision + recall)
2*(precision_score(y_test,y_pred)*recall_score(y_test,y_pred))/(precision_score(y_test,y_pred)+recall_score(y_test,y_pred))

0.8615384615384615

In [36]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
np.random.seed(42)

# Create data
housing_df = fetch_california_housing()

# Create X & y
X = pd.DataFrame(housing_df.data, columns=housing_df.feature_names)
y = housing_df.target

# Split data
X_train, X_test, y_train,y_test = train_test_split(X,y,test_size=0.2)

# Create model
model = RandomForestRegressor()

# Fit the model
model.fit(X_train,y_train)

# Using the model/predictiong
y_pred = model.predict(X_test)

model.score(X_test,y_test),len(y_test),len(y_pred)


(0.8066196804802649, 4128, 4128)

In [37]:
r2_score(y_test,y_pred)

0.8066196804802649

In [38]:
mean_absolute_error(y_test,y_pred)

0.3265721842781009

In [39]:
mean_squared_error(y_test,y_pred)

0.2534073069137548

# 5. Improving a model
First predictions = baseline predicitons.
First model = baseline model

From a data perspective:
   * Could we collect more data? (generally, the mode data, the better)
   * Could we imporove our data?

From a model perspective:
   * Is there a better model we could use?
   * Could we improve the current model?

Hyperparameters vs. Parameters:
   * Parameters = model find these patterns in data
   * Hyperparameters = settings on a model you can adjust to (potentially) improve its ability to find patterns

In [41]:
clf = RandomForestClassifier()

In [42]:
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

We're going to try and adjust:
   * `max_depth`
   * `max_features`
   * `min_samples_leaf`
   * `min_samples_split
   * `n_estimators`

In [77]:
def show_metrics(metric_dict):
    """
    Show the metrics in a table form
    """
    metric = pd.DataFrame.from_dict(metric_dict)
    print(metric)
#     metric = pd.DataFrame(data=metric_dict.values(),columns=metric_dict.keys())

In [101]:
def combine_dict_valalus(dict1, dict2):
    for key in dict2:
            dict1[key].append(dict2[key])
    return dict1
            

In [78]:
def evaluate_preds(y_true_local, y_pred_local):
    """
    Performs evaluation comparision on y_true labels vs. y_pred labels
    on a classification.
    
    Returns a dict that contains the accuracy, precision, recall and f1 scors
    """
    accuracy = accuracy_score(y_true_local, y_pred_local)
    precision = precision_score(y_true_local, y_pred_local)
    recall = recall_score(y_true_local,y_pred_local)
    f1 = f1_score(y_true_local, y_pred_local)
    metric_dict = {"accuracy":[accuracy],
                  "precision":[precision],
                   "recall":[recall],
                   "f1":[f1]}
    show_metrics(metric_dict)
    return metric_dict
    
    

In [81]:
evaluate_preds(y_test,y_pred);

   accuracy  precision  recall        f1
0  0.852459   0.848485   0.875  0.861538


In [97]:
from sklearn.ensemble import RandomForestClassifier
 
np.random.seed(42)
 
X = heart_disease.drop('target', axis = 1)
y = heart_disease['target']
 
# I first split the data into training and test/validation set. Test_size at 0.3 to obtain 70% as training data
 
X_train, X_val_test, y_train, y_val_test = train_test_split(X, y, test_size = 0.3)
 
# Then I split the val_test(validation/data set) into two using test_size of 0.5 to obtain 15% for each
 
X_val, X_test, y_val, y_test = train_test_split(X_val_test, y_val_test, test_size = 0.5)
 
len(X_train), len(X_val), len(X_test)

clf = RandomForestClassifier()

clf.fit(X_train,y_train)
y_preds = clf.predict(X_val)

d1 = evaluate_preds(y_val,y_preds)


   accuracy  precision  recall        f1
0       0.8   0.777778   0.875  0.823529


In [98]:
np.random.seed(42)

clf_2 = RandomForestClassifier(max_depth=10)

clf_2.fit(X_train,y_train)

y_preds_2 = clf_2.predict(X_val)
d2 = evaluate_preds(y_val,y_preds_2);

   accuracy  precision    recall        f1
0       0.8        0.8  0.833333  0.816327


In [103]:
dcomb = combine_dict_valalus(d1,d2)

accuracy
precision
recall
f1


In [104]:
show_metrics(dcomb)

Empty DataFrame
Columns: []
Index: []
