## Introduction to scikit learn

What we going to cover

0. [An end-to-end Scikit-Learn workflow](#0.-An-end-to-end-Scikit-Learn-workflow)
1. [Getting the data ready](#1.-Getting-the-data-ready-to-be-used-with-machine-learning)
2. [Choose the right estimator/algorithm for our problems](#2.-Chosing-right-estimator-or-algorithm-for-your-problem)
3. [Fit the model/algorithm and use it to make predictions on our data](#3.-Fit-the-model/algorithm-and-use-it-to-make-predictions-on-our-data)
4. [Evaluating a model](#4.-Evaluating-a-model)
    1. [Notes](#Notes)
5. [Improve a model](#5.-Improving-a-Model)
6. [Save and load a trained model](#6.-Saving-and-loading-trained-Machine-Learning-model)
7. [Putting it all together!](#7.-Putting-it-all-together!)



**Links for Feature Scaling**
1. __[https://rahul-saini.medium.com/feature-scaling-why-it-is-required-8a93df1af310](https://rahul-saini.medium.com/feature-scaling-why-it-is-required-8a93df1af310)__
2. __[https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/](https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/)__
3. __[https://benalexkeen.com/feature-scaling-with-scikit-learn/](https://benalexkeen.com/feature-scaling-with-scikit-learn/)__

**Links for ROC and AUC**
1. __[https://www.youtube.com/watch?v=4jRBRDbJemM](https://www.youtube.com/watch?v=4jRBRDbJemM)__
2. __[https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)__

**Which model to choose - Scikit learn Cheat sheet**
* __[https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)__

**Check this for Google's Courses**
1. __[https://developers.google.com/machine-learning/foundational-courses](https://developers.google.com/machine-learning/foundational-courses)__

## 0. An end-to-end Scikit-Learn workflow


In [1]:
# Get the data ready
import pandas as pd
import numpy as np
heart_disease = pd.read_csv('../data/heart-disease.csv')
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [2]:
# create X (feature matrix)
X = heart_disease.drop('target', axis=1)

# create y (labels)
y = heart_disease['target']

In [3]:
# 2. choose right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# We'll keep default hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

### 3 . fit the model to training data


In [4]:

from sklearn.model_selection import train_test_split

X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [5]:
clf.fit(X_train, y_train)

In [6]:
y_label = clf.predict(np.array([1,2,3])) # it wont work bcz the model is trained on different data format



ValueError: Expected 2D array, got 1D array instead:
array=[1. 2. 3.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

In [None]:

y_preds = clf.predict(x_test)

In [None]:

y_preds

In [None]:
y_test

### 4 .Evaluate the model on training and test data

In [None]:

clf.score(X_train, y_train)


In [None]:
clf.score(x_test, y_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score

print(classification_report(y_preds, y_test))

In [None]:
confusion_matrix(y_preds, y_test)

In [None]:
accuracy_score(y_preds, y_test)

### 5. Improve the model

In [None]:
np.random.seed(43)
for i in range(10,100,10):
    print(f"Training model with {i} estimator..")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set : {clf.score(x_test, y_test)*100:.2f}%")
    print('')
    

### 6 . Save the model

In [None]:
clf = RandomForestClassifier(n_estimators=50).fit(X_train, y_train)
clf.score(x_test, y_test)

In [None]:
import pickle
pickle.dump(clf, open("randm_forest.pkl","wb"))

In [None]:
loaded_model = pickle.load(open("randm_forest.pkl", 'rb'))
loaded_model.score(x_test, y_test)

# DETAILED OVERVIEW

## 1. Getting the data ready to be used with machine learning


Three main things:
1. Split the data into features and labels(usually `X` and `y`
2. Filling (also called imputing) or disregarding missing value
3. Non-numeric value to numeric (also called feature encoding)

In [None]:
heart_disease.head()

In [None]:
X= heart_disease.drop('target', axis=1)
y = heart_disease['target']

In [None]:
X.head()

In [None]:

y.head()

In [None]:
#splitting the dataset to training and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train , y_test = train_test_split(X,y,test_size=0.2 )

In [None]:
X_train.shape


### 1.1 Make it all numerical (Feature encoding)

In [None]:
car_sales = pd.read_csv('../data/car-sales-extended.csv')

In [None]:
car_sales.head()

In [None]:
car_sales.info()

In [None]:
# split the data
X = car_sales.drop('Price', axis = 1)
y = car_sales['Price']

# Splitting to Train and Test set
X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [None]:
#Building machine learning model
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()

model.fit(X_train,y_train)


In [None]:
car_sales['Doors'].value_counts()

In [None]:
car_sales.head()

In [None]:
#Turn the categories to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

#here we can also convert doors column since it is categorical
categorical_features  = ['Make', 'Colour','Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                  one_hot,
                                  categorical_features)],
                               remainder='passthrough')

In [None]:
transformed_X = transformer.fit_transform(X)

In [None]:
transformed_X

In [None]:
 pd.DataFrame(transformed_X)

In [None]:
# Let's refit the model

np.random.seed(22)
X_train, x_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)

In [None]:
model.score(x_test,y_test)

### =================================================================================

### 1.2 What if there are missing values?
1. Fill them with some value(also known as imputation)
2. Remove the samples with missing data


In [None]:
# import car sales missing data
car_sales_missing = pd.read_csv('../data/car-sales-extended-missing-data.csv')


In [None]:
car_sales_missing.isna().sum()

In [None]:
# Lets convert text to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']


#here we can also convert doors column since it is categorical
categorical_features  = ['Make', 'Colour','Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                  one_hot,
                                  categorical_features)],
                               remainder='passthrough')
transformed_X = transformer.fit_transform(X)

### Option 1 : Filling missing values with panda

In [None]:
car_sales_missing['Doors'].value_counts()

In [None]:
car_sales_missing['Make'].fillna("missing", inplace=True)
car_sales_missing['Colour'].fillna("missing", inplace=True)
car_sales_missing['Odometer (KM)'].fillna(car_sales_missing['Odometer (KM)'].mean(), inplace=True)
car_sales_missing['Doors'].fillna(4, inplace=True)


In [None]:
car_sales_missing.isna().sum()

In [None]:
#We do remove the missing values for price since without it, it is harder to predict
car_sales_missing.dropna(inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
len(car_sales_missing)

In [None]:
# Lets convert text to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']


#here we can also convert doors column since it is categorical
categorical_features  = ['Make', 'Colour','Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                  one_hot,
                                  categorical_features)],
                               remainder='passthrough')
transformed_X = transformer.fit_transform(X)

In [None]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()

In [None]:
X_train, x_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)
model.fit(X_train, y_train)
model.score(x_test, y_test)

### Option 2 : Filling missing value with Sklearn

In [19]:
car_sales_missing = pd.read_csv('../data/car-sales-extended-missing-data.csv')

In [20]:
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [21]:
car_sales_missing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           951 non-null    object 
 1   Colour         950 non-null    object 
 2   Odometer (KM)  950 non-null    float64
 3   Doors          950 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 39.2+ KB


In [22]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [23]:

car_sales_missing.dropna(subset=['Price'],inplace=True)

In [24]:
car_sales_missing.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 950 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           903 non-null    object 
 1   Colour         904 non-null    object 
 2   Odometer (KM)  902 non-null    float64
 3   Doors          903 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 44.5+ KB


In [25]:
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [26]:
# X = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [27]:
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


# Filling categorical values with 'missing' & numerical values with mean
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value = 4)
num_imputer = SimpleImputer(strategy="mean")

#Define columns
cat_features = ['Make', 'Colour']
door_features = ['Doors']
num_features = ['Odometer (KM)']


#Create an imputer (something tht fills up the data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer,cat_features),
    ("door_imputer", door_imputer, door_features),
    ("num_imputer", num_imputer, num_features)
])

filled_X = imputer.fit_transform(X)

In [28]:
filled_X

array([['Honda', 'White', 4.0, 35431.0],
       ['BMW', 'Blue', 5.0, 192714.0],
       ['Honda', 'White', 4.0, 84714.0],
       ...,
       ['Nissan', 'Blue', 4.0, 66604.0],
       ['Honda', 'White', 4.0, 215883.0],
       ['Toyota', 'Blue', 4.0, 248360.0]], dtype=object)

In [29]:
car_sale_filled = pd.DataFrame(filled_X, columns=['Make', 'Colour', 'Odometer (KM)', 'Doors'])
car_sale_filled

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,4.0,35431.0
1,BMW,Blue,5.0,192714.0
2,Honda,White,4.0,84714.0
3,Toyota,White,4.0,154365.0
4,Nissan,Blue,3.0,181577.0
...,...,...,...,...
945,Toyota,Black,4.0,35820.0
946,missing,White,3.0,155144.0
947,Nissan,Blue,4.0,66604.0
948,Honda,White,4.0,215883.0


In [30]:
# Lets convert text to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# X = car_sales_missing.drop('Price', axis=1)
# y = car_sales_missing['Price']


#here we can also convert doors column since it is categorical
categorical_features  = ['Make', 'Colour','Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                  one_hot,
                                  categorical_features)],
                               remainder='passthrough')
transformed_X = transformer.fit_transform(car_sale_filled)


In [33]:
transformed_X.shape

(950, 913)

In [34]:
a = pd.DataFrame(transformed_X)
a

Unnamed: 0,0
0,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 96)\t1.0\n..."
1,"(0, 0)\t1.0\n (0, 6)\t1.0\n (0, 690)\t1.0\..."
2,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 284)\t1.0\..."
3,"(0, 3)\t1.0\n (0, 9)\t1.0\n (0, 550)\t1.0\..."
4,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 651)\t1.0\..."
...,...
945,"(0, 3)\t1.0\n (0, 5)\t1.0\n (0, 99)\t1.0\n..."
946,"(0, 4)\t1.0\n (0, 9)\t1.0\n (0, 552)\t1.0\..."
947,"(0, 2)\t1.0\n (0, 6)\t1.0\n (0, 220)\t1.0\..."
948,"(0, 1)\t1.0\n (0, 9)\t1.0\n (0, 786)\t1.0\..."


In [35]:
#Lets Train the values
np.random.seed(32)

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, x_test, y_train,y_test = train_test_split(transformed_X, y, test_size=0.2)

model = RandomForestRegressor()

In [36]:
model.fit(X_train, y_train)
model.score(x_test, y_test)

-0.04833360827295885

## 2. Chosing right estimator or algorithm for your problem

Sklearn refer machine learning algorithm and models as ***Estimators***

Cheat_sheet - __[https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)_

In [None]:
from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
housing = fetch_california_housing()
housing

In [None]:
print(housing['DESCR'])

In [None]:
housing_df = pd.DataFrame(housing['data'], columns=housing['feature_names'])
housing_df.head()

In [None]:
housing_df['Target'] = housing['target']

In [None]:
housing_df.head()

In [None]:
housing_df.info()

In [None]:
#splitting the data
np.random.seed(42)

X= housing_df.drop('Target', axis=1)
y = housing_df['Target']

X_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


In [None]:
housing_df.isna().sum()

In [None]:
X_train

In [None]:
from sklearn.linear_model import Ridge

np.random.seed(42)
model = Ridge()
model.fit(X_train, y_train)
model.score(x_test, y_test)

In [None]:
from sklearn.svm import SVR

np.random.seed(42)
model = SVR()
model.fit(X_train, y_train)
model.score(x_test,y_test)

In [None]:
#using a Randomforest
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)
X= housing_df.drop('Target', axis=1)
y = housing_df['Target']

X_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(x_test,y_test)

## 2.2 Choosing an estimator for classification


**Let's go to the map...** https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [None]:
heart_disease = pd.read_csv('../data/heart-disease.csv')
heart_disease.head()

In [None]:
len(heart_disease)


**Consulting the map and it says to use `LinearSVC`**

In [None]:
from sklearn.svm import LinearSVC

# setting up the random seed
np.random.seed(42)

# Make the Data
X = heart_disease.drop('target', axis =1)
y = heart_disease['target']

# Split the data
X_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# Instantiate LinearSVC
clf = LinearSVC(max_iter=10000)
clf.fit(X_train, y_train)

#Evaluate the LinearSVC
clf.score(x_test, y_test)

In [None]:
heart_disease['target'].value_counts()

In [None]:
from sklearn.ensemble import RandomForestClassifier

# setting up the random seed
np.random.seed(42)

# Make the Data
X = heart_disease.drop('target', axis =1)
y = heart_disease['target']

# Split the data
X_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# Instantiate LinearSVC
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)

#Evaluate the LinearSVC
clf.score(x_test, y_test)

## 3. Fit the model/algorithm and use it to make predictions on our data

### 3.1 Fitting the model to the data

Different names for
* `X` = features, features variables, data
* `y` = labels, target variables, target 

In [None]:
from sklearn.ensemble import RandomForestClassifier

# setting up the random seed
np.random.seed(42)

# Make the Data
X = heart_disease.drop('target', axis =1)
y = heart_disease['target']

# Split the data
X_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

# Instantiate LinearSVC
clf = RandomForestClassifier(n_estimators=100)



In [None]:
# Fit the model to the data (Training the machine learning model)
clf.fit(X_train, y_train)


In [None]:

#Evaluate the LinearSVC( Use the patterns model has learnt)
clf.score(x_test, y_test)

### 3.2 Make predictions using machine learning model
2 ways to make predictions
1. `predict()`
2. `predict_proba()`

In [None]:
clf.predict(x_test)

In [None]:
np.array([y_test])

In [None]:
y_pred  = clf.predict(x_test)
np.mean(y_pred==y_test)

In [None]:
clf.score(x_test, y_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

**Making prediction using `predict_proba()`**
* It gives the probability of the class (confidence in which class the answer will be `0` or `1`)


In [None]:
clf.predict_proba(x_test[:5])
#Below values gives the confidence that model has over the data. 
#Let say if there are less difference between 2 classes, then there is a confusion for the model to predict
# i.e it is hard to predict.

In [None]:
y_test[:5]

In [None]:
housing_df.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

#Create a data
X = housing_df.drop('Target', axis=1)
y = housing_df['Target']

# Split to train and test data
X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

#Create model instance
model = RandomForestRegressor()

#fit the model
model.fit(X_train, y_train)

# Make Predction
y_pred = model.predict(x_test)

In [None]:
y_pred[:10]

In [None]:
np.array(y_test[:10])

**Keywords - Regression metric evaluation**

In [None]:
# Compare the prediction with truth
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_pred)

## 4. Evaluating a model

Three ways to evaluate a `Models` /`Estimator`:
1. [Estimator's builtin `score()` method](#4.1-Evaluating-model-with-score()-method)
2. [The `Scoring` parameter](#4.2-Evaluating-model-using-scoring-parameter)
    1.  [4.2.1 Classification model evaluataion metrics](#4.2.1-Classification-model-evaluataion-metrics)
    2.  [4.2.2 Regression model evaluation metrics](#4.2.2-Regression-model-evaluation-metrics)
    3.  [Notes](#Notes-Machine-Learning-Model-Evaluation)
    4.  [4.2.3 Finally evaluating using Scoring parameter](#4.2.3-Finally-evaluating-using-Scoring-parameter)
3. [Problem specific metric funtions](#4.3-Problem-specific-metric-funtions---Using-different-metrics-as-Scikit-Learn-functions)

Links - 
1. https://scikit-learn.org/stable/modules/model_evaluation.html
2. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html


### 4.1 Evaluating model with score() method

In [None]:
heart_disease.head()

In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

#Create X and y
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

#Create train and test set
X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

#Create Classifier model instance
clf = RandomForestClassifier()

#Fit the model
clf.fit(X_train, y_train)

In [None]:
clf.score(X_train, y_train)

In [None]:
clf.score(x_test, y_test)

**Lets use `score()` for Regression problem**

In [None]:
housing_df

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

#Create X and y
X = housing_df.drop('Target', axis=1)
y = housing_df['Target']

#Create train and test set
X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

#Create Classifier model instance
model = RandomForestRegressor(n_estimators=100)

#Fit the model
model.fit(X_train, y_train)

In [None]:
model.score(X_train, y_train)

In [None]:
# Default score() evaluation metric is R_squared for regression algorithms
model.score(x_test, y_test)

### 4.2 Evaluating model using scoring parameter

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

#Create X and y
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

#Create train and test set
X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

#Create Classifier model instance
clf = RandomForestClassifier()

#Fit the model
clf.fit(X_train, y_train)

In [None]:
clf.score(x_test, y_test)

In [None]:
cross_val_score(clf, X, y, cv=5)

![](../images/sklearn-cross-validation.png)

In [None]:
#default scoring parameter of classifier = mean accuracy
clf.score(x_test, y_test)

In [None]:
cross_val_score(clf, X, y, cv=5) # scoring paramete set to None by default

### 4.2.1 Classification model evaluataion metrics

1. Accuracy
2. Area under ROC curve
3. Confusion matrix
4. Classification report

**1. Accuracy**

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X= heart_disease.drop('target', axis = 1)
y = heart_disease['target']

clf = RandomForestClassifier()
cross_val_score =cross_val_score(clf, X, y, cv = 5)

In [None]:
np.mean(cross_val_score)

In [None]:
print(f'Heart Disease Classifier Cross_Validated Accuracy : {np.mean(cross_val_score) * 100:2f}%')

**2 . Area Under the Receiver Operating Characteristic Curve (AUC / ROC)**

* Area Under Curve (AUC)
* ROC curve

ROC curve are a comparision of a model's true positive rate(tpr) versus a model's false positive rate(fpr)

    * True positive =  model predicts 1 when truth is 1
    * False positive =  model predicts 1 when truth is 0    
    * True negative =  model predicts 0 when truth is 0    
    * False negative =  model predicts 0 when truth is 1    
    
ROC curves and AUC metrics are evaluation metrics for binary classification models (a model which predicts one thing or another, such as heart disease or not).

The ROC curve compares the true positive rate (tpr) versus the false positive rate (fpr) at different classification thresholds.

The AUC metric tells you how well your model is at choosing between classes (for example, how well it is at deciding whether someone has heart disease or not). A perfect model will get an AUC score of 1.

In [None]:
X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

In [None]:
from sklearn.metrics import roc_curve

clf.fit(X_train, y_train)

y_prob = clf.predict_proba(x_test)

y_prob[:10]


In [None]:
y_prob_positive = y_prob[:,1] # take all the column 1 of every row

In [None]:
# Calculate tpr and fpr and thresholds

fpr, tpr, thresholds = roc_curve(y_test, y_prob_positive)

In [None]:
# Check false positive rate
fpr

In [None]:
thresholds


In [None]:
import matplotlib.pyplot as plt

def plot_roc_curve(fpr, tpr):
    """
    Plot ROC curve from given false positive rate (fpr)
    and true positive rate(tpr)
    """
    
    #plot ROC curve
    plt.plot(fpr, tpr, color= "orange", label= "ROC")
    
    #plot the baseline (no predicting power)
    plt.plot([0,1], [0,1], color= "darkblue", linestyle="--", label="Guessing")
    
    #customize the plot
    plt.xlabel("False poitive rate (fpr)")
    plt.ylabel("True positive rate (tpr)")
    plt.title("Receiver Operating Characterisitcs (ROC) curve")
    plt.legend()
    plt.show()
    
    
plot_roc_curve(fpr, tpr)

In [None]:
len(thresholds)

In [None]:
#ploting threshold
plt.plot(thresholds)

In [None]:
from sklearn.metrics import roc_auc_score #plotting for Area under curve
roc_auc_score(y_test, y_prob_positive)

In [None]:
#Lets try for perfect score that roc gives. WHere there is no false positive
fpr, tpr, thresholds = roc_curve(y_test, y_test)
plot_roc_curve(fpr, tpr)

In [None]:
roc_auc_score(y_test, y_test)

**3. Confusion matrix**

In [None]:
from sklearn.metrics import confusion_matrix

y_preds = clf.predict(x_test)

confusion_matrix(y_test, y_preds)

In [None]:
pd.crosstab(y_test, y_preds, rownames=["Actual_label"], colnames=["predicted_lables"])

![](../images/sklearn-confusion-matrix-anatomy.png)

In [None]:
#Make our confusion matrix more visual with seaborn's heatmap()

import seaborn as sns

#set font scale
sns.set(font_scale=1.5)

conf_mat = confusion_matrix(y_test, y_preds)

sns.heatmap(conf_mat);

**Create confusion matx using scikit library**

In [None]:
# Using SKlearn's inbuilt method

from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(estimator=clf, X=X, y=y) # using itself create a prediction from X and y

In [None]:
ConfusionMatrixDisplay.from_predictions(y_true= y_test, y_pred=y_preds)

**4. Classification report**

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_preds))

![](../images/ClassificationReportAnatomy.png)

### 4.2.2 Regression model evaluation metrics

1. R^2 (R-squared) or Coeff of determination
2. Mean Absolute Error
3. Mean Squared Error

**R-squared**

What R-squared does: Compares your model prediction to the mean of the targets, Values can range from negative infinity(a very poor model) to 1. 

For example,
* If all your model does is predict the mean of the target, Its R^2 value be 0.  
* If your model perfectly predicts a range of numbers it's R^2 value would be 1.

In [None]:
housing_df.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor

from sklearn.datasets import fetch_california_housing
import pandas as pd
import numpy as np
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing['data'], columns=housing['feature_names'])
housing_df['target'] = housing['target']
np.random.seed(42)


X = housing_df.drop('target', axis=1)
y = housing_df['target']

model = RandomForestRegressor()

X_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model.fit(X_train, y_train)



In [None]:
model.score(x_test, y_test)

In [None]:
from sklearn.metrics import r2_score

#Fill array with y_test_mean

y_test_mean = np.full(len(y_test), y_test.mean())

r2_score(y_test, y_test_mean)

In [None]:
r2_score(y_test, y_test)

**Mean Absolute Error**

It is a absolute mean of differences predicted and actual value

In [None]:
from sklearn.metrics import mean_absolute_error

y_pred = model.predict(x_test) 
y_pred
mean_absolute_error(y_test, y_pred)

In [None]:
df = pd.DataFrame({'Actual':y_test, 'Pred': y_pred})
df['Diff'] = np.abs(df['Pred']-df['Actual'])
df.head()

In [None]:
np.mean(df['Diff'])

**Mean Square Error** 


It is a mean of the square of the errors between actual and predicted value

In [None]:
from sklearn.metrics import mean_squared_error

y_preds = model.predict(x_test)

mse = mean_squared_error(y_test, y_preds)
mse

In [None]:
df['Sqrd_diff'] = np.square(df['Diff'])

In [None]:
df.head()

### Notes-Machine Learning Model Evaluation

**Machine Learning Model Evaluation**

Evaluating the results of a machine learning model is as important as building one.

But just like how different problems have different machine learning models, different machine learning models have different evaluation metrics.

Below are some of the most important evaluation metrics you'll want to look into for classification and regression models.

**1. Classification Model Evaluation Metrics/Techniques**

* Accuracy - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.

* Precision - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.

* Recall - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.

* F1 score - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.

* Confusion matrix - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).

* Cross-validation - Splits your dataset into multiple parts and train and tests your model on each part then evaluates performance as an average.

* Classification report - Sklearn has a built-in function called classification_report() which returns some of the main classification metrics such as precision, recall and f1-score.

* ROC Curve - Also known as receiver operating characteristic is a plot of true positive rate versus false-positive rate.

* Area Under Curve (AUC) Score - The area underneath the ROC curve. A perfect model achieves an AUC score of 1.0.

**Which classification metric should you use?**

* Accuracy is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).

* Precision and recall become more important when classes are imbalanced.

* If false-positive predictions are worse than false-negatives, aim for higher precision.

* If false-negative predictions are worse than false-positives, aim for higher recall.

* F1-score is a combination of precision and recall.

* A confusion matrix is always a good way to visualize how a classification model is going.

**2. Regression Model Evaluation Metrics/Techniques**

* R^2 (pronounced r-squared) or the coefficient of determination - Compares your model's predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.

* Mean absolute error (MAE) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.

* Mean squared error (MSE) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).

**Which regression metric should you use?**

* R2 is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generally, the closer your R2 value is to 1.0, the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.

* MAE gives a better indication of how far off each of your model's predictions are on average.

* As for MAE or MSE, because of the way MSE is calculated, squaring the differences between predicted values and actual values, it amplifies larger differences. Let's say we're predicting the value of houses (which we are).

* Pay more attention to MAE: When being `10,000` off is twice as bad as being `5,000` off.

* Pay more attention to MSE: When being 10,000 off is more than twice as bad as being 5,000 off.

* For more resources on evaluating a machine learning model, be sure to check out the following resources:

Scikit-Learn documentation for metrics and scoring (quantifying the quality of predictions)
* https://scikit-learn.org/stable/modules/model_evaluation.html

Beyond Accuracy: Precision and Recall by Will Koehrsen 
* https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c

Stack Overflow answer describing MSE (mean squared error) and RSME (root mean squared error)
* https://stackoverflow.com/questions/17197492/is-there-a-library-function-for-root-mean-square-error-rmse-in-python/37861832#37861832

### 4.2.3 Finally evaluating using `Scoring` parameter

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np

np.random.seed(24)

heart_disease = pd.read_csv('../data/heart-disease.csv')

X = heart_disease.drop("target", axis =1)
y = heart_disease['target']

clf = RandomForestClassifier()

link - https://scikit-learn.org/stable/modules/model_evaluation.html

**for Classification**

In [None]:
np.random.seed(42)

#cross_validation accuracy
cv_acc = cross_val_score(clf, X, y, scoring=None) #default scoring is accuracy

print(f'Cross validation accuracy :{np.mean(cv_acc)*100:.2f}%')


In [None]:
#precision
np.random.seed(42)
cv_acc = cross_val_score(clf, X, y, scoring="precision")
cv_acc

In [None]:
print(f'Cross validation precision :{np.mean(cv_acc)*100:.2f}%')

In [None]:
#Recall
cv_acc = cross_val_score(clf, X, y, scoring="recall")
cv_acc


In [None]:
print(f'Cross validation precision :{np.mean(cv_acc)*100:.2f}%')

**for Regression**

In [None]:
housing_df.head()

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

X = housing_df.drop('Target', axis=1)
y = housing_df['Target']

model= RandomForestRegressor()

cv_r2 = cross_val_score(model, X, y, cv=3, scoring=None)
cv_r2

In [None]:
np.mean(cv_r2)

In [None]:
#Mean Squared Error

cv_MSE = cross_val_score(model, X, y, cv =3 , scoring="neg_mean_squared_error")
cv_MSE


In [None]:
np.mean(cv_MSE)

In [None]:
#Mean Absolute Error

cv_MAE = cross_val_score(model, X, y, cv =3 , scoring="neg_mean_absolute_error")
cv_MAE

In [None]:
np.mean(cv_MAE)

### 4.3 Problem specific metric funtions - Using different metrics as Scikit-Learn functions


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, precision_recall_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

#create X & y
X = heart_disease.drop('target', axis=1)
y = heart_disease['target']

#split Train, test set
X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)


#Initialise Model
clf = RandomForestClassifier()


#Fit the model
clf.fit(X_train, y_train)

#Evaluate
y_pred = clf.predict(x_test)

print('Classifier metrics on test set')
print(f'Accuracy : {accuracy_score(y_test, y_pred)*100:.2f}')
print(f'Precision Score : {precision_score(y_test, y_pred)}')
print(f'Recall : {recall_score(y_test, y_pred)}')
print(f'F1 : {f1_score(y_test, y_pred)}')



In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing

import pandas as pd
import numpy as np
housing = fetch_california_housing()
# housing
housing_df = pd.DataFrame(housing['data'], columns=housing['feature_names'])
# housing_df.head()
housing_df['Target'] = housing['target']


#Create X & y
X= housing_df.drop('Target', axis=1)
y = housing_df['Target']



#Split train and test set
X_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

#Initialize the model
model = RandomForestRegressor()

#Fit the model
model.fit(X_train, y_train)


#Evaluate
y_pred = model.predict(x_test)

print(f'R2: {r2_score(y_test, y_pred)}')
print(f'MAE : {mean_absolute_error(y_test, y_pred)}')
print(f'MSE : {mean_squared_error(y_test,y_pred)}')

## 5. Improving a Model

First Prediciton = Baseline Predictions

First Model = Baseline Model

**From Data Perspective**
* Could we collect more data? (More data.. more pattern findings)
* Could we imporove our data? 

**From Model Perspective**
* Is there a better model we could use?
* Could we improve our model?

**Parameter Vs HyperParameter**
* **Parameter** - model finds pattern in the data
* **HyperParameter** - Settings on a model that we can adjust to potentially improve by **3 ways**
    * By Hand
    * Randomly with RandomSearchCV
    * Exhaustively with GridsearchCV

In [None]:
clf = RandomForestClassifier()
clf.get_params()

![](../images/sklearn-train-valid-test-sets.png)

### 5.1 Hyperparameter Tuning by Hand

In [None]:
def evaluate_preds(y_true, y_preds):
    '''
    Performs evaluation comparision on y_true and y_preds labels on classification
    '''
    
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric = {"accuracy":round(accuracy,2),
              "precision": round(precision,2),
              " recall": round(recall,2),
              "f1":round(f1,2)}
    print(f'Accuracy : {accuracy*100:.2f}')
    print(f'Precision : {precision:.2f}')    
    print(f'Recall : {recall:.2f}')    
    print(f'F1 : {f1:.2f}')    
    return metric

In [None]:
heart_disease


In [None]:
from sklearn.ensemble import RandomForestClassifier
np.random.seed(43)

#Shuffle the data
heart_disease = heart_disease.sample(frac=1)
X = heart_disease.drop('target', axis=1)
y= heart_disease['target']

train_split = round(0.7*len(heart_disease)) #70% of data
val_split = round(train_split+ .15*len(heart_disease)) #15% of data

X_train, y_train = X[:train_split], y[:train_split]
X_valid, y_valid = X[train_split:val_split], y[train_split:val_split]
X_test, y_test = X[val_split:], y[val_split:]



In [None]:
len(X_train), len(X_valid), len(X_test)

In [None]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

In [None]:
#Make baseline predictions
y_preds = clf.predict(X_valid)

#Evaluate the classifier
baseline_metrics = evaluate_preds(y_valid, y_preds)

In [None]:
clf.get_params()

Lets change:
   * `max_depth`
   * `max_features`
   * `min_samples_leaf`
   * `min_samples_split`

In [None]:
clf2 = RandomForestClassifier(max_depth=50)
clf2.fit(X_train, y_train)
#Tuned predictions
y_preds = clf2.predict(X_valid)

#Evaluate the classifier
clf2_metrics = evaluate_preds(y_valid, y_preds)

### 5.2 Hyperparameter tuning by RandomSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
        "max_depth": [None, 5, 10, 20, 30],
        "max_features": ["auto", "sqrt"],
        "min_samples_split": [2, 4, 6],
        "min_samples_leaf": [1, 2, 4]}

np.random.seed(42)

# Split into X & y
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=grid, 
                            n_iter=10, # number of models to try
                            cv=5,
                            verbose=2)

# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);

In [None]:
rs_clf.best_params_


In [None]:
# Make predictions with the best hyperparameters
rs_y_preds = rs_clf.predict(X_test)

# Evaluate the predictions
rs_metrics = evaluate_preds(y_test, rs_y_preds)

### 5.3 Hyperparameter tuning with GridSearchCV

In [None]:
grid

In [None]:
grid_2 = {'n_estimators': [100, 200, 500],
          'max_depth': [None],
          'max_features': ['auto', 'sqrt'],
          'min_samples_split': [6],
          'min_samples_leaf': [1, 2]}

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split

np.random.seed(42)

# Split into X & y
X = heart_disease.drop("target", axis=1)
y = heart_disease["target"]

# Split into train and test sets
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs=1)

# Setup GridSearchCV
gs_clf = GridSearchCV(estimator=clf,
                      param_grid=grid_2, 
                      cv=5,
                      verbose=2)

# Fit the GridSearchCV version of clf
gs_clf.fit(X_train, y_train);

In [None]:
gs_clf.best_params_

In [None]:

gs_y_preds = gs_clf.predict(X_test)

# evaluate the predictions
gs_metrics = evaluate_preds(y_test, gs_y_preds)

In [None]:
compare_metrics = pd.DataFrame({"baseline": baseline_metrics,
                                "clf_2": clf2_metrics,
                                "random search": rs_metrics,
                                "grid search": gs_metrics})

compare_metrics.plot.bar(figsize=(10, 8));

## 6. Saving and loading trained Machine Learning model

Two ways to save and load machine learning models:
1. With python's `pickle` module
2. With the `joblib` module

In [None]:
import pickle

#save the existing model to the file
pickle.dump(gs_clf, open('gs_random_search.pkl', 'wb'))

In [None]:
#Load the saved model
loaded_pickle_model = pickle.load(open("gs_random_search.pkl", 'rb'))


In [None]:
#Make some predictions
pickle_y_preds = loaded_pickle_model.predict(X_test)
evaluate_preds(y_test, pickle_y_preds)

**Using Joblib**

In [None]:
from joblib import dump, load

#Save the model
dump(gs_clf, "gs_random_search.joblib")

In [None]:
#Import the saved joblib model

loaded_joblib_model = load("gs_random_search.joblib")


In [None]:
joblib_y_preds = loaded_joblib_model.predict(X_test)
evaluate_preds(y_test, joblib_y_preds)


## 7. Putting it all together!

![](../images/Things_to_remember.png)

In [7]:
data = pd.read_csv('../data/car-sales-extended-missing-data.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Make           951 non-null    object 
 1   Colour         950 non-null    object 
 2   Odometer (KM)  950 non-null    float64
 3   Doors          950 non-null    float64
 4   Price          950 non-null    float64
dtypes: float64(3), object(2)
memory usage: 39.2+ KB


In [8]:
data.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [9]:
data.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


### Steps we want to do:
1. Fill the missing data
2. Convert data to numbers
3. Build a model on the data


In [38]:
#Getting the data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

#Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# setup random seed
import numpy as np
np.random.seed(42)


# Import data and drop missing labels
data = pd.read_csv('../data/car-sales-extended-missing-data.csv')
data.dropna(subset=['Price'], inplace=True)

#Define different feature and transformer pipeline
categorical_features = ['Make', 'Colour']
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

door_feature = ["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))])

numeric_feature = ['Odometer (KM)']
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])

#Set Preprocessing steps(fill missing values , then convert to numbers)
preprocessor = ColumnTransformer(
                        transformers = [
                            ("cat", categorical_transformer, categorical_features),
                            ("door", door_transformer, door_feature),
                            ('numeric', numeric_transformer, numeric_feature)
                       ])


#Creating preprocessing and modelling pipeline
model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ("model", RandomForestRegressor())])

#Split data
X = data.drop('Price', axis=1)
y = data['Price']
X_train, x_test, y_train, y_test = train_test_split(X,y, test_size=0.2)


#Fit and Score the model
model.fit(X_train, y_train)
model.score(x_test, y_test)



0.22188417408787875

In [17]:
preprocessor

It is possible to use `GridsearchCV` and `RandomSearchCV` with our `Pipeline` 

In [39]:
#Using GridsearchCV with our regression pipeline
from sklearn.model_selection import GridSearchCV
pipe_grid = {
    "preprocessor__numeric__imputer__strategy":["mean", "median"],
    "model__n_estimators": [100,1000],
    "model__max_depth": [None, 5],
    "model__max_features":["auto"],
    "model__min_samples_split":[2,4]
}

gs_model = GridSearchCV(model, pipe_grid, cv=5, verbose = 2)
gs_model.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.2s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.2s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.2s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.3s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.2s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.2s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.2s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.6s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.5s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.6s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.5s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.6s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.7s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.5s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.5s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.5s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.7s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.2s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.2s
[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.2s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.4s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.4s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.4s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.5s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.4s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.4s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.5s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.8s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.5s


  warn(


[CV] END model__max_depth=None, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.6s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.4s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.2s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.0s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.2s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=2, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=mean; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s
[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s


  warn(
  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=100, preprocessor__numeric__imputer__strategy=median; total time=   0.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.3s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.2s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=mean; total time=   1.3s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.2s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.2s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.1s


  warn(


[CV] END model__max_depth=5, model__max_features=auto, model__min_samples_split=4, model__n_estimators=1000, preprocessor__numeric__imputer__strategy=median; total time=   1.2s


  warn(


In [40]:
gs_model.score(x_test, y_test)

0.3339554263158365