# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates some of the most useful functions of the beautiful Scikit-Learn library.

What we're going to cover:

0. An end-to-end scikit-learn workflow
1. Get the data ready to be used in machine learning models
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorith and use it to make predictions on our data
4. Evaluating the model 
5. Improve the model
6. Save and reload your trained model
7. Putting it all together!

Usefull links
* [scikit-learn class video notes](https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn-video.ipynb)
* [scikit-learn condensed class notes](https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/scikit-learn-what-were-covering.ipynb)
* [Quick Machine Learning Overview](https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/introduction-to-scikit-learn.ipynb)
* [scikit-learn workflow example](https://github.com/mrdbourke/zero-to-mastery-ml/blob/master/section-2-data-science-and-ml-tools/scikit-learn-workflow-example.ipynb)
* [scikit-Learn Documentaion](https://scikit-learn.org/stable/user_guide.html)
* [Algorithm Cheat Sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

In [None]:
what_were_covering = ["0. An end-to-end scikit-learn workflow",
"1. Get the data ready to be used in machine learning models",
"2. Choose the right estimator/algorithm for our problems",
"3. Fit the model/algorith and use it to make preedictions on our data",
"4. Evaluating the model",
"5. Improve the model",
"6. Save and reload your trained model",
"7. Putting it all together!"]

In [None]:
what_were_covering

In [None]:
# Standard Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plty
%matplotlib inline2

## 0. An end-to-end Scikit-Learn Workflow

In [None]:
what_were_covering[0] 

In [None]:
# 1. Get the data ready
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease

In [None]:
# Create X, the feature data
X = heart_disease.drop("target", axis=1)

# Create y (labels)
y = heart_disease['target']

In [None]:
# 2. Choose the right model and hyperparameters
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()

# we'll keep the default hyperparameters
clf.get_params()

In [None]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
clf.fit(X_train, y_train);

In [None]:
# Use the model to make a prediction

# y_label = clf.predict(np.array([0,2,3,4]))

# The above doesn't work because it is not the correct shape.


y_preds = clf.predict(X_test)
y_preds

In [None]:
y_test

In [None]:
# 4. Evaluate the model on the training data and test data
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

In [None]:
confusion_matrix(y_test, y_preds)

In [None]:
accuracy_score(y_test, y_preds)

In [None]:
# 5. Improve a model
# Try different amount of n_estimators

np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators = i).fit(X_train, y_train)
    print(f"Model accuracy on test set: {clf.score(X_test, y_test)* 100:.2f}%")
    print(" ")

In [None]:
# 6. Save a model and load it
import pickle

pickle.dump(clf, open("random_forest_model_1.pkl", "wb"))

In [None]:
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
loaded_model.score(X_test, y_test)

## Optional: Debugging Warnings in Jupyter

In [None]:
import warnings

In [None]:
warnings.filterwarnings("ignore")
warnings.filterwarnings("default")

## 1. Getting our data ready to be used with machine learning

###### Three main things we need to do to get the data ready:
    1. Split the data into features and labels (usually, 'X' and y)
    2. Filling also called imputing or disregarding missing values
    3. Converting non-numerical values to numerical values aka feature encoding

In [None]:
heart_disease.head()

In [None]:
X = heart_disease.drop('target', axis=1)
X.head()

In [None]:
y = heart_disease['target']
y.head()

In [None]:
print("len(X):", len(X))
print("len(y):", len(y))

In [None]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
X.shape[0] *0.8

In [None]:
# Final notes
# Getting your Data Ready:
# Clean Data -> Transform Data -> Reduce Data


#### 1.1  Make sure it's all numerical

In [None]:
car_sales = pd.read_csv("../data/car-sales-extended.csv")
car_sales.head()

In [None]:
len(car_sales)

In [None]:
car_sales.dtypes

In [None]:
# split the data into X and y
X = car_sales.drop('Price', axis=1)
y = car_sales['Price']

# Split into training and test.
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
# Build machine learning model
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
"""
This section isn't going to work because the strings aren't able to go into the model
I have commented it out so that I can rerun the entire notebook.


model.fit(X_train, y_train)
model.score(X_test, y_test)
"""

In [None]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder='passthrough')

transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
pd.DataFrame(transformed_X)

In [None]:
X.head()

In [None]:
# Another option to encode variables is pd.get_dummies()
dummies = pd.get_dummies(car_sales[["Make", "Colour", "Doors"]])
dummies

In [None]:
# Let's refit the model
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

### 1.2 What if there were missing values?

1. Fill them with some value (imputation)
2. Remove the samples with missing data altogether

In [None]:
# Import car sales missing data
# this data set has some missing values
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum()

In [None]:
#Create X and y
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing['Price']

In [None]:
# Let's try and convert our data to numbers
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder='passthrough')

transformed_X = transformer.fit_transform(X)
transformed_X

In [None]:
car_sales_missing

#### Option 1: Fill missing data with pandas

In [None]:
# Fix all of the columns except for the missing price, which we will delete later.

# Fill the "Make" column
car_sales_missing["Make"].fillna("missing", inplace=True)

# Fill the "Colour" column
car_sales_missing['Colour'].fillna("missing", inplace = True)

# fill the "Odometer (KM)" column
car_sales_missing['Odometer (KM)'].fillna(car_sales_missing["Odometer (KM)"].mean(), inplace = True)

# fill the "Doors" column
car_sales_missing["Doors"].fillna(4, inplace = True)

In [None]:
# Check our dataframe again
car_sales_missing.isna().sum()

In [None]:
# Remove rows with missing price value
car_sales_missing.dropna(inplace=True)

In [None]:
car_sales_missing.isna().sum()

In [None]:
len(car_sales_missing)

In [None]:
X = car_sales_missing.drop("Price", axis=1)
y = car_sales_missing["Price"]

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot", one_hot, categorical_features)], remainder='passthrough')

transformed_X = transformer.fit_transform(car_sales_missing)
transformed_X

#### Option 2: Fill missing values with Scikit-Learn

In [None]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

In [None]:
car_sales_missing.isna().sum()

In [None]:
len(car_sales_missing)

In [None]:
# Drop the rows with no labels
car_sales_missing.dropna(subset=["Price"], inplace=True)
car_sales_missing.isna().sum()

In [None]:
# Split into X and y
X = car_sales_missing.drop("Price", axis = 1)
y = car_sales_missing["Price"]

In [None]:
X.isna().sum()

In [None]:
# Fill missing values with Scikit-Learn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical with 'missing', and numerical values with the column mean...

"""
Normally, we would first need to split our data. We don't want to include the mean values of the test data. 
We want the replacement mean value to include only training data. Including the test data will
influence the results of testing the test data.
"""

cat_imputer = SimpleImputer(strategy="constant", fill_value = "missing" )
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer= SimpleImputer(strategy='mean')

# Define Columns
cat_features = ["Make", "Colour"]
door_features = ["Doors"]
num_features = ["Odometer (KM)"]

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features), 
    ("door_imputer", door_imputer, door_features),
    ("num_imputer", num_imputer, num_features)
])

# Transform the data
filled_X = imputer.fit_transform(X)
filled_X

In [None]:
car_sales_filled = pd.DataFrame(filled_X, columns = ["Make", "Colour", "Doors", "Odometer (KM)"])
car_sales_filled.head()

In [None]:
car_sales_filled.isna().sum()

In [None]:
# Turn the categories into numbers
# using the one  OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]
one_hot = OneHotEncoder()
transformer = ColumnTransformer([ ("one_hot", one_hot, categorical_features) ] ,remainder="passthrough")
transformed_X = transformer.fit_transform(car_sales_filled)
transformed_X

In [None]:
# Now we've got our data as numbers and filled (no missing values)
# Let's fit a model
np.random.seed(42)
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

In [None]:
# this model has done slightly worse.
len(car_sales_filled), len(car_sales)

In [None]:
"""Most DataSets wont be in a form ready to use models.
Some take more prep than others.
Ususally you data:
    - must be numberical
        * categorical data needs to be encoded
        * numberical data should also be scaled
    - can not have missing values
        * need to fill missing values with imputation
        * The type of data determines the type of value replcement
    """;

## 2. Choosing The Right Model For Your Data 2 (Regression)

Some things to note:

* Sklearn refers to machine learning models/algorithms as estimators
* Classification problem = predicting a category (heart disease or not)
    * Sometimes you'll see `clf` (classifier) used as a classification estimator
* Regression problem = predicting a number (selling price of a car)


* [algorithm cheat-sheet](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
* [sklearn built in datasets](https://scikit-learn.org/stable/datasets/toy_dataset.html)

In [None]:
### 2.1 Picking a machine learning model for a regression problem.

# Let's use the California Housing dataset
# Get California Housing dataset

from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing()

housing

In [None]:
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
housing_df

In [None]:
housing_df['target'] = housing['target']

In [None]:
housing_df

In [None]:
# housing_df = housing_df.drop("MedHouseVal", inplace=True, axis = 1)

In [None]:
housing_df.head()

In [None]:
# Import Algorithm
from sklearn.linear_model import Ridge

# Set up random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df['target']

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate and fit the model (on the training set)
model = Ridge()
model.fit(X_train, y_train)

# Check the score of the model
model.score(X_test, y_test)

In [None]:
# model.score returns R^2 which is to say "How well do these features predict the target variable?"
# R^2 is known as the coefficient of determination

In [None]:
# HW Pick a different model and try it out
from sklearn import linear_model
reg = linear_model.Lasso(alpha=0.1)
reg.fit(X_train, y_train)
        
reg.score(X_test, y_test)

What if 'Ridge' didn't work or the score didn't fit our needs?

We could always try a different model...

How about we try an ensemble model (an ensemble is a combination of smaller models to try and make better predictions other than just a single model)

Sklearn's Emsemble Models can be found here: [Ensemble Models](https://scikit-learn.org/stable/modules/ensemble.html)

In [None]:
# Import the RandomForestRegressor model class from the ensemble module
from sklearn.ensemble import RandomForestRegressor

# setup random seed
np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis=1)
y = housing_df['target']

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)


# Create the random forest model
model = RandomForestRegressor()
model.fit(X_train,y_train)

# Score the Model
model.score(X_test,y_test)

## 2.2 Picking a machine learning model for a classification problem

let's go to the map: [machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)

In [None]:
# Let's get some data for a classification problem.
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()

In [None]:
len(heart_disease)

Consulting the map and it says to try `LinearSVC`

In [None]:
# Import the Linear SVC estimaor class
from sklearn.svm import LinearSVC 

# setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# instantiate linearSVC
clf = LinearSVC(max_iter=10000)
clf.fit(X_train, y_train)

# Evaluate the LinearSVC
clf.score(X_test, y_test)

In [None]:
heart_disease['target'].value_counts()

In [None]:
# Import the random Forest classifier
from sklearn.ensemble import RandomForestClassifier

# setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# instantiate Random Forest Classifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Evaluate the Random Forest Classifier
clf.score(X_test, y_test)

Tidbit:
    
    1. If you have structured data, use ensemble methods
    2. If you have have unstructured data, use deep learning or transfer learning.

In [None]:
heart_disease

In [None]:
what_were_covering

## 3 Fit the model/algorithm and our data and use it to make predictions

### 3.1 Fitting the model to the data

Different names for:
* `X` = features,feature variables, data
* `y` = labels, targets, target variables

In [None]:
# Import the random Forest classifier
from sklearn.ensemble import RandomForestClassifier

# setup random seed
np.random.seed(42)

# Make the data
X = heart_disease.drop("target", axis = 1)
y = heart_disease['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# instantiate Random Forest Classifier
clf = RandomForestClassifier()

# Fit the model to the data (training the machine learning model)
clf.fit(X_train, y_train)

# Evaluate the Random Forest Classifier (use the patterns the model has learned)
clf.score(X_test, y_test)

In [None]:
X.head()

In [None]:
y.head()

### 3.2 Make predictions using a machine learning model

2 ways to make predictions:
1. `predict()`
2. `predict_proba()`

In [None]:
# Use a trained model to make predictions
# clf.predict(np.array([1,7,8,3,4])) # This doesn't work because it doesn't have the shape of our data


In [None]:
clf.predict(X_test)

In [None]:
np.array(y_test)

In [None]:
# Compare predictions to truth labels to evaluate the model
y_preds = clf.predict(X_test)
np.mean(y_preds == y_test)

In [None]:
clf.score(X_test, y_test)

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_preds)

Make predictions with `predict_proba()`


In [None]:
# predict_proba() returns probabilities of a classification label
clf.predict_proba(X_test[:5])

In [None]:
"""n = [[0.89, 0.11] -> 0,
[0.49, 0.51] -> 1,
[0.43, 0.57] -> 1,
[0.84, 0.16] -> 0,
[0.18, 0.82] -> 1]
for each array, there are two values.
The first value is the probability of label 0, the second is the probability of label 1.
[0.89, 0.11]
[0, 1, 1, 0, 1]"""

In [None]:
# Let's predict() on the same data
clf.predict(X_test[:5])

In [None]:
heart_disease['target'].value_counts()

`predict()` can also be used for regression models.

In [None]:
housing_df.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

# Create the data
X = housing_df.drop("target", axis = 1)
y = housing_df['target']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Create model instance
model = RandomForestRegressor()

#Fit the model to the data
model.fit(X_train, y_train)

# Make our prediction
y_preds = model.predict(X_test)

In [None]:
y_preds[:10]

In [None]:
np.array(y_test[:10])

In [None]:
# Compare the predictions to the truth.
from sklearn.metrics import mean_absolute_error
mean_absolute_error(y_test, y_preds)

In [None]:
housing_df['target']

## 4. Evaluating A Machine Learning Model

There are 3 ways to evaluate Scikit-Learn models/estimators:
1. Estimator's built-in `score()` method
2. The `scoring` parameter
3. Problem specific metric functions
   
you can read more about these here: https://scikit-learn.org/stable/modules/model_evaluation.html

### 4.1 Evaluating a model with the `score` method

In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

# Create X & y
X = heart_disease.drop("target", axis = 1)
y = heart_disease['target']

# Create train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create classifier model instance
clf = RandomForestClassifier()

# Fit classifier to training data
clf.fit(X_train, y_train)

In [None]:
# The highest value for the score method is 1.0
# The lowest is 0.0
clf.score(X_train, y_train)

In [None]:
clf.score(X_test, y_test)

Let's use the `score` method on our regression problem

In [None]:
housing_df.head()

In [None]:
# import the model
from sklearn.ensemble import RandomForestRegressor

# split the data into X & y
X = housing_df.drop("target", axis = 1)
y = housing_df['target']

# split the data into test and train
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.2)

# instantiate the model
reg = RandomForestRegressor(n_estimators=500)

# fit the model
reg.fit(X_train, y_train)



In [None]:
# score the model
# score the model
reg.score(X_train, y_train), reg.score(X_test, y_test)

In [None]:
y_test.mean()

### 4.2 Evaluating a Model using the `scoring` Parameter

In [None]:
from sklearn.model_selection import cross_val_score

from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

# Create X & y
X = heart_disease.drop("target", axis = 1)
y = heart_disease['target']

# Create train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create classifier model instance
clf = RandomForestClassifier()

# Fit classifier to training data
clf.fit(X_train, y_train);

In [None]:
clf.score(X_test, y_test)

In [None]:
cross_val_score(clf, X, y)

In [None]:
np.random.seed(42)

#single training and test split score
clf_single_score = clf.score(X_test, y_test)

clf_cross_val_score = np.mean(cross_val_score(clf, X, y))

# Compare the two
clf_single_score, clf_cross_val_score

In [None]:
# Default scoring parameter of classifier = mean accuracy
# clf.score()

In [None]:
# Scoring parameter set to None by default
cross_val_score(clf, X, y, scoring = None)

What is the classifier default score?
 - Mean accuracy (number of correct predictions)/(total number of predictions) from 0 -> 1
 
What is the regression defalut score?
- Coefficient of determination aka R^2 - How well the features explain the target from 0 -> 1

### 4.2.1 Classification model evaluation metrics

1. Accuracy
2. Area under ROC curve
3. Confusion matrix
4. Classification Report


**Accuracy**

In [None]:
heart_disease.head()

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop('target', axis = 1)
y = heart_disease['target']

clf = RandomForestClassifier()
cross_val_score = cross_val_score(clf, X, y, cv=5)

In [None]:
np.mean(cross_val_score)

In [None]:
print(f"Heart Disease Classifier Cross-Validated Accuracy: {np.mean(cross_val_score) * 100 :.2f}%")

**Area under the receiver operating characteristic curve (AUC/ROC)**

* Area under curve (AUC)
* ROC curve

ROC curves are a comparison of a model's true positive rate (tpr)  versus a models false positive rate (fpr).



* True positive = model predicts 1 when truth is 1
* False positive = model predicts 1 when truth is 0
* True negative = model predicts 0 when truth is 0
* False negative = model predicts 0 when truth is 1

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
from sklearn.metrics import roc_curve


clf.fit(X_train, y_train)

# Make predictions with proabilities
y_probs = clf.predict_proba(X_test)

y_probs[:10]

In [None]:
y_probs_positive = y_probs[:,1]
y_probs_positive

In [None]:
# calculate fpr, tpr and thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_probs_positive)

# Check the false positive rates
fpr

In [None]:
# Create a function for plotting ROC curves
import matplotlib.pyplot as plt

def plot_roc_curve(fpr,tpr):
    """
    Plot the ROC curve given the false positive rate (fpr)
    and the true postive rate (tpr) of a model.
    """
    # Plot roc curve
    plt.plot(fpr, tpr, color="orange", label = "ROC")
    # Plot line with no predictive power (baseline)
    plt.plot([0,1], [0,1],color = "darkblue", linestyle="--", label="Guessing")
    
    # Customize the plot
    plt.xlabel("False positive rate (fpr)")
    plt.ylabel("True positive rate (tpr)")
    plt.title("Receiver Operating Characteristic (ROC) Curve")
    plt.legend()
    plt.show()
plot_roc_curve(fpr, tpr)

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_probs_positive)

In [None]:
# Plot perfect ROC curve and AUC score
fpr, tpr, thresholds = roc_curve(y_test, y_test)
plot_roc_curve(fpr, tpr)

In [None]:
roc_auc_score(y_test, y_test)

**Confusion Matrix**

A confusion matrix is a quick way to compare the labels a model predicts and the actual labels it was supposed to predict.

In essence, giving you an idea of where the model is getting confused.

In [None]:
from sklearn.metrics import confusion_matrix

y_preds = clf.predict(X_test)
confusion_matrix(y_test, y_preds)

In [None]:
# Visualize confusion matrix with pd.crosstab()
# pd.crosstab(y_test, y_preds, rownames= ["Actual Label"], colnames=["Predicted"])
pd.crosstab(y_test,
           y_preds,
           rownames=["Acutal"],
           colnames=['Predicted'])

|  |Predicted 0 | Predicted 1 |
|----------- | ----------- | ----------- |
|**Actual 0** | True Negatives | False Positives  |
|**Actual 1** | False Negatives | True Positives |


True Positive = model predicts 1 when truth is 1

False Positive = model predicts 1 when truth is 0

True Negative = model predicts 0 when truth is 0

False Negative = model predicts 0 when truth is 1

In [None]:
23 + 6 + 6 + 26

In [None]:
len(X_test)

In [None]:
# How to install a conda package into the current environment from a Jupyter notebook
"""import sys
! conda install --yes --prefix {sys.prefix} seaborn"""

In [None]:
# Make our confusion matrix more visual with seasborn's heatmap()
import seaborn as sns

# Set the font scale
sns.set(font_scale = 1.5)

#Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)

# Plot is using Seaborn
sns.heatmap(conf_mat);

### Creating a confusion matrix using scikit-learn

To use the new methods of creating a confusion matrix with Scikit-Learn you will need sklearn version 1.0+

In [None]:
import sklearn
sklearn.__version__

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(estimator=clf, X=X, y=y)

In [None]:
ConfusionMatrixDisplay.from_predictions(y_true=y_test, y_pred=y_preds)

**Classification report**

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_preds))

**Precision**
: Indicates the proportion of positive identification (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.

**Recall**
: Indicates the portion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.

**F1 score** 
: A combination of precision and recall. A perfect model achieves an F1 score of 1.0.

**Support** 
: The number of samples each metric was calculated on.

**Accuracy** 
: The accuracy of the model in decimal form. Perfect accuracy is eqal to 1.0.

**Macro avg** 
: Short for macro average, the average precision, recall and F1 score between classes. Macro avg doesn't class imbalance into account, so if you do have class imbalances, pay attention to this metric

**Weighted avg** 
: Short for weighted average, the weighted average precision, recall and F1 score between classes. Weighted means each metric is calculated with respect to how many samples there are in each class. This metric will favour the majority class (e.g. will give a high value when one class out performs another due to having more samples).

In [None]:
# Where precision and recall become valuable.
disease_true = np.zeros(10000)
disease_true[0] = 1 #only one positive case

disease_preds = np.zeros(10000) # model predicts every case as 0

pd.DataFrame(classification_report(disease_true, disease_preds, output_dict=True))

To summarize classification metrics:

* **Accuracy** is a good measure to start with if all classes are balanced( e.g. same amount of samples)
* **Precision** and **Recall** become more important when classes are imbalanced.
* If false positive predictions are worse than false negatives, aim for higher **precision**.
* If false negative predictions are worse than false positives, aim for higher **recall**.

### 4.2.2 Regression model evaluation metrics

[Regression Model evaluation metrics documentation](https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics)

The ones we're going to cover are:
1. R^2 The portion of the variation in the dependent variable that is predictable from the independent variables.
2. Mean Absolute Error (MAE)
3. Mean Squared Error (MSE)

**R^2**
What R-squared does: Compares your models predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, it's R^2 would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.

In [None]:
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df.drop("target", axis=1)
y = housing_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model=RandomForestRegressor(n_estimators=100)
model.fit(X_train, y_train)


In [None]:
model.score(X_test, y_test)

In [None]:
housing_df.head()

In [None]:
y_test

In [None]:
y_test.mean()

In [None]:
from sklearn.metrics import r2_score

# fill an array with y_test mean
y_test_mean = np.full(len(y_test), y_test.mean())



In [None]:
y_test_mean[:10]

In [None]:
r2_score(y_true=y_test, y_pred=y_test_mean)

In [None]:
r2_score(y_true=y_test, y_pred=y_test)

In [None]:
### Practice MSE and MAE

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error


X = housing_df.drop("target", axis=1)
y = housing_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

model = RandomForestRegressor()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

print(f"MSE: {mean_squared_error(y_true=y_test, y_pred = y_pred)}")
print(f"MAE: {mean_absolute_error(y_true=y_test, y_pred = y_pred)}")

**Mean Absolute Error (MAE)**

MAE is the average of the absolute differences between predictions and actual values.

It gives you an idea of how wrong your models predictions are.

In [None]:
# MAE
from sklearn.metrics import mean_absolute_error

y_preds = model.predict(X_test)
mae = mean_absolute_error(y_test, y_pred)
print(mae)

In [None]:
y_preds

In [None]:
y_test

**MAE** On average the y_pred value is plus or minus the y_test value

In [None]:
df = pd.DataFrame(data={"actual values": y_test,
                       "predicted values": y_preds})
df["differences"] = df["predicted values"] - df["actual values"]

In [None]:
df.head()

In [None]:
# MAE using formulas and differences
np.abs(df["differences"]).mean()

In [None]:
# Trying out the Mean Squared Error on my own.

(df["differences"]**2).mean()

**Mean Squared Error (MSE)**

MSE is the mean of the square of the errors between the predicted values and the actual values.

In [None]:
from sklearn.metrics import mean_squared_error

y_preds = model.predict(X_test)
mse = mean_squared_error(y_test, y_preds)
mse

In [None]:
df["squared_differences"] = np.square(df['differences'])
df.head()

In [None]:
# Calculate MSE by Hand
squared = np.square(df['differences'])
squared.mean()

In [None]:
df_large_error = df.copy()
df_large_error.iloc[0]["squared_differences"] = 16

In [None]:
df_large_error["squared_differences"].mean()

In [None]:
df_large_error.iloc[1:100] = 20
df_large_error.head()

In [None]:
df_large_error["squared_differences"].mean()

## Which regression metric should you use?

* **R^2** is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generrally, the closer your **R^2** value is to 1.0 the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.
* **MAE** gives a better indication of how far off each of your model's predictions are on average.
* As for **MAE** or **MSE**, because of the way MSE is calculated, squaring the differences between predicted values and actual values, it amplifies larger differences. Let's say we're predicting the value of houses (which we are)
    * Pay more attention of **MAE**: When being 10,000 off is _twice_ as bad as being 5,000 off.
    * Pay more attention to **MSE**: When being 10,000 off is _more_ than twice as bad being 5,000 off.

# Machine Learning Model Evaluation

Evaluating the results of a machine learning model is as important as building one.

But just like how different problems have different machine learning models, different machine learning models have different evaluation metrics.

Below are some of the most important evaluation metrics you'll want to look into for classification and regression models.

## Classification Model Evaluation Metrics/Techniques

* **Accuracy** - The accuracy of the model in decimal form. Perfect accuracy is equal to 1.0.
* **Precision** - Indicates the proportion of positive identifications (model predicted class 1) which were actually correct. A model which produces no false positives has a precision of 1.0.
* **Recall** - Indicates the proportion of actual positives which were correctly classified. A model which produces no false negatives has a recall of 1.0.
* **F1 score** - A combination of precision and recall. A perfect model achieves an F1 score of 1.0.
* **Confusion matrix** - Compares the predicted values with the true values in a tabular way, if 100% correct, all values in the matrix will be top left to bottom right (diagonal line).
* **Cross-validation** - Splits your dataset into multiple parts and train and tests your model on each part then evaluates performance as an average.
* **Classification report** - Sklearn has a built-in function called `classification_report()` which returns some of the main classification metrics such as precision, recall and f1-score.
* **ROC Curve - Also known as [receiver operating characteristic](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) is a plot of true positive rate versus false-positive rate.
* **Area Under Curve (AUC) Score** - The area underneath the ROC curve. A perfect model achieves an AUC score of 1.0.

### Which Classification metric should you use?

* **Accuracy** is a good measure to start with if all classes are balanced (e.g. same amount of samples which are labelled with 0 or 1).

* **Precision** and **recall** become more important when classes are imbalanced.

* If false-positive predictions are worse than false-negatives, aim for higher precision.

* If false-negative predictions are worse than false-positives, aim for higher recall.

* **F1-score** is a combination of precision and recall.

* A confusion matrix is always a good way to visualize how a classification model is going.

## Regression Model Evaluation Metrics/Techniques

* [https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html) - Compares your model's predictions to the mean of the targets. Values can range from negative infinity (a very poor model) to 1. For example, if all your model does is predict the mean of the targets, its R^2 value would be 0. And if your model perfectly predicts a range of numbers it's R^2 value would be 1.

* [Mean Absolute Error (MAE)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_error.html) - The average of the absolute differences between predictions and actual values. It gives you an idea of how wrong your predictions were.

* [Mean squared error (MSE)](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html) - The average squared differences between predictions and actual values. Squaring the errors removes negative errors. It also amplifies outliers (samples which have larger errors).

### Which regression metric should you use?

* **R2** is similar to accuracy. It gives you a quick indication of how well your model might be doing. Generally, the closer your **R2** value is to 1.0, the better the model. But it doesn't really tell exactly how wrong your model is in terms of how far off each prediction is.

* **MAE** gives a better indication of how far off each of your model's predictions are on average.

* As for **MAE** or **MSE**, because of the way MSE is calculated, squaring the differences between predicted values and actual values, it amplifies larger differences. Let's say we're predicting the value of houses (which we are).

    * Pay more attention to MAE: When being 10,000 off is **twice** as bad as being 5,000 off.

    * Pay more attention to MSE: When being 10,000 off is **more than twice** as bad as being 5,000 off.

For more resources on evaluating a machine learning model, be sure to check out the following resources:

* [Scikit-Learn documentation for metrics and scoring (quantifying the quality of predictions)](https://scikit-learn.org/stable/modules/model_evaluation.html)

* [Beyond Accuracy: Precision and Recall by Will Koehrsen](https://towardsdatascience.com/beyond-accuracy-precision-and-recall-3da06bea9f6c)

* [Stack Overflow answer describing MSE (mean squared error) and RSME (root mean squared error)](https://stackoverflow.com/questions/17197492/is-there-a-library-function-for-root-mean-square-error-rmse-in-python/37861832#37861832)

### 4.2.3 Finally using the `scoring` parameter

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

X = heart_disease.drop("target", axis = 1)
y = heart_disease["target"]

clf = RandomForestClassifier()

In [None]:
np.random.seed(42)

# Cross-validation accuracy
cv_acc = cross_val_score(clf, X, y, cv=5, scoring=None) # if scoring = None, it defaults to accuracy
print(cv_acc)

In [None]:
# cross-validated accuracy
print(f"The cross-validated accuracy is: {np.mean(cv_acc)* 100:.2f}%")

In [None]:
np.random.seed(42)

cv_acc = cross_val_score(clf, X, y, cv=5, scoring="accuracy")
cv_acc

In [None]:
# cross-validated accuracy
print(f"The cross-validated accuracy is: {np.mean(cv_acc)* 100:.2f}%")

In [None]:
# Precision
np.random.seed(42)

cv_precision = cross_val_score(clf, X, y, cv=5, scoring="precision")
cv_precision

In [None]:
# cross-validated precision
print(f"The cross-validated precision is: {np.mean(cv_precision)}")

In [None]:
# Recall
np.random.seed(42)

cv_recall = cross_val_score(clf, X, y, cv=5, scoring="recall")
cv_recall

In [None]:
# cross-validated recall
print(f"The cross-validated recall is: {np.mean(cv_recall)}")

In [None]:
# F1
np.random.seed(42)

cv_f1 = cross_val_score(clf, X, y, cv=5, scoring="f1")
cv_f1

In [None]:
# cross-validated f1 score
print(f"The cross-validated f1 is: {np.mean(cv_f1)}")

In [None]:
# Let's see the scoring parameter being used for a regression problem.
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor

np.random.seed(42)

X = housing_df.drop("target", axis=1)
y = housing_df["target"]

model = RandomForestRegressor()

In [None]:
np.random.seed(42)
cv_r2 = cross_val_score(model, X, y, cv=3, scoring=None)
np.mean(cv_r2)

In [None]:
cv_r2

In [None]:
# Cross validated default score (R^2)
print(f"The average R^2 score is: {cv_r2.mean():.3f}")

In [None]:
# Mean Absolute Error or MAE

np.random.seed(42)

cv_mae = cross_val_score(model, X, y, cv=5, scoring='neg_mean_absolute_error')
cv_mae

In [None]:
print(f"The Average MAE: {np.mean(cv_mae)}")

In [None]:
cv_mae

In [None]:
# Mean Squared Error or MSE

np.random.seed(42)

cv_mse = cross_val_score(model, X, y, cv=5, scoring = 'neg_mean_squared_error')
cv_mse

In [None]:
print(f"The average MSE: {np.mean(cv_mse)}")

In [None]:
cv_mse

## 4.3 Using different evaluationmetrics with scikit-learn Functions

The 3rd way to evaluate scikit-learn machine learning models/estimators is to use the `sklearn.metrics` module:
[sklearn.etrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics)

In [None]:
# For a Classification Model
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

np.random.seed(42)
# create X & y
X = heart_disease.drop("target", axis=1)
y = heart_disease['target']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create model
clf = RandomForestClassifier()

# Fit model
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
# Evaluate model using evaluation functions
print(f"""--  Classifier metrics on the test set --
accuracy score: {accuracy_score(y_true=y_test, y_pred=y_pred)*100:.2f}%
precision score: {precision_score(y_true=y_test, y_pred=y_pred):.3f}%
recall score: {recall_score(y_true=y_test, y_pred=y_pred):.3f}
f1 score: {f1_score(y_true=y_test, y_pred=y_pred):.3f}""")

In [None]:
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

np.random.seed(42)
# create X & y
X = housing_df.drop("target", axis=1)
y = housing_df["target"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create model
reg = RandomForestRegressor()

# Fit model
reg.fit(X_train, y_train)

# Evaluate model using evaluation functions
y_pred = reg.predict(X_test)

print(f""" ---  Regression Metrics  --- 
R2 score: {r2_score(y_test, y_pred):.3f}
MAE: {mean_absolute_error(y_test, y_pred):.3f}
MSE: {mean_squared_error(y_test, y_pred):.3f}
""")

In [None]:
what_were_covering

## 5. Improving a model

**first predictions** of a model are the **baseline predictions**

**first model** created before improvements is called the **baseline model**.


How do we improve our **baseline predictions**?

* From a data perspective:
 * Could we collect more data? (generally, the more data, the better)
 * Could we improve our data?
        
* From a model Perspective:
 * Is there a better model we could use?
 * Could we improve the current model?
    
    
Parameters vs Hyper Parameters
   * Parameters = model find these patterns
   * Hyper Parameters = settings on a model you can adjust to (potentially) improve its ability to find patterns

Three ways to adjust hyperparameters:
1. By hand
2. Randomly with RandomSearchCV
3. Exhaustively with GridSearchVC

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

In [None]:
clf.get_params()

### 5.1 Tuning hyperprameters by hand

For tuning by hand, we split our data set into three

- **training split** (70%-80%) We train the model on this set
- **validation split** (10%-15%) We tune the hyperparameters on this set
- **test split** (10%-15%) We test the model on this set

EX: 100 patient records
- 70% training
- 15% hyperparameter tuning
- 15% testing

In [None]:
def evaluate_preds(y_true, y_preds):
    """Performs evaluation comparison on y_true labels vs y_pred labels
    on a classification model.
    """
    
    accuracy = accuracy_score(y_true, y_preds)
    precision = precision_score(y_true, y_preds)
    recall = recall_score(y_true, y_preds)
    f1 = f1_score(y_true, y_preds)
    metric_dict = {"accuracy": round(accuracy, 2),
                  "precision": round(precision, 2),
                  "recall": round(recall, 2),
                  "f1": round(f1, 2)}
    print(f"Accuracy: {accuracy * 100:.2f}%")
    print(f"Precision: {precision: .2f}")
    print(f"Recall: {recall:.2f}")
    print(f"F1 score: {f1:.2f}")
    
    return metric_dict



In [None]:
from sklearn.ensemble import RandomForestClassifier

np.random.seed(42)

# shuffle the data
heart_disease_shuffled = heart_disease.sample(frac=1)

# split into X & y
X = heart_disease_shuffled.drop("target", axis=1)
y = heart_disease_shuffled["target"]

# split the data into train, validation, and test set
train_split = round(0.7 * len(heart_disease_shuffled)) # 70 % of data
valid_split = round(train_split + 0.15 * len(heart_disease_shuffled))
X_train, y_train = X[:train_split] , y[:train_split]
X_valid, y_valid = X[train_split: valid_split], y[train_split: valid_split]
X_test, y_test = X[valid_split:], y[valid_split:]

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

# Make baseline Predictions
y_preds = clf.predict(X_valid)

# Evaluate the classifier on validation set
baseline_metrics = evaluate_preds(y_valid, y_preds)
baseline_metrics

In [None]:
clf_2 = RandomForestClassifier(n_estimators=100)
clf_2.fit(X_train, y_train)

# Make predictions
y_preds_2 = clf_2.predict(X_valid)

#Evaluate the 2nd classifier
clf_2_metrics = evaluate_preds(y_valid, y_preds_2)
clf_2_metrics

In [None]:
clf_3 = RandomForestClassifier(n_estimators=100, max_depth=10)
clf_3.fit(X_train, y_train)

# Make predictions
y_preds_3 = clf_3.predict(X_valid)

#Evaluate the 2nd classifier
clf_3_metrics = evaluate_preds(y_valid, y_preds_3)
clf_3_metrics

In [None]:
clf.get_params()

### 5.2 Hyperparameter tuning with RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
       "max_depth": [None, 5, 10, 20, 30],
       "max_features": ["auto", "sqrt"],
       "min_samples_split": [2,4,6],
       "min_samples_leaf": [1,2,4]}

np.random.seed(42)

# Split into X and y
X = heart_disease_shuffled.drop("target", axis =1)
y = heart_disease_shuffled["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator = clf, 
                            param_distributions=grid, 
                            n_iter=10, # number of models to try
                            cv = 5,
                            verbose=2)

# Fit the RandomizedSearchCV of CLF
rs_clf.fit(X_train, y_train);


In [None]:
rs_clf.best_params_

In [None]:
# Make predictions with the best hyperparameters
rs_y_preds = rs_clf.predict(X_test)

# Evaluate the predictions
rs_metrics = evaluate_preds(y_test, rs_y_preds)

### 5.3 Hyperparameter Tuning with GridSearchCV

In [None]:
grid

In [None]:
# The GridSearchCV method goes through each possbile 
# combination of hyperparameters for a model and then again for each fold of the cv
# for grid 1 that is
cv=5
6 * 5 * 2 * 3 * 3 * cv

In [None]:
# That's too many models so we're going to make the grid smaller.

In [None]:
grid_2 = {'n_estimators': [100, 200, 500],
          'max_depth': [None],
          'max_features': ['auto', 'sqrt'],
          'min_samples_split': [6],
          'min_samples_leaf': [1, 2]
         }

# now we have
3 * 1 * 2* 1 * 2 * cv

In [None]:
from sklearn.model_selection import GridSearchCV, train_test_split

np.random.seed(42)

# Split into X and y
X = heart_disease_shuffled.drop("target", axis =1)
y = heart_disease_shuffled["target"]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Instantiate RandomForestClassifier
clf = RandomForestClassifier(n_jobs=1)

# Setup GridSearchCV
gs_clf = GridSearchCV(estimator = clf, 
                            param_grid=grid_2, 
                            cv = 5,
                            verbose=2)

# Fit the GridSearchCV of CLF
gs_clf.fit(X_train, y_train);

In [None]:
gs_clf.best_params_

In [None]:
gs_y_preds = gs_clf.predict(X_test)

# evaluate the predictions
gs_metrics = evaluate_preds(y_test, gs_y_preds)

Let's compare our different models metrics

In [None]:
compare_metrics = pd.DataFrame({"baseline": baseline_metrics,
                              "clf_2": clf_2_metrics,
                              "random search": rs_metrics,
                              "grid search": gs_metrics})

compare_metrics.plot.bar(figsize=(10,8))

In [None]:
what_were_covering

## 6.0 Saving and loading trained machine learning models

Two ways to save and load machine learning models:
1. With Python's `pickle` module
2. With `joblib` module

**Pickle**

In [None]:
import pickle

# save an existing model to file
pickle.dump(gs_clf, open("gs_random_forest_model_1.pkl", "wb"))

In [None]:
# Load a saved model
loaded_pickle_model = pickle.load(open("gs_random_forest_model_1.pkl","rb") )

In [None]:
y_preds = loaded_pickle_model.predict(X_test)
evaluate_preds(y_test, y_preds)

In [None]:
gs_y_preds = gs_clf.predict(X_test)

gs_metrics = evaluate_preds(y_test, gs_y_preds)

**Joblib**

In [None]:
from joblib import dump, load

# Save model to file
dump(gs_clf, filename="gs_random_forest_model_1.joblib")

In [None]:
# import a saved joblid model
loaded_jobmodel = load(filename="gs_random_forest_model_1.joblib")

In [None]:
# Make and evaluate joblib predictions
joblib_y_preds = loaded_jobmodel.predict(X_test)
evaluate_preds(y_test, joblib_y_preds)

### 7. Putting it all together

In [None]:
data = pd.read_csv("../data/car-sales-extended-missing-data.csv")
data

In [None]:
data.dtypes

In [None]:
data.isna().sum()

Steps we want to do (all in one cell):
1. Fill the missing data
2. Convert data to numbers
3. Build a model on the data

In [None]:
# Getting data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# set up random seed
import numpy as np
np.random.seed(42)

# Import data and drop rows with missing labels
data = pd.read_csv("../data/car-sales-extended-missing-data.csv")
data.dropna(subset=["Price"], inplace=True)

# Define different features and transformer pipeline
categorical_features = ['Make', "Colour"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

door_features = ["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))])

numeric_features = ["Odometer (KM)"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))])

# Setup the preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
                    transformers=[
                        ("cat", categorical_transformer, categorical_features),
                        ("door", door_transformer, door_features),
                        ("num", numeric_transformer, numeric_features)])

# Creating a preprocessing and modelling pipeline
model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor())
])

# Split the data
X = data.drop("Price", axis=1)
y = data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit and score the model
model.fit(X_train, y_train)
model.score(X_test, y_test)

It's also possible to use `GridsearchCV` or `RandomizedSearchCV` with our `Pipeline`.

In [None]:
# Use GridSearchCV with our regression Pipeline
from sklearn.model_selection import GridSearchCV
pipe_grid = {
    "preprocessor__num__imputer__strategy": ["mean", "median"],
    "model__n_estimators": [100, 1000], 
    "model__max_depth": [None, 5],
    "model__max_features":["auto"],
    "model__min_samples_split": [2,4]
}

gs_model = GridSearchCV(model, pipe_grid, cv=5, verbose=2)
gs_model.fit(X_train, y_train)

In [None]:
gs_model.score(X_test, y_test)