# Introduction to Scikit-Learn (sklearn)

This notebook demonstrates some of the most useful functions of the scikit-learn library.

Topics covered:

0. An end-to-end Scikit-Learn workflow
1. Getting the data ready
2. Choose the right estimator/algorithm for our problems
3. Fit the model/algorithm and use it to make predictions on our data
4. Evaluating a model
5. Improve a model
6. Save and load a trained model
7. Putting it all together

In [1]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import sklearn
sklearn.show_versions()


System:
    python: 3.9.12 (main, Apr  4 2022, 05:22:27) [MSC v.1916 64 bit (AMD64)]
executable: C:\Users\migue\Documents\Machine_learning_Learning\env\python.exe
   machine: Windows-10-10.0.22000-SP0

Python dependencies:
          pip: 21.2.4
   setuptools: 61.2.0
      sklearn: 1.0.2
        numpy: 1.22.3
        scipy: 1.7.3
       Cython: None
       pandas: 1.4.2
   matplotlib: 3.5.1
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True




## 0. An end-to-end scikit-learn workflow

In [3]:
# 1. Get the data ready
import pandas as pd
heart_disease = pd.DataFrame(pd.read_csv('./data/heart-disease.csv'))
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [4]:
# Features matrix is usually named x 
x = heart_disease.drop('target', axis=1)

# Create y (labels matrix)
y = heart_disease['target']

In [5]:
import warnings
# Just use this in case warnings are irrelevant in other case
# DO NOT USE THIS FILTER
# warnings.filterwarnings('ignore')

In [6]:
# 2.Choose the right model and hyperparameters
# Random forest
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=80)

# We'll keep the deafult hyperparametes
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 80,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [7]:
# 3. Fit the model to the training data
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [8]:
clf.fit(x_train, y_train);

In [9]:
x_train

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
98,43,1,2,130,315,0,1,162,0,1.9,2,1,2
128,52,0,2,136,196,0,0,169,0,0.1,1,0,2
151,71,0,0,112,149,0,1,125,0,1.6,1,0,2
139,64,1,0,128,263,0,1,105,1,0.2,1,1,3
177,64,1,2,140,335,0,1,158,0,0.0,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
92,52,1,2,138,223,0,1,169,0,0.0,2,4,2
84,42,0,0,102,265,0,0,122,0,0.6,1,0,2
182,61,0,0,130,330,0,0,169,0,0.0,2,0,2
204,62,0,0,160,164,0,0,145,0,6.2,0,3,3


In [10]:
# Make some predictions
y_preds = clf.predict(x_test)
y_preds

array([1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0], dtype=int64)

In [11]:
y_test

155    1
117    1
58     1
222    0
199    0
      ..
261    0
292    0
298    0
87     1
252    0
Name: target, Length: 61, dtype: int64

In [12]:
# 4. Evaluate the model on the training data and test data
clf.score(x_train, y_train)

1.0

In [13]:
clf.score(x_test, y_test)

0.8360655737704918

## F1 score
F1 score is a metric to evaluate a model in ML, there are many metrics for classification models.

The F1 score is a propose imporvement of two simpler performance metrics.

### Accuracy
This is a metric for clasification models that measures the number of predictions that are correct as percentage of the total number of prediction. If 90% of your predictions are correct then your accuracy is simply a 90%

<img src='./images/accuracy_formula.png'>

Accuracy might not be useful if your data is imbalanced, let's imagine data sales from a website, 99% of the people are just lookers and 1% of the visitors buy something.

If your model is not very good, then it predicts that 100% of the visitors are lookers, while this is clearly wrong and it's an useless model.

## Precision and recall: The foundations
Precision and recall are two of the most common metrics that take into account class imbalance. They're also the foundation of the F1 score.

### Precision
Precision is the first part of the F1 score. It can also be used as an idividual machine learning metric.
<img src='./images/precision-score.png'>

You can interpreta the formula as it follows. Within everythin that has predicted as a positive, precision counts the percentage that is correct:
1. A not precise model may find a lot of the positves, but its selection method is nosy: it also wrongly detects many positives that aren't actually positives also known as al (false-positive)

2. A precise model is very 'pure'. maybe it does not find all the postives, but the ones that the model does class as positive are very like to be correct (true-positives)

### Recall
The second component of the F1 score, although recall can also be used as an individual machine learning metric.
<img src='./images/recall-metric.png'>

This formula is interpretaded as: Within everything that actually is positive, how may did the model succed to find:

1. A model with high recall succed well in finding all the positive cases in the data, even though they may also wrongly identify some negative cases as positive cases.

2. A model with low recall is not able to find all (or a large part) of the positive cases in data.

### Precision vs Recall
Think of a supermarket that has sold a product with a problem and they need to recall it, they are only intrested in making sure that they find all the problematic products back. It does not really matter to them if clients send back some non-problematic products asw welll, so the precision is no interest of this supermarket.

### Precision-Recall Trade-off
In an ideal world we'd want a model that identifies all of our positive cases and at the same time indentifies only positive cases, sadly the world is a cruel world that's not possible. In many cases, you can tweak a model to increase precision at the cost of lower recall, or increase recall at the cost of lower precision.

### F1 Score
The goal of F1 is to combine the precision and recall metric into a single metric. F1 score has also been designed to work well on imbalanced data.
<img src='./images/f1-score.png'>
In the F1 score, we compute the average of precision and recall. They are both rates, which makes it logical choice to use the harmonic mean (an alternative metric for the more common arithmetic mean). F1 score gives equal weight to precision and recall:
1. A model will obtain a high F1 score if both precision and recall are high
2. A model will obtain a low F1 score if both precision and recall are low
3. A model will obtain a medium F1 score if one of precision and recall is low and the other is high

### Conclusion
All the metrics for your model are useful except for accracy that is not reliable if you're working with an imbalaced data set

## Important note
Precision is appropiate whe minimizing false positives.\
Recall is appropiate when minimizing false negatives.

In [15]:
len(y_test)

61

### Support is the len of the test dataset

In [14]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.85      0.79      0.81        28
           1       0.83      0.88      0.85        33

    accuracy                           0.84        61
   macro avg       0.84      0.83      0.83        61
weighted avg       0.84      0.84      0.84        61



# Micro, Macro and weighted F1 Score

## Macro average
A macro averaged F1 Score is computed using the arithmetic mean (aka unweighted mean) of all per-class F1 Score

### In this method all classes are threated equally

## Weighted Average
This score is calculated by taking the mean of all per-class F1 scorews while considering each class support.
### Support
This refers to the number of actual ocurrences of the class in the dataset. In a data set with 10 samples, we have a class A which has 3 samples, the support of this class would be 3, in a proportion of 0.3.

With this metric the score that is gonna be show will be the avergae would have accounted for the contribution of each class.

## Micro average
Computes a global average of the F1 score by counting the sums of the true positives, false negatives and false positives\
Micro average essentially computes the proportion of correctly classified observations out of all observations. this is why is called accuracy, if we apply Micro average to Recall and precison we would get the same output

## Which average use
If we are working with an imbalanced dataset where all classes are equally important, using macro average woulf be a good choice.

If the classes with more data are more relevant then we should use the weighted average.

In a balanced dataset that you need an easily understandable metric for overall performance regardless of the class. You can use accuracy or micro F1 score

# Confusion Matrix

This is a table that helps us to analyze the performance of a classification model. It breaks down each class and the number of correct and incorrect predictions that the model makes.\

Let's imagine a scenario whwere we have 100 samples, 59 samples belong to class A, and the model correctly predicts 52 samples. We also have a class B with 41 sapmples, and our imaginary model predicted 28 correctly.\

## Sample confusion matrix view
x-axis belongs to the predictions and the y-axis belongs to the expected results.\
  A____B\
A 52 | 7\
B 13 | 28

The model predicted correctly 52 samples from class A, and made 7 mistakes.\
The model predicted correectly 28 samples from class B, and maded 13 mistakes.

## Positive and negative outcomes
In a problem of binnary classification models solve is identifying specific class instances from normal ones. An example could be identifying regular email vs spam email.\

## Representation of each cell of the confusion matrix
  A________B\
A 52 TP | 7  FN\
B 13 FP | 28 TN

#### TP = True Positive
#### FP = False Positive
#### FN = False Negative
#### TN = True Negative

The confusion matrix scales depending the number of classes. And it works the same way as in a binary classification problem.\

## Importance of a confusion matrix
It is a visualization tool that surfaces essential information about our model predictions, without a confusion matrix it would be hard to see, as learned in previous leassons knowing what's happening in our model or data is really important if we want to get a good ML model.\
A confusion matrix helps to communicate our results which is an essential task.\

Also the information that a confusion matrix gives us is critical in order to improve our model. Makes it really clear where are the problems and where to focus our eforts.

In [15]:
confusion_matrix(y_test, y_preds)

array([[26,  7],
       [ 3, 25]], dtype=int64)

In [16]:
accuracy_score(y_test, y_preds)

0.8360655737704918

In [17]:
# 5. Improve a model
# Try a different amount of n_estimators

# Set seed to replicate
np.random.seed(42)
for i in range(10, 100, 10):
    print(f'Trying model with {i} estimators...')
    clf = RandomForestClassifier(n_estimators=i).fit(x_train, y_train)
    print(f'Model accuracy on test set: {clf.score(x_test, y_test) * 100:.2f} %')
    print('')

Trying model with 10 estimators...
Model accuracy on test set: 77.05 %

Trying model with 20 estimators...
Model accuracy on test set: 83.61 %

Trying model with 30 estimators...
Model accuracy on test set: 81.97 %

Trying model with 40 estimators...
Model accuracy on test set: 85.25 %

Trying model with 50 estimators...
Model accuracy on test set: 81.97 %

Trying model with 60 estimators...
Model accuracy on test set: 83.61 %

Trying model with 70 estimators...
Model accuracy on test set: 83.61 %

Trying model with 80 estimators...
Model accuracy on test set: 81.97 %

Trying model with 90 estimators...
Model accuracy on test set: 83.61 %



In [18]:
import pickle

In [19]:
# 6. Save a model and load it
import pickle

# wb = write binary rb = read binarty
pickle.dump(clf, open('random_forest_model_1.pkl', 'wb'))

In [20]:
loaded_model = pickle.load(open('./models/random_forest_model_1.pkl', 'rb'))
loaded_model.score(x_test, y_test)

0.9836065573770492

## Getting the data ready to be used with ML

Three mains things to do:
1. Split the data into features and labels (usually 'x' and 'y')
2. Filling (also called imputing) or disregarding missing values
3. Converting non=numerical values (also called feature encoding)

In [21]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [22]:
x = heart_disease.drop('target', axis=1)
x.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [23]:
y = heart_disease['target']
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [24]:
# Split the data into training and test sets
# We split the training and test data so we can see how effective our model is
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [25]:
x_train.shape, x_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

### Quick tips of ML
More data is not always good, we only want useful data, we always want to clean data then transform the data and finally reduce the data to useful data.\

Data with missing labels or missiing values is not gonna help us in the process of training a model. We usually remove rows or columns that have empty spaces.\

Since computers only understand numbers (specifically 0 & 1), we converted all the data that we can into numbers, we have to make sure that all the data is in the same metric system.\

Reduce the data is important in order to make the model cheaper to run, when we have a lot of data this might help to save money, this can also be called as dimensionality reduction or column reduction, if you see that the data that you have is useless it might be a good idea to remove it.

### 1.1 Make sure it's all numerical

In [26]:
car_sales = pd.DataFrame(pd.read_csv('./data/car-sales-extended.csv'))

In [27]:
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [28]:
len(car_sales)

1000

In [29]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

### This model is gonna fail because it has dtypes of objects

In [30]:
x = car_sales.drop('Price', axis=1)

y = car_sales['Price']

# Split the data into training and test
x_train, x_test, y_brain, y_test = train_test_split(x, y, test_size=0.2)

In [31]:
# Build ML model
# same as clasifier but used to predict numbers
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(x_train, y_train)
model.score(x_test, y_test)

ValueError: could not convert string to float: 'Toyota'

<img src='./images/encoders.jpg'>

In [None]:
# Turn the categories into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']

one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                   one_hot,
                                   categorical_features)], remainder='passthrough')

transformed_x = transformer.fit_transform(x)
transformed_x

In [None]:
pd.DataFrame(transformed_x)

In [None]:
# OTher way to do it
dummies = pd.get_dummies(car_sales[['Make','Colour', 'Doors']])
dummies

All data is transformed into 0 and 1

In [None]:
# Refit the model
np.random.seed(42)
x_train, x_test, y_train, y_test = train_test_split(transformed_x, y, test_size=0.2)
model.fit(x_train, y_train)

In [None]:
model.score(x_test, y_test)

In [None]:
model.score(x_train, y_train)

### Quick Notes
In newer versions of scikit-learn (0.23+) `OneHotEncoder` class was upgraded so now is able to handle `None` and `NaN` values.\

In the leasson we're going to see errors due to the older version of sklearn. If our version of sklearn is 0.23+ no error will appear. As always for more info is good practice to check the documentation :)

### 1.2 What if there where missing values?

1. Fill them with some value (known as imputation).
2. Remove the sampes with missing data altogether.

In [32]:
# import car sales missing data
car_sales_missing = pd.DataFrame(pd.read_csv('./data/car-sales-extended-missing-data.csv'))

In [33]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [34]:
# Create x and y 
x = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [35]:
# Let's try to convert our data to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']

one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                   one_hot,
                                   categorical_features)], remainder='passthrough')

transformed_x = transformer.fit_transform(x)
transformed_x

<1000x16 sparse matrix of type '<class 'numpy.float64'>'
	with 4000 stored elements in Compressed Sparse Row format>

In [36]:
import sklearn
print(sklearn.__version__)

1.0.2


#### Fill missing data with pandas

In [37]:
# Fill the 'Make' column
car_sales_missing['Make'].fillna('missing', inplace=True)

# Fill 'Colour' column
car_sales_missing['Colour'].fillna('missing', inplace=True)

# Fill the 'Odometer (KM)' column
car_sales_missing['Odometer (KM)'].fillna('missing', inplace=True)

# Fill 'Doors' column
car_sales_missing['Doors'].fillna(4, inplace=True)

In [38]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [39]:
# Remove rows with missing price value
car_sales_missing.dropna(inplace=True)

In [40]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [41]:
len(car_sales_missing)

950

In [42]:
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [43]:
x = car_sales_missing.drop('Price', axis=1)
y = car_sales_missing['Price']

In [44]:
# Let's try to convert our data to numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ['Make', 'Colour', 'Doors']

one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one_hot',
                                   one_hot,
                                   categorical_features)], remainder='passthrough')

transformed_x = transformer.fit_transform(car_sales_missing)
transformed_x

array([[0.0, 1.0, 0.0, ..., 0.0, 35431.0, 15323.0],
       [1.0, 0.0, 0.0, ..., 1.0, 192714.0, 19943.0],
       [0.0, 1.0, 0.0, ..., 0.0, 84714.0, 28343.0],
       ...,
       [0.0, 0.0, 1.0, ..., 0.0, 66604.0, 31570.0],
       [0.0, 1.0, 0.0, ..., 0.0, 215883.0, 4001.0],
       [0.0, 0.0, 0.0, ..., 0.0, 248360.0, 12732.0]], dtype=object)

Pandas is very versatile, and helps us to manipulate the data used to train

### Feature scaling
Once your data is all in a numerical format, there's one more transformation you want to do.\

This is Feature Scaling.\
This is being sure that all the numerical data is on the same scale.\


### Two main types of feature scaling

#### Normalization
This rescales all the numerical vaules to between - and 1, with the lowest value being close to 0, and the highest previous value, being close to 1, scikit-lrean provides the `StandardScalar` class.

#### Standardization
This substracts the mean value from all of the features. It then scales the features to unit varince (dividing the feature by the std), this is also provided in `StandardScalar` class.

### Considerations
1. Feature scaling is not required for your target variable
2. Feature scaling is usually not required with tree-based models since the can hande varying features

### Useful links
<https://medium.com/@rahul77349/feature-scaling-why-it-is-required-8a93df1af310>

<https://benalexkeen.com/feature-scaling-with-scikit-learn/>

<https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalization-standardization/>

### Fill missing values with scikit-learn

In [45]:
# Import and view the DataFrame
car_missing_sales = pd.DataFrame(pd.read_csv('data/car-sales-extended-missing-data.csv'))
car_missing_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [46]:
# Check missing values
car_missing_sales.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [47]:
# Drop the rows with no labels
car_missing_sales.dropna(subset=['Price'], inplace=True)
car_missing_sales.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [48]:
from sklearn.model_selection import train_test_split

# Split the data into x and y
x = car_missing_sales.drop('Price', axis=1)
y = car_missing_sales['Price']


# Split the data into train and test
np.random.seed(42)

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [49]:
# Check missing values
x.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
dtype: int64

We will fill the training and test data separately to ensure training data stays with the training data and test data stays with the test data.

We use `fit_transform()` on the training data and `transform()` on the testing data. In essence we learn the patterns in the training set and transform it via imputation (fit, then transform). We take those same patters and fill the test set (transform only).

In [50]:
# Fill missing values with sklearn
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

# Fill categorical values with 'missing' & numerical values with mean
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
door_imputer = SimpleImputer(strategy='constant', fill_value=4)
num_imputer = SimpleImputer(strategy='mean')

# Define columns
cat_features = ['Make', 'Colour']
door_features = ['Doors']
num_features = ['Odometer (KM)']

# Create an imputer (something that fills missing data)
imputer = ColumnTransformer([('cat_imputer', cat_imputer, cat_features),
                            ('door_imputer', door_imputer, door_features),
                            ('num_imputer', num_imputer, num_features)])

# Fill train and test vaules separately
filled_x_train = imputer.fit_transform(x_train)
# fit_transfrom() imputes the missing valuies from the training set and fills
# them simultanueosly
filled_x_test = imputer.transform(x_test)
# Transform takes the imputing missing values from the training set and fills
# the test set with them

# Check filled x_train
filled_x_train

array([['Honda', 'White', 4.0, 71934.0],
       ['Toyota', 'Red', 4.0, 162665.0],
       ['Honda', 'White', 4.0, 42844.0],
       ...,
       ['Toyota', 'White', 4.0, 196225.0],
       ['Honda', 'Blue', 4.0, 133117.0],
       ['Honda', 'missing', 4.0, 150582.0]], dtype=object)

In [51]:
# Get our transformed data array;s back intro DataFrame's
car_sales_filled_train = pd.DataFrame(filled_x_train,
                                     columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])

car_sales_filled_test = pd.DataFrame(filled_x_test,
                                     columns=['Make', 'Colour', 'Doors', 'Odometer (KM)'])

# Check missing data in training set
car_sales_filled_train.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [52]:
car_sales_filled_test.isna().sum()

Make             0
Colour           0
Doors            0
Odometer (KM)    0
dtype: int64

In [53]:
car_missing_sales.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [54]:
# Let's one hot encode the features with the same code as before
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
transformer = ColumnTransformer([('one hot', one_hot, categorical_features
                                 )], remainder='passthrough')

# Fill train and test values separately
transformed_x_train = transformer.fit_transform(car_sales_filled_train) # Fit and transform training data
transformed_x_test = transformer.transform(car_sales_filled_test) # Transform test data

# Check transformed and filled x_train
transformed_x_train.toarray()

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 7.19340e+04],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.62665e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 4.28440e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.96225e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.33117e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.50582e+05]])

## Fit a model

In [55]:
np.random.seed(42)

# Setup model
model = RandomForestRegressor()

# Make sure to use the transformed data
model.fit(transformed_x_train, y_train)
model.score(transformed_x_test, y_test)

0.21229043336119102

## 2. Choose the right model for our problems
Some things to note:

* Sklearn refers to machine learning models, algorithms as estimators.
* Classification problem - predicting a category (heart disease or not)
* Sometimes you'll see `clf` (short for classifier) used as a classification estimator
* Regression problem - predicting a number (selling price of a car)
<img src='./images/ml_map.png'>

There are a lot of algorithms, each one of the used for a purpose, if you're ever in doubt of which model to use it's always a good practice to check the scikit documentation.

URL to sklearn cheat-sheet: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

### 2.1 Picking a machine learning model for regression problem
Scikit learn has dataset to play around

On this lecture we're going to use California Housing dataset.

In [56]:
# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [57]:
# A more realistic dataset
housing_df = pd.DataFrame(housing['data'], columns=housing['feature_names'])
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


In [58]:
# You always want to transform your dataset into something that is manipulative
# in order to transform your data. And be able to use it with Pandas NumPy sklearn
# target is used for educational purposes a more appropiate way to call it
# would be MedHouseVal
housing_df['target'] = housing['target']
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


Here we're predicting an estimated house value, with some data from a block house in Califronia, to undertand what each row is, you can check sklearn documentation: https://scikit-learn.org/stable/datasets/real_world.html#california-housing-dataset

In [59]:
# housing_df = housing_df.drop('MedHouseVal', axis=1)


In [60]:
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In order to choose the right algorithm we use the map placed before to see what can be a right algorithm. Data science comes with experimentation, for this to happen is also important have abilities on data analysis.

In order to get a better understanding of these models is important to learn more mathematics, and read documentation.

We're going to try Ridge Regression

In [61]:
# Import algorithm/ estimator
from sklearn.linear_model import  Ridge

# Setup random seed
np.random.seed(42)

# Create the data
x = housing_df.drop('target', axis=1)
y = housing_df['target'] # median house price in $100,000s

# Split data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instatiante and fit the model (on the training set)
model = Ridge()
model.fit(x_train, y_train)

# Check the score of the model (on the test set)
model.score(x_test, y_test)

0.5758549611440125

sklearn default score model is the squered coefficient of determination, all it does is check the linear relation between the first variable and the second variable, we have more than one variable, but everything what the model is taking in consideration is how `MedInc`, `HouseAge`, how all that data is affecting the target column, it makes an average and that's the model's score

In [62]:
y_preds = model.predict(x_test)
y_preds

array([0.71923978, 1.76395141, 2.70909238, ..., 4.46864495, 1.18785499,
       2.00912494])

## Improving a model
Well the previous model wasn't very acuarate with the metric of coefficient of determination so we're gonna try different models to improve accuracy

In [63]:
# Import algorithm
from sklearn.linear_model import Lasso

# Set random seed (replicability purposes)
np.random.seed(42)

# Data is already splitted above (we write code for learning purposes)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate the model and fit it
model = Lasso()
model.fit(x_train, y_train)

# Score the model
model.score(x_test, y_test)

0.2841671821008396

Well that was even worse, so now we try with other algorithm which is an EnsembleRegressor

In [64]:
# Import the algorithm
from sklearn import svm

# Set random seed
np.random.seed(42)

# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate the model and fit it
regr = svm.SVR()
regr.fit(x_train, y_train)

# Score the model
regr.score(x_test, y_test)

-0.01648536010717372

In [65]:
regr.score(x_train, y_train)

-0.023731482780207536

Let's try another one

In [66]:
from sklearn.linear_model import ElasticNet

# Set random seed
np.random.seed(42)

# Split training data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate the model and fit it
model = ElasticNet()
model.fit(x_train, y_train)

# Score the model
model.score(x_test, y_test)

0.41655189098028234

In [67]:
model.score(x_train, y_train)

0.42697503980879004

Just for fun let's try an SGD regressor


In [68]:
# Import algorithm
from sklearn.linear_model import SGDRegressor

# Set random seed
np.random.seed(42)

# Split the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate the model and fit it
model = SGDRegressor()
model.fit(x_train, y_train)

# Score the model
model.score(x_test, y_test)

-5.367216443051947e+27

In [69]:
# Set random seed
np.random.seed(42)

# Splitting the data
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate the model and fit it
regr = svm.SVR()
regr.fit(x_train, y_train)

# Score the model
regr.score(x_test, y_test)

-0.01648536010717372

In [70]:
# Import RFR from ensemble module
#from sklearn.ensemble import RandomForestRegressor

# Set random seed
#np.random.seed(42)

# Splitting the data
#x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

# Instantiate the model and fit it
# model = RandomForestRegressor(n_estimators= 20000)
# model.fit(x_train, y_train)

# Score the model
# model.score(x_test, y_test)

0.808425794260139

### Save the model

In [71]:
pickle.dump(clf, open('./models/housing-regressor.pkl', 'wb'))

Well an ensemble works very well, TODO: learn random forest classifiers, but i think I'm using a lot of parameters

## 2.2 Machine learning algorithms for classification.