# Scikit-Learn course 3
## II. Choosing the right estimator/algorithm for your problem

The key takeaways to remember are:

1. Most datasets you come across won't be in a form ready to immediately start using them with machine learning models. And some may take more preparation than others to get ready to use.
2. For most machine learning models, your data has to be numerical. This will involve converting whatever you're working with into numbers. This process is often referred to as feature engineering or feature encoding.
3. Some machine learning models aren't compatible with missing data. The process of filling missing data is referred to as data imputation.

## Choosing the right estimator/algorithm for your problem

Once you've got your data ready, the next step is to choose an appropriate machine learning algorithm or model to find patterns in your data.
<br><br>
Some things to note:

* Sklearn refers to machine learning models and algorithms as estimators.
* Classification problem - predicting a category (heart disease or not).
  * Sometimes you'll see clf (short for classifier) used as a classification estimator instance's variable name.
* Regression problem - predicting a number (selling price of a car).
* Unsupervised problem - clustering (grouping unlabelled samples with other similar unlabelled samples).

https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

![](images/cheat-sheet-map/ml_map.png)

Let's start with a regression problem (trying to predict a number). We'll use the California Housing dataset built into Scikit-Learn's `datasets` module.
 <br> <br>
The goal of the California Housing dataset is to predict a given district's median house value (in hundreds of thousands of dollars) on things like the age of the home, the number of rooms, the number of bedrooms, number of people living the home and more.

**To train** use the toy dataset : https://scikit-learn.org/stable/datasets/toy_dataset.html

![.](images/choose_Ml.jpg)

## 0. Standards import / getting data ready

In [1]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
%matplotlib inline 

In [2]:
# ---> y (axis=1)
# |
# |
# x (axis=0)

Let's use the California Housing dataset - https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

In [20]:
# Get California Housing dataset
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

Since it's in a dictionary, let's turn it into a DataFrame so we can inspect it better.

In [8]:
housing_df = pd.DataFrame(housing["data"], columns=housing["feature_names"])
# remind : feature is the data
# and label is the target
housing_df["target"] = pd.Series(housing["target"])
# here the target name is MedHouseVal but we call it target to be simplier
housing_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


In [5]:
# How many samples?
len(housing_df)

20640

## 1. Picking a machine learning model for a regression problem

![.](images/cheat-sheet-map/sklearn-ml-map-cheatsheet-california-housing-ridge.png)

1. We have 20640 samples > 50 ==> YES
2. We doesn't want to perdict a category (like if a picture is a cat or dog) we want to predict a price (ie quantity) ==> NO
3. We want to predict a price ie a quantity (how much does it cost) ==> YES
4. We have 20640 samples < 100 000 ==> YES
5. is they only few features important ??? for now we dont know ...

Then it's either RidgeRegression or Lasso
<br>
for now we gonna take RidgeRegression

## 2. Build machine learning model : Regression probleme

### 2.1 First try :
click on "RidgeRegression" on the skearn algorithm cheat-sheet : https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression
<br>
we can see these 2 lines of code : 
`from sklearn import linear_model`
`reg = linear_model.Ridge(alpha=.5)`.
<br>
Then for our first model we gonna use `Ridge` class, other existe like`RidgeCV`


#### `Ridge()`

In [11]:
# Import the Ridge model class from the linear_model module
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split

X = housing_df.drop("target", axis=1)
y = housing_df.target # target = MedHouseVal : median house price in $100k

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = Ridge()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.6107860106271783

`clf.score()` return the coefficient of determination $R^2$ 
<br>
In statistics, the coefficient of determination, denoted $R^2$, is the proportion of the variation in the dependent variable that is predictable from the independent variable(s).
<br>
https://en.wikipedia.org/wiki/Coefficient_of_determination

A data set has n values marked (y1,...,yn); each associated with a fitted (or modeled, or predicted) value (f1,...,fn) (sometimes known as ŷi).
<br>
Define the residuals as ei = yi − fi 

If ${\displaystyle {\bar {y}}}$ is the mean of the observed data :
$${\displaystyle {\bar {y}}={\frac {1}{n}}\sum _{i=1}^{n}y_{i}}$$

The sum of squares of residuals, also called the residual sum of squares (proportional to the quadratique error): 
$${\displaystyle SS_{\text{res}}=\sum _{i}(y_{i}-f_{i})^{2}=\sum _{i}e_{i}^{2}\,}$$

The total sum of squares (proportional to the variance of the data ($V_{data}= {1 \over n} \cdot SS_{tot}$)) :
<br><br>
$${\displaystyle SS_{\text{tot}}=\sum _{i}(y_{i}-{\bar {y}})^{2}}$$

The most general definition of the coefficient of determination is :
$${\displaystyle R^{2}=1-{SS_{\rm {res}} \over SS_{\rm {tot}}}}$$

![.](images/Coefficient_of_Determination.svg.png)

$$R^{2}=1-{\frac {\color {blue}{SS_{\text{res}}}}{\color {red}{SS_{\text{tot}}}}}$$

<strong>
<font color=red style="font-size: x-large">
If R^2 close to 1, it's mean that the error made by the model is lower than the variance of data
</font>
    <br>
if R^2=0.8 , it's mean your model describe 80% of the variation of your data
</strong>



### 2.2 Second try :
After RidgeRegression we try => Lasso
<br>
skearn algorithm cheat-sheet : https://scikit-learn.org/stable/modules/linear_model.html#lasso
<br>
=> `sklearn.linear_model.Lasso()`

#### `Lasso()`

In [14]:
from sklearn.linear_model import Lasso

X = housing_df.drop("target", axis=1)
y = housing_df.target # median house price in $100k

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = Lasso()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.28433887368566624

What if RidgeRegression didn't work? Or what if we wanted to improve our results ? Same for Lasso ...
<br>
Following the diagram, the next step would be to try EnsembleRegressors. An ensemble (or ensemble model) is combination of smaller models to try and make better predictions than just a single model.
<br>
<br>
skearn algorithm cheat-sheet : https://scikit-learn.org/stable/modules/ensemble.html
<br>
One of the most common and useful ensemble methods is the Random Forest. Known for its fast training and prediction times and adaptibility to different problems.
<br>
The basic premise of the Random Forest is to combine a number of different decision trees, each one random from the other and make a prediction on a sample by averaging the result of each decision tree.
<br>
https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76 
<br>
(go to the end of this notbook to have more resources)
<br>
<br>
Since we're working with regression, we'll use Scikit-Learn's `RandomForestRegressor`

### 2.3 Third try :
#### `RandomForestRegressor()`

![](images/cheat-sheet-map/sklearn-ml-map-cheatsheet-california-housing-ensemble.png)

In [17]:
from sklearn.ensemble import RandomForestRegressor

X = housing_df.drop("target", axis=1)
y = housing_df.target # median house price in $100k

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = RandomForestRegressor(n_estimators=100)
# n_estimators is the number of different model that RandomForestRegressor is gonna use

model.fit(X_train,y_train)
model.score(X_test,y_test)

0.8222371966851888

In [18]:
# 2h34

## 3. Chossing an estimator : classification problem

In [19]:
hear_disease = pd.read_csv("data/heart-disease.csv")
hear_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [21]:
len(hear_disease)

303

1. We have 303 samples > 50 ==> YES
2. We doesn't want to perdict a category (hear disease or not) ==> YES
3. We have label data ==> YES
4. We have 303 samples < 100 ==> YES

then we try `Linear SVC`
<br>
skearn algorithm cheat-sheet : https://scikit-learn.org/stable/modules/svm.html#classification

### 3.1 `LinearSVC()`

![.](images/cheat-sheet-map/sklearn-ml-map-cheatsheet-heart-disease-linear-svc.png)

In [25]:
from sklearn.svm import LinearSVC

X = hear_disease.drop("target", axis=1)
y = hear_disease.target 

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = LinearSVC()
model.fit(X_train,y_train)
model.score(X_test,y_test)



0.7049180327868853

Following the diagram until Ensemble Classifier :
<br>
skearn algorithm cheat-sheet : https://scikit-learn.org/stable/modules/ensemble.html

### 3.2 `RandomForestClassifier()`

![](images/cheat-sheet-map/sklearn-ml-map-cheatsheet-heart-disease-ensemble.png)

In [27]:
from sklearn.ensemble import RandomForestClassifier 

X = hear_disease.drop("target", axis=1)
y = hear_disease.target 

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

model = RandomForestClassifier()
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.8852459016393442

**Tips :** 
1. if you have structured data (ie table or dataframe) ==> use ensemble methods (such as RandomForest)
2. if you have unstructured data (audio, image, text) ==> use deep learning methods or transfer learning 

here focus on structured data 

#### Random Forest model deep dive
These resources will help to understand what's happening inside the Random Forest models we've been using :
* https://en.wikipedia.org/wiki/Random_forest
* https://simple.wikipedia.org/wiki/Random_forest
* http://blog.yhat.com/posts/random-forests-in-python.html
* https://towardsdatascience.com/an-implementation-and-explanation-of-the-random-forest-in-python-77bf308a9b76