# Python Machine Learning Tutorial
## Iris Flower Edition

### Import essential libraries
We'll import :
- **pandas** for manipulating and analyzing data
- **numpy** for use DataFrame object

In [2]:
import pandas as pd
import numpy as np

### Import the data
We need to import the data that contains different information about the iris flower.

In [3]:
url = "https://gist.githubusercontent.com/netj/8836201/raw/6f9306ad21398ea43cba4f7d537619d0e07d5ae3/iris.csv"
dataset = pd.read_csv(url)

### Analyzing the data

#### Features enumeration
Let's see what features the data set contains


In [50]:
print('\n'.join(col for col in dataset.columns))

sepal.length
sepal.width
petal.length
petal.width
variety


#### Data informations
- Data shape

In [53]:
print(dataset.shape)

(150, 5)


The dataset contains 150 rows (samples) and 5 columns (features)

- Summary statistics

In [55]:
print(dataset.describe())

       sepal.length  sepal.width  petal.length  petal.width
count    150.000000   150.000000    150.000000   150.000000
mean       5.843333     3.057333      3.758000     1.199333
std        0.828066     0.435866      1.765298     0.762238
min        4.300000     2.000000      1.000000     0.100000
25%        5.100000     2.800000      1.600000     0.300000
50%        5.800000     3.000000      4.350000     1.300000
75%        6.400000     3.300000      5.100000     1.800000
max        7.900000     4.400000      6.900000     2.500000


- First five samples

In [58]:
print(dataset.head())

   sepal.length  sepal.width  petal.length  petal.width variety
0           5.1          3.5           1.4          0.2  Setosa
1           4.9          3.0           1.4          0.2  Setosa
2           4.7          3.2           1.3          0.2  Setosa
3           4.6          3.1           1.5          0.2  Setosa
4           5.0          3.6           1.4          0.2  Setosa


As we see four of the five features are **numerical**  
The fifth feature is **categorical**, which means it takes only certain values

### Split the data

We need to split the dataset into a *train set* and a *test set*.  

- The **train set** is used to build our model.  
- The **test set** is used to watch the accuracy of the model.

In [4]:
from sklearn.model_selection import train_test_split

#### Define the variables

We have to define which features we'll use as the *dependent variable* and as *the independent variable*.  

- The **Dependent variable** (or target) represent the output or outcome whose variation is being studied.  


In [88]:
## We choose here the variety feature
y = dataset.iloc[:, 4]

- The **Independent variable** (or regressor) represent inputs or causes, that is, potential reasons for variation.  

In [101]:
## We choose here the lenght and width of the petal and sepal
X = dataset.iloc[:, 1:4]

We need to **encode** the dependent variable, more explanation below

In [34]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
y = labelencoder.fit_transform(y)

#### Splitting the data

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2,
                                                    random_state = 1, 
                                                    stratify = y)

### Data preprocessing

In [11]:
from sklearn import preprocessing

#### Transformation
We need to transform text data into some kind of readable value.  
For this, we can use the **One-Hot Encoder** or the **Label Encoder**.

##### One-Hot Encoder
Takes a column which has categorical data, which has been label encoded, and then splits the column into multiple columns

##### Label Encoder
Converts categorical text data into model-understandable numeric data

In [45]:
print(y)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


#### Standardization
Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations.

##### Scaling
Scaling the data brings all your values onto one scale eliminating the sparsity.

##### Transformer API
Transformer API allows you to *fit* a preprocessing step using the training set the same way you'd fit a model and then use the same transformation on future data sets.


1- Fit the transformer on the training set (saving the means and standard deviations)  
2- Apply the transformer to the training set (scaling the training data)  
3- Apply the transformer to the test set (using the same means and standard deviations)

This makes your *final estimate of model performance more realistic*, and it allows to *insert your preprocessing steps* into a **cross-validation pipeline**.

The **scaler** object contain the mean and stardard deviation for each features in the training set

In [46]:
scaler = preprocessing.StandardScaler().fit(X_train)

Now let's apply the transformer to the train set

In [48]:
X_train_scaled = scaler.transform(X_train)

Applying the transformer to the test set

In [49]:
X_test_scaled = scaler.transform(X_test)

##### Pipeline
They operate by enabling a sequence of data to be transformed and correlated together in a model that can be tested and evaluated to achieve an outcome, whether positive or negative.

In [55]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestRegressor

pipeline = make_pipeline(preprocessing.StandardScaler(), 
                        RandomForestRegressor(n_estimators=100))

### Hyperparameters
Hyperparameters express *higher-level* structural information about the model, and they are typically set before training the model.

We can list the tunable hyperparameters

In [57]:
print(pipeline.get_params())

{'memory': None, 'steps': [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False))], 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'randomforestregressor': RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_star

Now, let's declare the hyperparameters we want to tune through cross-validation.

In [58]:
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

### Cross-Validation Pipeline
#### Cross-validation
Process for reliably estimating the performance of a method for building a model by training and evaluating your model multiple times using the same method.

1) **Split** your data into k equal parts, or "folds" (typically k=10).  
2) **Train** your model on k-1 folds (e.g. the first 9 folds).  
3) **Evaluate** it on the remaining "hold-out" fold (e.g. the 10th fold).  
4) **Perform** steps (2) and (3) k times, each time holding out a different fold.  
5) **Aggregate** the performance across all k folds. This is your performance metric.  

Use Cross-validation to evaluate different hyperparameters and estimate their effectiveness.

#### Cross-validation pipeline
Include data preprocessing steps inside the cross-validation loop.

1) **Split** your data into k equal parts, or "folds" (typically k=10).  
2) **Preprocess** k-1 training folds.  
3) **Train** your model on the same k-1 folds.  
4) **Preprocess** the hold-out fold using the same transformations from step (2).  
5) **Evaluate** your model on the same hold-out fold.  
6) **Perform** steps (2) - (5) k times, each time holding out a different fold.  
7) **Aggregate** the performance across all k folds. This is your performance metric.  

In [63]:
from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 
# Fit and tune model
clf.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise-deprecating',
       estimator=Pipeline(memory=None,
     steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decr...ors=100, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False))]),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

Refit the model with the best set of hyperparameters using the entire training set.

In [66]:
print(clf.refit)

True


### Evaluate the model
Predict a new set of data

In [75]:
y_pred = clf.predict(X_test)

150


In [69]:
from sklearn.metrics import mean_squared_error, r2_score

An **R-squared** of 1 means that all movements of dependent variable are completely explained by movements in the independent variable(s) you are interested in.



In [70]:
print(r2_score(y_test, y_pred))

0.9510831119877574


The **Adjusted R-Squared** attempt to take account of the phenomenon of the R2 automatically and spuriously increasing when extra explanatory variables are added to the model.

![Adjusted R-Squared](https://i.stack.imgur.com/fLrDw.png)

- n : number of observations
- p : number of independent variable 

In [102]:
adjusted_r2 = ((1 - (1 - r2_score(y_test, y_pred))) * 
               ((dataset.shape[0] - 1) / 
                (dataset.shape[0] - X.shape[1] - 1)))
print(adjusted_r2)

0.9706259156587388


The **Mean Squared Error** measures the average of the squares of the errors, an *MSE* of zero means that the estimator predicts observations of the parameter with perfect accuracy.

In [72]:
print(mean_squared_error(y_test, y_pred))

0.03261125867482842
