# Python Machine Learning Tutorial, Scikit-Learn: Wine Snob Edition
[Tips for kaggle](https://elitedatascience.com/beginner-kaggle)

This tutorial is based on 
[python tutorials](https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn) from elite Data Science.

# Step 1. Set up your environment.

In [1]:
# check the Sckit-Learn version
import sklearn
print(sklearn.__version__)

0.18.1


# Step 2. Import libraries and modules.

In [2]:
# Numpy efficient numerical computation
import numpy as np
# handle numerical matrices
import pandas as pd
# many utilities
from sklearn.model_selection import train_test_split
# entire preprocessing module including scaling, transforming, and wrangling data
from sklearn import preprocessing
# import the random forest family
from sklearn.ensemble import RandomForestRegressor
# import performing cross-validation
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV
# import evaluation metrics
from sklearn.metrics import mean_squared_error, r2_score
# import module for saving scikit-learn module
from sklearn.externals import joblib

# Step 3: Load red wine data.

We are ready to load datasets.
[list of all the Pandas IO tools](http://pandas.pydata.org/pandas-docs/stable/io.html)

In [3]:
# Load wine data from remote URL
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url)

In [4]:
# output the first 5 rows of data
print(data.head())

  fixed acidity;"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
0   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5                                                                                                                     
1   7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5                                                                                                                     
2  7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;...                                                                                                                     
3  11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58...                                                                                                                     
4   7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5                                                                                                  

The CSV file separate data by semicolons, so we should use semicolon separator.

In [5]:
# Read CSV with semicolon seperator
data = pd.read_csv(dataset_url, sep=';')
print(data.head())

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

In [6]:
# print shape of data
print(data.shape)
# samples
print('Sample size is', data.shape[0])
# features
print('feature size is', data.shape[1])

(1599, 12)
Sample size is 1599
feature size is 12


In [7]:
# Lets summarize statistics
print(data.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

# Step 4: Split data into training and test sets.

Splitting the data into training and test sets at the beginning of your modeling workflow

In [8]:
# separate target from training features
y = data.quality
X = data.drop('quality', axis=1)

In [9]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)

# Step 5: Declare data preprocessing steps.

In [10]:
# Lazy way of scaling data
X_train_scaled = preprocessing.scale(X_train)
print(X_train_scaled)

[[ 0.51358886  2.19680282 -0.164433   ...,  1.08415147 -0.69866131
  -0.58608178]
 [-1.73698885 -0.31792985 -0.82867679 ...,  1.46964764  1.2491516
   2.97009781]
 [-0.35201795  0.46443143 -0.47100705 ..., -0.13658641 -0.35492962
  -0.20843439]
 ..., 
 [-0.98679628  1.10708533 -0.93086814 ...,  0.24890976 -0.98510439
   0.35803669]
 [-0.69826067  0.46443143 -1.28853787 ...,  1.08415147 -0.35492962
  -0.68049363]
 [ 3.1104093  -0.62528606  2.08377675 ..., -1.61432173  0.79084268
  -0.39725809]]


In [11]:
print('Average of each axis in train data')
print(X_train_scaled.mean(axis=0)) 

print('Standard deviation of each axis in train data')
print(X_train_scaled.std(axis=0))

Average of each axis in train data
[  1.16664562e-16  -3.05550043e-17  -8.47206937e-17  -2.22218213e-17
   2.22218213e-17  -6.38877362e-17  -4.16659149e-18  -2.54439854e-15
  -8.70817622e-16  -4.08325966e-16  -1.17220107e-15]
Standard deviation of each axis in train data
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]


## 1. Fit the transformer on the training set (saving the means and standard deviations)
## 2. Apply the transformer to the training set (scaling the training data)
## 3. Apply the transformer to the test set (using the same means and standard deviations)
## 4. Allow to insert your preprocessing steps into a cross-validation pipeline

In [12]:
# Fitting the Transformer API
scaler = preprocessing.StandardScaler().fit(X_train)

In [13]:
# Applying transformer to training data
X_train_scaled = scaler.transform(X_train)

In [14]:
print('Average')
print(X_train_scaled.mean(axis=0))
print('Standard deviation')
print(X_train_scaled.std(axis=0))

Average
[  1.16664562e-16  -3.05550043e-17  -8.47206937e-17  -2.22218213e-17
   2.22218213e-17  -6.38877362e-17  -4.16659149e-18  -2.54439854e-15
  -8.70817622e-16  -4.08325966e-16  -1.17220107e-15]
Standard deviation
[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]


In [15]:
# Applyin transformer to test data
X_test_scaled = scaler.transform(X_test)

In [16]:
print('Average')
print(X_test_scaled.mean(axis=0))
print('Standard deviation')
print(X_test_scaled.std(axis=0))

Average
[ 0.02776704  0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
 -0.02414174 -0.00293273 -0.00467444 -0.10894663  0.01043391]
Standard deviation
[ 1.02160495  1.00135689  0.97456598  0.91099054  0.86716698  0.94193125
  1.03673213  1.03145119  0.95734849  0.83829505  1.0286218 ]


In [17]:
# pipeline with prerprocessing and model
# modeling pipeline
pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         RandomForestRegressor(n_estimators=100))

# Step 6: Declare hyperparameters to tune.

In [18]:
# List tunable hyperparameters
print(pipeline.get_params())

{'steps': [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))], 'standardscaler': StandardScaler(copy=True, with_mean=True, with_std=True), 'randomforestregressor': RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False), 'standardscaler__copy': True, 'standardscaler__with_mean': True,

[Random Forest](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)

In [19]:
#  declare the hyperparameters we want to tune through cross-validation
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

# Step 7: Tune model using a cross-validation pipeline.

The cross-validation (CV) pipeline looks like
- 1 Split your data into k equal parts, or "folds" (typically k=10).
- 2 Preprocess k-1 training folds.
- 3 Train your model on the same k-1 folds.
- 4 Preprocess the hold-out fold using the same transformations from step (2).
- 5 Evaluate your model on the same hold-out fold.
- 6 Perform steps (2) - (5) k times, each time holding out a different fold.
- 7 Aggregate the performance across all k folds. This is your performance metric.
You can use GridSearchCV which essentially performs CV for entire grid of hyperparameters.

In [20]:
# sklearn cross-validation with pipeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 # Fit and tune model
clf.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'randomforestregressor__max_features': ['auto', 'sqrt', 'log2'], 'randomforestregressor__max_depth': [None, 5, 3, 1]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

In [21]:
# print best parameters
print(clf.best_params_)

{'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'sqrt'}


# Step 8: Refit on the entire training set.

In [22]:
# Confirm model will be retrained.
print(clf.refit)

True


# Step 9: Evaluate model pipeline on test data.

In [23]:
# predict a new set of data (test data)
y_pred = clf.predict(X_test)

In [24]:
# results
print(r2_score(y_test, y_pred))
print(mean_squared_error(y_test, y_pred))

0.466173012894
0.344464375


# Step 10: Save model for future use.

In [25]:
# save model to a .pkl file
joblib.dump(clf, 'rf_regressor.pkl')

['rf_regressor.pkl']

In [26]:
# load model from .pkl file
clf2 = joblib.load('rf_regressor.pkl')
# Predict data set using loaded model
clf2.predict(X_test)

array([ 6.44,  5.79,  4.99,  5.37,  6.29,  5.59,  4.91,  4.83,  5.02,
        5.93,  5.39,  5.7 ,  5.78,  5.06,  5.78,  5.7 ,  6.59,  5.8 ,
        5.64,  6.96,  5.39,  5.64,  5.02,  6.03,  6.  ,  5.02,  5.48,
        5.15,  5.83,  5.96,  5.86,  6.6 ,  5.98,  5.06,  5.01,  5.89,
        5.1 ,  6.08,  5.07,  5.91,  4.96,  6.05,  6.7 ,  5.13,  6.13,
        5.49,  5.51,  5.6 ,  5.11,  6.49,  6.08,  5.22,  5.89,  5.05,
        5.56,  5.75,  5.31,  5.33,  4.99,  5.3 ,  5.21,  5.21,  5.04,
        5.77,  5.97,  5.16,  6.54,  5.  ,  5.21,  6.79,  5.72,  5.81,
        5.08,  5.06,  5.32,  6.03,  5.38,  5.11,  5.23,  5.21,  6.37,
        5.76,  6.16,  6.31,  5.06,  5.91,  6.38,  6.45,  5.75,  5.77,
        5.94,  5.28,  6.4 ,  5.69,  5.67,  5.78,  6.67,  6.84,  5.6 ,
        6.7 ,  5.12,  5.47,  5.1 ,  6.59,  5.06,  4.79,  5.7 ,  4.89,
        5.55,  5.9 ,  5.81,  5.48,  5.94,  5.43,  5.24,  5.21,  5.98,
        5.14,  4.92,  5.9 ,  5.75,  5.06,  5.73,  6.1 ,  5.21,  5.43,
        5.22,  6.05,