- use Scikit-Learn to build and tune a supervised learning model
- We’ll be training and tuning a random forest for wine quality based on traits like acidity, residual sugar, and alcohol concentration.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing #contains utilities for scaling, transforming, and wrangling data

Next, import the "family" of a model. (Basically the generic form of a model). In this case, a random forest

In [2]:
from sklearn.ensemble import RandomForestRegressor

Cross validation is splitting very similar to train/test split, but it’s applied to more subsets. Meaning, we split our data into k subsets, and train on k-1 one of those subset. the k'th subset is held for testing. 
- is a resampling procedure used to evaluate machine learning models on a limited data sample

Importing tools for cross-validation:

In [3]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

Importing evaluation metrics

In [5]:
from sklearn.metrics import mean_squared_error, r2_score

Importing library to save model for later loading

In [6]:
from sklearn.externals import joblib

Load wine data using pandas from a url:

In [12]:
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';') #data in csv file is seperated using semicolons
print (data.head())
print (data.shape)

   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5 

```print(data.shape)``` shows ```(1599,12)```. This means that in the wine csv, we have 1599 samples and 12 features to learn (including the feature of ```quality``` which we want to learn  
Using ```.describe``` on the object pandas returned, we can easily get some summary statistics

In [13]:
print (data.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

Notice that all features are numeric. This makes things easy as ML can only learn on numbers. However, the scales of each feature are different. Mental note: **standardize the data later**

Now, we seperate the target (y) features from our input (x) features:

In [15]:
Y = data.quality
X = data.drop('quality', axis=1)

Then splitting into training and test data:  
Note: good practice to stratify your data.

**Stratifying data**: ensures training set looks similar to test set. A similar probable type sample is chosen

In [18]:
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, #set aside 20% of data for testing
                                                    random_state=123, #seed to reproduce exact output everytime compiled
                                                    stratify=y) #

Standardization is the process of subtracting the means from each feature and then dividing by the feature standard deviations.

Standardization is a common requirement for machine learning tasks. Many algorithms assume that all features are centered around zero and have approximately the same variance.

To scale the train and test data the same way, we use the Transformer API in sklearn: allows you to "fit" a preprocessing step using the training data the same way you'd fit a model

Doing so also allows us to insert preprocessing steps into cross-validation pipeline

In [19]:
scaler = preprocessing.StandardScaler().fit(X_train)

The scalar object now has the saved means and standard deviations for each feature in the training set

In [20]:
X_train_scaled = scaler.transform(X_train)
 
print (X_train_scaled.mean(axis=0))
print (X_train_scaled.std(axis=0))

[ 1.16664562e-16 -3.05550043e-17 -8.47206937e-17 -2.22218213e-17
  2.22218213e-17 -6.38877362e-17 -4.16659149e-18 -2.54439854e-15
 -8.70817622e-16 -4.08325966e-16 -1.17220107e-15]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]


Source: https://elitedatascience.com/python-machine-learning-tutorial-scikit-learn