#                  "Wine Predictor with Random Forest Regressor" 

### By Joe Mendez

# In this project we will use Machine Learning to build a predictive model that helps determine the quality of a certain wine or wines. When data about a wine is entered into the model, we will see if the wine is great, just ok, or not very good at all.

## Ok, let's get started! Now first we will import all the libraries and modules needed for this project. We will be using Python 2.7 for this model, and modules from libraries: scikit-learn, Numpy, and Pandas.

In [1]:
#first we will import scikit-learn, which is a great machine learning library for python
import sklearn

#next we will import the Numpy library. Numpy is great for numerical computations and for statistics when using python
import numpy as np

#Next we will import Pandas. Pandas is a library in Python that makes DataFrames easy to work with
import pandas as pd

#import train_test_split function from scikit-learn's module model_selection will help with... you guessed it: model selection!
from sklearn.model_selection import train_test_split

#import preprocessing model from scikit-learn for scaling, wrangling and transforming data during preprocessing
from sklearn import preprocessing

## Next we will start to put together the tools we will use to train our model and evaluate the model's performance.

In [117]:
#import Random Forest model from scikit-learn
from sklearn.ensemble import RandomForestRegressor

#to help with cross-validation
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import GridSearchCV

#to evaluate the model's performance
from sklearn.metrics import mean_squared_error, r2_score


#import module to save scikit-learn models. Joblib is similar to pickle, another popular python library 
from sklearn.externals import joblib

## Next we will load and view the wine dataset:

In [26]:
#load the wine dataset from a remote url
dataset_url = 'http://mlr.cs.umass.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
data = pd.read_csv(dataset_url, sep=';')#read in dataset with ";" accounted for as delimeter

In [27]:
#let's view the samples and features
print data.shape

(1599, 12)


## There are 1599 samples and 12 features(categories) in our dataset.

## Let's print out a summary of the data to see what we're working with:

In [28]:
#summary statistics
print data.describe()

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

## Here is a list of the features we will be working with in this dataset: 

## Here we remove our target feature "quality" from our training set:

In [95]:
y = data.quality
X = data.drop('quality', axis=1)

## Now we split the data into a test set and a training set: Splitting into test and training sets at the beginning of your model is crucial to getting a realistic estimate of your model's performance.

In [96]:
#using train_test_split() from scikit-learn's model_selection module; we will utilize 20% of the data for the test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.2, 
                                                    random_state=123, 
                                                    stratify=y)

## We will be using a feature in scikit-learn called the Transformer API. The Transformer API allows you to "fit" a preprocessing step using the training data the same way you'd fit a model. This makes your final estimate of model performance more realistic, and it allows to insert your preprocessing steps into a cross-validation pipeline.

In [97]:
#Fit the Transformer API to the training set(saving the means and standard deviations)
scaler = preprocessing.StandardScaler().fit(X_train)

In [98]:
#apply transformer API to the training set
X_train_scaled = scaler.transform(X_train)

In [99]:
#print out mean of the training set 
print X_train_scaled.mean(axis=0)                                           

[  1.16664562e-16  -3.05550043e-17  -8.47206937e-17  -2.22218213e-17
   2.22218213e-17  -6.38877362e-17  -4.16659149e-18  -2.54439854e-15
  -8.70817622e-16  -4.08325966e-16  -1.17220107e-15]


In [100]:
#print out standard deviation
print X_train_scaled.std(axis=0)

[ 1.  1.  1.  1.  1.  1.  1.  1.  1.  1.  1.]


## Now we do the same for the test set

In [101]:
#apply transformer API to the test set
X_test_scaled = scaler.transform(X_test)

In [102]:
#print out the mean of the test set
print X_test_scaled.mean(axis=0)

[ 0.02776704  0.02592492 -0.03078587 -0.03137977 -0.00471876 -0.04413827
 -0.02414174 -0.00293273 -0.00467444 -0.10894663  0.01043391]


In [103]:
#print out the standard deviation of the test set
print X_test_scaled.std(axis=0)

[ 1.02160495  1.00135689  0.97456598  0.91099054  0.86716698  0.94193125
  1.03673213  1.03145119  0.95734849  0.83829505  1.0286218 ]


In [104]:
#pipeline with preprocessing and modeling
pipeline = make_pipeline(preprocessing.StandardScaler(), 
                         RandomForestRegressor(n_estimators=100))

## Hyperparameters & Model Parameters. Within each decision tree, the computer can decide where to create branches based on either mean-squared-error (MSE) or mean-absolute-error (MAE). So the actual branch locations are model parameters. However, the algorithm does not know which of the two criteria, MSE or MAE, that it should use. The algorithm also cannot decide how many trees to include in the forest. The examples of hyperparameters that the we must set are listed below:

In [105]:
#List tuneable hyperparameters
print pipeline.get_params()

{'randomforestregressor__random_state': None, 'randomforestregressor__min_weight_fraction_leaf': 0.0, 'standardscaler__with_mean': True, 'randomforestregressor__n_estimators': 100, 'randomforestregressor__min_samples_leaf': 1, 'standardscaler__copy': True, 'randomforestregressor__warm_start': False, 'randomforestregressor__criterion': 'mse', 'randomforestregressor__n_jobs': 1, 'randomforestregressor__max_leaf_nodes': None, 'randomforestregressor__oob_score': False, 'randomforestregressor__verbose': 0, 'randomforestregressor__min_impurity_split': 1e-07, 'steps': [('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
         

## Now we set the hyperparameters

In [106]:
#Setting hyperparameters
hyperparameters = { 'randomforestregressor__max_features' : ['auto', 'sqrt', 'log2'],
                  'randomforestregressor__max_depth': [None, 5, 3, 1]}

## GridSearchCV performs cross-validation across the entire "grid"of hyperparameters. It takes in your model, the hyperparameters you want to tune, and the number of "folds" to create.

In [107]:
#scikit-learn cross-validation with pipeline
clf = GridSearchCV(pipeline, hyperparameters, cv=10)
 
# Fit and tune model
clf.fit(X_train, y_train)

GridSearchCV(cv=10, error_score='raise',
       estimator=Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('randomforestregressor', RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_split=1e-07, min_samples_leaf=1,
           min_samples_split=2, min_weight_fraction_leaf=0.0,
           n_estimators=100, n_jobs=1, oob_score=False, random_state=None,
           verbose=0, warm_start=False))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'randomforestregressor__max_depth': [None, 5, 3, 1], 'randomforestregressor__max_features': ['auto', 'sqrt', 'log2']},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

## Now, you can see the best set of parameters found using CV:

In [113]:
#printing out the best hyperparameters
print clf.best_params_

{'randomforestregressor__max_depth': None, 'randomforestregressor__max_features': 'sqrt'}


## It is time to refit the model. GridSearchCV will automatically refit for you. To check if default refit is on: 

In [109]:
#check if auto-refit through GridSearchCV is on. "True" = on
print clf.refit

True


## Here is how we predict a new set of data:

In [110]:
#Predict new set of data
pred = clf.predict(X_test)

## With earlier imported metrics we can now evaluate our model's performance. Let's see how we did:

In [118]:
print r2_score(y_test, y_pred)

0.468295659544


In [116]:
#print mean sq. error
print mean_squared_error(y_test, y_pred)

0.3430946875


## Now you have a pipeline model to evaluate wines of all kinds! Let's make sure we save this model for future use: 

In [119]:
joblib.dump(clf, 'rf_regressor.pkl')

['rf_regressor.pkl']

## Next time you want to use the model just use this function:

In [None]:
clf2 = joblib.load('rf_regressor.pkl')
 
# Predict data set using loaded model
clf2.predict(X_test)