<h1><center> Predict Quality of Red Wine - Linear Regression Method </center> </h1>

<img src="coverimage.jpg">

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Exploratory-Data-Analysis" data-toc-modified-id="Exploratory-Data-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Exploratory Data Analysis</a></span><ul class="toc-item"><li><span><a href="#Loading-Wine-Data" data-toc-modified-id="Loading-Wine-Data-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Loading Wine Data</a></span></li><li><span><a href="#Data-Exploration" data-toc-modified-id="Data-Exploration-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Data Exploration</a></span></li><li><span><a href="#Preparing-Data" data-toc-modified-id="Preparing-Data-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Preparing Data</a></span><ul class="toc-item"><li><span><a href="#Create-X-and-y" data-toc-modified-id="Create-X-and-y-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Create X and y</a></span></li><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Train Test Split</a></span></li></ul></li></ul></li><li><span><a href="#Analysis" data-toc-modified-id="Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Analysis</a></span><ul class="toc-item"><li><span><a href="#Linear-Regression" data-toc-modified-id="Linear-Regression-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Linear Regression</a></span></li><li><span><a href="#Linear-Regression-With-Polynomial-Features" data-toc-modified-id="Linear-Regression-With-Polynomial-Features-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Linear Regression With Polynomial Features</a></span></li><li><span><a href="#RidgeCV-Regression" data-toc-modified-id="RidgeCV-Regression-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>RidgeCV Regression</a></span></li><li><span><a href="#LassoCV-Regression" data-toc-modified-id="LassoCV-Regression-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>LassoCV Regression</a></span></li><li><span><a href="#ElasticNetCV" data-toc-modified-id="ElasticNetCV-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>ElasticNetCV</a></span></li></ul></li><li><span><a href="#Next-Steps" data-toc-modified-id="Next-Steps-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Next Steps</a></span></li></ul></div>

## Introduction

The dataset for this project was collected from <a href="https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009">kaggle - Red Wine Quality</a>. The data investigated here consists of 11 variables (based on physicochemical tests) and quality of red wine (score between 0 and 10).

Main objective of the analysis is to focus on prediction. In this project, We will employ linear regression algorithms to find relationship between quality of the red wine and other input parameters. We will then choose the best candidate algorithm from preliminary results. The goal with this implementation is to construct a model that accurately predicts quality of the red wine. Here the predictand <i>(y-variable)</i>  is categorical, but for the regression we consider it as discreet numerical values.

## Exploratory Data Analysis

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import KFold, cross_val_predict
from sklearn.linear_model import LinearRegression, Lasso, Ridge, RidgeCV, LassoCV, ElasticNetCV 
from sklearn.pipeline import Pipeline

# Mute the sklearn warning about regularization
import warnings
warnings.filterwarnings('ignore', module='sklearn')

### Loading Wine Data

In [None]:
data = pd.read_csv('../input/red-wine-quality-cortez-et-al-2009/winequality-red.csv')
data.head()

### Data Exploration

In [None]:
print('Data shape: ', data.shape)
print("Data types: ")
print(data.dtypes.value_counts())

In [None]:
print("Data Info:")
data_info = data.info()

In [None]:
print('The total number of records: ', str(len(data.index)))
print('Column names: ', str(data.columns.tolist()))

There is no missing data in our data set. Quality is a categorical variable from scale 1 to 10. it is our `y-variable` in this project.

In [None]:
data.describe()

<b> Predictor </b>

* Quality: The quality of wine from scale 1 - 10. 

<b> Features </b>

* fixed acidity
* volatile acidity
* citric acid
* residual sugar
* chlorides
* free sulfur dioxide
* total sulfur dioxide
* density
* pH
* sulphates
* alcohol

### Preparing Data

Let first see the distributions of each variable.

Plotting a set of histograms:

In [None]:
data.hist(figsize=(10, 10));

The fixed acidity, density, and pH values are normally distributed. Others are positively skewed except quality.

Let's look at the correlation coefficient. A coefficient close to 1 means that there’s a very strong positive correlation between the two variables. The diagonal line is the correlation of the variables to themselves, that's why they are 1.

In [None]:
corr = data.corr(method='pearson')
fig = plt.subplots(figsize=(10, 10))
sns.heatmap(corr,
           xticklabels=corr.columns,
           yticklabels=corr.columns,
           cmap='YlOrBr',
           annot=True,
           );

* fixed acidity positively correlated with density and citric acid and negatively correlated with pH around 0.67 absolute correlation coeficient and negatively correlate with.

    

#### Create X and y

In [None]:
y_col = "quality"

X = data.drop(y_col, axis=1)
y = data[y_col]

In [None]:
print('X:')
X.head()

In [None]:
print('y: ')
y.head()

#### Train Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=72018)

In [None]:
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Apply min-max scaler to normalize data. his ensures that each feature is treated equally when applying supervised learners.

In [None]:
scaler = MinMaxScaler()
X_train_s = scaler.fit_transform(X_train)
pd.DataFrame(X_train_s)

## Analysis

We'll now:

- Train the following models: Vanilla Linear, RidgeCV, LassoCV, ElasticNetCV
- Compare accuracy scores
- Compare root-mean square errors
- Plot the results: prediction vs actual

### Linear Regression

In [None]:
LR = LinearRegression()
LR = LR.fit(X_train_s, y_train)
y_train_pred = np.round(LR.predict(X_train_s))
X_test_s = scaler.transform(X_test)
y_test_pred = np.round(LR.predict(X_test_s))

print('r2 score for train data: ', r2_score(y_train.values, y_train_pred))
print('r2 score for test data: ', r2_score(y_test.values, y_test_pred))

In [None]:
pd.DataFrame({'Model coeff': LR.coef_})

### Linear Regression With Polynomial Features

In [None]:
pf = PolynomialFeatures(degree=2, include_bias=False)
X_pf = pf.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_pf, y, test_size=0.3, 
                                                    random_state=72018)

In [None]:
X_train_s = scaler.fit_transform(X_train)

In [None]:
LR = LR.fit(X_train_s, y_train)
y_train_pred = np.round(LR.predict(X_train_s))
X_test_s = scaler.transform(X_test)
y_test_pred = np.round(LR.predict(X_test_s))

print('r2 score for train data: ', r2_score(y_train.values, y_train_pred))
print('r2 score for test data: ', r2_score(y_test.values, y_test_pred))

Adding polynomial features improve the training accuaracy compare to the simple linear regression. But lower the test r2 score.

In [None]:
pd.DataFrame({'Model coeff': LR.coef_}).T

In [None]:
from sklearn.metrics import mean_squared_error


def rmse(ytrue, ypredicted):
    return np.sqrt(mean_squared_error(ytrue, ypredicted))

In [None]:
from sklearn.linear_model import LinearRegression

linearRegression = LinearRegression().fit(X_train, y_train)

linearRegression_rmse = rmse(y_test, np.round(linearRegression.predict(X_test)))

print('linearRegression_rmse: ', linearRegression_rmse)

In [None]:
f = plt.figure(figsize=(6,6))
ax = plt.axes()

ax.plot(y_test, np.round(linearRegression.predict(X_test)), 
         marker='o', ls='', ms=3.0)

lim = (0, y_test.max())

ax.set(xlabel='Actual Quality', 
       ylabel='Predicted Quality', 
       xlim=lim,
       ylim=lim,
       title='Linear Regression Results');

### RidgeCV Regression

In [None]:
from sklearn.linear_model import RidgeCV

alphas = [0.005, 0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 80]

ridgeCV = RidgeCV(alphas=alphas, 
                  cv=4).fit(X_train, y_train)

ridgeCV_rmse = rmse(y_test, np.round(ridgeCV.predict(X_test)))

print('ridgeCV.alpha:', ridgeCV.alpha_, 'ridgeCV_rmse: ' ,ridgeCV_rmse)

### LassoCV Regression

In [None]:
from sklearn.linear_model import LassoCV

alphas2 = np.array([1e-5, 5e-5, 0.0001, 0.0005])

lassoCV = LassoCV(alphas=alphas2,
                  max_iter=5e4,
                  cv=3).fit(X_train, y_train)

lassoCV_rmse = rmse(y_test, np.round(lassoCV.predict(X_test)))

print('lassoCV.alpha',lassoCV.alpha_, 'lassoCV_rmse',lassoCV_rmse)  # Lasso is slower

We can determine how many of these features remain non-zero.

In [None]:
print('Of {} coefficients, {} are non-zero with Lasso.'.format(len(lassoCV.coef_), 
                                                               len(lassoCV.coef_.nonzero()[0])))

### ElasticNetCV

Now try the elastic net, with the same alphas as in Lasso, and l1_ratios between 0.1 and 0.9

In [None]:
from sklearn.linear_model import ElasticNetCV

l1_ratios = np.linspace(0.1, 0.9, 9)

elasticNetCV = ElasticNetCV(alphas=alphas2, 
                            l1_ratio=l1_ratios,
                            max_iter=1e4).fit(X_train, y_train)
elasticNetCV_rmse = rmse(y_test, elasticNetCV.predict(X_test))

print('elasticNetCV.alpha',elasticNetCV.alpha_, 'elasticNetCV.l1_ratio',elasticNetCV.l1_ratio_, 'elasticNetCV_rmse',elasticNetCV_rmse)

Comparing the RMSE calculation from all models is easiest in a table.

In [None]:
rmse_vals = [linearRegression_rmse, ridgeCV_rmse, lassoCV_rmse, elasticNetCV_rmse]

labels = ['Linear', 'Ridge', 'Lasso', 'ElasticNet']

rmse_df = pd.Series(rmse_vals, index=labels).to_frame()
rmse_df.rename(columns={0: 'RMSE'}, inplace=1)
rmse_df

We can also make a plot of actual vs predicted wine quality as before.

In [None]:
f = plt.figure(figsize=(6,6))
ax = plt.axes()

labels = ['Ridge', 'Lasso', 'ElasticNet']

models = [ridgeCV, lassoCV, elasticNetCV]

for mod, lab in zip(models, labels):
    ax.plot(y_test, mod.predict(X_test), 
             marker='o', ls='', ms=3.0, label=lab)


leg = plt.legend(frameon=True)
leg.get_frame().set_edgecolor('black')
leg.get_frame().set_linewidth(1.0)

ax.set(xlabel='Actual Quality', 
       ylabel='Predicted Quality', 
       title='Linear Regression Results');

Conclusion: ElasticNet gives the smallest Root-mean-square error however. The best candidate based on Root-mean-square error and score results is ElasticNet Regression, therefore we recommend ElasticNet as a final model that best fits the data in terms of accuracy.

## Next Steps

We could further try optimize ElasticNet using Stochastic gradient descent.

Linear regression has low prediction accuracy. To predict the quality of wine with more accuracy, we could employ classification methods.