## Wine Quality Data Set

Dataset URL: https://archive.ics.uci.edu/ml/datasets/wine+quality

The task here is to predict the quality of red wine on a scale of 0–10 given a set of features as inputs. I have solved it as a regression problem using Linear Regression.

**Attribute Information:**

For more information, read [Cortez et al., 2009].<br>
Input variables (based on physicochemical tests):<br>
1 - fixed acidity<br>
2 - volatile acidity<br>
3 - citric acid<br>
4 - residual sugar<br>
5 - chlorides<br>
6 - free sulfur dioxide<br>
7 - total sulfur dioxide<br>
8 - density<br>
9 - pH<br>
10 - sulphates<br>
11 - alcohol<br>
Output variable (based on sensory data):<br>
12 - quality (score between 0 and 10)<br>

In [1]:
# import necessary packages 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn import metrics
import numpy as np

In [2]:
# Read dataset 
# here it's not comma seperated file. delimiter is semicolon. 
data = pd.read_csv('Data/winequality-red.csv',sep=';')
data.head()

FileNotFoundError: [Errno 2] File b'Data/winequality-red.csv' does not exist: b'Data/winequality-red.csv'

## Missing values in dataset

In [None]:
# find missing values in data 
data.isnull().sum()

**Fortunately this dataset is clean** 

## Descriptive Statistics 

In [None]:
data.describe()

In [None]:
## Percentiles and Outliers using box plot
f, ax = plt.subplots(figsize=(20, 9))
data.boxplot()
plt.show()

## Correlation 

In [None]:
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (40,40), diagonal = 'kde');
plt.show()

In [None]:
# there are no categorical variables. each feature is a number. Regression problem. 
# Given the set of values for features, we have to predict the quality of wine. 
# finding correlation of each feature with our target variable - quality
correlations = data.corr()['quality'].drop('quality')
print(correlations)

In [None]:
# correlation matrix
# display(correlation)
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(data.corr(), annot=True, linewidths=0, vmin=-1, cmap="RdBu_r")
plt.show()

In [None]:
for i, column in enumerate(data.columns):
    print(i,column)

In [None]:
fig, axes = plt.subplots(nrows=4, ncols=3)
for i, column in enumerate(data.columns):
    sns.distplot(data[column],ax=axes[i//3,i%3])
plt.show()

In [None]:
# Visualizing the multidimensional relationships among the samples is as easy as calling sns.pairplot
sns.pairplot(data, hue='quality', size=2.5)
plt.show()

In [None]:
def get_features(data,correlation_threshold):
    """Returns features whose correlation is above a threshold value"""
    correlations = data.corr()['quality'].drop('quality')
    abs_corrs = correlations.abs()
    high_correlations = abs_corrs[abs_corrs > correlation_threshold].index.values.tolist()
    return high_correlations

In [None]:
get_features(data,0.05)

In [None]:
features_select = get_features(data,0.05)

In [None]:
# selected features
print(features_select)

In [None]:
x = data[features_select] 
y = data['quality']

In [None]:
# Train test split 
# 30% of the data is used for testing and 70% for training.
# Checking the size of the dataset using x_train.shape
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=3)
print('Total dataset size:',x.shape)
print('Train dataset size:',x_train.shape)
print('Test dataset size:',x_test.shape)

In [None]:
# fitting linear regression to training data
regressor = LinearRegression()
regressor.fit(x_train,y_train)
  
# this gives the coefficients of the 10 features selected above.  
print(regressor.coef_)

In [None]:
# predict on train and test set
train_pred = regressor.predict(x_train)
test_pred = regressor.predict(x_test)

# round up predictions 
train_pred = np.round_(train_pred)
test_pred = np.round_(test_pred)

In [None]:
# calculating rmse
train_rmse = metrics.mean_squared_error(train_pred, y_train) ** 0.5
print('Train RMSE:',train_rmse)
test_rmse = metrics.mean_squared_error(test_pred, y_test) ** 0.5
print('Test RMSE:',test_rmse)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, test_pred)))

In [None]:
# displaying coefficients of each feature
coeffecients = pd.DataFrame(regressor.coef_,features_select) 
coeffecients.columns = ['Coeffecient'] 
print(coeffecients)

**Obeservation**

* These numbers mean that holding all other features fixed, a 1 unit increase in sulphates will lead to an increase of 0.8 in quality of wine, and similarly for the other features.


* Also holding all other features fixed, a 1 unit increase in volatile acidity will lead to a decrease of 0.99 in quality of wine, and similarly for the other features.

## Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor 

rf_regressor = RandomForestRegressor()
rf_regressor.fit(x_train,y_train)

# predict on train and test set
train_pred = rf_regressor.predict(x_train)
test_pred = rf_regressor.predict(x_test)

# round up predictions 
train_pred = np.round_(train_pred)
test_pred = np.round_(test_pred)

# calculating rmse
train_rmse = metrics.mean_squared_error(train_pred, y_train) ** 0.5
print('Train RMSE:',train_rmse)
test_rmse = metrics.mean_squared_error(test_pred, y_test) ** 0.5
print('Test RMSE:',test_rmse)

In [None]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, test_pred)))

In [None]:

# displaying feature importance of each feature
feat_imp = pd.DataFrame(rf_regressor.feature_importances_,features_select) 
feat_imp.columns = ['Feature_Importance'] 
print(feat_imp)

## Try Other Algorithms

In [None]:
# decision tree regressor 
# gradient boositing regressor