# Wine Time
By:  Data Scientist Corey Baughman & DeAdrien Hill

## Goal:
* Discover drivers of wine quality scores in the wine quality dataset
* Identify if clustering has a benefit in modeling
* Use drivers to develop a machine learning model that predicts wine quality better than baseline

# Imports

In [1]:
#Modules for data processing
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib as plt
import matplotlib.pyplot as plt
import acquire
import prepare
import seaborn as sns
from scipy.stats import pearsonr
import scipy.stats as stats
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import Ridge
from sklearn.svm import SVR


# Acquire

* Data acquired from Data.World Wine Quality Dataset
* Dataset contained 6497 rows and 12 columns before cleaning
* The colomn is_red was added to verify wine types
* Each row represents a red or white wine
* Each column represents a feature of the wine

# Prepare

* Dataset was clean with no missing values 
* Removed white space in column names
* Checked for and removed outliers
* Split the data for modeling

# Data Dictionary

### Feature                           Discription

**fixed acidity**:           most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

**volatile acidity**:        the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste',


**citric acid**:             found in small quantities, citric acid can add freshness and flavor to wines

**residual sugar**:          the amount of sugar remaining after fermentation stops its rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

**chlorides**:               the amount of salt in the wine','the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

**free sulfur dioxide**:     amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
    
**total sulfur dioxide**:    the density of water is close to that of water depending on the percent alcohol and sugar content

**density**:                 describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

**pH**:                      describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

**sulphates**:               a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

**alcohol**:                 the percent alcohol content of the wine','(score between 0 and 10)

**quality**:                 score between 0 and 10
    
**is_red**:                  indicated red or white with 1 or 0

In [None]:
df = acquire.new_wine_data()

In [None]:
#Removed duplicated index from import
df.index.is_unique
df.index.duplicated()
df = df.loc[~df.index.duplicated(), :]

# Data Summary

In [None]:
df.describe()

In [None]:
for col in df.columns:
    sns.boxplot(df[col])
    plt.title(col)
    plt.show()

In [None]:
#Removing whitespace, outliers, and splitting the data
partitions = prepare.prepare(df, target_var='quality')

In [None]:
#Labeling variables for modeling
train = partitions[0]
X_train = partitions[1]
X_validate = partitions[2]
X_test = partitions[3]
y_train = partitions[4]
y_validate = partitions[5]
y_test = partitions[6]

# A look at the data

In [None]:
train.head()

# Exploration

# What is the distribution of our data?

In [None]:
#Hist plot to show distribution of each column in the dataframe
fig = plt.figure(figsize = (15,20))
ax = fig.gca()
train.hist(ax = ax)
plt.show()

# Do certain drivers affect quality more than others?

In [None]:
#Correlation, p-value between feature and quality
prepare.pearson_r(train)

In [None]:
#plot of feature relationships to target
prepare.relations_features(train)

# Does alcohol effect wine quality?

In [None]:
print('White = 0 Red = 1')
sns.scatterplot(x="alcohol", y="is_red", data=train, hue='quality')
plt.xlabel("Alcohol")
plt.ylabel("Wine Type")
plt.show()

# Does density effect wine quality

In [None]:
print('White = 0 Red = 1')
sns.scatterplot(x="density", y="is_red", data=train, hue='quality')
plt.xlabel("Density")
plt.ylabel("Wine Type")
plt.show()

# Do chlorides effect wine quality

In [None]:
print('White = 0 Red = 1')
sns.scatterplot(x="chlorides", y="is_red", data=train, hue='quality')
plt.xlabel("Chlorides")
plt.ylabel("Wine Type")
plt.show()

# Is there a difference in quality for red or white wine?

In [None]:
sns.boxplot(x=train.is_red, y=train.quality)
plt.title("Is there a difference in quality for\nred vs white wine?")
plt.show()


**Test the equal variance**
**H0 is that the variances are equal**
* Levene test
* our two groups are 1. where train.is-red==0 and 2) where train.is_red == 1

In [None]:
#Levene test
stats.levene(train[train.is_red==0].quality, 
             train[train.is_red==1].quality)

In [None]:
#ttest independant
stats.ttest_ind(train[train.is_red==0].quality, 
                train[train.is_red==1].quality, 
                equal_var=True)

# Is there a correlation between volatile acidity and quality?

In [None]:
plt.scatter(x=train.volatileacidity, y=train.quality)
plt.title('Does Volatile Acidity Affect Quality?')
plt.xlabel('volatile acidity')
plt.ylabel('quality')
plt.show()

In [None]:
# Visually there appears to be a negative linear relationship.

$H_0:$ There is no significant linear relationship between volatile acidity and quality.

$H_a:$ There is a significant linear relationship between volatile acidity and quality.

In [None]:
# volatile acidity is heavily right skewed so I'll examine using
# Spearman's rank test
α = 0.5
stats.spearmanr(train.volatileacidity, train.quality)

p is less than alpha, so must reject the null hypothesis that there is no significant linear relationship between volatile acidity and quality.

# Is there a linear correlation between residual free sulphur and quality?

In [None]:
plt.scatter(x=train.freesulfurdioxide, y=train.quality)
plt.title('Does Free Sulfur Affect Quality?')
plt.xlabel('sulfur dioxide')
plt.ylabel('quality')
plt.show()

In [None]:
# Visually there appears to be a very slight positive linear relationship.

$H_0:$ There is no significant linear relationship between free sulphur dioxide and quality.

$H_a:$ There is a significant linear relationship between free sulphur dioxide and quality.

In [None]:
# free sulphur dioxide is slightly right skewed, but due to CLT,
# I'll examine using pearsons-r test
α = 0.5
stats.pearsonr(train.freesulfurdioxide, train.quality)

p is less than alpha, so must reject the null hypothesis that there is no significant linear relationship between free sulphur dioxide and quality.

# Does Clustering provide a better insight on the data?
* Performed clustering on multiple target features in modeling. 
* No clear clusters indentified in this dataset
* Sample work in dummy notebook.

# Exploratory Summary
* No significant difference in variances
* Alcohol has the highest positive correlation score related to wine quality indicating higher alcohol levels 
increase wine quality
* Density and Chlorides had similar negative correlations with wine quality indicating lower levels had a posistive affect on quality 
* Clustering features provided no clear insight about the data

# Modeling

* I will use Root Mean Square Erro(RMSE) as my metric of evaluation
* Using the Mean quality score we achieve a RMSE of .76 this will be the baseline for this project
* I will use four different models for evaluation 
* Models will be evaluated on train and validate data first and the best performing model will be evaluated on the test data

# Modeling Features

**fixedacidity**: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

**scaled_volatileacidity**: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste',

**citricacid**: found in small quantities, citric acid can add freshness and flavor to wines

**residualsugar**: the amount of sugar remaining after fermentation stops its rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

**scaled_chlorides**: the amount of salt in the wine','the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

**scaled_freesulfurdioxide**: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

**totalsulfurdioxide**: the density of water is close to that of water depending on the percent alcohol and sugar content

**scaled_density**: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

**pH**: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

**scaled_sulphates**: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

**scaled_alcohol**: the percent alcohol content of the wine','(score between 0 and 10)

**is_red**: indicated red or white with 1 or 0

In [None]:
#Prep data for modeling
X= partitions[1:4] 
for i in range(len(X)):
    X[i] = prepare.scale_and_concat(X[i], partitions)
    
X_train = X[0].iloc[:,0:27]
X_validate = X[1].iloc[:,0:27]
X_test = X[2].iloc[:,0:27]

In [None]:
#Prep data for modeling
prepare.modeling_feats(X_train,X_validate,X_test)

# Baseline
* Will use Mean as baseline because it's a lower number
* Plotted visual of baseline vs actual quality

In [None]:
#got RMSE using Mean and Median
prepare.baseline(y_train, y_validate)

In [None]:
#baseline quality predictions vs actual quality predictions
prepare.actualvs_pred(y_train)

# OLS Model

In [None]:
#OLS Linear Regression Model Results
prepare.lm_model(X_train,y_train,X_validate,y_validate)

* OLS Model outperforms baseline

# Lasso Lars

In [None]:
#Lasso Lars Model Results
prepare.lars_model(X_train,y_train,y_validate,X_validate)

* Lasso Lars is about the same as baseline 

# Tweedie Regressor

In [None]:
#Tweedie Regressor
prepare.tweedie_model(X_train,y_train,y_validate,X_validate)

* Tweedie Regressor beat baseline and was more in line with the OLS model

# Polynomial Model

In [None]:
#Polynomial Model Results
prepare.poly_model(X_train,y_train,y_validate,X_validate,X_test)

* Polynomial performed better than baseline and all other models. We will use this model on our test data.

# Evaluate

* Plotted model predictions vs actual quality to see where each model performed
* Plotted value changes with error change to see what quality scores gave models the best and worst predictions
* Plotted distributions of the top two models against actual quality for a comparison of error

In [None]:
#Model Predictions vs Actual
prepare.plot_model_pred(y_validate)

In [None]:
#view how errors change based on actual value change
prepare.plot_errors(y_validate)

In [None]:
#Plot the top two models vs actual
prepare.top_model(y_validate)

# Polynomial on Test Data

* The Polynomial Model return a .58 RMSE on test data
* Polynomial Model decreased errors by 23.44%

In [None]:
#Evaluate best model on test data
prepare.best_model(X_train,y_test,y_train,X_validate,X_test)

In [None]:
#verify decreased error percentage
prepare.final_model(y_test,y_validate)

# Conclusion 

* Majority of wine quality scores fell between 5-7 
* All models tested werent very successful predicting quality below 5 or above 7
* No clear clusters were visisble in this dataset, clustering could be helpfull at a later time 
* No clear driver of wine quality, using multiple features proved to be the most successful approach when modeling

# Recommendations 

* Trying more combinations of features to achieve a lower RMSE
* Try different regression algorithms that havent been used on the dataset
* Collecting more data on high quality wines to help improve predictions

# Next Steps
* Research Ridge Regression and SVD to see if either model could be an improvement on what we currently have