# Wine Quality Prediction Project

## Goal:
- Discover features that affect wine quality
- Use features to develop a machine learning model to predict the quality of wine on a scale of 1-10

# Imports

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import wrangle as w
import explore as e
# import model as m
import evaluate as ev

from scipy.stats import pearsonr, spearmanr

import warnings
warnings.filterwarnings("ignore")

np.random.seed(123)

# Acquire
- Data acquired from [data.world](https://data.world/food/wine-quality)
- Downloaded two separate .csv files: 1 for red wine, 1 for white wine
- Merged the .csv files into a a single dataframe
- Added one column, wine_type, after combining csv files to indicate red or white wine
- It contained 6,497 rows and 13 columns before cleaning
    - 1599 rows were red wines
    - 4898 rows were white wines
- Each row represents a single vintage of wine
- Each column represents a chemical quality of the wines

# Prepare
- Did not remove any columns
- Did not rename any columns
- Checked for nulls - no null values found
- Checked that column data types were appropriate
- Outliers: used IQR method and 2 for the multiplier value:
    - IQR = Q3 - Q1;  For each column: removed (Q1 - 2 * IQR) AND removed (Q3 + 2 * IQR)
    - 867 rows removed
    - Began with 6,497
    - Ended with 5,630
- Encoded categorical variables
- Split data into train, validate, and test (60/20/20)
- Scaled continuous variables

# Data Dictionary

| Feature | Definition (measurement)|
|:--------|:-----------|
|Fixed Acidity| The fixed amount of tartaric acid. (g/L)|
|Volatile Acidity| A wine's acetic acid; (High Volatility = High Vinegar-like smell). (g/L)|
|Citric Acid| The amount of citric acid; (Raises acidity, Lowers shelf-life). (g/L)|
|Residual Sugar| Leftover sugars after fermentation. (g/L)|
|Chlorides| Increases sodium levels; (Affects color, clarity, flavor, aroma). (g/L)|
|Free Sulfur Dioxide| Related to pH. Determines how much SO2 is available. (Increases shelf-life, decreases palatability). (mg/L)|
|Total Sulfur Dioxide| Summation of free and bound SO2. (Limited to 350ppm: 0-150, low-processed, 150+ highly processed). (mg/L)|
|Density| Between 1.08 and 1.09. (Insight into fermentation process of yeast growth). (g/L)|
|pH| 2.5: more acidic - 4.5: less acidic (range)|
|Sulphates| Added to stop fermentation (Preservative) (g/L)|
|Alcohol| Related to Residual Sugars. By-product of fermentation process (vol%)|
|Quality (Target)| Score assigned between 0 and 10; 0=low, 10=best|
|Wine Type| Red or White|

In [None]:
# acquire and prepare wine data
df = w.wrangle_wine()

In [None]:
df.head()

In [None]:
# split into train/validate/test datasets (tr/val/ts)
tr, val, ts = w.get_split(df)

# A brief look at the data

In [None]:
tr.head()

## A summary of the data

In [None]:
tr.info()

In [None]:
tr.describe()

# Explore

## What is the distribution of the target?)
- NOTE: this question is more of a look at the data, not a stats test question

In [None]:
# get a histplot of quality
sns.histplot(df.quality)
plt.show()

## Question 1: Is alcohol associated with quality?

In [None]:
# Visualize: get a regplot of alcohol vs quality on train
e.get_alcohol_qual_plot(tr, 'alcohol', 'quality')

### Analyze alcohol vs quality with stats
- $H_0$: There is NO relationship between alcohol and quality
- $H_a$: There IS a relationship
- $\alpha$ = .05
    - Utilize spearmanr - we are comparing continuous variables, normally distributed, but UNequal variance

In [None]:
# get the stats from a pearsonr test on alcohol vs quality
spearmanr(tr.alcohol, tr.quality)

### Summarize
- p is < $\alpha$, so we can reject the $H_0$ which suggest the $H_a$, i.e. there IS a relationship between alcohol and quality: as alcohol by volume increases, quality increases in the dataset

## Question 2: Are chloride levels associated with quality?)

In [None]:
# Visualize: get a regplot of chlorides vs quality
e.get_chloride_qual_plot(tr, 'chlorides', 'quality')

### Analyze chlorides vs quality with stats
- $H_0$: There is NO relationship between chlorides and quality
- $H_a$: There IS a relationship
- $\alpha$ = .05
    - Utilize spearmanr - we are comparing continuous variables, normally distributed, but UNequal variance

In [None]:
# get the stats from a pearsonr test on chlorides vs quality
spearmanr(tr.chlorides, tr.quality)

### Summarize
- p is < $\alpha$, so we can reject the $H_0$ which suggest the $H_a$, i.e. there IS a relationship between chlorides and quality: a lower value for chlorides correlates with an increase in quality

## Question 3 Is residual_sugar associated with quality?

In [None]:
# Visualize: get a regplot of residual_sugar vs quality on train
e.get_res_sugar_quality_plot(tr, 'residual_sugar', 'quality')

### Analyze residual_sugar vs quality with stats
- $H_0$: There is NO relationship between residual_sugar and quality
- $H_a$: There IS a relationship
- $\alpha$ = .05
    - Utilize spearmanr - we are comparing continuous variables, normally distributed, but UNequal variance

In [None]:
# get the stats from a pearsonr test on chlorides vs quality
spearmanr(tr.residual_sugar, tr.quality)

### Summarize
- p is > $\alpha$, so we CANNOT reject the $H_0$ i.e. there is NOT a relationship between residual_sugar and quality

## Question 4 Is alcohol associated with density?
- Chemistry knowledge suggest these two features are very closely related, and therefore we should, perhaps, only send one in to our models

In [None]:
# Visualize: get a regplot of alcohol vs density on train
e.get_alcohol_density_plot(tr, 'alcohol', 'density')

### Analyze alcohol vs density with stats
- $H_0$: There is NO relationship between alcohol and density
- $H_a$: There IS a relationship
- $\alpha$ = .05
    - Utilize spearmanr - we are comparing continuous variables, normally distributed, but UNequal variance

In [None]:
# get the stats from a pearsonr test on alcohol vs density
spearmanr(tr.alcohol, tr.density)

### Summarize
- p is < $\alpha$, so we can reject the $H_0$ which suggest the $H_a$, i.e. there IS a relationship between alcohol and density; furthermore, the correlation coefficient is -0.77, i.e. highly correlated (closer to -1 is more correlated)

## Exploration Summary
* No feature had a strong correlation with quality by itself; alcohol was the strongest with a .46 correlation coefficient
* However, most features had some relationship with quality by themselves (stats tests for other features completed in separate working notebook)
* density was closely correlated with alcohol
* PUT SOMETHING IN HERE ABOUT CLUSTERS


### Features we are moving to modeling with
* fixed_acidity
* volatile_acidity
* citric_acid
* chlorides
* free_sulfur_dioxide
* total_sulfur_dioxide
* ph
* sulphates
* alcohol
* wine_type

### Features we are, initially, not moving to modeling with
* residual_sugar
* density

# Modeling
* Evaluation Metrics (example: I will use R^2 and Root Mean Square Error (RMSE) as my evaluation metrics)
    * for R^2, the value is in the range 0-1; closer to 1.0 is better; baseline is 0.0
    * for RMSE, the lower the value the better; baseline is xxx
* The average target is xxx which is the baseline prediction
* I will evaluate x different model types and various hyperparameter configurations
    * (Example: The four model types are Ordinary Least Squares (OLS), LassoLars, Polynomial Regression, Generalized Linear Model (GLM))
* Models will be evaluated on train and validate data
* The model that performs the best will then be evaluated on test

In [None]:
## prep data for modeling

# get X y splits for modeling
target = 'quality'
X_tr, X_val, X_ts, y_tr, y_val, y_ts, to_scale, baseline = w.get_Xs_ys_to_scale_baseline(tr, val, ts, target)

# scaling continuous variable columns for use in modeling
X_tr_sc, X_val_sc, X_ts_sc = w.scale_data(X_tr,X_val,X_ts,to_scale)

## Comparing models - All features EXCEPT residual_sugar and density
* Also running OLS on top 3 RFE features

In [None]:
# getting machine learning model metrics for features deemed useful in explore
to_model_cols = list(X_tr_sc.drop(columns=['residual_sugar', 'density']).columns)
w.display_model_metrics(baseline, tr[to_model_cols], y_tr, y_val, y_ts, X_tr_sc[to_model_cols], X_val_sc[to_model_cols], X_ts_sc[to_model_cols])

* All models, except for Lars beat baseline

## Comparing models - All features
* Also running OLS on top 3 RFE features

In [None]:
# get model metrics for all features
w.display_model_metrics(baseline, tr, y_tr, y_val, y_ts, X_tr_sc, X_val_sc, X_ts_sc)

* Using all features led to slightly better models
* Polynomial Regression, with degrees=2, is the best model

# Clustering
* Before we move to test, we will see if clustering may improve these models

## Cluster 1 - Density-Chlorides

In [None]:
tr, val, ts, X_tr, X_val, X_ts, y_tr, y_val, y_ts, to_scale, baseline, X_tr_sc, X_val_sc, X_ts_sc, X_tr2, y_tr2 = ev.density_chlorides_cluster(df)

In [None]:
# Display cluster #1
ev.density_chlorides_cluster_plot(X_tr2, y_tr2)

* The cluster plot of density-chlorides splits into 3 sections in the bottom it hues on wine-quality. In the bottom right you'll see a higher concentration of better quality wines. We will attempt to run it throught the models.

## Cluster 2 Alcohol-Residual Sugar

In [None]:
# Display cluster #2
ev.plot_cluster_2(X_tr, y_tr)

* The cluster plot of alcohol-residual sugar splits into 3 sections in the bottom as it hues on wine-quality. No noticable clusters.

## Cluster 3 Sulfur Dioxide-Residual Sugar

In [None]:
# Display cluster #3
ev.plot_cluster_3(X_tr, y_tr)

* The cluster plot of total sulfur dioxide-residual sugar splits into 3 sections in the bottom as it hues on wine-quality. No noticable clusters.

## Comparing models with best cluster

In [None]:
# get model results
ev.get_metrics_with_cluster(X_tr2, X_tr_sc, y_tr, X_val_sc, y_val, alpha=1, power=2, degrees=2)

* Original model RMSE is 0.695 and cluster model 0.739
* Original model R2 is 0.334 and cluster model 0.246

## Comparing Models
* The Polynomial Regression model without the density chloride clusters performed better than the model with the clusters so we will move forward to test with that model

## Best Model, Polynomial Regression, on Test

In [None]:
# get test results for final model
w.test_best_model(X_ts_sc, y_ts, X_tr_sc, y_tr)

## Modeling Summary
* The RMSE for Polynomial Regression on test was 0.713 which beat the baseline of 0.871.

# Conclusions

## Exploration
* The mean wine quality score is 5.87
    * all bottles had a score from 1 to 10
* No feature correlated more than a .5 correlation coefficient
    * alcohol was the highest correlation with a .46
* Features affecting wine quality in KBest rank order:

     1) alcohol             :   higher value   -> higher quality
     2) chlorides           :   lower value    -> higher quality
     3) density             :   lower value    -> higher quality
     4) volatile_acidity    :   lower value    -> higher quality
     5) wine_type           :   white wine     -> higher quality
     6) citric_acid         :   higher value   -> higher quality
     7) fixed_acidity       :   lower value    -> higher quality
     8) free_sulfur_dioxide :   higher value   -> higher quality
     9) sulphates           :   higher value   -> higher quality
    10) total_sulfur_dioxide:   lower value    -> higher quality
    11) ph                  :   higher value   -> higher quality
    12) residual_sugar      :   no correlation (but it did help the model slightly)

## Modeling
* Interestingly, sending in all, features even those not correlated with the target, led to the best performing model
* Clustering did not help.

## Recommendations
* Create wines that optimize the chemical properties as described above 
    * i.e. the higher/lower values that correspond with higher quality scores
    * This will not guarantee a higher quality wine, but it should increase the probability of creating a higher quality wine

## Next steps
* Try classifiation models such as knn, etc.
* Build separate models for the two wine types (red & white) to see if that would improve model accuracy
