# Predicting Wine Quality

* Rich Alcabes
* Garrett Arnett


## Our Goal

* Create a REGRESSION model which will assign Quality_Score predictions to an unseen set of wines based upon specific drivers such as, alcohol content, volatile_acidity, chlorides, sulfates, etc...

## Our Plan

* Plan: Questions and Hypotheses
* Acquire: Obtain two distinct datasets (red wines/white wines) from https://data.world/food/wine-quality and combine both into a single datset with an attribute which indicates the wine_color.
* Prepare: Kept outliers, missingness was a non-issue, as there were ZERO entries containing NULL values for predictors. Split into ML subsets (Train/Validate/Test).
* Explore: Univariate and multi-variate analysis, correlation matrix, 2D and 3D visualization, correlation significance testing, 1-sample T-testing, K-Means clustering to discover meaningful sample segments (with associated MinMaxScaling).
* Model: With dataset scaled via MinMaxScaler, Initially, create an MVP model which is a Vanilla Multiple Linear Regression model specifically using ALCOHOL/DENSITY/VOLATILE_ACIDITY. Experiment with OLS Model hyperparameters
* Deliver: Please refer to this doc as well as the Final_Report.ipynb file for the finished version of the presentation, in addition to a brief 5-min presentation to the DS Team.

## Imports

In [7]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.color_palette("tab10")
from scipy import stats
from sklearn.model_selection import train_test_split
import os
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector
from sklearn.preprocessing import MinMaxScaler, StandardScaler, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.linear_model import LinearRegression, TweedieRegressor, LassoLars
from sklearn.metrics import explained_variance_score
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.feature_selection import SelectKBest, RFE, f_regression, SequentialFeatureSelector
from sklearn.preprocessing import QuantileTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
from sklearn.feature_selection import SelectKBest, RFE, f_regression
from sklearn.linear_model import LinearRegression
seed = 1349
target = 'quality'
# python files
import final_wrangle as wr
import final_modeling as mo
# ignore warnings
import warnings
warnings.filterwarnings("ignore")
from importlib import reload


NameError: name 'train' is not defined

## Acquire and Data Preperation

* 
*

In [6]:
wrange()

NameError: name 'wrange' is not defined

#### Steps taken to Prepare

* 
* 
* 

#### Results and Notes

* 
* 

|Feature |  Data type | Definition |
|---|---|---|
| alcohol: | float | measured as percentage |
| volatile_acidity: | float | greater values indicate vinegar-like taste |
| sulphates: | float | additive with antimicrobial/antioxidant properties |
| citric_acid: | float | a preservative with capacity to add flavor |
| total_SO2: | float | includes free_SO2 |
| density: | float | an indication of sugar/alcohol content |
| chlorides: | float | amount of salts present |
| fixed_acidity: | float | non-volatile acids not subject to evaporation |
| ph: | float | acidity-base measurement |
| free_SO2: | float | antimicrobial/antioxidant properties |
| residual_sugar: | float | measured to indicate sweetness |
| wine_color: | int | 1-Red / 0-White |
| quality: | int | TARGET: rating given by a wine-tasting professional |

## Explore

### 1.


In [2]:
## Grapg

explanation

### 2. 


In [3]:
# graph


We see that LA counties are below the averages for median and mean. I want to confirm the price difference is significant.

$H_0$: There is no significant difference between home_value in different county_name counties.

$H_a$: There is a significant difference between home_value in different county_name counties.

In [6]:
# running stat test
ex.counties_test(train)

Variances are different. Use an non-parametric Kruskal-Wallis test.

We reject the null hypothesis.
There is a significant difference in home prices in different counties.


## 3. 

In [4]:
#gragh

There is a stong positive coorelation between the price of the hosue and the square footage. Lets Confirm:

$H_0$: There is no linear correlation between home_value and sq_feet

$H_a$: There is a linear correlation between home_value and sq_feet

In [8]:
# run the test
ex.sqfeet_test(train) 

We reject the null hypothesis.
There is a linear correlation between home price and its size(square feet)
The correlation coefficient is 0.5
LA       correlation coefficient is  0.43

Ventura  correlation coefficient is  0.64

Orange   correlation coefficient is  0.57



## Explore Takaways

*
*
*
*

Based on my results, I cannot discover many strong coorelations with the price in LA county versus the others. I will expect price predictions for LA county to pull all of my preditcion results down. 

## Modeling

## How I create models

I use six different regressors:

* Single and Multiple Linear Regressions with default hyperparameters
* Generalized Linear Model (TweedieRegressor) with alpha=0.5 and power=2
* Lasso Lars algorithm with alpha=0.1
* Gradient Boosting Regressor with default hyperparameters
* Decision Tree Regressor with max_depth=4
* Random Forest Regressor with max_depth=4

I have created manualy five different feature combinations, added 2 combinations with SelectKBest algorithm and added all columns in the last feature combination. In different iterations I used StandardScaler, QuantileTransformer, PolynomialFeatures and I've made one iteration with RFE (recursive feature elimination). All tools used from thesklearn library. Overall my code generated 128 different models.

#### Root mean squared error
To evaluate the performance of the regression models I'm going to use the RMSE score.

Our data contains outliers. The difference betweence between the mean and median values is around $80K. That's why I decided to use the median value as my baseline model.

In [9]:
# run all models
# the results are stored in the md.scores data frame.
# this function also writes an *.csv file with the scores
md.run_all_models()

In [10]:
# top 5 models
best_model = md.select_best_model(md.scores)

Unnamed: 0,model_name,features,scaling,RMSE_train,R2_train,RMSE_validate,R2_validate,RMSE_difference
104,Gradient Boosting Regression,poly1,standard,169112,0.67,202127,0.58,33015
92,Gradient Boosting Regression,f8,quantile,169469,0.67,202933,0.58,33464
44,Gradient Boosting Regression,f8,standard,169469,0.67,203041,0.58,33572
116,Gradient Boosting Regression,poly3,standard,166907,0.68,203109,0.58,36202
98,Gradient Boosting Regression,rfe,standard,169033,0.67,204201,0.57,35168


In [11]:
best_model

Unnamed: 0,model_name,features,scaling,RMSE_train,R2_train,RMSE_validate,R2_validate,RMSE_difference
104,Gradient Boosting Regression,poly1,standard,169112,0.67,202127,0.58,33015


Our Best model is Gradient Boosting Regression

In [13]:
# show the model performance scores
md.run_best_model()

['bedrooms', 'bathrooms', 'sq_feet', 'lot_sqft', 'house_age', 'pools']


Unnamed: 0,Gradient Bosting Regression,Scores
0,Features,"['bedrooms', 'bathrooms', 'sq_feet', 'lot_sqft..."
1,RMSE baseline,292815
2,RMSE Train Set,169469
3,RMSE Validation Set,203062
4,RMSE Test Set,195952
5,R2 Train Set,0.670
6,R2 Validation Set,0.580
7,R2 Test,0.540
8,Beats a basline by:,33.1%


Our winning model beat our baseline by 33.1%.

## Modeling summary

Observation of prediction results led me to the following conclusions:

* Gradient Boosting Regressor performed the best with our data set
* Gradient Boosting Regressor is a good model in terms of prediction but doesn't return stable results. The RMSE scores vary a lot in all 3 sets.
* Scaling type didn't affect the regression model's performance.



# Conclusions

* Overall my regression model performs good. Its predictions beat the baseline model by 33.1%
* For more stable results I would pick Random Forest Regressor or Lasso Lars Regressor.

# Next Steps

* To imporove prediction results I would recommend to pull more features from the database and look for the ones that have a strong correlation with the price in LA county.

* Also I would focus on making a model that is more stable than the current.