### Luis Arce https://github.com/LuisVArce
### Tim Keriazes https://github.com/tim-keriazes
### Joshua Mayes https://github.com/MrEnigmamgine

#### Sep 22, 2022

# Red Wine From the Vihno Verde Region in Portugal - Predicting the Quality Score Using Machine Learning

#### Scenario

#### Project Description:
##### Our project examines 11 quantitative features of red/white wine data sets from the Vihno Verde region of Portugal. Using the physicochemical features/breakdown of the wine, we built a predictive machine learning model with a target variable of quality score. Our insights, discoveries, and modeling offer a distinct advantage to wine producers/stakeholders/distributors by using a wine's chemical composition and predicting its associated quality score.


#### Project Planning/Outline:
1. Intro
2. Acquire
3. Prepare/Wrangle
4. Split
5. Exploration Highlights
6. Stats Tests?
7. Scale
8. Clusters
9. Modeling
10. Conclusion
11. Next Steps

#### Initial Hypotheses
1. Sugar and alcohol content directly to correlates to wine density
2. For white wines, the higher acid content the higher quality
3. For red wines, residual sugar content lowers quality score
4. Sulfates will have negative impact on quality for both
5. High volitile acid content lowers quality for both
6. White and red wines may need predicted separately

### Target variable
#### quality
- Quality score is the median score given to associated wine based off the rankings of three industry experts

### Exploration Key Findings/Results:

#### 1. Key Finding 
#### 2. Key Finding
#### 3. Key Finding
#### 4. Key Finding
#### 5. Key Finding



# Best Model: 
### 2nd Degree Polynomial Regressor:
    - Established baseline using the mean quality score. Baseline Model had an RMSE in train of .815 and in validate set of .862. Our final model scored a RMSE of  on the test set
    
   #### Baseline:

    RMSE using Mean:
    Train/In-Sample: 0.815
    Validate/Out-of-Sample: 0.862
          
   #### RMSE for Polynomial Regressor:
    
    Test/Out-of-Sample Performance: 0.567

### Key takeaways
    -  
    -  
### Next Steps
    - 

### Imports/Dependencies

In [33]:
#imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from scipy import stats
from sklearn.cluster import KMeans

#functions

import model
import acquire
import wrangle as wr
import warnings
warnings.filterwarnings("ignore")

#evaluate
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.feature_selection import f_regression 
from statsmodels.formula.api import ols
import sklearn.preprocessing

#feature engineering
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import MinMaxScaler

# modeling methods
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression, LassoLars, TweedieRegressor
from sklearn.preprocessing import PolynomialFeatures


### Scope
The dataset includes 6,497 observations. 0 of which contain nulls and 1,177 of which are duplicates.

##### 5320 observations remain after cleaning, 1359 of which are red wines and 3961 of which are white wines.

### Utilize helper files to acquire and wrangle data sets
- drops duplicates
- formats titles

### Analyze/view the data 
- check for nulls
- learn about size/attributes/components
- identify outliers

In [30]:
red = wr.wrangle_data("red")
red

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,type
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5,red
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5,red
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5,red
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6,red
5,7.4,0.660,0.00,1.8,0.075,13.0,40.0,0.99780,3.51,0.56,9.4,5,red
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1593,6.8,0.620,0.08,1.9,0.068,28.0,38.0,0.99651,3.42,0.82,9.5,6,red
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5,red
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,red
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,red


In [31]:
white = wr.wrangle_data("white")
white

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6,white
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6,white
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6,white
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.99490,3.18,0.47,9.6,6,white
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6,white
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5,white
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6,white
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7,white


### Prepare the data

In [18]:
#null check
white.isna().sum()

fixed_acidity           0
volatile_acidity        0
citric_acid             0
residual_sugar          0
chlorides               0
free_sulfur_dioxide     0
total_sulfur_dioxide    0
density                 0
ph                      0
sulphates               0
alcohol                 0
quality                 0
type                    0
dtype: int64

In [29]:
#ensuring duplicates were dropped
duplicatered = red[red.duplicated()]
duplicatered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 0 entries
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         0 non-null      float64
 1   volatile_acidity      0 non-null      float64
 2   citric_acid           0 non-null      float64
 3   residual_sugar        0 non-null      float64
 4   chlorides             0 non-null      float64
 5   free_sulfur_dioxide   0 non-null      float64
 6   total_sulfur_dioxide  0 non-null      float64
 7   density               0 non-null      float64
 8   ph                    0 non-null      float64
 9   sulphates             0 non-null      float64
 10  alcohol               0 non-null      float64
 11  quality               0 non-null      int64  
 12  type                  0 non-null      object 
dtypes: float64(11), int64(1), object(1)
memory usage: 0.0+ bytes


### Outliers identified manually on the red data set through analysis and exploration
- use of boxplots for each feature
- manually examined the impact of outlier removal on the data set
- defined and set to remove

In [40]:
red = model.drop_outliers(red, "red", method='manual')


In [41]:
white = model.drop_outliers(white, "white", method='manual')

### Call in feature engineering
- Ions = 'ions'= combo of chlorides and sulfates
- Additives = combo of chlorides, sulfates, residual sugar, and free/total sulfur dioxide
- Hydronium Concentration = 'hydronium' = ‘H+’ which is a derivative of pH where pH = -log[H3O+ concentration]

In [42]:
def add_features():
    
    #'ions'= combo of chlorides and sulfates
    red['ions'] = red['chlorides']+red['sulphates']
    white['ions']= white['chlorides']+white['sulphates']
    
    # 'hydronium' = ‘H+’ which is a derivative of pH where pH = -log[H3O+ concentration]
    red['hydronium'] = 10**(-red['ph'])
    white['hydronium']= 10**(-white['ph'])
    
    #combo of chlorides, sulfates, residual sugar, and free/total sulfur dioxide
    red['additives']=red['chlorides']+red['sulphates']+red['residual_sugar']+red['total_sulfur_dioxide']-red['free_sulfur_dioxide']
    white['additives']=white['chlorides']+white['sulphates']+white['residual_sugar']+white['total_sulfur_dioxide']-white['free_sulfur_dioxide']
    
    return red, white
    

In [None]:
red, white = add_features()