# Fine Wine

# Goal:

- Construct an ML Regression model that predicts wine quality using features of white and red wines.

- Find the key drivers of wine quality. 

- Deliver a report that explains what steps were taken, why and what the outcome was.

- Make recommendations on what works or doesn't work in predicting wine quality.

In [1]:
#standard DS imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from scipy import stats
import math
from math import sqrt
import random

#sklearn imports
from sklearn.model_selection import train_test_split
import sklearn.preprocessing
from sklearn.feature_selection import SelectKBest, f_regression, RFE
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LassoLars
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import TweedieRegressor

#custom imports

import acquire
# import prepare

#filter out any noisy warning flags
import warnings
warnings.filterwarnings('ignore')

# setting the seed
a = random.seed(123)

# Acquire

- Data acquired from P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
    Modeling wine preferences by data mining from physicochemical properties.
    In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
- It contained 6497 rows and 13 columns before cleaning
- Each row represents a white or red wine
- Each column represents information about the red or white wine

In [2]:
# Acquire Step
wine_df = acquire.get_wine_data()

# Prepare

Prepare Actions:

Dropped unnecessary columns  
Renamed confusing columns  
Dropped duplicate columns  
Replaced null values 
Eliminated outliers   
Dropped null values  
split the data  

In [None]:
train, validate, test = prepare.prep_wine(df)

# Data Dictionary

| Feature | Definition |
|:--------|:-----------|
|'airconditioningtypeid'|	 Type of cooling system present in the home (if any)|
|'architecturalstyletypeid'|	Architectural style of the home (i.e. ranch, colonial, split-level, etc…)|
|'basementsqft'| Finished living area below or partially below ground level|
|'bathroomcnt'|	Number of bathrooms in home including fractional bathrooms|
|'bedroomcnt'|	Number of bedrooms in home| 
|'buildingqualitytypeid'|	|Overall assessment of condition of the building from best (lowest) to worst (highest)|
|'buildingclasstypeid'|	The building framing type (steel frame, wood frame, concrete/brick)| 
|'calculatedbathnbr'|	Number of bathrooms in home including fractional bathroom|
|'decktypeid'|	Type of deck (if any) present on parcel|
|'threequarterbathnbr'|	Number of 3/4 bathrooms in house (shower + sink + toilet)|
|'finishedfloor1squarefeet'|	Size of the finished living area on the first (entry) floor of the home|
|'calculatedfinishedsquarefeet'|	Calculated total finished living area of the home| 
|'finishedsquarefeet6'|	Base unfinished and finished area|
|'finishedsquarefeet12'|	Finished living area|
|'finishedsquarefeet13'|	Perimeter  living area|
|'finishedsquarefeet15'|	Total area|

## A brief look at the data

In [None]:
# Looking at the cleaned up columns
train.head()

# Explore

Questions asked:

    1) Does...
    
    2) What...
    
    3) Does...
    
    4) Is...

In [None]:
hm_visual = prepare.visual_correlations(train)
hm_visual

***Takeaway:***



## 1) Does...

In [None]:
# Visualization for square footage
square_foot_relplot = prepare.sq_ft_visual(train)
square_foot_relplot

- As...


H_0:   

H_a: 

In [None]:
# Running a spearmanr statistical test to check for correlation and have more confidence
sq_footage_stat_test = prepare.eval_result(train)
sq_footage_stat_test

***Takeaway:***

## 2) What...

In [None]:
# Visualization to show optimal number of bedrooms.
bed_barplot = prepare.bed_visual(train)
bed_barplot

- As...
- Although...

H_0:   

H_a: 

In [None]:
# Running a spearmanr statistical test to check for correlation and have more confidence
bedrooms_stat_test = prepare.eval_result2(train)
bedrooms_stat_test

***Takeaway:***

## 3) Does...

In [None]:
# Visualization to show number of bathrooms that affect property value.
bath_barplot = prepare.bath_visual(train)
bath_barplot

- The...
- The...
- From...

H_0: 

H_a: 

In [None]:
# Running a spearmanr statistical test to check for correlation and have more confidence
bathrooms_stat_test = prepare.eval_result3(train)
bathrooms_stat_test

***Takeaway:***

## 4) Is...

In [None]:
# Visualization to show the optimal square footage to maximize property value
opt_sq_barplot = prepare.opt_sf_visual(train)
opt_sq_barplot

- After...
- The...
- The...
- The...

H_0:  

H_a: 

In [None]:
# Running an independent t-test statistical test to compare proportions and have more confidence
optimal_sf_stat_test = prepare.eval_result4(train)
optimal_sf_stat_test

***Takeaway:***

# Exploration Summary

- While...
- There is...
- The...
- The...

# Features I am moving to modeling With

- "" (moderate correlation, had the strongest correlation)
- "" (slight correlation with, but enough to move forward)
- "" (moderate correlation with)
- "" (moderate negative correlation with)
- moving forward with all other features listed in the train dataframe as well. Further testing might eliminate some of these features.

# Features I'm not moving to modeling with

- The optimal...

# Modeling

- The mean on train and validate will be the baseline I use for this project
- I will be evaluating models developed using four different model types and various hyperparameter configurations 
- Models will be evaluated on train and validate data 
- The model that performs the best will then be evaluated on test data. 
- The _______ model produced the best results.

In [None]:
# Function for creating the X_train, y_train, X_validate, y_validate, X_test, y_test, and
# checking the shape.
X_train, y_train, X_validate, y_validate, X_test, y_test = prepare.X_y_split(train, 'property_value')

In [None]:
# Scale the data
train_scaled, validate_scaled, test_scaled = prepare.scale_data(train, 
               validate, 
               test, 
               columns_to_scale=['bedrooms', 'bathrooms', 'square_footage', 'property_value', 'year_built', 'fire_place', 'garage', 'hottub_spa', 'lot_size', 'pools', 'zip_code', 'stories', 'optimal_sf'],
               return_scaler=False)

train_scaled.shape, validate_scaled.shape, test_scaled.shape

In [None]:
# Feture engineering usings RFE to confirm best features.
feature_ranks = prepare.rfe(X_train, y_train, 4)
feature_ranks

In [None]:
# Start with the baseline
baseline = prepare.baseline(y_train, y_validate)
baseline

# Linear Regression Model

In [None]:
# calling the function for the linear regression model
lr_model = prepare.linear_reg_model(X_train, y_train, y_validate, X_validate)
lr_model

- Linear regression model performed better than the baseline on train and validate

# Lasso-Lars Model

In [None]:
# Calling the function for the lasso-lars model
ll_model = prepare.lasso_lars_model(X_train, y_train, y_validate, X_validate)
ll_model

- Lasso Lars Model performed better than the baseline on train and validate.

# Polynomial Model

In [None]:
# Calling the function for the polynomial model
pf_model = prepare.poly_model(X_train, y_train, y_validate, X_validate)
pf_model

- Polynomial Model performed better than the baseline on train and validate.

# Comparing Models

- The linear Regression, Lasso-Lars, and Polynomial models all performed better than the baseline.
- The Polynomial model performed the best on train and validate of all the models.
- I have chosen to move forward with the Polynomial model because it performed better on the validate data.

# Polynomial on Test

In [None]:
# Calling the function for the Polynomial test model
p_test_model = prepare.poly_test_model(X_test, y_test)
p_test_model

Model RMSE:  
Baseline RMSE:  

My Polynomial model performs better than baseline.


## Modeling Summary

- All three of the models performed better than the baseline on train and validate.
- A Polynomial model was selected as the final model and had a better error of prediction than the baseline.

# Conclusions

## Exploration

- Top three features with strongest correlation...
    - _______ has the strongest correlation of all the features.
- There is a moderate corellation between...
    - Further investigation into _________ is necessary in determing....
- There is a slight correlation between...
- The optimal...
- The optimal...

# Modeling

***The final model outperformed the baseline. Possible reasons include:***

- Adding...
- The features that were included...

# Recommendations (MAKE SURE THESE ARE RELATED TO ACTIONABLE SUGGESTIONS, NOT BASED ON THE CODE OR EXPLORATION OF THE DATA)

- The...
- The...
- The...

# Next Steps

- Target...
- Analyze other features with visualizations and statistical tests, such as...
- A deeper investigation into...