# Regression Homework

All of the subquestions (ie: a, b, c, etc.) are worth 5 points unless noted otherwise.

__Data Source__:

https://archive.ics.uci.edu/ml/datasets/wine+quality

Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez
A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal
@2009


__Data Description__:

The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

These datasets can be viewed as classification or regression tasks. The classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones). Outlier detection algorithms could be used to detect the few excellent or poor wines. Also, we are not sure if all input variables are relevant. So it could be interesting to test feature selection methods.


__Attribute Information__:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests):

1 - fixed acidity

2 - volatile acidity

3 - citric acid

4 - residual sugar

5 - chlorides

6 - free sulfur dioxide

7 - total sulfur dioxide

8 - density

9 - pH

10 - sulphates

11 - alcohol

Output variable (based on sensory data):

12 - quality (score between 0 and 10)

__Additional Citation__:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

----------------------------------------------------------------------------------------------------------------------

<b>1. (5 Points) Packages and Prebuilt Functions </b>

    a) (2 points) Run the code block below to load in needed packages

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
from scipy.stats import norm
import os
import warnings

warnings.filterwarnings('ignore')

    b) (3 points) Copy over the get_download_path() function we have used elsewhere in the course and run this cell to load in its functionality.

<b>2. (40 Points) Data </b>

    a) Place the "winequality-white.csv" and the "winequality-red.csv" in your Downloads folder.  Then, read the data files in python. Check if there are any NaN values using the ".info" command and comment in the markdown cell why or why not you think there are null values - should be a sentence or two.

In [None]:
# Your code below
whitewine = ...
whitewine.info()

In [None]:
redwine = ...
redwine.info()

    b) Create a new variable in both whitewine and redwine called "winecolor".  Set whitewine's "winecolor" to 0 and redwine's "winecolor" to 1.  Then append the two dataframes together in a new dataframe called "all_wines".

In [None]:

all_wines = ...
all_wines

    c) Describe data features in terms of distribution range (max and min) and mean values.

    d) Run the code block below and in the cell after it, describe if you think "quality" variable fits a normal distribution or not.  

In [None]:
#https://www.geeksforgeeks.org/how-to-plot-normal-distribution-over-histogram-in-python/
  
# Generate some data for this demonstration.
data = all_wines['quality']
  
# Fit a normal distribution to the data:
# mean and standard deviation
mu, std = norm.fit(data) 
  
# Plot the histogram.
plt.hist(data, bins=len(data.unique()), density=True, color='y')
  
# Plot the PDF.
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax)
p = norm.pdf(x, mu, std)

print('\n')
print('Normal Distribution plotted over the "quality" variable for the dataset "all_wines":')
plt.plot(x, p, 'k', linewidth=2, color='r')
title = "Mean of {:.2f} with a Standard Deviation of {:.2f}".format(mu, std)
plt.title(title)
  
plt.show()

    e) (10 points) Describe what the code block below is doing.

In [None]:
all_wines['fixed_acidity']=pd.cut(all_wines['fixed acidity'],4, 
                                  labels=["low_fixed_acidity", "medium_fixed_acidity", "high_fixed_acidity","very_high_fixed_acidity"])
fixed_acid=pd.get_dummies(all_wines['fixed_acidity'])
all_wines = pd.concat([all_wines,fixed_acid],axis=1)
all_wines=all_wines.drop(['fixed acidity','fixed_acidity'], axis=1)
all_wines

    f) Create a new variable called "citric_squared" which is "citric acid" multiplied against itself.

In [None]:
all_wines['citric_squared'] = ...

    g) Intepret the output of the code block below and note any variables of concern.

In [None]:
corr = all_wines.corr()
plt.subplots(figsize=(15,10))
sns.heatmap(corr, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, 
            cmap=sns.diverging_palette(220, 20, as_cmap=True))

<b>3. (25 Points) First Regression Test </b>

    a) (3 points) Create Y using the variable "quality" and all other variables as the X.

In [None]:
Y = ...
X = ...

    b) (2 points) Split the data into a training set as well a test set for both your X variables as well as for your Y variable with 16% test size.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(...)

    c) (20 points) Interpret the results of the regression results below.  Namely, described what the R-squared, testing accuracy and the difference between the two indicate.  Note if any variables aren't statistically significant and interpret the coefficient and confidence interval of at least 3 of the variables in the markdown cell below the output.

In [None]:
def regression(x_train, x_test, y_train, y_test):

    print("Number of training records:", len(y_train))
    print("Number of testing records:",len(y_test))

    print('\nLinear Regression Results')
    X2 = sm.add_constant(x_train)
    est = sm.OLS(y_train, x_train)
    regr = est.fit()
    print(regr.summary())
    
    y_pred = regr.predict(x_test)
    test_acc=r2_score(y_test, y_pred)
    print('\nTest accuracy =',test_acc)
    
regression(X_train, X_test, y_train, y_test)

<b>4. (30 Points) Your Own Regression </b>

    For either the full wine dataset or just red/white wine datasets, come up with your own regression model different than the one above and interpret the results as well as document why you made the changes you did. Use as many code blocks as you like and be creative - there are infinetely many ways to do this!