# Assignment 1: Visualizing the Wine Dataset
by:
- Sam Yao
- Michael Amberg
- Rebecca Kuhlman

# NOTE: DO NOT PUSH .ipynb_checkpoints/ TO THE GITHUB

### Data Sources
- [1] http://www3.dsi.uminho.pt/pcortez/wine5.pdf
- [2] https://archive.ics.uci.edu/ml/datasets/Wine+Quality

## Business Understanding

- Describe the purpose of the data set you selected (i.e., why and how was this data collected in the first place?). 
- What is the prediction task for your data and why are other third parties interested in the result? 
- Once you begin modeling, how well would your prediction algorithm need to perform to be considered useful to these third parties?
- Be specific and use your own words to describe the aspects of the data.

### In your own words, give an overview of the dataset.
This dataset describes several physical and chemical qualities of various wines (both white and red) from Portugal [1]. These characteristics, such as pH, citric acid, alcohol content, and residual sugar, were tested by physicochemical machines by the CVRVV, the official wine testing entity of the region of Portugal which these wines were made. The output was determined by a minimum of three human judges, who judged the wine quality on a scale of 0-10 (0 being bad and 10 being excellent). 

### What is the prediction task for your data and why are other third parties interested in the result? 
We are to predict the rating of a wine, given its qualities, on a scale of one to ten. Third-party entities, such as winemakers and related businesses, are interested in these results so that they can determine what chemically constitutes a wine that will rank high. Then, wine can be manufactured that focuses on this attribute (or less on this attribute if there is a strong negative correlation between the rankin and a certain attribute). This high ranking can be used to market those wines, bringing in profits for the company.

### Once you begin modeling, how well would your prediction algorithm need to perform to be considered useful to these third parties?
Wine is often viewed as a very subjective experience. Wine chemical composition varies from year to year depending on the overall climate of the growing season. Given that quality is extremely subjective and variable, there is not a strong need for an extremely precise algorithm. With these considerations, a 60-70% accuracy will be deemed acceptable.

### Be specific and use your own words to describe the aspects of the data.

## Data Understanding

In [None]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import warnings
import seaborn as sns
warnings.simplefilter('ignore', DeprecationWarning)

df_white_raw = pd.read_csv("winequality-white.csv")
df_red_raw = pd.read_csv("winequality-red.csv")

#Data csv format used ";" instead of ",", leading to us needing to manipulate file to use data
print(df_white_raw.info())

In [None]:
print(df_red_raw.info())

In [None]:
df_red_raw.isna().sum()

In [None]:
df_white_raw.isna().sum()

There is no missing data points in this dataset. The majority is numerical data. Whether a wine has sulfates or not could be made into categorical data, and whether the wine is red or white is categorical.
The numerical day can be broken down into groups, such as not acidic wine, light acidity, medium acidity, etc.
Red wine acidity ranges from 4ph to 3.3ph at the very lowest
White wine acidity ranges from 3.6 to 2.8ph, with the sweetest white wines being the most acidic.

In [None]:
df_groupedRed = df_red_raw.groupby(['quality'])
df_groupedRed.count()

In [None]:
df_red_raw['pH_range'] = pd.cut(df_red_raw['pH'],
                                 [0,3.075,3.3,3.7,3.9,4.5],
                                 labels=['Very High','High','Standard','Low','Very Low'])
                                #labels=['0','1','2','3','4'])
df_groupedRed.pH_range.describe()
plt.style.use('ggplot')

#plt.subplot(1,3,1)
#df_groupedRed.pH_range.plot.hist(5)
#plt.subplot(1,3,2)
#df_groupedRed.pH_range.plot.kde(0.2)

#plt.subplot(1,3,3)
#df_groupedRed.pH_range.plot.hist(50)
#df_groupedRed.pH_range.plot.kde(0.05, True)

# remember that visualization is interpretted, it supports evidence.
# plt.ylim([0, 0.06])

#plt.show()
#df_groupedRed.pH_range.plot.hist()

In [None]:
df_red_raw['ABV_range'] = pd.cut(df_red_raw['alcohol'],
                                 [0,12.5,13.5,14.5,100],
                                labels=['0','1','2','3'])
df_groupedRed.ABV_range.describe()

Initially there doesn't seem to be a correlation between ABV and quality. Popular Portugense wines often have a very low ABV, which may affect the algorithm.

In [None]:
dfRed = df_red_raw.copy()
dfRed['type'] = 'red'
dfRed['type#'] = 1
dfWhite = df_white_raw.copy()
dfWhite['type'] = 'white'
dfWhite['type#'] = 0

# The resulting dataset has 0 represent White Wine and 1 represent
df_merged = pd.concat([dfWhite,dfRed], axis = 0, ignore_index=True,)
df_merged

## Data Visualization

In [None]:
%matplotlib inline
df_merged.boxplot(column = ['total sulfur dioxide'], by=['type#','quality'])
# 0 represents white wine, 1 represents red wines

In [None]:
df_merged.hist(column=['total sulfur dioxide'], by=['type'])

In [None]:
df_merged.boxplot(column = ['pH'], by=['type#','quality'])

In [None]:
df_merged.boxplot(column = ['alcohol'], by=['type#','quality'])

In [None]:
df_merged.groupby('type#').plot.scatter(x='alcohol', y='sulphates', c='quality', colormap='plasma', alpha = .5)

In [None]:
df_merged.groupby('type#').plot.scatter(x='residual sugar', y='chlorides', c='quality', colormap='plasma', alpha = .5)

## Extra Work 😡

In [None]:
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
import umap

cats = df_merged[['type#','quality']]
df_merged_nums = df_merged.drop(columns=['type','type#','quality'], axis =0)

scaler = StandardScaler()
scaler.fit(df_merged_nums)
df_scaled = pd.DataFrame(scaler.transform(df_merged_nums), columns=list(df_merged_nums.columns))

df_merged_scaled = pd.concat([df_scaled,cats['type#']],axis = 1)

df_full_scaled = pd.concat([df_scaled,cats],axis = 1)
df_scaled_white = df_full_scaled[df_full_scaled['type#']==0]
df_scaled_red = df_full_scaled[df_full_scaled['type#']==1]

cats = cats.drop('type#',axis=1) #drop to use cats as target variable
le = LabelEncoder()
cats = le.fit_transform(cats)

# This is being done incorrectly, the train/test should be scaled seperated to prevent data leakage.
# idc tho because I'm not trying to make a model
X_train, X_test, y_train, y_test = train_test_split(df_merged_scaled, cats, test_size = 0.2)

X_train.head(3)

In [None]:
reducer = umap.UMAP(min_dist=0.1, n_components=2, n_epochs=None,
     n_neighbors=15)
embedding = reducer.fit_transform(df_merged_scaled)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_merged['type#']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('Default UMAP projection of the Portuguese wine dataset by type', fontsize=24)

In [None]:
reducer = umap.UMAP(min_dist=0.3, n_components=2, n_epochs=None,
     n_neighbors=50)
embedding = reducer.fit_transform(df_merged_scaled)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_merged['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Portuguese wine dataset by quality', fontsize=24)

In [None]:
reducer = umap.UMAP(min_dist=0.3, n_components=2, n_epochs=None,
     n_neighbors=50)
embedding = reducer.fit_transform(df_scaled_white)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_scaled_white['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Portuguese white wine dataset', fontsize=24)

In [None]:
reducer = umap.UMAP(min_dist=0.3, n_components=2, n_epochs=None,
     n_neighbors=50)
embedding = reducer.fit_transform(df_scaled_red)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_scaled_red['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Portuguese red wine dataset', fontsize=24)

In [None]:
reducer = umap.UMAP(min_dist=0.6, n_components=2, n_epochs=None,
     n_neighbors=120)
embedding = reducer.fit_transform(df_scaled_white)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_scaled_white['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Portuguese white wine dataset, dist=.6, neighbors=120', fontsize=24)

In [None]:
reducer = umap.UMAP(min_dist=.9, n_components=3, n_epochs=None,
     n_neighbors=5)
embedding = reducer.fit_transform(df_scaled_white)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    embedding[:, 2],
    c=[sns.color_palette()[x] for x in df_scaled_white['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Portuguese white wine dataset in "3D"', fontsize=24)

In [None]:
reducer = umap.UMAP(min_dist=.2, n_components=3, n_epochs=None,
     n_neighbors=160)
embedding = reducer.fit_transform(df_scaled_red)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    embedding[:, 2],
    c=[sns.color_palette()[x] for x in df_scaled_red['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Portuguese red wine dataset in "3D"', fontsize=24)

In [None]:
reducer = umap.UMAP(min_dist=0.3, n_components=2, metric = 'manhattan',
     n_neighbors=50)
embedding = reducer.fit_transform(df_scaled_red)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_scaled_red['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Portuguese red wine dataset using Manhattan Distance', fontsize=24)

In [None]:
reducer = umap.UMAP(min_dist=0.4, n_components=2, metric = 'canberra',
     n_neighbors=50)
embedding = reducer.fit_transform(df_scaled_red)
plt.scatter(
    embedding[:, 0],
    embedding[:, 1],
    c=[sns.color_palette()[x] for x in df_scaled_red['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('UMAP projection of the Portuguese red wine dataset using Canberra metric', fontsize=24)


## UMAP Interpretation
From the different UMAP projections we can see groups form within the data point across multiple parameter changes. While it is hard to determine which projection is the 'best', the relatively distinct groupings of points proves that UMAP dimensionality reduction would be appropriate from the dataset.

A clear example is seen in the "UMAP projection of the Portuguese red wine dataset using Manhattan Distance" plot. In it, each point's color represents the quality of the wine. While there is some overlap, there are fairly distinct boundaries between the different wine qualities that are easy to follow.

Moving forward, the different wine types should have different parameters for dimensionality reduction. Because white wine has more entries and results in a more compacted graph, selecting parameters that would space these points further from each-other would allow more clear boundaries to form.

In both cases, we believe that dimensions should be reduced to 3 rather than two in order to better represent the original 12 features and form clearer groups.



## Applying KernelPCA dimensionality reduction


In [None]:
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf',
                 gamma=15)
white = kpca.fit_transform(df_scaled_white)
red = kpca.fit_transform(df_scaled_red)

In [None]:
plt.scatter(
    red[:, 0],
    red[:, 1],
    c=[sns.color_palette()[x] for x in df_scaled_red['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('KernelPCA projection of the Portuguese white wine dataset with rbf kernel', fontsize=24)

In [None]:
kpca = KernelPCA(n_components=2, kernel='poly',
                 gamma=10, degree = 4,  )
red = kpca.fit_transform(df_scaled_red)

plt.scatter(
    red[:, 0],
    red[:, 1],
    c=[sns.color_palette()[x] for x in df_scaled_red['quality']],
    alpha=.6)
plt.gca().set_aspect('equal', 'datalim')
plt.title('KernelPCA projection of the Portuguese white wine dataset with poly kernel', fontsize=24)

## KernelPCA analysis
KernelPCA seems to perform a similar job to UMAP, but the resulting data is far less detailed. There are far fewer data points discernible on the graph, so a clustering algorithm would struggle to perform accurately on the resulting table. For this reason, UMAP is would be the recommender dimensionality reduction algorithm used

## Looking at the SHAP values for the data
After applying a simple model (xgboost in this example) we can use the SHAP library to determine SHAP values.
These values represent the sway each individual element has towards the final prediction. By summerizing all these positive/negatives contributions across all features, we are able to see which features have the largest effect on final prediction.

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=5, max_depth=19, learning_rate=.7)
model.fit(X_train,y_train)
preds = model.predict(X_test)
print(accuracy_score(y_test, preds)) # awful accuracy but no cleaning has been done tbf

In [None]:
import shap
# do not run this if you want to avoid waiting a minute or two
explainer = shap.Explainer(model.predict, X_test)
shap_values = explainer(X_test)

In [None]:
shap.summary_plot(shap_values)

From the above graph, ordered from most the least effect, we see that alcohol is the largest contributing factor towards quality prediction. If we wanted to just remove features that were less important to reduce dimensions, this could be a viable method to remove things such as fixed acidity and pH from the dataset to simplify it.