# Project 1 - EDA 
#### Mason Reyher, Jamison Cleveland, Kade Aldrich, Mitch Froelich, Ryley Ourada

Initial setup:

In [None]:
%pip install --upgrade pip -q
%pip install pandas -q
%pip install numpy -q 
%pip install matplotlib -q

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math

## Part 1

### In a well-written paragraph, answer the following questions about the data:

What was the data used for? </br>
**Two datasets are included, related to red and white (We only used the red for our project) vinho verde wine samples, from the north of Portugal. The goal is to model wine quality based on physicochemical tests. The proposal of the data usage was to predict human wine taste preferences that is based on easily available analytical tests at the certification step.** </br>
**Additional Info: The two datasets are related to red and white variants of the Portuguese (We only used the red variants in our study) \"Vinho Verde\" wine. For more details, consult: [Web Link] or the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).**

Who (or what organization) uploaded the data?</br>
**Paulo Cortez, University of Minho, Guimarães, Portugal, http://www3.dsi.uminho.pt/pcortez**
**A. Cerdeira, F. Almeida, T. Matos and J. Reis, Viticulture Commission of the Vinho Verde Region(CVRVV), Porto, Portugal**
**@2009**


How many attributes and how many entities are represented in the data? </br>
**12 Attributes** </br>
**The dataset has 4898 instances.**


How many numerical attributes? </br>
**12**


How many categorical attributes? </br>
**0 (Grape types, wine brand, ect were not present in the submitted data recieved from website)**


Would you suggest that each categorical attribute be label-encoded or one-hot-
encoded? Why? </br>
**All categorical variables should be label encoded, not one-hot encoded. All variables are ranked via quality (Scored between 0 and 10). - for example, the number of a wine 'pH' field is given as a number between 0-14. This is perfect for label-encoding. All other categorical variables assume this pattern as well.**

Are there missing values in the data? If so, what proportion of the data is missing
overall? What proportion of data is missing per attribute (you may use a plot or table to
summarize this information)?
</br>
**There is no missing data.**

Why is this data set interesting to you?
</br>
**The data set is interesting to us because the wine industry is one the largest in the world. Almost every country across the globe has some type or variation of wine that is unique in flavour. The goal of this data was to make it possibly predicatable for companies to gauge the preferences of their buyers and enjoyers. Trying to predict human preference in a incredibly interesting idea as every human is different, but we can gather if more people prefer an aspect over an other using data mining.**


Of the attributes used to describe this data, which do you think are the most
descriptive of the data and why (before doing any data analysis) ?
</br>
**The 'alcohol' and 'residual sugars' attributes seem to be the most important. Before doing analysis, it would be probable to assume the more alcohol a wine consists of, the more it would be either like or disliked, based on the general knowledge that some people will not like the taste of alcohol. The residual sugars falls into this realm as well with some people preferring a sugary wine over a bitter wine. Most attributes can fairly important for a wine, so the least important is likely the 'pH' (acidity gauge) attribute, due to fields like fixed acidity and volatile acidity being similar in nature.**

## Part 2

Use Python to write the following functions, without using any functions with the same purpose
in sklearn, pandas, numpy, or any other library (though you may want to use these libraries to
check your answers):

A function that will compute the mean of a numerical, multidimensional data set
input as a 2-dimensional numpy array

In [None]:
# Input needs to be a 2d numpy array
def get_vector_mean(arr):
    return arr.sum(axis=0) / arr.shape[0]

A function that will compute the sample covariance between two attributes that are
input as one-dimensional numpy vectors

In [None]:
def get_cov(attr_1, attr_2):
    attr_1_mean = float(sum(attr_1)) / len(attr_1) 
    attr_2_mean = float(sum(attr_2)) / len(attr_2) 
    sum_of = 0
    for i in range(len(attr_1)):
        sum_of += (float(attr_1[i]) - attr_1_mean) * (float(attr_2[i]) - attr_2_mean)
    sum_of /= len(attr_1) - 1
    return sum_of

def get_var(arr):
    return np.apply_along_axis(lambda x: get_cov(x, x), 0, arr)

A function that will compute the correlation between two attributes that are input as
two numpy vectors.

In [None]:
def get_corr(attr_1, attr_2):
    attr_1_mean = float(sum(attr_1)) / len(attr_1) 
    attr_2_mean = float(sum(attr_2)) / len(attr_2)
    # find standard deviation
    attr_1_dev = 0
    attr_2_dev = 0
    for i in range(len(attr_1)):
        attr_1_dev += (attr_1[i] - attr_1_mean)**2
        attr_2_dev += (attr_2[i] - attr_2_mean)**2
    attr_1_dev = math.sqrt(attr_1_dev / len(attr_1))
    attr_2_dev = math.sqrt(attr_2_dev / len(attr_2))

    # standardize values
    for i in range(len(attr_1)):
        attr_1[i] = (attr_1[i] - attr_1_mean) / attr_1_dev
        attr_2[i] = (attr_2[i] - attr_2_mean) / attr_2_dev
   
    num = 0
    den_x = 0
    den_y = 0
    # calculate numerator and denominators
    for i in range(len(attr_1)):
        num += (float(attr_1[i]) - attr_1_mean) * (float(attr_2[i]) - attr_2_mean)
        den_x += (float(attr_1[i]) - attr_1_mean)**2
        den_y += (float(attr_2[i]) - attr_2_mean)**2
    # calculate full denominator
    den = den_x * den_y
    return num / den

A function that will normalize the attributes in a two-dimensional numpy array using
range normalization.

In [None]:
def get_range_norm(arr):
    max_ = arr.max(0)
    min_ = arr.min(0)
    return (arr - min_) / (max_ - min_)

A function that will normalize the attributes in a two-dimensional numpy array using
standard normalization.

In [None]:
# standard normalization is the z-score normalization
# https://en.wikipedia.org/wiki/Normalization_(statistics)#Examples
# https://en.wikipedia.org/wiki/Standard_score
def get_standard_norm(arr):
    mu = get_vector_mean(arr)
    sigma = np.sqrt(get_var(arr))
    return (arr - mu) / sigma

A function that will compute the covariance matrix of a data set.


In [None]:
def get_cov_matrix(df):
    return np.stack([np.array([get_cov(df[attr1], df[attr2]) for attr2 in df]) for attr1 in df])

A function that will label-encode a two-dimensional categorical data array that is
passed in as input.

In [None]:
def label_encode(attr):
    attr = list(attr)
    key_list = set(attr)
    keys = {}
    count = 0
    for key in key_list:
        keys[key] = count
        count+=1
    for i, val in enumerate(attr):
        attr[i] = keys[val]
    return np.array(attr)

## Part 3

## Questions to Answer:

In [None]:
df_orig = pd.read_csv('wines_red.csv', sep=";")
df_orig.columns = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide','density', 'pH', 'sulphates', 'alcohol', 'quality']

df_orig['quality'] = label_encode(df_orig['quality'])
df_orig.head()

What is the multivariate mean of the numerical data matrix (where categorical data
have been converted to numerical values)?

In [None]:
df_copy = df_orig.copy()
get_vector_mean(df_copy.to_numpy())

What is the covariance matrix of the numerical data matrix (where categorical data
have been converted to numerical values)?

In [None]:
df_copy2 = df_orig.copy()
get_cov_matrix(df_copy2)

Choose 5 pairs of attributes that you think could be related. Create scatter plots of
all 5 pairs and include these in your report, along with a description and analysis that
summarizes why these pairs of attributes might be related, and how the scatter plots do or
do not support this intuition.

Which range-normalized numerical attributes have the greatest sample covariance?

In [None]:
df_range_normalized = pd.DataFrame(get_range_norm(df_orig.to_numpy()), columns=df_orig.columns)
covar_matrix = get_cov_matrix(df_range_normalized)

# mask used later to get the upper diagonal values from our matrices
triu_mask = np.zeros_like(covar_matrix, dtype=np.bool_)
triu_mask[np.triu_indices_from(covar_matrix, k=1)] = True

# set diagonals to -inf so we only get max of off-diagonals

# take the flat index to the array,
# get the indicies for each attribute,
# then return the labels for each attribute
covar_flat_idx = np.where(triu_mask, covar_matrix, -math.inf).argmax()
covar_shape_idx = np.unravel_index(covar_flat_idx, df_range_normalized.shape)
[covar_label1, covar_label2] = [df_range_normalized.columns[i] for i in covar_shape_idx]
covar_label1, covar_label2

What is their sample covariance? Create a scatter plot of these range-normalized attributes.

In [None]:
covar_matrix[covar_shape_idx]

In [None]:
x = df_range_normalized[covar_label1]
y = df_range_normalized[covar_label2]
plt.xlabel(covar_label1)
plt.ylabel(covar_label2)
plt.scatter(x, y)

Which Z-score-normalized numerical attributes have the greatest correlation? What
is their correlation? Create a scatter plot of these Z-score-normalized attributes.

In [None]:
df_standard_normalized = pd.DataFrame(data=get_standard_norm(df_orig.to_numpy()), columns=df_orig.columns)
df_standard_normalized.to_numpy()

# same thing as the covariance, just with correlation
# correlation matrix is just covariance matrix of the z-score-normalized data
corr_matrix = get_cov_matrix(df_standard_normalized)

corr_flat_idx = np.where(triu_mask, corr_matrix, -math.inf).argmax()
corr_shape_idx = np.unravel_index(corr_flat_idx, df_standard_normalized.shape)
[corr_label1, corr_label2] = [df_standard_normalized.columns[i] for i in corr_shape_idx]

corr_label1, corr_label2

In [None]:
x = df_standard_normalized[corr_label1]
y = df_standard_normalized[corr_label2]
plt.xlabel(corr_label1)
plt.ylabel(corr_label2)
plt.scatter(x, y)

Which Z-score-normalized numerical attributes have the smallest correlation? What
is their correlation? Create a scatter plot of these Z-score-normalized attributes.

In [None]:
corr_flat_idx = np.where(triu_mask, corr_matrix, math.inf).argmin()
corr_shape_idx = np.unravel_index(corr_flat_idx, df_standard_normalized.shape)
[corr_label3, corr_label4] = [df_standard_normalized.columns[i] for i in corr_shape_idx]

(corr_label3, corr_label4)

In [None]:
x = df_standard_normalized[corr_label3]
y = df_standard_normalized[corr_label4]
plt.xlabel(corr_label3)
plt.ylabel(corr_label4)
plt.scatter(x, y)

How many pairs of features have correlation greater than or equal to 0.5?


In [None]:
df_copy = df_orig.copy()
columns = df_copy.columns
column_check = {x : False for x in columns}
for column_1 in columns:
    for column_2 in columns:
        if column_1 == column_2 or not column_check[column_2]:
            continue
        # reset df because I don't understand references
        df_copy = df_orig.copy()
        print(f'Columns -> {column_1} - {column_2} -> correlation: {get_corr(df_copy[column_1], df_copy[column_2])}')
    column_check[column_1] = True

**No pairs of features have a correlation of >= 0.5.**

How many pairs of features have negative sample covariance?


What is the total variance of the data?


In [None]:
def get_total_var(arr):
    vec_var = get_var(arr)
    return vec_var.sum()
df_copy3 = df_orig.copy()
get_total_var(df_copy3.to_numpy())

What is the total variance of the data, restricted to the five features that have the
greatest sample variance?