## Question 1 

To verify if the formulations are significantly different statistically.

#### a. A descriptive analysis of the additives (columns named as “a” to “i”), which must include summaries of findings (parametric/non-parametric).  Correlation and ANOVA, if applicable, is a must.

#### b. A graphical analysis of the additives, including a distribution study.

#### c. A clustering test of your choice (unsupervised learning), to determine the distinctive number of formulations present in the dataset.

In [None]:
import os
import pandas as pd
import numpy as np
from ipynb.fs.full.Functions import *
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import statsmodels.api as sm
from statsmodels.formula.api import ols
warnings.filterwarnings("ignore")
%matplotlib inline

In [None]:
#importing the data
ingredient = pd.read_csv('data/ingredient.csv')
ingredient.head()

## Exploratory data analysis

### After importing the dataset using pandas libarary, statstical EDA is conducted.

In [None]:
#looking into ingredient data and finding the datatypes & number of non-null rows
ingredient.info()

In [None]:
#Finding total number of missing values in each features
ingredient.isnull().sum()

In [None]:
ingredient.describe()

##### Using "Describe" function we can find out about some general information on our dataset. From the table above additive "i" has the least usage (consumption) while "e" has the highest. In addtion, we can obtain information such as mean,std,min which can be useful for further analysis  

In [None]:
ingredient.corr()

#### Table above shows the correlations among features. 
#### Heatmap was consutructed for better understanding.

In [None]:
plt.figure(figsize=(16,10))
sns.heatmap(ingredient.corr(),annot=True)

Heatmap above shows that "a" has high negative correlations with "e" and "d".
Similarly "c" has high negative relations with "h","g" and "d" meaning if one increases the other will most likely decrease.On the other hand "a" and "g" have the high positve correlations meaning if "a" increases "g" also most likely will increase,same goes to "d" and "h"

In [None]:
df = ingredient.copy()

Box-plot below shows the outliers in each additives. There are several outliers can be detected in this figure however "e" has diffrent scale than the rest. Therefore they are plotted separately for better visualization. Based on Boxplots, the means for each additives are different. T-test (for two features) or ANOVA analysis (more than 2 features) to further determine if there is any significant difference between the additives.

In [None]:
plt.figure(figsize=(16, 4))
sns.boxplot(data = ingredient)

In [None]:
# cols = ['a','b','c','d','e','f','g','h','i']
box_plots(ingredient,ingredient.columns,3,'y',title=None,figsize=(16,20))

Outliers shown in the boxplot above can be treat them accordingly by eliminate them from the dataset using Z score measure. However since the dataset is so small, It has been decided to continue with all the data.

# ANOVA Testing

### Null hypothesis (H0) = There is no significant difference among the average consumptions of additives in formulations
### Alternative hypothesis (H1) = There are significant diffrence among the average consumptions of at least two of the additives.

In [None]:
#Reshape the dataset to prepare it for Analysis of variance or ANOVA
df = pd.melt(ingredient,value_vars=ingredient.columns,var_name='additives')
df.head()

In [None]:
#Check if all the additives are having same number of rows
df['additives'].value_counts().sort_index()

In [None]:
mod = ols('value~additives',data=df).fit()
aov = sm.stats.anova_lm(mod,type=2)
aov

### Interpretation: The P-value obtained from ANOVA analysis is significant (P<0.05),  Therefore, we conclude that there are significant differences among additives.

From ANOVA analysis, we know that additives differences are statistically significant, but ANOVA does not tell which additives are significantly different from each other. To know the pairs of significant different addtives, we will perform multiple pairwise comparison (Post-hoc comparison) analysis using Tukey HSD test.

In [None]:
# load packages
from pingouin import pairwise_tukey
# perform multiple pairwise comparison (Tukey HSD)
m_comp = pairwise_tukey(data=df, dv='value', between='additives')
m_comp

Above results from Tukey HSD suggests that except a-d,f-h,h-i, all other pairwise comparisons for addtives rejects null hypothesis (P-tukey<0.05) and indicates statistical significant differences.

In [None]:
sns.pairplot(ingredient)

The pairplot above shows some correlations among the addtives. For example we can see a-g have positive relations. This also has been varified in the heatmap above.

In [None]:
# cols = ['a','b','c','d','e','f','g','h','i']
dis_plot(ingredient,ingredient.columns,3)

From the figure above it can determine that additives "f", "h" and "i" has lower consumptions than others while "a","b","d","e","g" have normal distributions and consumptions. In addtion we can see that "c" has diffrent distribusions and two peaks. This could be because not all the formula uses "c" or some formula uses c more than others.

In [None]:
#Transform features by scaling each feature to a given range
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms.fit(ingredient)
Tingredient = mms.transform(ingredient)

In [None]:
#Find Optimal K 
from sklearn.cluster import KMeans
squared_distances = []
K = range(1,10)
for k in K:
    km = KMeans(n_clusters=k)
    km = km.fit(Tingredient)
    squared_distances.append(km.inertia_)

In [None]:
#Visualize optimom number of clusters
plt.plot(K, squared_distances, 'bx-')
plt.xlabel('K')
plt.ylabel('Sum(d^2)')
plt.title('K Optimization')
plt.show()

In [None]:
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import MiniBatchKMeans
from matplotlib import pyplot

model = MiniBatchKMeans(n_clusters=3)
# fit the model
model.fit(Tingredient)
# assign a cluster to each example
yhat = model.predict(Tingredient)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
    # get row indexes for samples with this cluster
    row_ix = where(yhat == cluster)
    # create scatter of these samples
    pyplot.scatter(Tingredient[row_ix, 0], Tingredient[row_ix,3])
# show the plot
pyplot.show()

In [None]:
ingredient2 = ingredient.copy()

Since it is a multi-dimentional problem, in the 2-D graph above there are lots of points in the same vacinity allocated to different clusters.

In [None]:
ingredient2['cluster'] = yhat
ingredient2

In [None]:
sns.pairplot(ingredient2,hue='cluster')

In [None]:
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

for i, cluster in enumerate(clusters):
    # get row indexes for samples with this cluster
    row_ix = where(yhat == cluster)
    
    x = np.array(Tingredient[row_ix,1])
    y = np.array(Tingredient[row_ix,2])
    z = np.array(Tingredient[row_ix,3])
    # create scatter of these samples
    ax.scatter(x,y,z, marker="s", s=40)
# show the plot
pyplot.show()

The 3-D Graph above shows the three clusters 