## Preliminary Steps.

**Import Libraries.**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from IPython.display import display

**Some Notebook Settings.**

In [None]:
warnings.filterwarnings('ignore') # ignore warnings.
%config IPCompleter.greedy = True # autocomplete feature.
pd.options.display.max_rows = None # set maximum rows that can be displayed in notebook.
pd.options.display.max_columns = None # set maximum columns that can be displayed in notebook.
pd.options.display.precision = 2 # set the precision of floating point numbers.

**Check Encoding of Data.**

In [None]:
# # Check the encoding of data. Use ctrl+/ to comment/un-comment.

# import chardet

# rawdata = open('candy-data.csv', 'rb').read()
# result = chardet.detect(rawdata)
# charenc = result['encoding']
# print(charenc)
# print(result) # It's utf-8 with 99% confidence.

**Read Data.**

In [2]:
df = pd.read_csv('../input/candy-data.csv', encoding='utf-8')
df.drop_duplicates(inplace=True) # drop duplicates if any.
df.shape # num rows x num columns.

(85, 13)

<hr>

## Data Preparation.

**Check for missing values.**

In [None]:
(df.isnull().sum()/len(df)*100).sort_values(ascending=False)

No missing values.

In [None]:
df.head()

We have a total of 12 variables that describe a chocolate. 9 of them are categorical and the rest, i.e. 3 are numerical variables.

1. chocolate: Does it contain chocolate?
2. fruity: Is it fruit flavored?
3. caramel: Is there caramel in the candy?
4. peanutalmondy: Does it contain peanuts, peanut butter or almonds?
5. nougat: Does it contain nougat?
6. crispedricewafer: Does it contain crisped rice, wafers, or a cookie component?
7. hard: Is it a hard candy?
8. bar: Is it a candy bar?
9. pluribus: Is it one of many candies in a bag or box?
10. sugarpercent: The percentile of sugar it falls under within the data set.
11. pricepercent: The unit price percentile compared to the rest of the set.
12. winpercent: The overall win percentage according to 269,000 matchups.

In [None]:
df['winpercent'] = df['winpercent']/100

**Deriving new features.**

In [None]:
df['sugarbyprice'] = df['sugarpercent'].div(df['pricepercent']) # higher value means the candy is sweet as well as cheap.
df['winbyprice'] = df['winpercent'].div(df['pricepercent']) # higher value means the candy is more liked as well as cheap.

In [None]:
categorival_vars = ['chocolate', 'fruity', 'caramel', 'peanutyalmondy', 'nougat', 'crispedricewafer', 'hard', 'bar',
                    'pluribus']
numerical_vars = ['sugarpercent', 'pricepercent', 'winpercent', 'sugarbyprice', 'winbyprice']

<hr>

## Data Understanding.

**Some Questions one might ask.**

1. Top 10 winner candies.

In [None]:
df['competitorname'] = df['competitorname'].str.replace('Õ', "'") # Special character was appearing in name of candy.
df.sort_values(by=['winpercent', 'sugarpercent'], ascending=False).head(10)

Reese's seem to be a favourite. Note that all the top competitors are chocolaty as well. Also, Reese's Miniatures is very cheap when compared to top competitors and overall as well.

2. Competitors which are not chocolaty but winners.

In [None]:
df[df['chocolate']==0].sort_values(by=['winpercent', 'sugarpercent'], ascending=False).head(10)

Sour Patch Kids has a high `winbyprice`. They are cheap as well as a favourite.

3. Top `winbyprice` competitors.

In [None]:
df.sort_values(by=['winbyprice', 'winpercent'], ascending=False).head(10)

Tootsie Roll Midgies seems to give a bang for buck.

4. Top 10 sugary candies.

In [None]:
df.sort_values(by=['sugarpercent', 'winpercent'], ascending=False).head(10)

5. Which candies are both chocolaty as well as fruity?

In [None]:
df[(df['chocolate']==1)&(df['fruity']==1)]

<hr>

**Correlation Heatmap.**

In [None]:
plt.figure(figsize = (20,8))        
sns.heatmap(df.corr(),annot=True, cmap = 'coolwarm')

We do have some correlation between features. We can use PCA for treating correlation as well as dimensionality reduction.

Should we drop correlated variables before performing K-means? -> https://stats.stackexchange.com/questions/62253/do-i-need-to-drop-variables-that-are-correlated-collinear-before-running-kmeans

<hr>

## Principal Components Analysis.

**Perform PCA.**

scikit has 4 steps -> import, instantiate, fit, transform.

In [None]:
# Improting the PCA module. 

from sklearn.decomposition import PCA # import.
pca = PCA(svd_solver='randomized', random_state=123) #instantiate.
pca.fit(df.drop('competitorname', axis=1)) # fit.

**Scree Plot.**

In [None]:
# Making the screeplot - plotting the cumulative variance against the number of components

fig = plt.figure(figsize = (20,5))
ax = plt.subplot(121)
plt.plot(pca.explained_variance_ratio_)
plt.xlabel('principal components')
plt.ylabel('explained variance')

ax2 = plt.subplot(122)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

plt.show()

First 2 or 3 components are suggested by using the elbow method.

**Percentage of Variance retained.**

In [None]:
# what percentage of variance in data can be explained by first 2,3 and 4 principal components respectively?
(pca.explained_variance_ratio_[0:2].sum().round(3),
pca.explained_variance_ratio_[0:3].sum().round(3),
pca.explained_variance_ratio_[0:4].sum().round(3))

**Visualize Principal Components Loadings.**

In [None]:
# we'll use first 2 principal components as it retains 95% of variance.

df_pca_2_comp = pd.DataFrame({'PC1':pca.components_[0],'PC2':pca.components_[1], 'Feature':df.drop(
                              'competitorname', axis=1).columns})
# df_pca_2_comp

In [None]:
# we can visualize what the principal components seem to capture.

fig = plt.figure(figsize = (6,6))
plt.scatter(df_pca_2_comp.PC1, df_pca_2_comp.PC2)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
for i, txt in enumerate(df_pca_2_comp.Feature):
    plt.annotate(txt, (df_pca_2_comp.PC1[i],df_pca_2_comp.PC2[i]))
plt.tight_layout()
plt.show()

Except `sugarbyprice` and `winbyprice`, all the other features seem to be clustered.

<hr>

**Transform Data.**

In [None]:
df_pca = pca.transform(df.drop('competitorname', axis=1)) # our data transformed with new features as principal components.
df_pca = df_pca[:, 0:2] # Since we require first two principal components only.

**Scale Data.**

In [None]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()
df_s = standard_scaler.fit_transform(df_pca) # s in df_s stands for scaled.

**Visualize Principal Components.**

In [None]:
sns.pairplot(pd.DataFrame(df_s)) # Try to get some intuiton of data.

One cluster is very clearly visible. Seems to me that the second cluster will contain the data points not in the first cluster. Two clusters might suffice.

<hr>

## Clustering of Data.

**Is the data clusterable?**

Hopkin's Statstic will tell us if the data is clusterable or not. If it is less than 0.5, clusters are not statistically significant.

In [None]:
from sklearn.neighbors import NearestNeighbors
from random import sample
from numpy.random import uniform
import numpy as np
from math import isnan
 
def hopkins(X):
    d = X.shape[1]
    #d = len(vars) # columns
    n = len(X) # rows
    m = int(0.1 * n) 
    nbrs = NearestNeighbors(n_neighbors=1).fit(X.values)
 
    rand_X = sample(range(0, n, 1), m)
 
    ujd = []
    wjd = []
    for j in range(0, m):
        u_dist, _ = nbrs.kneighbors(uniform(np.amin(X,axis=0),np.amax(X,axis=0),d).reshape(1, -1), 2, return_distance=True)
        ujd.append(u_dist[0][1])
        w_dist, _ = nbrs.kneighbors(X.iloc[rand_X[j]].values.reshape(1, -1), 2, return_distance=True)
        wjd.append(w_dist[0][1])
 
    H = sum(ujd) / (sum(ujd) + sum(wjd))
    if isnan(H):
        print(ujd, wjd)
        H = 0
 
    return H

In [None]:
hopkins(pd.DataFrame(df_s))

Yes, Hopkin's statistic claims that this data is indeed highly clusterable.

<hr>

**Clustering.**

In [None]:
from sklearn.cluster import KMeans # import.

# silhouette scores to choose number of clusters.
from sklearn.metrics import silhouette_score
def sil_score(df):
    sse_ = []
    for k in range(2, 15):
        kmeans = KMeans(n_clusters=k, random_state=123).fit(df_s) # fit.
        sse_.append([k, silhouette_score(df, kmeans.labels_)])
    plt.plot(pd.DataFrame(sse_)[0], pd.DataFrame(sse_)[1])

sil_score(df_s)

Maximum silhouette score at k=2.

In [None]:
# sum of squared distances.

def plot_ssd(df):
    ssd = []
    for num_clusters in list(range(1,19)):
        model_clus = KMeans(n_clusters = num_clusters, max_iter=50, random_state=123)
        model_clus.fit(df)
        ssd.append(model_clus.inertia_)
    plt.plot(ssd)

plot_ssd(df_s)

Elbow seems to form at 2.

<hr>

**K-Means with 2 clusters.**

In [None]:
# K-means with K=2.
km2c = KMeans(n_clusters=2, max_iter=50, random_state=93)
km2c.fit(df_s)

In [None]:
# creation of data frame with original features for analysis of clusters formed.

df_dummy = pd.DataFrame.copy(df)
dfkm2c = pd.concat([df_dummy, pd.Series(km2c.labels_)], axis=1)
dfkm2c.rename(columns={0:'Cluster ID'}, inplace=True)
# dfkm2c.head()

In [None]:
# creation of data frame with features as principal components for analysis of clusters formed.

df_dummy = pd.DataFrame.copy(pd.DataFrame(df_s))
dfpcakm2c = pd.concat([df_dummy, pd.Series(km2c.labels_)], axis=1)
dfpcakm2c.columns = ['PC1', 'PC2', 'Cluster ID']

In [None]:
sns.pairplot(data=dfpcakm2c, vars=['PC1', 'PC2'], hue='Cluster ID')

<hr>

**K-means with 5 clusters.**

In [None]:
# K-means with K=5.
km5c = KMeans(n_clusters=5, max_iter=50, random_state=123)
km5c.fit(df_s)

In [None]:
# creation of data frame with original features for analysis of clusters formed.

df_dummy = pd.DataFrame.copy(df)
dfkm5c = pd.concat([df_dummy, pd.Series(km5c.labels_)], axis=1) # df-dataframe, km-kmeans, 5c-5clusters.
dfkm5c.rename(columns={0:'Cluster ID'}, inplace=True)
# dfkm5c.head()

In [None]:
# creation of data frame with features as principal components for analysis of clusters formed.

df_dummy = pd.DataFrame.copy(pd.DataFrame(df_s))
dfpcakm5c = pd.concat([df_dummy, pd.Series(km5c.labels_)], axis=1)
dfpcakm5c.columns = ['PC1', 'PC2', 'Cluster ID']

In [None]:
sns.pairplot(data = dfpcakm5c, vars=['PC1', 'PC2'], hue='Cluster ID')

## Analysis of Clusters.

Let's see how cluster 0 differs from the rest.

In [None]:
dfkm5c.groupby('Cluster ID').mean()

In [None]:
dfkm5c[dfkm5c['Cluster ID']!=0]

1. It is to be noted that only Cluster ID 4 (Dum Dums) and 1 (Tootsie Roll Midgies) are far away from Cluster ID 0.<br>
2. 'Dum Dums' and 'Tootsie Roll Midgies' are sort of opposite of each other. The first one is fruity and the second one chocolaty.<br>
3. Cluster ID 0 contains competitors which are mostly chocolaty, sugary and more favourable. Cluster ID 1, although being chocolaty has a low sugar percentile.<br>
4. All the chocolates which don't belong to Cluster ID 0 have made the top 10 list of `winbyprice`. They are all cheap.

Let's put clusters other than 0 into one cluster and then analyze again.

In [None]:
dfkm5c['Cluster ID'] = dfkm5c['Cluster ID'].map(lambda x: 1 if (x!=0) else 0)

In [None]:
dfkm5c.groupby('Cluster ID').mean()

So, Cluster ID 0 contains competitors which are more chocolaty and more pricey.

<hr>

## Predicting the win percentage.

**Scaling.**

In [None]:
X = df.drop(['competitorname', 'winpercent', 'sugarpercent', 'pricepercent', 'sugarbyprice', 'winbyprice'], axis=1)
y = df['winpercent']

from sklearn.preprocessing import MinMaxScaler
minmax_scaler = MinMaxScaler()

**Cross-Validation.**

In [None]:
from sklearn import linear_model # import.
lr_rdg = linear_model.Ridge(random_state=123) # instantiate.

# Perform cross-validation.
from sklearn.model_selection import GridSearchCV
hyperparameters = {'alpha': [0.01, 0.1, 1, 10, 100, 1000]}
model_cv = GridSearchCV(estimator = lr_rdg, param_grid = hyperparameters, cv=10, scoring= 'neg_mean_absolute_error')
#lr_rdg.get_params().keys() # hyperparameters that we can set.

model_cv.fit(X, y) # fit.

In [None]:
cv_results = pd.DataFrame(model_cv.cv_results_)
# cv_results.head()

# Plotting mean test and train scoes with alpha.
cv_results['param_alpha'] = cv_results['param_alpha'].astype('float32')

# Plotting.
plt.plot(cv_results['param_alpha'], cv_results['mean_train_score'])
plt.plot(cv_results['param_alpha'], cv_results['mean_test_score'])
plt.xlabel('alpha')
plt.ylabel('Negative Mean Absolute Error')

plt.title("Negative Mean Absolute Error and alpha")
plt.legend(['train score', 'test score'], loc='upper left')
plt.show()

<hr>

**Ridge Linear Regression.**

In [None]:
model_cv.best_params_

In [None]:
alpha = 1
ridge = linear_model.Ridge(alpha=alpha)
ridge.fit(X, y)

**Results.**

In [None]:
ridge.intercept_ # constant term.

In [None]:
for x,y in zip(X.columns, ridge.coef_): # coefficients of features.
    print(x, y*100)

These coefficients sort of matches with the analysis done by FiveThirtyEight -> https://fivethirtyeight.com/features/the-ultimate-halloween-candy-power-ranking/

<hr>