<h1 style="color:blue;">Scenario 4 - DATA 6310</h1> 

- C2.S4.Py01	Digging deeper in the normalized results  
- C2.S4.Py02	Creating a dendrogram
- C2.S4.Py03	Hierarchical cluster analysis
- C2.S4.Py04	Hierarchical cluster analysis with Ward linkage and comparisons to other clusters
- C2.S4.Py05	Hierarchical cluster analysis with larger sample
- C2.S4.Py06	Normalize the data and run hierarchical cluster analysis with larger sample
- C2.S4.Py07	K-Means clustering with the original dataset
- C2.S4.Py08	Hierarchical clustering with the original dataset
- C2.S4.Py09	Comparing the cluster results

---
#  BUSINESS UNDERSTANDING 
---

## Business Objective
 - Can we find specific groups within the data to get a better idea of the data? 
 - Do we know the number of clusters?
     - If **NO** then use hiearchal clustering 
     - If **YES** then use K-Means clustering
 - This is an unsupervised learning technique
 
## Technical Objective
- Create a dataset that can be used for clustering
- Normalize the data to see if it provides better results
- Determine if you would like to use a set number of groups
- Use hierarchal clustering to get a good idea of the number of clusters
- Use K-means clustering if you know the number of clusters expected (many times a business decision)

- https://www.geeksforgeeks.org/implementing-agglomerative-clustering-using-sklearn/

<h2 style="color:blue;">Digging deeper in the normalized results</h2>

In [None]:
#Code Block 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns',500) #allows for up to 500 columns to be displayed when viewing a dataframe



#if you want graphs to automatically without plt.show
plt.style.use('seaborn-colorblind') #a style that can be used for plots - see style reference above

%matplotlib inline

In [None]:
%%time

#Code Block 2
url = 'https://data6300.file.core.windows.net/data6300/Scenario4.csv?st=2020-09-30T22%3A22%3A07Z&se=2022-10-01T22%3A22%3A00Z&sp=rl&sv=2018-03-28&sr=f&sig=pyseRYm2q51tAewi8DmFjIBC7Zd60FaP%2BSJb77y2y1A%3D'
url2 = 'https://data6300.file.core.windows.net/data6300/Scenario4_melt.csv?st=2020-09-30T22%3A22%3A26Z&se=2022-10-01T22%3A22%3A00Z&sp=rl&sv=2018-03-28&sr=f&sig=%2FC1oPHyVSwHjVESFXSEmd0U6UVRk2bdv3CbKxlnd%2F18%3D'

df_ap = pd.read_csv(url, index_col=0, header=0)
df_ap_melt = pd.read_csv(url2, index_col=0, header=0)
df_ap.info()

In [None]:
#Code Block 3
round(df_ap.groupby('Predict_4').mean().T, 2)

In [None]:
#Code Block 4
df_ap['Predict_4'].value_counts()

In [None]:
#Code Block 5
round(df_ap.groupby('Predict_n').mean().T, 2)

In [None]:
#Code Block 6

df_ap['Predict_n'].value_counts()

In [None]:
#Code Block 7

sns.set_style("whitegrid")
plt.figure(figsize=(20,20))

plt.subplot(411)
plt.title('Income for Normalizer Data', fontweight='bold', color = 'blue', fontsize='24', horizontalalignment='center')
sns.boxplot(y='Income_Dollars', x='Predict_n', data=df_ap, palette='Blues')
plt.xticks([])
plt.ylabel('Income Dollars')
plt.ylim(0, 200000)


plt.subplot(412)
plt.title('Discretionary for Normalizer Data', fontweight='bold', color = 'orange', fontsize='24', horizontalalignment='center')
sns.boxplot(y='Discretionary_Spending_Dollars', x='Predict_n', data=df_ap, palette='Oranges')
plt.xticks([])
plt.ylabel('Discretionary')
plt.ylim(0, 25000)

plt.subplot(413)
plt.title('Age for Normalizer Data', fontweight='bold', color = 'green', fontsize='24', horizontalalignment='center')
sns.boxplot(y='Age', x='Predict_n', data=df_ap, palette='Greens')
plt.xticks([])
plt.ylabel('Age')
#plt.ylim(0, 25000)

plt.subplot(414)
plt.title('Househoold Size for Normalizer Data', fontweight='bold', color = 'purple', fontsize='24', horizontalalignment='center')
sns.boxplot(y='Househoold_Size', x='Predict_n', data=df_ap, palette='Purples')
#plt.xticks([])
plt.ylabel('Househoold Size')


### Compare income dollars to TV Show rank

In [None]:
#Code Block 8

df_ap_chart = df_ap.sample(1000, random_state=42)

In [None]:
#Code Block 9

df_ap_cols = df_ap_chart.columns
df_ap_cols = df_ap_cols.drop(['Income_Dollars', 'HomeOwner_Renter','Marital_Status', 'Age', 
                              'Adults_in_Household', 'Househoold_Size', 
                              'Discretionary_Spending_Dollars', 'PolicyNumber', 'DriverNumber', 
                              'DriverCount', 'APID', 'Policy', 'Predict', 'Predict_4',
                              'Predict_3n', 'Predict_n','APID'])
df_ap_melt =pd.melt(df_ap_chart, id_vars=['APID'], value_vars=df_ap_cols)
df_ap_melt=df_ap_melt.rename(columns = {'variable':'TV_Show', \
                        'value':'Rank'})
df_ap_melt

In [None]:
#Code Block 10

df_ap_melt = pd.merge(df_ap_melt, df_ap_chart[['APID', 'Income_Dollars', 'Age' ,'Househoold_Size',  
                                         'Discretionary_Spending_Dollars' ,
                                         'Predict_4', 'Predict_n']], how='left', on='APID')
display(df_ap_melt.info())
df_ap_melt.head()



In [None]:
#Code Block 11

df_ap_melt.info()

In [None]:
#Code Block 12

plt.figure(figsize=(20,10)) #changes area of scatterplot
sns.lmplot(y='Rank', x='Income_Dollars', hue='Predict_n', fit_reg = False, col="TV_Show", col_wrap=2, data = df_ap_melt, palette="Set1", 
           aspect = 2, scatter_kws={"alpha":0.35,"s":150,"linewidth":2,"edgecolor":"white"}, line_kws={'color': 'red'})
plt.xlim(0, 200000)

<h2 style="color:blue;">Creating the Dendrogram</h2> 

### Approach to using Hierarchical clustering

- Create a dendrogram to visually depict the clusters

### Dendrogram 1

In [None]:
#Code Block 13

df_ap_100_demo = df_ap.sample(100, random_state=42)
df_ap_100_demo = df_ap_100_demo.reset_index()
df_ap_100_demo = df_ap_100_demo.drop('index', axis=1)
df_ap_100 = df_ap_100_demo.iloc[:, 1:17]
df_ap_100.info()

In [None]:
#Code Block 14

from scipy.cluster.hierarchy import linkage, dendrogram

In [None]:
#Code Block 15

mergings = linkage(df_ap_100, method='complete')

In [None]:
#Code Block 16

plt.figure(figsize=(20,10))
dendrogram(mergings, leaf_rotation=0, leaf_font_size=8)
plt.show()

### Dendrogram 2

### Orient to the left

In [None]:
#Code Block 17

plt.figure(figsize=(20,20))
dendrogram(mergings, leaf_rotation=0, leaf_font_size=10, orientation='left')
plt.show()

### Dendrogram No. 3

### Dendrogram with more features

#### By setting truncate and p, it gives you an easier view of your dendrogram (p = 20)

In [None]:
#Code Block 18

plt.figure(figsize=(20,4))
dendrogram(mergings, leaf_rotation=0, leaf_font_size=8)
plt.show()

In [None]:
#Code Block 19

plt.figure(figsize=(20,10))
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    mergings,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=20,  # show only the last p merged clusters
    show_leaf_counts=True,  # otherwise numbers in brackets are counts
    leaf_rotation=0.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
    orientation='top' #sets orientation to horizontal instead of vertical
)
plt.show()

### Dendrogram No. 4
### By setting truncate and p, it gives you an easier view of your dendrogram (p = 30)

In [None]:
#Code Block 20

plt.figure(figsize=(20,10))
plt.title('Hierarchical Clustering Dendrogram (truncated)')
plt.ylabel('sample index')
plt.xlabel('distance')
dendrogram(
    mergings,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=30,  # show only the last p merged clusters
    show_leaf_counts=False,  # otherwise numbers in brackets are counts
    leaf_rotation=0.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
    orientation='left' #sets orientation to horizontal instead of vertical
)
plt.show()


<h2 style="color:blue;">Hierarchical clustering</h2> 

In [None]:
#Code Block 20

from scipy.cluster.hierarchy import fcluster

In [None]:
#Code Block 21

df_ap_pred = fcluster(mergings, 45, criterion='distance')
df_ap_pred = pd.DataFrame(df_ap_pred)
df_ap_pred.columns = ['Pred_45']
df_ap_pred.head()

In [None]:
#Code Block 22

df_ap_pred['Pred_45'].value_counts()

In [None]:
#Code Block 23

sns.set(style='whitegrid')

plt.figure(figsize=(20,14))
plt.title('Hierarchical Clustering Dendrogram')
plt.ylabel('sample index')
plt.xlabel('distance')
dendrogram(
    mergings,
    show_leaf_counts=False,  # otherwise numbers in brackets are counts
    leaf_rotation=0.,
    leaf_font_size=9.,
    show_contracted=True,  # to get a distribution impression in truncated branches
    orientation='left' #sets orientation to horizontal instead of vertical
)
plt.show()

In [None]:
#Code Block 24

df_ap_pred_42 = fcluster(mergings, 42, criterion='distance')
df_ap_pred_42 = pd.DataFrame(df_ap_pred_42)
df_ap_pred_42.columns = ['Predict_h_42']
df_ap_pred_42.head()

In [None]:
#Code Block 25

df_ap_pred_42['Predict_h_42'].value_counts()

In [None]:
#Code Block 26

df_ap_100_demo = pd.concat([df_ap_100_demo, df_ap_pred_42], axis=1)
round(df_ap_100_demo.groupby('Predict_h_42').mean().T,2)

In [None]:
#Code Block 27

round(df_ap_100_demo[df_ap_100_demo['Predict_h_42']==2][['Age', 'Income_Dollars', 'Adults_in_Household', 'Househoold_Size', 'Discretionary_Spending_Dollars', 'Policy']].describe().T, 2)

<h2 style="color:blue;">Hierarchical cluster analysis with Ward and comparisons to other clusters</h2> 

- **Complete** linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the largest value (i.e., maximum value) of these dissimilarities as the distance between the two clusters. It tends to produce more compact clusters.
- **Single** linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the smallest of these dissimilarities as a linkage criterion. It tends to produce long, “loose” clusters.
- **Average** linkage clustering: It computes all pairwise dissimilarities between the elements in cluster 1 and the elements in cluster 2, and considers the average of these dissimilarities as the distance between the two clusters.
- **Centroid** linkage clustering: It computes the dissimilarity between the centroid for cluster 1 (a mean vector of length p variables) and the centroid for cluster 2.
- **Ward’s** minimum variance method: It minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. *This is most similar to K-Means.*

In [None]:
#Code Block 28

mergings_w = linkage(df_ap_100, method='ward')

In [None]:
#Code Block 29

plt.figure(figsize=(20,10))
plt.title('Hierarchical Clustering Dendrogram (Ward)')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    mergings_w,
    truncate_mode='lastp',  # show only the last p merged clusters
    p=30,  # show only the last p merged clusters
    show_leaf_counts=True,  # otherwise numbers in brackets are counts
    leaf_rotation=0.,
    leaf_font_size=12.,
    show_contracted=True,  # to get a distribution impression in truncated branches
    orientation='top' #sets orientation to horizontal instead of vertical
)
plt.show()

In [None]:
#Code Block 30

df_ap_pred_w = fcluster(mergings_w, 78, criterion='distance')
df_ap_pred_w = pd.DataFrame(df_ap_pred_w)
df_ap_pred_w.columns = ['Predict_ward']
df_ap_pred_w.head()

In [None]:
#Code Block 31

df_ap_pred_w['Predict_ward'].value_counts()

### Crosstab analysis of groups 

In [None]:
#Code Block 32

df_ap_100_demo.info()

In [None]:
#Code Block 33

df_ap_100_demo = pd.concat([df_ap_100_demo, df_ap_pred_w], axis=1)

In [None]:
#Code Block 34

pd.crosstab(df_ap_100_demo['Predict_4'], df_ap_100_demo['Predict_h_42'])

In [None]:
#Code Block 35

pd.crosstab(df_ap_100_demo['Predict_4'], df_ap_100_demo['Predict_ward'])

In [None]:
#Code Block 36

pd.crosstab(df_ap_100_demo['Predict'], df_ap_100_demo['Predict_n'])

<h2 style="color:blue;">Hierarchical cluster analysis with larger sample</h2> 

In [None]:
#Code Block 37

df_ap.info()

In [None]:
#Code Block 38

df_ap_cluster = df_ap.iloc[:, 1:17]
df_ap_cluster.info()

In [None]:
%%time

#Code Block 39

mergings_all = linkage(df_ap_cluster, method='complete')

In [None]:
%%time

#Code Block 40

plt.figure(figsize=(20,10))
dendrogram(mergings_all, leaf_rotation=0, leaf_font_size=8)
plt.show()

In [None]:
%%time

#Code Block 41

df_ap_pred_all = fcluster(mergings_all, 55, criterion='distance')
df_ap_pred_all = pd.DataFrame(df_ap_pred_all)
df_ap_pred_all.columns = ['Pred_h_55']
df_ap_pred_all.head()

In [None]:
#Code Block 42

df_ap_pred_all['Pred_h_55'].value_counts()

<h2 style="color:blue;">Normalize the data and conduct hierarchical clustering</h2> 

In [None]:
#Code Block 43

from sklearn import preprocessing

n_scaler = preprocessing.Normalizer()

In [None]:
#Code Block 44

df_ap_n = n_scaler.fit_transform(df_ap_cluster)
df_ap_n = pd.DataFrame(df_ap_n, columns=(df_ap_cols))
df_ap_n.head()

In [None]:
%%time

#Code Block 45

mergings_n = linkage(df_ap_n, method='complete')

In [None]:
%%time

#Code Block 46

plt.figure(figsize=(20,10))
dendrogram(mergings_n, leaf_rotation=0, leaf_font_size=8)
plt.show()

In [None]:
%%time

#Code Block 47

df_ap_pred_n = fcluster(mergings_n, 1.2, criterion='distance')
df_ap_pred_n = pd.DataFrame(df_ap_pred_n)
df_ap_pred_n.columns = ['Pred_h_n']
display(df_ap_pred_n.head())
df_ap_pred_n['Pred_h_n'].value_counts()

In [None]:
#Code Block 48

df_ap = pd.concat([df_ap, df_ap_pred_n, df_ap_pred_all], axis=1)
df_ap.head()

In [None]:
#Code Block 49

round(df_ap.groupby('Pred_h_n').mean().T, 2)

In [None]:
#Code Block 50

df_ap.info()

In [None]:
#Code Block 51

pd.crosstab(df_ap['Pred_h_n'], df_ap['Pred_h_55'])

In [None]:
#Code Block 52

pd.crosstab(df_ap['Predict_4'], df_ap['Pred_h_55'])

In [None]:
#Code Block 53

pd.crosstab(df_ap['Househoold_Size'], df_ap['Pred_h_55'])