# The Data

The data has 777 observations on the following 18 variables.
* Private A factor with levels No and Yes indicating private or public university
* Apps Number of applications received
* Accept Number of applications accepted
* Enroll Number of new students enrolled
* Top10perc Pct. new students from top 10% of H.S. class
* Top25perc Pct. new students from top 25% of H.S. class
* F.Undergrad Number of fulltime undergraduates
* P.Undergrad Number of parttime undergraduates
* Outstate Out-of-state tuition
* Room.Board Room and board costs
* Books Estimated book costs
* Personal Estimated personal spending
* PhD Pct. of faculty with Ph.D.’s
* Terminal Pct. of faculty with terminal degree
* S.F.Ratio Student/faculty ratio
* perc.alumni Pct. alumni who donate
* Expend Instructional expenditure per student
* Grad.Rate Graduation rate

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.cluster.hierarchy import dendrogram, linkage
%matplotlib inline


## Read in the data

In [2]:
data = pd.read_csv('College_Data.csv')

In [3]:
data.head()

Unnamed: 0,School,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
0,Abilene Christian University,1,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,18.1,12,7041,60
1,Adelphi University,1,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,12.2,16,10527,56
2,Adrian College,1,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,12.9,30,8735,54
3,Agnes Scott College,1,417,349,137,60,89,510,63,12960,5450,450,875,92,97,7.7,37,19016,59
4,Alaska Pacific University,1,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,11.9,2,10922,15


In [4]:
data.shape

(777, 19)

In [5]:
data.describe()

Unnamed: 0,Private,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,S.F.Ratio,perc.alumni,Expend,Grad.Rate
count,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0,777.0
mean,0.727156,3001.638353,2018.804376,779.972973,27.558559,55.796654,3699.907336,855.298584,10440.669241,4357.526384,549.380952,1340.642214,72.660232,79.702703,14.089704,22.743887,9660.171171,65.46332
std,0.445708,3870.201484,2451.113971,929.17619,17.640364,19.804778,4850.420531,1522.431887,4023.016484,1096.696416,165.10536,677.071454,16.328155,14.722359,3.958349,12.391801,5221.76844,17.17771
min,0.0,81.0,72.0,35.0,1.0,9.0,139.0,1.0,2340.0,1780.0,96.0,250.0,8.0,24.0,2.5,0.0,3186.0,10.0
25%,0.0,776.0,604.0,242.0,15.0,41.0,992.0,95.0,7320.0,3597.0,470.0,850.0,62.0,71.0,11.5,13.0,6751.0,53.0
50%,1.0,1558.0,1110.0,434.0,23.0,54.0,1707.0,353.0,9990.0,4200.0,500.0,1200.0,75.0,82.0,13.6,21.0,8377.0,65.0
75%,1.0,3624.0,2424.0,902.0,35.0,69.0,4005.0,967.0,12925.0,5050.0,600.0,1700.0,85.0,92.0,16.5,31.0,10830.0,78.0
max,1.0,48094.0,26330.0,6392.0,96.0,100.0,31643.0,21836.0,21700.0,8124.0,2340.0,6800.0,103.0,100.0,39.8,64.0,56233.0,118.0


## Hierarchical Clustering 

Unfortunately, sklearn does not have a convenient dendrogram function.

We will use scipy's linkage function (to generate the linkage matrix) and dendrogram function (which  conveniently plots the dendrogram). 

linkage(df, method): 
- Paramters:
    - df: First parameter (dataframe of feature variables)
    - method: linkage method, which can be 'single', 'complete', 'centroid', 'ward'...
    
dendrogram(df, color_threshold): 
- Parameters: 
    - df: First parameter (linkage matrix created by linkage())
    - color_threshold: set to 0 to have the same color for all branches; 
    remove to color-code branches based on a distance threshold; all points with distances below the threshold have the same color    
    - truncate_mode: can use 'lastp' to see only last p clusters merged
    - p: see only p last clusters merged (e.g p=50 shows last 50 clusters merged)

Documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.linkage.html

In [6]:
X = data.drop(columns=['School', 'Private', 'S.F.Ratio', 'Grad.Rate'])

X.shape

X


Unnamed: 0,Apps,Accept,Enroll,Top10perc,Top25perc,F.Undergrad,P.Undergrad,Outstate,Room.Board,Books,Personal,PhD,Terminal,perc.alumni,Expend
0,1660,1232,721,23,52,2885,537,7440,3300,450,2200,70,78,12,7041
1,2186,1924,512,16,29,2683,1227,12280,6450,750,1500,29,30,16,10527
2,1428,1097,336,22,50,1036,99,11250,3750,400,1165,53,66,30,8735
3,417,349,137,60,89,510,63,12960,5450,450,875,92,97,37,19016
4,193,146,55,16,44,249,869,7560,4120,800,1500,76,72,2,10922
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
772,2197,1515,543,4,26,3089,2029,6797,3900,500,1200,60,60,14,4469
773,1959,1805,695,24,47,2849,1107,11520,4960,600,1250,73,75,31,9189
774,2097,1915,695,34,61,2793,166,6900,4200,617,781,67,75,20,8323
775,10705,2453,1317,95,99,5217,83,19840,6510,630,2115,96,96,49,40386


In [None]:
linkage_matrix = linkage(X, method='ward') 

figure = plt.figure(figsize=(15,5))

dendrogram(linkage_matrix, color_threshold=0) 

plt.title('Hierarchical Clustering Dendrogram (Ward)')
plt.xlabel('Data Points')
plt.ylabel('Clustering Criterion')
plt.tight_layout()
plt.savefig('dendrogram1.png') 
plt.show()


In [None]:
linkage_matrix = linkage(X, method='ward')

figure = plt.figure(figsize=(15,5))

dendrogram(linkage_matrix, color_threshold=0, 
           truncate_mode='lastp', p=50)  

plt.title('Hierarchical Clustering Dendrogram (Ward)')
plt.xlabel('Data Points')
plt.ylabel('Clustering Criterion')
plt.tight_layout()
plt.savefig('dendrogram2.png')
plt.show()
