## Python Clustering

This notebook does some clustering, making use of `clustering_code.py`, a file of functions taken from Joel's _Data Science from Scratch_. I recommend giving chapter 19 of that book a read as you work through this.

In [1]:
import clustering_code
from collections import defaultdict
from pprint import pprint

input_file = "survey_responses.txt"

Clustering Code Loaded


Now let's read in the data and put it in a default dictionary.

In [6]:
student_data = defaultdict(list)
with open(input_file,'r') as ifile :
    next(ifile)
    for row in ifile.readlines() :
        row = row.strip().split("\t")
        this_student = row[1]
        student_data[this_student] = row[2:]


We need numerical data for clustering, so we'll convert over the Yes/No responses.

In [7]:
# Let's change No to 0 and Yes to 1, so everything is numerical
for student in student_data :
    this_data = student_data[student] # get the list of data 
    
    for idx, item in enumerate(this_data) : # iterate over the list (and its index)
        if item == "No" :
            this_data[idx] = 0 # change the "No" spot to 0
        elif item == "Yes" :
            this_data[idx] = 1 # change the "Yes" spot to 1 
            
    student_data[student] = [float(item) for item in this_data] 
        # overwrite the old list with the new one. Also make everything numeric
            

In [8]:
# Let's just print the data so it's easier to see
pprint(student_data)

defaultdict(<class 'list'>,
            {'Arave': [249.0, 249.0, 7.0, 0.0, 0.0, 2.0],
             'Berens': [929.0, 5.0, 7.0, 0.0, 0.0, 2.0],
             'Chandler': [2169.0, 2169.0, 10.0, 0.0, 0.0, 2.0],
             'Dezihan': [1166.0, 210.0, 5.0, 1.0, 1.0, 4.0],
             'Diehl': [4568.0, 6.0, 4.0, 1.0, 1.0, 3.0],
             'Flesch': [114.0, 222.0, 5.0, 1.0, 1.0, 5.0],
             'Freyn': [1600.0, 1600.0, 4.0, 0.0, 1.0, 4.0],
             'Grant': [271.0, 268.0, 8.0, 0.0, 1.0, 2.0],
             'Hansen': [625.0, 625.0, 10.0, 0.0, 0.0, 2.0],
             'Harper': [115.0, 115.0, 5.0, 1.0, 1.0, 2.0],
             'Jambor': [391.0, 92.0, 5.0, 1.0, 1.0, 1.0],
             'Kassner': [1743.0, 5.0, 5.0, 0.0, 0.0, 3.0],
             'Khormali': [6600.0, 6600.0, 10.0, 0.0, 0.0, 2.0],
             'Kolberg': [2132.0, 2.0, 5.0, 0.0, 0.0, 3.0],
             'Layton': [128.0, 147.0, 7.0, 1.0, 1.0, 3.0],
             'Makris': [187.0, 191.0, 5.0, 1.0, 1.0, 4.0],
             'Marbut'

In [9]:
# Now, let's explore some clusters. Try different values of
# k and see what emerges

k = 3
assignments, means = clustering_code.train_dict(student_data, k)


# Sorted version
s_assign = ( (k ,assignments[k]) for k in sorted(assignments, key=assignments.get, reverse=False))
print( str(k) + "-means:")
for student, cluster in s_assign :
    print(str(cluster) + " : " + student)

print(means)
    

3-means:
0 : Chandler
0 : Diehl
0 : Freyn
0 : Kolberg
0 : Wiener
0 : Spoja
1 : Nakajima
1 : Zor
1 : Khormali
1 : Yang
2 : Hansen
2 : Persico
2 : Sliwinski
2 : Harper
2 : Milligan
2 : Kassner
2 : curnow
2 : Primm
2 : Flesch
2 : Murphy
2 : Jambor
2 : Makris
2 : Arave
2 : Grant
2 : Sicheri
2 : Norman
2 : Dezihan
2 : Berens
2 : Marbut
2 : Ray
2 : Murray
2 : Layton
[[2878.333333333333, 974.3333333333333, 5.833333333333333, 0.16666666666666666, 0.3333333333333333, 3.0], [6005.75, 6005.75, 6.5, 0.25, 0.5, 2.0], [625.0454545454546, 384.04545454545456, 5.954545454545455, 0.4545454545454546, 0.7727272727272727, 2.8181818181818183]]


In [10]:
# let's re-scale the two mileage columns so that they're in the range of 0 - 1.
miles = []
for student, vec in student_data.items() :
    miles.append(vec[0])
    miles.append(vec[1])

max_miles = max(miles)
min_miles = min(miles)

for student, vec in student_data.items() :
    vec[0] = (vec[0] - min_miles)/(max_miles - min_miles)    
    vec[1] = (vec[1] - min_miles)/(max_miles - min_miles)    



In [11]:
# Let's make a function that prints the means in a nice way.

def pprint_means(the_means) :
    var_labels = ["Birth Dist","Age 15 Dist",
                  "Post-Secondary","Mkt Major",
                  "Biz Major","HH Size"]
    for idx, cluster_mean in enumerate(the_means) :
        print("--- Printing Cluster " + str(idx) + " ---")
        
        for idx2, item in enumerate(cluster_mean) :
            print(": ".join([var_labels[idx2],str(round(item,2))]))

        print("----------------------\n")
            

In [12]:
k = 5
assignments, means = clustering_code.train_dict(student_data, k)

#assignments = sorted(assignments.items(),
#                     key = lambda (student, cluster) : cluster,
#                     reverse = False)

s_assign = ( (k ,assignments[k]) for k in sorted(assignments, key=assignments.get, reverse=False))
print( str(k) + "-means:")
for student, cluster in s_assign :
    print(str(cluster) + " : " + student)



5-means:
0 : Harper
0 : curnow
0 : Primm
0 : Jambor
0 : Norman
0 : Yang
1 : Diehl
1 : Nakajima
2 : Sliwinski
2 : Freyn
2 : Milligan
2 : Kassner
2 : Kolberg
2 : Flesch
2 : Murphy
2 : Makris
2 : Sicheri
2 : Dezihan
2 : Ray
2 : Murray
2 : Spoja
3 : Persico
3 : Zor
3 : Wiener
3 : Arave
3 : Grant
3 : Berens
3 : Marbut
3 : Layton
4 : Chandler
4 : Hansen
4 : Khormali


In [13]:
pprint_means(means)

--- Printing Cluster 0 ---
Birth Dist: 0.2
Age 15 Dist: 0.18
Post-Secondary: 5.17
Mkt Major: 0.67
Biz Major: 1.0
HH Size: 1.67
----------------------

--- Printing Cluster 1 ---
Birth Dist: 0.74
Age 15 Dist: 0.39
Post-Secondary: 4.0
Mkt Major: 1.0
Biz Major: 1.0
HH Size: 2.5
----------------------

--- Printing Cluster 2 ---
Birth Dist: 0.18
Age 15 Dist: 0.08
Post-Secondary: 5.0
Mkt Major: 0.38
Biz Major: 0.77
HH Size: 3.77
----------------------

--- Printing Cluster 3 ---
Birth Dist: 0.23
Age 15 Dist: 0.19
Post-Secondary: 7.25
Mkt Major: 0.12
Biz Major: 0.38
HH Size: 2.25
----------------------

--- Printing Cluster 4 ---
Birth Dist: 0.47
Age 15 Dist: 0.47
Post-Secondary: 10.0
Mkt Major: 0.0
Biz Major: 0.0
HH Size: 2.0
----------------------

