## Python Clustering

This notebook does some clustering, making use of `clustering_code.py`, a file of functions taken from Joel's _Data Science from Scratch_. I recommend giving chapter 19 of that book a read as you work through this.

In [1]:
import clustering_code
from collections import defaultdict
from pprint import pprint

input_file = "survey_responses.txt"

Clustering Code Loaded


Now let's read in the data and put it in a default dictionary.

In [3]:
student_data = defaultdict(list)
with open(input_file,'r') as ifile :
    next(ifile)
    for row in ifile.readlines() :
        row = row.strip().split("\t")
        this_student = row[1]
        student_data[this_student] = row[2:]


We need numerical data for clustering, so we'll convert over the Yes/No responses.

In [4]:
# Let's change No to 0 and Yes to 1, so everything is numerical
for student in student_data :
    this_data = student_data[student] # get the list of data 
    
    for idx, item in enumerate(this_data) : # iterate over the list (and its index)
        if item == "No" :
            this_data[idx] = 0 # change the "No" spot to 0
        elif item == "Yes" :
            this_data[idx] = 1 # change the "Yes" spot to 1 
            
    student_data[student] = [float(item) for item in this_data] 
        # overwrite the old list with the new one. Also make everything numeric
            

In [5]:
# Let's just print the data so it's easier to see
pprint(student_data)

defaultdict(<class 'list'>,
            {'Albee': [1012.0, 1023.0, 6.0, 0.0, 0.0, 3.0],
             'Anderson': [700.0, 700.0, 5.0, 0.0, 1.0, 2.0],
             'Bankston': [1273.0, 45.0, 6.0, 1.0, 1.0, 2.0],
             'Barr': [1982.0, 5.0, 7.0, 0.0, 0.0, 1.0],
             'Bernhard': [4505.0, 4514.0, 5.0, 0.0, 0.0, 2.0],
             'Bosco Louis': [8279.0, 8279.0, 5.0, 0.0, 1.0, 1.0],
             'Bu': [6756.0, 6756.0, 12.0, 0.0, 0.0, 2.0],
             'Campestre': [8402.0, 2322.0, 6.0, 0.0, 0.0, 1.0],
             'Cerkovnik': [272.0, 272.0, 5.0, 0.0, 1.0, 2.0],
             'Connor': [1616.0, 1606.0, 5.0, 0.0, 0.0, 2.0],
             'Daly': [731.0, 2993.0, 7.0, 0.0, 0.0, 1.0],
             'Danicich': [478.0, 5.0, 5.0, 1.0, 1.0, 2.0],
             'Dauenhauer': [3.0, 1.0, 5.0, 0.0, 1.0, 3.0],
             'Engellant': [4814.0, 92.8, 4.5, 0.0, 0.0, 2.0],
             'Fanok': [2303.0, 2303.0, 5.0, 0.0, 1.0, 2.0],
             'Fowler': [1500.0, 1500.0, 8.0, 0.0, 0.0, 2.0],
 

In [6]:
# Now, let's explore some clusters. Try different values of
# k and see what emerges

k = 3
assignments, means = clustering_code.train_dict(student_data, k)


# Sorted version
s_assign = ( (k ,assignments[k]) for k in sorted(assignments, key=assignments.get, reverse=False))
print( str(k) + "-means:")
for student, cluster in s_assign :
    print(str(cluster) + " : " + student)

print(means)
    

3-means:
0 : Schwartz
0 : Thompson
0 : Bankston
0 : Barr
0 : Gabrielsen
0 : Leonard
0 : Martin
0 : Cerkovnik
0 : McNea
0 : Runkel
0 : Toepke
0 : Stokes
0 : Danicich
0 : Hendricks
0 : Anderson
0 : Moore
0 : Howell
0 : Hauer
0 : Scheibel
0 : Stahlberg
0 : Hoffman
0 : Phillips
0 : Joyner
0 : Hettinger
0 : Albee
0 : Dauenhauer
0 : Wagers
0 : Whattam 
0 : Keith
0 : Paul
0 : Woods
0 : Niekamp
0 : Parrent
0 : Haefele
1 : Campestre
1 : Bernhard
1 : Bu
1 : Bosco Louis
1 : Ghazouani
2 : Williams
2 : Fowler
2 : Gilbert
2 : Connor
2 : Nelson
2 : Knowlton
2 : Fanok
2 : Halderman
2 : Robertson
2 : Engellant
2 : Daly
[[603.1970588235295, 329.3382352941176, 6.764705882352941, 0.20588235294117646, 0.6470588235294118, 2.441176470588235], [6728.400000000001, 5514.200000000001, 7.0, 0.0, 0.2, 2.0], [2141.818181818182, 1970.3454545454545, 7.772727272727273, 0.09090909090909091, 0.2727272727272727, 2.272727272727273]]


In [7]:
# let's re-scale the two mileage columns so that they're in the range of 0 - 1.
miles = []
for student, vec in student_data.items() :
    miles.append(vec[0])
    miles.append(vec[1])

max_miles = max(miles)
min_miles = min(miles)

for student, vec in student_data.items() :
    vec[0] = (vec[0] - min_miles)/(max_miles - min_miles)    
    vec[1] = (vec[1] - min_miles)/(max_miles - min_miles)    



In [8]:
# Let's make a function that prints the means in a nice way.

def pprint_means(the_means) :
    var_labels = ["Birth Dist","Age 15 Dist",
                  "Post-Secondary","Mkt Major",
                  "Biz Major","HH Size"]
    for idx, cluster_mean in enumerate(the_means) :
        print("--- Printing Cluster " + str(idx) + " ---")
        
        for idx2, item in enumerate(cluster_mean) :
            print(": ".join([var_labels[idx2],str(round(item,2))]))

        print("----------------------\n")
            

In [9]:
k = 5
assignments, means = clustering_code.train_dict(student_data, k)

#assignments = sorted(assignments.items(),
#                     key = lambda (student, cluster) : cluster,
#                     reverse = False)

s_assign = ( (k ,assignments[k]) for k in sorted(assignments, key=assignments.get, reverse=False))
print( str(k) + "-means:")
for student, cluster in s_assign :
    print(str(cluster) + " : " + student)



5-means:
0 : Barr
0 : Campestre
0 : McNea
0 : Toepke
0 : Bernhard
0 : Phillips
0 : Connor
0 : Whattam 
0 : Bosco Louis
0 : Paul
0 : Engellant
0 : Woods
0 : Daly
1 : Howell
1 : Knowlton
2 : Gilbert
3 : Schwartz
3 : Leonard
3 : Stokes
3 : Williams
3 : Fowler
3 : Moore
3 : Hauer
3 : Stahlberg
3 : Bu
3 : Keith
3 : Niekamp
4 : Thompson
4 : Bankston
4 : Gabrielsen
4 : Martin
4 : Cerkovnik
4 : Runkel
4 : Danicich
4 : Hendricks
4 : Anderson
4 : Scheibel
4 : Hoffman
4 : Joyner
4 : Hettinger
4 : Nelson
4 : Fanok
4 : Albee
4 : Dauenhauer
4 : Wagers
4 : Halderman
4 : Robertson
4 : Ghazouani
4 : Parrent
4 : Haefele


In [10]:
pprint_means(means)

--- Printing Cluster 0 ---
Birth Dist: 0.31
Age 15 Dist: 0.21
Post-Secondary: 5.65
Mkt Major: 0.0
Biz Major: 0.23
HH Size: 1.46
----------------------

--- Printing Cluster 1 ---
Birth Dist: 0.15
Age 15 Dist: 0.16
Post-Secondary: 22.0
Mkt Major: 0.0
Biz Major: 0.0
HH Size: 1.0
----------------------

--- Printing Cluster 2 ---
Birth Dist: 0.19
Age 15 Dist: 0.19
Post-Secondary: 2.0
Mkt Major: 0.0
Biz Major: 1.0
HH Size: 4.0
----------------------

--- Printing Cluster 3 ---
Birth Dist: 0.18
Age 15 Dist: 0.15
Post-Secondary: 9.36
Mkt Major: 0.09
Biz Major: 0.36
HH Size: 3.0
----------------------

--- Printing Cluster 4 ---
Birth Dist: 0.12
Age 15 Dist: 0.1
Post-Secondary: 5.57
Mkt Major: 0.3
Biz Major: 0.78
HH Size: 2.61
----------------------

