In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

## Day 50 Lecture 1 Assignment

In this assignment, we will calculate affinity propagation clustering using responses to a survey about student life at a university.

In [3]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import AffinityPropagation
from scipy.spatial.distance import pdist, squareform

<IPython.core.display.Javascript object>

We will load a student life survey dataset. This dataset consists of 35 binary features, each corresponding to a yes/no question that characterizes the student taking the survey.

This dataset contains a large number of features, each corresponding to a survey question. The feature name summarizes the survey question, so we will not list them all out here.

In [4]:
# answer goes here

stud = pd.read_csv('data/student_life_survey.csv')
stud.



Unnamed: 0,Q2-Participated in Societies and Interest Groups,Q2-Participated in Clubs,"Q2-Participated in Halls, JCRCs and/or Residential College CSCs",Q2-Participated in University organised events,Q3-Interested in Arts & Culture,Q3-Interested in Science & Technology,Q3-Interested in Research and independent study,Q3-Interested in Sports,"Q3-Interested in Other competitions (eg case, debates)",Q3-Interested in Entrepreneurship,...,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,response_id
0,0,1,0,0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,1
1,0,1,0,0,1,0,0,1,0,0,...,1,0,0,0,0,0,0,0,0,2
2,0,0,1,0,0,0,0,1,0,0,...,1,0,0,0,0,0,1,0,0,3
3,1,1,1,1,0,1,1,0,0,0,...,1,0,1,1,1,1,0,1,0,4
4,1,0,1,1,0,1,1,0,0,1,...,1,1,0,1,0,0,0,1,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2953,1,0,0,0,1,1,1,1,0,0,...,1,1,0,0,1,0,0,0,0,2954
2954,1,0,0,0,0,1,1,1,1,0,...,1,0,0,0,0,0,0,0,0,2955
2955,1,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,2956
2956,0,1,0,1,1,1,1,1,0,1,...,1,0,0,0,0,1,1,0,0,2957


<IPython.core.display.Javascript object>

For our analysis, we will focus on a specific subset of the survey that is focused on stress. These questions all begin with the string 'Q5'. Filter the columns that meet this criteria (should be 10 in total).

In addition, we are only going to perform clustering on a random subset of this data, as affinity propagation is a fairly slow algorithm and requires infeasibly long times to converge for even medium-sized datasets. Select a random sample of 500 rows from the dataset.

In [6]:
# answer goes here
q5 = stud.filter(like='Q5')
q5_sample = q5.sample(n=500, random_state=42)



<IPython.core.display.Javascript object>

The sklearn implementation of affinity propagation only supports euclidean and precomputed distances, so we will need to precompute a dissimilarity matrix. Furthermore, it expects negative values; the default affinity is negative euclidean distance. 

Compute the full dissimilarity matrix between all pairs of students using the negative matching/hamming distance and store it in a dataframe. 

Note: Be sure to convert the values to negative to match what the algorithm expects.

In [10]:
# answer goes here

dissim_mat = squareform(pdist(q5_sample, metric="hamming")) * -1

dissim_mat = pd.DataFrame(dissim_mat, index=q5_sample.index, columns=q5_sample.index)
dissim_mat

Unnamed: 0,2012,1688,764,2057,2025,2099,807,2414,1582,332,...,1775,940,612,408,1953,2322,135,2437,2708,2899
2012,-0.0,-0.1,-0.5,-0.5,-0.3,-0.7,-0.7,-0.3,-0.6,-0.8,...,-0.6,-0.6,-0.4,-0.5,-0.5,-0.4,-0.5,-0.4,-0.5,-0.9
1688,-0.1,-0.0,-0.4,-0.4,-0.2,-0.6,-0.6,-0.2,-0.5,-0.7,...,-0.5,-0.5,-0.3,-0.4,-0.4,-0.3,-0.4,-0.3,-0.4,-0.8
764,-0.5,-0.4,-0.0,-0.6,-0.4,-0.2,-0.4,-0.2,-0.5,-0.5,...,-0.7,-0.7,-0.5,-0.6,-0.6,-0.3,-0.6,-0.5,-0.4,-0.4
2057,-0.5,-0.4,-0.6,-0.0,-0.2,-0.4,-0.6,-0.4,-0.3,-0.3,...,-0.3,-0.1,-0.1,-0.0,-0.2,-0.3,-0.2,-0.1,-0.2,-0.6
2025,-0.3,-0.2,-0.4,-0.2,-0.0,-0.4,-0.6,-0.2,-0.3,-0.5,...,-0.5,-0.3,-0.1,-0.2,-0.4,-0.1,-0.4,-0.1,-0.4,-0.8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2322,-0.4,-0.3,-0.3,-0.3,-0.1,-0.3,-0.5,-0.3,-0.4,-0.4,...,-0.6,-0.4,-0.2,-0.3,-0.5,-0.0,-0.5,-0.2,-0.3,-0.7
135,-0.5,-0.4,-0.6,-0.2,-0.4,-0.6,-0.8,-0.4,-0.5,-0.5,...,-0.5,-0.3,-0.3,-0.2,-0.2,-0.5,-0.0,-0.3,-0.2,-0.4
2437,-0.4,-0.3,-0.5,-0.1,-0.1,-0.3,-0.5,-0.3,-0.4,-0.4,...,-0.4,-0.2,-0.0,-0.1,-0.3,-0.2,-0.3,-0.0,-0.3,-0.7
2708,-0.5,-0.4,-0.4,-0.2,-0.4,-0.4,-0.6,-0.4,-0.5,-0.3,...,-0.5,-0.3,-0.3,-0.2,-0.2,-0.3,-0.2,-0.3,-0.0,-0.4


<IPython.core.display.Javascript object>

Using the dissimilarity matrix and the specified preference value, run affinity propagation on the survey results using the default value for preference, which is the median dissimilarity, and a damping parameter of 0.8. How many exemplars did it identify? If there are too many exemplars, what changes would we want to make?

In [19]:
def_pref = dissim_mat.median()
def_pref

2012   -0.40
1688   -0.30
764    -0.40
2057   -0.20
2025   -0.20
        ... 
2322   -0.30
135    -0.40
2437   -0.20
2708   -0.30
2899   -0.65
Length: 500, dtype: float64

<IPython.core.display.Javascript object>

In [21]:
# answer goes here

af_prop = AffinityPropagation(
    affinity="precomputed", preference=-0.3, max_iter=1000, damping=0.8
)
af_prop.fit(dissim_mat)
af_prop.cluster_centers_indices_

array([ 13,  17,  18,  47,  54,  61,  63,  71,  80,  94,  99, 108, 114,
       117, 124, 125, 133, 146, 149, 153, 154, 158, 176, 184, 191, 197,
       203, 216, 219, 237, 239, 245, 249, 259, 266, 271, 286, 290, 304,
       305, 314, 355, 383, 385, 413, 432, 436, 447, 451, 457, 494],
      dtype=int64)

<IPython.core.display.Javascript object>

Try adjusting the value of the preference based on the result you saw in the previous step until you have a reasonable number of exemplars. Print out the data for each of these exemplars, as well as the number of surveys assigned to each exemplar. How do these clusters compare to what we saw previously with k-medoids?

Tip: large preferences can lead to numerical instability and issues with convergence. The "damping" parameter can help control this by downscaling the impact of incoming messages; check the documentation for AffinityPropagation for more details().

In [32]:
# answer goes here

af_prop = AffinityPropagation(
    affinity="precomputed", preference=-4.5, max_iter=1000, damping=0.9
)
af_prop.fit(dissim_mat)
af_prop.cluster_centers_indices_



array([  3, 346, 420, 468, 469], dtype=int64)

<IPython.core.display.Javascript object>

In [33]:
q5_sample["cluster"] = af_prop.labels_
q5_sample.groupby("cluster").mean().style.background_gradient()

Unnamed: 0_level_0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
0,0.154762,0.916667,0.702381,0.190476,0.119048,0.119048,0.107143,1.0,0.071429,0.011905
1,0.104895,0.86014,0.559441,0.146853,0.188811,0.125874,0.132867,0.0,0.097902,0.027972
2,0.833333,1.0,0.714286,0.357143,0.880952,0.357143,0.404762,0.833333,0.809524,0.071429
3,0.709677,0.870968,0.096774,0.072581,0.209677,0.145161,0.153226,0.145161,0.08871,0.016129
4,0.056075,0.953271,0.028037,0.074766,0.214953,0.140187,0.168224,0.728972,0.663551,0.0


<IPython.core.display.Javascript object>

In [34]:
pd.Series(af_prop.labels_).value_counts()

1    143
3    124
4    107
0     84
2     42
dtype: int64

<IPython.core.display.Javascript object>