In [4]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

## Day 50 Lecture 1 Assignment

In this assignment, we will calculate affinity propagation clustering using responses to a survey about student life at a university.

In [5]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.cluster import AffinityPropagation
from scipy.spatial.distance import pdist, squareform

<IPython.core.display.Javascript object>

We will load a student life survey dataset. This dataset consists of 35 binary features, each corresponding to a yes/no question that characterizes the student taking the survey.

This dataset contains a large number of features, each corresponding to a survey question. The feature name summarizes the survey question, so we will not list them all out here.

In [7]:
# answer goes here
student_df = pd.read_csv('data/student_life_survey.csv')
student_df.info()




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2958 entries, 0 to 2957
Data columns (total 36 columns):
 #   Column                                                                                                                      Non-Null Count  Dtype
---  ------                                                                                                                      --------------  -----
 0   Q2-Participated in Societies and Interest Groups                                                                            2958 non-null   int64
 1   Q2-Participated in Clubs                                                                                                    2958 non-null   int64
 2   Q2-Participated in Halls, JCRCs and/or Residential College CSCs                                                             2958 non-null   int64
 3   Q2-Participated in University organised events                                                                              2958 non-

<IPython.core.display.Javascript object>

For our analysis, we will focus on a specific subset of the survey that is focused on stress. These questions all begin with the string 'Q5'. Filter the columns that meet this criteria (should be 10 in total).

In addition, we are only going to perform clustering on a random subset of this data, as affinity propagation is a fairly slow algorithm and requires infeasibly long times to converge for even medium-sized datasets. Select a random sample of 500 rows from the dataset.

In [9]:
# answer goes here
q5_df = student_df.filter(like="Q5")
q5_df
q5_samp = q5_df.sample(500)
q5_samp.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 2526 to 1768
Data columns (total 10 columns):
 #   Column                                                                               Non-Null Count  Dtype
---  ------                                                                               --------------  -----
 0   Q5-Stressed about Adjustment issues                                                  500 non-null    int64
 1   Q5-Stressed about Academic issues                                                    500 non-null    int64
 2   Q5-Stressed about Financial issues                                                   500 non-null    int64
 3   Q5-Stressed about Family issues                                                      500 non-null    int64
 4   Q5-Stressed about Friendships                                                        500 non-null    int64
 5   Q5-Stressed about Romantic relationships                                             500 non-null    i

<IPython.core.display.Javascript object>

The sklearn implementation of affinity propagation only supports euclidean and precomputed distances, so we will need to precompute a dissimilarity matrix. Furthermore, it expects negative values; the default affinity is negative euclidean distance. 

Compute the full dissimilarity matrix between all pairs of students using the negative matching/hamming distance and store it in a dataframe. 

Note: Be sure to convert the values to negative to match what the algorithm expects.

In [14]:
# answer goes here
q5_affin = pd.DataFrame(squareform(pdist(q5_samp, metric="hamming") * -1))
q5_affin

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,490,491,492,493,494,495,496,497,498,499
0,0.0,-0.2,-0.0,-0.2,-0.2,-0.5,-0.2,-0.1,-0.2,-0.2,...,-0.1,-0.4,-0.3,-0.1,-0.4,-0.4,-0.3,-0.1,-0.7,-0.5
1,-0.2,0.0,-0.2,-0.2,-0.2,-0.3,-0.2,-0.3,-0.2,-0.4,...,-0.3,-0.2,-0.5,-0.3,-0.4,-0.4,-0.3,-0.3,-0.5,-0.3
2,-0.0,-0.2,0.0,-0.2,-0.2,-0.5,-0.2,-0.1,-0.2,-0.2,...,-0.1,-0.4,-0.3,-0.1,-0.4,-0.4,-0.3,-0.1,-0.7,-0.5
3,-0.2,-0.2,-0.2,0.0,-0.4,-0.5,-0.0,-0.3,-0.2,-0.4,...,-0.3,-0.2,-0.5,-0.3,-0.4,-0.6,-0.3,-0.3,-0.5,-0.3
4,-0.2,-0.2,-0.2,-0.4,0.0,-0.5,-0.4,-0.1,-0.2,-0.2,...,-0.1,-0.2,-0.3,-0.3,-0.4,-0.2,-0.3,-0.1,-0.5,-0.5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
495,-0.4,-0.4,-0.4,-0.6,-0.2,-0.7,-0.6,-0.3,-0.4,-0.4,...,-0.3,-0.4,-0.5,-0.3,-0.6,0.0,-0.5,-0.3,-0.5,-0.7
496,-0.3,-0.3,-0.3,-0.3,-0.3,-0.4,-0.3,-0.2,-0.1,-0.1,...,-0.2,-0.3,-0.2,-0.4,-0.1,-0.5,0.0,-0.2,-0.6,-0.4
497,-0.1,-0.3,-0.1,-0.3,-0.1,-0.6,-0.3,-0.0,-0.1,-0.1,...,-0.0,-0.3,-0.2,-0.2,-0.3,-0.3,-0.2,0.0,-0.6,-0.6
498,-0.7,-0.5,-0.7,-0.5,-0.5,-0.4,-0.5,-0.6,-0.5,-0.7,...,-0.6,-0.3,-0.6,-0.6,-0.5,-0.5,-0.6,-0.6,0.0,-0.4


<IPython.core.display.Javascript object>

Using the dissimilarity matrix and the specified preference value, run affinity propagation on the survey results using the default value for preference, which is the median dissimilarity, and a damping parameter of 0.8. How many exemplars did it identify? If there are too many exemplars, what changes would we want to make?

In [18]:
# answer goes here
ap_clstr = AffinityPropagation(damping=0.8, affinity='precomputed',preference=np.median(q5_affin), max_iter=500)
ap_clstr.fit(q5_affin)




AffinityPropagation(affinity='precomputed', convergence_iter=15, copy=True,
                    damping=0.8, max_iter=500, preference=-0.3, verbose=False)

<IPython.core.display.Javascript object>

In [20]:
len(ap_clstr.cluster_centers_indices_)

51

<IPython.core.display.Javascript object>

Try adjusting the value of the preference based on the result you saw in the previous step until you have a reasonable number of exemplars. Print out the data for each of these exemplars, as well as the number of surveys assigned to each exemplar. How do these clusters compare to what we saw previously with k-medoids?

Tip: large preferences can lead to numerical instability and issues with convergence. The "damping" parameter can help control this by downscaling the impact of incoming messages; check the documentation for AffinityPropagation for more details().

In [29]:
# answer goes here
ap_clstr = AffinityPropagation(damping=0.8, affinity='precomputed',preference=np.mean(q5_affin), max_iter=1000)
ap_clstr.fit(q5_affin)




AffinityPropagation(affinity='precomputed', convergence_iter=15, copy=True,
                    damping=0.8, max_iter=1000,
                    preference=0     -0.2700
1     -0.3032
2     -0.2700
3     -0.3440
4     -0.2524
        ...  
495   -0.4112
496   -0.2672
497   -0.2256
498   -0.5156
499   -0.4740
Length: 500, dtype: float64,
                    verbose=False)

<IPython.core.display.Javascript object>

In [26]:
len(ap_clstr.cluster_centers_indices_)

47

<IPython.core.display.Javascript object>