## Day 47 Lecture 2 Assignment

In this assignment, we will perform K-Medoids clustering using responses to a survey about student life at a university.

In [1]:
%matplotlib inline

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist, squareform
from pyclustering.cluster.kmedoids import kmedoids
import random

This dataset consists of 35 binary features, each corresponding to a yes/no question that characterizes the student taking the survey.

This dataset contains a large number of features, each corresponding to a survey question. The feature name summarizes the survey question, so we will not list them all out here.

Load the dataset.

In [2]:
# answer goes here
df = pd.read_csv('data/student_life_survey.csv')






For our analysis, we will focus on a specific subset of the survey that is focused on stress. These questions all begin with the string 'Q5'. Filter the columns that meet this criteria (should be 10 in total).

In [3]:
# answer goes here
q5_df = df.filter(like='Q5', axis=1)
q5_df

q5_og= q5_df.copy()





The pyclustering implementation of kmedoids supports a variety of distance metrics, but they are primarily for numeric data. We will be using SMC/Hamming dissimilarity and precomputing the similarity matrix (an alternative would be to specify a user-defined function, which you are welcome to try in addition).

We'll assume for the next step that a pair of negative values (i.e. both responses are "no") is as informative as a pair of positive values. Compute the full distance/dissimilarity matrix for the survey data using matching/hamming distance.

In [4]:
# answer goes here
dist_df = pd.DataFrame(squareform(pdist(q5_df, metric='hamming')))
dist_mat = np.array(dist_df)



Using the dissimilarity matrix, perform kmedoids clustering using k=2. Set the initial medoids randomly. Note that pyclustering expects the distance matrix to be a numpy array; a pandas dataframe may cause errors. 

Which survey responses are chosen as the cluster representatives? Print out the details of these responses.

In [5]:
k = 2

In [21]:
# answer goes here
# np.random.seed(42)

nrows = dist_mat.shape[0]
init_medoids = np.random.randint(0, nrows, k)
init_medoids



array([1131, 2309])

In [22]:
kmed = kmedoids(
    dist_mat, initial_index_medoids=init_medoids, data_type="distance_matrix"
)

kmed.process()


<pyclustering.cluster.kmedoids.kmedoids at 0x2b91d809948>

In [23]:
medoid_idxs = kmed.get_medoids()

q5_df.iloc[medoid_idxs]

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,labels
2,0,1,0,0,0,0,0,1,0,0,0
2309,0,1,0,0,0,0,0,0,0,0,0


In [24]:
labels = kmed.predict(dist_mat)
q5_df['labels'] = labels



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [25]:
q5_df.groupby('labels').mean().T.style.background_gradient(axis=1)

labels,0,1
Q5-Stressed about Adjustment issues,0.316716,0.297992
Q5-Stressed about Academic issues,0.949413,0.898996
Q5-Stressed about Financial issues,0.442082,0.338143
Q5-Stressed about Family issues,0.180352,0.119197
Q5-Stressed about Friendships,0.316716,0.229611
Q5-Stressed about Romantic relationships,0.19868,0.13739
Q5-Stressed about Health related issues,0.183284,0.123588
Q5-Stressed about Career related issues,0.986804,0.0
"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",0.316716,0.237767
Q5-Stressed about Others,0.006598,0.020075


If you run the previous cell a few times, you'll probably notice that the medoids are very sensitive to initialization. A common approach to produce well-separated clusters is to choose initial centroids that are far apart. Re-run the previous process, except with a random pair of centroids that have a dissimilarity of 0.8 or higher. Are the results more stable now? How would you describe the typical clusters you see?

In [33]:
dist_df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2948,2949,2950,2951,2952,2953,2954,2955,2956,2957
0,0.0,0.0,0.1,0.6,0.3,0.1,0.2,0.3,0.4,0.4,...,0.1,0.3,0.2,0.3,0.3,0.2,0.0,0.0,0.2,0.2
1,0.0,0.0,0.1,0.6,0.3,0.1,0.2,0.3,0.4,0.4,...,0.1,0.3,0.2,0.3,0.3,0.2,0.0,0.0,0.2,0.2
2,0.1,0.1,0.0,0.7,0.4,0.2,0.1,0.2,0.5,0.3,...,0.2,0.4,0.3,0.4,0.4,0.3,0.1,0.1,0.1,0.1
3,0.6,0.6,0.7,0.0,0.5,0.5,0.8,0.7,0.2,0.6,...,0.7,0.5,0.4,0.5,0.3,0.6,0.6,0.6,0.6,0.8
4,0.3,0.3,0.4,0.5,0.0,0.2,0.3,0.6,0.3,0.3,...,0.2,0.2,0.3,0.4,0.2,0.3,0.3,0.3,0.5,0.3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2953,0.2,0.2,0.3,0.6,0.3,0.3,0.2,0.5,0.4,0.4,...,0.1,0.1,0.4,0.3,0.5,0.0,0.2,0.2,0.4,0.2
2954,0.0,0.0,0.1,0.6,0.3,0.1,0.2,0.3,0.4,0.4,...,0.1,0.3,0.2,0.3,0.3,0.2,0.0,0.0,0.2,0.2
2955,0.0,0.0,0.1,0.6,0.3,0.1,0.2,0.3,0.4,0.4,...,0.1,0.3,0.2,0.3,0.3,0.2,0.0,0.0,0.2,0.2
2956,0.2,0.2,0.1,0.6,0.5,0.3,0.2,0.3,0.6,0.4,...,0.3,0.5,0.4,0.5,0.5,0.4,0.2,0.2,0.0,0.2


In [29]:
dist_df.loc[dist_df.values > 0.8].sample(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2948,2949,2950,2951,2952,2953,2954,2955,2956,2957
2383,0.2,0.2,0.3,0.8,0.5,0.3,0.4,0.3,0.6,0.6,...,0.3,0.5,0.4,0.5,0.5,0.4,0.2,0.2,0.4,0.4
2678,0.1,0.1,0.2,0.7,0.4,0.2,0.3,0.2,0.5,0.5,...,0.2,0.4,0.3,0.4,0.4,0.3,0.1,0.1,0.3,0.3


In [37]:
dist_df[1].value_counts()

0.2    804
0.1    717
0.3    525
0.0    342
0.4    294
0.5    155
0.6     71
0.7     29
0.8     21
Name: 1, dtype: int64

In [40]:
#criteria = 
dist_df[dist_df.iloc[:,1] >= 0.8 ]

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2948,2949,2950,2951,2952,2953,2954,2955,2956,2957
245,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
263,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
496,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
752,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
812,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
866,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
878,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
923,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
941,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6
1125,0.8,0.8,0.7,0.2,0.5,0.7,0.6,0.7,0.4,0.4,...,0.7,0.5,0.6,0.5,0.5,0.6,0.8,0.8,0.6,0.6


In [44]:
# answer goes here
nrows = dist_mat.shape[0]

#dist_df.loc[dist_df > 0.8].sample(1)

init_medoids = np.array([245, 0])
init_medoids




array([245,   0])

In [45]:
kmed = kmedoids(
    dist_mat, initial_index_medoids=init_medoids, data_type="distance_matrix"
)

kmed.process()

<pyclustering.cluster.kmedoids.kmedoids at 0x2b92682bbc8>

In [46]:
medoid_idxs = kmed.get_medoids()

q5_df.iloc[medoid_idxs]

Unnamed: 0,Q5-Stressed about Adjustment issues,Q5-Stressed about Academic issues,Q5-Stressed about Financial issues,Q5-Stressed about Family issues,Q5-Stressed about Friendships,Q5-Stressed about Romantic relationships,Q5-Stressed about Health related issues,Q5-Stressed about Career related issues,"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",Q5-Stressed about Others,labels
134,1,1,1,0,1,0,0,1,1,0,0
0,0,1,0,0,0,0,0,0,0,0,1


In [47]:
labels = kmed.predict(dist_mat)
q5_df['labels'] = labels

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [48]:
q5_df.groupby('labels').mean().T.style.background_gradient(axis=1)

labels,0,1
Q5-Stressed about Adjustment issues,0.529575,0.178038
Q5-Stressed about Academic issues,0.918669,0.924307
Q5-Stressed about Financial issues,0.620148,0.251066
Q5-Stressed about Family issues,0.336414,0.03838
Q5-Stressed about Friendships,0.531423,0.11887
Q5-Stressed about Romantic relationships,0.342884,0.063433
Q5-Stressed about Health related issues,0.322551,0.052239
Q5-Stressed about Career related issues,0.663586,0.334755
"Q5-Stressed about My involvement in hostel, clubs, societies, interest groups, etc.",0.515712,0.134861
Q5-Stressed about Others,0.015712,0.012793
