In [1]:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans #The k-means algorithm

First, let's start by creating a user-item matrix, as explained in the other Notebook.

In [2]:
movie_file = pd.read_csv('movies.csv')
ratings_file = pd.read_csv('ratings.csv')
df = pd.merge(movie_file, ratings_file)

ratings = pd.pivot_table(df, index='userId', columns='title', values='rating')
ratings.head(3)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,


Let's put the most popular movies at the front.

In [4]:
#This piece of code is a bit complex. Here it is, step by step:
#1. reindex shuffles a dataframe according to a new list
#2. ratings.count() gets the number of non-NaN values per column/movie
#3. sort_values() sort those values, descending (because ascending=False)
#4. finally, .index gets the names of the columns/movies
#axis=1 tells Pandas we want to reshuffle the columns (not the rows)
ratings = ratings.reindex(ratings.count().sort_values(ascending=False).index, axis=1)
ratings.head(3)

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Shrooms (2007),Siam Sunset (1999),Side by Side (2012),Sightseers (2012),"Signal, The (2007)",Shot Caller (2017),"Signal, The (2014)",Silent Hill: Revelation 3D (2012),Silent Movie (1976),'71 (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,3.0,4.0,5.0,5.0,4.0,4.0,,5.0,...,,,,,,,,,,
2,,3.0,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,0.5,...,,,,,,,,,,


We will now find clusters. Unfortunately, the k-means algorithm won't work with NaN values. We will put a 0 in the empty cells. This is not ideal for many reasons, but the best we can do for now without getting really complex

In [5]:
ratings_full = ratings.fillna(0) #fill the NaN with the mean of each column
ratings_full.head(3)

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Shrooms (2007),Siam Sunset (1999),Side by Side (2012),Sightseers (2012),"Signal, The (2007)",Shot Caller (2017),"Signal, The (2014)",Silent Hill: Revelation 3D (2012),Silent Movie (1976),'71 (2014)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,3.0,4.0,5.0,5.0,4.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0



Finish the code below. You need to...

1. Pick a suitable number of clusters (somewhere between 4 and 10 will work)
2. Apply the k-means algorithm to the Movielens user-item matrix that is in the code. Store the cluster predictions in the original ratings dataframe and continue working with that dataframe.
3. Print the number of users per cluster (do you remember the relevant Pandas function?).
4. Calculate the mean rating by user cluster using the Pandas pivot_table function. Pandas will sort alphabetically after making the pivot table, so you will need to reorder your pivot table with my_pivot.reindex(ratings.count().sort_values(ascending=False).index, axis=1). Replace my_pivot with the name of your pivot table.
5. Examine the mean ratings of the top rated movies by user cluster. Can you describe the user clusters in plain language (e.g., ‘simple-minded action movie lover’)? This may be hard…

In [13]:
km = KMeans(n_clusters=5)
X = ratings_full.loc[:,'Forrest Gump (1994)':"'71 (2014)"] #get the X variables from the dataframe
km = km.fit(X) #calculate the cluster centers
ratings_full['cluster'] = km.predict(X) #predict the clusters of each observation and store in the dataframe
ratings_full.head()

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Siam Sunset (1999),Side by Side (2012),Sightseers (2012),"Signal, The (2007)",Shot Caller (2017),"Signal, The (2014)",Silent Hill: Revelation 3D (2012),Silent Movie (1976),'71 (2014),cluster
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,3.0,4.0,5.0,5.0,4.0,4.0,0.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
2,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4
4,0.0,0.0,1.0,5.0,1.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0
5,0.0,3.0,5.0,0.0,0.0,0.0,0.0,4.0,3.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3


In [18]:
ratings_full['cluster'].value_counts()

4    387
3     99
0     66
1     48
2     10
Name: cluster, dtype: int64

In [31]:
pivot = pd.pivot_table(data=ratings_full, index='cluster')
pivot.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.060606,0.05303,0.0,0.0,0.0,0.318182,0.0,0.022727,0.05303,...,0.0,0.05303,0.0,0.0,0.0,0.515152,0.090909,0.0,0.477273,0.015152
1,0.083333,0.0,0.0,0.0,0.0,0.03125,0.239583,0.0,1.239583,0.145833,...,0.0,0.15625,0.072917,0.0625,0.0,0.416667,0.5,0.083333,0.354167,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.9,0.3,1.3,0.55,...,0.15,0.85,0.3,0.3,0.0,1.5,1.6,0.2,1.55,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.009044,0.01292,0.020672,0.0,0.0323,0.0,0.206718,0.018088,...,0.0,0.033592,0.011628,0.0,0.007752,0.041344,0.052972,0.010336,0.04522,0.0


In [33]:
pivot = pivot.reindex(ratings.count().sort_values(ascending=False).index, axis=1)
pivot

title,Forrest Gump (1994),"Shawshank Redemption, The (1994)",Pulp Fiction (1994),"Silence of the Lambs, The (1991)","Matrix, The (1999)",Star Wars: Episode IV - A New Hope (1977),Jurassic Park (1993),Braveheart (1995),Terminator 2: Judgment Day (1991),Schindler's List (1993),...,Last Train Home (2009),"Last Waltz, The (1978)","Last Wave, The (1977)","Last Wedding, The (Kivenpyörittäjän kylä) (1995)","Last Winter, The (2006)",Last Year's Snow Was Falling (1983),Last of the Dogmen (1995),Late Marriage (Hatuna Meuheret) (2001),Late Night Shopping (2001),'71 (2014)
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,3.257576,2.825758,3.287879,3.272727,3.287879,3.795455,2.590909,2.227273,2.893939,2.507576,...,0.0,0.068182,0.045455,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,4.03125,3.90625,3.833333,3.114583,4.15625,3.895833,3.114583,2.322917,2.864583,2.8125,...,0.0,0.0,0.0,0.0625,0.083333,0.104167,0.052083,0.072917,0.09375,0.083333
2,4.25,3.85,4.55,4.2,4.15,4.55,3.7,3.5,4.0,3.1,...,0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,3.247475,3.090909,3.161616,2.712121,0.171717,0.39899,2.782828,3.237374,2.323232,1.909091,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.543928,1.771318,1.366925,1.25323,1.784238,1.394057,0.670543,0.882429,0.751938,1.056848,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Cluster 0: The mean of ratings of cluster 0 is average of all the ratings. Alle the highest ratings are around 2 and 3, so in the middle of the rating system (0-5).<br>
Cluster 1: The mean of ratings of cluster 0 is average of all the ratings. Alle the highest ratings are around 2 and 3, so in the middle of the rating system (0-5).<br>
Cluster 2: loves action and horror kind of movies. They rated 'Terminator 2: Judgment Day', 'Star Wars' and 'The Silence of the Lambs' high. <br>
Cluster 3: doesn't like the action movies as 'Star Wars' and 'The Matrix'.<br>
Cluster 4: People in cluster 4 are picky about which movies they like. Their ratings are way lower than from the other clusters for the same movies. <br>