# Clustering Netflix Titles

markdown practice warm-up:

There's a file named `me_hoy_medoid.png` in this directory.  Display the image in this notebook using a markdown cell.

![me_hoy_medoid.png](me_hoy_medoid.png)

## Load

In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
import pandas as pd
import numpy as np

from scipy.spatial.distance import pdist, squareform


from pyclustering.cluster.kmedoids import kmedoids

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


url = "https://raw.githubusercontent.com/AdamSpannbauer/flixable_ml_dsi/master/data/movies_2020_01_23_13_15_04.csv"
movie = pd.read_csv(url)

# Drop rows where genre is na
movie = movie.dropna(subset=["Genre"])

# Proceed with sample of rows to make things run faster for class time
movie = movie.sample(2000, random_state=42)

# Subset down to a small feature set
# fmt: off
drop_columns = ['Poster', 'flixable_url', 'Response', 
                'Awards', 'Rated', 'imdbID', 'DVD', 'Website',
                'BoxOffice', 'Released', 'added_to_netflix',
                'Writer', 'Actors', 'Plot',
                'Metascore', 'Production',
                'totalSeasons', 'Runtime', 'Director',
                'Title', 'Ratings', 'Year', 'imdbRating',
                'imdbVotes']
# fmt: on
movie = movie.drop(columns=drop_columns)

Collecting pyclustering
  Downloading pyclustering-0.10.0.1.tar.gz (2.7 MB)
Building wheels for collected packages: pyclustering
  Building wheel for pyclustering (setup.py): started
  Building wheel for pyclustering (setup.py): finished with status 'done'
  Created wheel for pyclustering: filename=pyclustering-0.10.0.1-py3-none-any.whl size=2615594 sha256=a3de2533a8388ceb3c4032a4168817042c1e3e42a49198aef7f2e9f10aa1606d
  Stored in directory: c:\users\b1t\appdata\local\pip\cache\wheels\38\f5\c1\98785d678f868f99a6c7e0d8075cedc123fe35ed04f72c560e
Successfully built pyclustering
Installing collected packages: pyclustering
Successfully installed pyclustering-0.10.0.1


<IPython.core.display.Javascript object>

In [4]:
movie.head()

Unnamed: 0,Country,Genre,Language,Type,mpaa_rating
3136,Hong Kong,"Action, Comedy","Cantonese, Mandarin",movie,TV-14
1648,Egypt,"Action, Comedy, Drama",Arabic,movie,TV-14
3641,USA,Drama,English,movie,TV-14
4221,India,Comedy,,movie,TV-PG
158,South Korea,"Comedy, Drama, Family",Korean,series,TV-14


<IPython.core.display.Javascript object>

## Preprocess

Create a copy of the dataframe to preserve this original structure for cluster analysis later.

In [5]:
og_movie = movie.copy()

<IPython.core.display.Javascript object>

Use [`pd.Series.str.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html) to convert dummy encode `'Genre'`, `'Language'`, and `'Country'`.

In [12]:
genre_dummies = movie["Genre"].str.get_dummies(sep=", ")
language_dummies = movie["Language"].str.get_dummies(sep=", ")
country_dummies = movie["Country"].str.get_dummies(sep=", ")


<IPython.core.display.Javascript object>

Combine all 3 dummy dataframes into a single (very wide) dataframe.

In [15]:
str_dummies = pd.concat((genre_dummies, language_dummies, country_dummies), axis=1)
str_dummies.head()

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,Thailand,Tunisia,Turkey,UK,USA,Uganda,Ukraine,United Arab Emirates,Uruguay,Zimbabwe
3136,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1648,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
3641,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,0
4221,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
158,0,0,0,0,1,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

* Drop the original `'Genre'`, `'Language'`, and `'Country'` columns from the `movie` dataframe.
* Add the data from `str_dummies` to the `movie` dataframe

In [16]:
movie = movie.drop(columns=["Genre", "Language", "Country"])
movie = pd.concat((movie, str_dummies), axis=1)
movie.head()

Unnamed: 0,Type,mpaa_rating,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,...,Thailand,Tunisia,Turkey,UK,USA,Uganda,Ukraine,United Arab Emirates,Uruguay,Zimbabwe
3136,movie,TV-14,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1648,movie,TV-14,1,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3641,movie,TV-14,0,0,0,0,0,0,0,1,...,0,0,0,0,1,0,0,0,0,0
4221,movie,TV-PG,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
158,series,TV-14,0,0,0,0,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0


<IPython.core.display.Javascript object>

Use [`pd.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to dummy encode `'Type'` and `'mpaa_rating'`.

In [17]:
movie = pd.get_dummies(movie)
movie.head()

Unnamed: 0,Action,Adventure,Animation,Biography,Comedy,Crime,Documentary,Drama,Family,Fantasy,...,mpaa_rating_PG,mpaa_rating_PG-13,mpaa_rating_R,mpaa_rating_TV-14,mpaa_rating_TV-G,mpaa_rating_TV-MA,mpaa_rating_TV-PG,mpaa_rating_TV-Y,mpaa_rating_TV-Y7,mpaa_rating_TV-Y7-FV
3136,1,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1648,1,0,0,0,1,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
3641,0,0,0,0,0,0,0,1,0,0,...,0,0,0,1,0,0,0,0,0,0
4221,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
158,0,0,0,0,1,0,0,1,1,0,...,0,0,0,1,0,0,0,0,0,0


<IPython.core.display.Javascript object>

## Calculate distances

* Use `pdist` and `squareform` to calculate the distance between each row
    * What distance metric makes the most sense here?

In [18]:
dist = pdist(movie, metric="dice")
dist_mat = squareform(dist)
dist_mat.shape

(2000, 2000)

<IPython.core.display.Javascript object>

In [19]:
pd.DataFrame(dist_mat)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999
0,0.000000,0.428571,0.666667,0.636364,0.714286,0.875000,0.714286,0.714286,0.789474,0.833333,...,0.500000,0.764706,0.666667,0.857143,0.833333,0.666667,0.500000,1.000000,0.714286,0.666667
1,0.428571,0.000000,0.500000,0.636364,0.571429,0.875000,0.571429,0.714286,0.684211,0.833333,...,0.500000,0.764706,0.666667,0.714286,0.666667,0.666667,0.500000,1.000000,0.571429,0.666667
2,0.666667,0.500000,0.000000,0.777778,0.666667,0.571429,0.333333,0.833333,0.529412,0.800000,...,0.200000,0.600000,0.800000,0.333333,0.600000,0.400000,0.600000,1.000000,0.333333,0.400000
3,0.636364,0.636364,0.777778,0.000000,0.818182,0.692308,0.818182,1.000000,0.750000,1.000000,...,0.555556,0.714286,0.555556,0.818182,0.777778,0.777778,0.555556,0.714286,0.454545,0.555556
4,0.714286,0.571429,0.666667,0.818182,0.000000,1.000000,0.714286,0.714286,0.578947,0.666667,...,0.666667,0.764706,0.833333,0.857143,0.500000,0.833333,0.666667,0.800000,0.714286,0.833333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,0.666667,0.666667,0.400000,0.777778,0.833333,0.714286,0.500000,0.833333,0.764706,0.800000,...,0.400000,0.733333,0.800000,0.500000,0.800000,0.000000,0.600000,1.000000,0.666667,0.600000
1996,0.500000,0.500000,0.600000,0.555556,0.666667,0.857143,0.666667,0.833333,0.764706,0.800000,...,0.400000,0.733333,0.600000,0.833333,0.800000,0.600000,0.000000,1.000000,0.666667,0.600000
1997,1.000000,1.000000,1.000000,0.714286,0.800000,0.833333,1.000000,0.800000,1.000000,0.500000,...,1.000000,1.000000,1.000000,1.000000,0.750000,1.000000,1.000000,0.000000,1.000000,1.000000
1998,0.714286,0.571429,0.333333,0.454545,0.714286,0.625000,0.571429,1.000000,0.368421,1.000000,...,0.333333,0.529412,0.500000,0.428571,0.666667,0.666667,0.666667,1.000000,0.000000,0.166667


<IPython.core.display.Javascript object>

## Cluster with K-medoids

We need to initialize the starting 'medoids' for our clusters.  To do this, `pyclustering` wants us to provide the indices of our starting points.

* Generate `k` random indices from our distance matrix

In [20]:
k = 5

<IPython.core.display.Javascript object>

In [21]:
np.random.seed(42)

nrows = dist_mat.shape[0]
init_medoids = np.random.randint(0, 2001, k)
init_medoids

array([1126, 1459,  860, 1294, 1130])

<IPython.core.display.Javascript object>

In [22]:
kmed = kmedoids(
    dist_mat, initial_index_medoids=init_medoids, data_type="distance_matrix"
)

kmed.process()

<pyclustering.cluster.kmedoids.kmedoids at 0x1e4d574b088>

<IPython.core.display.Javascript object>

Use the `.get_medoids()` method to find the index for each cluster center.

In [23]:
medoid_idxs = kmed.get_medoids()
medoid_idxs

[1377, 1459, 860, 992, 1483]

<IPython.core.display.Javascript object>

Use the `.predict()` method to output the cluster label for each record in a dataset.

In [24]:
labels = kmed.predict(dist_mat)
labels

array([3, 3, 2, ..., 3, 1, 1], dtype=int64)

<IPython.core.display.Javascript object>

Put these labels into both the `og_movie` and `movie` dataframes.

In [25]:
og_movie["label"] = labels
movie["label"] = labels

<IPython.core.display.Javascript object>

## Explore Clusters

Use the `medoid_idxs` to pull out our cluster centers from `og_movie`.

In [26]:
medoid_idxs

[1377, 1459, 860, 992, 1483]

<IPython.core.display.Javascript object>

In [27]:
og_movie.iloc[medoid_idxs, :]

Unnamed: 0,Country,Genre,Language,Type,mpaa_rating,label
962,USA,"Action, Crime, Drama, Thriller",English,movie,R,0
5774,USA,Comedy,English,movie,TV-MA,1
2771,USA,Documentary,English,movie,TV-14,2
4447,"France, Belgium","Drama, Romance","English, French",movie,TV-MA,3
2919,India,"Drama, Thriller",Hindi,movie,TV-MA,4


<IPython.core.display.Javascript object>

Analyze clusters

In [28]:
og_movie["label"].value_counts()

3    877
4    324
0    281
1    273
2    245
Name: label, dtype: int64

<IPython.core.display.Javascript object>

In [30]:
clst_avg = movie.groupby("label").mean().T
clst_avg.style.background_gradient(axis=1)

label,0,1,2,3,4
Action,0.338078,0.025641,0.016327,0.142531,0.209877
Adventure,0.156584,0.007326,0.028571,0.111745,0.030864
Animation,0.064057,0.010989,0.040816,0.115165,0.0
Biography,0.046263,0.010989,0.106122,0.051311,0.037037
Comedy,0.192171,0.681319,0.244898,0.296465,0.280864
Crime,0.270463,0.014652,0.036735,0.096921,0.179012
Documentary,0.003559,0.25641,0.653061,0.111745,0.003086
Drama,0.594306,0.208791,0.077551,0.417332,0.679012
Family,0.106762,0.043956,0.081633,0.084379,0.083333
Fantasy,0.117438,0.010989,0.044898,0.114025,0.015432


<IPython.core.display.Javascript object>

In [33]:
clst_avg.sort_values(0, ascending=False)

label,0,1,2,3,4
English,0.996441,0.981685,0.971429,0.451539,0.117284
Type_movie,0.978648,0.978022,0.832653,0.614595,0.916667
USA,0.932384,0.941392,0.853061,0.179019,0.000000
Drama,0.594306,0.208791,0.077551,0.417332,0.679012
mpaa_rating_R,0.505338,0.047619,0.044898,0.034208,0.000000
...,...,...,...,...,...
Mapudungun,0.000000,0.000000,0.000000,0.001140,0.000000
Bengali,0.000000,0.000000,0.000000,0.005701,0.027778
Cambodia,0.000000,0.000000,0.000000,0.002281,0.000000
Dutch,0.000000,0.000000,0.000000,0.009122,0.000000


<IPython.core.display.Javascript object>

In [34]:
clst_avg.sort_values(1, ascending=False)

label,0,1,2,3,4
English,0.996441,0.981685,0.971429,0.451539,0.117284
Type_movie,0.978648,0.978022,0.832653,0.614595,0.916667
USA,0.932384,0.941392,0.853061,0.179019,0.000000
mpaa_rating_TV-MA,0.113879,0.794872,0.081633,0.451539,0.293210
Comedy,0.192171,0.681319,0.244898,0.296465,0.280864
...,...,...,...,...,...
Slovenian,0.003559,0.000000,0.000000,0.001140,0.000000
Southern Sotho,0.003559,0.000000,0.000000,0.000000,0.000000
Chinese,0.007117,0.000000,0.000000,0.011403,0.000000
Swahili,0.003559,0.000000,0.004082,0.000000,0.000000


<IPython.core.display.Javascript object>

In [35]:
clst_avg.sort_values(2, ascending=False)

label,0,1,2,3,4
English,0.996441,0.981685,0.971429,0.451539,0.117284
USA,0.932384,0.941392,0.853061,0.179019,0.000000
Type_movie,0.978648,0.978022,0.832653,0.614595,0.916667
Documentary,0.003559,0.256410,0.653061,0.111745,0.003086
mpaa_rating_TV-14,0.067616,0.000000,0.546939,0.250855,0.543210
...,...,...,...,...,...
Romanian,0.003559,0.000000,0.000000,0.007982,0.000000
Romany,0.000000,0.000000,0.000000,0.001140,0.000000
Saami,0.000000,0.000000,0.000000,0.001140,0.000000
Sanskrit,0.000000,0.000000,0.000000,0.000000,0.003086


<IPython.core.display.Javascript object>

In [36]:
clst_avg.sort_values(3, ascending=False)

label,0,1,2,3,4
Type_movie,0.978648,0.978022,0.832653,0.614595,0.916667
English,0.996441,0.981685,0.971429,0.451539,0.117284
mpaa_rating_TV-MA,0.113879,0.794872,0.081633,0.451539,0.293210
Drama,0.594306,0.208791,0.077551,0.417332,0.679012
Type_series,0.021352,0.021978,0.167347,0.385405,0.083333
...,...,...,...,...,...
Bosnian,0.000000,0.000000,0.004082,0.000000,0.000000
Mongolian,0.000000,0.000000,0.004082,0.000000,0.000000
Samoa,0.000000,0.000000,0.004082,0.000000,0.000000
Nama,0.003559,0.000000,0.000000,0.000000,0.000000


<IPython.core.display.Javascript object>

In [37]:
clst_avg.sort_values(4, ascending=False)

label,0,1,2,3,4
India,0.010676,0.003663,0.012245,0.028506,0.922840
Type_movie,0.978648,0.978022,0.832653,0.614595,0.916667
Drama,0.594306,0.208791,0.077551,0.417332,0.679012
Hindi,0.007117,0.000000,0.012245,0.021665,0.669753
mpaa_rating_TV-14,0.067616,0.000000,0.546939,0.250855,0.543210
...,...,...,...,...,...
Yiddish,0.003559,0.000000,0.000000,0.002281,0.000000
Yoruba,0.000000,0.000000,0.004082,0.004561,0.000000
Zulu,0.003559,0.000000,0.000000,0.001140,0.000000
Gallegan,0.000000,0.000000,0.000000,0.001140,0.000000


<IPython.core.display.Javascript object>