# Clustering Netflix Titles

markdown practice warm-up:

There's a file named `me_hoy_medoid.png` in this directory.  Display the image in this notebook using a markdown cell.

## Load

In [1]:
%reload_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd
import numpy as np

from scipy.spatial.distance import pdist, squareform

!pip install pyclustering
from pyclustering.cluster.kmedoids import kmedoids

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline


url = "https://raw.githubusercontent.com/AdamSpannbauer/flixable_ml_dsi/master/data/movies_2020_01_23_13_15_04.csv"
movie = pd.read_csv(url)

# Drop rows where genre is na
movie = movie.dropna(subset=["Genre"])

# Proceed with sample of rows to make things run faster for class time
movie = movie.sample(2000, random_state=42)

# Subset down to a small feature set
# fmt: off
drop_columns = ['Poster', 'flixable_url', 'Response', 
                'Awards', 'Rated', 'imdbID', 'DVD', 'Website',
                'BoxOffice', 'Released', 'added_to_netflix',
                'Writer', 'Actors', 'Plot',
                'Metascore', 'Production',
                'totalSeasons', 'Runtime', 'Director',
                'Title', 'Ratings', 'Year', 'imdbRating',
                'imdbVotes']
# fmt: on
movie = movie.drop(columns=drop_columns)

You should consider upgrading via the 'c:\users\gaukharjavarova\appdata\local\programs\python\python38-32\python.exe -m pip install --upgrade pip' command.




<IPython.core.display.Javascript object>

In [3]:
movie.head()

Unnamed: 0,Country,Genre,Language,Type,mpaa_rating
3136,Hong Kong,"Action, Comedy","Cantonese, Mandarin",movie,TV-14
1648,Egypt,"Action, Comedy, Drama",Arabic,movie,TV-14
3641,USA,Drama,English,movie,TV-14
4221,India,Comedy,,movie,TV-PG
158,South Korea,"Comedy, Drama, Family",Korean,series,TV-14


<IPython.core.display.Javascript object>

## Preprocess

Create a copy of the dataframe to preserve this original structure for cluster analysis later.

In [4]:
og_movie = movie.copy()

<IPython.core.display.Javascript object>

Use [`pd.Series.str.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.get_dummies.html) to convert dummy encode `'Genre'`, `'Language'`, and `'Country'`.

In [5]:
genre_dummies = ____

NameError: name '____' is not defined

<IPython.core.display.Javascript object>

In [None]:
language_dummies = ____

In [None]:
country_dummies = ____

Combine all 3 dummy dataframes into a single (very wide) dataframe.

In [None]:
str_dummies = ____
str_dummies.head()

* Drop the original `'Genre'`, `'Language'`, and `'Country'` columns from the `movie` dataframe.
* Add the data from `str_dummies` to the `movie` dataframe

In [None]:
movie = movie.drop(columns=["Genre", "Language", "Country"])
movie = pd.concat((movie, str_dummies), axis=1)
movie.head()

Use [`pd.get_dummies()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) to dummy encode `'Type'` and `'mpaa_rating'`.

In [None]:
movie = pd.get_dummies(movie)
movie.head()

## Calculate distances

* Use `pdist` and `squareform` to calculate the distance between each row
    * What distance metric makes the most sense here?

In [None]:
dist = pdist(movie, metric="____")
dist_mat = squareform(dist)
dist_mat.shape

## Cluster with K-medoids

We need to initialize the starting 'medoids' for our clusters.  To do this, `pyclustering` wants us to provide the indices of our starting points.

* Generate `k` random indices from our distance matrix

In [None]:
k = 5

In [None]:
np.random.seed(42)

nrows = dist_mat.shape[0]
init_medoids = np.random.randint(0, 2001, k)
init_medoids

In [None]:
kmed = kmedoids(
    dist_mat, initial_index_medoids=init_medoids, data_type="distance_matrix"
)

kmed.process()

Use the `.get_medoids()` method to find the index for each cluster center.

In [None]:
medoid_idxs = kmed.get_medoids()
medoid_idxs

Use the `.predict()` method to output the cluster label for each record in a dataset.

In [None]:
labels = kmed.predict(dist_mat)
labels

Put these labels into both the `og_movie` and `movie` dataframes.

In [None]:
og_movie["label"] = labels
movie["label"] = labels

## Explore Clusters

Use the `medoid_idxs` to pull out our cluster centers from `og_movie`.

In [None]:
medoid_idxs

In [None]:
og_movie.iloc[medoid_idxs, :]

Analyze clusters