# Weekend movie trip

## Blockbuster or art film?

#### Author: Ruturaj Kiran Vaidya

#### References:

<ul>
<li> Understanding K-means Clustering in Machine Learning: <a>https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-6a6e67336aa1</a></li>
<li>Sklearn kmeans: <a>https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html</a></li>
</ul>

In [1]:
# imports
import random
import pandas as pd
import numpy as np

import sys
sys.path.append('/usr/local/lib/python3.7/site-packages')

from plotly import graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

# sklearn
from sklearn.cluster import KMeans

<p>Datasets can be downloaded here: <a>https://grouplens.org/datasets/movielens</a></p>
<p>Particular dataset link: <a>http://files.grouplens.org/datasets/movielens/ml-latest-small.zip</a></p>

the dataset is divided into movies, ratings, tags and links. We don't need links (basically it contains imdb
and rotten tomatoes db ids).

### Importing Datasets

In [2]:
# importing movies, ratings and tags
movies = pd.read_csv("../data/external/ml-latest-small/movies.csv")
ratings = pd.read_csv("../data/external/ml-latest-small/ratings.csv")
tags = pd.read_csv("../data/external/ml-latest-small/tags.csv")

### Feature Analysis

In [3]:
movies.tail()

Unnamed: 0,movieId,title,genres
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation
9741,193609,Andrew Dice Clay: Dice Rules (1991),Comedy


In [4]:
ratings.tail()

Unnamed: 0,userId,movieId,rating,timestamp
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352
100835,610,170875,3.0,1493846415


In [5]:
# We don't need userid and timestamp
del ratings['userId']
del ratings['timestamp']
ratings = ratings.groupby("movieId").mean().reset_index()
ratings.tail()

Unnamed: 0,movieId,rating
9719,193581,4.0
9720,193583,3.5
9721,193585,3.5
9722,193587,3.5
9723,193609,4.0


In [6]:
tags.tail()

Unnamed: 0,userId,movieId,tag,timestamp
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978
3682,610,168248,Heroic Bloodshed,1493844270


In [7]:
# timestamp and userid is not important
del tags['timestamp']
del tags['userId']

# Also converting all tags into lowercase
tags['tag'] = tags['tag'].str.lower()
print(tags.tail())
print(f"tags shape: {tags.shape}")
print(f"Unique movies in tags dataset: {len(tags['movieId'].unique())}")
print(f"Unique movies in movies dataset: {len(movies['movieId'].unique())}")

      movieId               tag
3678     7382         for katie
3679     7936           austere
3680     3265            gun fu
3681     3265  heroic bloodshed
3682   168248  heroic bloodshed
tags shape: (3683, 2)
Unique movies in tags dataset: 1572
Unique movies in movies dataset: 9742


Notice that there aren't many movies with tags. We don't need tags for our analysis, I reason that based on NaN values we will get for a lot of movies.

In [8]:
# We will concate the dataframes, because it's just easier to work with
# combined = pd.merge(movies, tags, on='movieId', how='left')
combined = pd.merge(movies, ratings, on='movieId', how='left')

In [9]:
combined.drop_duplicates().dropna()
combined.head()

Unnamed: 0,movieId,title,genres,rating
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143
4,5,Father of the Bride Part II (1995),Comedy,3.071429


In [10]:
# Lets see how many features and total number of records are in this dataset
print(combined.shape)

(9742, 4)


In [11]:
# First we will split genres
# Unique genres
genres = set([j for i in [i.split("|") for i in combined['genres']] for j in i])
print(genres)
print(len(genres))

{'Romance', 'Sci-Fi', 'Drama', 'Documentary', 'IMAX', 'Thriller', 'Horror', 'Western', 'Comedy', 'Fantasy', 'Mystery', 'Animation', 'Crime', 'Musical', 'War', 'Film-Noir', 'Adventure', 'Action', 'Children', '(no genres listed)'}
20


In [12]:
# Preparing to add columns based on new generes
genres_dic = {j:[1 if j in i else 0 for i in combined["genres"]] for j in genres}

# python 3.7 dictionary stays same as the insertion order
# ref: https://docs.python.org/3.7/library/stdtypes.html#typesmapping
data = [go.Bar(
            x=list(genres_dic.keys()),
            y=[sum(i) for i in genres_dic.values()]
    )]
fig = go.Figure(data=data)
fig.update_layout(
    title="Movies per Genre",
    xaxis_title="Genres",
    yaxis_title="Number of Movies",
)

iplot(fig)

# Observed that there is "(no genres listed)" type - removing these as it won't be helpful
# in our analysis
del genres_dic['(no genres listed)'] 

# Adding new generes in the dataframe
for k,v in genres_dic.items():
    combined[k]=v
# Dropping rows with no genres listed
combined = combined[combined.genres != "(no genres listed)"]
# We don't need genres column anymore
del combined['genres']

In [13]:
print(combined.shape)

(9708, 22)


In [14]:
combined.tail()

Unnamed: 0,movieId,title,rating,Mystery,Comedy,Action,Sci-Fi,Documentary,IMAX,War,...,Fantasy,Horror,Adventure,Children,Crime,Musical,Film-Noir,Animation,Western,Drama
9737,193581,Black Butler: Book of the Atlantic (2017),4.0,0,1,1,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
9738,193583,No Game No Life: Zero (2017),3.5,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,0
9739,193585,Flint (2017),3.5,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9740,193587,Bungo Stray Dogs: Dead Apple (2018),3.5,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9741,193609,Andrew Dice Clay: Dice Rules (1991),4.0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Now using clustering algorithm

In [15]:
# Here I select columns belong to ratings and genre values as calculated in the above table
# random_state is any random number advised as in: https://scikit-learn.org/stable/glossary.html#term-random-state
X = combined.replace([np.inf, -np.inf], np.nan).dropna().iloc[:,2:].to_numpy()
kmeans = KMeans(n_clusters=20, random_state=4).fit(X)

In [16]:
# Centroid values

kmeans.cluster_centers_[5]

array([ 3.87112946e+00,  3.84254920e-02,  4.32986980e-15,  3.93626992e-02,
        4.77975633e-02,  5.82867088e-16,  4.68603561e-03,  2.84494650e-16,
        1.80411242e-15,  9.46579194e-02,  3.37394564e-02,  1.68697282e-02,
        6.27928772e-02,  2.06185567e-02, -6.80011603e-16,  2.43673852e-02,
        1.03092784e-02,  7.49765698e-03,  2.24929709e-02,  1.00000000e+00])

In [17]:
kmeans.labels_

array([ 8,  8,  1, ...,  5, 10, 18], dtype=int32)

In [18]:
comb_dic = {i:[] for i in kmeans.labels_}
[comb_dic[i].append(list(combined["title"])[c]) for c, i in enumerate(kmeans.labels_)];

In [19]:
print([len(i) for i in comb_dic.values()])

[407, 468, 338, 908, 258, 360, 512, 729, 1067, 430, 505, 271, 659, 405, 244, 493, 389, 410, 430, 407]


In [20]:
# For recommendation, literally pick any random movie from the cluster
# Note that the movie must present in our dataset
# This may give the same movie as a recommendation lol, but try again if it does
# the chances of getting that is very very low :-)
def recommend(x):
    return random.choice(comb_dic[kmeans.labels_[list(combined["title"]).index(x)]])

In [21]:
recommend("Toy Story (1995)")

'Snow White and the Seven Dwarfs (1937)'