# Weekend Movie Trip

In [1]:
import numpy as np
import pandas as pd

## Exploratory Data Analysis and Feature Engineering
This dataset come in a couple variants. I will use the smaller dataset to make exploration and modeling more tractable.

In [2]:
movies_df  = pd.read_csv('../data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('../data/ml-latest-small/ratings.csv')
tags_df    = pd.read_csv('../data/ml-latest-small/tags.csv')

movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


In terms of feature engineering, I see two immediate opportunities. The "title" column contains years that can be parsed out and added as a separate feature. Furthermore, the genres can be parsed, and each genre given its own binary column.

In [3]:
# Check that years are always in the expected format
def check_year(s):
    year = s.strip()[-5:-1]
    if year.isdecimal():
        return int(year)
    else:
        print(s)

movies_df['title'].apply(check_year)

Babylon 5
Ready Player One
Hyena Road
The Adventures of Sherlock Holmes and Doctor Watson
Nocturnal Animals
Paterson
Moonlight
The OA
Cosmos
Maria Bamford: Old Baby
Generation Iron 2
Black Mirror


0       1995.0
1       1995.0
2       1995.0
3       1995.0
4       1995.0
         ...  
9737    2017.0
9738    2017.0
9739    2017.0
9740    2018.0
9741    1991.0
Name: title, Length: 9742, dtype: float64

In [4]:
# So not every title has an associated date. I'll just enter that into the new "year" column as missing data.
def parse_title_year(s):
    s = s.strip()
    year = s[-5:-1]
    return (s[:-7], int(year)) if year.isdecimal() else (s, np.nan)

title_year = movies_df['title'].apply(parse_title_year)
title_year

0                                (Toy Story, 1995)
1                                  (Jumanji, 1995)
2                         (Grumpier Old Men, 1995)
3                        (Waiting to Exhale, 1995)
4              (Father of the Bride Part II, 1995)
                           ...                    
9737    (Black Butler: Book of the Atlantic, 2017)
9738                 (No Game No Life: Zero, 2017)
9739                                 (Flint, 2017)
9740          (Bungo Stray Dogs: Dead Apple, 2018)
9741          (Andrew Dice Clay: Dice Rules, 1991)
Name: title, Length: 9742, dtype: object

In [5]:
movies_df[['title','year']] = pd.DataFrame(title_year.tolist(), columns=['title','year'])
movies_df

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995.0
1,2,Jumanji,Adventure|Children|Fantasy,1995.0
2,3,Grumpier Old Men,Comedy|Romance,1995.0
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995.0
4,5,Father of the Bride Part II,Comedy,1995.0
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017.0
9738,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017.0
9739,193585,Flint,Drama,2017.0
9740,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018.0


In [6]:
# From the dataset's README:
genres = ['Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

parsed_genres = movies_df['genres'].apply(lambda l : list(map(lambda g: 1 if g in l else 0, genres)))
movies_df[genres] = pd.DataFrame(parsed_genres.tolist(), columns=genres)
movies_df

Unnamed: 0,movieId,title,genres,year,Action,Adventure,Animation,Children's,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995.0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,2,Jumanji,Adventure|Children|Fantasy,1995.0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,Comedy|Romance,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II,Comedy,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017.0,1,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,0
9738,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017.0,0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,0
9739,193585,Flint,Drama,2017.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9740,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018.0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now I turn my attention to the next segment of the dataset.

In [7]:
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


The timestamp is interesting. Looking at the `README.md` for the dataset, this is in standard unix time, which enables us to add all sorts of features. But what could we add that actually has a bearing on our end goal: clustering similar movies? We could add seasons and months to capture the notion of holiday movies. If two movies see a surge of reviews near Christmas, perhaps they are both Christmas movies. But more likely, both just received a december release date.

I choose not to add any such features, because I believe that clustering over this type of data does not indicate movie similarity nor forms a proper basis for movie recommendation.

Finally we have the tag data.

In [8]:
tags_df

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


In [9]:
tags_df['tag'].describe()

count                 3683
unique                1589
top       In Netflix queue
freq                   131
Name: tag, dtype: object

In [10]:
tags_df['tag'] = tags_df['tag'].apply(lambda s : s.lower() if type(s) is str else s)
tags_df['tag'].describe()

count                 3683
unique                1475
top       in netflix queue
freq                   131
Name: tag, dtype: object

These tags are interesting because they are completely user generated. There is no predetermined bank of tags to choose from. However, they aren't as diverse as I'd expect. Less than 10% of them are unique (modulo capitalization). Perhaps users see other tags, and have the ability to add the same tag.

### Combining the Data

How can these three datasets be sensibly combined before clustering?

It is immediately apparent how everything in the movies dataframe can be used. But how about the tags? We can group these by movie. I'll add a column to the movie dataframe for each potential tag, the value of which will be the number of users who assigned that tag. The only information lost in this transformation is which users applied which tags. Tags appear to be largely objective and descriptive (e.g. genre, actor, mood), rather than opinion based (e.g. good/bad), so knowing which users apply which tags is not particularly relevant to us.

In [11]:
all_tags = set()
movie_tags = {}
for (_, r) in tags_df.iterrows():
    all_tags.add(r['tag'])
    if not r['movieId'] in movie_tags:
        movie_tags.update({r['movieId']: {}})
    d = movie_tags[r['movieId']]
    k = r['tag']
    if k in d:
        d[k] += 1
    else:
        d.update({k: 1})

In [12]:
for t in all_tags:
    movies_df[t] = 0
    
for m in movie_tags:
    if m in movies_df['movieId']:
        for t in movie_tags[m]:
            row = movies_df[movies_df['movieId'] == m].index
            col = movies_df.columns.get_loc(t)
            movies_df.iloc[row, col] = movie_tags[m][t]
            
movies_df

Unnamed: 0,movieId,title,genres,year,Action,Adventure,Animation,Children's,Comedy,Crime,...,bruce willis,nonlinear,macbeth,italy,archaeology,darth vader,humor,julianne moore,mental illness,south park
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995.0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,Jumanji,Adventure|Children|Fantasy,1995.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,Comedy|Romance,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
4,5,Father of the Bride Part II,Comedy,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017.0,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9738,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017.0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9739,193585,Flint,Drama,2017.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9740,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018.0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now, to incorporate the ratings. In order to associate them with movies rather than users, I will simply add a column to the movie dataframe signifying the average rating.

In [13]:
ratings = {}
for (_, r) in ratings_df.iterrows():
    if r['movieId'] not in ratings:
        ratings.update({r['movieId']: (0,0)})
    (total, divisor) = ratings[r['movieId']]
    ratings[r['movieId']] = (total + r['rating'], divisor + 1)

movies_df['rating'] = 2.5
rating_col = movies_df.columns.get_loc('rating')
for m in ratings:
    if m in movies_df['movieId']:
        (total, divisor) = ratings[m]
        row = movies_df[movies_df['movieId'] == m].index
        if divisor != 0:
            movies_df.iloc[row, rating_col] = (total / divisor)
        
movies_df

Unnamed: 0,movieId,title,genres,year,Action,Adventure,Animation,Children's,Comedy,Crime,...,nonlinear,macbeth,italy,archaeology,darth vader,humor,julianne moore,mental illness,south park,rating
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995.0,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,3.920930
1,2,Jumanji,Adventure|Children|Fantasy,1995.0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3.431818
2,3,Grumpier Old Men,Comedy|Romance,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,3.259615
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,2.357143
4,5,Father of the Bride Part II,Comedy,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,3.071429
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017.0,1,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,2.500000
9738,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017.0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,0,2.500000
9739,193585,Flint,Drama,2017.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,2.500000
9740,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018.0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,2.500000


## Modeling

In [14]:
import sklearn.cluster

Before modelling, we must make the final moves in preparing the data. We remove all of the string fields, as well as the movieId, which does not have any meaning beyond unique identification.

In [15]:
del movies_df['movieId']
del movies_df['title']
del movies_df['genres']
movies_df = movies_df.dropna()

data = movies_df.to_numpy()

I elected to use the Mean Shift clustering model, which I selected by following scikit-learn's flowchart.
 - https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

In [19]:
bandwidth = sklearn.cluster.estimate_bandwidth(data, quantile=.1)
meanshift = sklearn.cluster.MeanShift(bandwidth, bin_seeding=True)

meanshift.fit(data)

MeanShift(bandwidth=5.251646139956242, bin_seeding=True, cluster_all=True,
          min_bin_freq=1, n_jobs=None, seeds=None)

In [20]:
labels = meanshift.labels_
labels_unique = np.unique(labels)
clusters = len(labels_unique)

print("Number of clusters: %d" % clusters)

Number of clusters: 8


It is difficult to judge the success of our clustering. The dimensions are much too many to visualize. Observing the number of clusters determined by the algorithm, 8 seems reasonable.