# Weekend Movie Trip

In [36]:
import numpy as np
import pandas as pd

## Exploratory Data Analysis and Feature Engineering
This dataset come in a couple variants. I will start with the smaller dataset to familiarize myself with and explore the data before applying models to the full dataset.

In [6]:
movies_df  = pd.read_csv('../data/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('../data/ml-latest-small/ratings.csv')
tags_df    = pd.read_csv('../data/ml-latest-small/tags.csv')
links_df   = pd.read_csv('../data/ml-latest-small/links.csv')

movies_df

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy
...,...,...,...
9737,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy
9738,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy
9739,193585,Flint (2017),Drama
9740,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation


It seems that the "movieId" field not only starts at 1, but at some point, it vastly departs from the row number. I would conjecture this to be an artifact of the sampling strategy used to reduce this dataset from the origianl, except the ID reaches nearly 200,000, and even the full dataset only contains roughly 58,000 movies. Regardless, it shouldn't affect our data so long as the IDs are unique and consistent.

In terms of feature engineering, I see two immediate opportunities. The "title" column contains years that can be parsed out and added as a separate feature. Furthermore, the genres can be parsed, and each genre given its own binary column.

In [44]:
# Check that years are always in the expected format
def check_year(s):
    year = s.strip()[-5:-1]
    if year.isdecimal():
        return int(year)
    else:
        print(s)

movies_df['title'].apply(check_year)

Babylon 5
Ready Player One
Hyena Road
The Adventures of Sherlock Holmes and Doctor Watson
Nocturnal Animals
Paterson
Moonlight
The OA
Cosmos
Maria Bamford: Old Baby
Generation Iron 2
Black Mirror


0       1995.0
1       1995.0
2       1995.0
3       1995.0
4       1995.0
         ...  
9737    2017.0
9738    2017.0
9739    2017.0
9740    2018.0
9741    1991.0
Name: title, Length: 9742, dtype: float64

In [52]:
# So not every title has an associated date. I'll just enter that into the new "year" column as missing data.
def parse_title_year(s):
    s = s.strip()
    year = s[-5:-1]
    return (s[:-7], int(year)) if year.isdecimal() else (s, np.nan)

title_year = movies_df['title'].apply(parse_title_year)
title_year

0                                (Toy Story, nan)
1                                  (Jumanji, nan)
2                         (Grumpier Old Men, nan)
3                        (Waiting to Exhale, nan)
4              (Father of the Bride Part II, nan)
                          ...                    
9737    (Black Butler: Book of the Atlantic, nan)
9738                 (No Game No Life: Zero, nan)
9739                                 (Flint, nan)
9740          (Bungo Stray Dogs: Dead Apple, nan)
9741          (Andrew Dice Clay: Dice Rules, nan)
Name: title, Length: 9742, dtype: object

In [51]:
movies_df[['title','year']] = pd.DataFrame(title_year.tolist(), columns=['title','year'])
movies_df

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995.0
1,2,Jumanji,Adventure|Children|Fantasy,1995.0
2,3,Grumpier Old Men,Comedy|Romance,1995.0
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995.0
4,5,Father of the Bride Part II,Comedy,1995.0
...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017.0
9738,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017.0
9739,193585,Flint,Drama,2017.0
9740,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018.0


In [66]:
# From the dataset's README:
genres = ['Action', 'Adventure', 'Animation', 'Children\'s', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
          'Film-Noir', 'Horror', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']

parsed_genres = movies_df['genres'].apply(lambda l : list(map(lambda g: 1 if g in l else 0, genres)))
movies_df[genres] = pd.DataFrame(parsed_genres.tolist(), columns=genres)
movies_df

Unnamed: 0,movieId,title,genres,year,Action,Adventure,Animation,Children's,Comedy,Crime,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995.0,0,1,1,0,1,0,...,1,0,0,0,0,0,0,0,0,0
1,2,Jumanji,Adventure|Children|Fantasy,1995.0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,3,Grumpier Old Men,Comedy|Romance,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0
4,5,Father of the Bride Part II,Comedy,1995.0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9737,193581,Black Butler: Book of the Atlantic,Action|Animation|Comedy|Fantasy,2017.0,1,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,0
9738,193583,No Game No Life: Zero,Animation|Comedy|Fantasy,2017.0,0,0,1,0,1,0,...,1,0,0,0,0,0,0,0,0,0
9739,193585,Flint,Drama,2017.0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9740,193587,Bungo Stray Dogs: Dead Apple,Action|Animation,2018.0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now I turn my attention to the next fraction of the dataset.

In [16]:
ratings_df

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931
...,...,...,...,...
100831,610,166534,4.0,1493848402
100832,610,168248,5.0,1493850091
100833,610,168250,5.0,1494273047
100834,610,168252,5.0,1493846352


The timestamp is interesting. Looking at the `README.md` for the dataset, this is in standard unix time, which enables us to add all sorts of features. But what could we add that actually has a bearing on our end goal: clustering similar movies? We could add seasons and months to capture the notion of holiday movies. If two movies see a surge of reviews near Christmas, perhaps they are both Christmas movies. But more likely, both just received a december release date.

I choose not to add any such features, because I believe that clustering over this type of data does not indicate movie similarity nor forms a proper basis for movie recommendation.

Finally we have the tag data.

In [69]:
tags_df

Unnamed: 0,userId,movieId,tag,timestamp
0,2,60756,funny,1445714994
1,2,60756,Highly quotable,1445714996
2,2,60756,will ferrell,1445714992
3,2,89774,Boxing story,1445715207
4,2,89774,MMA,1445715200
...,...,...,...,...
3678,606,7382,for katie,1171234019
3679,606,7936,austere,1173392334
3680,610,3265,gun fu,1493843984
3681,610,3265,heroic bloodshed,1493843978


In [70]:
tags_df['tag'].describe()

count                 3683
unique                1589
top       In Netflix queue
freq                   131
Name: tag, dtype: object

In [73]:
tags_df['tag'].apply(lambda s : s.lower()).describe()

count                 3683
unique                1475
top       in netflix queue
freq                   131
Name: tag, dtype: object

In [74]:
tags_df['tag'] = tags_df['tag'].apply(lambda s : s.lower())

These tags are interesting because they are completely user generated. There is no predetermined bank of tags to choose from. However, they aren't as diverse as I'd expect. Less than half of them are unique modulo capitalization. Perhaps users see other tags, and have the ability to add the same tag.

## Modeling