# Data Exploration
- No missing data in original data. Lots of missing data in IMDB set
- There are "rare" movies at the tail
    - 17% of movies have only 1 rating
    - 43% of movies have 5 or less ratings 
- Merge loses 17410 movies that aren't in the IMDB dataset

In [1]:
import pandas as pd
import os

In [2]:
links = pd.read_csv('ml-25m/links.csv')
movies = pd.read_csv('ml-25m/movies.csv')
ratings = pd.read_csv('ml-25m/ratings.csv')

In [3]:
# distribution of number of ratings per movie. Many don't have many ratings. 
x = ratings.groupby('movieId').count()
x.userId.describe()

count    59047.000000
mean       423.393144
std       2477.885821
min          1.000000
25%          2.000000
50%          6.000000
75%         36.000000
max      81491.000000
Name: userId, dtype: float64

In [95]:
len(x[x.userId == 1]) / len(x)

0.174403441326401

In [96]:
len(x[x.userId <= 2]) / len(x)

0.3036733449624875

In [93]:
len(x[x.userId <= 5]) / len(x)

0.42730294507333627

### Merge all Data (including IMDB dataset)
Searched kaggle for a larger set and did not find one that also had IMDB IDs

In [5]:
# IMDB dataset
imdb_movies = pd.read_csv('imdb/IMDb movies.csv')

# standardize IMDB IDs
imdb_movies['imdbId'] = imdb_movies.imdb_title_id.str.split('tt').str[1]
imdb_movies.imdbId = pd.to_numeric(imdb_movies.imdbId)

In [38]:
# merge all dataset together 
num_movies = len(movies)
# links (has imdb rating) + movies
df = pd.merge(links, movies, on = 'movieId')
# merge with imdb data
df = pd.merge(df, imdb_movies, on = 'imdbId')
# titles are different in movielens vs imdb because imdb in the original language whereas movielens all english translated
df = df.rename(columns = {'title_x':'title_eng', 'title_y':'title_orig'})

new_num_movies = len(df)

## Missing Data Post Merge

In [12]:
# number of movies lost in merge
num_movies - new_num_movies

17410

In [13]:
new_num_movies

45013

In [14]:
df.isnull().sum()

userId                         0
movieId                        0
rating                         0
timestamp                      0
imdbId                         0
tmdbId                        30
title_eng                      0
genres                         0
imdb_title_id                  0
title_orig                     0
original_title                 0
year                           0
date_published                 0
genre                          0
duration                       0
country                       46
language                    5241
director                     235
writer                      7783
production_company         12419
actors                       588
description                12438
avg_vote                       0
votes                          0
budget                   2149711
usa_gross_income         1476438
worlwide_gross_income     975608
metascore                2212228
reviews_from_users          5451
reviews_from_critics       10354
dtype: int

## Separate Genres -> Dummy Variables
IMDB and MovieLens sometimes have a differnet genre list for thes same movie. Take the union of both lists to get the max number of genres/information.  

In [39]:
# convert into sets and take union 
df.genre = df.genre.str.split(', ')
df.genres = df.genres.str.split('|')

df.genre = df.genre.apply(set)
df.genres = df.genres.apply(set)

df['genres_all'] = df.apply(lambda x: x['genre'].union(x['genres']), axis=1)
df.genres_all = df.genres_all.apply(list)

In [40]:
# dummy variables
genre_dummies = pd.get_dummies(df.genres_all.apply(pd.Series).stack(), prefix = 'genre').sum(level = 0)

# merge back in
df = pd.merge(df, genre_dummies, left_index = True, right_index = True)

## Merge with Ratings Data

In [None]:
df = pd.merge(ratings, df, on = 'movieId')