# Highly Scalable Efficient Recommendation System

## Dowload dataset from open data source
There are many data sets in two selected data sources, but we only need some of them to build our recommendation system. So we select a few of them in consideration of the demand for data in our system.

#### IMDb database
* title.basics.tsv
* title.crew.tsv
* title.ratings.tsv

#### MovieLens Data Sets
* tags.csv
* ratings.csv
* genome-scores.csv
* genome-tags.csv

In [2]:
import pandas as pd
import numpy as np

basics = pd.read_csv("Original_Data_Sets/IMDb/title.basics.tsv", sep='\t', header=0)
crew = pd.read_csv("Original_Data_Sets/IMDb/title.crew.tsv", sep='\t', header=0)
ratings = pd.read_csv("Original_Data_Sets/IMDb/title.ratings.tsv", sep='\t', header=0)

basics.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short


In [3]:
links = pd.read_csv("Original_Data_Sets/MovieLens/links.csv", sep=",", header=0)
tags = pd.read_csv("Original_Data_Sets/MovieLens/tags.csv", sep=",", header=0)
user_ratings = pd.read_csv("Original_Data_Sets/MovieLens/ratings.csv", sep=",", header=0)
genome_scores = pd.read_csv("Original_Data_Sets/MovieLens/genome-scores.csv", sep=",", header=0)
genome_tag = pd.read_csv("Original_Data_Sets/MovieLens/genome-tags.csv", sep=",", header=0)

links.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27278 entries, 0 to 27277
Data columns (total 3 columns):
movieId    27278 non-null int64
imdbId     27278 non-null int64
tmdbId     27026 non-null float64
dtypes: float64(1), int64(2)
memory usage: 639.4 KB


In [4]:
basics.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4941637 entries, 0 to 4941636
Data columns (total 9 columns):
tconst            object
titleType         object
primaryTitle      object
originalTitle     object
isAdult           int64
startYear         object
endYear           object
runtimeMinutes    object
genres            object
dtypes: int64(1), object(8)
memory usage: 339.3+ MB


There are 27278 movies in MovieLens datset and 4941637 records in IMDb dataset.

## Data Integration
In IMDb database, the information of one movie is divided into three part, such as basic information of movie, information of crew and audience evaluation. So we combine three data sets into one data set according to the titleId of movies.  

* Form a large data set of 11420 movies with 12 attributes

In [101]:
extra_info = pd.merge(left=crew, right=ratings, how='outer', on='tconst')
movie = pd.merge(left=basics, right=extra_info, how='outer', on='tconst')
movie.to_csv("Data_Set/movie.csv")
movie.head(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",nm0005690,\N,5.8,1364.0
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",nm0721526,\N,6.5,160.0
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",nm0721526,\N,6.6,948.0
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short",nm0721526,\N,6.4,96.0
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short,nm0005690,\N,6.2,1643.0


IMDb and MovieLens use different identity number to represent movies, we can two datasets using links.csv.

* Transform the attribute "tconst" in IMDb to the form of "imdbId" in MovieLens

In [102]:
movie['tconst'] = movie.apply(lambda x: int(x['tconst'].replace("tt", "")), axis=1)
movie.rename(columns=lambda x:x.replace('tconst', 'imdbId'), inplace=True)
movie.head(5)

Unnamed: 0,imdbId,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",nm0005690,\N,5.8,1364.0
1,2,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",nm0721526,\N,6.5,160.0
2,3,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",nm0721526,\N,6.6,948.0
3,4,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short",nm0721526,\N,6.4,96.0
4,5,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short,nm0005690,\N,6.2,1643.0


## Data Reduction
### Feature Selection

In [104]:
movie = movie.replace('\\N', np.nan)
pd.isnull(movie).sum()

imdbId                  0
titleType               0
primaryTitle            4
originalTitle         180
isAdult                 0
startYear          276921
endYear           4902775
runtimeMinutes    3404085
genres             390533
directors         2120881
writers           2505777
averageRating     4117413
numVotes          4117413
dtype: int64

So we drop attributes as below:

* primaryTitle
* originalTitle
* endYear

In [105]:
movie = movie.drop(columns=['primaryTitle', 'originalTitle', 'endYear'])
movie.head(5)

Unnamed: 0,imdbId,titleType,isAdult,startYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1,short,0,1894,1.0,"Documentary,Short",nm0005690,,5.8,1364.0
1,2,short,0,1892,5.0,"Animation,Short",nm0721526,,6.5,160.0
2,3,short,0,1892,4.0,"Animation,Comedy,Romance",nm0721526,,6.6,948.0
3,4,short,0,1892,,"Animation,Short",nm0721526,,6.4,96.0
4,5,short,0,1893,1.0,Short,nm0005690,,6.2,1643.0


### Instance Selection

Here, we drop all videos and video games which are unrelated.

In [114]:
movie = movie.drop(np.where((movie.titleType == 'video') | (movie.titleType == 'videoGame'))[0])
movie.to_csv("Data_Set/movie_l1.csv", index=False)
movie.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4722816 entries, 0 to 4941636
Data columns (total 10 columns):
imdbId            int64
titleType         object
isAdult           int64
startYear         float64
runtimeMinutes    object
genres            object
directors         object
writers           object
averageRating     float64
numVotes          float64
dtypes: float64(3), int64(2), object(5)
memory usage: 396.4+ MB


## Data Cleaning
Here, we handle missing data in IMDb dataset.

In [108]:
movie = pd.read_csv("Data_Set/movie_l1.csv", sep=",", header=0)

  interactivity=interactivity, compiler=compiler, result=result)


In [115]:
movie.groupby('titleType').mean()

Unnamed: 0_level_0,imdbId,isAdult,startYear,averageRating,numVotes
titleType,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
movie,2207283.0,0.017511,1986.252302,6.257489,3137.623713
short,3175621.0,0.001691,1989.339339,6.875474,61.364852
tvEpisode,3768220.0,0.022563,2003.045081,7.449162,124.768424
tvMiniSeries,3917497.0,0.003608,2004.946546,7.267308,567.484533
tvMovie,2400234.0,0.000562,1995.310781,6.712848,176.661691
tvSeries,2973238.0,0.003968,2002.112311,6.836243,1078.392503
tvShort,2532572.0,0.00024,1993.065176,6.924822,179.432171
tvSpecial,2056499.0,0.00056,2001.810498,6.724496,155.683419


We use the mean value in same "titleType" to impute the missing values in "runtimeMinutes".

In [116]:
movie.loc[4805574, 'isAdult'] = 0
movie.loc[4805574, 'startYear'] = 2015
movie.loc[4805574, 'runtimeMinutes'] = np.nan
movie.loc[4805574, 'genres'] = 'Animation,Sci-Fi,Short'
movie.loc[4730468, 'isAdult'] = 0
movie.loc[4730468, 'startYear'] = 2018
movie.loc[4730468, 'runtimeMinutes'] = np.nan
movie.loc[4730468, 'genres'] = 'Animation,Family'
movie.loc[4525055, 'isAdult'] = 0
movie.loc[4525055, 'startYear'] = 2016
movie.loc[4525055, 'runtimeMinutes'] = np.nan
movie.loc[4525055, 'genres'] = 'Action,Adventure,Animation'
movie.loc[2301207, 'isAdult'] = 0
movie.loc[2301207, 'startYear'] = 2011
movie.loc[2301207, 'runtimeMinutes'] = np.nan
movie.loc[2301207, 'genres'] = 'Comedy'
movie.loc[811687, 'isAdult'] = 0
movie.loc[811687, 'startYear'] = 1993
movie.loc[811687, 'runtimeMinutes'] = np.nan
movie.loc[811687, 'genres'] = 'Action,Animation,Comedy'

movie['runtimeMinutes'] = movie['runtimeMinutes'].astype('float')
means = movie.groupby('titleType').mean()

movie['runtimeMinutes'] = movie.apply(
    lambda x: means.loc[x['titleType'], 'runtimeMinutes'] if np.isnan(x['runtimeMinutes']) else x['runtimeMinutes'],
    axis=1)

We use the mean value to impute the missing values in "averageRating".

In [117]:
movie['averageRating'] = movie['averageRating'].fillna(movie['averageRating'].mean())

We use the mean value in same "titleType" to impute the missing values in "numVotes".

In [118]:
movie['numVotes'] = movie.apply(
    lambda x: means.loc[x['titleType'], 'numVotes'] if np.isnan(x['numVotes']) else x['numVotes'],
    axis=1)

For other categorical data, we don't need to handle the missing data.

In [119]:
movie.head(10)

Unnamed: 0,imdbId,titleType,isAdult,startYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1,short,0,1894.0,1.0,"Documentary,Short",nm0005690,,5.8,1364.0
1,2,short,0,1892.0,5.0,"Animation,Short",nm0721526,,6.5,160.0
2,3,short,0,1892.0,4.0,"Animation,Comedy,Romance",nm0721526,,6.6,948.0
3,4,short,0,1892.0,13.213336,"Animation,Short",nm0721526,,6.4,96.0
4,5,short,0,1893.0,1.0,Short,nm0005690,,6.2,1643.0
5,6,short,0,1894.0,1.0,Short,nm0005690,,5.6,83.0
6,7,short,0,1894.0,1.0,"Short,Sport","nm0374658,nm0005690",,5.5,553.0
7,8,short,0,1894.0,1.0,"Documentary,Short",nm0005690,,5.6,1465.0
8,9,movie,0,1894.0,45.0,Romance,nm0085156,nm0085156,5.5,65.0
9,10,short,0,1895.0,1.0,"Documentary,Short",nm0525910,,6.9,4910.0


## Data Transformation

For titleType, we use numbers 0-4 to update values.
* 0 if the titleType is movie or tvMovie
* 1 if the titleType is short or tvShort
* 2 if the titleType is tvEpisode or tvSeries
* 3 if the titleType is tvMiniSeries or tvSpecial

In [120]:
movie['titleType'] = np.where((movie.titleType == 'movie') | (movie.titleType == 'tvMovie'), 0, movie.titleType)
movie['titleType'] = np.where((movie.titleType == 'short') | (movie.titleType == 'tvShort'), 1, movie.titleType)
movie['titleType'] = np.where((movie.titleType == 'tvEpisode') | (movie.titleType == 'tvSeries'), 2, movie.titleType)
movie['titleType'] = np.where((movie.titleType == 'tvMiniSeries') | (movie.titleType == 'tvSpecial'), 3, movie.titleType)
movie.head(5)

Unnamed: 0,imdbId,titleType,isAdult,startYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1,1,0,1894.0,1.0,"Documentary,Short",nm0005690,,5.8,1364.0
1,2,1,0,1892.0,5.0,"Animation,Short",nm0721526,,6.5,160.0
2,3,1,0,1892.0,4.0,"Animation,Comedy,Romance",nm0721526,,6.6,948.0
3,4,1,0,1892.0,13.213336,"Animation,Short",nm0721526,,6.4,96.0
4,5,1,0,1893.0,1.0,Short,nm0005690,,6.2,1643.0


For startYear, we divde them to 7 ages.
* 0 if the startYear is earlier than 1930
* 1 if the startYear is later than 1930 but earlier than 1950
* 2 if the startYear is later than 1950 but earlier than 1970
* 3 if the startYear is later than 1970 but earlier than 1990
* 4 if the startYear is later than 1990 but earlier than 2000
* 5 if the startYear is later than 2000 but earlier than 2010
* 6 if the startYear is later than 2010

In [123]:
movie['startYear'] = np.where(movie.startYear <= 1930, 0, movie.startYear)
movie['startYear'] = np.where((movie.startYear > 1930) & (movie.startYear <= 1950), 1, movie.startYear)
movie['startYear'] = np.where((movie.startYear > 1950) & (movie.startYear <= 1970), 2, movie.startYear)
movie['startYear'] = np.where((movie.startYear > 1970) & (movie.startYear <= 1990), 3, movie.startYear)
movie['startYear'] = np.where((movie.startYear > 1990) & (movie.startYear <= 2000), 4, movie.startYear)
movie['startYear'] = np.where((movie.startYear > 2000) & (movie.startYear <= 2010), 5, movie.startYear)
movie['startYear'] = np.where(movie.startYear > 2010, 6, movie.startYear)
movie.head(5)

Unnamed: 0,imdbId,titleType,isAdult,startYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1,1,0,0.0,1.0,"Documentary,Short",nm0005690,,5.8,1364.0
1,2,1,0,0.0,5.0,"Animation,Short",nm0721526,,6.5,160.0
2,3,1,0,0.0,4.0,"Animation,Comedy,Romance",nm0721526,,6.6,948.0
3,4,1,0,0.0,13.213336,"Animation,Short",nm0721526,,6.4,96.0
4,5,1,0,0.0,1.0,Short,nm0005690,,6.2,1643.0


For runtimeMinutes, we divide them to 5 types.
* 0 if the runtimeMinutes is less than 10
* 1 if the runtimeMinutes is larger than 10 but less than 30
* 2 if the runtimeMinutes is larger than 30 but less than 60
* 3 if the runtimeMinutes is larger than 60 but less than 120
* 4 if the runtimeMinutes is larger than 120

In [125]:
movie['runtimeMinutes'] = np.where(movie.runtimeMinutes <= 10, 0, movie.runtimeMinutes)
movie['runtimeMinutes'] = np.where((movie.runtimeMinutes > 10) & (movie.runtimeMinutes <= 30), 1, movie.runtimeMinutes)
movie['runtimeMinutes'] = np.where((movie.runtimeMinutes > 30) & (movie.runtimeMinutes <= 60), 2, movie.runtimeMinutes)
movie['runtimeMinutes'] = np.where((movie.runtimeMinutes > 60) & (movie.runtimeMinutes <= 120), 3, movie.runtimeMinutes)
movie['runtimeMinutes'] = np.where(movie.runtimeMinutes > 120, 4, movie.runtimeMinutes)
movie.head(5)

Unnamed: 0,imdbId,titleType,isAdult,startYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1,1,0,0.0,0.0,"Documentary,Short",nm0005690,,5.8,1364.0
1,2,1,0,0.0,0.0,"Animation,Short",nm0721526,,6.5,160.0
2,3,1,0,0.0,0.0,"Animation,Comedy,Romance",nm0721526,,6.6,948.0
3,4,1,0,0.0,1.0,"Animation,Short",nm0721526,,6.4,96.0
4,5,1,0,0.0,0.0,Short,nm0005690,,6.2,1643.0


For averageRating, we divide them to 5 types.
* 0 if the averageRating is less than 5.0
* 1 if the averageRating is larger than 5.0 but less than 6.0
* 2 if the averageRating is larger than 6.0 but less than 7.0
* 3 if the averageRating is larger than 7.0 but less than 8.0
* 4 if the averageRating is larger than 8.0

In [127]:
movie['averageRating'] = np.where(movie.averageRating <= 5.0, 0, movie.averageRating)
movie['averageRating'] = np.where((movie.averageRating > 5.0) & (movie.averageRating <= 6.0), 1, movie.averageRating)
movie['averageRating'] = np.where((movie.averageRating > 6.0) & (movie.averageRating <= 7.0), 2, movie.averageRating)
movie['averageRating'] = np.where((movie.averageRating > 7.0) & (movie.averageRating <= 8.0), 3, movie.averageRating)
movie['averageRating'] = np.where(movie.averageRating > 8.0, 4, movie.averageRating)
movie.head(5)

Unnamed: 0,imdbId,titleType,isAdult,startYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1,1,0,0.0,0.0,"Documentary,Short",nm0005690,,1.0,1364.0
1,2,1,0,0.0,0.0,"Animation,Short",nm0721526,,2.0,160.0
2,3,1,0,0.0,0.0,"Animation,Comedy,Romance",nm0721526,,2.0,948.0
3,4,1,0,0.0,1.0,"Animation,Short",nm0721526,,2.0,96.0
4,5,1,0,0.0,0.0,Short,nm0005690,,2.0,1643.0


For numVotes, we divide them to 5 types.
* 0 if the numVotes is less than 100
* 1 if the numVotes is larger than 100 but less than 1000
* 2 if the numVotes is larger than 1000 but less than 10000
* 3 if the numVotes is larger than 10000 but less than 100000
* 4 if the numVotes is larger than 100000

In [137]:
movie['numVotes'] = np.where(movie.numVotes <= 100.0, 0, movie.numVotes)
movie['numVotes'] = np.where((movie.numVotes > 100.0) & (movie.numVotes <= 1000.0), 1, movie.numVotes)
movie['numVotes'] = np.where((movie.numVotes > 1000.0) & (movie.numVotes <= 10000.0), 2, movie.numVotes)
movie['numVotes'] = np.where((movie.numVotes > 10000.0) & (movie.numVotes <= 100000.0), 3, movie.numVotes)
movie['numVotes'] = np.where(movie.numVotes > 100000.0, 4, movie.numVotes)
movie.head(5)

Unnamed: 0,imdbId,titleType,isAdult,startYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1,1,0,0.0,0.0,"Documentary,Short",nm0005690,,1.0,2.0
1,2,1,0,0.0,0.0,"Animation,Short",nm0721526,,2.0,1.0
2,3,1,0,0.0,0.0,"Animation,Comedy,Romance",nm0721526,,2.0,1.0
3,4,1,0,0.0,1.0,"Animation,Short",nm0721526,,2.0,0.0
4,5,1,0,0.0,0.0,Short,nm0005690,,2.0,2.0


In [138]:
movie.to_csv("Data_Set/movie_l2.csv", index=False)

In [None]:
## Build