# Highly Scalable Efficient Recommendation System

## Dowload dataset from open data source
There are many data sets in two selected data sources, but we only need some of them to build our recommendation system. So we select a few of them in consideration of the demand for data in our system.

#### IMDb database
* title.basics.tsv
* title.crew.tsv
* title.ratings.tsv

#### MovieLens Data Sets
* tags.csv
* links.csv
* ratings.csv
* genome-scores.csv
* genome-tags.csv

In [32]:
import pandas as pd

basics = pd.read_csv("Original_Data_Sets/IMDb/title.basics.tsv", sep='\t', header=0)
crew = pd.read_csv("Original_Data_Sets/IMDb/title.crew.tsv", sep='\t', header=0)
ratings = pd.read_csv("Original_Data_Sets/IMDb/title.ratings.tsv", sep='\t', header=0)

links = pd.read_csv("Original_Data_Sets/MovieLens/links.csv", sep=",", header=0)
tags = pd.read_csv("Original_Data_Sets/MovieLens/tags.csv", sep=",", header=0)
user_ratings = pd.read_csv("Original_Data_Sets/MovieLens/ratings.csv", sep=",", header=0)
genome_scores = pd.read_csv("Original_Data_Sets/MovieLens/genome-scores.csv", sep=",", header=0)
genome_tag = pd.read_csv("Original_Data_Sets/MovieLens/genome-tags.csv", sep=",", header=0)

basics.head(10)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport"
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short"
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance
9,tt0000010,short,Employees Leaving the Lumière Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short"


## Data Integration
In IMDb database, the information of one movie is divided into three part, such as basic information of movie, information of crew and audience evaluation. So we combine three data sets into one data set according to the titleId of movies.  

* Form a large data set of 11420 movies with 12 attributes

In [19]:
extra_info = pd.merge(left=crew, right=ratings, how='outer', on='tconst')
movie = pd.merge(left=basics, right=extra_info, how='outer', on='tconst')
movie.to_csv("Data_Set/movie.csv")
movie.head(10)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short",nm0005690,\N,5.8,1364.0
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short",nm0721526,\N,6.5,160.0
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance",nm0721526,\N,6.6,948.0
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,\N,"Animation,Short",nm0721526,\N,6.4,96.0
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,Short,nm0005690,\N,6.2,1643.0
5,tt0000006,short,Chinese Opium Den,Chinese Opium Den,0,1894,\N,1,Short,nm0005690,\N,5.6,83.0
6,tt0000007,short,Corbett and Courtney Before the Kinetograph,Corbett and Courtney Before the Kinetograph,0,1894,\N,1,"Short,Sport","nm0374658,nm0005690",\N,5.5,553.0
7,tt0000008,short,Edison Kinetoscopic Record of a Sneeze,Edison Kinetoscopic Record of a Sneeze,0,1894,\N,1,"Documentary,Short",nm0005690,\N,5.6,1465.0
8,tt0000009,movie,Miss Jerry,Miss Jerry,0,1894,\N,45,Romance,nm0085156,nm0085156,5.5,65.0
9,tt0000010,short,Employees Leaving the Lumière Factory,La sortie de l'usine Lumière à Lyon,0,1895,\N,1,"Documentary,Short",nm0525910,\N,6.9,4910.0


IMDb and MovieLens use different identity number to represent movies, so we should combine these to data sets using links.csv to one data set. 

* Replace titleIds in IMDb with movieIds in MovieLens

In [31]:
links['imdbId'] = links.apply(lambda x: 'tt' + str(int(x["imdbId"])), axis=1)
links.rename(columns=lambda x:x.replace('imdbId','tconst'), inplace=True)
movie = pd.merge(left=links, right=movie, how='inner', on='tconst')
movie.head(10)

Unnamed: 0,movieId,tconst,tmdbId,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres,directors,writers,averageRating,numVotes
0,1.0,tt114709,862.0,,,,,,,,,,,,
1,2.0,tt113497,8844.0,,,,,,,,,,,,
2,3.0,tt113228,15602.0,,,,,,,,,,,,
3,4.0,tt114885,31357.0,,,,,,,,,,,,
4,5.0,tt113041,11862.0,,,,,,,,,,,,
5,6.0,tt113277,949.0,,,,,,,,,,,,
6,7.0,tt114319,11860.0,,,,,,,,,,,,
7,8.0,tt112302,45325.0,,,,,,,,,,,,
8,9.0,tt114576,9091.0,,,,,,,,,,,,
9,10.0,tt113189,710.0,,,,,,,,,,,,


## Data Reduction
Here, we do feature Selection to drop useless attributes as below:

* primaryTitle
* originalTitle
* endYear
* writers
* 

In [None]:
movie = movie.drop(columns=['primaryTitle', 'originalTitle', 'endYear', 'writers'])

## Data Cleaning
Here, we handle missing data in


## Data Transformation