# <center>EDSA Movie Recommendation Challenge</center>

## Introduction

In today’s technology driven world, recommender systems are socially and economically critical for ensuring that individuals can make appropriate choices surrounding the content they engage with on a daily basis. One application where this is especially true surrounds movie content recommendations; where intelligent algorithms can help viewers find great titles from tens of thousands of options. 
<br>

Providing an accurate and robust solution to this challenge has immense economic potential, with users of the system being exposed to content they would like to view or purchase - generating revenue and platform affinity.

## Problem Statement

The aim of this project is to **construct a recommendation algorithm based on `content or collaborative filtering`, capable of accurately predicting how a user will rate a movie they have not yet viewed based on their historical preferences.**
<br>

The evaluation metric will be the `Root Mean Square Error(RMSE)`. The lower(close to 0) the error, the better.
<br>

To view more infomation about the compepition [Click here](https://www.kaggle.com/c/edsa-recommender-system-predict/overview)

## Table of Contents

* [Data Overview](#the_over)
* [Connect to Comet](#the_connect)
* [Import librabies](#the_library)
* [Load datasets](#the_data)
* [Inspect Datasets](#the_inspect)
* [Data Cleaning](#the_clean)
* [Explonatory Data Analysis](#the_eda)
* [Feature Engineering](#the_engineer)
* [Implementation of Recommender Systems](#the_recommender)
* [Assessing Accuracy](#the_assess)
* [Conclusion](#the_conclusion)


<a id='the_over'></a>
## Data Overview

This dataset consists of several million 5-star ratings obtained from users of the online [MovieLens](http://movielens.org/ ) movie recommendation service. The data for the MovieLens dataset is maintained by the [GroupLens](https://grouplens.org/)  research group in the Department of Computer Science and Engineering at the University of Minnesota. Additional movie content data was legally scraped from [IMDB](https://www.imdb.com/).

Below is a table of supplied files that will be used in this notebook:



| **File Name** | **Description** |
|:---------:|:----------------|
|   **genome_scores.csv**   |A score mapping the strength between movies and tag-related properties. Read more [here](http://files.grouplens.org/papers/tag_genome.pdf) |
|   **genome_tags.csv**   |User assigned tags for genome-related scores. |
|   **imdb_data.csv**   |Additional movie metadata scraped from IMDB using the links.csv file. |
|  **links.csv**   |File providing a mapping between a MovieLens ID and associated IMDB and TMDB IDs. |  
|  **sample_submission.csv**   |Sample of the submission format for the hackathon. | 
|  **tags.csv**   |User assigned for the movies within the dataset. | 
|  **test.csv**   |The test split of the dataset. Contains user and movie IDs with no rating data. | 
|  **train.csv**   |The training split of the dataset. Contains user and movie IDs with associated rating data. | 


**Ratings Data File Structure (train.csv)**

All ratings are contained in the file train.csv. Each line of this file after the header row represents one rating of one movie by one user, and has the following features:

| userId | movieId | rating | timestamp |
|:---------:|:----------------|:---------:|:----------------|


The lines within this file are ordered first by userId, then, within user, by movieId. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars). Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970.

**Tags Data File Structure (tags.csv)**

All tags are contained in the file tags.csv. Each line of this file after the header row represents one tag applied to one movie by one user, and has the following features:

| userId | movieId | tag | timestamp |
|:---------:|:----------------|:---------:|:----------------|


The lines within this file are ordered first by userId, then, within user, by movieId. Tags are user-generated metadata about movies. Each tag is typically a single word or short phrase. The meaning, value, and purpose of a particular tag is determined by each user. Timestamps represent seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970

**Movies Data File Structure (movies.csv)**

Movie information is contained in the file movies.csv. Each line of this file after the header row represents one movie, and has the following features:


| userId | title | genres | 
|:---------:|:----------------|:---------:|


Movie titles are entered manually or imported from [here](https://www.themoviedb.org/), and include the year of release in parentheses. Errors and inconsistencies may exist in these titles. Genres are a pipe-separated list, and are selected from the following:

Action,
Adventure,
Animation,
Children's,
Comedy,
Crime,
Documentary,
Drama,
Fantasy,
Film-Noir,
Horror,
Musical,
Mystery,
Romance,
Sci-Fi,
Thriller,
War,
Western,
and (no genres listed).

**Links Data File Structure (links.csv)**

Identifiers that can be used to link to other sources of movie data are contained in the file links.csv. Each line of this file after the header row represents one movie, and has the following features:

| userId |imdbId  | tmdbId  | 
|:---------:|:----------------|:---------:|



- movieId is an identifier for movies used by [movielens site](https://movielens.org). E.g., the movie Toy Story has this [link](https://movielens.org/movies/1).

- imdbId is an identifier for movies used by [imdb site](http://www.imdb.com). E.g., the movie Toy Story has this [link](http://www.imdb.com/title/tt0114709/).

- tmdbId is an identifier for movies used by [themoviedb site](https://www.themoviedb.org). E.g., the movie Toy Story has this [link](https://www.themoviedb.org/movie/862).

- Use of the resources listed above is subject to the terms of each provider.

**Tag Genome (genome-scores.csv and genome-tags.csv)**

The tag genome encodes how strongly movies exhibit particular properties represented by tags (atmospheric, thought-provoking, realistic, etc.). The tag genome was computed using a machine learning algorithm on user-contributed content including tags, ratings, and textual reviews. The genome is split into two files. The file genome-scores.csv contains movie-tag relevance data in the following features:


| userId | tagId | relevance | 
|:---------:|:----------------|:---------:|


The second file, genome-tags.csv, provides the tag descriptions for the tag IDs in the genome file, in the following features:

| tagId | tag |
|:---------:|:----------------|


<a id='the_connection'></a>
## Connect to Comet

In [None]:
#!pip install comet_ml

<a id='the_library'></a>
## Importing the libraries

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from wordcloud import WordCloud

#visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('dark')
%matplotlib inline

<a id ='the_data'></a>
## Load datasets

In [2]:
train_df = pd.read_csv('train.csv')
genome_scores_df = pd.read_csv('genome_scores.csv')
genome_tags_df = pd.read_csv('genome_tags.csv')
imdb_df = pd.read_csv('imdb_data.csv')
links_df = pd.read_csv('links.csv')
movies_df = pd.read_csv('movies.csv')
sample_submission_df = pd.read_csv('sample_submission.csv')
tags_df = pd.read_csv('tags.csv')
test_df = pd.read_csv('test.csv')

<a id='the_inspect'></a>
## Inspect datasets

In [3]:
train_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,5163,57669,4.0,1518349992
1,106343,5,4.5,1206238739
2,146790,5459,5.0,1076215539
3,106362,32296,2.0,1423042565
4,9041,366,3.0,833375837


The userId column contains the ID of the user who left the rating. The movieId column contains the Id of the movie, the rating column contains the rating left by the user. Ratings can have values between 1 and 5. And finally, the timestamp refers to the time at which the user left the rating.

Let's add in some movie titles

In [4]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
movies_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62423 entries, 0 to 62422
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   movieId  62423 non-null  int64 
 1   title    62423 non-null  object
 2   genres   62423 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


Movies database appears to have no null values ,but further inspection is required

In [6]:
movies_df.loc[movies_df["genres"] == "(no genres listed)"] #to see which movies have missing genres

Unnamed: 0,movieId,title,genres
15881,83773,Away with Words (San tiao ren) (1999),(no genres listed)
16060,84768,Glitterbug (1994),(no genres listed)
16351,86493,"Age of the Earth, The (A Idade da Terra) (1980)",(no genres listed)
16491,87061,Trails (Veredas) (1978),(no genres listed)
17404,91246,Milky Way (Tejút) (2007),(no genres listed)
...,...,...,...
62400,209101,Hua yang de nian hua (2001),(no genres listed)
62401,209103,Tsar Ivan the Terrible (1991),(no genres listed)
62407,209133,The Riot and the Dance (2018),(no genres listed)
62415,209151,Mao Zedong 1949 (2019),(no genres listed)


5062 movies in the data have no genre listed 

In [7]:
movies_df.loc[movies_df["title"].duplicated() == True]
#All the duplicate rows except their first occurrence are returned 
#the default value of keep argument was ”first”.

Unnamed: 0,movieId,title,genres
9065,26982,Men with Guns (1997),Drama
12909,64997,War of the Worlds (2005),Action|Sci-Fi
12984,65665,Hamlet (2000),Drama
13177,67459,Chaos (2005),Crime|Drama|Horror
16120,85070,Blackout (2007),Drama
...,...,...,...
61521,206117,The Lonely Island Presents: The Unauthorized B...,Comedy
61525,206125,Lost & Found (2018),Comedy|Drama
61697,206674,Camino (2016),Comedy
61800,206925,The Plague (2006),Documentary


98 movie titles are duplicated in the movies database.
However ,the movie id and and titles for the duplicates are different .
Overall ,because this is the same movie ,the duplicates need to be deleted .
Only keep the title with the most genres as this one has more information.
Inspecting random dupliction shows that the first occourance of the movies title has more genres ,so keep first occurance

In [8]:
#Randomly selected to check differences between first and second
movies_df.loc[movies_df["title"] == "The Plague (2006)"] 


Unnamed: 0,movieId,title,genres
27081,128255,The Plague (2006),Documentary|Horror|Thriller
61800,206925,The Plague (2006),Documentary


In [None]:
movies_df.loc[movies_df["title"] == "Men with Guns (1997)"]

In [None]:
movies_df.loc[movies_df["title"] == "War of the Worlds (2005)"]

In [None]:
movies_df.loc[movies_df["title"] == "Chaos (2005)"]

Need to drop all second occurances of a movie titble ,first occurance has more genres

#### Merge Datasets

In [13]:
movies_info_df = pd.merge(train_df,movies_df, on='movieId')
movies_info_df.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,5163,57669,4.0,1518349992,In Bruges (2008),Comedy|Crime|Drama|Thriller
1,87388,57669,3.5,1237455297,In Bruges (2008),Comedy|Crime|Drama|Thriller
2,137050,57669,4.0,1425631854,In Bruges (2008),Comedy|Crime|Drama|Thriller
3,120490,57669,4.5,1408228517,In Bruges (2008),Comedy|Crime|Drama|Thriller
4,50616,57669,4.5,1446941640,In Bruges (2008),Comedy|Crime|Drama|Thriller


In [9]:
imdb_df.head()

Unnamed: 0,movieId,title_cast,director,runtime,budget,plot_keywords
0,1,Tom Hanks|Tim Allen|Don Rickles|Jim Varney|Wal...,John Lasseter,81.0,"$30,000,000",toy|rivalry|cowboy|cgi animation
1,2,Robin Williams|Jonathan Hyde|Kirsten Dunst|Bra...,Jonathan Hensleigh,104.0,"$65,000,000",board game|adventurer|fight|game
2,3,Walter Matthau|Jack Lemmon|Sophia Loren|Ann-Ma...,Mark Steven Johnson,101.0,"$25,000,000",boat|lake|neighbor|rivalry
3,4,Whitney Houston|Angela Bassett|Loretta Devine|...,Terry McMillan,124.0,"$16,000,000",black american|husband wife relationship|betra...
4,5,Steve Martin|Diane Keaton|Martin Short|Kimberl...,Albert Hackett,106.0,"$30,000,000",fatherhood|doberman|dog|mansion


In [21]:
movies_info_df = pd.merge(movies_info_df,imdb_df, on='movieId')

In [22]:
movies_info_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9633031 entries, 0 to 9633030
Data columns (total 11 columns):
 #   Column         Dtype  
---  ------         -----  
 0   userId         int64  
 1   movieId        int64  
 2   rating         float64
 3   timestamp      int64  
 4   title          object 
 5   genres         object 
 6   title_cast     object 
 7   director       object 
 8   runtime        float64
 9   budget         object 
 10  plot_keywords  object 
dtypes: float64(2), int64(3), object(6)
memory usage: 881.9+ MB


In [23]:
movies_info_df=movies_info_df.dropna()

In [24]:
movies_info_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6454993 entries, 0 to 9633029
Data columns (total 11 columns):
 #   Column         Dtype  
---  ------         -----  
 0   userId         int64  
 1   movieId        int64  
 2   rating         float64
 3   timestamp      int64  
 4   title          object 
 5   genres         object 
 6   title_cast     object 
 7   director       object 
 8   runtime        float64
 9   budget         object 
 10  plot_keywords  object 
dtypes: float64(2), int64(3), object(6)
memory usage: 591.0+ MB


## Data Cleaning

## Exploratory Data Analysis

Now let's take a look at the average rating of each movie. To do so, we can group the dataset by the title of the movie and then calculate the mean of the rating for each movie.

In [None]:
movies_info_df.groupby('title')['rating'].mean().sort_values(ascending=False).head() #top 5 movies!!!

A movie can make it to the top of the above list even if only a single user has given it five stars. Therefore, the above stats can be misleading. Normally, a movie which is really a good one gets a higher rating by a large number of users.

In [None]:
movies_info_df.groupby('title')['rating'].count().sort_values(ascending=False).head()

Average ratings

In [None]:
ratings_avg = pd.DataFrame(movies_info_df.groupby('title')['rating'].mean())

In [None]:
ratings_avg['rating_counts'] = pd.DataFrame(movies_info_df.groupby('title')['rating'].count())

In [None]:
ratings_avg.head()

Let's plot these

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
ratings_avg['rating_counts'].hist(bins=50)

Most movies received less than 5000 ratings.

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
ratings_avg['rating'].hist(bins=50)

In [None]:
plt.figure(figsize=(8,6))
plt.rcParams['patch.force_edgecolor'] = True
sns.jointplot(x='rating', y='rating_counts', data=ratings_avg, alpha=0.4)

# Tshegofatjo

In [19]:
%matplotlib inline
from IPython.display import Image, HTML
import json
import datetime
import ast
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy import stats
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud, STOPWORDS
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')
sns.set(font_scale=1.25)
pd.set_option('display.max_colwidth', 50)

In [25]:
movies_info_df.head(5)

Unnamed: 0,userId,movieId,rating,timestamp,title,genres,title_cast,director,runtime,budget,plot_keywords
0,5163,57669,4.0,1518349992,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
1,87388,57669,3.5,1237455297,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
2,137050,57669,4.0,1425631854,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
3,120490,57669,4.5,1408228517,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman
4,50616,57669,4.5,1446941640,In Bruges (2008),Comedy|Crime|Drama|Thriller,Elizabeth Berrington|Rudy Blomme|Olivier Bonjo...,Martin McDonagh,107.0,"$15,000,000",dwarf|bruges|irish|hitman


## Feature Engineering

## Implementation of Recommeder Systems

### Impementation of Content Based filtering

### Implementation of Collaborative filtering

## Assessing accuracy

## Conclusion