# Introduction

As the trend of on-demand entertainment becoming more prevelant, there is a change in the trend of content creation being less exclusive to a particular network and more avaiable across various content providers. Due to this, there is a competiton across various entertainment provider to provide the right content to the right user to ensure that their users are engaged and reduce their needs to move to another content provider. 1 key factor that can help in pushing the right content to the right users is to have a robust recommendation system.

A singapore start-up which envision to be the Netflix of Singapore, has hired me as a Data scientist to assist their team in creating a recommendation system. They have kindly provided movie rating data from their system. Due to PDPA, no user's profiles are provided.

## Problem Statement
1. To create a recommendation system based on collaborative filtering to recommend movies to users based on similar movie watched & ratings.
2. Recommendation system should provide new contents (w/o rating) users based on user profile to user.

## Import Library

In [2]:
import numpy as np
import pandas as pd
import glob

import matplotlib.pyplot as plt
import seaborn as sns

# API Call
import requests

## Import Data sets 

### Importing of Movie title file

In [3]:
# Import Movie titles file
titles = pd.read_csv('datasets/movie_titles.txt', sep = ',', names=['movie_id','year_of_release','title'], encoding='Latin_1')

In [4]:
# convert all movie title to lower case
titles['title'] = titles['title'].map(lambda x: str.lower(x))

In [5]:
print(f'shape of titles {titles.shape}')
titles.head()

shape of titles (17770, 3)


In [6]:
# For testing purpose
test = titles.head(1_000)
test.head()

Unnamed: 0,movie_id,year_of_release,title
0,1,2003.0,dinosaur planet
1,2,2004.0,isle of man tt 2004 review
2,3,1997.0,character
3,4,1994.0,paula abdul's get up & dance
4,5,2004.0,the rise and fall of ecw


### Connect to TheMoiveDB.org

In [7]:
# API key
REQUEST_TOKEN = 'd8f46f139abc47c1f048f3efc486fe53'

# Target web page:

# To authenticate api call session
url = "https://www.themoviedb.org/authenticate/"+ REQUEST_TOKEN

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(response.status_code)

200


In [8]:
# Create a function for API call for different movie titles

def title_pull(item_name, item_release_year):
    
    global REQUEST_TOKEN
    
    # replace whitespace with %20
    item = item_name.replace(" ", "%20")

    print(f'currently working on {item_name}')
    
    # Try to search based on movie
    try:
        # Set URL for 1st page result
        url = f"https://api.themoviedb.org/3/search/movie?api_key={REQUEST_TOKEN}&query={item}&primary_release_year={item_release_year}&page=1"
        req = requests.get(url)
        # convert request into json file
        sq_dict = req.json()
        result = sq_dict['results'][0]
        # return the 1st item of the result dict from the sq_dict
        return result
    
    except:
        pass
    
    # Try to search based on TV
    try:
        # Set URL for 1st page result
        url = f"https://api.themoviedb.org/3/search/tv?api_key={REQUEST_TOKEN}&query={item}&first_air_date_year={item_release_year}&page=1"
        req = requests.get(url)
        # convert request into json file
        sq_dict = req.json()
        result = sq_dict['results'][0]
        # return the 1st item of the result dict from the sq_dict
        return result
    
    except:
        # Return 'NA' if not able to search title
        return "NA"

In [9]:
%%time

# Create a column of pulled result based on title & release_year
test['moviedb_result'] = test.apply(lambda x: title_pull(x['title'], x['year_of_release']),axis=1)

currently working on dinosaur planet
currently working on isle of man tt 2004 review
currently working on character
currently working on paula abdul's get up & dance
currently working on the rise and fall of ecw
currently working on sick
currently working on 8 man
currently working on what the #$*! do we know!?
currently working on class of nuke 'em high 2
currently working on fighter
currently working on full frame: documentary shorts
currently working on my favorite brunette
currently working on lord of the rings: the return of the king: extended edition: bonus material
currently working on nature: antarctica
currently working on neil diamond: greatest hits live
currently working on screamers
currently working on 7 seconds
currently working on immortal beloved
currently working on by dawn's early light
currently working on seeta aur geeta
currently working on strange relations
currently working on chump change
currently working on clifford: clifford saves the day! / clifford's fluffi

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [13]:
test[test['moviedb_result']=='NA']

Unnamed: 0,movie_id,year_of_release,title,moviedb_result
1,2,2004.0,isle of man tt 2004 review,
10,11,1999.0,full frame: documentary shorts,
12,13,2003.0,lord of the rings: the return of the king: ext...,
13,14,1982.0,nature: antarctica,
15,16,1996.0,screamers,
...,...,...,...,...
964,965,1985.0,metropolitan opera: puccini: tosca,
975,976,2003.0,tom and jerry: paws for a holiday,
982,983,1997.0,wishful thinking,
990,991,1997.0,burn up excess: vol. 2: crimes and missed deme...,


In [16]:
test.loc[]

'lord of the rings: the return of the king: extended edition: bonus material'

In [20]:
test[test['title']=='gto: great teacher onizuka: set 2']

Unnamed: 0,movie_id,year_of_release,title,moviedb_result
134,135,1998.0,gto: great teacher onizuka: set 2,


### Import Movie Rating File by Users

In [5]:
%%time
# Import rating files into dataframe
ls_of_ratings = []

# Specify the pattern matching for glob
rating_files = glob.glob('datasets/training_set/*')

# Loop to get dataframe into list
for filename in rating_files:
    df = pd.read_csv(filename, sep=',', names=['customer_id','rating','date'],skiprows=1)
    df['movie_id'] = int(filename.split('mv_')[1].split('.')[0])
    ls_of_ratings.append(df)

# Concat dataframe together
ratings = pd.concat(ls_of_ratings,ignore_index=True)

print('glob completed')

glob completed
Wall time: 2min 34s


In [7]:
print(ratings.shape)
ratings.head()

(100480507, 4)


Unnamed: 0,customer_id,rating,date,movie_id
0,1488844,3,2005-09-06,1
1,822109,5,2005-05-13,1
2,885013,4,2005-10-19,1
3,30878,4,2005-12-26,1
4,823519,3,2004-05-03,1


### Merging of Movie title & Rating DataFrame

In [8]:
ratings_set = set(ratings['movie_id'])
titles_set = set(titles['movie_id'])
print(f" elements present in title df but not in rating df : {titles_set.difference(ratings_set)}")
print(f" elements present in rating df but not in title df : {ratings_set.difference(titles_set)}")

 elements present in title df but not in rating df : set()
 elements present in rating df but not in title df : set()


In [9]:
# Merge movie title with rating
df = pd.merge(ratings, titles, on='movie_id', how='inner')
df.shape

(100480507, 6)

In [10]:
df.head()

Unnamed: 0,customer_id,rating,date,movie_id,year_of_release,title
0,1488844,3,2005-09-06,1,2003.0,Dinosaur Planet
1,822109,5,2005-05-13,1,2003.0,Dinosaur Planet
2,885013,4,2005-10-19,1,2003.0,Dinosaur Planet
3,30878,4,2005-12-26,1,2003.0,Dinosaur Planet
4,823519,3,2004-05-03,1,2003.0,Dinosaur Planet


In [11]:
df.isnull().sum()

customer_id          0
rating               0
date                 0
movie_id             0
year_of_release    965
title                0
dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 100480507 entries, 0 to 100480506
Data columns (total 6 columns):
 #   Column           Dtype  
---  ------           -----  
 0   customer_id      int64  
 1   rating           int64  
 2   date             object 
 3   movie_id         int64  
 4   year_of_release  float64
 5   title            object 
dtypes: float64(1), int64(3), object(2)
memory usage: 5.2+ GB


In [28]:
df['title'].nunique()

17297

Unnamed: 0,customer_id,rating,date,movie_id,year_of_release,title
0,1488844,3,2005-09-06,1,2003.0,Dinosaur Planet
1,822109,5,2005-05-13,1,2003.0,Dinosaur Planet
2,885013,4,2005-10-19,1,2003.0,Dinosaur Planet
3,30878,4,2005-12-26,1,2003.0,Dinosaur Planet
4,823519,3,2004-05-03,1,2003.0,Dinosaur Planet
5,893988,3,2005-11-17,1,2003.0,Dinosaur Planet
6,124105,4,2004-08-05,1,2003.0,Dinosaur Planet
7,1248029,3,2004-04-22,1,2003.0,Dinosaur Planet
8,1842128,4,2004-05-09,1,2003.0,Dinosaur Planet
9,2238063,3,2005-05-11,1,2003.0,Dinosaur Planet
