# Introduction

As the trend of on-demand entertainment becoming more prevelant, there is a change in the trend of content creation being less exclusive to a particular network and more avaiable across various content providers. Due to this, there is a competiton across various entertainment provider to provide the right content to the right user to ensure that their users are engaged and reduce their needs to move to another content provider. 1 key factor that can help in pushing the right content to the right users is to have a robust recommendation system.

A singapore start-up which envision to be the Netflix of Singapore, has hired me as a Data scientist to assist their team in creating a recommendation system. They have kindly provided movie rating data from their system. Due to PDPA, no user's profiles are provided.

## Problem Statement
1. To create a recommendation system based on collaborative filtering to recommend movies to users based on similar movie watched & ratings.
2. Recommendation system should provide new contents (w/o rating) users based on user profile to user.

## Import Library

In [1]:
import pickle
pickle.HIGHEST_PROTOCOL = 4
import numpy as np
import pandas as pd
import glob
import csv

import matplotlib.pyplot as plt
import seaborn as sns

# API Call
import requests

from datetime import datetime


# Additional Setting
pd.set_option("display.max_rows", 201)

## Import Data sets 

### Import Movie Rating File by Users

In [19]:
%%time
# Import rating files into dataframe
ls_of_ratings = []

# Specify the pattern matching for glob
rating_files = glob.glob('datasets/training_set/*')

# Loop to get dataframe into list
for filename in rating_files:
    df = pd.read_csv(filename, sep=',', names=['customer_id','rating','date'],skiprows=1)
    df['movie_id'] = int(filename.split('mv_')[1].split('.')[0])
    ls_of_ratings.append(df)

# Concat dataframe together
ratings = pd.concat(ls_of_ratings,ignore_index=True)

print('glob completed')

glob completed
Wall time: 2min 56s


In [16]:
%%time
ratings.to_csv('datasets/movie.csv')

Wall time: 7min 11s


In [20]:
%%time
ratings.to_hdf('datasets/movie.h5', key='ratings', mode='w')

Wall time: 34.4 s


In [2]:
%%time
test_df = pd.read_hdf('datasets/movie.h5', 'ratings')

Wall time: 27.8 s


In [14]:
%%time
test_df2 = pd.read_csv('datasets/ratings_combined.csv')

Wall time: 1min 15s


In [4]:
%%time
ratings.to_pickle('datasets/movie.pkl',protocol=4)

Wall time: 27 s


In [8]:
%%time
test_df3= pd.read_pickle('datasets/movie.pkl')

Wall time: 15.4 s


In [5]:
# print(ratings.shape)
# ratings.head()

### Importing of Movie title file

In [4]:
# Import Movie titles file
titles = pd.read_csv('datasets/movie_titles.txt', sep = ',', names=['movie_id','year_of_release','title'], encoding='Latin_1')

In [5]:
# convert all movie title to lower case
titles['title'] = titles['title'].map(lambda x: str.lower(x))

In [6]:
print(f'shape of titles {titles.shape}')
titles.head()

shape of titles (17770, 3)


Unnamed: 0,movie_id,year_of_release,title
0,1,2003.0,dinosaur planet
1,2,2004.0,isle of man tt 2004 review
2,3,1997.0,character
3,4,1994.0,paula abdul's get up & dance
4,5,2004.0,the rise and fall of ecw


#### Cleaning of titles dataframe

In [7]:
# Split title into a list of words - delimiter ":"
titles['split_title'] = titles['title'].map(lambda x : [x for x in x.split(':')])
titles.head()

Unnamed: 0,movie_id,year_of_release,title,split_title
0,1,2003.0,dinosaur planet,[dinosaur planet]
1,2,2004.0,isle of man tt 2004 review,[isle of man tt 2004 review]
2,3,1997.0,character,[character]
3,4,1994.0,paula abdul's get up & dance,[paula abdul's get up & dance]
4,5,2004.0,the rise and fall of ecw,[the rise and fall of ecw]


In [8]:
# For testing
# titles = titles.loc[2200:2300,:]
# titles.shape

In [9]:
print(titles.info())
print('-' * 50)
print(titles.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17770 entries, 0 to 17769
Data columns (total 4 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   movie_id         17770 non-null  int64  
 1   year_of_release  17763 non-null  float64
 2   title            17770 non-null  object 
 3   split_title      17770 non-null  object 
dtypes: float64(1), int64(1), object(2)
memory usage: 555.4+ KB
None
--------------------------------------------------
(17770, 4)


In [10]:
titles

Unnamed: 0,movie_id,year_of_release,title,split_title
0,1,2003.0,dinosaur planet,[dinosaur planet]
1,2,2004.0,isle of man tt 2004 review,[isle of man tt 2004 review]
2,3,1997.0,character,[character]
3,4,1994.0,paula abdul's get up & dance,[paula abdul's get up & dance]
4,5,2004.0,the rise and fall of ecw,[the rise and fall of ecw]
...,...,...,...,...
17765,17766,2002.0,where the wild things are and other maurice se...,[where the wild things are and other maurice s...
17766,17767,2004.0,fidel castro: american experience,"[fidel castro, american experience]"
17767,17768,2000.0,epoch,[epoch]
17768,17769,2003.0,the company,[the company]


In [11]:
chunk_list = []
for i in range(500, len(titles),500):
    start_point = i - 500
    df = titles.loc[start_point:i,:]
    #print(type(df))
    chunk_list.append(df)

## API Call

### Connect to TheMoiveDB.org

In [12]:
# API key
REQUEST_TOKEN = 'd8f46f139abc47c1f048f3efc486fe53'

# Target web page:

# To authenticate api call session
url = "https://www.themoviedb.org/authenticate/"+ REQUEST_TOKEN

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(response.status_code)

200


# Need to spend some time to modify the code below

In [13]:
def url_request(mode, item, item_release_year=None):
    '''
    Insert Docstring
    
    
    '''
    
    global REQUEST_TOKEN
    
    if (mode == 'movie') & (item_release_year != None):
        url = f"https://api.themoviedb.org/3/search/{mode}?api_key={REQUEST_TOKEN}&query={item}&primary_release_year={item_release_year}&page=1"
        
    elif (mode == 'movie') & (item_release_year == None):
        url = f"https://api.themoviedb.org/3/search/{mode}?api_key={REQUEST_TOKEN}&query={item}&page=1"
    
    if (mode == 'tv') & (item_release_year != None):
        url = f"https://api.themoviedb.org/3/search/tv?api_key={REQUEST_TOKEN}&query={item}&first_air_date_year={item_release_year}&page=1"
        
    elif (mode == 'tv') & (item_release_year == None):
        url = f"https://api.themoviedb.org/3/search/tv?api_key={REQUEST_TOKEN}&query={item}&page=1"
    
    req = requests.get(url)
    sq_dict = req.json()
    
    return sq_dict

In [14]:
# Create a function for API call for different movie titles

def title_pull(item_name, item_release_year):
    '''
    Insert docstring
    '''
    
    # replace whitespace with %20
    item = item_name.replace(" ", "%20")

    #print(f'currently working on {item_name}')
    
    # Try to search based on movie & release year
    results = url_request('movie', item, item_release_year)
    if results['total_results'] != 0:
        res = {key: results['results'][0][key] for key in results['results'][0].keys() & {'id', 'title'}} 
        return (res, item, 'movie')
    elif results['total_results'] == 0:
        results = url_request('movie', item)
        if results['total_results'] != 0:
            res = {key: results['results'][0][key] for key in results['results'][0].keys() & {'id', 'title'}}  
            return (res, item, 'movie')
        elif results['total_results'] == 0:
            results = url_request('tv', item, item_release_year)
            if results['total_results'] != 0:
                res = {key: results['results'][0][key] for key in results['results'][0].keys() & {'id', 'name'}} 
                return (res, item, 'tv')
            elif results['total_results'] == 0:
                results = url_request('tv', item)
                if results['total_results'] != 0:
                    res = {key: results['results'][0][key] for key in results['results'][0].keys() & {'id', 'name'}} 
                    return (res, item, 'tv')
                else:
                    return ('NA','NA','NA')

In [15]:
def loop_title_pull(num_ls, item_list, item_release_year):
    '''
    Insert docstring
    
    '''
    
    # Check which 'stage' is running

    
    if num_ls % 100 == 0:
        now = datetime.now()
        current_time = now.strftime("%H:%M:%S")
        print(f"Current run: {num_ls} @ {current_time}")
        
    
    
    # Create an empty item_name
    item_name = ''
    
    # Loop to create a string to search movie database
    for i in range(len(item_list)):
        if i != 0:
            item_name += ':' + item_list[i]
        else:
            item_name = item_list[i]
        
        return title_pull(item_name, item_release_year)

In [None]:
%%time
for index, df in enumerate(chunk_list):
    df['moviedb_result'] = df.apply(lambda x: loop_title_pull(x['movie_id'], x['split_title'], x['year_of_release']),axis=1)
    df.to_csv(f'datasets/titles_added/ta_{index}.csv')
    print(index, type(df))

Current run: 100 @ 17:22:26
Current run: 200 @ 17:23:16
Current run: 300 @ 17:24:03


In [None]:
%%time

start = datetime.now()

current_time = start.strftime("%H:%M:%S")
print("Start Time =", current_time)


# Create a column of pulled result based on title & release_year
titles['moviedb_result'] = titles.apply(lambda x: loop_title_pull(x['movie_id'], x['split_title'], x['year_of_release']),axis=1)

# Create a column of pulled result based on title & release_year
# test['moviedb_result'] = test.apply(lambda x: loop_title_pull(x['split_title'], x['year_of_release']),axis=1)

In [19]:
titles.head()

Unnamed: 0,movie_id,year_of_release,title,split_title
0,1,2003.0,dinosaur planet,[dinosaur planet]
1,2,2004.0,isle of man tt 2004 review,[isle of man tt 2004 review]
2,3,1997.0,character,[character]
3,4,1994.0,paula abdul's get up & dance,[paula abdul's get up & dance]
4,5,2004.0,the rise and fall of ecw,[the rise and fall of ecw]


In [None]:
%%time
titles[['search_result', 'search_title', 'type']] = pd.DataFrame(titles['moviedb_result'].tolist(), index=test.index)


## test[['search_result', 'search_title', 'type']] = pd.DataFrame(test['moviedb_result'].tolist(), index=test.index)

# Slower method but may be more readable
#test['search_result'], test['search_title'], test['type'] = zip(*test['moviedb_result'])

In [None]:
# Change html whitespace to normal whitespace
titles['search_title'] = titles['search_title'].map(lambda x: x.replace('%20', " "))

# test['search_title'] = test['search_title'].map(lambda x: x.replace('%20', " "))

In [None]:
titles.head()

In [None]:
titles[titles['search_result']=='NA']['type'].count()

### Merging of Movie title & Rating DataFrame

In [None]:
ratings_set = set(ratings['movie_id'])
titles_set = set(titles['movie_id'])
print(f" elements present in title df but not in rating df : {titles_set.difference(ratings_set)}")
print(f" elements present in rating df but not in title df : {ratings_set.difference(titles_set)}")

In [None]:
# Merge movie title with rating
df = pd.merge(ratings, titles, on='movie_id', how='inner')
df.shape

In [None]:
df.head()

In [None]:
df.isnull().sum()

In [None]:
df.info()

In [None]:
df['title'].nunique()