## Connection to Google Drive

In [1]:
# Importing datasets from shared folder
# from google.colab import drive
# drive.mount("/content/drive/")

In [1]:
# File location variable 
# fileloc = '/content/drive/My Drive/DSI_16/Capstone/datasets/'

# File @ local machine
fileloc = 'datasets/'

## Introduction
----
As techonology progress and more of our lifestyle revolving around tech gadgets, many companies have accumulated a large amount of data pertaining to their customers. The key question has and always been - "How are we going to tap on this immerse wealth of data not just to gain insight into both our customers and our own strengths but also to monetize it?"

The simplest of solution or the most practical will be to hire a team of data scientists, data dump and help me solve my problem. Yet, very often, whether intentionally or unintentionally, stakeholder left out the most crucial piece of information - full access of domain knowledge. Due to this missing chunk, data scientists often and usually create analysis or products that are sub-par to the expectation of the stakeholder. More often than more, data scientist spend much time gathering these domain knowledge from various stakeholders that little time is left to produce quality and succinct analysis and product.

This project serve as a simulation where I am hired as a data scientist and provided access to limited but substantial amount of data to quickly (~approx 2 weeks) generate ways to help business stakeholders make better sale decisions yet with little or no interactions with any of the stakeholders. Beyond this scope and restrictions, I am given limited time to tap on the knowledge of other tech teams in the course of this project.

The simulated datasets used is Kaggle Prize Dataset. The data is sufficient large (~4GB) but limited in information - no user's profile except their ratings, title details are missing except for the release year.

## Problem Statement

Using the Netflix Dataset, develop recommendation system where new customers are able to find titles pertaining to their taste out of all the titles available in the repository. The recommendation system need to fulfill the following criteria:

- To design and create a robust recommendation systems where titles are pushed to users by knowing their preferences/tastes in doing so eliminate cold-start problems that have plague most early recommendation system.
- Recommendation system needs to highlight and persuade customers why the selections are recommended to them.


## Executive Summary

Using Netflix as an example to simulate situation where company or department are holding to immersive 'data wealth' but do not have the ability to monetize it (depending on context, either in producing profit or gain further insights/efficiencies).

Often insights or in this case recommendation are pushed to customers based on prior knowledge or widely accepted facts - eg. titles that are popular or knowing the customer personally. This create a knowledge silo or key-person risks where a single person or system may hold the company hostage simplify because of it being indispensable. In addition, this may that recommendations are 'whole-sale' and changing trends from company-push to on-demand model; customers may be left unsatisfied if a company do not provide a personalized touch to their experience.

Using multiple recommendation systems which targets user's rating similarity; title-similarity and looking for relationship between titles based on their meta-data seems like a good way to start to personalized products to customers.

Based on the model created, we can see that all recommendations systems provide different titles based on different aspect of the understanding the user. An ensemble model was created to find the best title recommendation from each of the sub-recommendation system. This will give a more generalized titles recommendations to user.

Comparing to the base recommendations - popularity titles, we see that our recommendations systems pushes titles that are more relating to the user's preference. In addition to this, the systems in its entirety has achieved to eliminate cold start problem where new users may be recommended something that are not to their taste.

### Key Observations:

Limitation of data available. Initially API-calls to MovieDB was supposed to be the panacea to the limited information available. However, MoiveDB do not have an exact match function but give the best movie hence I have to discards some of my data. In total, I have to reduce close to about 16% of my data.

The traditional machine learning models based recommendation systems - Collaborative Filtering and Title-Based are computationally expensive. This limits me to use a subset in order to generate the user matrix. Hence the recommendations may not be as comprehensive in terms of the titles to push to users. In additional, for every user, it need to re-calculate the user matrix again hence reducing the user experience when they need to wait longer time to find titles that are suitable to their taste.

*Neural Network (Meta-data rec system) although eliminate the problems experienced by traditional ML but it suffers from longer training time when we runs more epochs and more embedded layers. However, it do not suffers the long calculation time required to generate recommendations to users.

Each systems have their pros and cons and the bottlenecks occurs at different sections. However, storing preference is for Neutral Networks because:
- Heavy lifting is during the model training phase hence the on the user-front, there is close to little lag in pushing titles to users.
- Additional of inputs/variables are an additional layers into the Neural Network. It is relative easier to perform features "top-up" compared to traditional MLs.

### Recommendations & Further Research
Given the short turn-around time, evaluation of the recommendation system is limited to referencing to the popular titles in the database. The next steps is to explore deeper into evaluation metrics using confusion matrix metric to find how each system are recommending titles correctly using created training datasets (created as part of training for neural network)

Another direction is to look into TDIDF of meta-data. This will give a aggregated weight matrix of meta-data across all the titles. Using this matrix in the Neural Network might give better recommendation as it takes into account of weight of each keyword the title contains. Currently, each keywords is assumed to have the same weight.

* Neural Network Concept & use are heavily inspired by the following article:
<a href="https://towardsdatascience.com/building-a-recommendation-system-using-neural-network-embeddings-1ef92e5c80c9">Building a Recommendation System Using Neural Network Embeddings</a>

# Codes Book

The code in this notebook are relating to data collection, API Call & data transformations

## Import Library

In [2]:
import numpy as np
import pandas as pd

# Library for importing and saving large datasets
import glob
import json
import h5py

# For dealing with iterables
from itertools import chain
from collections import ChainMap

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns

# API Call
import requests

from datetime import datetime
import time
import random
import ast

# Additional Setting
pd.set_option("display.max_rows", 201)

## User Defined Functions

### Time Function

In [3]:
# Time function

def t_now(string):
    '''
    Insert Docstring
    '''
    time = datetime.now()
    current_time = time.strftime("%H:%M:%S")
    return f"{string} = {current_time}"

### Title Pull

In [4]:
# Create a function for API call for different movie titles

def title_pull(item_ls, item_release_year):
    '''
    Function: This is 
    '''
    
    global REQUEST_TOKEN
    
    item_name = ''
    
    # Loop to create a string to search movie database
    for i in range(len(item_ls)):
        if i != 0:
            item_name += ":" + item_ls[i]
        else:
            item_name = item_ls[i]
        
        # replace whitespace with %20
        search_item = item_name.replace(" ", "%20")
            
        # Try to search based on movie & release year
        try:
            url = f"https://api.themoviedb.org/3/search/movie?api_key={REQUEST_TOKEN}&query={search_item}&primary_release_year={item_release_year}&page=1" 
            req = requests.get(url)
            sq_dict = req.json()
            res = {key: sq_dict['results'][0][key] for key in sq_dict['results'][0].keys() & {'id', 'title'}}
            #res = {key: sq_dict['results'][0][key] for key in sq_dict['results'][0].keys() & {'id', 'title','genre_ids'}}
            res['search_title'] = search_item.replace("%20", ' ')
            res['type'] = 'movie'
            return res
        except:
            pass

        # Try to search based on TV name & first air date
        try:
            url = f"https://api.themoviedb.org/3/search/tv?api_key={REQUEST_TOKEN}&query={search_item}&first_air_date_year={item_release_year}&page=1"
            req = requests.get(url)
            sq_dict = req.json()
            res = {key: sq_dict['results'][0][key] for key in sq_dict['results'][0].keys() & {'id', 'name'}}
            #res = {key: sq_dict['results'][0][key] for key in sq_dict['results'][0].keys() & {'id', 'name','genre_ids'}}
            res['search_title'] = search_item.replace("%20", ' ')
            res['type'] = 'tv'
            return res
        except:
            pass

        # Try to search based on Movie title only
        try:
            url = f"https://api.themoviedb.org/3/search/movie?api_key={REQUEST_TOKEN}&query={search_item}&page=1" 
            req = requests.get(url)
            sq_dict = req.json()
            res = {key: sq_dict['results'][0][key] for key in sq_dict['results'][0].keys() & {'id', 'title'}}
            #res = {key: sq_dict['results'][0][key] for key in sq_dict['results'][0].keys() & {'id', 'title','genre_ids'}}
            res['search_title'] = search_item.replace("%20", ' ')
            res['type'] = 'movie'
            return res
        except:
            pass

        # Try to search based on TV name only
        try:
            url = f"https://api.themoviedb.org/3/search/tv?api_key={REQUEST_TOKEN}&query={search_item}&page=1"
            req = requests.get(url)
            sq_dict = req.json()
            res = {key: sq_dict['results'][0][key] for key in sq_dict['results'][0].keys() & {'id', 'name'}}
            #res = {key: sq_dict['results'][0][key] for key in sq_dict['results'][0].keys() & {'id', 'name','genre_ids'}}
            res['search_title'] = search_item.replace("%20", ' ')
            res['type'] = 'tv'
            return res
        except:
            return 'NA'

### Metadata API

In [5]:
# Create a function for API call for different movie titles

def movieid_pull(index, item_id, title_type):
    '''
    Function: Perform API call to movie db and extract relevant information base on movie_id

    arg1 (int): moive_id unique to the movie db

    arg2 (str): Whether the title is a movie or a tv show
    '''
    
    global REQUEST_TOKEN, chunk_size

    #Create empty dict
    res = dict()
    sq_dict = dict()

    # # Loop to create a string to search movie database
    # if index % chunk_size == 0:
    #     print(f"{chunk_size}{t_now(' rows processed')}")

    #Check title type
    if title_type == 'movie':
        # Find the top actor/actress of the moive - api order = 0
        url = f"https://api.themoviedb.org/3/movie/{item_id}/credits?api_key={REQUEST_TOKEN}" 
        req = requests.get(url)
        sq_dict = req.json()

        # Find actor/actress
        if sq_dict['cast']:
            res = {key: sq_dict['cast'][0][key] for key in sq_dict['cast'][0].keys() & {'name'}}
        
        # Find Associated Keywords with titles
        url = f"https://api.themoviedb.org/3/movie/{item_id}/keywords?api_key={REQUEST_TOKEN}"
        req = requests.get(url)
        sq_dict = req.json()
        if sq_dict['keywords']:
            res['keywords'] = sq_dict['keywords']

        # Find titles details
        url = f"https://api.themoviedb.org/3/movie/{item_id}?api_key={REQUEST_TOKEN}"
        req = requests.get(url)
        sq_dict = req.json()
        for i in ['genres', 'popularity','vote_count','vote_average']:
            if sq_dict[i]:
                res[i] = sq_dict[i]
        return res
    
    elif title_type == 'tv':
        # Find the top actor/actress/host of the moive - api order = 0
        url = f"https://api.themoviedb.org/3/tv/{item_id}/credits?api_key={REQUEST_TOKEN}" 
        req = requests.get(url)
        sq_dict = req.json()
        if sq_dict['cast']:
            res = {key: sq_dict['cast'][0][key] for key in sq_dict['cast'][0].keys() & {'name'}}

        # Find Associated Keywords with titles
        url = f"https://api.themoviedb.org/3/tv/{item_id}/keywords?api_key={REQUEST_TOKEN}"
        req = requests.get(url)
        sq_dict = req.json()
        if sq_dict['results']:
            res['keywords'] = sq_dict['results']

        # Find titles details
        url = f"https://api.themoviedb.org/3/tv/{item_id}?api_key={REQUEST_TOKEN}"
        req = requests.get(url)
        sq_dict = req.json()
        for i in ['genres', 'popularity','vote_count','vote_average']:
            if sq_dict[i]:
                res[i] = sq_dict[i]
        return res

### make dict

In [6]:
def make_dict(x):
    '''
    Function: Return a dict of 2 vals within a dict = {val1: val2}
    
    arg1 (list): A list of dictionaries

    '''
    
    final_dict = dict()
    try:
        for item in x:
            final_dict.update({item['name'] : item['id']})
        return final_dict
    except:
        return final_dict

### Chunking

In [7]:
def chunking(chunk_size, df):
    '''
    Function return a lsit of chunked dataframe
    
    arg1 (int): Chunk size of each chunk
    arg2 (df) : Dataframe to chunk
    
    return (list): A list of chunked dataframe
    '''
    
    chunk_list = []
    for i in range(0, len(titles),chunk_size):
        if i == 0:
            start_point = 0
        else:
            start_point = i

        df = titles.loc[start_point: start_point + chunk_size -1 ,:]
        chunk_list.append(df)
        
    return chunk_list

## Import Dataset

### Movie Title Dataset

<span style='color:magenta'>Remarks:</span>
As mentioned during the introduction, the dataset might be brief. There are about 17770 titles, the word titles is used because a quick scan on the name of the datasets, shows that there are a mixture of both movies and short series. No additonal information is presented exepcted for the year_of_release. Additonal information such as genre, keywords & featured actors are required to be extracted from external database - MoiveDB via a API call

In [8]:
# Import Movie titles file
titles = pd.read_csv(fileloc+ 'movie_titles.txt', sep = ',', names=['movie_id','year_of_release','title'], encoding='Latin_1')

In [9]:
# convert all movie title to lower case
titles['title'] = titles['title'].map(lambda x: str.lower(x))

In [10]:
print(f'shape of titles {titles.shape}')
titles.head()

shape of titles (17770, 3)


Unnamed: 0,movie_id,year_of_release,title
0,1,2003.0,dinosaur planet
1,2,2004.0,isle of man tt 2004 review
2,3,1997.0,character
3,4,1994.0,paula abdul's get up & dance
4,5,2004.0,the rise and fall of ecw


#### Delimit Title for search in MovieDB

In [11]:
# Split title into a list of words - delimiter ":"
titles['split_title'] = titles['title'].str.split(':')
titles.head(10)

Unnamed: 0,movie_id,year_of_release,title,split_title
0,1,2003.0,dinosaur planet,[dinosaur planet]
1,2,2004.0,isle of man tt 2004 review,[isle of man tt 2004 review]
2,3,1997.0,character,[character]
3,4,1994.0,paula abdul's get up & dance,[paula abdul's get up & dance]
4,5,2004.0,the rise and fall of ecw,[the rise and fall of ecw]
5,6,1997.0,sick,[sick]
6,7,1992.0,8 man,[8 man]
7,8,2004.0,what the #$*! do we know!?,[what the #$*! do we know!?]
8,9,1991.0,class of nuke 'em high 2,[class of nuke 'em high 2]
9,10,2001.0,fighter,[fighter]


In [12]:
# Create Chunk Size
chunk_size = 3000

In [13]:
chunk_list = chunking(chunk_size, titles)

#### Invoke API call to extract title id from MovieDB

In [14]:
# API key
REQUEST_TOKEN = 'd8f46f139abc47c1f048f3efc486fe53'

# Target web page:

# To authenticate api call session
url = "https://www.themoviedb.org/authenticate/"+ REQUEST_TOKEN

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(response.status_code)

200


In [16]:
%%time

print(t_now('Start Time'))
print(f"There are {len(chunk_list)} batches to process")

# Saving of titles files
for index, df in enumerate(chunk_list):
    df['moviedb_result'] = df.apply(lambda x: title_pull(x['split_title'], x['year_of_release']),axis=1)
    print(f"{index} batch: {t_now('run time')}")
    df.to_csv(f'{fileloc}titles_added/ta_{index}.csv', index=False)
        
    # Check if url time-out
    if response.status_code != 200:
        print(f'Connection Error, error type: {response.status_code}')
        url = "https://www.themoviedb.org/authenticate/"+ REQUEST_TOKEN
        print(response.status_code)
        
    #generate a random sleep duration to look more 'natural'
    sleep_duration = random.randint(0,2)
    print(f'Sleep Duration: {sleep_duration}')
    print('-' * 50)
    time.sleep(sleep_duration) 

Start Time = 10:45:32
There are 6 batches to process


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


0 batch: run time = 10:56:58
Sleep Duration: 2
--------------------------------------------------
1 batch: run time = 11:08:21
Sleep Duration: 0
--------------------------------------------------
2 batch: run time = 11:20:14
Sleep Duration: 0
--------------------------------------------------
3 batch: run time = 11:31:38
Sleep Duration: 1
--------------------------------------------------
4 batch: run time = 11:42:48
Sleep Duration: 2
--------------------------------------------------
5 batch: run time = 11:53:21
Sleep Duration: 1
--------------------------------------------------
CPU times: user 4min 48s, sys: 15.9 s, total: 5min 4s
Wall time: 1h 7min 49s


In [15]:
%%time
# Globbing of saved Chunks
# Import titles chunks into dataframe
ls_of_titles = []

# Specify the pattern matching for glob
rating_files = glob.glob(fileloc + 'titles_added/*')

# Loop to get dataframe into list
for filename in rating_files:
    df = pd.read_csv(filename, sep=',')
    ls_of_titles.append(df)

# Concat dataframe together
titles = pd.concat(ls_of_titles,ignore_index=True)

print('glob completed')

glob completed
Wall time: 81.8 ms


In [16]:
print(titles.shape)

(17770, 5)


In [17]:
titles.head()

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result
0,1,2003.0,dinosaur planet,['dinosaur planet'],"{'id': 11710, 'name': 'Dinosaur Planet', 'sear..."
1,2,2004.0,isle of man tt 2004 review,['isle of man tt 2004 review'],
2,3,1997.0,character,['character'],"{'id': 17139, 'title': 'Character', 'search_ti..."
3,4,1994.0,paula abdul's get up & dance,"[""paula abdul's get up & dance""]","{'id': 274766, 'title': ""Paula Abdul's Get Up ..."
4,5,2004.0,the rise and fall of ecw,['the rise and fall of ecw'],"{'id': 33209, 'title': 'The Rise & Fall of ECW..."


#### Cleaning of Title Dataset

##### Dropping of Null Values

In [18]:
titles.isnull().sum()[titles.isnull().sum()>0]

year_of_release      7
moviedb_result     803
dtype: int64

In [19]:
# Drop null values in moviedb_result and reassigned back to titles
print(f'Shape of titles before dropping: {titles.shape}')
titles = titles.dropna(axis=0, how='all', subset=['moviedb_result'])
print(f'Shape of titles after dropping: {titles.shape}')

Shape of titles before dropping: (17770, 5)
Shape of titles after dropping: (16967, 5)


In [20]:
# Change the wordings from API function call
titles['moviedb_result'] = titles['moviedb_result'].map(lambda x : x.replace('name', 'title'))
titles.reset_index(drop=True,inplace=True)
print(titles.shape)
titles.head()

(16967, 5)


Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result
0,1,2003.0,dinosaur planet,['dinosaur planet'],"{'id': 11710, 'title': 'Dinosaur Planet', 'sea..."
1,3,1997.0,character,['character'],"{'id': 17139, 'title': 'Character', 'search_ti..."
2,4,1994.0,paula abdul's get up & dance,"[""paula abdul's get up & dance""]","{'id': 274766, 'title': ""Paula Abdul's Get Up ..."
3,5,2004.0,the rise and fall of ecw,['the rise and fall of ecw'],"{'id': 33209, 'title': 'The Rise & Fall of ECW..."
4,6,1997.0,sick,['sick'],"{'id': 35638, 'title': 'Sick: The Life and Dea..."


In [21]:
titles[titles['year_of_release'].isnull()]

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result
4191,4388,,ancient civilizations: rome and pompeii,"['ancient civilizations', ' rome and pompeii']","{'id': 503681, 'title': 'Ancient Civilizations..."
4577,4794,,ancient civilizations: land of the pharaohs,"['ancient civilizations', ' land of the pharao...","{'id': 503681, 'title': 'Ancient Civilizations..."
6898,7241,,ancient civilizations: athens and greece,"['ancient civilizations', ' athens and greece']","{'id': 503681, 'title': 'Ancient Civilizations..."
10300,10782,,roti kapada aur makaan,['roti kapada aur makaan'],"{'id': 125292, 'title': 'Roti Kapada Aur Makaa..."
15932,16678,,jimmy hollywood,['jimmy hollywood'],"{'id': 31643, 'title': 'Jimmy Hollywood', 'sea..."


In [22]:
# Filter for years of releases that are missing
print(f'Shape of titles before dropping: {titles.shape}')
titles = titles[~(titles['year_of_release'].isnull())]
print(f'Shape of titles before dropping: {titles.shape}')

# Reset titles index after dropping of rows
titles.reset_index(drop=True,inplace=True)

Shape of titles before dropping: (16967, 5)
Shape of titles before dropping: (16962, 5)


<span style='color:magenta'>Remarks:</span> I have removed 808 titles from the Netflix datasets. This amount to about 4.5% loss in original data. This is acceptable as the cleanliness of the dataset is important to model a good recommendation system.

##### Change column datatype using ast.literal.eval

In [23]:
# Check the data type of the 2 columns
print(type(titles.loc[0,'split_title']))
print(type(titles.loc[0,'moviedb_result']))

<class 'str'>
<class 'str'>


In [24]:
# Convert, evaluate stirng representation into actual type
titles['split_title']  = titles['split_title'].apply(ast.literal_eval)
titles['moviedb_result']  = titles['moviedb_result'].apply(ast.literal_eval)

In [25]:
print(type(titles.loc[0,'split_title']))
print(type(titles.loc[0,'moviedb_result']))

<class 'list'>
<class 'dict'>


In [26]:
# Sample of an element
titles.loc[0,'moviedb_result']

{'id': 11710,
 'title': 'Dinosaur Planet',
 'search_title': 'dinosaur planet',
 'type': 'tv'}

In [27]:
# Unpack dictionary into dataframe columns
titles[['moviedb_id','moviedb_search_title', 'result_title','type']] =  pd.DataFrame(titles['moviedb_result'].tolist(), index=titles.index)
titles.head()

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type
0,1,2003.0,dinosaur planet,[dinosaur planet],"{'id': 11710, 'title': 'Dinosaur Planet', 'sea...",11710,Dinosaur Planet,dinosaur planet,tv
1,3,1997.0,character,[character],"{'id': 17139, 'title': 'Character', 'search_ti...",17139,Character,character,movie
2,4,1994.0,paula abdul's get up & dance,[paula abdul's get up & dance],"{'id': 274766, 'title': 'Paula Abdul's Get Up ...",274766,Paula Abdul's Get Up & Dance,paula abdul's get up & dance,movie
3,5,2004.0,the rise and fall of ecw,[the rise and fall of ecw],"{'id': 33209, 'title': 'The Rise & Fall of ECW...",33209,The Rise & Fall of ECW,the rise and fall of ecw,movie
4,6,1997.0,sick,[sick],"{'id': 35638, 'title': 'Sick: The Life and Dea...",35638,"Sick: The Life and Death of Bob Flanagan, Supe...",sick,movie


##### Filter Netflix title = MovieDB titles

In [28]:
%%time
# 1 if exact match for each item of list in split_title == moviedb_search_title
titles['match'] = [[1 if element in titles.loc[i, 'moviedb_search_title'].lower() else 0 for element in titles.loc[i,'split_title']][0] for i in range(len(titles))]
titles.head()

Wall time: 300 ms


Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type,match
0,1,2003.0,dinosaur planet,[dinosaur planet],"{'id': 11710, 'title': 'Dinosaur Planet', 'sea...",11710,Dinosaur Planet,dinosaur planet,tv,1
1,3,1997.0,character,[character],"{'id': 17139, 'title': 'Character', 'search_ti...",17139,Character,character,movie,1
2,4,1994.0,paula abdul's get up & dance,[paula abdul's get up & dance],"{'id': 274766, 'title': 'Paula Abdul's Get Up ...",274766,Paula Abdul's Get Up & Dance,paula abdul's get up & dance,movie,1
3,5,2004.0,the rise and fall of ecw,[the rise and fall of ecw],"{'id': 33209, 'title': 'The Rise & Fall of ECW...",33209,The Rise & Fall of ECW,the rise and fall of ecw,movie,0
4,6,1997.0,sick,[sick],"{'id': 35638, 'title': 'Sick: The Life and Dea...",35638,"Sick: The Life and Death of Bob Flanagan, Supe...",sick,movie,1


<span style='color:magenta'>Remarks:</span>
To find match as there are instances where the moviedb is returning a wrong moive. This happens as the API call do not have an exact match function and only the top result is returned

In [29]:
# Filter for those that are wrongly match
titles[titles['match']==0].shape[0]

1590

In [30]:
# Filter for those that are match
titles = titles[titles['match']!=0]
titles.reset_index(drop=True, inplace=True)
print(titles.shape)
titles.head()

(15372, 10)


Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type,match
0,1,2003.0,dinosaur planet,[dinosaur planet],"{'id': 11710, 'title': 'Dinosaur Planet', 'sea...",11710,Dinosaur Planet,dinosaur planet,tv,1
1,3,1997.0,character,[character],"{'id': 17139, 'title': 'Character', 'search_ti...",17139,Character,character,movie,1
2,4,1994.0,paula abdul's get up & dance,[paula abdul's get up & dance],"{'id': 274766, 'title': 'Paula Abdul's Get Up ...",274766,Paula Abdul's Get Up & Dance,paula abdul's get up & dance,movie,1
3,6,1997.0,sick,[sick],"{'id': 35638, 'title': 'Sick: The Life and Dea...",35638,"Sick: The Life and Death of Bob Flanagan, Supe...",sick,movie,1
4,7,1992.0,8 man,[8 man],"{'id': 196685, 'title': '8 Man - For All Lonel...",196685,8 Man - For All Lonely Nights,8 man,movie,1


##### Check if there are duplicated movies

In [31]:
titles[titles.duplicated(subset='title', keep='first')]

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type,match
267,305,1996.0,jack,[jack],"{'id': 7095, 'title': 'Jack', 'search_title': ...",7095,Jack,jack,movie,1
327,379,1996.0,crash dive,[crash dive],"{'id': 160019, 'title': 'Crash Dive', 'search_...",160019,Crash Dive,crash dive,movie,1
888,1015,1996.0,dr. quinn,[dr. quinn],"{'id': 25101, 'title': 'Dr. Quinn Medicine Wom...",25101,Dr. Quinn Medicine Woman: The Movie,dr. quinn,movie,1
1107,1260,1999.0,journey to the center of the earth,[journey to the center of the earth],"{'id': 732427, 'title': 'Journey to the Center...",732427,Journey to the Center of the Earth,journey to the center of the earth,movie,1
1322,1505,1964.0,hamlet,[hamlet],"{'id': 261439, 'title': 'Hamlet at Elsinore', ...",261439,Hamlet at Elsinore,hamlet,movie,1
...,...,...,...,...,...,...,...,...,...,...
15274,17658,1994.0,the chase,[the chase],"{'id': 10694, 'title': 'The Chase', 'search_ti...",10694,The Chase,the chase,movie,1
15297,17685,1988.0,alice,[alice],"{'id': 114364, 'title': 'Alice in Wonderland',...",114364,Alice in Wonderland,alice,movie,1
15315,17704,1999.0,taboo,[taboo],"{'id': 20617, 'title': 'Taboo', 'search_title'...",20617,Taboo,taboo,movie,1
15330,17721,1998.0,the love letter,[the love letter],"{'id': 57943, 'title': 'The Love Letter', 'sea...",57943,The Love Letter,the love letter,movie,1


In [32]:
# Check Sample 1
titles[titles['title']=='journey to the center of the earth']

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type,match
415,486,1959.0,journey to the center of the earth,[journey to the center of the earth],"{'id': 11571, 'title': 'Journey to the Center ...",11571,Journey to the Center of the Earth,journey to the center of the earth,movie,1
1107,1260,1999.0,journey to the center of the earth,[journey to the center of the earth],"{'id': 732427, 'title': 'Journey to the Center...",732427,Journey to the Center of the Earth,journey to the center of the earth,movie,1


In [33]:
# Check Sample 2
titles[titles['title']=='crash dive']

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type,match
51,63,1943.0,crash dive,[crash dive],"{'id': 74012, 'title': 'Crash Dive', 'search_t...",74012,Crash Dive,crash dive,movie,1
327,379,1996.0,crash dive,[crash dive],"{'id': 160019, 'title': 'Crash Dive', 'search_...",160019,Crash Dive,crash dive,movie,1


<span style='color:magenta'>Remarks:</span>
It seems like there are some movies that are a re-make of the original. This may 'pollute' our model when we pull additional data from MovieDB especially when we are creating an title-based recommendation system later on. There are 2 treatments for this:
1. Create a unique Name that is a combination of title & year_of_release. This will helps us to retain these old movies and may improve our recommendation system.
2. Remove them as the improvement may be insignificant. 

I will choose option (2) for the following reason:
- These old titles may have missing information when I perfrom an API call from MovieDB. This will not value-add to my features in my recommendation system.
- Users tends to watch and rate more recent movies hence if we ignore the materiality of less features, the low ratings of these movies may mean that they will very unlikely to be recommended.

In [34]:
# Sort titles based on latest year_of_release
titles.sort_values(by='year_of_release',ascending=False, inplace=True)
titles.head()

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type,match
6071,7003,2005.0,national geographic: arlington: field of honor,"[national geographic, arlington, field of ho...","{'id': 312215, 'title': 'National Geographic E...",312215,National Geographic Explorer: King Tut's Final...,national geographic,movie,1
8336,9603,2005.0,pink floyd: atom heart mother,"[pink floyd, atom heart mother]","{'id': 61913, 'title': 'Pink Floyd - Comfortab...",61913,Pink Floyd - Comfortably Numb,pink floyd,movie,1
12769,14762,2005.0,devour,[devour],"{'id': 14147, 'title': 'DeVour', 'search_title...",14147,DeVour,devour,movie,1
12764,14757,2005.0,palindromes,[palindromes],"{'id': 13073, 'title': 'Palindromes', 'search_...",13073,Palindromes,palindromes,movie,1
546,631,2005.0,unleashed,[unleashed],"{'id': 10027, 'title': 'Unleashed', 'search_ti...",10027,Unleashed,unleashed,movie,1


In [35]:
# Drop all duplicate titles
print(f'Shape of titles before dropping: {titles.shape}')
titles = titles[~(titles.duplicated(subset='title', keep='first'))]
print(f'Shape of titles after dropping: {titles.shape}')

Shape of titles before dropping: (15372, 10)
Shape of titles after dropping: (14923, 10)


In [36]:
# Sample 1
titles[titles['title']=='journey to the center of the earth']

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type,match
1107,1260,1999.0,journey to the center of the earth,[journey to the center of the earth],"{'id': 732427, 'title': 'Journey to the Center...",732427,Journey to the Center of the Earth,journey to the center of the earth,movie,1


In [37]:
# Check Sample 2
titles[titles['title']=='crash dive']

Unnamed: 0,movie_id,year_of_release,title,split_title,moviedb_result,moviedb_id,moviedb_search_title,result_title,type,match
327,379,1996.0,crash dive,[crash dive],"{'id': 160019, 'title': 'Crash Dive', 'search_...",160019,Crash Dive,crash dive,movie,1


<span style='color:magenta'>Remarks:</span>
From the sample after dropping duplicates, The older titles are dropped while the latest titles are retained.

In total, I have dropped 2,847 titles. This amount to 16% reduction in original data. However, this is still acceptable as this implies that our dataset is as clean as possible for modeling.

In [38]:
# Remove non-relevant columns
titles.drop(columns=['split_title', 'moviedb_result', 'moviedb_search_title','result_title','match'], inplace=True, errors='ignore')
titles.reset_index(drop=True,inplace=True)
print(titles.shape)
titles.head(20)

(14923, 5)


Unnamed: 0,movie_id,year_of_release,title,moviedb_id,type
0,7003,2005.0,national geographic: arlington: field of honor,312215,movie
1,9603,2005.0,pink floyd: atom heart mother,61913,movie
2,14762,2005.0,devour,14147,movie
3,14757,2005.0,palindromes,13073,movie
4,631,2005.0,unleashed,10027,movie
5,14739,2005.0,martha stewart holidays: classic thanksgiving,531699,movie
6,9652,2005.0,wwe: the self-destruction of the ultimate warrior,58975,movie
7,9747,2005.0,warm springs,46661,movie
8,5892,2005.0,wwe: wrestlemania 21,58975,movie
9,9771,2005.0,black,31977,movie


In [39]:
# Save dataframe to HDF5 
hf = pd.HDFStore(fileloc + 'movie_raw.hdf5')
hf['titles_search_mvid'] = titles
hf.close()

#### Invoke Metadata API Call

In [40]:
# API key
REQUEST_TOKEN = 'd8f46f139abc47c1f048f3efc486fe53'

# Target web page:

# To authenticate api call session
url = "https://www.themoviedb.org/authenticate/"+ REQUEST_TOKEN

# Establishing the connection to the web page:
response = requests.get(url)

# You can use status codes to understand how the target server responds to your request.
# Ex., 200 = OK, 400 = Bad Request, 403 = Forbidden, 404 = Not Found.
print(response.status_code)

200


In [41]:
# Chunking the files
chunk_list = chunking(chunk_size,titles)
print(len(chunk_list))

5


In [61]:
# Perform API call on chunked dataframe & save as csv file
print(t_now('Start Time'))
print(f"There are {len(chunk_list)} batches to process")
print('-' * 50)

for index, df in enumerate(chunk_list):
    df['search_result'] = df.apply(lambda x: movieid_pull(x.name, x['moviedb_id'], x['type']), axis=1)
    df.to_csv(f'{fileloc}metadata_added/md_{index}.csv', index=False)
    print(f"{index} batch: {t_now('run time')}")
    print('-' * 50)
print('API Call Complete')

Start Time = 12:27:47
There are 5 batches to process
--------------------------------------------------


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys


0 batch: run time = 12:34:47
--------------------------------------------------
1 batch: run time = 12:56:07
--------------------------------------------------
2 batch: run time = 13:17:28
--------------------------------------------------
3 batch: run time = 13:38:55
--------------------------------------------------
4 batch: run time = 13:58:45
--------------------------------------------------
API Call Complete


#### Glob-ing of Chunked files

In [42]:
%%time
# Globbing of saved Chunks
# Import titles chunks into dataframe
ls_of_titles = []

# Specify the pattern matching for glob
rating_files = glob.glob(fileloc + 'metadata_added/*')

# Loop to get dataframe into list
for filename in rating_files:
    df = pd.read_csv(filename, sep=',')
    ls_of_titles.append(df)

# Concat dataframe together
titles = pd.concat(ls_of_titles,ignore_index=True)

print('glob completed')

glob completed
Wall time: 97.9 ms


In [43]:
print(titles.shape)
titles.head()

(14923, 6)


Unnamed: 0,movie_id,year_of_release,title,moviedb_id,type,search_result
0,7003,2005.0,national geographic: arlington: field of honor,312215,movie,"{'name': 'Zahi Hawass', 'keywords': [{'id': 11..."
1,9603,2005.0,pink floyd: atom heart mother,61913,movie,"{'name': 'Roger Waters', 'keywords': [{'id': 5..."
2,14762,2005.0,devour,14147,movie,"{'name': 'Jensen Ackles', 'keywords': [{'id': ..."
3,14757,2005.0,palindromes,13073,movie,"{'name': 'Ellen Barkin', 'keywords': [{'id': 1..."
4,631,2005.0,unleashed,10027,movie,"{'name': 'Jet Li', 'keywords': [{'id': 779, 'n..."


<span style='color:magenta'>Remarks:</span>
Additional information such as genre, keywrods & featured actors are being extracted from MovieDB. As the API call takes more than 1 hour, I have broken my dataset into chunks to prevent loss of efforts should the API call failed half-way. Given that the dataset is being saved into a csv file, after globbing the data, the search result although should be a dictionary (data-type) is interpreted as string. Hence I will need to perform a data conversion using Abstract Syntax Tree library (ast) to convert back to dictionary.

In [44]:
# Sample of search result
titles.loc[0,'search_result']

'{\'name\': \'Zahi Hawass\', \'keywords\': [{\'id\': 1160, \'name\': \'egypt\'}, {\'id\': 2598, \'name\': \'museum\'}, {\'id\': 2904, \'name\': \'mummy\'}, {\'id\': 9032, \'name\': \'excavation\'}, {\'id\': 157894, \'name\': \'ancient egypt\'}, {\'id\': 163347, \'name\': \'egyptology\'}, {\'id\': 180259, \'name\': \'reconstruction\'}, {\'id\': 183205, \'name\': \'secret burial\'}, {\'id\': 192125, \'name\': \'egyptologist\'}, {\'id\': 208435, \'name\': \'archaeology\'}, {\'id\': 210598, \'name\': \'forensic archaeology\'}, {\'id\': 215218, \'name\': \'king tut\'}, {\'id\': 234743, \'name\': \'egyptian tomb\'}, {\'id\': 234744, \'name\': \'sarcophagus\'}, {\'id\': 243413, \'name\': "mummy\'s curse"}], \'genres\': [{\'id\': 99, \'name\': \'Documentary\'}, {\'id\': 36, \'name\': \'History\'}, {\'id\': 10770, \'name\': \'TV Movie\'}], \'popularity\': 4.338, \'vote_count\': 2, \'vote_average\': 7.0}'

#### Conversion of search_result type

In [45]:
# Check the data type of the element
type(titles.loc[0,'search_result'])

str

In [46]:
# Convert, evaluate stirng representation into actual type
titles['search_result']  = titles['search_result'].apply(ast.literal_eval)

In [47]:
# Re-check data type of the element
type(titles.loc[0,'search_result'])

dict

In [48]:
# Example of returned result
titles.loc[0,'search_result']

{'name': 'Zahi Hawass',
 'keywords': [{'id': 1160, 'name': 'egypt'},
  {'id': 2598, 'name': 'museum'},
  {'id': 2904, 'name': 'mummy'},
  {'id': 9032, 'name': 'excavation'},
  {'id': 157894, 'name': 'ancient egypt'},
  {'id': 163347, 'name': 'egyptology'},
  {'id': 180259, 'name': 'reconstruction'},
  {'id': 183205, 'name': 'secret burial'},
  {'id': 192125, 'name': 'egyptologist'},
  {'id': 208435, 'name': 'archaeology'},
  {'id': 210598, 'name': 'forensic archaeology'},
  {'id': 215218, 'name': 'king tut'},
  {'id': 234743, 'name': 'egyptian tomb'},
  {'id': 234744, 'name': 'sarcophagus'},
  {'id': 243413, 'name': "mummy's curse"}],
 'genres': [{'id': 99, 'name': 'Documentary'},
  {'id': 36, 'name': 'History'},
  {'id': 10770, 'name': 'TV Movie'}],
 'popularity': 4.338,
 'vote_count': 2,
 'vote_average': 7.0}

In [49]:
# Convert search_result columns into multiple columns based on dictionary
titles[['featured_actor', 'keywords', 'genre','popularity', 'vote_count', 'vote_average']] =  pd.DataFrame(titles['search_result'].tolist(), index=titles.index)
titles.head()

Unnamed: 0,movie_id,year_of_release,title,moviedb_id,type,search_result,featured_actor,keywords,genre,popularity,vote_count,vote_average
0,7003,2005.0,national geographic: arlington: field of honor,312215,movie,"{'name': 'Zahi Hawass', 'keywords': [{'id': 11...",Zahi Hawass,"[{'id': 1160, 'name': 'egypt'}, {'id': 2598, '...","[{'id': 99, 'name': 'Documentary'}, {'id': 36,...",4.338,2.0,7.0
1,9603,2005.0,pink floyd: atom heart mother,61913,movie,"{'name': 'Roger Waters', 'keywords': [{'id': 5...",Roger Waters,"[{'id': 5565, 'name': 'biography'}, {'id': 233...","[{'id': 99, 'name': 'Documentary'}]",1.863,1.0,7.0
2,14762,2005.0,devour,14147,movie,"{'name': 'Jensen Ackles', 'keywords': [{'id': ...",Jensen Ackles,"[{'id': 282, 'name': 'video game'}, {'id': 331...","[{'id': 27, 'name': 'Horror'}]",8.569,53.0,5.1
3,14757,2005.0,palindromes,13073,movie,"{'name': 'Ellen Barkin', 'keywords': [{'id': 1...",Ellen Barkin,"[{'id': 173186, 'name': 'teenage pregnancy'}, ...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",12.121,102.0,6.2
4,631,2005.0,unleashed,10027,movie,"{'name': 'Jet Li', 'keywords': [{'id': 779, 'n...",Jet Li,"[{'id': 779, 'name': 'martial arts'}, {'id': 2...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...",17.592,1225.0,6.8


#### Fillna

In [50]:
# Check for null values
titles.isnull().sum()[titles.isnull().sum()!=0]

featured_actor    1167
keywords          4237
genre             1108
vote_count        1517
vote_average      1519
dtype: int64

In [51]:
# Fillna for missing values
titles['featured_actor'].fillna('NA', inplace=True)
titles['keywords'].fillna('NA', inplace=True)
titles['genre'].fillna('NA', inplace=True)
titles['vote_count'].fillna(0, inplace=True)
titles['vote_average'].fillna(0, inplace=True)

In [52]:
# Recheck if there are any null values
titles.isnull().sum()[titles.isnull().sum()!=0]

Series([], dtype: int64)

#### Creation of dictionary

<span style='color:magenta'>Remarks:</span>
The idea behind creation of genre & keyword dictionary is to create a key-value pairs which allows me to qucikly perform mapping at the modeling stage especially for the Neural Network as it accept a numpy array. The genre & keyword comes with its own indexing hence I do not need to create an identification key for them.

In [53]:
# Return a dictionary from a list of dictionary
titles['keywords'] = titles['keywords'].apply(make_dict)
titles['genre'] = titles['genre'].apply(make_dict)

##### Keyword Dictionary

In [54]:
# Generate a keyword dictionary
keywords_dict = dict()
titles['keywords'].apply(keywords_dict.update);

# Convert dict into a series to be saved
keywords_series = pd.Series(keywords_dict)

# Change keyword column into list of keywords only
titles['keywords'] = titles['keywords'].apply(lambda x: list(chain(x.keys())))

In [55]:
titles.head()

Unnamed: 0,movie_id,year_of_release,title,moviedb_id,type,search_result,featured_actor,keywords,genre,popularity,vote_count,vote_average
0,7003,2005.0,national geographic: arlington: field of honor,312215,movie,"{'name': 'Zahi Hawass', 'keywords': [{'id': 11...",Zahi Hawass,"[egypt, museum, mummy, excavation, ancient egy...","{'Documentary': 99, 'History': 36, 'TV Movie':...",4.338,2.0,7.0
1,9603,2005.0,pink floyd: atom heart mother,61913,movie,"{'name': 'Roger Waters', 'keywords': [{'id': 5...",Roger Waters,"[biography, classic rock]",{'Documentary': 99},1.863,1.0,7.0
2,14762,2005.0,devour,14147,movie,"{'name': 'Jensen Ackles', 'keywords': [{'id': ...",Jensen Ackles,"[video game, tattoo, nightmare, daydream, murd...",{'Horror': 27},8.569,53.0,5.1
3,14757,2005.0,palindromes,13073,movie,"{'name': 'Ellen Barkin', 'keywords': [{'id': 1...",Ellen Barkin,"[teenage pregnancy, runaway teen]","{'Comedy': 35, 'Drama': 18}",12.121,102.0,6.2
4,631,2005.0,unleashed,10027,movie,"{'name': 'Jet Li', 'keywords': [{'id': 779, 'n...",Jet Li,"[martial arts, hitman, serial killer]","{'Action': 28, 'Crime': 80}",17.592,1225.0,6.8


In [56]:
# Save file to HDF5
hf = pd.HDFStore('datasets/movie.hdf5')
hf['keywords_series'] = keywords_series
hf.close()

##### Genre Dictionary

In [57]:
def genre_dict_update(x,key_1,key_dict):
    '''
    Function will generate a dictionary of dictionary based on title type
    
    arg1 (list of dict): Element of the data series
    arg2 (str)         : title type
    arg3 (dict)        : return dictionary
    
    '''
    if x != dict():
        if key_1 == 'tv':
            key_dict['tv'].update(x)
        if key_1 == 'movie':
            key_dict['movie'].update(x)

In [58]:
# Create a genre dictionary, The genre is split into 2 - tv & movie due to the way the API is strcutre and I am not able to confirm if the same id is used.
g_dict = {'tv':{}, 'movie':{}}

In [59]:
# update genre_dict based on titles type
for i in range(len(titles)):
    genre_dict_update(titles.loc[i,'genre'],titles.loc[i,'type'],g_dict)

In [59]:
g_dict

{'tv': {'Drama': 18,
  'Mystery': 9648,
  'Comedy': 35,
  'Western': 37,
  'Action & Adventure': 10759,
  'Animation': 16,
  'Crime': 80,
  'Sci-Fi & Fantasy': 10765,
  'Documentary': 99,
  'Reality': 10764,
  'Family': 10751,
  'Kids': 10762,
  'War & Politics': 10768,
  'Talk': 10767,
  'Adventure': 12,
  'War': 10752,
  'Horror': 27,
  'Soap': 10766,
  'Music': 10402,
  'News': 10763,
  'Science Fiction': 878},
 'movie': {'Documentary': 99,
  'History': 36,
  'TV Movie': 10770,
  'Horror': 27,
  'Comedy': 35,
  'Drama': 18,
  'Action': 28,
  'Crime': 80,
  'Family': 10751,
  'Romance': 10749,
  'Music': 10402,
  'Adventure': 12,
  'Animation': 16,
  'Fantasy': 14,
  'Science Fiction': 878,
  'War': 10752,
  'Thriller': 53,
  'Mystery': 9648,
  'Western': 37}}

<span style='color:magenta'>Remarks:</span>
From the result, it seems that both TV & Moives are using the same id for the same genre. I will be combining them together.

In [60]:
genre_dict = dict()
genre_dict.update(g_dict['tv'])
genre_dict.update(g_dict['movie'])

# Convert dict into a series to be saved
genre_series = pd.Series(genre_dict)

In [61]:
# Save file to HDF5
hf = pd.HDFStore('datasets/movie.hdf5')
hf['genre_series'] = genre_series
hf.close()

In [62]:
# Change keyword column into list of genre id only
titles['genre'] = titles['genre'].apply(lambda x: list(chain(x.keys())))

In [63]:
titles.head()

Unnamed: 0,movie_id,year_of_release,title,moviedb_id,type,search_result,featured_actor,keywords,genre,popularity,vote_count,vote_average
0,7003,2005.0,national geographic: arlington: field of honor,312215,movie,"{'name': 'Zahi Hawass', 'keywords': [{'id': 11...",Zahi Hawass,"[egypt, museum, mummy, excavation, ancient egy...","[Documentary, History, TV Movie]",4.338,2.0,7.0
1,9603,2005.0,pink floyd: atom heart mother,61913,movie,"{'name': 'Roger Waters', 'keywords': [{'id': 5...",Roger Waters,"[biography, classic rock]",[Documentary],1.863,1.0,7.0
2,14762,2005.0,devour,14147,movie,"{'name': 'Jensen Ackles', 'keywords': [{'id': ...",Jensen Ackles,"[video game, tattoo, nightmare, daydream, murd...",[Horror],8.569,53.0,5.1
3,14757,2005.0,palindromes,13073,movie,"{'name': 'Ellen Barkin', 'keywords': [{'id': 1...",Ellen Barkin,"[teenage pregnancy, runaway teen]","[Comedy, Drama]",12.121,102.0,6.2
4,631,2005.0,unleashed,10027,movie,"{'name': 'Jet Li', 'keywords': [{'id': 779, 'n...",Jet Li,"[martial arts, hitman, serial killer]","[Action, Crime]",17.592,1225.0,6.8


In [64]:
# Remove search_result column
titles.drop(columns=['search_result','moviedb_id'], inplace=True, errors='ignore')
print(titles.shape)
titles.head()

(14923, 10)


Unnamed: 0,movie_id,year_of_release,title,type,featured_actor,keywords,genre,popularity,vote_count,vote_average
0,7003,2005.0,national geographic: arlington: field of honor,movie,Zahi Hawass,"[egypt, museum, mummy, excavation, ancient egy...","[Documentary, History, TV Movie]",4.338,2.0,7.0
1,9603,2005.0,pink floyd: atom heart mother,movie,Roger Waters,"[biography, classic rock]",[Documentary],1.863,1.0,7.0
2,14762,2005.0,devour,movie,Jensen Ackles,"[video game, tattoo, nightmare, daydream, murd...",[Horror],8.569,53.0,5.1
3,14757,2005.0,palindromes,movie,Ellen Barkin,"[teenage pregnancy, runaway teen]","[Comedy, Drama]",12.121,102.0,6.2
4,631,2005.0,unleashed,movie,Jet Li,"[martial arts, hitman, serial killer]","[Action, Crime]",17.592,1225.0,6.8


In [65]:
# Create a series to creat a title mapping dict
title_series= titles[['movie_id','title','genre']].set_index(keys='movie_id')

In [66]:
# Save a copy of the cleaned title dataset
hf = pd.HDFStore('datasets/movie_raw.hdf5')
hf['titles_cleaned'] = titles
hf.close()

hf = pd.HDFStore('datasets/movie.hdf5')
hf['titles_series'] = title_series
hf.close()

your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block2_values] [items->Index(['title', 'type', 'featured_actor', 'keywords', 'genre'], dtype='object')]

  exec(code_obj, self.user_global_ns, self.user_ns)
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['title', 'genre'], dtype='object')]

  exec(code_obj, self.user_global_ns, self.user_ns)


In [67]:
# Check if the dataframe is saved.
hf = h5py.File('datasets/movie_raw.hdf5','r')
print(f'Dataframe in movie_raw.hdf5 {hf.keys()}')
hf.close()

Dataframe in movie_raw.hdf5 <KeysViewHDF5 ['full_ratings', 'titles_cleaned', 'titles_search_mvid']>


### Users Rating

<span style='color:magenta'>Remarks:</span>
The user rating is a massive multi-document data set. Each user's ratings are saved as a text file and each file contains the ratings for all the tiles they rate, including the date that they rate them. In total, there are over 100 million rows of data. Due to the massive-ness of it, I have attempt to save resources (RAM) by imputing them to the smallest data-type. 

#### Glob User Rating Files

In [68]:
# Import rating files into dataframe
ls_of_ratings = []

# Specify the pattern matching for glob
rating_files = glob.glob('datasets/training_set/*')

# Loop to get dataframe into list
for filename in rating_files:
    df = pd.read_csv(filename, sep=',', names=['customer_id','rating','date'],skiprows=1)
    df['movie_id'] = int(filename.split('mv_')[1].split('.')[0])
    ls_of_ratings.append(df)

# Concat dataframe together
ratings = pd.concat(ls_of_ratings,ignore_index=True)

print('glob completed, saving....')
print("-" * 50)

# covert dataframe into memory-saving datatype
ratings['date'] = pd.to_datetime(ratings['date'])
ratings[['customer_id', 'rating', 'movie_id']] = ratings[['customer_id', 'rating', 'movie_id']].apply(pd.to_numeric,args=('ignore','integer'))

# Save file to HDF5
hf = pd.HDFStore('datasets/movie_raw.hdf5')
hf['full_ratings'] = ratings
hf.close()

print(ratings.info())
print("-" * 50)
print('Saving completed')

# Check if the dataframe is saved.
hf = h5py.File('datasets/movie_raw.hdf5','r')
print(f'Dataframe in movie.hdf5 {hf.keys()}')
hf.close()

glob completed, saving....
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100480507 entries, 0 to 100480506
Data columns (total 4 columns):
 #   Column       Dtype         
---  ------       -----         
 0   customer_id  int32         
 1   rating       int8          
 2   date         datetime64[ns]
 3   movie_id     int16         
dtypes: datetime64[ns](1), int16(1), int32(1), int8(1)
memory usage: 1.4 GB
None
--------------------------------------------------
Saving completed
Dataframe in movie.hdf5 <KeysViewHDF5 ['full_ratings', 'titles_cleaned', 'titles_search_mvid']>


#### Filter user-ratings of titles dropped

In [69]:
# Find unique movie_id from titles
titles_set = set(titles['movie_id'])

In [70]:
# Filtered ratings based on cleaned titles
ratings = ratings[ratings['movie_id'].isin(titles_set)]

In [71]:
#save rating file
hf = pd.HDFStore('datasets/movie_raw.hdf5')
hf['full_ratings'] = ratings
hf.close()

In [72]:
hf = h5py.File('datasets/movie_raw.hdf5','r')
print(f'Dataframe in movie_raw.hdf5 {hf.keys()}')
hf.close()

Dataframe in movie_raw.hdf5 <KeysViewHDF5 ['full_ratings', 'titles_cleaned', 'titles_search_mvid']>


In [73]:
hf = h5py.File('datasets/movie.hdf5','r')
print(f'Dataframe in movie.hdf5 {hf.keys()}')
hf.close()

Dataframe in movie.hdf5 <KeysViewHDF5 ['genre_series', 'keywords_series', 'titles_series']>
