#### Creating a richer dataset with actors information

In order to enrich our dataset we will fetch information regarding the actors who played the characters and also information relative to the movie like poster, runtime and revenue.

To do that we will fetch the [The Movie DB API](https://www.themoviedb.org/).

We will start by retrieving, from our already built database, the movie titles we have. With these we are able to search in the API and get relevant information about the movies.

In [1]:
import requests
import json
import pandas as pd
import sqlite3

# url encoding library
import urllib.parse

api_key = 'c84d2eacda018a9f013d10a6880631ac'

# CONNECT TO movie_corpus.db
conn = sqlite3.connect('final.db')
cursor = conn.cursor()

# GET ALL CHARACTERS GROUPED BY MOVIE
characterFilm = cursor.execute('SELECT name, title FROM character JOIN movie WHERE movie.id = character.movie_id ORDER BY title').fetchall()

titles = cursor.execute('SELECT title FROM movie ORDER BY title').fetchall()

#### Fetching the API

We will organize our data in dictionaries where the key is the title of the movie as it is in the database and the value is a list of characters and actors names. This data structures will allow us to build the bridge between the characters and actors who played them.

In [4]:
# CREATE MOVIE - CHARACTERS DICTIONARY
movieCharactersDatabase = {}
movieCharactersAPI = {}
movieActorsAPI = {}
movieInfoAPI = {}

for title in titles:
    
    title = title[0]
    print(f'Title - {title}')
    
    movieCharactersDatabase[title] = []
    movieCharactersAPI[title] = []
    movieActorsAPI[title] = []
    movieInfoAPI[title] = []
    
    # convert to url encoding
    titleParsed = urllib.parse.quote_plus(title)
    
    url = f'https://api.themoviedb.org/3/search/movie?api_key={api_key}&query={titleParsed}'
    resp = requests.get(url).json()
    id = resp['results'][0]['id']
    url = f'https://api.themoviedb.org/3/movie/{id}?api_key={api_key}&append_to_response=credits'
    resp = requests.get(url).json()
    
    # GET CHARACTERS FROM CAST
    cast = resp['credits']['cast']
    poster_path_prefix = 'https://image.tmdb.org/t/p/original'
    poster_path = poster_path_prefix + resp['poster_path'] if resp['poster_path'] else None
    movieInfoAPI[title].append((resp['runtime'], resp['revenue'], poster_path))
    for character in cast:
        movieCharactersAPI[title].append((character['character'], character['gender'], character['order']))
        movieActorsAPI[title].append((character['name'], character['character']))

Title - 10 things i hate about you
Title - 1492: conquest of paradise
Title - 15 minutes
Title - 2001: a space odyssey
Title - 48 hrs.
Title - 8mm
Title - a bucket of blood
Title - a clockwork orange
Title - a hard day's night
Title - a nightmare on elm street
Title - a nightmare on elm street 3: dream warriors
Title - a nightmare on elm street 4: the dream master
Title - a nightmare on elm street part 2: freddy's revenge
Title - a nightmare on elm street: the dream child
Title - a walk to remember
Title - affliction
Title - agnes of god
Title - air force one
Title - airplane ii: the sequel
Title - airplane!
Title - alien
Title - alien nation
Title - alien vs. predator
Title - aliens
Title - all about eve
Title - all the president's men
Title - amadeus
Title - american madness
Title - american outlaws
Title - american pie
Title - american psycho
Title - an american werewolf in london
Title - an american werewolf in paris
Title - an officer and a gentleman
Title - anastasia
Title - anni

#### Creating and populating actor table

Since we want to store the actors who played the characters we need to store their information in a table where the character table can point to.

In [6]:
# CREATE ACTORS TABLE
cursor.execute('DROP TABLE IF EXISTS actor')
cursor.execute('CREATE TABLE actor (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT UNIQUE);')
# cursor.execute('ALTER TABLE character DROP COLUMN actor_id;')
cursor.execute('ALTER TABLE movie ADD COLUMN runtime INTEGER;')
cursor.execute('ALTER TABLE movie ADD COLUMN revenue INTEGER;')
cursor.execute('ALTER TABLE movie ADD COLUMN poster_path TEXT;')
conn.commit()

# POPULATE ACTORS TABLE

for title in movieActorsAPI:
    print(title)
    for actor,character in movieActorsAPI[title]:
        try:
            cursor.execute('INSERT INTO actor (name) VALUES (?)', (actor,))
        except:
            pass
    for runtime, revenue, poster_path in movieInfoAPI[title]:
        try:
            cursor.execute('UPDATE movie SET runtime = ?, revenue = ?, poster_path = ? WHERE title = ?', (runtime, revenue, poster_path, title))
        except:
            pass
        
conn.commit()

# ADD ACTOR_ID TO CHARACTERS TABLE
cursor.execute('ALTER TABLE character ADD COLUMN actor_id INTEGER REFERENCES actor (id) ON DELETE CASCADE;')
conn.commit()

10 things i hate about you
1492: conquest of paradise
15 minutes
2001: a space odyssey
48 hrs.
8mm
a bucket of blood
a clockwork orange
a hard day's night
a nightmare on elm street
a nightmare on elm street 3: dream warriors
a nightmare on elm street 4: the dream master
a nightmare on elm street part 2: freddy's revenge
a nightmare on elm street: the dream child
a walk to remember
affliction
agnes of god
air force one
airplane ii: the sequel
airplane!
alien
alien nation
alien vs. predator
aliens
all about eve
all the president's men
amadeus
american madness
american outlaws
american pie
american psycho
an american werewolf in london
an american werewolf in paris
an officer and a gentleman
anastasia
annie hall
antitrust
antz
apocalypse now
arcade
arctic blue
as good as it gets
assassins
asylum
austin powers: international man of mystery
bachelor party
back to the future
backdraft
bad lieutenant
badlands
bamboozled
barry lyndon
barton fink
basic
basic instinct
basquiat
batman
batman and 

We also need to populate the movieCharactersDatabase dictionary before starting the matching process.

In [7]:
# ITERATE THROUGH ALL MOVIES
for character in characterFilm:
     
    # GET MOVIE TITLE
    title = character[1]
    
    # GET CHARACTERS
    movieCharactersDatabase[title].append(character[0])

#### Matching characters and actors from the two data sources

As we could expect the characters name in both data sources isn't a exact match. To deal with this we will use fuzzy matching and substring search and use the best match for each character from the database to obtain the actor who played it.

In [9]:
from thefuzz import fuzz
from thefuzz import process
countEmptyString = 0

for title in movieCharactersDatabase:
    # GET CHARACTERS
    charactersDatabase = movieCharactersDatabase[title]
    charactersAPI = movieCharactersAPI[title]
    
    for character in charactersDatabase:
        
        if character == None:
            print(f'{title} - {character}, {characterAPI}')
            
        characterCopy = character
        character = character.lower().replace('.', '').replace('-', '')
        
        maxFuzzRatio = 0
        bestCharAPI = ''
        
        for characterAPI, gender, order in charactersAPI:

            if gender == 1:
                gender = 'f'
            elif gender == 2:
                gender = 'm'
            order = order + 1
            
            characterAPICopy = characterAPI
            characterAPI = characterAPI.lower().replace('.', '').replace('-', '')
                        
            if characterAPI == '':
                continue
            
            fuzzRatio = fuzz.ratio(character, characterAPI)
            
            if fuzzRatio > 90 or character in characterAPI or characterAPI in character:
                
                if fuzzRatio > maxFuzzRatio:
                    maxFuzzRatio = fuzzRatio
                    bestCharAPI = characterAPICopy
                    
                    for actor, char in movieActorsAPI[title]:
                        if char == bestCharAPI:
                            actor_id = cursor.execute('SELECT id FROM actor WHERE name = ?', (actor,)).fetchone()[0]
                            movie_id = cursor.execute('SELECT id FROM movie WHERE title = ?', (title,)).fetchone()[0]
                            character_id = cursor.execute('SELECT id FROM character WHERE name = ? AND movie_id = ?', (characterCopy, movie_id)).fetchone()[0]
                            cursor.execute('UPDATE character SET actor_id = ?, gender = ?, credit_pos = ? WHERE id = ?', (actor_id, gender, order, character_id))
                            conn.commit()
                    
                
    
        
        if bestCharAPI == '':
            countEmptyString += 1
            continue
                
print(f'Empty Strings - {countEmptyString}')


Empty Strings - 2849


In order to have a good matching system we used a minimum fuzz ratio of 90. Altough not every match is found, like examples where we have in one side 'Dad' and in the other side the actual name of the character we were able to find the actor who played 6192 characters, which represents 68,5% of our dataset.

In [8]:
movies = cursor.execute('SELECT * from movie where length(year) > 4').fetchall()

# iterate through movies
for movie in movies:
    
    year = movie[2]
    title = movie[1]
    
    # update year
    year = year[:4]
    print(year)
    cursor.execute('UPDATE movie SET year = ? WHERE title = ?', (year, title))
    conn.commit()

1989
1990
1995
1998
2004
2007
1992
2005
2002
1998
1968
1996
1998
2000
2009
2003
