# Overview

<hr>

In order to add more movies to our dataset, we found a The Movie Database API and plan on using it.  
TMDB publishes a file containing all valid movie IDs daily, we have taken the file posted on 09/23/2024.  
This file contains almost 1 million movies which is beyond the scope of this project.  
This script will cut that list down to a couple thousand.

In [1]:
# Dependencies.
# Data.
import pandas as pd
import numpy as np

# Misc.
import re
import json
import pprint

# Package used to detect whether or not a title is English.
from langdetect import detect

In [2]:
# Here, we need to cut down the amount of movies to search for with TMDB API.
# Main points to cut based on are:
# - English title.
# - Popularity.
# After playing with the numbers a bit, it seems that a popularity of 20 or higher reduces the list to just above 2000.

In [3]:
# Load the list of valid movies for 09/23/2024.
# Get paths for input and output file.
input_path = "raw/movie_ids_09_23_2024.json"
output_path = "clean/tmdb_movie_list.txt"

# This list will hold all entries that cannot be detected by langdetect.
# It encounters errors when there are only numerical characters as a title.
# However, we don't want to lose those movies.
failed = list()

with open(input_path, 'r', encoding = "utf8") as input:
    with open(output_path, 'w', encoding = "utf8") as output:
        for line in input:
            # Filtering for English characters.
            if re.search('[^\x00-\x7F]+', line) == None:
                movie = json.loads(line)

                # Filtering for popularity rating.
                # Note: this is the TMDB User Popularity. It is akin to the IMDB ratings on IMDB.
                if movie['popularity'] >= 20: 
                    try:
                        # Is the title in English?
                        if detect(movie['original_title']) == 'en':
                            output.write(f"{json.dumps(movie)}\n")
                    
                    except:
                        print(f"Failed on {movie['original_title']} | {movie['id']}")
                        failed.append(json.dumps(movie))

Failed on 300 | 1271
Failed on 1408 | 3021
Failed on 21 | 8065
Failed on 9 | 12244
Failed on 2012 | 14161
Failed on 42 | 109410
Failed on 1992 | 413846
Failed on 1917 | 530915
Failed on 65 | 700391
Failed on 74 | 1303869


In [4]:
# Check the list 'Failed' movies that were collected.
for item in failed:
    print(item)

{"adult": false, "id": 1271, "original_title": "300", "popularity": 52.868, "video": false}
{"adult": false, "id": 3021, "original_title": "1408", "popularity": 21.524, "video": false}
{"adult": false, "id": 8065, "original_title": "21", "popularity": 28.519, "video": false}
{"adult": false, "id": 12244, "original_title": "9", "popularity": 39.961, "video": false}
{"adult": false, "id": 14161, "original_title": "2012", "popularity": 71.815, "video": false}
{"adult": false, "id": 109410, "original_title": "42", "popularity": 22.232, "video": false}
{"adult": false, "id": 413846, "original_title": "1992", "popularity": 133.531, "video": false}
{"adult": false, "id": 530915, "original_title": "1917", "popularity": 32.413, "video": false}
{"adult": false, "id": 700391, "original_title": "65", "popularity": 59.434, "video": false}
{"adult": false, "id": 1303869, "original_title": "74", "popularity": 22.716, "video": false}


In [5]:
# Those seem to be great movies to add back into the list.
output_path = "clean/tmdb_movie_list.txt"

with open(output_path, 'a', encoding = "utf8") as output:
    for item in failed:
        output.write(f"{item}\n")

In [6]:
# Now, let's cross reference the IMDB dataset that we got from Kaggle.
# We do not want to have duplicates when we join these together.
filepath = "raw/imdb_top_1000.csv"
imdb_df = pd.read_csv(filepath)
imdb_df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


In [7]:
# There are slight differences in spelling and punctuation with the movie names... that must be addressed before we can proceed.
# Ideal changes: lowercase, get rid of punctuation.
# We will make a list of the changed IMDB titles and then reference that later.
imdb_titles = list()

for i, title in imdb_df.Series_Title.items():
    imdb_title = title.lower().replace('.', '').replace(',', '').replace(':', '').replace('\'', '').replace('-', '')
    imdb_titles.append(imdb_title)

# Check the changes.
for title in imdb_titles:
    print(title)

the shawshank redemption
the godfather
the dark knight
the godfather part ii
12 angry men
the lord of the rings the return of the king
pulp fiction
schindlers list
inception
fight club
the lord of the rings the fellowship of the ring
forrest gump
il buono il brutto il cattivo
the lord of the rings the two towers
the matrix
goodfellas
star wars episode v  the empire strikes back
one flew over the cuckoos nest
hamilton
gisaengchung
soorarai pottru
interstellar
cidade de deus
sen to chihiro no kamikakushi
saving private ryan
the green mile
la vita è bella
se7en
the silence of the lambs
star wars
seppuku
shichinin no samurai
its a wonderful life
joker
whiplash
the intouchables
the prestige
the departed
the pianist
gladiator
american history x
the usual suspects
léon
the lion king
terminator 2 judgment day
nuovo cinema paradiso
hotaru no haka
back to the future
once upon a time in the west
psycho
casablanca
modern times
city lights
capharnaüm
ayla the daughter of war
vikram vedha
kimi no na

In [8]:
# Now, we will load up the list and find duplicates.
# Path to the file and a list to hold non-duplicate movies.
filepath = "clean/tmdb_movie_list.txt"
tmdb_movies = list()

# Find non-duplicate movies and add them to list.
with open(filepath, 'r', encoding = "utf8") as file:
    for line in file:
        movie = json.loads(line)
        title = movie['original_title']

        if title.lower().replace('.', '').replace(',', '').replace(':', '').replace('\'', '').replace('-', '') not in imdb_titles:
            tmdb_movies.append(movie)

        else:
            # Show the duplicate movies that are ignored.
            print(f"Skipping: {title}")

    # Delete everything to prepare for re-writing.
    file.truncate()

# Now, write them back.
with open(filepath, 'w', encoding = "utf8") as file:
    for movie in tmdb_movies:
        file.write(f"{movie}\n")

Skipping: Finding Nemo
Skipping: American Beauty
Skipping: Dancer in the Dark
Skipping: The Fifth Element
Skipping: Pirates of the Caribbean: The Curse of the Black Pearl
Skipping: Apocalypse Now
Skipping: Eternal Sunshine of the Spotless Mind
Skipping: 2001: A Space Odyssey
Skipping: Twelve Monkeys
Skipping: Million Dollar Baby
Skipping: American History X
Skipping: Raiders of the Lost Ark
Skipping: Indiana Jones and the Last Crusade
Skipping: Taxi Driver
Skipping: Back to the Future
Skipping: Snatch
Skipping: Match Point
Skipping: The Untouchables
Skipping: The Lord of the Rings: The Fellowship of the Ring
Skipping: The Lord of the Rings: The Two Towers
Skipping: The Lord of the Rings: The Return of the King
Skipping: O Brother, Where Art Thou?
Skipping: Groundhog Day
Skipping: Lost in Translation
Skipping: Star Trek II: The Wrath of Khan
Skipping: The Dark Knight
Skipping: Ocean's Eleven
Skipping: Edward Scissorhands
Skipping: Breakfast at Tiffany's
Skipping: Back to the Future Part

UnsupportedOperation: truncate

<hr>

Now our list of movies is down to about 2000 entries and we will be able to use the list to grab the information from TMDB API.