### Authentication to use TMDB API

TMDB requires users to create an account to access its API. After creating an account, you can request a bearer token, which is then used to authenticate HTTPS requests. For using this code, the provided _config.json_ file contains the neccessary credentials used in the request header. Finally, the header is parsed with each request.

# Setup API for HTTPS requests

In [3]:
CONFIG_PATH = SCRAPED_DATA_PATH = os.path.join(os.path.abspath(""), "config.json")
BASE_URL = "https://api.themoviedb.org/3/movie/"
BASE_URL_TOP_RATED = BASE_URL + "top_rated?"

In [4]:
# Load the config file with API credentials
if exists(CONFIG_PATH):
    with open(CONFIG_PATH) as config_file:
        config = json.load(config_file)
        APP_NAME = config["TMDB_APPLICATION_NAME"]
        AUTH_USER = config["TMDB_EMAIL"]
        AUTH_TOKEN = config["TMDB_BEARER_KEY"]
    
    # Define the headers to include the authentication token
    HEADERS = {
        "accept": "application/json",
        "Authorization": f"Bearer {AUTH_TOKEN}",
    }

else:
    print("Config not found!")

In order to check if authentication is valid, the response should return 200:

In [5]:
response = requests.get(BASE_URL_TOP_RATED+"authentication", headers=HEADERS)
print(response)

<Response [200]>


# Scraping TMDB

## Scraping the top 10.000 features on TMDB

The TMDB web structure is page-based, meaning any search performed on their database returns results one page at a time. It is the user's responsibility to specify which page to request. Therefore, to retrieve all search results, it is necessary to determine the total number of pages.

### Finding the number of pages

The total number of pages can be found by using the default base url and access the _"total\_pages"_ field.

In [6]:
response = requests.get(BASE_URL_TOP_RATED, headers=HEADERS)

In [7]:
TOTAL_PAGES = response.json()["total_pages"]
print(f"{TOTAL_PAGES=}")

TOTAL_PAGES=490


### Extracting features from pages

Using the total number of pages, all features can be extracted incrementally for each page and appended to a pandas DataFrame. The DataFrame can then be saved as a .csv file. 
Initially, the DataFrame is set up as follows:

In [8]:
features_df = pd.DataFrame()

Then, using a for-loop, the incemental page-number with the parameters are parsed the HTTPS request via the TMDB API. 
The parameters used in the code return all the highest rated movies from TMDB and sort them by their average rating.
These are then stored as rows in the DataFrame.

In [9]:
# Create the tqdm progress bar
progress_bar = tqdm(range(1, TOTAL_PAGES+1), desc="Scraping TMDB")

for PAGE in progress_bar:

    params = {
        "language": "en-US",
        "page": PAGE,
        "sort_by": "vote_average.desc"
    }

    response = requests.get(
        BASE_URL_TOP_RATED, 
        headers=HEADERS, 
        params=params
    )
    
    respone_json = response.json()

    current_df = pd.json_normalize(
        respone_json, 
        record_path = "results", 
        meta = [
            "page"
        ]
    )

    features_df = pd.concat([features_df, current_df])

Scraping TMDB: 100%|███████████████████████████████████████████| 490/490 [01:23<00:00,  5.84it/s]


### Clean the feature data and reindex the rows

In [35]:
features_clean_df = features_df.drop_duplicates(subset = "id", keep = "first")

features_clean_df = features_clean_df.reset_index()

features_clean_df = features_clean_df.drop("index", axis=1)

In [36]:
features_clean_df = features_clean_df[[
    "id", 
    "title", 
    "original_language",
    "overview",
    "popularity",
    "vote_count",
    "vote_average",
    "release_date",
    "genre_ids",
    "poster_path",
    "backdrop_path",
    "adult",
    "page"
]]

features_clean_df.rename(columns={
    "id": "feature_id", 
    "popularity": "feature_popularity"
}, inplace=True)

In [37]:
features_clean_df

Unnamed: 0,feature_id,title,original_language,overview,feature_popularity,vote_count,vote_average,release_date,genre_ids,poster_path,backdrop_path,adult,page
0,278,The Shawshank Redemption,en,Imprisoned in the 1940s for the double murder ...,210.704,27247,8.708,1994-09-23,"[18, 80]",/9cqNxx0GxF0bflZmeSMuL5tnGzr.jpg,/zfbjgQE1uSd9wiPTX4VzsLi0rGG.jpg,False,1
1,238,The Godfather,en,"Spanning the years 1945 to 1955, a chronicle o...",182.054,20693,8.689,1972-03-14,"[18, 80]",/3bhkrj58Vtu7enYsRolD1fZdja1.jpg,/tmU7GeKVybMWFButWEGl2M4GeiP.jpg,False,1
2,240,The Godfather Part II,en,In the continuing saga of the Corleone crime f...,113.029,12482,8.572,1974-12-20,"[18, 80]",/hek3koDUyRQk7FIhPXsa6mT2Zc3.jpg,/kGzFbGhp99zva6oZODW5atUtnqi.jpg,False,1
3,424,Schindler's List,en,The true story of how businessman Oskar Schind...,88.182,15897,8.567,1993-12-15,"[18, 36, 10752]",/sF1U4EUQS8YHUYjNl3pMGNIQyr0.jpg,/zb6fM1CX41D9rF9hdgclu0peUmy.jpg,False,1
4,389,12 Angry Men,en,The defense and the prosecution have rested an...,48.620,8693,8.547,1957-04-10,[18],/ow3wq89wM8qd5X7hWKxiRfsFf9C.jpg,/qqHQsStV6exghCM7zbObuYBiYxw.jpg,False,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
9785,12142,Alone in the Dark,en,Edward Carnby is a private investigator specia...,13.258,606,3.246,2005-01-28,"[28, 14, 27]",/bSxrbVCyWW077zhtpuYlo3zgyug.jpg,/lcLyZzhB1ctfdH0hGBsTFrbflqP.jpg,False,490
9786,13805,Disaster Movie,en,"Over the course of one evening, an unsuspectin...",23.797,1024,3.200,2008-08-29,[35],/3J8XKUfhJiNzwobUZVtizXYPe8b.jpg,/5V6jAFS0Q49SI07qjyFRMYlbfR9.jpg,False,490
9787,11059,House of the Dead,en,"Set on an island off the coast, a techno rave ...",11.998,386,3.100,2003-04-11,"[27, 28, 53]",/z2mDGbV4pLtsvSMNnmnSgoVZSWK.jpg,/aNUEHLNsNMprLZt6fjf5nqDq6er.jpg,False,490
9788,14164,Dragonball Evolution,en,"On his 18th birthday, Goku receives a mystical...",16.741,2037,2.891,2009-03-12,"[28, 12, 14, 878, 53]",/sunS9xhPnFNP5wlOWrvbpBteAB.jpg,/oHrrgAPEKpz0S1ofQntiZNrmGrM.jpg,False,490


## Scraping the cast of features on TMDB 

Using the feature IDs, the cast can be scraped and added to the feature DataFrame. Each cast entry is stored as a list of actors.

In [14]:
cast_df = pd.DataFrame()

In [16]:
# Create the tqdm progress bar
progress_bar = tqdm(features_clean_df.iterrows(), total=len(features_clean_df), desc="Scraping TMDB")

for idx, row in progress_bar:

    params = {
        "language": "en-US",
    }

    movie_id = row["feature_id"]

    BASE_URL_ACTORS = BASE_URL + f"{movie_id}" + "/credits"

    response = requests.get(
        BASE_URL_ACTORS, 
        headers=HEADERS, 
        params=params
    )
    
    respone_json = response.json()

    current_df = pd.json_normalize(
        respone_json, 
        record_path = "cast",
    )
    current_df.assign(feature_id = f"{movie_id}")

    cast_df = pd.concat([cast_df, current_df])

Scraping TMDB: 100%|███████████████████████████████████████| 9790/9790 [1:04:00<00:00,  2.55it/s]


## Update feature data set with actor IDs

By scraping the features cast list, the starring actors can be added to the feature DataFrame as a list, indicated by their TMDB IDs.

In [49]:
features_clean_df = features_clean_df.assign(cast="")

In [None]:
# Create the tqdm progress bar
progress_bar = tqdm(features_clean_df.iterrows(), total=len(features_clean_df), desc="Adding cast to feature")

for idx, row in progress_bar:
    selected_rows = cast_clean_df[cast_clean_df['feature_id'] == row["feature_id"]]
    selected_rows = selected_rows.sort_values(by='actor_id',ascending = True)
    cast = list(selected_rows["actor_id"])

    row["cast"] = cast

In [50]:
features_clean_df

Unnamed: 0,feature_id,title,original_language,overview,feature_popularity,vote_count,vote_average,release_date,genre_ids,poster_path,backdrop_path,adult,page,cast
0,278,The Shawshank Redemption,en,Imprisoned in the 1940s for the double murder ...,210.704,27247,8.708,1994-09-23,"[18, 80]",/9cqNxx0GxF0bflZmeSMuL5tnGzr.jpg,/zfbjgQE1uSd9wiPTX4VzsLi0rGG.jpg,False,1,
1,238,The Godfather,en,"Spanning the years 1945 to 1955, a chronicle o...",182.054,20693,8.689,1972-03-14,"[18, 80]",/3bhkrj58Vtu7enYsRolD1fZdja1.jpg,/tmU7GeKVybMWFButWEGl2M4GeiP.jpg,False,1,
2,240,The Godfather Part II,en,In the continuing saga of the Corleone crime f...,113.029,12482,8.572,1974-12-20,"[18, 80]",/hek3koDUyRQk7FIhPXsa6mT2Zc3.jpg,/kGzFbGhp99zva6oZODW5atUtnqi.jpg,False,1,
3,424,Schindler's List,en,The true story of how businessman Oskar Schind...,88.182,15897,8.567,1993-12-15,"[18, 36, 10752]",/sF1U4EUQS8YHUYjNl3pMGNIQyr0.jpg,/zb6fM1CX41D9rF9hdgclu0peUmy.jpg,False,1,
4,389,12 Angry Men,en,The defense and the prosecution have rested an...,48.620,8693,8.547,1957-04-10,[18],/ow3wq89wM8qd5X7hWKxiRfsFf9C.jpg,/qqHQsStV6exghCM7zbObuYBiYxw.jpg,False,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9785,12142,Alone in the Dark,en,Edward Carnby is a private investigator specia...,13.258,606,3.246,2005-01-28,"[28, 14, 27]",/bSxrbVCyWW077zhtpuYlo3zgyug.jpg,/lcLyZzhB1ctfdH0hGBsTFrbflqP.jpg,False,490,
9786,13805,Disaster Movie,en,"Over the course of one evening, an unsuspectin...",23.797,1024,3.200,2008-08-29,[35],/3J8XKUfhJiNzwobUZVtizXYPe8b.jpg,/5V6jAFS0Q49SI07qjyFRMYlbfR9.jpg,False,490,
9787,11059,House of the Dead,en,"Set on an island off the coast, a techno rave ...",11.998,386,3.100,2003-04-11,"[27, 28, 53]",/z2mDGbV4pLtsvSMNnmnSgoVZSWK.jpg,/aNUEHLNsNMprLZt6fjf5nqDq6er.jpg,False,490,
9788,14164,Dragonball Evolution,en,"On his 18th birthday, Goku receives a mystical...",16.741,2037,2.891,2009-03-12,"[28, 12, 14, 878, 53]",/sunS9xhPnFNP5wlOWrvbpBteAB.jpg,/oHrrgAPEKpz0S1ofQntiZNrmGrM.jpg,False,490,


### Clean the actors data and reindex the rows

In [30]:
cast_clean_df = cast_df.drop_duplicates(subset = "id", keep = "first")

cast_clean_df = cast_clean_df.reset_index()

cast_clean_df = cast_clean_df.drop("index", axis=1)

In [31]:
cast_clean_df = cast_clean_df[[
    "id",
    "original_name",
    "popularity",
    "gender",
    "adult",
    "profile_path",
    "feature_id"
]]

cast_clean_df.rename(columns={
    "id": "actor_id", 
    "popularity": "actor_popularity",
    "profile_path": "profile_image_path"
}, inplace=True)

In [40]:
cast_clean_df

Unnamed: 0,actor_id,original_name,actor_popularity,gender,adult,profile_image_path
0,504,Tim Robbins,23.975,2,False,/djLVFETFTvPyVUdrd7aLVykobof.jpg
1,192,Morgan Freeman,55.691,2,False,/jPsLqiYGSofU4s6BjrxnefMfabb.jpg
2,4029,Bob Gunton,17.977,2,False,/ulbVvuBToBN3aCGcV028hwO0MOP.jpg
3,6573,William Sadler,16.873,2,False,/rWeb2kjYCA7V9MC9kRwRpm57YoY.jpg
4,6574,Clancy Brown,33.025,2,False,/1JeBRNG7VS7r64V9lOvej9bZXW5.jpg
...,...,...,...,...,...,...
188417,4350238,Jaime Soria,0.001,0,False,
188418,4350239,Jerry Madison,0.001,0,False,
188419,4350243,George Canahauti,0.001,0,False,
188420,2360306,Daniel Mai,0.001,0,False,


## Saving the scraped feature data to a .csv file

In [None]:
SCRAPED_FEATURES_PATH = os.path.join(os.path.abspath(""), "TMDB_scraped_features.csv")

In [None]:
SCRAPED_FEATURES_PATH

In [None]:
features_clean_df.to_csv(SCRAPED_FEATURES_PATH, index = False)

## Saving the scraped actors data to a .csv file

In [None]:
SCRAPED_ACTORS_PATH = os.path.join(os.path.abspath(""), "TMDB_scraped_actors.csv")

In [None]:
SCRAPED_ACTORS_PATH

In [None]:
cast_clean_df.to_csv(SCRAPED_ACTORS_PATH, index = False)

# Scraping Wikipedia via TMDB

In [None]:
response_wiki = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=pageprops&titles=The_Shawshank_Redemption&format=json")

page = list((response_wiki.json()["query"]["pages"]).keys())[0]

response_wiki.json()["query"]["pages"][page]["pageprops"]["wikibase_item"]

movie_id = 278

params = {
        "language": "en-US",
        # "sort_by": "vote_average.desc"
    }

BASE_URL_ACTORS = BASE_URL + f"{movie_id}" + "/external_ids"

response = requests.get(
        BASE_URL_ACTORS, 
        headers=HEADERS, 
        params=params
    )


respone_json = response.json()

respone_json

# Downloading feature posters from TMDB 

### Create folder for images

To automize the process of fetching and storing the feature images from TMDB, a folder _"images"_ is created if it doesn't already exist.

In [None]:
IMAGE_FOLDER_PATH = os.path.join(os.path.abspath(""), "images")

In [None]:
if os.path.exists(IMAGE_FOLDER_PATH):
    print(f"Found folder:\n{IMAGE_FOLDER_PATH}")
else:
    os.makedirs(IMAGE_FOLDER_PATH)
    print(f"Created folder:\n{IMAGE_FOLDER_PATH}")

### Iterate over the DataFrame to request the images
 
With the DataFrame complete, the _"backdrop\_path"_ column contains the endpoint for each feature's image backdrop.

By appending these endpoints to the modified base URL, the corresponding .jpg files can be retrieved and stored locally.

In [None]:
def save_feature_image(img, folder_path, feature_id):
    path = os.path.join(folder_path, f"{feature_id}_poster.jpg")
    with open(path, "wb") as f:
        f.write(img.content)

In [None]:
BASE_URL_IMAGE = "https://image.tmdb.org/t/p/original"
HEADERS_IMG = {
    "accept": "application/jpg",
    "Authorization": f"Bearer {AUTH_TOKEN}",
}

In [None]:
# Simply for showing it works.
# Should be updated later!
sample_df = tmdb_df.iloc[0:5]
sample_df

In [None]:
# Create the tqdm progress bar
progress_bar = tqdm(sample_df.iterrows(), total=len(sample_df), desc="Saving posters")

for idx, row in progress_bar:
                
        # get id and backdrop endpoint
        feature_id = row["id"]
        feature_backdrop_path = row["poster_path"]
        
        
        # Update progress bar with current id
        progress_bar.set_postfix(current_id=feature_id)
        
        # Send HTTPS GET request to retrieve the image and then save it to folder
        img = requests.get(BASE_URL_IMAGE + feature_backdrop_path, headers = HEADERS_IMG)
        save_feature_image(img, IMAGE_FOLDER_PATH, feature_id)