# Project: Analyzing movies in 2023 on The Movie Database (TMDb)

- Name: Lê Đức Cường
- Student code: 21120213
- Website I get data: https://www.themoviedb.org/

    **The Movie Database (TMDb)** is a collaborative film database. The project was founded in 2008 by Travis Bell to collect movie posters. The initial database was a donation from the free project Open Media Database (OMDb). This database has more 913000 movies (including the adults content). In this project, I analyze data about movies which are released in 2023.

## I.COLLECTING DATA

### 1. Import Packages

In [1]:
import pandas as pd
import numpy as np
import time
import requests
import json
from bs4 import BeautifulSoup

### 2. Check data size (using API key)
I registered for a TMDb's API key and use 'API Read Access Token' to get data about movies which is released in 2023 (primary_release_year=2023) and not including adults content (include_adult=false).

In [2]:
#Get data text by using API Read Access Token
url = "https://api.themoviedb.org/3/discover/movie?primary_release_year=2023&include_adult=false"
headers = {
    "accept": "application/json",
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiJjYzY0NTA3YmYzNTA4ZWRmYmM1NGUwNTllNDQ3YWM4ZCIsInN1YiI6IjY1OTNmOTVmMDY5ZjBlNDY0YzIxMWUxOSIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.tSMF0nkjf5qQLZ3Wn1zehrdJ9-BAA6m6mHkhtezvhs4"
}
response = requests.get(url, headers = headers)

#How many results if 'primary release year = 2023'?
j = response.json()
totalPages = j['total_pages']
totalResults = j['total_results']
print(totalPages, 'pages')
print(totalResults, 'results')

1752 pages
35031 results


The total pages are **1749**, and **34979** is the total results. This is a very large amount of data. So I will decrease the amount of data by adding filter 'vote_count.gte=10' to find the movies with at least 10 votes. (Last requires API on January 5th 2024)

In [3]:
#Get data text by using API Read Access Token
url = "https://api.themoviedb.org/3/discover/movie?primary_release_year=2023&include_adult=true&vote_count.gte=10"
headers = {
    "accept": "application/json",
    "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiJjYzY0NTA3YmYzNTA4ZWRmYmM1NGUwNTllNDQ3YWM4ZCIsInN1YiI6IjY1OTNmOTVmMDY5ZjBlNDY0YzIxMWUxOSIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.tSMF0nkjf5qQLZ3Wn1zehrdJ9-BAA6m6mHkhtezvhs4"
}
response = requests.get(url, headers = headers)

#How many results if 'primary release year = 2023' and 'vote count (greater than Greater Than or Equal) = 10' 
j = response.json()
totalPages = j['total_pages']
totalResults = j['total_results']
print(totalPages, 'pages')
print(totalResults, 'results')

70 pages
1400 results


With **70 pages** and **1395 results**, this data's size is suitable for collect and analysis.

### 3. Create a function to collect data

In [4]:
#Create an empty array 'urls_list' contains all urls
urls_list = []
   
#Define 'base_url'
base_url = "https://api.themoviedb.org/3/discover/movie?primary_release_year=2023&include_adult=true&vote_count.gte=10"

In [5]:
#Define arrays base on informations in each part of raw data
GENRE_IDS = []
ID = []
ORIGINAL_LANGUAGE = []
ORIGINAL_TITLE = []
OVERVIEW = []
POPULARITY = []
RELEASE_DATE = []
TITLE = []
VOTE_AVERAGE = []
VOTE_COUNT = []

In [6]:
#Create a function to get data from a url
def collect_data():
    for index in range(1, 71):
        #Create url and get raw data
        url = base_url + '&page=' + str(index)
        headers = {
            "accept": "application/json",
            "Authorization": "Bearer eyJhbGciOiJIUzI1NiJ9.eyJhdWQiOiJjYzY0NTA3YmYzNTA4ZWRmYmM1NGUwNTllNDQ3YWM4ZCIsInN1YiI6IjY1OTNmOTVmMDY5ZjBlNDY0YzIxMWUxOSIsInNjb3BlcyI6WyJhcGlfcmVhZCJdLCJ2ZXJzaW9uIjoxfQ.tSMF0nkjf5qQLZ3Wn1zehrdJ9-BAA6m6mHkhtezvhs4"
        }
        response = requests.get(url, headers = headers)
        raw_data = response.json()['results']
        
        #process each part of raw data
        for a_movie_data in raw_data:
            GENRE_IDS.append(a_movie_data['genre_ids'])
            ID.append(a_movie_data['id'])
            ORIGINAL_LANGUAGE.append(a_movie_data['original_language'])
            ORIGINAL_TITLE.append(a_movie_data['original_title'])
            OVERVIEW.append(a_movie_data['overview'])
            POPULARITY.append(a_movie_data['popularity'])
            RELEASE_DATE.append(a_movie_data['release_date'])
            TITLE.append(a_movie_data['title'])
            VOTE_AVERAGE.append(a_movie_data['vote_average'])
            VOTE_COUNT.append(a_movie_data['vote_count'])
   
    #Combine all data into a dataframe
    data = pd.DataFrame({"Title": TITLE,
                         "Original title": ORIGINAL_TITLE,
                         "ID": ID,
                         "Language": ORIGINAL_LANGUAGE,
                         "Details": OVERVIEW,
                         "Genre IDs": GENRE_IDS,                         
                         "Popularity": POPULARITY,
                         "Release date": RELEASE_DATE,
                         "Review score": VOTE_AVERAGE,
                         "Number of reviews": VOTE_COUNT})
    return data

After setup *collect_data* function, I collect data and save them.

In [7]:
#Collect data and check this data:
data_movies = collect_data()
print(data_movies.shape)

(1400, 10)


data_movies has *1395 rows*, it's similar **1395 results** above.

In [8]:
#Create movies_data.csv file
data_movies.to_csv('../data/raw_data/movies_data_raw.csv', index = False)

Now, recheck file **'movies_data_raw.csv'**

In [9]:
df = pd.read_csv("../data/raw_data/movies_data_raw.csv")
df

Unnamed: 0,Title,Original title,ID,Language,Details,Genre IDs,Popularity,Release date,Review score,Number of reviews
0,The Family Plan,The Family Plan,1029575,en,"Dan Morgan is many things: a devoted husband, ...","[28, 35]",3443.376,2023-12-14,7.4,577
1,Rebel Moon - Part One: A Child of Fire,Rebel Moon - Part One: A Child of Fire,848326,en,When a peaceful colony on the edge of the gala...,[878],2288.636,2023-12-15,6.5,1026
2,The Hunger Games: The Ballad of Songbirds & Sn...,The Hunger Games: The Ballad of Songbirds & Sn...,695721,en,64 years before he becomes the tyrannical pres...,"[18, 878, 28]",2182.886,2023-11-15,7.2,1323
3,Silent Night,Silent Night,891699,en,A tormented father witnesses his young son die...,"[28, 80]",1441.196,2023-11-30,5.9,234
4,Aquaman and the Lost Kingdom,Aquaman and the Lost Kingdom,572802,en,"Black Manta, still driven by the need to aveng...","[28, 12, 14]",1283.474,2023-12-20,6.5,379
...,...,...,...,...,...,...,...,...,...,...
1395,Write Me A Letter When You Return Home,Write Me A Letter When You Return Home,1113693,en,75-year-old Enola Niaga finds comfort in writi...,[18],1.400,2023-05-12,6.3,16
1396,Gli attassati,Gli attassati,1168807,it,,[35],0.622,2023-08-31,5.0,11
1397,Return,Regreso,1140754,es,Gerardo returns home with a pack of dogs barki...,[18],0.609,2023-07-30,4.9,13
1398,Hombres hay muchos,Hombres hay muchos,1168735,es,,"[35, 10749]",0.600,2023-08-01,7.0,12


**Finally, I completed the data collection**