# Data collection

In this notebook, we will detail the complete process to construct our dataset. However, it should be noted that if the kernel is restarted and the code re run, the results may be different due to the movie database website. (The pages collected are generated randomly for each instance)

### The Movie Database

We will first create the database of movies that we will use during this project. To do so, we will retrieve the data using the TMDB api. We get the api token by going on their website https://www.themoviedb.org/, we need to create a account and make a request to get the token by completing a form. After that we can use the token as we want, however there is a request rate limit of 50 request/sec.

Now we will detail the steps that we will follow. Firstly, we will import the necessary packages:

In [1]:
import os
import pandas as pd
from tmdbv3api import TMDb, Movie
from concurrent.futures import ThreadPoolExecutor, as_completed
from collections import deque
import time

Here we will use the token and the python library (https://pypi.org/project/tmdbv3api/) to create our dataset, it is a python library to make the retrieving process easier.

We will also use in our code a function to limit our request per second and a ThreadPoolExecutor to optimize the process (make the request async).

The variables we will keep are title, popularity score (calculated by the website https://developer.themoviedb.org/docs/popularity-and-trending), release date, budget, revenue, the vote count, vote average, status (release/in production/..), runtime (duration of the movie) and imdb id (we will need it to merge many database together).

<font color='red'>WARNING WARNING WARNING</font>

We advice to import the csv file that we provide, in order to not run the retrieving process, because the movies present in it might differ from what we got (the pages retrieved are generated randomly, the movies are not associated with a speicific page, ex: The grunch can be in the 1st page or might as well be in the 3000th page). You can execute the code, however if you use your newly created database, the results may differ afterwards.

<font color='red'>WARNING WARNING WARNING</font>

Api key: <font color='green'> 99da5e8841e2a2a68423045546594d96<font>

In [3]:
tmdb = TMDb()
tmdb.api_key = '99da5e8841e2a2a68423045546594d96'
movie = Movie()

# Queue to track request timestamps for rate limiting
request_times = deque()

# Rate limiting function
def rate_limited_request(func, *args, max_requests_per_second=50, **kwargs):
    """
    Wrap a function to ensure it adheres to the rate limit.
    """
    while True:
        now = time.time()
        # Remove timestamps older than 1 second
        while request_times and (now - request_times[0]) > 1:
            request_times.popleft()

        # If we're at the limit, sleep briefly before continuing
        if len(request_times) >= max_requests_per_second:
            sleep_time = 1 - (now - request_times[0])
            time.sleep(max(0, sleep_time))
        else:
            break

    # Make the request
    result = func(*args, **kwargs)

    # Record the timestamp
    request_times.append(time.time())
    return result

# Fetch movie details
def fetch_page(page_num):
    """
    Fetch a page of popular movies.
    """
    try:
        return rate_limited_request(movie.popular, page=page_num)
    except Exception as e:
        print(f"Error fetching page {page_num}: {e}")
        return []

def fetch_details(m_id):
    """
    Fetch details for a specific movie ID.
    """
    try:
        return rate_limited_request(movie.details, m_id)
    except Exception as e:
        print(f"Error fetching details for movie {m_id}: {e}")
        return None

# Step 1: Fetch IDs from all pages
page_numbers = range(1, 300) # Max page = 500 (maybe)
all_movies_data = []

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(fetch_page, p) for p in page_numbers]
    for f in as_completed(futures):
        try:
            page_data = f.result()
            if page_data:
                all_movies_data.extend(page_data)
        except Exception as e:
            print(f"Error fetching page data: {e}")

# Extract movie IDs
movie_ids = [m['id'] for m in all_movies_data if 'id' in m]

# Step 2: Fetch details for each movie
titles, genres, popularity_scores, release_dates, budgets, revenues, vote_counts, vote_averages, status,runtime, imdb_ids = [], [], [], [], [], [], [], [], [], [], []

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = [executor.submit(fetch_details, m_id) for m_id in movie_ids]
    for f in as_completed(futures):
        try:
            details = f.result()
            if details:
                # Extract other movie details
                genres_list = [g['name'] for g in details.genres] if 'genres' in details and details.genres else []
                titles.append(details.title)
                genres.append(genres_list)
                popularity_scores.append(details.popularity)
                release_dates.append(details.release_date)
                budgets.append(details.budget)
                revenues.append(details.revenue)
                vote_counts.append(details.vote_count)
                vote_averages.append(details.vote_average)
                status.append(details.status)
                runtime.append(details.runtime)
                imdb_ids.append(details.imdb_id if 'imdb_id' in details else None)
        except Exception as e:
            print(f"Error processing movie details: {e}")

# Construct the DataFrame
mv = pd.DataFrame({
    'title': titles,
    'genres': genres,
    'popularity': popularity_scores,
    'release_date': release_dates,
    'budget': budgets,
    'revenue': revenues,
    'vote_count': vote_counts,
    'vote_average': vote_averages,
    'status': status,
    'runtime': runtime,
    'imdb_id': imdb_ids,
})

mv.head(10000)


Unnamed: 0,title,genres,popularity,release_date,budget,revenue,vote_count,vote_average,status,runtime,imdb_id
0,Sikandar Ka Muqaddar,"[Thriller, Crime, Mystery, Action]",375.316,2024-11-29,0,0,12,6.100,Released,143,tt31522415
1,Japanese Mom,"[Romance, Drama]",353.716,2017-02-09,0,0,30,6.200,Released,91,
2,Spellbound,"[Animation, Fantasy, Family, Adventure, Comedy]",326.983,2024-11-22,0,0,159,6.900,Released,111,tt7215232
3,The Grinch,"[Animation, Family, Comedy, Fantasy]",360.162,2018-11-08,75000000,508600000,3984,6.900,Released,85,tt2709692
4,We Live in Time,"[Romance, Drama]",370.593,2024-10-10,20000000,31818409,227,7.597,Released,108,tt27131358
...,...,...,...,...,...,...,...,...,...,...,...
5975,Rogue,"[Action, Horror, Thriller]",25.390,2007-11-08,20000000,4600000,684,6.400,Released,99,tt0479528
5976,"The Adventures of Priscilla, Queen of the Desert","[Drama, Comedy]",24.388,1994-05-31,2000000,29700000,795,7.281,Released,103,tt0109045
5977,Detective Conan: Sunflowers of Inferno,"[Animation, Crime, Adventure, Mystery]",17.892,2015-04-18,0,52920296,115,6.100,Released,113,tt3737650
5978,Breakdown,"[Crime, Mystery, Thriller]",28.138,1997-05-02,36000000,50159144,923,6.900,Released,93,tt0118771


Converting the pandas dataset into a csv file to cut to 0 the process time

After creating our dataset with the help of the api, we will save the dataset in a csv file to not repeat the long process above. (we have scraped 300 pages, which equals to 5980 entries)

In [5]:
# Specify the directory and filename for saving the CSV
output_folder = '/Users/joni/Documents/Master 1/Data management/data'
output_filename = 'movieDB.csv'

# Ensure the directory exists (create if it doesn't)
os.makedirs(output_folder, exist_ok=True)

# Construct the full path
full_path = os.path.join(output_folder, output_filename)

# Save the DataFrame as CSV
mv.to_csv(full_path, index=False)

print(f"CSV file saved to {full_path}")

CSV file saved to /Users/joni/Documents/Master 1/Data management/data/movieDB.csv


we will reimport the csv file.

In [7]:
mv = pd.read_csv('/Users/joni/Documents/Master 1/Data management/data/movieDB.csv')

In [9]:
mv.head(10000)

Unnamed: 0,title,genres,popularity,release_date,budget,revenue,vote_count,vote_average,status,runtime,imdb_id
0,Sikandar Ka Muqaddar,"['Thriller', 'Crime', 'Mystery', 'Action']",375.316,2024-11-29,0,0,12,6.100,Released,143,tt31522415
1,Japanese Mom,"['Romance', 'Drama']",353.716,2017-02-09,0,0,30,6.200,Released,91,
2,Spellbound,"['Animation', 'Fantasy', 'Family', 'Adventure'...",326.983,2024-11-22,0,0,159,6.900,Released,111,tt7215232
3,The Grinch,"['Animation', 'Family', 'Comedy', 'Fantasy']",360.162,2018-11-08,75000000,508600000,3984,6.900,Released,85,tt2709692
4,We Live in Time,"['Romance', 'Drama']",370.593,2024-10-10,20000000,31818409,227,7.597,Released,108,tt27131358
...,...,...,...,...,...,...,...,...,...,...,...
5975,Rogue,"['Action', 'Horror', 'Thriller']",25.390,2007-11-08,20000000,4600000,684,6.400,Released,99,tt0479528
5976,"The Adventures of Priscilla, Queen of the Desert","['Drama', 'Comedy']",24.388,1994-05-31,2000000,29700000,795,7.281,Released,103,tt0109045
5977,Detective Conan: Sunflowers of Inferno,"['Animation', 'Crime', 'Adventure', 'Mystery']",17.892,2015-04-18,0,52920296,115,6.100,Released,113,tt3737650
5978,Breakdown,"['Crime', 'Mystery', 'Thriller']",28.138,1997-05-02,36000000,50159144,923,6.900,Released,93,tt0118771


In [11]:
mv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5980 entries, 0 to 5979
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         5980 non-null   object 
 1   genres        5980 non-null   object 
 2   popularity    5980 non-null   float64
 3   release_date  5971 non-null   object 
 4   budget        5980 non-null   int64  
 5   revenue       5980 non-null   int64  
 6   vote_count    5980 non-null   int64  
 7   vote_average  5980 non-null   float64
 8   status        5980 non-null   object 
 9   runtime       5980 non-null   int64  
 10  imdb_id       5866 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 514.0+ KB


This cleaning step should be in the data cleaning, but we would like to treat it here. the reason is that we need to merge the database and the second api is limited to a certain extent of request, so we will remove the movie that we will not use from the database. Therefore it will reduce the request rate.

In [108]:
# Drop rows with NaN values in critical columns
mv = mv.dropna(subset=['release_date', 'budget', 'revenue','imdb_id'])

# Filter out rows with zero values in budget or revenue
mv = mv[(mv['budget'] > 0) & (mv['revenue'] > 0) & (mv['runtime'] > 0)]

# Check the dataset after cleaning
print("Dataset after removing missing or zero values:")
mv.head(10000)


Dataset after removing missing or zero values:


Unnamed: 0,title,genres,popularity,release_date,budget,revenue,vote_count,vote_average,status,runtime,imdb_id,Metascore,imdbRating
3,The Grinch,"['Animation', 'Family', 'Comedy', 'Fantasy']",360.162,2018-11-08,75000000,508600000,3984,6.900,Released,85,tt2709692,51.0,6.4
4,We Live in Time,"['Romance', 'Drama']",370.593,2024-10-10,20000000,31818409,227,7.597,Released,108,tt27131358,59.0,7.0
5,Azrael,"['Action', 'Horror', 'Thriller']",329.154,2024-09-27,12000000,631272,148,5.980,Released,86,tt22173666,52.0,5.4
8,Home Alone 2: Lost in New York,"['Comedy', 'Family', 'Adventure']",361.064,1992-11-15,18000000,358994850,9631,6.750,Released,120,tt0104431,46.0,6.9
11,Anora,"['Romance', 'Comedy', 'Drama']",437.105,2024-10-14,6000000,25467529,404,7.300,Released,139,tt28607951,91.0,8.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5973,The Player,"['Mystery', 'Drama', 'Thriller', 'Comedy']",27.139,1992-04-03,8000000,21706101,811,7.200,Released,124,tt0105151,86.0,7.5
5975,Rogue,"['Action', 'Horror', 'Thriller']",25.390,2007-11-08,20000000,4600000,684,6.400,Released,99,tt0479528,,6.2
5976,"The Adventures of Priscilla, Queen of the Desert","['Drama', 'Comedy']",24.388,1994-05-31,2000000,29700000,795,7.281,Released,103,tt0109045,70.0,7.5
5978,Breakdown,"['Crime', 'Mystery', 'Thriller']",28.138,1997-05-02,36000000,50159144,923,6.900,Released,93,tt0118771,73.0,7.0


### Watchmode (failed due to api request limit)

We wanted to merge the watchmode database with our current database, however we have faced 2 problems: The first one is that there is a rate limit of 1000 per month for free (the first subscription starts at 249USD https://api.watchmode.com/#pricing) and the second problem is when we want to get the variable "accessibility" (On which platform the movies are available) we are claim too many information to fit in the table.

Given those problems, we gave up on using this database.

Verify wether the token is still valid: the method is to request a movie from the api using the imdb_id, if the api returns a movie, then it means it works. On the contrary, if there is a error, we need to fix the request. 

-api_key used for the website Watchmode ("https://api.watchmode.com/"): <font color='green'>Lh4qAAJGZtR3MAFzPkycxwsHjx5bVsgpUDESJdjt<font>

In [14]:
import urllib.request
import json

# Replace with your actual IMDb ID and API key
imdb_id = "tt2709692"  # Example IMDb ID
api_key = "Lh4qAAJGZtR3MAFzPkycxwsHjx5bVsgpUDESJdjt"  # Replace with your actual API key

# Construct the URL with the IMDb ID and API key
url = f"https://api.watchmode.com/v1/title/{imdb_id}/details/?apiKey={api_key}"

try:
    # Make the API request
    with urllib.request.urlopen(url) as response:
        data = json.loads(response.read().decode())  # Parse the JSON response
        print("API Response:")
        print(json.dumps(data, indent=4))  # Pretty-print the JSON response
except urllib.error.HTTPError as e:
    print(f"HTTPError: {e.code} - {e.reason}")
    if e.code == 400:
        print("Bad Request: Check the IMDb ID or API key.")
    elif e.code == 401:
        print("Unauthorized: Verify your API key.")
    elif e.code == 404:
        print("Not Found: The IMDb ID might not exist in Watchmode's database.")
except urllib.error.URLError as e:
    print(f"URLError: {e.reason}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")


HTTPError: 429 - Too Many Requests


In [None]:
# Load your movie CSV file
movies_csv_path = '/Users/joni/Documents/Master 1/Data management/data/movieDB.csv'
movie_data = pd.read_csv(movies_csv_path)

# API key
api_key = "Lh4qAAJGZtR3MAFzPkycxwsHjx5bVsgpUDESJdjt"  # Replace with your actual Watchmode API key

# Function to fetch user_rating and critic_score
def fetch_watchmode_ratings(imdb_id):
    if pd.isna(imdb_id):  # Skip rows without a valid IMDb ID
        return None, None
    url = f"https://api.watchmode.com/v1/title/{imdb_id}/details/?apiKey={api_key}"
    try:
        # Make the API request
        with urllib.request.urlopen(url) as response:
            data = json.loads(response.read().decode())  # Parse the JSON response
            # Extract user_rating and critic_score
            user_rating = data.get('user_rating', None)
            critic_score = data.get('critic_score', None)
            return user_rating, critic_score
    except urllib.error.HTTPError as e:
        print(f"HTTPError for IMDb ID {imdb_id}: {e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URLError for IMDb ID {imdb_id}: {e.reason}")
    except Exception as e:
        print(f"Unexpected error for IMDb ID {imdb_id}: {e}")
    return None, None

# Create new columns for ratings
movie_data['user_rating'], movie_data['critic_score'] = zip(
    *movie_data['imdb_id'].apply(fetch_watchmode_ratings)
)

# Save the updated dataset to a new CSV file
updated_csv_path = '/Users/joni/Documents/Master 1/Data management/data/movieDB_with_ratings.csv'
movie_data.to_csv(updated_csv_path, index=False)

print(f"Updated dataset saved to {updated_csv_path}")


In [23]:
mv2 = pd.read_csv('/Users/joni/Documents/Master 1/Data management/data/movieDB_with_ratings.csv')

In [87]:
mv2.head(100)

Unnamed: 0,title,genres,popularity,release_date,budget,revenue,vote_count,vote_average,status,runtime,imdb_id
0,Sikandar Ka Muqaddar,"['Thriller', 'Crime', 'Mystery', 'Action']",375.316,2024-11-29,0,0,12,6.100,Released,143,tt31522415
1,Japanese Mom,"['Romance', 'Drama']",353.716,2017-02-09,0,0,30,6.200,Released,91,
2,Spellbound,"['Animation', 'Fantasy', 'Family', 'Adventure'...",326.983,2024-11-22,0,0,159,6.900,Released,111,tt7215232
3,The Grinch,"['Animation', 'Family', 'Comedy', 'Fantasy']",360.162,2018-11-08,75000000,508600000,3984,6.900,Released,85,tt2709692
4,We Live in Time,"['Romance', 'Drama']",370.593,2024-10-10,20000000,31818409,227,7.597,Released,108,tt27131358
...,...,...,...,...,...,...,...,...,...,...,...
95,Saving Bikini Bottom: The Sandy Cheeks Movie,"['Animation', 'Comedy', 'Adventure', 'Family']",138.635,2024-10-18,100000000,0,48,6.300,Released,87,tt23063732
96,Longlegs,"['Horror', 'Thriller', 'Crime']",124.268,2024-07-10,10000000,126388179,1500,6.600,Released,101,tt23468450
97,The Convert,"['Action', 'Drama']",123.494,2024-03-14,0,692018,99,6.200,Released,119,tt20113412
98,Toy Story,"['Animation', 'Adventure', 'Family', 'Comedy']",116.732,1995-10-30,30000000,394436586,18416,8.000,Released,81,tt0114709


### The Open Movie Database

We will use the Open Movie Database to get more movie ratings. In order to get the api key, we have to go to the website: https://www.omdbapi.com/

Then we have to request the api key by registering our email, we can either choose the free option (1000 daily free requests) or we can pay 1€ on the patreon to get more requests (200 000 requests). We went for the premium subscription, since we have more than a thousand movies.

Api key: <font color='green'>6d96e983<font>

We will use the urllib.request package to perform more complex api request and the json package since the database provided by the open movie database is formated in json.

In [49]:
import urllib.request
import json

Trying to make a request to the OMDb to see wether the api works. We use the api key and a movie from our dataset using its imdb_id to recognize it (we verified beforehand on the website if it existed in their database). There is a test bar on their website to try to search movie by their id. On this request, we will ask to retrieve the metascore (on 100) and the imdb rating (on 10):

In [94]:
# Test API request with OMDb
api_key = "809be6a6"  # Replace with your OMDb API key
imdb_id = "tt2709692"  # IMDb ID for testing

# Construct the URL
url = f"http://www.omdbapi.com/?apikey={api_key}&i={imdb_id}"

try:
    # Make the API request
    with urllib.request.urlopen(url) as response:
        data = json.loads(response.read().decode())  # Parse the JSON response
        
        # Extract only the required fields
        title = data.get("Title", "N/A")
        metascore = data.get("Metascore", "N/A")
        imdb_rating = data.get("imdbRating", "N/A")
        
        # Print the extracted data
        print("Extracted Data:")
        print(f"Title: {title}")
        print(f"Metascore: {metascore}")
        print(f"IMDb Rating: {imdb_rating}")
except urllib.error.HTTPError as e:
    print(f"HTTPError: {e.code} - {e.reason}")
except urllib.error.URLError as e:
    print(f"URLError: {e.reason}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")


Extracted Data:
Title: The Grinch
Metascore: 51
IMDb Rating: 6.4


After making a request, we will create a new database using the current one "movieDB.csv" that we will use to add columns corresponding to Metascore and IMDb Rating. However we will not request every movie in the open movie database and then merge it with the corresponding id, we will first identify the movies present in the file movieDB.csv and then request their data.

Then we will save the new database in a csv file, to not redo the same process. (last csv file: movieDB_complete.csv)

In [98]:
# Load your movie CSV file
movies_csv_path = '/Users/joni/Documents/Master 1/Data management/data/movieDB.csv'
movie_data = pd.read_csv(movies_csv_path)

# OMDb API key
api_key = "809be6a6"  # Replace with your OMDb API key

# Function to fetch Metascore and IMDb Rating
def fetch_omdb_ratings(imdb_id):
    if pd.isna(imdb_id):  # Skip rows without a valid IMDb ID
        return None, None
    url = f"http://www.omdbapi.com/?apikey={api_key}&i={imdb_id}"
    try:
        # Make the API request
        with urllib.request.urlopen(url) as response:
            data = json.loads(response.read().decode())  # Parse the JSON response
            # Extract Metascore and imdbRating
            metascore = data.get('Metascore', None)
            imdb_rating = data.get('imdbRating', None)
            return metascore, imdb_rating
    except urllib.error.HTTPError as e:
        print(f"HTTPError for IMDb ID {imdb_id}: {e.code} - {e.reason}")
    except urllib.error.URLError as e:
        print(f"URLError for IMDb ID {imdb_id}: {e.reason}")
    except Exception as e:
        print(f"Unexpected error for IMDb ID {imdb_id}: {e}")
    return None, None

# Create new columns for Metascore and IMDb Rating
movie_data['Metascore'], movie_data['imdbRating'] = zip(
    *movie_data['imdb_id'].apply(fetch_omdb_ratings)
)

# Save the updated dataset to a new CSV file
updated_csv_path = '/Users/joni/Documents/Master 1/Data management/data/movieDB_complete.csv'
movie_data.to_csv(updated_csv_path, index=False)

print(f"Updated dataset saved to {updated_csv_path}")


Updated dataset saved to /Users/joni/Documents/Master 1/Data management/data/movieDB_complete.csv


In [75]:
mv = pd.read_csv('/Users/joni/Documents/Master 1/Data management/data/movieDB_complete.csv')

In [77]:
mv.head(10000)

Unnamed: 0,title,genres,popularity,release_date,budget,revenue,vote_count,vote_average,status,runtime,imdb_id,Metascore,imdbRating
0,Sikandar Ka Muqaddar,"['Thriller', 'Crime', 'Mystery', 'Action']",375.316,2024-11-29,0,0,12,6.100,Released,143,tt31522415,,
1,Japanese Mom,"['Romance', 'Drama']",353.716,2017-02-09,0,0,30,6.200,Released,91,,,
2,Spellbound,"['Animation', 'Fantasy', 'Family', 'Adventure'...",326.983,2024-11-22,0,0,159,6.900,Released,111,tt7215232,,6.3
3,The Grinch,"['Animation', 'Family', 'Comedy', 'Fantasy']",360.162,2018-11-08,75000000,508600000,3984,6.900,Released,85,tt2709692,51.0,6.4
4,We Live in Time,"['Romance', 'Drama']",370.593,2024-10-10,20000000,31818409,227,7.597,Released,108,tt27131358,59.0,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5975,Rogue,"['Action', 'Horror', 'Thriller']",25.390,2007-11-08,20000000,4600000,684,6.400,Released,99,tt0479528,,6.2
5976,"The Adventures of Priscilla, Queen of the Desert","['Drama', 'Comedy']",24.388,1994-05-31,2000000,29700000,795,7.281,Released,103,tt0109045,70.0,7.5
5977,Detective Conan: Sunflowers of Inferno,"['Animation', 'Crime', 'Adventure', 'Mystery']",17.892,2015-04-18,0,52920296,115,6.100,Released,113,tt3737650,,6.2
5978,Breakdown,"['Crime', 'Mystery', 'Thriller']",28.138,1997-05-02,36000000,50159144,923,6.900,Released,93,tt0118771,73.0,7.0


In [None]:
from sklearn.preprocessing import MinMaxScaler

# Step 5.1: Select numeric columns to normalize
numeric_columns = ['w_popularity', 'budget', 'revenue', 'vote_count', 'vote_average']

# Step 5.2: Initialize the scaler
scaler = MinMaxScaler()

# Step 5.3: Apply normalization
mv[numeric_columns] = scaler.fit_transform(mv[numeric_columns])

# Check the normalized dataset
print("After normalization:")
mv.head(1000)


In [None]:
# Drop 'release_year', 'release_quarter', and 'title' columns as they are no longer required for prediction
columns_to_drop = ['release_year', 'release_quarter', 'title', 'year_quarter','release_date', 'popularity']
mv = mv.drop(columns=columns_to_drop, errors='ignore')

# Display the updated dataset
print("Dataset after dropping unnecessary columns:")
mv.head(1000)


In [None]:
# Separate the target variable and features
X = mv.drop(columns=['popularity_class'])  # Features (independent variables)
y = mv['popularity_class']                # Target (dependent variable)

# Display the shapes of the resulting datasets
print(f"Features (X) shape: {X.shape}")
print(f"Target (y) shape: {y.shape}")

# Display a preview of the features
print("Preview of features (X):")
X.head(10)

# Display a preview of the target variable
print("Preview of target (y):")
y.head(10)


In [None]:
# Import necessary libraries for Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Define your features (X) and target variable (y)
# Assuming that 'popularity_class' is the target variable (binary: 1 or 0)
X = mv.drop(columns=['popularity_class'])  # Features: all columns except the target
y = mv['popularity_class']  # Target: popularity_class (binary)

# Step 2: Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Initialize and train the Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Step 4: Predict the target variable on the test set
y_pred = rf_model.predict(X_test)

# Step 5: Evaluate the model's performance using classification metrics
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)

# Output the results
print(f"Accuracy (Random Forest): {accuracy}")
print(f"Confusion Matrix (Random Forest):\n{conf_matrix}")
print(f"Classification Report (Random Forest):\n{class_report}")


In [None]:
# Step 1: Get the feature importance from the trained Random Forest model
importance = rf_model.feature_importances_

# Step 2: Get the feature names
feature_names = X_train.columns

# Step 3: Create a DataFrame to display feature importance
feature_importance = pd.DataFrame({'Feature': feature_names, 'Importance': importance})

# Step 4: Sort the DataFrame by importance in descending order
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)

# Step 5: Display the feature importance
print(feature_importance)

# Step 6: Display the feature importance as a bar plot
plt.figure(figsize=(12, 8))

# Updated bar plot with `hue` assigned
sns.barplot(x='Importance', y='Feature', hue='Feature', dodge=False, data=feature_importance, palette='viridis')
plt.legend([], [], frameon=False)  # Remove legend for clarity

# Add labels and title
plt.title('Feature Importance in Random Forest Model', fontsize=16)
plt.xlabel('Importance', fontsize=14)
plt.ylabel('Feature', fontsize=14)

# Show the plot
plt.show()


In [None]:
# Import necessary libraries for additional metrics
from sklearn.metrics import roc_curve, auc, roc_auc_score, precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt

# Step 1: Make predictions for the test set
# For Random Forest, we need to obtain predicted probabilities for the positive class (1)
# y_pred_proba contains the probability of the positive class (1) for each sample
y_pred_proba = rf_model.predict_proba(X_test)[:, 1]  # Get the predicted probabilities for the positive class (1)

# Step 2: Calculate the AUC-ROC
# roc_curve function returns False Positive Rate (FPR), True Positive Rate (TPR), and thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)  # Compute ROC curve points
roc_auc = auc(fpr, tpr)  # Compute the area under the ROC curve (AUC)

# Step 3: Plot the ROC curve
# This plots the ROC curve to visualize the trade-off between True Positive Rate (TPR) and False Positive Rate (FPR)
plt.figure(figsize=(8, 6))  # Set figure size for the plot
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')  # ROC curve line
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')  # Diagonal line representing random model
plt.xlim([0.0, 1.0])  # Set x-axis limits (FPR from 0 to 1)
plt.ylim([0.0, 1.05])  # Set y-axis limits (TPR from 0 to 1)
plt.xlabel('False Positive Rate')  # Label for x-axis
plt.ylabel('True Positive Rate')  # Label for y-axis
plt.title('Receiver Operating Characteristic (ROC)')  # Title for the plot
plt.legend(loc='lower right')  # Display legend at lower right
plt.show()  # Display the plot

# Step 4: Precision-Recall Curve
# precision_recall_curve returns precision, recall, and thresholds for different classification thresholds
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_pred_proba)
pr_auc = average_precision_score(y_test, y_pred_proba)  # Calculate the Average Precision score (AP)

# Step 5: Plot the Precision-Recall curve
# This curve shows the relationship between precision and recall at different thresholds
plt.figure(figsize=(8, 6))  # Set figure size for the plot
plt.plot(recall, precision, color='green', lw=2, label=f'Precision-Recall curve (AP = {pr_auc:.2f})')  # Precision-Recall curve line
plt.xlabel('Recall')  # Label for x-axis
plt.ylabel('Precision')  # Label for y-axis
plt.title('Precision-Recall Curve')  # Title for the plot
plt.legend(loc='lower left')  # Display legend at lower left
plt.show()  # Display the plot

# Additional Step: Print AUC-ROC and Precision-Recall metrics
print(f"AUC-ROC score: {roc_auc:.4f}")  # Print the AUC score, which is a performance metric for the ROC curve
print(f"Average Precision (AP) score: {pr_auc:.4f}")  # Print the Average Precision score, useful for imbalanced datasets


In [None]:
# Import the necessary libraries for Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Initialize the Gradient Boosting Classifier
gb_model = GradientBoostingClassifier(n_estimators=100, random_state=42)

# Step 2: Train the model
gb_model.fit(X_train, y_train)

# Step 3: Make predictions on the test data
y_pred_gb = gb_model.predict(X_test)

# Step 4: Evaluate the model
gb_accuracy = accuracy_score(y_test, y_pred_gb)
gb_confusion_matrix = confusion_matrix(y_test, y_pred_gb)
gb_classification_report = classification_report(y_test, y_pred_gb)

# Output results
print(f"Accuracy (Gradient Boosting): {gb_accuracy}")
print(f"Confusion Matrix (Gradient Boosting):\n{gb_confusion_matrix}")
print(f"Classification Report (Gradient Boosting):\n{gb_classification_report}")


In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# Assuming 'mv' is the cleaned dataframe

# Step 1: Prepare your data
# We will use 'popularity_class' as the target variable
X = mv.drop(columns=['popularity_class'])
y = mv['popularity_class']

# Step 2: Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Normalize the features (optional but often improves model performance)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Step 4: Define the Gradient Boosting model
gb_model = GradientBoostingClassifier(random_state=42)

# Step 5: Hyperparameter tuning using GridSearchCV
param_grid_gb = {
    'n_estimators': [50, 100, 200],            # Number of boosting stages
    'learning_rate': [0.01, 0.05, 0.1],        # Step size shrinking
    'max_depth': [3, 5, 7],                     # Depth of the trees
    'min_samples_split': [2, 5, 10],            # Minimum samples required to split a node
    'min_samples_leaf': [1, 2, 4]               # Minimum samples required to be a leaf node
}

# Initialize GridSearchCV
grid_search_gb = GridSearchCV(estimator=gb_model, param_grid=param_grid_gb, cv=5, n_jobs=-1, verbose=2)

# Step 6: Fit GridSearchCV to the training data
grid_search_gb.fit(X_train, y_train)

# Step 7: Get the best parameters
print("Best Parameters for Gradient Boosting:", grid_search_gb.best_params_)

# Step 8: Train the model with the best parameters
best_gb_model = grid_search_gb.best_estimator_

# Step 9: Predict on the test set
y_pred_gb = best_gb_model.predict(X_test)

# Step 10: Evaluate the model
# Calculate accuracy
accuracy_gb = accuracy_score(y_test, y_pred_gb)
print(f"Accuracy (Gradient Boosting): {accuracy_gb:.4f}")

# Display confusion matrix
print("Confusion Matrix (Gradient Boosting):")
print(confusion_matrix(y_test, y_pred_gb))

# Display classification report
print("Classification Report (Gradient Boosting):")
print(classification_report(y_test, y_pred_gb))


RANDOM FOREST CLASSIFIER IS THE BEST MODEL