<a href="https://colab.research.google.com/github/Elbx88/ML-Model-Perdiction/blob/main/Project_ML_Models_Erez_Levy_TMDB_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## ML Model Perdiction - Project By Erez Levy- Part I Dataset Preperation

# The TMDB Dataset

The TMDB (The Movie Database) is a widely-used resource for movie and TV show data, providing valuable information such as ratings, plot summaries, and more.

This dataset contains a collection of 150,000 tv shows from the TMDB database, collected and cleaned.

# The Project Overview:


This dataset opens up a wide range of possibilities for data analysts and data scientists. Here are some ideas to get you started:

Explore trends in TV show popularity based on vote count and average.
Analyze TV show genres to identify the most popular genres or combinations of genres.
Investigate the relationship between TV show ratings and the number of seasons and episodes.
Build a recommendation system that suggests TV shows based on a user's favorite genres or languages.
Predict the success of a TV show based on features like vote count, average, and popularity.
Identify the most prolific TV show creators or production companies based on the number of shows they have created.
Explore the distribution of TV show run times and investigate whether episode duration affects the overall ratings.
Investigate TV show production trends across different countries and networks.
Analyze the relationship between TV show language and popularity, and investigate the popularity of non-English shows.
Track the status of TV shows (in production or not) and analyze their popularity over time.
Develop a language analysis model to identify sentiment or themes from TV show overviews.

I need to build a predictive model to determine the success of a TV show based on features like vote count, vote average, and popularity. We'll approach this as a regression problem, where we predict a continuous success metric, and we'll use multiple regression models.



**Approach**
Feature Selection: Choose the features you want to use for prediction (vote_count, vote_average, popularity, and potentially others like number_of_episodes, number_of_seasons).

Target Variable: Define the target variable. In this case, we'll use popularity as a measure of success.

Data Splitting: Split the data into training and testing sets.

Model Selection: Try several regression models:
Linear Regression: A good starting point for regression problems.
Random Forest Regressor: A more complex model that can capture non-linear relationships.
Gradient Boosting Regressor: Another advanced model known for good performance.

Model Training: Train the models on the training set.

Model Evaluation: Evaluate the models on the testing set using appropriate metrics (e.g., R-squared, Mean Absolute Error (MAE), Root Mean Squared Error (RMSE)).

Cross-Validation: Use Cross-Validation to ensure that the model does not overfit.

Hyperparameter Tuning: Use Hyperparameter Tuning to optimize the model parameters.

# Target Value prediction
Based on the potential insights and business value, I would suggest focusing on predicting either:

Popularity: It is a complex and dynamic metric that reflects overall success.
Vote Average (Rating): It captures audience satisfaction and critical acclaim.
Both of these targets have valuable real-world implications and can be approached with a variety of machine learning models.

Important Considerations:

Feature Engineering: Carefully select and engineer features from the TMDB data that you think will be most relevant to your chosen target variable.
Model Selection: Experiment with different machine learning models (regression for popularity or ratings, classification for status/renewal) to find the best performer.
Evaluation: Use appropriate metrics (like RMSE for regression or accuracy for classification) to assess the performance of your predictive model.

# 1. Data Preparation - TMDB Database

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
# Ignore warnings
import warnings
warnings.filterwarnings('ignore')

# Uploading the TMDB Original DataSet

In [None]:
from google.colab import files

uploaded = files.upload()  # This will prompt you to select a file

import seaborn as sns
import pandas as pd
import io # Import the io module

from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance
import pandas as pd

# Load dataset

file_name = 'TMDB_tv_dataset_v3.csv'  # Change this to your actual file name
# Check if the file was uploaded successfully
if file_name in uploaded:
  # Read the uploaded CSV file into a Pandas DataFrame
  df = pd.read_csv(io.BytesIO(uploaded[file_name])) # Use file_name instead of Train
  print("Data loaded successfully!")
  #df = df.dropna(subset=['SalePrice'])
  display(df.head(100)) # Display first 100 rows
else:
  print(f"Error: File '{TMDB_tv_dataset_v3.csv}' not found in uploaded files.")

Saving TMDB_tv_dataset_v3.csv to TMDB_tv_dataset_v3.csv
Data loaded successfully!


Unnamed: 0,id,name,number_of_seasons,number_of_episodes,original_language,vote_count,vote_average,overview,adult,backdrop_path,...,tagline,genres,created_by,languages,networks,origin_country,spoken_languages,production_companies,production_countries,episode_run_time
0,1399,Game of Thrones,8,73,en,21857,8.442,Seven noble families fight for control of the ...,False,/2OMB0ynKlyIenMJWI2Dy9IWT4c.jpg,...,Winter Is Coming,"Sci-Fi & Fantasy, Drama, Action & Adventure","David Benioff, D.B. Weiss",en,HBO,US,English,"Revolution Sun Studios, Television 360, Genera...","United Kingdom, United States of America",0
1,71446,Money Heist,3,41,es,17836,8.257,"To carry out the biggest heist in history, a m...",False,/gFZriCkpJYsApPZEF3jhxL4yLzG.jpg,...,The perfect robbery.,"Crime, Drama",Álex Pina,es,"Netflix, Antena 3",ES,Español,Vancouver Media,Spain,70
2,66732,Stranger Things,4,34,en,16161,8.624,"When a young boy vanishes, a small town uncove...",False,/2MaumbgBlW1NoPo3ZJO38A6v7OS.jpg,...,Every ending has a beginning.,"Drama, Sci-Fi & Fantasy, Mystery","Matt Duffer, Ross Duffer",en,Netflix,US,English,"21 Laps Entertainment, Monkey Massacre Product...",United States of America,0
3,1402,The Walking Dead,11,177,en,15432,8.121,Sheriff's deputy Rick Grimes awakens from a co...,False,/x4salpjB11umlUOltfNvSSrjSXm.jpg,...,Fight the dead. Fear the living.,"Action & Adventure, Drama, Sci-Fi & Fantasy",Frank Darabont,en,AMC,US,English,"AMC Studios, Circle of Confusion, Valhalla Mot...",United States of America,42
4,63174,Lucifer,6,93,en,13870,8.486,"Bored and unhappy as the Lord of Hell, Lucifer...",False,/aDBRtunw49UF4XmqfyNuD9nlYIu.jpg,...,It's good to be bad.,"Crime, Sci-Fi & Fantasy",Tom Kapinos,en,"FOX, Netflix",US,English,"Warner Bros. Television, DC Entertainment, Jer...",United States of America,45
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,4613,Band of Brothers,1,10,en,3107,8.488,Drawn from interviews with survivors of Easy C...,False,/lngME02RNCzjUIkzS9eJ4b7sjeQ.jpg,...,There was a time when the world asked ordinary...,"Drama, War & Politics",,"de, fr, lt, nl, en",HBO,US,"Deutsch, Français, Lietuvių, Nederlands, English","HBO Films, DreamWorks Television, DreamWorks P...","Switzerland, United Kingdom, United States of ...",60
96,104877,Love Is In The Air,2,52,tr,2920,8.200,"Eda, who ties all her hopes to her education, ...",False,/ic7YfRS52N3m4oalxoOIT665C33.jpg,...,,"Drama, Comedy",Ayse Uber Kutlu,tr,FOX,TR,Türkçe,MF Yapım,Turkey,130
97,1911,Bones,12,246,en,2891,8.264,Dr. Temperance Brennan and her colleagues at t...,False,/e9n87p3Ax67spq3eUgLB6rjIEow.jpg,...,Every body has secrets.,"Crime, Drama",Hart Hanson,en,FOX,US,English,"20th Century Fox Television, Far Field Product...",United States of America,40
98,88329,Hawkeye,1,6,en,2887,7.960,Former Avenger Clint Barton has a seemingly si...,False,/1R68vl3d5s86JsS2NPjl8UoMqIS.jpg,...,"This holiday season, the best gifts come with ...","Drama, Comedy, Action & Adventure",Jonathan Igla,en,Disney+,US,English,"Marvel Studios, Kevin Feige Productions",United States of America,0


#Reduce Large Categories:Dropping unnecesery  Dataset columns

In [None]:
# Columns to drop
columns_to_drop = ['overview', 'tagline', 'backdrop_path','homepage','original_name','poster_path','production_countries','original_language','spoken_languages']

# Drop the specified columns
tmdb = df.drop(columns=columns_to_drop, axis=1)
tmdb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168639 entries, 0 to 168638
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    168639 non-null  int64  
 1   name                  168634 non-null  object 
 2   number_of_seasons     168639 non-null  int64  
 3   number_of_episodes    168639 non-null  int64  
 4   vote_count            168639 non-null  int64  
 5   vote_average          168639 non-null  float64
 6   adult                 168639 non-null  bool   
 7   first_air_date        136903 non-null  object 
 8   last_air_date         138735 non-null  object 
 9   in_production         168639 non-null  bool   
 10  popularity            168639 non-null  float64
 11  type                  168639 non-null  object 
 12  status                168639 non-null  object 
 13  genres                99713 non-null   object 
 14  created_by            36496 non-null   object 
 15  

In [None]:
tmdb.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,168639.0,111307.074704,76451.662352,1.0,45936.5,97734.0,196923.5,251213.0
number_of_seasons,168639.0,1.548497,2.942872,0.0,1.0,1.0,1.0,240.0
number_of_episodes,168639.0,24.465082,134.799622,0.0,1.0,6.0,20.0,20839.0
vote_count,168639.0,13.305054,190.809059,0.0,0.0,0.0,1.0,21857.0
vote_average,168639.0,2.333843,3.454334,0.0,0.0,0.0,6.0,10.0
popularity,168639.0,5.882644,42.023216,0.0,0.6,0.857,2.4315,3707.008
episode_run_time,168639.0,22.603348,47.950427,0.0,0.0,0.0,42.0,6032.0


# Cleaning  common words & Punctuation marks from the Dataset tmdb

**Stop Words:**

These are common words (like "the," "a," "is," "and") that often don't add much meaning to text analysis. Removing them can reduce noise and improve processing efficiency.


**Punctuation:**

Punctuation marks (like commas, periods, question marks) can interfere with text analysis. Removing or standardizing them helps in focusing on the actual words.

**Special Characters:**

These include symbols, numbers, and other non-alphanumeric characters that migh


In [None]:
!pip install --upgrade nltk
import sys
import os
import nltk
import seaborn as sns
import pandas as pd
import re

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# Get the absolute path to the directory containing nltk
nltk_dir = os.path.abspath('/usr/local/lib/python3.11/dist-packages')

# Add the nltk directory to sys.path
sys.path.append(nltk_dir)


# Download necessary NLTK resources, including 'punkt' and 'punkt_tab'
# Using try except blocks for a better network issue handling
import ssl
try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

nltk.download('stopwords')
try:
    nltk.download('punkt_tab')  # we use punkt_tab instead of punkt.
except LookupError as e:
    print(f"Error downloading NLTK resources: {e}")
    print("Please ensure you have a stable internet connection and try again.")



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


# Converting Dataset to Lower case words

In [None]:
# The 'tmdb' DataFrame and has a column called 'name'

tmdb_cleaned = tmdb.copy()
# Function for text cleaning

def clean_text(text):
    """
    Cleans the input text by removing punctuation, special characters,
    stop words, and converting it to lowercase.

    Args:
        text (str): The input text string.

    Returns:
        str: The cleaned text string.
    """
    if not isinstance(text, str):
        return ""  # Return empty string if not string

    # Lowercasing
    text = text.lower()

    # Remove punctuation and special characters
    text = re.sub(r'\W+', ' ', text)  # Replace non-alphanumeric characters with space
    text = re.sub(r'\s+', ' ', text)  # Replace multiple spaces with single space

    # Tokenization
    words = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]

    # Join words back into a string
    cleaned_text = ' '.join(words)
    return cleaned_text

# Apply the cleaning function to the 'name' column
tmdb_cleaned['cleaned_name'] = tmdb_cleaned['name'].apply(clean_text)

# Display the first few rows with the cleaned text

tmdb_cleaned['name'] = tmdb_cleaned['cleaned_name'].astype(str)
tmdb_cleaned.drop('cleaned_name', axis=1, inplace=True)
print(tmdb_cleaned[['name']].head())
tmdb_cleaned.info()

              name
0     game thrones
1      money heist
2  stranger things
3     walking dead
4          lucifer
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168639 entries, 0 to 168638
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   id                    168639 non-null  int64  
 1   name                  168639 non-null  object 
 2   number_of_seasons     168639 non-null  int64  
 3   number_of_episodes    168639 non-null  int64  
 4   vote_count            168639 non-null  int64  
 5   vote_average          168639 non-null  float64
 6   adult                 168639 non-null  bool   
 7   first_air_date        136903 non-null  object 
 8   last_air_date         138735 non-null  object 
 9   in_production         168639 non-null  bool   
 10  popularity            168639 non-null  float64
 11  type                  168639 non-null  object 
 12  status                168639 non-null  obj

In [None]:
tmdb_cleaned.head(10)

Unnamed: 0,id,name,number_of_seasons,number_of_episodes,vote_count,vote_average,adult,first_air_date,last_air_date,in_production,popularity,type,status,genres,created_by,languages,networks,origin_country,production_companies,episode_run_time
0,1399,game thrones,8,73,21857,8.442,False,2011-04-17,2019-05-19,False,1083.917,Scripted,Ended,"Sci-Fi & Fantasy, Drama, Action & Adventure","David Benioff, D.B. Weiss",en,HBO,US,"Revolution Sun Studios, Television 360, Genera...",0
1,71446,money heist,3,41,17836,8.257,False,2017-05-02,2021-12-03,False,96.354,Scripted,Ended,"Crime, Drama",Álex Pina,es,"Netflix, Antena 3",ES,Vancouver Media,70
2,66732,stranger things,4,34,16161,8.624,False,2016-07-15,2022-07-01,True,185.711,Scripted,Returning Series,"Drama, Sci-Fi & Fantasy, Mystery","Matt Duffer, Ross Duffer",en,Netflix,US,"21 Laps Entertainment, Monkey Massacre Product...",0
3,1402,walking dead,11,177,15432,8.121,False,2010-10-31,2022-11-20,False,489.746,Scripted,Ended,"Action & Adventure, Drama, Sci-Fi & Fantasy",Frank Darabont,en,AMC,US,"AMC Studios, Circle of Confusion, Valhalla Mot...",42
4,63174,lucifer,6,93,13870,8.486,False,2016-01-25,2021-09-10,False,416.668,Scripted,Ended,"Crime, Sci-Fi & Fantasy",Tom Kapinos,en,"FOX, Netflix",US,"Warner Bros. Television, DC Entertainment, Jer...",45
5,69050,riverdale,7,137,13180,8.479,False,2017-01-26,2023-08-23,False,143.75,Scripted,Ended,"Crime, Drama, Mystery",Roberto Aguirre-Sacasa,en,The CW,US,"Warner Bros. Television, Berlanti Productions,...",45
6,93405,squid game,2,9,13053,7.831,False,2021-09-17,2021-09-17,True,115.587,Scripted,Returning Series,"Action & Adventure, Mystery, Drama",Hwang Dong-hyuk,"en, ko, ur",Netflix,KR,"Siren Pictures, Firstman Studio",0
7,1396,breaking bad,5,62,12398,8.89,False,2008-01-20,2013-09-29,False,247.632,Scripted,Ended,"Drama, Crime",Vince Gilligan,"en, de, es",AMC,US,"Sony Pictures Television Studios, High Bridge ...",0
8,71712,good doctor,6,116,11768,8.503,False,2017-09-25,2023-05-01,True,681.614,Scripted,Returning Series,Drama,David Shore,en,ABC,US,"ABC Studios, 3AD, Sony Pictures Television Stu...",43
9,85271,wandavision,1,9,11308,8.3,False,2021-01-15,2021-03-05,False,62.893,Miniseries,Ended,"Sci-Fi & Fantasy, Mystery, Drama",Jac Schaeffer,en,Disney+,US,Marvel Studios,0


# Transform/Manipulate data
Consolidating rows with the same 'name' column

In [None]:
import pandas as pd

# Group by 'name' and aggregate

tmdb_consolidate = tmdb_cleaned.groupby('name').agg(lambda x: x.dropna().iloc[0] if not x.dropna().empty else None).reset_index()

print(tmdb_consolidate)

                     name      id  number_of_seasons  number_of_episodes  \
0                           91363                  1                   9   
1          0 5 30 minutes   39280                  0                   0   
2                 0 5 man  222976                  1                   5   
3        0 michael shchur  111731                  6                 137   
4       00 erne tur retur  231560                  1                  10   
...                   ...     ...                ...                 ...   
149835    ｏｚｕ 小津安二郎が描いた物語  233374                  1                   1   
149836     ｒｏｍｅｓ 空港防御システム  158374                  1                   9   
149837    ｓｌ山口線貴婦人号トンネル殺人  202120                  1                   1   
149838          ｕ字工事の旅 発見  195208                  1                   1   
149839   𝐁𝐮𝐢𝐥𝐝𝐢𝐧𝐠 𝐍𝐨 𝐗𝐕𝐈𝐈  231549                  1                  12   

        vote_count  vote_average  adult first_air_date last_air_date  \
0             3

In [None]:
tmdb_consolidate.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149840 entries, 0 to 149839
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   name                  149840 non-null  object 
 1   id                    149840 non-null  int64  
 2   number_of_seasons     149840 non-null  int64  
 3   number_of_episodes    149840 non-null  int64  
 4   vote_count            149840 non-null  int64  
 5   vote_average          149840 non-null  float64
 6   adult                 149840 non-null  bool   
 7   first_air_date        123584 non-null  object 
 8   last_air_date         125296 non-null  object 
 9   in_production         149840 non-null  bool   
 10  popularity            149840 non-null  float64
 11  type                  149840 non-null  object 
 12  status                149840 non-null  object 
 13  genres                90035 non-null   object 
 14  created_by            33556 non-null   object 
 15  

# Removes numbers from the 'name' using regular expressions

In [None]:
import re
tmdb_NP = tmdb_consolidate.copy()
def remove_numbers(text):
    """
    Removes numbers from the input text using regular expressions.

    Args:
        text (str): The input text string.

    Returns:
        str: The text string with numbers removed.
    """
    if isinstance(text, str):  # Check if text is a string
        return re.sub(r'\d+', '', text)  # Remove digits (numbers)
    else:
        return text  # Return the original value if it's not a string

# Apply the function to the 'name' column
tmdb_NP['name'] = tmdb_NP['name'].apply(remove_numbers)

# Display the first few rows with the cleaned text
print(tmdb_NP[['name']].head())
tmdb_NP.head()

              name
0                 
1          minutes
2              man
3   michael shchur
4   erne tur retur


Unnamed: 0,name,id,number_of_seasons,number_of_episodes,vote_count,vote_average,adult,first_air_date,last_air_date,in_production,popularity,type,status,genres,created_by,languages,networks,origin_country,production_companies,episode_run_time
0,,91363,1,9,3676,8.208,False,2021-08-11,2021-10-06,True,53.046,Scripted,Returning Series,"Animation, Action & Adventure, Sci-Fi & Fantasy",Stephen King,en,Disney+,US,"Marvel Studios, Blue Spirit, Squeeze, Flying B...",34
1,minutes,39280,0,0,0,0.0,False,,,False,0.6,Scripted,Ended,,,en,,US,,30
2,man,222976,1,5,3,7.7,False,2023-05-28,2023-06-25,False,4.606,Scripted,Ended,Drama,Shuichi Okita,ja,WOWOW Prime,JP,,50
3,michael shchur,111731,6,137,2,5.5,False,2016-11-06,2021-12-19,True,0.946,Scripted,Returning Series,"News, Talk",,uk,"UA: PERSHYI, Hromadske, 24 Kanal, YouTube",UA,,25
4,erne tur retur,231560,1,10,0,0.0,False,2014-05-15,2014-07-17,False,0.6,Scripted,Ended,,,,,DK,,0


# Split tmdb_eda into 2 sub_Datasets tmdb_IP for 'in_production' dataset & tmdb_NP for not in production Dataset

In [None]:
import pandas as pd
tmdb_consolidate = tmdb_NP.copy()
tmdb_act = tmdb_consolidate.copy()  # Create a copy of tmdb_consolidate
# Convert 'in_production' column to int
# First, replace string values with boolean, then convert to int and fill NaNs with -1
tmdb_act['in_production'] = tmdb_act['in_production'].map({'True': True, 'False': False}).fillna(False).astype(int)
tmdb_act.info()
tmdb_IP = tmdb_consolidate[tmdb_consolidate['in_production'] == 1].copy()  # Include all columns for 'in_production' == 'True'
tmdb_NP = tmdb_consolidate[tmdb_consolidate['in_production'] == 0].copy() # Include all columns for 'in_production' == 'False'
tmdb_NP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 149840 entries, 0 to 149839
Data columns (total 20 columns):
 #   Column                Non-Null Count   Dtype  
---  ------                --------------   -----  
 0   name                  149840 non-null  object 
 1   id                    149840 non-null  int64  
 2   number_of_seasons     149840 non-null  int64  
 3   number_of_episodes    149840 non-null  int64  
 4   vote_count            149840 non-null  int64  
 5   vote_average          149840 non-null  float64
 6   adult                 149840 non-null  bool   
 7   first_air_date        123584 non-null  object 
 8   last_air_date         125296 non-null  object 
 9   in_production         149840 non-null  int64  
 10  popularity            149840 non-null  float64
 11  type                  149840 non-null  object 
 12  status                149840 non-null  object 
 13  genres                90035 non-null   object 
 14  created_by            33556 non-null   object 
 15  

# Identifying & Replacing rare 'languages' into 'other'

In [None]:
import pandas as pd

# Assuming tmdb_eda is your DataFrame
tmdb_test = tmdb_NP.copy()

# Function to sort and canonicalize languages
def sort_languages(languages_str):
    if isinstance(languages_str, str):
        languages = [g.strip() for g in languages_str.split(',')]
        languages.sort()
        return ','.join(languages)
    return languages_str  # Return the original value if it's not a string

# Apply the function to create a new 'sorted_languages' column
tmdb_test['sorted_languages'] = tmdb_test['languages'].apply(sort_languages)

# Now count the frequencies of the sorted languages
languages_counts = tmdb_test['sorted_languages'].value_counts()

# Identify Rare Languages
rare_languages = languages_counts[languages_counts <= 75].index

# Replace rare genres with 'other'
tmdb_test.loc[tmdb_test['sorted_languages'].isin(rare_languages), 'sorted_languages'] = 'other'

# Update  the changes into tmdb.NP
tmdb_NP['languages']=tmdb_test['sorted_languages']
print(tmdb_NP['languages'].value_counts())

languages
en       23899
ja        8055
zh        4011
de        3783
fr        2803
other     2702
ko        2379
es        2293
pt        1629
cn        1502
th         926
ru         918
nl         864
tr         821
cs         770
it         646
ar         626
da         577
el         552
sv         517
hi         499
en,tl      376
no         330
pl         296
en,fr      276
en,es      254
fi         211
tl         190
sk         183
de,en      182
id         176
ur         176
en,ja      156
he         146
hu         129
ms         101
Name: count, dtype: int64


In [None]:
tmdb_NP['status'].value_counts()

Unnamed: 0_level_0,count
status,Unnamed: 1_level_1
Ended,82872
Canceled,4060


# Droping rows where 'status' is 'Canceled' from the Dataset tmdb_eda

In [None]:
# Using tmdb_NP as my DataFrame

# Droping rows where 'status' is 'Canceled'
tmdb_NP = tmdb_NP[tmdb_NP['status'] != 'Canceled']

# Reset the index (optional but recommended)
tmdb_NP = tmdb_NP.reset_index(drop=True)

# Verify the changes
print(tmdb_NP['status'].value_counts())

status
Ended    82872
Name: count, dtype: int64


In [None]:
tmdb_NP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82872 entries, 0 to 82871
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  82872 non-null  object 
 1   id                    82872 non-null  int64  
 2   number_of_seasons     82872 non-null  int64  
 3   number_of_episodes    82872 non-null  int64  
 4   vote_count            82872 non-null  int64  
 5   vote_average          82872 non-null  float64
 6   adult                 82872 non-null  bool   
 7   first_air_date        62883 non-null  object 
 8   last_air_date         63444 non-null  object 
 9   in_production         82872 non-null  bool   
 10  popularity            82872 non-null  float64
 11  type                  82872 non-null  object 
 12  status                82872 non-null  object 
 13  genres                53402 non-null  object 
 14  created_by            21699 non-null  object 
 15  languages          

# Dropping the 'status' column from tmdb.eda Dataset

In [None]:
# Drop the 'status' column
tmdb_NP = tmdb_NP.drop('status', axis=1)

# Verify the change
tmdb_NP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82872 entries, 0 to 82871
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  82872 non-null  object 
 1   id                    82872 non-null  int64  
 2   number_of_seasons     82872 non-null  int64  
 3   number_of_episodes    82872 non-null  int64  
 4   vote_count            82872 non-null  int64  
 5   vote_average          82872 non-null  float64
 6   adult                 82872 non-null  bool   
 7   first_air_date        62883 non-null  object 
 8   last_air_date         63444 non-null  object 
 9   in_production         82872 non-null  bool   
 10  popularity            82872 non-null  float64
 11  type                  82872 non-null  object 
 12  genres                53402 non-null  object 
 13  created_by            21699 non-null  object 
 14  languages             60084 non-null  object 
 15  networks           

In [None]:
tmdb_NP['type'].value_counts()

Unnamed: 0_level_0,count
type,Unnamed: 1_level_1
Scripted,63907
Miniseries,8305
Documentary,6318
Reality,2465
Video,1002
Talk Show,801
News,74


# Networks column preperetion

In [None]:
tmdb_NP['networks'].value_counts()

Unnamed: 0_level_0,count
networks,Unnamed: 1_level_1
BBC One,1444
ITV1,1222
BBC Two,1161
TVB Jade,1136
NBC,978
...,...
Sari-Sari Channel,1
"ARY Digital, MBC + Drama",1
"Discovery Science, Prime Video",1
"HBO Europe, TNT Serie",1


# Identifying & Replacing rare 'networks' into 'other'

In [None]:
import pandas as pd

# Assuming consolidated_tmdb is your DataFrame
tmdb_test = tmdb_NP.copy()

# Function to sort and canonicalize networks
def sort_networks(networks_str):
    if isinstance(networks_str, str):
        networks = [g.strip() for g in networks_str.split(',')]
        networks.sort()
        return ','.join(networks)
    return networks_str  # Return the original value if it's not a string

# Apply the function to create a new 'sorted_networks' column
tmdb_test['sorted_networks'] = tmdb_test['networks'].apply(sort_networks)

# Now count the frequencies of the sorted networks
networks_counts = tmdb_test['sorted_networks'].value_counts()

# Identify Rare Languages
rare_networks =networks_counts[networks_counts <= 75].index

# Replace rare networks with 'other'
tmdb_test.loc[tmdb_test['sorted_networks'].isin(rare_networks), 'sorted_networks'] = 'other'

# Verify the changes
tmdb_test['networks']=tmdb_test['sorted_networks']
tmdb_test.drop('sorted_networks', axis=1, inplace=True)
tmdb_NP['networks']=tmdb_test['networks']
print(tmdb_NP['networks'].value_counts())

networks
other                      19690
BBC One                     1444
ITV1                        1222
BBC Two                     1161
TVB Jade                    1136
                           ...  
Nat Geo Wild                  78
ERT1                          78
ProSieben                     77
Investigation Discovery       76
DFF 1                         76
Name: count, Length: 135, dtype: int64


# Identifying & Replacing rare 'geners' into 'other'

In [None]:
tmdb_NP['genres'].value_counts()

Unnamed: 0_level_0,count
genres,Unnamed: 1_level_1
Drama,9877
Documentary,8605
Comedy,5974
Animation,2289
Reality,1908
...,...
"Sci-Fi & Fantasy, Family, Family",1
"Action & Adventure, Crime, Mystery, Comedy",1
"Comedy, Action & Adventure, Mystery, Crime",1
"Drama, History, Crime",1


In [None]:
import pandas as pd

# Assuming consolidated_tmdb is your DataFrame
tmdb_test = tmdb_NP.copy()

# Function to sort and canonicalize genres
def sort_genres(genre_str):
    if isinstance(genre_str, str):
        genres = [g.strip() for g in genre_str.split(',')]
        genres.sort()
        return ','.join(genres)
    return genre_str  # Return the original value if it's not a string

# Apply the function to create a new 'sorted_genres' column
tmdb_test['sorted_genres'] = tmdb_test['genres'].apply(sort_genres)

# Now count the frequencies of the sorted genres
genres_counts = tmdb_test['sorted_genres'].value_counts()

# Identify Rare Languages
rare_genres = genres_counts[genres_counts <= 100].index

# Replace rare genres with 'other'
tmdb_test.loc[tmdb_test['sorted_genres'].isin(rare_genres), 'sorted_genres'] = 'other'

# Verify the changes
tmdb_test['genres']=tmdb_test['sorted_genres']
tmdb_test.drop('sorted_genres', axis=1, inplace=True)
tmdb_NP['genres']=tmdb_test['genres']
print(tmdb_NP['genres'].value_counts())

genres
Drama                                                   9877
Documentary                                             8605
Comedy                                                  5974
other                                                   4939
Animation                                               2289
Comedy,Drama                                            2272
Reality                                                 1908
Animation,Comedy                                        1066
Crime,Drama                                             1064
Drama,Soap                                               814
Drama,Family                                             778
Drama,Mystery                                            593
Crime                                                    578
News                                                     574
Action & Adventure,Animation,Sci-Fi & Fantasy            567
Action & Adventure,Drama                                 557
Family           

# Identifying & Replacing rare 'production_companies' into 'other'

In [None]:
tmdb_NP['production_companies'].value_counts()

Unnamed: 0_level_0,count
production_companies,Unnamed: 1_level_1
TVB,1152
Estúdios Globo,401
BBC,392
NHK,270
Televisa,234
...,...
"台灣漢梁公司, 西岸傳媒股份有限公司, 上海劇蕊有限公司",1
"Sharefun Studio, 啊哈娱乐",1
NHK World-Japan,1
"Hallmark Entertainment, Muse Entertainment",1


In [None]:
import pandas as pd

# Assuming tmdb is your DataFrame
tmdb_test = tmdb_NP.copy()

# Function to sort and canonicalize production_companies
def sort_production_companies(production_companies_str):
    if isinstance(production_companies_str, str):
        production_companies = [g.strip() for g in production_companies_str.split(',')]
        production_companies.sort()
        return ','.join(production_companies)
    return production_companies_str  # Return the original value if it's not a string

# Apply the function to create a new 'sorted_production_companies' column
tmdb_test['sorted_production_companies'] = tmdb_test['production_companies'].apply(sort_production_companies)

# Now count the frequencies of the sorted production_companies
# Correct the column name to 'sorted_production_companies'
production_companies_counts = tmdb_test['sorted_production_companies'].value_counts()

# Identify Rare production_companies
rare_production_companies = production_companies_counts[production_companies_counts <= 25].index

# Replace rare production_companies with 'other'
tmdb_test.loc[tmdb_test['sorted_production_companies'].isin(rare_production_companies), 'sorted_production_companies'] = 'other'

# Verify the changes
tmdb_test['production_companies']=tmdb_test['sorted_production_companies']
tmdb_test.drop('sorted_production_companies', axis=1, inplace=True)
tmdb_NP['production_companies']=tmdb_test['production_companies']
print(tmdb_NP['production_companies'].value_counts())

production_companies
other             25510
TVB                1152
Estúdios Globo      401
BBC                 392
NHK                 270
                  ...  
iQIYI                26
Southern Star        26
Broadway.com         26
ITV Studios          26
Suzuki Mirano        26
Name: count, Length: 93, dtype: int64


In [None]:
tmdb_NP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82872 entries, 0 to 82871
Data columns (total 19 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  82872 non-null  object 
 1   id                    82872 non-null  int64  
 2   number_of_seasons     82872 non-null  int64  
 3   number_of_episodes    82872 non-null  int64  
 4   vote_count            82872 non-null  int64  
 5   vote_average          82872 non-null  float64
 6   adult                 82872 non-null  bool   
 7   first_air_date        62883 non-null  object 
 8   last_air_date         63444 non-null  object 
 9   in_production         82872 non-null  bool   
 10  popularity            82872 non-null  float64
 11  type                  82872 non-null  object 
 12  genres                53402 non-null  object 
 13  created_by            21699 non-null  object 
 14  languages             60084 non-null  object 
 15  networks           

In [None]:
tmdb_NP['in_production'].value_counts()

Unnamed: 0_level_0,count
in_production,Unnamed: 1_level_1
False,82872


# Drop 'in_production' column from tmdb_eda dataset

In [None]:
# Dropping 'in_production' from tmdb_NP  DataFrame

tmdb_NP = tmdb_NP.drop('in_production', axis=1)

# Verify the change
tmdb_NP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82872 entries, 0 to 82871
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  82872 non-null  object 
 1   id                    82872 non-null  int64  
 2   number_of_seasons     82872 non-null  int64  
 3   number_of_episodes    82872 non-null  int64  
 4   vote_count            82872 non-null  int64  
 5   vote_average          82872 non-null  float64
 6   adult                 82872 non-null  bool   
 7   first_air_date        62883 non-null  object 
 8   last_air_date         63444 non-null  object 
 9   popularity            82872 non-null  float64
 10  type                  82872 non-null  object 
 11  genres                53402 non-null  object 
 12  created_by            21699 non-null  object 
 13  languages             60084 non-null  object 
 14  networks              54035 non-null  object 
 15  origin_country     

In [None]:
tmdb_NP.head()

Unnamed: 0,name,id,number_of_seasons,number_of_episodes,vote_count,vote_average,adult,first_air_date,last_air_date,popularity,type,genres,created_by,languages,networks,origin_country,production_companies,episode_run_time
0,minutes,39280,0,0,0,0.0,False,,,0.6,Scripted,,,en,,US,,30
1,man,222976,1,5,3,7.7,False,2023-05-28,2023-06-25,4.606,Scripted,Drama,Shuichi Okita,ja,WOWOW Prime,JP,,50
2,erne tur retur,231560,1,10,0,0.0,False,2014-05-15,2014-07-17,0.6,Scripted,,,,,DK,,0
3,,34835,1,12,10,6.4,False,2006-10-06,2006-12-22,9.761,Scripted,other,Shotaro Ishinomori,ja,other,JP,other,30
4,bama,42598,0,0,0,0.0,False,,,0.6,Scripted,Comedy,,,,,,0


In [None]:
tmdb_NP.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 82872 entries, 0 to 82871
Data columns (total 18 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   name                  82872 non-null  object 
 1   id                    82872 non-null  int64  
 2   number_of_seasons     82872 non-null  int64  
 3   number_of_episodes    82872 non-null  int64  
 4   vote_count            82872 non-null  int64  
 5   vote_average          82872 non-null  float64
 6   adult                 82872 non-null  bool   
 7   first_air_date        62883 non-null  object 
 8   last_air_date         63444 non-null  object 
 9   popularity            82872 non-null  float64
 10  type                  82872 non-null  object 
 11  genres                53402 non-null  object 
 12  created_by            21699 non-null  object 
 13  languages             60084 non-null  object 
 14  networks              54035 non-null  object 
 15  origin_country     

# Data Preparation - Download the tmdb.NP as pickle file to my Google Drive

In [None]:
# Import necessary libraries
import pickle
from google.colab import drive

# Mount Google Drive
drive.mount('/content/drive')

# Define the file path where you want to save the pickle file in your Google Drive
# Make sure to replace 'your_folder' with the actual folder name in your Google Drive, or create the folder.
file_path = '/content/drive/My Drive/1pBDwupkA4bAYBLsQilnu8YrNra72QfKz/tmdb_NP.pkl'

# Check if the directory exists, and if not, create it
import os
directory = os.path.dirname(file_path)
if not os.path.exists(directory):
    os.makedirs(directory)
    print(f"Created directory: {directory}")

# save the tmdb_NP DataFrame to a pickle file
try:
  with open(file_path, 'wb') as file:
      pickle.dump(tmdb_NP, file)
  print(f"tmdb_NP saved to {file_path}")
except NameError:
  print('Run all the code in the notebook until the Dataframe tmdb_NP is created')

Mounted at /content/drive
tmdb_NP saved to /content/drive/My Drive/1pBDwupkA4bAYBLsQilnu8YrNra72QfKz/tmdb_NP.pkl


# Data Preparation - Import tmdb.NP as pickle file from my Google Drive

In [None]:
import pickle
from google.colab import drive

# Mount Google Drive (if you haven't already)
drive.mount('/content/drive')

# Define the file path where the pickle file is located in your Google Drive
file_path = '/content/drive/My Drive/1pBDwupkA4bAYBLsQilnu8YrNra72QfKz/tmdb_NP.pkl'  # Replace 'your_folder' with the actual folder name

# Load the data from the pickle file
try:
    with open(file_path, 'rb') as file:
        tmdb_NP = pickle.load(file)
    print(f"tmdb_NP loaded from {file_path} successfully.")

    # You can now use the loaded DataFrame (tmdb_NP)
    # For example, you can display the first few rows:
    print(tmdb_NP.head())
except FileNotFoundError:
    print(f"Error: File not found at {file_path}. Please check the file path and ensure the file exists in your Google Drive.")
except Exception as e:
    print(f"An error occurred: {e}")


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
tmdb_NP loaded from /content/drive/My Drive/1pBDwupkA4bAYBLsQilnu8YrNra72QfKz/tmdb_NP.pkl successfully.
              name      id  number_of_seasons  number_of_episodes  vote_count  \
0          minutes   39280                  0                   0           0   
1              man  222976                  1                   5           3   
2   erne tur retur  231560                  1                  10           0   
3                    34835                  1                  12          10   
4             bama   42598                  0                   0           0   

   vote_average  adult first_air_date last_air_date  popularity      type  \
0           0.0  False           None          None       0.600  Scripted   
1           7.7  False     2023-05-28    2023-06-25       4.606  Scripted   
2           0.0  False     2014-05-15    2014-07-