# Movie Success Predictor

## 1. Installing & Importing Required Libraries

```pip install numpy``` <br>
```pip install pandas``` <br>
```pip install seaborn``` <br>
```pip install matplotlib``` <br>

```pip install kagglehub``` <br>
```pip install kagglehub[pandas-datasets]```

In [2]:
import json
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler

In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

## 2. Loading the dataset

In [4]:
file_path_credits = "tmdb_5000_credits.csv"
file_path_movies = "tmdb_5000_movies.csv"

### 2.1. Loading the Dataset with KaggleHub

In [None]:
df_credits = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "tmdb/tmdb-movie-metadata",
  file_path_credits,
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

df_movies = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "tmdb/tmdb-movie-metadata",
  file_path_movies,
  # Provide any additional arguments like 
  # sql_query or pandas_kwargs. See the 
  # documenation for more information:
  # https://github.com/Kaggle/kagglehub/blob/main/README.md#kaggledatasetadapterpandas
)

### 2.2. Loading the CSV Locally

If the KaggleHub import doesn't work, we can also import the dataset from a local CSV file.

In [5]:
df_credits = pd.read_csv(f"./data/{file_path_credits}")
df_movies = pd.read_csv(f"./data/{file_path_movies}")

## 3. Cleaning the data

### 3.1. Merging the two datasets

In [6]:
df_credits.rename(columns={"movie_id": "id"}, inplace=True)

In [23]:
df = df_movies.merge(df_credits, on='id')

### 3.2. Remove the unnecessary columns

In [24]:
df = df.drop('homepage', axis=1)
df = df.drop('original_title', axis=1)
df = df.drop('overview', axis=1)
df = df.drop('tagline', axis=1)
df = df.drop('status', axis=1)
df = df.drop('title_x', axis=1)
df = df.drop('title_y', axis=1)
df = df.drop('spoken_languages', axis=1)
df = df.drop('production_countries', axis=1)
df = df.drop('production_companies', axis=1)
df = df.drop('crew', axis=1)
df = df.drop('keywords', axis=1)
df = df.drop('id', axis=1)

> It would be great to be able to analyse the cast and its popularity, but for symplicity's sake, we'll unfortunately just drop it

In [25]:
df = df.drop('cast', axis = 1)

### 3.3. Remove missing values

> I would normally remplace the datas in the columns **budget** and **revenue**, but there are to many missing datas to extrapolate from the rest of the dataset

In [26]:
df = df.drop(df[df['genres'] == '[]'].index)
df = df.drop(df[df['budget'] == 0].index)
df = df.drop(df[df['revenue'] == 0].index)
df = df.drop(df[df['runtime'].isnull()].index)
df = df.drop(df[df['release_date'].isnull()].index)

### 3.4. Convert Strings to numerical format

#### 3.4.1. Map movie languages to a language_id

In [None]:
print(df['original_language'].unique())

In [27]:
languages_dict = {}
language_id = 0
for lang in df['original_language'].unique():
    languages_dict[lang] = language_id
    language_id += 1
df['original_language'] = df['original_language'].map(languages_dict)

#### 3.4.2. Decomposing the release date

In [28]:
df["year"] = [None] * len(df)
df["month"] = [None] * len(df)
df["day"] = [None] * len(df)
date_order = ["year", "month", "day"]

for col, row in df.iterrows():
    date_parts = row.release_date.split("-")
    for i in range(len(date_order)):
        df.loc[col, date_order[i]] = date_parts[i]

df = df.drop("release_date", axis=1)

#### 3.4.3 Separate each genre and actors into a colummn

##### 3.4.3.1 Convert genres json format to a list

In [None]:
def extract_genres(json_str):
    try:
        items = json.loads(json_str.replace("'", '"'))
        return [g["name"] for g in items]
    except:
        return []
    
df["genres"] = df["genres"].apply(extract_genres)

##### 3.4.3.2 Create a column for each genre

In [30]:
genres_set = set()
for film_genres in df["genres"]:
    for elem in film_genres:
        genres_set.add(f"genre_{elem.lower()}")

In [31]:
genre_order = []
for genre in genres_set:
    df[genre] = 0 * len(df)
    genre_order.append(genre)

for col, row in df.iterrows():
    for i in range(len(genre_order)):
        if genre_order[i][6:].replace('_', ' ').title() in row.genres:
            df.loc[col, genre_order[i]] = 1

df = df.drop("genres", axis=1)