# Machine Learning Project : Predicting success of a movie 


> ## Description: 
>In the contemporary era, the film industry continues to evolve rapidly, investing increasingly vast resources into production and marketing. However, despite these >advancements, predicting the commercial success of a feature film remains a complex challenge fraught with financial risk.

>This project aims to bridge that gap by leveraging the power of Machine Learning. By analyzing historical data—such as budget, casting, genre, and release timing—we aim to >build a predictive model capable of forecasting a movie's success. This tool seeks to provide data-driven insights to mitigate risks and optimize decision-making within >the entertainment sector.

## Data Importation 

In [1]:
import pandas as pd

First, we import the csv containing films metadata 

In [2]:
raw_data = pd.read_csv(r"C:\Users\maeva\OneDrive - De Vinci Higher Education (DVHE)\ESILV\A4\Machine Learning\Projet\movies_metadata.csv") # replace with your own path

  raw_data = pd.read_csv(r"C:\Users\maeva\OneDrive - De Vinci Higher Education (DVHE)\ESILV\A4\Machine Learning\Projet\movies_metadata.csv") # replace with your own path


In [3]:
raw_data.head(2)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0


In [4]:
raw_data.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [5]:
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

### Raw Data Overview

Here, we display the first few rows, the column names and teh dataset infos to perform an initial visual inspection of the dataset. This step helps us identify key structural characteristics:

> Data Format: We can observe that columns like genres and belongs_to_collection contain complex nested structures (dictionaries/JSON) that will need parsing.

> Missing Values: NaNs are already visible in columns such as homepage or belongs_to_collection.

> Variable Types: A mix of numerical (budget) and categorical data (original_language) is present.

### Preliminary Data Cleaning

We start by cleaning the raw dataset to keep only usable samples. We drop duplicates and deal with unknown values  (NaN or 0). This reduces the dataset size but ensures that every remaining entry is valid for training.

In [6]:
# removing duplicates 
print (f"raw_data shape before removing duplicates : {raw_data.shape}")
raw_data = raw_data.drop_duplicates()
print(f"raw_data shape after : {raw_data.shape}")

raw_data shape before removing duplicates : (45466, 24)
raw_data shape after : (45453, 24)


In [7]:
# Missing values 
print("Missing values per columns :")
raw_data.isna().sum()

Missing values per columns :


adult                        0
belongs_to_collection    40959
budget                       0
genres                       0
homepage                 37673
id                           0
imdb_id                     17
original_language           11
original_title               0
overview                   954
popularity                   5
poster_path                386
production_companies         3
production_countries         3
release_date                87
revenue                      6
runtime                    263
spoken_languages             6
status                      87
tagline                  25045
title                        6
video                        6
vote_average                 6
vote_count                   6
dtype: int64

Since there are lots of missing values in thoses columns or the column is not petinent to the study, the best is to remove them from the dataset :

In [8]:
raw_data = raw_data.drop(columns= ["belongs_to_collection", "homepage", "tagline", "poster_path", "overview"])

In [9]:
cols_to_check = ["runtime", "imdb_id", "original_language", "popularity", "production_companies", "production_countries", 
                 "release_date", "revenue", "runtime", "spoken_languages", "status", "title", "video", "vote_average", "vote_count"]
raw_data = raw_data.dropna(subset=cols_to_check)

# verification 
raw_data.isna().sum()

adult                   0
budget                  0
genres                  0
id                      0
imdb_id                 0
original_language       0
original_title          0
popularity              0
production_companies    0
production_countries    0
release_date            0
revenue                 0
runtime                 0
spoken_languages        0
status                  0
title                   0
video                   0
vote_average            0
vote_count              0
dtype: int64

Now that we heve deal with the missing values we have to check the O values in numerical columns


In [10]:
# Count nb of 0
zeros_count = (raw_data == 0).sum()

# print only columns with 0 
zeros_count = zeros_count[zeros_count > 0]

print("Nb of 0 per columns :")
print(zeros_count)

Nb of 0 per columns :
popularity         19
revenue         37620
runtime          1514
video           44930
vote_average     2823
vote_count       2725
dtype: int64


### Data Cleaning: Removing Financials & Missing Metadata

Since our focus is about critical success, we remove financial columns (revenue) which contained a high percentage of missing values (zeros).

We also drop technical columns irrelevant to quality prediction (video) and remove the few rows with missing essential metadata like runtime, poularity, vote_average, vot_count.

In [11]:
# 1. Drop financial and technical columns
cols_to_drop = ["revenue", "video"]
raw_data.drop(columns=cols_to_drop, inplace=True, errors='ignore')

In [None]:
# 2. Drop rows with unintegrity values 

# Conversion of popularity to numeric
raw_data['popularity'] = pd.to_numeric(raw_data['popularity'], errors='coerce')

mask = (raw_data["popularity"]>0) & (raw_data["runtime"]>0) & (raw_data["vote_count"] >0)
raw_data = raw_data[mask]

### Data Enrichment: Merging Cast and Crew

To improve the predictive power of our model, we are enriching the dataset with information regarding the movie's Cast (actors) and Crew (directors, producers, writers).

This data is stored in a separate CSV file. We will merge it with our main dataset using the unique id of each movie as the key. This will allow us to analyze whether "Star Power" or specific directors correlate with box-office success.