# The Next Big Thing: Forecasting Hollywood's Future with Film Analytics
Author: [Francisco de Paula Lemos](https://linkedin.com/in/chicos0)  
Language: English

## About this project

Dear PProductions Team,  

This project will be led by us, a Data & Analytics team from Indicium, a data-driven consulting firm specialized in transforming business challenges into strategic, actionable insights.

We'll guide this analysis using the CRISP-DM framework, a systematic process that ensures a comprehensive approach. It begins with understanding the business objectives, moves to data exploration and preparation, and culminates in modeling and the delivery of a clear and compelling narrative that empowers our clients to succeed.

With a proven track record of helping over 50 companies make informed decisions, our team brings a unique blend of technical expertise and business acumen to maximize your next film's potential for success.



## 1 - Business Understanding

### 1.1 Introduction

Our proposal outlines a strategic project to leverage your cinematic database for a data-driven film development strategy. So, our primary goal is to provide actionable insights that will guide your next production decision beyound intuition, maximizing its potential for both **critical acclaim and financial success**. 

### 1.2 Business Objectives & Key Questions

As requested, our analysis will focus on answering a set of critical business questions, while also addressing the technical aspects of our analysis. We believe this full-circle approach will not only deliver the key business recommendations you seek but also provide a transparent view into the data science process.

#### Business Objectives:

* *Audience Resonance (Q2a):* We'll dive into what film attributes drive high IMDb ratings and audience engagement, helping the creations of films that are not only profitable but also critically acclaimed.

* **Maximizing ROI (Q2b):** We'll identify the core factors that correlate with high box-office gross, providing a clear data-driven understanding of what makes a film financially successful.

* **Future-Proofing:** We'll explore the potential for predicting a film's success based on key variables, a crucial step for de-risking future investments.

#### Analytical & Technical Approach:

* **Exploratory Data Analysis (Q1):** We will provide a comprehensive EDA, highlighting the main characteristics and relationships between variables.

* **Overview Column Insights (Q2c):** We will investigate whether the film's "Overview" (synopsis) contains linguistic patterns or keywords that can predict its genre and its potential for success.

* **Predictive Modeling (Q3):** We will outline the methodology for predicting a film's potential IMDb score, including:
    * The **type of problem** being solved (Regression vs. Classification).
    * The **key variables** and **transformations** to be used.
    * The most suitable **models** and their respective pros and cons.
    * The **performance metrics** chosen to evaluate the model's accuracy.

By addressing all these points, we aim to provide a clear and actionable recommendation on the type of film PProductions should develop next to ensure both financial and critical success.

## 2 Data Understanding

This section details our initial exploration of the dataset. We'll focus on dataset's structure, quality and key characteristics to empower us with more knoledge to hypothesis formation.

* 2.1 **Initial Assessment**: We will begin by loading the raw data, importing libraries and conducting a review of its shape and columns.

* 2.2 **Quality Check**: We will identify and handle any missing, duplicated, or incorrect data entries, as well as inappropriate data types, that could impact our analysis. Additionally, we'll perform a series of data transformations to make complex categorical data more usable. This structural modeling of the data is essential for both our exploratory analysis and the predictive modeling to follow.

* 2.3 **Statistical Summary**: For numerical columns, we will generate descriptive statistics to understand their central tendency, spread, and overall distribution. For categorical columns, we'll approach with an analysis of their frequency distribution.

* 2.4 **Hypothesis Formation**: Based on our initial findings, we will formulate preliminary hypotheses about the relationships between variables, which will be tested in our Exploratory Data Analysis.

* 2.5 **Exploratory Data Analysis**: 


### 2.1 Initial Assessment

#### Importing libraries

In [1]:
#
import pandas as pd
#


#### Loading dataset

In [2]:
films = pd.read_csv("desafio_indicium_imdb.csv")
films.head()

Unnamed: 0.1,Unnamed: 0,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,1,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
1,2,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
2,3,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
3,4,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000
4,5,The Lord of the Rings: The Return of the King,2003,U,201 min,"Action, Adventure, Drama",8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905


#### Dataset's size

In [3]:
print(f'Films data-set has {films.shape[0]} rows and {films.shape[1]} columns.')

Films data-set has 999 rows and 16 columns.


#### Data dictionary

| Variable | Description |
| :--- | :--- |
| **Series_Title** | Film title |
| **Released_Year** | Release year |
| **Certificate** | Age rating |
| **Runtime** | Running time in minutes |
| **Genre** | Genre |
| **IMDB_Rating** | IMDB rating |
| **Overview** | Film synopsis |
| **Meta_score** | Weighted average of all critics' scores |
| **Director** | Director |
| **Star1** | Actor/actress #1 |
| **Star2** | Actor/actress #2 |
| **Star3** | Actor/actress #3 |
| **Star4** | Actor/actress #4 |
| **No_of_Votes** | Number of votes |
| **Gross** | Gross revenue |

### 2.2 Quality Check

#### Missing values

In [4]:
films.isnull().sum()

Unnamed: 0         0
Series_Title       0
Released_Year      0
Certificate      101
Runtime            0
Genre              0
IMDB_Rating        0
Overview           0
Meta_score       157
Director           0
Star1              0
Star2              0
Star3              0
Star4              0
No_of_Votes        0
Gross            169
dtype: int64

Our initial analysis has highlighted that `Certificate`, `Meta_score`, and `Gross` have missing values. 

We will convert `Meta_score` and `Gross` to a float data type in the next section, which allows us to natively represent missing data as `NaN` and defer handling these values for now. 

However, since `Certificate` is a categorical variable, the absence of a rating is valuable information in itself. That's why we will create a new category, 'Unrated', to fill the missing values, which will allow us to analyze whether the lack of an age rating impacts other variables."

In [5]:
films['Certificate'] = films['Certificate'].fillna('Unrated')

#### Data type transformation

In [6]:
films.dtypes

Unnamed: 0         int64
Series_Title      object
Released_Year     object
Certificate       object
Runtime           object
Genre             object
IMDB_Rating      float64
Overview          object
Meta_score       float64
Director          object
Star1             object
Star2             object
Star3             object
Star4             object
No_of_Votes        int64
Gross             object
dtype: object

Based on the analysis above we've identified several columns that require data type adjustments to ensure the integrity and usability of our dataset. Here’s how we’ll approach these transformations:

* **`Released_Year`:** This column contains year values and is intended to be a numeric type. Since our initial check indicated no missing values, we will convert the column to an **integer** type to support more efficient data storage and numerical analysis.

* **`Runtime`:** To prepare this column for analysis, we'll first remove the string suffix ('min'). As our initial assessment showed no missing values, we can then safely convert the column to an **integer** type.

* **`Gross`:** This column, representing a film's revenue, contains both commas and missing values (`NaN`). We'll first remove the commas to prepare the data for conversion. We'll then convert the column to a **float** type, which can natively handle `NaN` values, allowing us to defer imputation until the modeling phase.

In [7]:
# Trying to convert 'Released_Year' to integer
try:
    films['Released_Year'] = films['Released_Year'].astype('int64')
except ValueError:
    print("Conversion error: 'Released_Year' contains non-numeric values.")

Conversion error: 'Released_Year' contains non-numeric values.


This initial conversion attempt revealed an underlying data quality issue that needs to be addressed. Before we can successfully convert the column to an integer, we must first locate and correct the incorrect value.

In [8]:
# Converting 'Released_Year' to numeric -> setting 'coerce' will turn non-convertible values into NaN and the data type will be *float*
films['Released_Year'] = pd.to_numeric(films['Released_Year'], 'coerce')
# Finding out which films have no release year
films_with_null_year = films[films['Released_Year'].isnull()]
print(f'The movies with null year are: {films_with_null_year['Series_Title']}')

The movies with null year are: 965    Apollo 13
Name: Series_Title, dtype: object


In [9]:
# After searching in the internet about 'Apollo 13' release year, '1995', I shall manually correct the data
films.loc[films['Series_Title'] == 'Apollo 13', 'Released_Year'] = 1995
# Checking if it was correctly replaced by '1995'
print(films.loc[films['Series_Title'] == 'Apollo 13', 'Released_Year'])

965    1995.0
Name: Released_Year, dtype: float64


In [10]:
# Converting 'Released_Year' to integer
films['Released_Year'] = films['Released_Year'].astype('int64')

In [11]:
# Converting Runtime to integer
films['Runtime'] = films['Runtime'].str.replace(' min', '').astype('int64')

In [12]:
# Converting Gross to float
films['Gross'] = films['Gross'].str.replace(',', '').astype('float64')

#### Duplicated entries

In [13]:
# Finding if there are any duplicated films
films["Series_Title"].duplicated().sum()

np.int64(1)

In [14]:
# Descovering which films are duplicated
films[films.duplicated(['Series_Title'], False)]['Series_Title'].unique()

array(['Drishyam'], dtype=object)

In [15]:
# Finding out if both entries are identical
print(films.loc[films['Series_Title'] == 'Drishyam', 'Overview'])

86     A man goes to extreme lengths to save his fami...
135    Desperate measures are taken by a man who trie...
Name: Overview, dtype: object


This analysis was particularly interesting as we discovered what appeared to be a duplicate film, but were in fact different films sharing the same title.

### Data transformations

In order to make our complex categorical data more usable for analysis and further modeling, we will perform the following transformations:

* The `Genre` column, which currently holds multiple values per entry, will be split into individual columns.
* To unify the diverse rating systems found in the `Certificate` column, we will also create a `Standardized_Rating` column with three possible entrys: 'All Ages', 'Parental Guidance' and 'Adults Only'

Once these transformations are complete, we will drop the  `Genre`, `Certificate`, and `Unnamed: 0` columns, as they will serve no analytical purpose in our study.

In [16]:
genres = films['Genre'].str.split(', ', expand=True)

# Passo 2: Renomear as novas colunas para 'Genre1', 'Genre2', etc.
genres.columns = [f'Genre{i+1}' for i in range(len(genres.columns))]

# Passo 3: Adicionar essas novas colunas ao DataFrame original
films = pd.concat([films, genres], axis=1)

# Passo 4: Identificar o número de colunas criadas
print(f"Number of new 'Genre' coluns created: {len(genres.columns)}")

Number of new 'Genre' coluns created: 3


In [17]:
# Analyzing unique values in 'Certificate' column to further standardize it
print(films['Certificate'].unique())


['A' 'UA' 'U' 'PG-13' 'R' 'Unrated' 'PG' 'G' 'Passed' 'TV-14' '16' 'TV-MA'
 'GP' 'Approved' 'TV-PG' 'U/A']


In [18]:
# Defining the mapping function to convert old certificate values to new categories
def standardize_rating(certificate):
    if certificate in ['U', 'G', 'Passed', 'Approved', 'TV-PG', 'Unrated']:
        return 'All Ages'
    elif certificate in ['PG', 'PG-13', 'UA', 'TV-14', '16', 'GP', 'U/A']:
        return 'Parental Guidance' 
    elif certificate in ['A', 'R', 'TV-MA']:
        return 'Adults Only'
    else:
        return 'Not classified'

# Applying the function to create a 'Standardized_Rating' column
films['Standardized_Rating'] = films['Certificate'].apply(standardize_rating)

# Checking the new category distribution to confirm if the mapping worked
print(films['Standardized_Rating'].value_counts())
print(f'Total of entries: {films['Standardized_Rating'].value_counts().sum()}')

Standardized_Rating
All Ages             396
Adults Only          343
Parental Guidance    260
Name: count, dtype: int64
Total of entries: 999


In [19]:
# Droping the now unnecessary columns
films = films.drop(columns=['Genre', 'Certificate', 'Unnamed: 0'])

In [20]:
films.head()

Unnamed: 0,Series_Title,Released_Year,Runtime,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross,Genre1,Genre2,Genre3,Standardized_Rating
0,The Godfather,1972,175,9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411.0,Crime,Drama,,Adults Only
1,The Dark Knight,2008,152,9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444.0,Action,Crime,Drama,Parental Guidance
2,The Godfather: Part II,1974,202,9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000.0,Crime,Drama,,Adults Only
3,12 Angry Men,1957,96,9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000.0,Crime,Drama,,All Ages
4,The Lord of the Rings: The Return of the King,2003,201,8.9,Gandalf and Aragorn lead the World of Men agai...,94.0,Peter Jackson,Elijah Wood,Viggo Mortensen,Ian McKellen,Orlando Bloom,1642758,377845905.0,Action,Adventure,Drama,All Ages


### 2.3 Statistical Summary

This section provides an overview of the dataset's key characteristics through descriptive statistics. 

By using methods like `describe()` and `value_counts()` from the pandas library, we can easily compute metrics that summarize the data's central tendencies, spread, and distribution.

* For our **numerical variables**, the analysis will give us a clear picture of the average `IMDB_Rating`, `Meta_score`, and the typical `Runtime`. We will also see the overall range of `Released_Year`, the spread of `No_of_Votes`, and the overall range of `Gross` revenue. These statistics are essential for understanding the distribution of our data, identifying potential outliers, and spotting patterns that will guide our analysis.

* For our **categorical variables**, we will focus on analyzing the frequency distribution. This will allow us to identify the most common `Director`, the prevalent `Standardized_Rating`, and the most frequent `Genre` categories. We also will analyze the frequency of our main cast members (`Star1`, `Star2`, etc.), providing valuable context on the composition of our dataset and informing our hypotheses about what attributes may drive a film's success. As for the `Overview` column, a text variable, it will be analyzed separately using specific linguistic techniques.

In [21]:
#### Numerical Variables Summary

Here is a concise overview of the metrics we'll be using.

* **Mean & Standard Deviation:** The **mean** is the simple arithmetic average of a dataset. It is most useful for data that is evenly distributed. The **standard deviation (std)** complements the mean by measuring the amount of variation or dispersion in a set of values, showing how widely the data points are spread out.

* **Extreme Values:** The **minimum (min)** and **maximum (max)** values define the full range of the dataset, indicating the lowest and highest points observed for a given variable.

* **Quartiles (25%, 50%, 75%):** Quartiles divide a dataset into four equal parts.
    * The **first quartile (Q1)** marks the point below which 25% of the data falls.
    * The **second quartile (Q2)**, also known as the **median**, is the middle value of the dataset, with 50% of the data points below it and 50% above.
    * The **third quartile (Q3)** is the value below which 75% of the data is located.

In [22]:
num_variables = ['IMDB_Rating', 'Meta_score', 'Released_Year', 'Runtime', 'No_of_Votes', 'Gross']
films[num_variables].describe().round(2)

Unnamed: 0,IMDB_Rating,Meta_score,Released_Year,Runtime,No_of_Votes,Gross
count,999.0,842.0,999.0,999.0,999.0,830.0
mean,7.95,77.97,1991.22,122.87,271621.42,68082570.0
std,0.27,12.38,23.3,28.1,320912.62,109807600.0
min,7.6,28.0,1920.0,45.0,25088.0,1305.0
25%,7.7,70.0,1976.0,103.0,55471.5,3245338.0
50%,7.9,79.0,1999.0,119.0,138356.0,23457440.0
75%,8.1,87.0,2009.0,137.0,373167.5,80876340.0
max,9.2,100.0,2020.0,321.0,2303232.0,936662200.0


This means bla bla bla
bla bla

#### Categorical Variables Summary

In [23]:
# Distribution of Directors and Standardized_Rating
cat_variables = ['Director', 'Standardized_Rating']
films[cat_variables].describe().round(2)

Unnamed: 0,Director,Standardized_Rating
count,999,999
unique,548,3
top,Alfred Hitchcock,All Ages
freq,14,396


In [24]:
# Distribution of Genres and Stars: creating a combined DataFrame for better visualization 
genre_series = films[['Genre1', 'Genre2', 'Genre3']].stack()
star_series = films[['Star1', 'Star2', 'Star3', 'Star4']].stack()
combined_df = pd.DataFrame({
    'Genre': genre_series,
    'Star': star_series
})
combined_df.describe().round(2)

Unnamed: 0,Genre,Star
count,2540,3996
unique,21,2707
top,Drama,Robert De Niro
freq,723,17


In [25]:
# Snippet to show the top 10 most frequent Directors, Stars and Genres
print("Director", films["Director"].value_counts().head(10),"\n")
print("Star:","\n",star_series.value_counts().head(10),"\n")
print("Genre:","\n",genre_series.value_counts().head(10))

Director Director
Alfred Hitchcock    14
Steven Spielberg    13
Hayao Miyazaki      11
Martin Scorsese     10
Akira Kurosawa      10
Billy Wilder         9
Stanley Kubrick      9
Woody Allen          9
Clint Eastwood       8
David Fincher        8
Name: count, dtype: int64 

Star: 
 Robert De Niro       17
Tom Hanks            14
Al Pacino            13
Brad Pitt            12
Clint Eastwood       12
Christian Bale       11
Matt Damon           11
Leonardo DiCaprio    11
James Stewart        10
Ethan Hawke           9
Name: count, dtype: int64 

Genre: 
 Drama        723
Comedy       233
Crime        209
Adventure    196
Action       189
Thriller     137
Romance      125
Biography    109
Mystery       99
Animation     82
Name: count, dtype: int64


### 2.5 Exploratory Data Analysis (EDA)


Our EDA will be a deep dive into the dataset's characteristics, providing a visual and statistical foundation for our insights. Through this in-depth exploration, we will:

* Generate visualizations like **histograms** and **boxplots** to understand the distribution of key numerical variables, such as `IMDB_Rating`, `Meta_score` and `Gross`, helping us identify central tendencies and outliers.
* Perform a **correlation analysis** to uncover relationships between variables that are tied to a film's financial and critical success. We will also use **bar charts** to visually explore the distribution and trends within categorical variables like "Genre" (`Genre1`, `Genre2`, `Genre3`) and `Certificate`.

This comprehensive exploration will allow us to validate our hypotheses, address core business questions, and inform the subsequent stages of the project, including our text analysis of the `Overview` column and our predictive modeling efforts.

## 3 - Data preparation

Transformar variaveis categoricas em poucas categorias e numericas realizar a transformacao (log etc) conforme necessario

- No we have to do something about the missing values. "Runtime" is a numerical variable, we can fill missing values with the right central tendency measure. To dec

In [26]:
#films['Runtime'] = films['Runtime'].fillna(films['Runtime'].mean()) 