# Project: Investigate a Dataset - [TMDB Movie Dataset]

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction

### Dataset Description

This project centers on a cleaned version of the **TMDb movie dataset** originally sourced from [Kaggle](https://www.kaggle.com/tmdb/tmdb-movie-metadata). The dataset includes information on approximately 10,000 movies such as:

- **title**: The official title of the movie.
- **id**: A unique identifier for each movie.
- **genres**: A pipe-separated list of genres associated with the film.
- **cast**: A pipe-separated list of main cast members.
- **director**: The name of the primary director for the film.
- **budget_adj** and **revenue_adj**: The budget and revenue in terms of 2010 dollars, adjusted for inflation.
- **popularity**: A TMDb-defined metric reflecting user activity and interest.
- **vote_average**: The average user rating on TMDb, which will be a key focus in our analysis.
- **vote_count**: The total number of user votes.
- **release_date**: The film’s release date.

For this project, we will look specifically at **director–actor collaborations** to explore how certain factors may correlate with audience reception. After some data wrangling, we plan to derive new variables such as:

- **`collaboration_count`**: How many times a given director–actor pair has worked together.
- **`dominant_genre`**: The most frequently appearing genre for each director–actor pair.
- **`avg_vote_average`**: The average rating (on a scale of 0–10) for all films created by that pair.

Our goal is to investigate relationships between these variables (and potentially others) to gain insight into which director–actor teams may achieve higher audience ratings and whether certain genres appear more strongly correlated with positive viewer feedback. This exploration will be primarily **descriptive** and **exploratory**—we are looking for interesting patterns in the data rather than drawing definitive causal conclusions.



### Question(s) for Analysis

Our main goal in this project is to explore **how certain characteristics of director–actor pairs relate to audience reception** (as measured by **`avg_vote_average`**). To guide our investigation, we will focus on the following questions:

1. **Does the number of collaborations between a director and an actor (`collaboration_count`) correspond to higher or lower average audience ratings (`avg_vote_average`)?**  
2. **Does the primary genre in which a director–actor pair collaborates (`dominant_genre`) appear to influence their films’ average ratings?**  
3. **Is there a relationship between additional factors (e.g., `popularity`, `avg_revenue_adj`, or release years) and audience ratings for these pairs?**

We chose to use **`avg_vote_average`** as our **dependent variable**, given that it directly measures audience reception. Our **independent variables** of interest will be:

- **`collaboration_count`** (numeric)  
- **`dominant_genre`** (categorical)  
- **A third variable** (e.g., `popularity`, `avg_revenue_adj`, or another feature of interest)

## <a id='wrangling'>Data Wrangling</a>

In this section, we will **load**, **inspect**, and **clean** our TMDb dataset. Our aim is to transform the data so that it is suitable for analyzing the relationships between director–actor pairs and their average audience ratings.

### General Properties

Below are the steps we will follow:
<ol type='1'>
<li><a href='#1-loading-the-dataset'>Load the dataset</a> from a CSV file into a Pandas DataFrame.</li>
<li><a href='#explore_dataset'>Explore the DataFrame</a> & check dimensions, column datatypes, and sample rows.</li>
<li><a href='#assess_quality'>Assess & clean data</a> & look for missing values, invalid entries, or duplicates & drop unnecessary columns, handle missing or zero values if needed, and create or transform columns that will help our analysis (e.g., `collaboration_count`, `dominant_genre`).</li>
<li><a href=''>Create Relationships</a> between data, in our case between <b>actor-director</b> as well as audience ratings</li>
<li><a href=''>Post-Cleaning Step</a></li>
</ol>

#### <a id='load_dataset'>1. Loading the dataset</a>

In [7]:
import pandas as pd
import numpy as np

# Load the dataset
file_path = 'tmdb-movies.csv'  # Replace with the actual file path
tmdb_data = pd.read_csv(file_path)

# Preview the data
tmdb_data.head()

Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,overview,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,Twenty-two years after the events of Jurassic ...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0
1,76341,tt1392190,28.419936,150000000,378436354,Mad Max: Fury Road,Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic...,http://www.madmaxmovie.com/,George Miller,What a Lovely Day.,...,An apocalyptic story set in the furthest reach...,120,Action|Adventure|Science Fiction|Thriller,Village Roadshow Pictures|Kennedy Miller Produ...,5/13/15,6185,7.1,2015,137999900.0,348161300.0
2,262500,tt2908446,13.112507,110000000,295238201,Insurgent,Shailene Woodley|Theo James|Kate Winslet|Ansel...,http://www.thedivergentseries.movie/#insurgent,Robert Schwentke,One Choice Can Destroy You,...,Beatrice Prior must confront her inner demons ...,119,Adventure|Science Fiction|Thriller,Summit Entertainment|Mandeville Films|Red Wago...,3/18/15,2480,6.3,2015,101200000.0,271619000.0
3,140607,tt2488496,11.173104,200000000,2068178225,Star Wars: The Force Awakens,Harrison Ford|Mark Hamill|Carrie Fisher|Adam D...,http://www.starwars.com/films/star-wars-episod...,J.J. Abrams,Every generation has a story.,...,Thirty years after defeating the Galactic Empi...,136,Action|Adventure|Science Fiction|Fantasy,Lucasfilm|Truenorth Productions|Bad Robot,12/15/15,5292,7.5,2015,183999900.0,1902723000.0
4,168259,tt2820852,9.335014,190000000,1506249360,Furious 7,Vin Diesel|Paul Walker|Jason Statham|Michelle ...,http://www.furious7.com/,James Wan,Vengeance Hits Home,...,Deckard Shaw seeks revenge against Dominic Tor...,137,Action|Crime|Thriller,Universal Pictures|Original Film|Media Rights ...,4/1/15,2947,7.3,2015,174799900.0,1385749000.0


- We use **Pandas** to read the CSV file and store the data in a DataFrame named `df`.
- A quick look at `.head()` helps us verify that the data loaded correctly.

#### <a id='explore_dataset'>2. Exploring Data Structure</a>


In [8]:
# Check the number of rows and columns
print("Data Shape:", tmdb_data.shape)

# Check datatypes and non-null counts
tmdb_data.info()

# Summary statistics for numeric columns
tmdb_data.describe()

Data Shape: (10866, 21)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10866 entries, 0 to 10865
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10866 non-null  int64  
 1   imdb_id               10856 non-null  object 
 2   popularity            10866 non-null  float64
 3   budget                10866 non-null  int64  
 4   revenue               10866 non-null  int64  
 5   original_title        10866 non-null  object 
 6   cast                  10790 non-null  object 
 7   homepage              2936 non-null   object 
 8   director              10822 non-null  object 
 9   tagline               8042 non-null   object 
 10  keywords              9373 non-null   object 
 11  overview              10862 non-null  object 
 12  runtime               10866 non-null  int64  
 13  genres                10843 non-null  object 
 14  production_companies  9836 non-null   object 


Unnamed: 0,id,popularity,budget,revenue,runtime,vote_count,vote_average,release_year,budget_adj,revenue_adj
count,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0,10866.0
mean,66064.177434,0.646441,14625700.0,39823320.0,102.070863,217.389748,5.974922,2001.322658,17551040.0,51364360.0
std,92130.136561,1.000185,30913210.0,117003500.0,31.381405,575.619058,0.935142,12.812941,34306160.0,144632500.0
min,5.0,6.5e-05,0.0,0.0,0.0,10.0,1.5,1960.0,0.0,0.0
25%,10596.25,0.207583,0.0,0.0,90.0,17.0,5.4,1995.0,0.0,0.0
50%,20669.0,0.383856,0.0,0.0,99.0,38.0,6.0,2006.0,0.0,0.0
75%,75610.0,0.713817,15000000.0,24000000.0,111.0,145.75,6.6,2011.0,20853250.0,33697100.0
max,417859.0,32.985763,425000000.0,2781506000.0,900.0,9767.0,9.2,2015.0,425000000.0,2827124000.0


**Initial Observations**:

 - The dataset has around 10,000 rows, indicating 10,000 unique movie entries.
 - Key columns for our analysis include: `director`, `cast`, `genres`, `vote_average`, `popularity`, and potentially `revenue_adj`.
 - We’ll need to check if any of these crucial columns contain `null` or problematic values.

#### <a id='assess_quality'>3 Assessing Data Quality & Cleaning Steps</a>

**1) Initial Data Cleaning**
 - Remove duplicates, handle missing values, and filter out invalid rows.

In [9]:
# Drop duplicate rows based on the unique movie identifier
tmdb_data = tmdb_data.drop_duplicates(subset=['id'])

# Drop rows where crucial columns are missing
tmdb_data = tmdb_data.dropna(subset=['director', 'cast', 'vote_average'])

# Ensure only meaningful ratings (vote_average > 0) are considered
tmdb_data = tmdb_data[tmdb_data['vote_average'] > 0]

# Reset the index after cleaning
tmdb_data.reset_index(drop=True, inplace=True)

# Check the cleaned data
tmdb_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10751 entries, 0 to 10750
Data columns (total 21 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   id                    10751 non-null  int64  
 1   imdb_id               10745 non-null  object 
 2   popularity            10751 non-null  float64
 3   budget                10751 non-null  int64  
 4   revenue               10751 non-null  int64  
 5   original_title        10751 non-null  object 
 6   cast                  10751 non-null  object 
 7   homepage              2898 non-null   object 
 8   director              10751 non-null  object 
 9   tagline               8006 non-null   object 
 10  keywords              9311 non-null   object 
 11  overview              10748 non-null  object 
 12  runtime               10751 non-null  int64  
 13  genres                10731 non-null  object 
 14  production_companies  9779 non-null   object 
 15  release_date       

**2) Explode the `cast` Column**

 - Transform the `cast` column to create one row per `(director, actor)` relationship.

In [10]:
# Split the 'cast' column into a list of actors
tmdb_data['cast_list'] = tmdb_data['cast'].str.split('|')

# Explode the list to create one row per actor
tmdb_exploded = tmdb_data.explode('cast_list')

# Rename the column for clarity
tmdb_exploded.rename(columns={'cast_list': 'actor'}, inplace=True)

# Check the exploded data
tmdb_exploded.head()


Unnamed: 0,id,imdb_id,popularity,budget,revenue,original_title,cast,homepage,director,tagline,...,runtime,genres,production_companies,release_date,vote_count,vote_average,release_year,budget_adj,revenue_adj,actor
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0,Chris Pratt
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0,Bryce Dallas Howard
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0,Irrfan Khan
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0,Vincent D'Onofrio
0,135397,tt0369610,32.985763,150000000,1513528810,Jurassic World,Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi...,http://www.jurassicworld.com/,Colin Trevorrow,The park is open.,...,124,Action|Adventure|Science Fiction|Thriller,Universal Studios|Amblin Entertainment|Legenda...,6/9/15,5562,6.5,2015,137999900.0,1392446000.0,Nick Robinson


**Decisions on Cleaning**
 - We may drop rows where `director` or `cast` is missing, as we cannot form a pair otherwise.
 - We  also remove movies with a `vote_average` of 0 or less if they appear to be placeholders.

#### <a id='clean_data'>4 Creating Director–Actor Relationships</a>

 - Group by (director, actor) to calculate collaboration counts and average ratings.
 - Each movie typically has multiple cast members, so exploding creates one row per actor. This allows us to group by `(director, actor)` later.

In [11]:
# Group by director and actor to calculate collaboration count and average ratings
pair_stats = tmdb_exploded.groupby(['director', 'actor']).agg({
    'id': 'count',  # Count of movies they worked on together
    'vote_average': 'mean'  # Average audience rating across their collaborations
}).reset_index()

# Rename columns for clarity
pair_stats.rename(columns={
    'id': 'collaboration_count',
    'vote_average': 'avg_vote_average'
}, inplace=True)

# Check the pair_stats DataFrame
pair_stats.head()


Unnamed: 0,director,actor,collaboration_count,avg_vote_average
0,FrÃ©dÃ©ric Jardin,JoeyStarr,1,5.9
1,FrÃ©dÃ©ric Jardin,Julien Boisselier,1,5.9
2,FrÃ©dÃ©ric Jardin,Laurent Stocker,1,5.9
3,FrÃ©dÃ©ric Jardin,Serge Riaboukine,1,5.9
4,FrÃ©dÃ©ric Jardin,Tomer Sisley,1,5.9


#### <a id='post_clean'>5 Extract Dominant Genre</a>

 - Identify the most frequent genre for each `(director, actor)` pair.

In [None]:
# Explode the 'genres' column to analyze individual genres
tmdb_exploded['genres_list'] = tmdb_exploded['genres'].str.split('|')
genres_exploded = tmdb_exploded.explode('genres_list')

# Group by (director, actor, genre) to count occurrences of each genre
genre_counts = genres_exploded.groupby(['director', 'actor', 'genres_list']).size().reset_index(name='genre_count')

# Sort by director, actor, and genre count in descending order
genre_counts.sort_values(by=['director', 'actor', 'genre_count'], ascending=[True, True, False], inplace=True)

# Select the most frequent genre for each director-actor pair
dominant_genres = genre_counts.drop_duplicates(subset=['director', 'actor'], keep='first')

# Merge the dominant genre into the pair_stats DataFrame
pair_stats = pd.merge(pair_stats, dominant_genres[['director', 'actor', 'genres_list']], on=['director', 'actor'], how='left')

# Rename for clarity
pair_stats.rename(columns={'genres_list': 'dominant_genre'}, inplace=True)

# Final check
pair_stats.head()


#### Next Steps:
 - Now that we have a row for each `(director, actor, movie)`, we can compute metrics like collaboration_count or average vote_average across multiple movies.
 - We will also determine each pair’s `dominant_genre` to study whether genre plays a role in ratings.


### Data Cleaning
> **Tip**: Make sure that you keep your reader informed on the steps that you are taking in your investigation. Follow every code cell, or every set of related code cells, with a markdown cell to describe to the reader what was found in the preceding cell(s). Try to make it so that the reader can then understand what they will be seeing in the following cell(s).
 

<a id='eda'></a>
## Exploratory Data Analysis

With our data now cleaned and restructured, we can begin to explore how director–actor pairs relate to audience ratings. Specifically, we’ll examine:

1. **Single-variable (1D) analyses** to understand distributions of key variables.
2. **Two-variable (2D) analyses** to see how our independent variables (e.g., `collaboration_count`, `dominant_genre`) might relate to the dependent variable (`avg_vote_average`).

Below, we address the **three primary questions** we posed earlier, focusing on how certain factors might be associated with higher or lower average audience ratings.

---

### 4.1 Single-Variable Exploration

#### 4.1.1 Collaboration Count

After grouping by `(director, actor)`, each pair has a `collaboration_count`, reflecting how many movies they've worked on together:


In [None]:
import matplotlib.pyplot as plt
print(df_exploded.columns)

plt.figure(figsize=(8,5))
pair_stats = df_exploded.groupby(['director', 'actor']).agg({
    'original_title': 'count',
    'vote_average': 'mean'
}).reset_index()

pair_stats.rename(columns={
    'original_title': 'collaboration_count',
    'vote_average': 'avg_vote_average'
}, inplace=True)

In [None]:
plt.hist(pair_stats['collaboration_count'], bins=range(1, pair_stats['collaboration_count'].max() + 2), edgecolor='black')
plt.title('Distribution of Collaboration Counts for Director–Actor Pairs')
plt.xlabel('Number of Collaborations')
plt.ylabel('Number of Pairs')
plt.show()

# Basic statistics
mean_collab = pair_stats['collaboration_count'].mean()
median_collab = pair_stats['collaboration_count'].median()
max_collab = pair_stats['collaboration_count'].max()

print(f"Mean Collaboration Count: {mean_collab:.2f}")
print(f"Median Collaboration Count: {median_collab:.2f}")
print(f"Max Collaboration Count: {max_collab}")

<a id='eda'></a>
## Exploratory Data Analysis

> **Tip**: Now that you've trimmed and cleaned your data, you're ready to move on to exploration. **Compute statistics** and **create visualizations** with the goal of addressing the research questions that you posed in the Introduction section. You should compute the relevant statistics throughout the analysis when an inference is made about the data. Note that at least two or more kinds of plots should be created as part of the exploration, and you must  compare and show trends in the varied visualizations. Remember to utilize the visualizations that the pandas library already has available.



> **Tip**: Investigate the stated question(s) from multiple angles. It is recommended that you be systematic with your approach. Look at one variable at a time, and then follow it up by looking at relationships between variables. You should explore at least three variables in relation to the primary question. This can be an exploratory relationship between three variables of interest, or looking at how two independent variables relate to a single dependent variable of interest. Lastly, you  should perform both single-variable (1d) and multiple-variable (2d) explorations.

### Research Question 1 (Replace this header name!)

### Research Question 2  (Replace this header name!)

<a id='conclusions'></a>
## Conclusions

> **Tip**: Finally, summarize your findings and the results that have been performed in relation to the question(s) provided at the beginning of the analysis. Summarize the results accurately, and point out where additional research can be done or where additional information could be useful.

> **Tip**: Make sure that you are clear with regards to the limitations of your exploration. You should have at least 1 limitation explained clearly. 

> **Tip**: If you haven't done any statistical tests, do not imply any statistical conclusions. And make sure you avoid implying causation from correlation!

> **Tip**: Once you are satisfied with your work here, check over your report to make sure that it is satisfies all the areas of the rubric (found on the project submission page at the end of the lesson). You should also probably remove all of the "Tips" like this one so that the presentation is as polished as possible.

## Submitting your Project 

> **Tip**: Before you submit your project, you need to create a .html or .pdf version of this notebook in the workspace here. To do that, run the code cell below. If it worked correctly, you should see output that starts with `NbConvertApp] Converting notebook`, and you should see the generated .html file in the workspace directory (click on the orange Jupyter icon in the upper left).

> **Tip**: Alternatively, you can download this report as .html via the **File** > **Download as** submenu, and then manually upload it into the workspace directory by clicking on the orange Jupyter icon in the upper left, then using the Upload button.

> **Tip**: Once you've done this, you can submit your project by clicking on the "Submit Project" button in the lower right here. This will create and submit a zip file with this .ipynb doc and the .html or .pdf version you created. Congratulations!