
# TMDB Movie Analysis: Uncovering Insights on Popularity, Box Office Success, and More!

## Introduction

What factors determine whether a movie becomes a box office hit or a commercial failure? 

Is it the star-studded cast featuring Hollywood icons like Tom Hanks and Chris Pratt? Perhaps it’s the influence of a visionary director such as J.J. Abrams or James Wan? Or could it be the genre that dictates its appeal—whether it be Action, Adventure, or Thriller?

In this analysis, we’ll dive into the TMDB movie dataset to answer some of these burning questions and uncover insights into what drives a film's popularity and profitability.

The dataset, sourced from Kaggle’s **5000 Movie Dataset**, includes information on **10,867 movies** collected from The Movie Database (TMDb). Our objective is to explore key indicators of success, from popularity to box office performance, genre trends, and more.

## Questions of Interest

The following questions will guide our analysis as we explore the dataset:

1. **Does a Higher Budget Lead to Greater Revenue and Profit?**
   - We’ll investigate whether there’s a strong correlation between a movie's budget and its financial returns, and if so, how significant that relationship is.

2. **Which Genres are Most Associated with High Popularity?**
   - By examining genre data, we aim to pinpoint the types of movies that consistently draw large audiences and maintain a high popularity rating on TMDb.

3. **Which Director Has Produced the Highest Revenue Movies?**
   - Analyzing revenue by director can help us identify if certain directors repeatedly produce high-grossing films, and whether directorial style might contribute to box office success.

4. **Which Actor is Associated with Higher Popularity and Profit?**
   - We’ll look at actors' involvement in films with high popularity and revenue, searching for any actors whose presence is consistently linked to profitable projects.

5. **Which Year Produced the Highest Grossing Movies?**
   - A closer look at revenue over the years will allow us to spot trends and identify any particular year that stood out in terms of producing box office hits.

## Analysis Plan

Using these questions as a foundation, we’ll apply data analysis techniques to identify trends, correlations, and other insights within the dataset. Through this approach, we hope to gain a better understanding of the factors that influence a movie’s success, both in terms of popularity and financial performance.

Join us as we explore and analyze the data, discovering what it takes for a movie to succeed in the competitive landscape of the film industry!

### Data set 
[imdb-movie-metadata](https://www.kaggle.com/tmdb/tmdb-movie-metadata)

## Data Prep/Data wrangling
We will begin by preparing the environment and loading the dataset for initial assessment. Following this, we will focus on cleaning and processing the data to ensure it is ready for thorough analysis.

In [10]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Set pandas display option to show all columns in a single line
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.expand_frame_repr', False)  # Prevent line wrapping

# Import CSV
df = pd.read_csv('data/tmdb_5000_credits.csv')

# View first 3 rows
df.head()

Unnamed: 0,movie_id,title,cast,crew
0,19995,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,285,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,206647,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,49026,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,49529,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


In [8]:
# Basic information of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   movie_id  4803 non-null   int64 
 1   title     4803 non-null   object
 2   cast      4803 non-null   object
 3   crew      4803 non-null   object
dtypes: int64(1), object(3)
memory usage: 150.2+ KB
