# TMDB DATA ANALYSIS REPORT

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#trend">Trend Analysis</a></li>
<li><a href="#comparative">Comparative Analysis</a></li>
<li><a href="#conclusion">Conclusion</a></li>
<li><a href="#recommendations">Recommendations</a></li>
<li><a href="#ref">References</a></li>

<a id='intro'></a>
#  Introduction

## Dataset Description

The movie database **TMDB** contains information on 10,866 movies and 21 columns, covering various aspects such as popularity, budget, revenue, genres, release year, and more. The objective of this comprehensive analysis is to explore the dataset, uncover insights, and provide actionable recommendations for stakeholders in the movie industry.
The column  includes:

- **id**: it is a unique code for identifying every row in a dataset.

- **Tmdb_id**: it is a unique code for identifying the movies on the TMDB platform.

- **popularity**: the popularity of the movies in number.

- **budget**: overall amount spent on the movie.

- **revenue**: overall amount received after the release of the movie.

- **original_title**: the original title of the movie.

- **cast**: the actors and actresses featured in the movie.

- **homepage**: homepage of the movie website.

- **director**: the individual who directed the movie.

- **tagline**: the advertising slogan.

- **keyword**: unique words that describe the movie.

- **overview**: a brief summary of the movie.

- **runtime**: the time from which the movie runs from start to finish.

- **genres**: movie clasification.

- **production_companies**: the company that made the production of the movie.

- **release_date**: date the movie was released.

- **vote_count**: the amount of people who voted for the movie.

- **vote_average**: average amount votes per movie out of ten.

- **release_year**: the year the movie was released.

- **budget_adj**: the amount of money spent for the budget production of the movie in terms of the 2010 dollar rate (inflation).

- **revenue_adj**: the amount of money received for the revenue in terms of the 2010 dollar rate (inflation).

<a id='wrangling'></a>
# Data Wrangling

- Loaded the TMDB movie dataset into a Pandas DataFrame.
- Reviewed the structure of the dataset and the first few rows to understand its contents.
- Handled missing values by imputation or removal.
- Converted data types to appropriate formats (e.g., dates, categorical variables).
- Checked for and addressed any duplicates or inconsistencies in the data.


```python
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns


# Load the dataset
Tmdb = pd.read_csv('tmdb-movies.csv')

# Display the first few rows
print(Tmdb.head())

# Information about the dataset
Tmdb.info()

# Shape of the dataset
Tmdb.shape

# Handle missing values
Tmdb.dropna(inplace=True)

# Convert data types
Tmdb['release_date'] = pd.to_datetime(Tmdb['release_date'])

# Convert budget_adj column from float to integer
Tmdb['budget_adj'] = [int(x) for x in Tmdb['budget_adj']]

# Convert revenue_adj column from float to integer
Tmdb['revenue_adj'] = [int(x) for x in Tmdb['revenue_adj']]

# Extract year from 'release_date' and create 'release_year' column
Tmdb['release_year'] = Tmdb['release_date'].dt.year

# Rename column 'imdb_id' to 'tmdb_id'
Tmdb.rename(columns={'imdb_id': 'tmdb_id'}, inplace=True)

# Check for duplicates
duplicate_rows = Tmdb[Tmdb.duplicated()]
print("Duplicate rows:", duplicate_rows)

# Drop duplicates
Tmdb.drop_duplicates(inplace=True)

# Check for any unrealistic future years
future_years = Tmdb[Tmdb['release_year'] > pd.Timestamp.now().year]['release_year']
print("Future years:", future_years)

# Remove unrealistic future years
Tmdb = Tmdb[Tmdb['release_year'] <= pd.Timestamp.now().year]

# Drop 'homepage' column
Tmdb = Tmdb.drop(columns=['homepage'])




<a id='eda'></a>
# Exploratory Data Analysis (EDA)

## Descriptive Statistics
- Plot distributions of numerical variables.
- Explore how popularity varies with release_year or genres.
- Investigate the correlation between budget and revenue.
- Analyze the most common genres.

```python 
# Plot distributions of numerical variables
numerical_variables = ['popularity', 'budget', 'revenue', 'runtime', 'vote_count', 'vote_average', 'budget_adj', 'revenue_adj']; Tmdb[numerical_variables].hist(figsize=(15, 10), color='sienna')
plt.tight_layout()
plt.show()

# Explore how popularity varies with release_year

sns.lineplot(x='release_year', y='popularity', data=Tmdb, color='sandybrown')
plt.title('Popularity Variation Over Years')
plt.xlabel('Release Year')
plt.ylabel('Popularity')
plt.show()

# Explore how popularity varies with genres
genres_data = Tmdb['genres'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('genre')
genres_popularity = Tmdb.merge(genres_data, left_index=True, right_index=True)[['popularity', 'genre']]
sns.boxplot(x='genre', y='popularity', data=genres_popularity, color='peru')
plt.title('Popularity Distribution Across Genres')
plt.xlabel('Genre')
plt.ylabel('Popularity')
plt.xticks(rotation=45)
plt.show()

# Investigate the correlation between budget and revenue
sns.scatterplot(x='budget', y='revenue', data=Tmdb, color='rosybrown')
plt.title('Budget vs Revenue')
plt.xlabel('Budget')
plt.ylabel('Revenue')
plt.show()

# Analyze the most common genres
common_genres = genres_data.value_counts().head(10)
common_genres.plot(kind='bar', figsize=(10, 6), colo='tan')
plt.title('Top 10 Most Common Genres')
plt.xlabel('Genre')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()


<a id='trend'></a>
# Trend Analysis

## Popularity Trends
- Analyzed trends in movie popularity over release years.
- Visualized popularity trends using line plots and examined any notable fluctuations or patterns.

## Financial Trends
- Investigated trends in budget and revenue over release years.
- Calculated growth rates and visualized financial trends using line plots.

 ```python

# Trend analysis for popularity over release_year
plt.figure(figsize=(12, 6))
sns.lineplot(x='release_year', y='popularity', data=Tmdb, errorbar=None, color='sienna')
plt.title('Popularity Trend Over Years')
plt.xlabel('Release Year')
plt.ylabel('Popularity')
plt.grid(False)
plt.show()

# Trend analysis for budget over release_year
plt.figure(figsize=(12, 6))
sns.lineplot(x='release_year', y='budget', data=Tmdb, errorbar=None, color='peru')
plt.title('Revenue Trend Over Years')
plt.xlabel('Release Year')
plt.ylabel('Budget')
plt.grid(False)
plt.show()

# Trend analysis for revenue over release_year
plt.figure(figsize=(12, 6))
sns.lineplot(x='release_year', y='revenue', data=movies_data, errorbar=None, color='saddlebrown')
plt.title('Revenue Trend Over Years')
plt.xlabel('Release Year')
plt.ylabel('Revenue')
plt.grid(False)
plt.show()



<a id='comparative'></a>
# Comparative Analysis

## Budget vs. Revenue
- Compared budget and revenue to understand financial performance.
- Calculated key metrics such as profit margins and return on investment (ROI).

## Budget Adjusted for Inflation
- Analyzed the percentage difference between budget and budget adjusted for inflation.

## Revenue Adjusted for Inflation
- Analyzed the percentage difference between revenue and revenue adjusted for inflation.

```python
# Calculate profit
Tmdb['profit'] = Tmdb['revenue'] - Tmdb['budget']

# Calculate profit margin (profit as a percentage of revenue)
Tmdb['profit_margin'] = (Tmdb['profit'] / Tmdb['revenue']) * 100

# Display descriptive statistics
Tmdb['profit'].describe()


# Calculate percentage difference function
def calculate_percentage_difference(row):
    if row['budget'] != 0:
        return ((row['budget_adj'] - row['budget']) / row['budget']) * 100
    else:
        return None
    
# Apply function to calculate percentage difference
Tmdb['percentage_difference'] = Tmdb.apply(calculate_percentage_difference, axis=1)

# Calculate percentage difference function
def calculate_percentage_difference(row):
    if row['revenue'] != 0:
        return ((row['revenue_adj'] - row['revenue']) / row['revenue']) * 100
    else:
        return None
    
# Apply function to calculate percentage difference
Tmdb['percentage_difference'] = Tmdb.apply(calculate_percentage_difference, axis=1)


<a id='conclusion'></a>
# Conclusion

1. **Exploratory Data Analysis (EDA)**:
- **Descriptive Statistics**: Conducted an overview of the dataset through descriptive statistics, identifying key numerical attributes' distributions, means, and other central tendencies.

- **Popularity Variation**: Explored how popularity varies with release year and genres, identifying trends and potential insights into audience preferences and market trends.

- **Budget-Revenue Correlation**: Investigated the correlation between budget and revenue, understanding how investments in movie production relate to financial returns.

- **Common Genres Analysis**: Analyzed the most common genres to understand audience preferences and market demands better.

2. **Trend Analysis**:
- **Popularity Trends**: Examined trends in movie popularity over release years, identifying any significant fluctuations or patterns that could inform marketing and production strategies.

- **Financial Trends**: Investigated trends in budget and revenue over release years, calculating growth rates and visualizing financial trends to guide financial planning and decision-making.

3. **Comparative Analysis**:
- **Budget vs. Revenue**: Compared budget and revenue to assess financial performance, conduct a descriptive statistics on key metric like profit to evaluate the effectiveness of investments.

- **Budget and Revenue Adjusted for Inflation**: Analyzed the impact of inflation on budgeting decisions and financial planning, understanding how real financial growth of movies evolves over time.



<a id='recommendations'></a>
# Recommendations

- **Diversification of Genres**: Based on the analysis of common genres and financial performance across genres, consider diversifying movie offerings to cater to a broader audience and mitigate risks associated with genre-specific fluctuations.

- **Long-Term Popularity Strategies**: Identify and capitalize on long-term popularity trends by investing in genres or themes that show consistent audience interest over the years.

- **Optimization of Budget Allocation**: Use insights from the budget-revenue correlation analysis to optimize budget allocation, ensuring efficient resource utilization and maximizing financial returns.

- **Inflation-Adjusted Financial Planning**: Incorporate inflation-adjusted financial planning to accurately assess budgeting needs and revenue expectations, enabling better long-term financial sustainability.

- **Continuous Monitoring and Adaptation**: Implement a robust monitoring system to track evolving audience preferences, market trends, and financial performance metrics, enabling timely adjustments to production and marketing strategies.

- **Strategic Partnerships**: Explore strategic partnerships with production companies or distribution platforms specializing in high-performing genres identified through the analysis, leveraging synergies to enhance financial outcomes.

- **Audience Engagement Strategies**: Develop targeted audience engagement strategies based on genre preferences and popularity trends, fostering deeper connections with viewers and driving box office success.

By leveraging insights derived from comprehensive data analysis, stakeholders in the movie industry can make informed decisions to optimize resources, maximize revenues, and enhance overall competitiveness in the market.






<a id='ref'></a>
# References

- **Dataset**: The movie dataset (TMDB) Kaggle
- **Python libraries**: Pandas, Matplotlib, Numpy, Seaborn