> 

# Project: Investigate a Dataset (movies data)
## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

# Introduction
In this notebook, we will analyze a dataset of movies to uncover interesting insights. Our primary focus will be on understanding the relationship between budget and popularity, as well as identifying the most prolific directors. 

We aim to answer the following key questions:
1. Does an increase in budget lead to a higher popularity for movies?
2. Which directors have directed the most films?

In [None]:
# importing statements for all of the packages that you
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### importing necessary libraries

<a id='wrangling'></a>
# Data Wrangling

### General Properties

In [None]:
df = pd .read_csv("tmdb-movies.csv")

### reading data frome a csv into a dataframe

In [None]:
df.head()

### displaying the first five rows of the data

In [None]:
df.info()

### showing dataframe info (data typs , non null counts)

In [None]:
df.shape

### getting the number of rows and columns in the data

In [None]:
df.describe()

### showing count , mean min , max , std , and other satats for numeric columns

In [None]:
df.duplicated().sum()

### counting the number of duplicate

>

# Data Cleaning 

In [None]:
df.drop(["homepage","production_companies","keywords", "tagline" ,"cast" ,"overview" ,"imdb_id"], axis = 1 , inplace = True)

###  Removing unnecessary columns from the DataFrame to clean the data

In [None]:
df = df.dropna(subset=["director","genres"])

### Removing rows where the "director" or "genres" columns have missing values

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'])

### Converting the 'release_date' column to datetime format  

In [None]:
df = df.drop_duplicates()

### Removing duplicate rows from the DataFrame

In [None]:
df.replace(0, np.nan, inplace=True) 
df.fillna(df.select_dtypes(include=["number"]).mean(), inplace=True)

# Exploratory Data Analysis

In [None]:
def plot_correlation(df, x_col, y_col, plot_type , title , xlabel , ylabel, hmv) :
    plt.figure(figsize=(8, 6))
    if plot_type == "scatter":
        log_y_col = np.log1p(df[y_col])
        sns.scatterplot(x=df[x_col], y=log_y_col, alpha=0.7)
        sns.regplot(x= df[x_col], y=log_y_col, data=df, scatter=False, color='red')
        plt.xlabel(x_col, fontsize=12)
        plt.ylabel(y_col, fontsize=12) 
    elif plot_type == "barplot" :
        sns.barplot(x = x_col.values, y=y_col.index, palette="viridis")
        plt.title(title, fontsize=14)
        plt.xlabel( xlabel, fontsize=12)
        plt.ylabel( ylabel, fontsize=12)
    elif plot_type == "heatmap" :
        sns.heatmap(hmv.corr(), annot=True, cmap="Blues", fmt =".2f", linewidths=0.5)
        plt.title(title)   
    else:
        print("error")
    plt.show()


# 📊 Plot Correlation Function  

This function `plot_correlation()` is designed to **visualize relationships between variables** in a dataset using different types of plots:  
- **Scatter Plot** (with regression line)  
- **Bar Plot**  
- **Heatmap**  

## 📌 Parameters:  
- `df` → The DataFrame containing the data.  
- `x_col` → The column name for the x-axis.  
- `y_col` → The column name for the y-axis.  
- `plot_type` → The type of plot (`"scatter"`, `"barplot"`, or `"heatmap"`).  
- `title` → Title of the plot.  
- `xlabel` → Label for the x-axis.  
- `ylabel` → Label for the y-axis.  
- `hmv` → DataFrame for heatmap correlation.

### Research Question 1  (When the budget increases, the film's popularity increases as well ?)

In [None]:
plot_correlation(df, 'popularity', 'budget', "scatter" , "correlation between popularity and budget" , "popularity" , "budget" ,"")

# 📊 Correlation Between Popularity and Budget  

This visualization explores the **relationship between movie popularity and budget** using a scatter plot with a regression line.  

## 📌 Steps:  
1. **Apply log transformation** to both `popularity` and `budget` to normalize the data.  
2. **Create a scatter plot** to visualize the data distribution.  
3. **Overlay a regression line** to highlight the trend.  
4. **Add labels and a title** for better readability. 

## Research Question 2  (Who are the directors who have directed the most films?)

In [None]:
top_directors = df["director"].value_counts().head(10) 
plot_correlation(df, top_directors , top_directors , "barplot" , "Top Contributing Directors in Movies" , "directors" , "number of movies" , "")

# 🎬 Top Contributing Directors in Movies  

In this analysis, we will extract the **top 10 directors** with the highest number of movies in the dataset and visualize the data using a bar chart.  

### 📌 Steps:  
1. **Count the number of movies per director** using `value_counts()`.  
2. **Select the top 10 directors** based on the number of movies.  
3. **Create a horizontal bar chart** using **Seaborn**.  
4. **Add labels and titles** to improve readability. 

In [None]:
data_numeric = df.select_dtypes(include=["number"])
plot_correlation(df,"","","heatmap","correlation heatmap","","",data_numeric)

# 🔥 Correlation Heatmap of Numeric Data  

In this analysis, we generate a **heatmap** to visualize the correlation between numerical variables in the dataset.  

## 📌 Steps:  
1. **Select numeric columns** using `select_dtypes(include=["number"])`.  
2. **Calculate correlations** between these numerical features.  
3. **Visualize the correlation matrix** using a **Seaborn heatmap**.  
4. **Customize the heatmap** with annotations, colors, and formatting for better readability.  

<a id='conclusions'></a>
#  Conclusions

From our analysis, we found a slight correlation between budget and popularity, indicating that as the budget increases, the popularity tends to increase as well. Additionally, we identified the top 10 directors who have contributed the most films. The heatmap visualization helped us understand various correlations, such as the relationship between vote count and revenue.

# Limitations:

Limitations:

1- Limited Time: The analysis was conducted in a short time frame, restricting deeper exploration.

2- Data Loss: A significant amount of data was removed after cleaning.

3- Missing Values: The dataset contained a large number of missing values, which impacted the overall analysis.

