# **1. Data Exploration and Understanding**

## **1.1 Loading the dataset and performing a preliminary examination**

**This code imports the necessary libraries, loads the horror movies dataset, and displays the first few rows to give us an initial look at the data.**

In [1]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('/kaggle/input/horror-movies-dataset/horror_movies.csv')

# Display the first few rows of the dataset
df.head()

Unnamed: 0,id,original_title,title,original_language,overview,tagline,release_date,poster_path,popularity,vote_count,vote_average,budget,revenue,runtime,status,adult,backdrop_path,genre_names,collection,collection_name
0,760161,Orphan: First Kill,Orphan: First Kill,en,After escaping from an Estonian psychiatric fa...,There's always been something wrong with Esther.,2022-07-27,/pHkKbIRoCe7zIFvqan9LFSaQAde.jpg,5088.584,902,6.9,0,9572765,99,Released,False,/5GA3vV1aWWHTSDO5eno8V5zDo8r.jpg,"Horror, Thriller",760193.0,Orphan Collection
1,760741,Beast,Beast,en,A recently widowed man and his two teenage dau...,Fight for family.,2022-08-11,/xIGr7UHsKf0URWmyyd5qFMAq4d8.jpg,2172.338,584,7.1,0,56000000,93,Released,False,/2k9tBql5GYH328Krj66tDT9LtFZ.jpg,"Adventure, Drama, Horror",,
2,882598,Smile,Smile,en,"After witnessing a bizarre, traumatic incident...","Once you see it, it’s too late.",2022-09-23,/hiaeZKzwsk4y4atFhmncO5KRxeT.jpg,1863.628,114,6.8,17000000,45000000,115,Released,False,/mVNPfpydornVe4H4UCIk7WevWjf.jpg,"Horror, Mystery, Thriller",,
3,756999,The Black Phone,The Black Phone,en,"Finney Blake, a shy but clever 13-year-old boy...",Never talk to strangers.,2022-06-22,/lr11mCT85T1JanlgjMuhs9nMht4.jpg,1071.398,2736,7.9,18800000,161000000,103,Released,False,/AfvIjhDu9p64jKcmohS4hsPG95Q.jpg,"Horror, Thriller",,
4,772450,Presencias,Presences,es,A man who loses his wife and goes to seclude h...,,2022-09-07,/dgDT3uol3mdvwEg0jt1ble3l9hw.jpg,1020.995,83,7.0,0,0,0,Released,False,/ojfzhdwRemcDt1I6pao6vVLw9AA.jpg,Horror,,


## **1.2 Create descriptive statistics for the columns 'budget', 'revenue' and 'popularity'**

**This code provides a summary of the central tendency, dispersion, and shape of the distribution of the specified columns. It helps in understanding the range, mean, median, and potential outliers in the data.**

In [2]:
# Generate descriptive statistics for the specified columns
df[['budget', 'revenue', 'popularity']].describe()

Unnamed: 0,budget,revenue,popularity
count,32540.0,32540.0,32540.0
mean,543126.6,1349747.0,4.013456
std,4542668.0,14430480.0,37.513472
min,0.0,0.0,0.0
25%,0.0,0.0,0.6
50%,0.0,0.0,0.84
75%,0.0,0.0,2.24325
max,200000000.0,701842600.0,5088.584


## **1.3 Visualize the distribution of 'budget', 'revenue', and 'popularity' columns**

**This code will produce interactive histograms for 'budget', 'revenue', and 'popularity' using Plotly. The visualizations will be displayed in a single column layout, similar to the previous Seaborn plots, but with the added benefit of interactivity provided by Plotly.**

In [3]:
# Import necessary libraries
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create a subplot layout
fig = make_subplots(rows=3, cols=1, subplot_titles=('Distribution of Budget', 'Distribution of Revenue', 'Distribution of Popularity'))

# Plot distribution for 'budget'
fig.add_trace(go.Histogram(x=df['budget'], name='Budget', marker_color='blue', nbinsx=50), row=1, col=1)

# Plot distribution for 'revenue'
fig.add_trace(go.Histogram(x=df['revenue'], name='Revenue', marker_color='green', nbinsx=50), row=2, col=1)

# Plot distribution for 'popularity'
fig.add_trace(go.Histogram(x=df['popularity'], name='Popularity', marker_color='red', nbinsx=50), row=3, col=1)

# Update layout for better appearance
fig.update_layout(height=800, width=800, title_text="Distributions of Budget, Revenue, and Popularity", showlegend=False)
fig.update_xaxes(title_text="Budget", row=1, col=1)
fig.update_xaxes(title_text="Revenue", row=2, col=1)
fig.update_xaxes(title_text="Popularity", row=3, col=1)
fig.update_yaxes(title_text="Frequency", row=1, col=1)
fig.update_yaxes(title_text="Frequency", row=2, col=1)
fig.update_yaxes(title_text="Frequency", row=3, col=1)

# Display the plots
fig.show()

# **2. Data Preprocessing**

## **2.1 Handle missing values**

In [4]:
# Check for missing values in the specified columns
missing_values = df[['budget', 'revenue', 'popularity']].isnull().sum()
print(missing_values)

budget        0
revenue       0
popularity    0
dtype: int64


## **2.2 Check for duplicates**

**This code checks for duplicate rows in the dataset**

In [5]:
# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

Number of duplicate rows: 0


**This code removes any duplicate rows from the dataset**

In [6]:
# Remove duplicate rows (if any)
df.drop_duplicates(inplace=True)

## **2.3 Feature Engineering: Calculate the ROI for each movie**

**This code calculates the Return on Investment (ROI) for each movie using the provided formula.**

In [7]:
# Calculate ROI for each movie
df['ROI'] = ((df['revenue'] - df['budget']) / df['budget']) * 100

## **2.4 Normalize or standardize**

**This code standardizes the 'budget', 'revenue', and 'popularity' columns. Standardization is often required when using machine learning models to ensure that all features have the same scale.**

In [8]:
# Standardize 'budget', 'revenue', and 'popularity' columns
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df[['budget', 'revenue', 'popularity']] = scaler.fit_transform(df[['budget', 'revenue', 'popularity']])

# **3. Data Analysis**

## **3.1 Using scatter plots to visualize relationships between 'budget', 'revenue', and 'popularity'**

**This code visualizes the relationship between 'budget' and 'revenue', with the color intensity representing 'popularity'.**

In [9]:
# Import necessary libraries
import plotly.express as px

# Scatter plot for Budget vs Revenue
fig1 = px.scatter(df, x='budget', y='revenue', color='popularity', title='Budget vs Revenue colored by Popularity')
fig1.show()

## **3.2 Calculate correlation coefficients to quantify the strength and direction of the relationships.**

**This code calculates the correlation coefficients between 'budget', 'revenue', and 'popularity' to understand the strength and direction of their relationships.**

In [10]:
# Calculate correlation coefficients
correlation_matrix = df[['budget', 'revenue', 'popularity']].corr()
print(correlation_matrix)

              budget   revenue  popularity
budget      1.000000  0.630493    0.115664
revenue     0.630493  1.000000    0.155160
popularity  0.115664  0.155160    1.000000


## **3.3 Identify movies with the highest ROI and analyze their characteristics**

**This code identifies the top 5 movies with the highest ROI and displays their titles, ROI, budget, revenue, and popularity.**

In [11]:
# Sort movies by ROI and display top 5
top_roi_movies = df.sort_values(by='ROI', ascending=False).head(5)
print(top_roi_movies[['original_title', 'ROI', 'budget', 'revenue', 'popularity']])

                              original_title  ROI    budget   revenue  \
0                         Orphan: First Kill  inf -0.119563  0.569846   
1929                       The ABCs of Death  inf -0.119563 -0.091901   
1944                              The Hunger  inf -0.119563  0.320822   
1971    Sadomanía (El infierno de la pasión)  inf -0.119563 -0.079798   
1997  The House Next Door: Meet the Blacks 2  inf -0.119563  0.106991   

      popularity  
0     135.541938  
1929    0.161932  
1944    0.159986  
1971    0.156974  
1997    0.153775  


## **3.4 Using clustering algorithms to group movies based on their financial metrics and popularity**

**This code uses the KMeans clustering algorithm to group movies into 3 clusters based on 'budget', 'revenue', and 'popularity'. The clusters are then visualized using a scatter plot.**

In [12]:
# Import necessary libraries
from sklearn.cluster import KMeans

# Extract features for clustering
features = df[['budget', 'revenue', 'popularity']]

# Use KMeans clustering
kmeans = KMeans(n_clusters=3, random_state=42)
df['cluster'] = kmeans.fit_predict(features)

# Visualize clusters using scatter plot
fig2 = px.scatter(df, x='budget', y='revenue', color='cluster', title='Clusters of Movies based on Budget, Revenue, and Popularity')
fig2.show()





# **4. Interpretation and Communication**

## **4.1 Summarize key findings from the analysis**

From my analysis, I observed the following:

* The relationship between 'budget' and 'revenue' shows a positive correlation, indicating that movies with higher budgets tend to generate higher revenues.

* 'Popularity' also has a positive correlation with 'revenue', suggesting that more popular movies tend to earn more.

* The top movies with the highest ROI have a significant return, but it's essential to note that movies with a zero budget have an infinite ROI, which might be due to missing or incorrect data.

* 1The clustering analysis grouped movies into three distinct clusters based on their budget, revenue, and popularity metrics.

## **4.2 Creating visual dashboards or presentations highlighting the most profitable movies, trends in ROI over time, and the relationship between budget, revenue, and popularity**

**This code visualizes the top 10 most profitable movies based on revenue**

In [13]:
# Visualizing the top 10 most profitable movies
top_profitable_movies = df.sort_values(by='revenue', ascending=False).head(10)
fig3 = px.bar(top_profitable_movies, x='original_title', y='revenue', title='Top 10 Most Profitable Movies')
fig3.show()

**This code visualizes the trend of average ROI over the years**

In [14]:
# Trends in ROI over time
df['release_year'] = pd.to_datetime(df['release_date']).dt.year
avg_roi_per_year = df.groupby('release_year')['ROI'].mean().reset_index()
fig4 = px.line(avg_roi_per_year, x='release_year', y='ROI', title='Trends in ROI Over Time')
fig4.show()

**This code visualizes the relationship between budget, revenue, and popularity in a 3D scatter plot**

In [15]:
# Relationship between budget, revenue, and popularity
fig5 = px.scatter_3d(df, x='budget', y='revenue', z='popularity', color='popularity', title='3D Scatter plot of Budget, Revenue, and Popularity')
fig5.show()

## **4.3 Recommendations based on the analysis**

Based on my analysis, I recommend the following:

* Invest in High Budget Movies: There's a positive correlation between budget and revenue. Investing in high-quality content can potentially yield higher returns.

* Focus on Popularity Metrics: Popularity has a positive correlation with revenue. Marketing and promotional activities can boost a movie's popularity, leading to increased revenue.

* Analyze High ROI Movies: While high ROI movies can be very profitable, it's essential to understand the factors contributing to their success. This can provide insights for future projects.

* Data Integrity: Ensure that the data, especially financial metrics, are accurate and up-to-date. Incorrect data can lead to misleading analyses.