Image By: Casie
README.md— Project documentationstudent.ipynb— Jupyter notebook for analysisImages/— Saved plots and figuresData/— Dataset filesPresentation/— Project summary slides PDFzippedData/— Project zipped Data
- Business Understanding
- Data Understanding
- Exploratory Data Analysis (EDA)
- Conclusion and Recommendations
- Links and Resources - ( Data Source)
The entertainment industry is undergoing rapid transformation, with original content emerging as a key driver of audience engagement and revenue growth. Major players like Netflix, Amazon, and other sources are heavily investing in original films —reaping substantial financial returns and strengthening their brand presence. Recognizing this trend, Flix company has made the strategic decision to launch a new movie studio. However, it currently lacks the data-driven insights necessary to understand what factors contribute to a film’s box office success. As data scientists, our role is to explore publicly available movie performance data to uncover patterns that indicates what makes a movie financially successful. The goal is to provide clear data- driven, actionable recommendations that will help guide decisions about genre, budget size, release timing, and other production choices.
Primary Stakeholder: Head of the New Movie Studio
Use Case: Leverage data-driven insights to inform strategic decisions on film production. This includes identifying high-performing genres, determining optimal budget ranges, selecting ideal release windows, and shaping casting strategies—all aimed at maximizing box office success and return on investment.
-
Identify which genres perform best at the box office, considering revenue and profitability.
-
Analyze the impact of budget, runtime, cast, and release month on a film’s success.
-
Provide actionable recommendations for the types of films the company should produce.
This project provides a data-driven foundation to support the successful launch of Flix company’s new movie studio while reducing financial risk. By uncovering the key factors that correlate with box office success, the Head of Studio is equipped to make informed, strategic decisions, including:
-
Genre Selection: Focus on genres with a strong track record of performance.
-
Budget Planning: Allocate production budgets based on historically successful investment ranges.
-
Release Strategy: Optimize release timing to align with peak audience engagement periods.
-
Talent Strategy: Identify the cast and crew characteristics commonly linked to high-grossing films.
This project uses data from three high-quality, complementary sources of movie data:
Provides domestic box office revenue data.
Includes key features such as: title, studio, domestic_gross, release_date, and year.
used to determine the financial performance of films.
TheMovieDb entails the following:
- User-generated popularity
- voting data
Key features: title, popularity, vote_average, vote_count, release_date, genres, budget, revenue
Purpose: Complements Box Office Mojo and IMDb with:
Popularity metrics: Show which films gain audience traction pre- and post-release
Vote data: Allows cross-comparison with IMDb ratings
Contains detailed metadata about films and user-ratings.
Key tables used:
-
movie_basics: Includes primary_title, original_title, genres, runtime_minutes, and start_year.
-
movie_ratings: Contains user rating data (average_rating, num_votes).
used for movie characteristics and audience quality perceptions.
It consits of Film production budgets and worldwide gross
Key Features:
- Release_date, movie, production_budget, domestic_gross, worldwide_gross
Why It Matters:
-
Gives a complete financial picture by providing both the cost of making the film (production budget) and revenue generated globally.
-
Allows calculation of Return on Investment (ROI) — one of the most important metrics when deciding which types of films to produce.
This plot helps us identify the highest-earning genres in terms of raw profit.
- Action, Adventure, Sci-Fi, and Fantasy genres generally show the highest total profit.These genres attract large global audiences and usually get bigger production and marketing budgets — resulting in high box office returns.
- Documentary, Music, and Experimental genres tend to have the lowest total profits, reflecting limited theatrical release, smaller audiences, and lower budgets.
- Some genres may have high movie counts but low profit totals (e.g., Drama) — meaning many such films are made but are not huge moneymakers.
- Others, like Fantasy or Sci-Fi, may have fewer movies but outsized profit contributions.
There is a Positive Correlation:
-
Generally, higher production budgets lead to higher worldwide gross.
-
The scatter points will often trend upwards to the right, showing that spending more usually results in greater revenue but not always proportional.
-
A long tail toward the right, shows that while only a few films make huge profits, they skew the average upward.
-
This is known as a right-skewed distribution, common in industries with high risk and high reward (like the film industry).
-
A sharp peak near $0 or negative values, suggests that many films either break even or lose money.
-
A long tail toward the right, shows that while only a few films make huge profits, they skew the average upward.
-
This is known as a right-skewed distribution, common in industries with high risk and high reward (like the film industry).
-
Scatter points are usually spread widely across all runtime values.
-
A film’s length alone rarely predicts financial success.
-
Many high-grossing films tend to cluster around 100–140 minutes. This is the sweet spot where blockbusters, action, and adventure films typically fall.
-
Films with more IMDb votes tend to have higher profits.
-
More audience engagement (votes) often reflects wider viewership and box office success.
-
The points usually trend upwards as the number of votes increases.
-
Films with fewer votes (low popularity) show mixed profit outcomes — some lose money, while some gain.
-
Films with very high IMDb votes (i.e., >100,000) almost always have positive profits.
- There is a strong Positive Correlation:
-
production_budget vs worldwide_gross: High Positive Correlation (~0.6 to 0.8)
Bigger budget films tend to earn higher worldwide gross — large-scale marketing, global releases, big-name casts drive this.
-
production_budget vs profit: Moderate Positive Correlation
Higher budgets may lead to higher profit, but not always guaranteed — profits also depend on cost control and reception.
- We have a very Strong Positive Correlation (close to 1.0)
-
worldwide_gross vs profit
Profit is largely driven by worldwide gross, as expected. This is natural since profit = worldwide gross - production budget.
- We also have a Weak or No Correlation:
-
runtime_minutes shows very low or negligible correlation with both profit and gross.
Confirms earlier finding: runtime does not significantly drive revenue or profit on its own.
-
release_year has low correlation with other variables.
Implies no strong trend over the years regarding budgets or profits in this dataset — unless the time window includes massive shifts (e.g., pandemic years).
-
There is a possible Weak Positive Trend:
Sometimes release_year vs production_budget shows a slight positive correlation — indicating that film budgets have gradually risen over the years.
The goal is to predict Movie profit using linear regression by testing multiple numeric features i.e production budget , run time, average rating and number of votes
Intercept: $4,925,123 R-squared: 0.546 RMSE: 129332520.437
Intercept: 4925122.745559871 -this is the base profit when all inputs are zero R-squared: 0.546 - how much of the variation profit our model can explain. its decent but not perfect. a moderate level of predictive power RMSE: 129332520.437 -this is the root mean squared error- the prediction error in dollars which is high meaning movie profits are influenced by many other factors.
we note run time is negative, meaning for every extra minute on a movie, profit reduces by $724k, Rating tend to have a big coefficient $7.26m for every 1 point increase in rating.
Our model suggest that bigger budgets , more votes, and higher ratings help boost profit while longer run time slightly reduces it. the model is decent as it explains 55% of point variation but still has large prediction errors - typical in thr movie industry due to complexity.
Most points cluster around the red line,showing the model captures the general profit trend.
Spread increases at higher profits, indicating less accuracy for big hits
R² ≈ 0.546 — model explains about 54.6% of profit variation
Some clear outliers where predictions are way off
This is a one-tailed independent t-test comparing the average ratings of Action vs. non-Action movies.
Action movies do not have significantly higher average ratings than non-Action movies.
Action movies have significantly higher average ratings than non-Action movies.
T-statistic: -0.7174 One-tailed P-value: 0.2367 Fail to reject the null hypothesis: No significant evidence that Action movies have higher ratings.
If the p-value is less than alpha and the t-statistic is positive (i.e., Action movies have a higher mean), it rejects the null hypothesis, indicating statistical evidence that Action movies are rated higher.
Otherwise, it fails to reject the null, meaning there's no strong evidence for a difference in the expected direction.
Fail to reject the null hypothesis: No significant evidence that Action movies have higher ratings.
Recommendations:
- Focus on Action, Adventure, and Sci-Fi genres which have the highest gross and profit.
- Films with higher budgets generally earn more, but profit should be considered carefully.
- Optimal runtimes appear around 120-150 minutes for blockbuster success.
- Build hype early to raise IMDb vote count, improving engagement and potentially profit.
- Dataset Source : Box Office Mojo , IMDB , TheMovieDB, The Numbers








