## VIEWERS PRODUCTION STUDIO ##

## Problem Statement

The corporation faces the difficulty of having no past filmmaking experience when it enters the movie business by opening a new studio.  In order to make lucrative judgements, the studio must have a thorough understanding of current market trends.  In order to determine the primary elements that contribute to box office performance, this study will examine historical film data, with a particular emphasis on genres, production expenditures, and release dates.  The results will help the studio choose which kinds of films to make and release in a strategic manner.

## Business Objectives

**1. Maximize Box Office revenue by focussing on high-perfoming genres.**

Determine the most lucrative film genres (adventures, action, animation) and give priority to making films in these categories.  Utilise analysis data to concentrate on film genres that have historically brought in the most money at the box office.  For example, a studio can devote more resources to making action films if data indicates that they routinely generate more gross receipts than other genres.

**2. Timing film releases to maximize audience reach and revenue.**

Plan your movie releases for peak seasons (summer and holidays) to take advantage of increased attendance and boost profits.  Planning for a film's release might be aided by past performance data by month of release.  The studio can schedule the majority of important releases around these times if analysis demonstrates that films produced in the summer regularly bring in greater money.

**3. Optimize film budgets to achieve high return on investment.**

Set the budget for new films in a way that maximises return on investment without appreciably raising risk.  Drawing on the correlation between production budgets and box office performance, the studio can implement a strategy that ensures a higher probability of profitability while avoiding excessive expenditure on films with diminishing returns.  According to the data, films with budgets in this category routinely yield the best returns on investment.

# DATA CLEANING

In [1]:
#First Notebook

In [2]:
import pandas as pd
import sqlite3
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import f_oneway

In [3]:
df1=pd.read_csv('bom.movie_gross.csv')#merge
df1.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3382 non-null   object 
 2   domestic_gross  3359 non-null   float64
 3   foreign_gross   2037 non-null   object 
 4   year            3387 non-null   int64  
dtypes: float64(1), int64(1), object(3)
memory usage: 132.4+ KB


In [4]:
df1.describe

<bound method NDFrame.describe of                                             title      studio  domestic_gross  \
0                                     Toy Story 3          BV     415000000.0   
1                      Alice in Wonderland (2010)          BV     334200000.0   
2     Harry Potter and the Deathly Hallows Part 1          WB     296000000.0   
3                                       Inception          WB     292600000.0   
4                             Shrek Forever After        P/DW     238700000.0   
...                                           ...         ...             ...   
3382                                    The Quake       Magn.          6200.0   
3383                  Edward II (2018 re-release)          FM          4800.0   
3384                                     El Pacto        Sony          2500.0   
3385                                     The Swan  Synergetic          2400.0   
3386                            An Actor Prepares       Grav.          1700

In [5]:
df1.isnull().sum()


title                0
studio               5
domestic_gross      28
foreign_gross     1350
year                 0
dtype: int64

In [6]:
df1.duplicated().describe()

count      3387
unique        1
top       False
freq       3387
dtype: object

In [7]:
df1['foreign_gross'] = df1['foreign_gross'].str.replace(',', '', regex=False)
df1['foreign_gross'] = pd.to_numeric(df1['foreign_gross'])
df1['foreign_gross'].head()

0    652000000.0
1    691300000.0
2    664300000.0
3    535700000.0
4    513900000.0
Name: foreign_gross, dtype: float64

In [8]:
df1['foreign_gross'].max()

960500000.0

In [9]:
df1['foreign_gross'].fillna(0, inplace=True)
df1['studio'].fillna('Unknown', inplace=True)
df1['domestic_gross'].fillna(0, inplace=True)
df1.isnull().sum()

title             0
studio            0
domestic_gross    0
foreign_gross     0
year              0
dtype: int64

In [10]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3387 entries, 0 to 3386
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           3387 non-null   object 
 1   studio          3387 non-null   object 
 2   domestic_gross  3387 non-null   float64
 3   foreign_gross   3387 non-null   float64
 4   year            3387 non-null   int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 132.4+ KB


In [11]:
df1.head()

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000.0,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000.0,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000.0,2010
3,Inception,WB,292600000.0,535700000.0,2010
4,Shrek Forever After,P/DW,238700000.0,513900000.0,2010


In [12]:
df2=pd.read_csv('tmdb.movies.csv')#merge
df2.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26517 entries, 0 to 26516
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         26517 non-null  int64  
 1   genre_ids          26517 non-null  object 
 2   id                 26517 non-null  int64  
 3   original_language  26517 non-null  object 
 4   original_title     26517 non-null  object 
 5   popularity         26517 non-null  float64
 6   release_date       26517 non-null  object 
 7   title              26517 non-null  object 
 8   vote_average       26517 non-null  float64
 9   vote_count         26517 non-null  int64  
dtypes: float64(2), int64(3), object(5)
memory usage: 2.0+ MB


In [13]:
df2 = df2.drop(columns=['Unnamed: 0'])
df2['release_date'] = pd.to_datetime(df2['release_date'], errors='coerce')
df2['release_date'] = df2['release_date'].dt.date

In [14]:
df2.duplicated().describe()

count     26517
unique        2
top       False
freq      25497
dtype: object

In [15]:
df2.drop_duplicates(inplace=True)

In [16]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 25497 entries, 0 to 26516
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   genre_ids          25497 non-null  object 
 1   id                 25497 non-null  int64  
 2   original_language  25497 non-null  object 
 3   original_title     25497 non-null  object 
 4   popularity         25497 non-null  float64
 5   release_date       25497 non-null  object 
 6   title              25497 non-null  object 
 7   vote_average       25497 non-null  float64
 8   vote_count         25497 non-null  int64  
dtypes: float64(2), int64(2), object(5)
memory usage: 1.9+ MB


In [17]:
df2.head()

Unnamed: 0,genre_ids,id,original_language,original_title,popularity,release_date,title,vote_average,vote_count
0,"[12, 14, 10751]",12444,en,Harry Potter and the Deathly Hallows: Part 1,33.533,2010-11-19,Harry Potter and the Deathly Hallows: Part 1,7.7,10788
1,"[14, 12, 16, 10751]",10191,en,How to Train Your Dragon,28.734,2010-03-26,How to Train Your Dragon,7.7,7610
2,"[12, 28, 878]",10138,en,Iron Man 2,28.515,2010-05-07,Iron Man 2,6.8,12368
3,"[16, 35, 10751]",862,en,Toy Story,28.005,1995-11-22,Toy Story,7.9,10174
4,"[28, 878, 12]",27205,en,Inception,27.92,2010-07-16,Inception,8.3,22186


In [18]:
df3=pd.read_csv('tn.movie_budgets.csv', index_col=0)#merge
df3.head()


Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,"Dec 18, 2009",Avatar,"$425,000,000","$760,507,625","$2,776,345,279"
2,"May 20, 2011",Pirates of the Caribbean: On Stranger Tides,"$410,600,000","$241,063,875","$1,045,663,875"
3,"Jun 7, 2019",Dark Phoenix,"$350,000,000","$42,762,350","$149,762,350"
4,"May 1, 2015",Avengers: Age of Ultron,"$330,600,000","$459,005,868","$1,403,013,963"
5,"Dec 15, 2017",Star Wars Ep. VIII: The Last Jedi,"$317,000,000","$620,181,382","$1,316,721,747"


In [19]:
def clean_conv(df,column):
   for column in columns :
       df[column] = df[column].str.replace(',', '', regex=False)
       df[column] = df[column].str.replace('$', '', regex=False)
       df[column] = pd.to_numeric(df[column])
   return df

In [20]:
columns=['production_budget','domestic_gross','worldwide_gross']
df3=clean_conv(df3,columns)

In [21]:
df3['release_date'] = pd.to_datetime(df3['release_date'], errors='coerce')
df3['release_date'] = df3['release_date'].dt.date

In [22]:
df3.head()

Unnamed: 0_level_0,release_date,movie,production_budget,domestic_gross,worldwide_gross
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,2009-12-18,Avatar,425000000,760507625,2776345279
2,2011-05-20,Pirates of the Caribbean: On Stranger Tides,410600000,241063875,1045663875
3,2019-06-07,Dark Phoenix,350000000,42762350,149762350
4,2015-05-01,Avengers: Age of Ultron,330600000,459005868,1403013963
5,2017-12-15,Star Wars Ep. VIII: The Last Jedi,317000000,620181382,1316721747


In [23]:
df3.duplicated().describe()

count      5782
unique        1
top       False
freq       5782
dtype: object

In [24]:
df4=pd.read_csv('rt.reviews.tsv', index_col=0,sep='\t', encoding='Latin-1')#merge
df4.info()

FileNotFoundError: [Errno 2] No such file or directory: 'rt.reviews.tsv'

In [None]:
df4['date'] = pd.to_datetime(df4['date'], errors='coerce')
df4['date'] = df4['date'].dt.date

In [None]:
df4.duplicated().describe()

In [None]:
df4.drop_duplicates(inplace=True)

In [None]:
df4.isnull().sum()

In [None]:
df4['review'].fillna('Unknown', inplace=True)
df4['rating'].fillna('Unknown', inplace=True)
df4['critic'].fillna('Unknown', inplace=True)
df4['publisher'].fillna('Unknown', inplace=True)
df4.isnull().sum()

In [None]:
df4.info()

In [None]:
df5=pd.read_csv('rt.movie_info.tsv', index_col=0, sep='\t'  )
df5.head()
df5.info()

In [None]:
df5.isnull().sum()

In [None]:
df5['dvd_date'] = pd.to_datetime(df5['dvd_date'], errors='coerce')
df5['dvd_date'] = df5['dvd_date'].dt.date
df5['theater_date'] = pd.to_datetime(df5['theater_date'], errors='coerce')
df5['theater_date'] = df5['theater_date'].dt.date

In [None]:
def fill_vals(df,columns,value='unknown'):
    for column in columns:
        df[column].fillna(value, inplace=True)
    return df

In [None]:
columns=['synopsis','rating','genre','director','writer','theater_date','dvd_date','runtime']
df5=fill_vals(df5,columns)


In [None]:
df5.drop('currency',axis=1,inplace=True)
df5.drop('box_office',axis=1,inplace=True)
df5.drop('studio',axis=1,inplace=True)
df5.head(10)

In [None]:
df2.duplicated().describe()

In [None]:
import sqlite3
import pandas as pd

In [None]:
conn = sqlite3.connect('im.db')

cursor = conn.cursor()

In [None]:
table_name_query = """SELECT * 
                      FROM sqlite_master 
                      WHERE type='table';"""

pd.read_sql(table_name_query, conn)

In [None]:
table_0_name_query = """SELECT * 
                      FROM movie_basics;"""
movie_basics = pd.read_sql(table_0_name_query, conn)
movie_basics

## Analyzing Movie Genres: Popularity and Ratings

The aim of this analysis is to identify which movie genres are the most popular and how their popularity correlates with their average viewer ratings. This can help determine whether popularity aligns with quality as perceived by viewers.

### 1. Merge Datasets
We are working with two main data sources:
- `tmdb.movies.csv`: Contains metadata such as `title`, `popularity`, and `vote_average`.
- `movie_basics` table from the `im.db` SQLite database: Includes fields like `primary_title`, `genres`, and additional movie identifie
**Goal:**  
Join these two datasets based on a common movie identifier (e.g., `title` or `original_title`) to enrich the dataset with genre information.rs.

In [None]:
def clean_title(title):
    if isinstance(title, str):
        return title.lower().strip()
    return title

df2['original_title_clean'] = df2['original_title'].apply(clean_title)
movie_basics['original_title_cleann'] = movie_basics['original_title'].apply(clean_title)

merged_movies_1 = pd.merge(
    df2,
    movie_basics,
    how='inner',              
    left_on='original_title_clean',
    right_on='original_title_cleann'
)

In [None]:
merged_movies_1

### 2. Group and Aggregate by Genre
Once the data is merged:
- Group the data by the `genres` column.
- Calculate the **mean popularity** and **mean rating (vote_average)** for each genre.

In [None]:
grouped_genre_stats = merged_movies_1.groupby('genres').agg({
    'vote_average': 'mean',
    'popularity': 'mean'
}).reset_index()
grouped_genre_stats.columns = ['genre_name', 'avg_vote', 'avg_popularity']
#round of these to the nearest 2 dp
grouped_genre_stats['avg_vote'] = grouped_genre_stats['avg_vote'].round(2)
grouped_genre_stats['avg_popularity'] = grouped_genre_stats['avg_popularity'].round(2)

grouped_genre_stats

### 3. Evaluate Genre Performance
With the grouped data, we will:
- Rank genres by their average popularity.
- Compare popularity scores against average ratings.
- Identify which genres are both popular and highly rated.

In [None]:
sorted_genre_stats = grouped_genre_stats.sort_values(
    by=[ 'avg_popularity', 'avg_vote'], ascending=[False, False]
).reset_index(drop=True)
sorted_genre_stats.head(10)

##  Conclusion

From our analysis, we observed that genres like **Family, Fantasy, Musical** and **Adventure, Fantasy, Mystery** have the highest average popularity scores, with values around 29. However, this doesn't always translate to quality in terms of viewer satisfaction. For example, **Action, Fantasy, Thriller** and **Action, Adventure, Fantasy** have relatively low average ratings despite decent popularity.

Notably, some of the highest-rated genres, such as **Adventure, Biography, Crime** (7.40) and **Drama, Horror, Music** (7.30), don't lead in popularity — suggesting that critical acclaim and audience appreciation are not solely driven by hype or mass appeal.

 **Key Insight:**  
While popularity can indicate trends and audience interest, **it is not the sole measure of success**. In the film industry, what ultimately matters most is **profitability** — the financial return a genre brings in. A genre with moderate popularity but high profitability can be more valuable than one with high visibility but low returns.

Therefore, the next analysis factors in **profit margins** alongside popularity and ratings to gain a more complete picture of a genre's performance and true value in the market.


## Analyzing Movie Genres by Profitability



## Step-by-Step Profitability Analysis

### 1. Calculate Profit
We begin by using the `tn.movie_budgets.csv` file, which includes financial details for various movies:

- **Columns used**:  
  - `production_budget`  
  - `worldwide_gross`  

To get the profit for each movie, we use the formula:

profit = worldwide_gross - production_budget

In [None]:
df3['profit'] = df3['worldwide_gross'] - df3['production_budget']

In [None]:
df3

This gives us a clear picture of the net gain for each title.

---

### 2. Merge Datasets
Next, we merge the newly created profit data with the `movie_basics` table from the `im.db` SQLite database. This merge helps us map each movie's profit to its corresponding **genre**.

We ensure that both datasets have a common key, which is `primary_title` or `original_title`, to facilitate the merge.

---

In [None]:
df3['movie_clean'] = df3['movie'].str.lower().str.strip()
movie_basics['primary_title_clean'] = movie_basics['primary_title'].str.lower().str.strip()


merged_movies_2 = pd.merge(
    df3,
    movie_basics,
    how='inner',
    left_on='movie_clean',
    right_on='primary_title_clean'
)

In [None]:
merged_movies_2.head()

### 3. Group by Genre and Calculate Average Profit
After merging:
- We group the dataset by the `genres` column.
- For each genre, we calculate the **average profit**.

The result is a concise table showing:
- `genre_name`
- `avg_profit`

In [None]:
grouped_genre_stats = merged_movies_2.groupby('genres').agg({
    'profit': 'mean',
}).reset_index()

grouped_genre_stats.columns = ['genre_name', 'avg_profit']
grouped_genre_stats

### 4. Combine with Previous Results (Ratings & Popularity)
To get a more holistic view, we now combine this new **profitability table** with the one we previously created, which includes:
- `avg_vote` (average rating)
- `avg_popularity` (audience interest)

This combined table allows us to compare:
- **Popularity**
- **Viewer Ratings**
- **Financial Performance**  
…all in one place.



In [None]:
merged_movies_3 = pd.merge(
    grouped_genre_stats,
    sorted_genre_stats,
    how='inner',
    left_on='genre_name',
    right_on='genre_name'
)

In [None]:
merged_movies_3

### 5. Sort and Visualize
- We sort the final combined table by `avg_profit` in descending order to identify the most profitable genres.
- Then, we generate a **bar graph** to visualize the **Top 10 genres by profitability**.

This chart will help us see at a glance which genres dominate financially.



In [None]:
df4=merged_movies_3.sort_values(
    by=[ 'avg_profit','avg_popularity'], ascending=[False, False]
).reset_index(drop=True).head(10)
df4

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.barplot(x=df4['genre_name'], y=df4['avg_profit'], color='deepskyblue')

plt.title('Top 10 Genres by Average Profit', fontsize=16)
plt.xlabel('Genre', fontsize=12)
plt.ylabel('Average Profit (USD)', fontsize=12)
plt.xticks(rotation=45, ha='right', fontsize=10)  # Rotate labels
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.8)

plt.tight_layout()
plt.show()

##  Conclusion: Profitability Over Popularity

The bar graph of the top 10 most profitable genres highlights a compelling trend: genres like **Fantasy, Romance** and **Adventure, Drama, Sport** significantly outperform others, each generating over **$1 billion** in average profit. These are followed closely by **Family, Fantasy, Musical** and **Adventure, Fantasy**, reinforcing the dominance of hybrid genres that blend escapism, emotion, and broad audience appeal.

This analysis confirms that **profitability does not always align with popularity or ratings**. While some genres may receive high ratings or viral attention, they don't always yield high financial returns.


##  Final Thought
Profitability is the most critical metric in determining a film’s success. Studios should focus not only on what’s trending or critically acclaimed, but on what **consistently brings returns**. Data-driven strategies using genre-based profitability can provide a major edge in film production planning.


In [None]:


plt.figure(figsize=(8, 5))


sns.regplot(
    data=merged_movies_3,
    x='avg_popularity',
    y='avg_profit',
    scatter=True,        
    color='dodgerblue',  
    line_kws={'linewidth': 2}
)

plt.title('Popularity vs Profit with Line of Best Fit', fontsize=14)
plt.xlabel('Average Popularity')
plt.ylabel('Average Profit')
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:

correlation = merged_movies_3[['avg_popularity', 'avg_profit', 'avg_vote']].corr()
print(correlation)

In [None]:

merged_movies_2['profit'] = merged_movies_2['worldwide_gross'] - merged_movies_2['production_budget']
top_genres = merged_movies_2['genres'].value_counts().head(3).index 
df_anova = merged_movies_2[merged_movies_2['genres'].isin(top_genres)]
group1 = df_anova[df_anova['genres'] == top_genres[0]]['profit']
group2 = df_anova[df_anova['genres'] == top_genres[1]]['profit']
group3 = df_anova[df_anova['genres'] == top_genres[2]]['profit']

from scipy.stats import f_oneway

f_stat, p_value = f_oneway(group1, group2, group3)
print("F-statistic:", f_stat)
print("P-value:", p_value)
print(len(group1), len(group2), len(group3))

## Popularity vs Profit Analysis


The scatter plot visualizes the relationship between **average popularity** and **average profit** for different movie genres. A line of best fit is included to reveal the overall trend.

### Key Observations:
- **Positive Correlation**: The line slopes upward, suggesting that as popularity increases, so does profit.
- **Variance**: The data is widely scattered, indicating that popularity alone does not guarantee high profits.
- A few outlier genres are **highly profitable** despite **average or low popularity**, suggesting hidden factors like budget efficiency or global market appeal.


##  Correlation Matrix

### Interpretation:
- **Popularity and Profit** have a **moderate positive correlation** (**r = 0.40**). This means popular movies tend to be more profitable, but it's not a strong guarantee.
- **Votes (ratings)** have **very weak correlation** with both popularity and profit, suggesting that **rating is not a strong predictor of financial success**.

###  ANOVA Test Summary: 

The **ANOVA (Analysis of Variance)** test was used to determine whether there are **statistically significant differences in average profit** among the top 3 most frequent movie genres in the dataset.
enres.

## Final Conclusions (Based on ANOVA, Correlation, and Scatter Plot)

1. **Popularity Is Only Moderately Linked to Profit**
   - The correlation coefficient between popularity and profit is **0.403**, showing a weak-to-moderate positive relationship.
   - Not all popular genres are profitable.

2. **Ratings (Votes) Do Not Predict Profit**
   - The correlation between average vote and profit is extremely weak (**r = 0.10**).
   - High-rated films do not guarantee high profits.

3. **Top Genres Are Equally Profitable**
   - The ANOVA test shows **no significant difference in profit** between the top 3 genres.
   - Studios shouldn’t assume that just producing more films in a popular genre will generate more profit.

---

##  Strategic Recommendations

### 1. Focus on **Profitable Genres**, Not Just Popular Ones
- Use genre-specific profit data instead of relying on trends or volume of production.
- Prioritize genres with **historically high ROI**, like *Fantasy-Romance* or *Adventure-Drama-Sport*.


<B>WE WILL MERGE THE DATASET AND THEN CLEAN IT ALTOGETHER THIS IS EASIER TO DO IT AT ONCE</B>

In [None]:
df=pd.read_csv("tn.movie_budgets.csv")
df


In [None]:
df2=pd.read_csv("tmdb.movies.csv")
df2

 I MERGED BEFORE CLEANING TO MAKE IT EASIER FOR ME TO CLEAN

In [None]:
df3 = pd.merge(df, df2, on="id")
df3

In [None]:
df=df3
df

WE WILL FIRST CHECK IF THERE ARE NULL VALUES

In [None]:
df.isna().sum()

HERE WE WILL CHECK FOR OUTLIERS AND ANY DUPLICATED COLUMNS

In [None]:
df.duplicated().sum().sum()#check for duplicates

In [None]:
df['release_date'] = df['release_date_x']


In [None]:
# we will drop duplicated columns different names but same content
df.drop(columns=['Unnamed: 0','title','release_date_x','release_date_y','production_budget','genre_ids','original_language','original_title'], inplace=True)
df

REMOVE ANY WHITESPACES IN THE MOVIES COLUMN

In [None]:
df['movie'].str.strip()

CHANGE THE RELEASE DATE TIME TO THE RIGHT FORMAT

In [None]:
df['release_date'] = pd.to_datetime(df['release_date'], errors='coerce')


In [None]:
df

In [None]:
# i  Removed dollar signs and commas from 'domestic_gross' and 'worldwide_gross'

money_columns =  ['domestic_gross','worldwide_gross']



In [None]:
for col in money_columns:
    df[col] = df[col].replace('[\$,]', '', regex=True).astype(int)

CHECK IF THERE IS ANY OUTLIERS

In [None]:
fig, axes = plt.subplots(1, 5, figsize=(20, 5) ,)

# Histogram for domestic gross
axes[0].hist(df['domestic_gross'], bins=30, )
axes[0].set_title('domestic_gross Distribution')
axes[0].set_xlabel('domestic_gross')
axes[0].set_ylabel('Frequency')

# Histogram for worldwide_gross
axes[1].hist(df['worldwide_gross'])
axes[1].set_title('worldwide_gross Distribution')
axes[1].set_xlabel('worldwide_gross')
axes[1].set_ylabel('Frequency')

# Histogram for popularity
axes[2].hist(df['popularity'])
axes[2].set_title('popularity Distribution')
axes[2].set_xlabel('popularity')
axes[2].set_ylabel('Frequency')

# Histogram for vote average
axes[3].hist(df['vote_average'])
axes[3].set_title('vote_average Distribution')
axes[3].set_xlabel('vote_average')
axes[3].set_ylabel('Frequency')

# Histogram for vote_count
axes[4].hist(df['vote_count'])
axes[4].set_title('vote_count Distribution')
axes[4].set_xlabel('vote_count')
axes[4].set_ylabel('Frequency')
# Adjust spacing
plt.tight_layout
plt.show()


AS YOU CAN OBSERVE THE FOLLOWING:
1)THE DOMESTIC_GROSS,WORLDWIDE_GROSSVOTECOUNT AND POPULARITY IS POSIVELY SKEWED AS WE CAN SEE THE TAIL IS TOWARDS THE RIGHT,FOR THE VOTE AVERAGE ITS  LEANING TOWARDS THE LEFT

In [None]:
df_clean = df.dropna(subset=['domestic_gross', 'worldwide_gross'])
# i removed null values i suspected there was some outliers

In [None]:
columns=['domestic_gross', 'worldwide_gross','popularity','vote_average','vote_count']



In [None]:
def outliers(df,col):
    upper_limit=df[col].mean() +3 *df[col].std()
    lower_limit=df[col].mean()-3 * df[col].std()
#create a list that will store the index of the outliers
    #ls=df.index[(df[cols] <lower_limit) | (df[cols] >upper_limit)]
    ls= df[(df[col] < lower_limit) | (df[col] > upper_limit)].index.tolist()

    return ls


In [None]:
index_list=[]
columns = [col for col in columns if col in df.columns]

columns=['domestic_gross', 'worldwide_gross','popularity','vote_average','vote_count']

for col in columns:
    index_list.extend(outliers(df,col))
    index_list=list(set(index_list))

In [None]:
df_cleaned=df

In [None]:
df

<H1><B>Season Grouping</B></H1>
THE SEASONS WERE GROUPED INTO TWO:
<OL>
    <LI>WINTER CHRISTMAS HOLIDAY(NOV-DEC) & SUMMER HOLIDAY(MAY-AUG)</LI>
    <LI>NON HOLIDAY/IRREGULAR SEASONS (SPRING,FALL AND EARLIER IN THE YEAR)March - April, September - October, January -    February</LI>
</OL>
Then we will use the domestic gross,worldwide gross and popularity to see if for sure releasing movies during the holiday and non holiday seasons does increase these three variables

<H2>Function to divide the release date into the targeted seasons<H/2>

In [None]:
df['release_month']=df['release_date'].dt.month_name()


In [None]:
def category_season(release_date):
    month = release_date.month
    if month in [11, 12]:
        return 'Christmas Holiday Season'
    elif month in [5, 6, 7, 8]:
        return 'Summer Season'
    else:
        return 'Non-Holiday/Irregular Period'



In [None]:
df['season_group'] = df['release_date'].apply(category_season)




In [None]:
df

In [None]:
#formulating the hypothesis 

<h5>Null Hypothesis (H0):No difference in terms of popularity the worldwide and domestic gross when movies are released in holiday seasons and non holiday seasons</h5>
<h5>Alternative Hypothesis(H1):Difference in terms of popularity the worldwide and domestic gross when movies are released in holidays seasons as compared to to non holiday seasons</h5>


<H3>CONDUCTING THE TEST TO REJECT/NOT REJECT THE NULL HYPOTHESIS</H3>

WE WILL CONDUCT OUR TEST USING THE MANN WHITNEY U TEST BECAUSE WE HAVE TWO INDEPENDENT GROUPS AND THEY ARE NOT NORMALLY DISTRIBUTED

#so i will perform the tests on the three variables popularity,worldwide gross and domestic gross
<h3><b>Worldwide Gross<b></h3>

WE WILL SPLIT THE DATA INTO TWO GROUPS THE HOLIDAY AND NON HOLIDAY SEASONS

In [None]:
holiday_movies = df[df['season_group'].str.contains('Christmas Holiday Season|Summer Season')]


In [None]:
#for non holiday group
non_holiday_movies = df[df['season_group'].str.contains('Non-Holiday/Irregular Period')]
non_holiday_movies

In [None]:
# i visualized to see if its a normal distributed graph
sns.histplot(holiday_movies['worldwide_gross'], kde=True)
plt.title("worldwide Gross - Holiday Movies")
plt.show()

In [None]:
#then  ivisualized for the non holiday
sns.histplot(non_holiday_movies['worldwide_gross'], kde=True)
plt.title("worldwide Gross - Holiday Movies")
plt.show()

In [None]:
# we will perform a mann whitney u test on the first variable
holiday_worldwide_gross = holiday_movies['worldwide_gross']
non_holiday_worldwide_gross = non_holiday_movies['worldwide_gross']

In [None]:
stat_worldwide_gross, p_value_worldwide_gross = stats.mannwhitneyu(holiday_worldwide_gross, non_holiday_worldwide_gross)


In [None]:
print(f"Mann-Whitney U Test for Worldwide Gross:")
print(f"U-statistic: {stat_worldwide_gross}, p-value: {p_value_worldwide_gross}\n")

if p_value_worldwide_gross < 0.05:
    print("Result: Significant difference in Worldwide Gross between holiday and non-holiday movies.\n")
else:
    print("Result: No significant difference in Worldwide Gross between holiday and non-holiday movies.\n")

<B>CONCLUSION FOR THE FIRST VARIABLE</B>

 The p-value is lower than the significance level and so that means releasing during the holiday season indeed has a significant difference as compared to the non-holiday season, so the income is more worldwide during the holiday season.

<h3>DOMESTIC_GROSS</h3>

In [None]:

sns.histplot(holiday_movies['domestic_gross'], kde=True)
plt.title("domestic Gross - Holiday Movies")
plt.show()

In [None]:

sns.histplot(holiday_movies['domestic_gross'], kde=True)
plt.title("domestic Gross - Holiday Movies")
plt.show()

In [None]:
# we will perform a mann whitney u test on the second variable
holiday_domestic_gross = holiday_movies['domestic_gross']
non_holiday_domestic_gross = non_holiday_movies['domestic_gross']

In [None]:
stat_domestic_gross, p_value_domestic_gross = stats.mannwhitneyu(holiday_domestic_gross, non_holiday_domestic_gross)


In [None]:
print(f"Mann-Whitney U Test for Domestic Gross:")
print(f"U-statistic: {stat_domestic_gross}, p-value: {p_value_domestic_gross}\n")

if p_value_domestic_gross < 0.05:
    print("Result: Significant difference in domestic Gross between holiday and non-holiday movies.\n")
else:
    print("Result: No significant difference in domestic Gross between holiday and non-holiday movies.\n")

<B>CONCLUSION FOR THE SECOND VARIABLE</B>

As you can see the p-value is lower than the significance level and so that means whether releasing during the holiday season indeed has a significant difference as compared to non-holiday season, so the income is more worldwide during the holiday season.

<h3>POPULARITY</h3>

In [None]:
#WE THEN CHECK THE DISTRIBUTION FOR POPULARITY 
sns.histplot(holiday_movies['popularity'], kde=True)
plt.title("popularity - Holiday Movies")
plt.show()

In [None]:
sns.histplot(non_holiday_movies['popularity'], kde=True)
plt.title("popularity - Non-Holiday Movies")
plt.show()

In [None]:
# we will perform a mann whitney u test on the third variable
holiday_popularity = holiday_movies['popularity']
non_holiday_popularity = non_holiday_movies['popularity']

In [None]:
stat_popularity, p_value_popularity = stats.mannwhitneyu(holiday_popularity, non_holiday_popularity)


In [None]:
print(f"Mann-Whitney U Test for popularity:")
print(f"U-statistic: {stat_popularity}, p-value: {p_value_popularity}\n")

if p_value_popularity < 0.05:
    print("Result: Significant difference in popularity between holiday and non-holiday movies.\n")
else:
    print("Result: No significant difference in popularity between holiday and non-holiday movies.\n")

<B> IN CONCLUSION </B>

Releasing a movie between the holiday and non-holiday season actually doesn't affect the popularity, but for domestic income and worldwide income, it does affect. so when releasing movies, it's better to release them during the holiday seasons.

<b>Recommendations<b>
    
When releasing a movie we  should plan on release during the holiday season this will increase our revenue