### HYPOTHESIS TESTING 
Step 1: Verifying Cleaned Data
Make sure:
- The dataset has no missing or duplicate values.
- Numeric columns like Revenue and Budget are in the correct format (e.g., no text strings or special characters).
- Categories like Genre are consistent (e.g., no typos like “Action” vs. “action”).

Step 2: Define Your Groups
Decide what two groups you’ll compare for the t-test. For our two hypothesis tests below:
- Genre comparison: ActionAdventure scifi vs. Drama Romance.
Now that we’ve created the samples, let’s establish the hypotheses:
- Null Hypothesis (H₀): The average box office revenue for Action/Adventure/Sci-Fi movies is equal to that of Drama/Romance movies.
- Alternative Hypothesis (H₁): The average box office revenue for Action/Adventure/Sci-Fi movies is different (or higher) than Drama/Romance movies.




- Budget comparison: High-budget ($>$ $10M) vs. low-budget ($\leq$ $10M).




In [1]:
#extracting the cleaned datasets

from IPython.display import FileLink  
FileLink("cleaned_dataset.csv") 

In [2]:
import pandas as pd 
import numpy as np

In [9]:
film_df = pd.read_csv("zippedData/all_combined.csv")
film_df

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0.1,movie_id,ordering,person_id,category,ordering.1,title,region,is_original_title,primary_title,original_title_x,...,Unnamed: 0,genre_ids,id_y,original_language,original_title_y,popularity,release_date_y,vote_average,vote_count,profits
0,tt0475290,10,nm0005683,cinematographer,24,"Hail, Caesar!",GB,0.0,"Hail, Caesar!","Hail, Caesar!",...,17497,"[35, 18, 9648]",270487,en,"Hail, Caesar!",12.312,2016-02-05,5.9,2328,42160680.0
1,tt0475290,1,nm0000982,actor,24,"Hail, Caesar!",GB,0.0,"Hail, Caesar!","Hail, Caesar!",...,17497,"[35, 18, 9648]",270487,en,"Hail, Caesar!",12.312,2016-02-05,5.9,2328,42160680.0
2,tt0475290,2,nm0000123,actor,24,"Hail, Caesar!",GB,0.0,"Hail, Caesar!","Hail, Caesar!",...,17497,"[35, 18, 9648]",270487,en,"Hail, Caesar!",12.312,2016-02-05,5.9,2328,42160680.0
3,tt0475290,3,nm2403277,actor,24,"Hail, Caesar!",GB,0.0,"Hail, Caesar!","Hail, Caesar!",...,17497,"[35, 18, 9648]",270487,en,"Hail, Caesar!",12.312,2016-02-05,5.9,2328,42160680.0
4,tt0475290,4,nm0000146,actor,24,"Hail, Caesar!",GB,0.0,"Hail, Caesar!","Hail, Caesar!",...,17497,"[35, 18, 9648]",270487,en,"Hail, Caesar!",12.312,2016-02-05,5.9,2328,42160680.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
50962,tt7153766,8,nm2722502,producer,8,Unsane,IT,0.0,Unsane,Unsane,...,23949,"[27, 53]",467660,en,Unsane,16.316,2018-03-23,6.2,667,12744931.0
50963,tt7153766,8,nm2722502,producer,15,Unsane,US,0.0,Unsane,Unsane,...,23949,"[27, 53]",467660,en,Unsane,16.316,2018-03-23,6.2,667,12744931.0
50964,tt7153766,9,nm10426133,composer,7,Unsane,SE,0.0,Unsane,Unsane,...,23949,"[27, 53]",467660,en,Unsane,16.316,2018-03-23,6.2,667,12744931.0
50965,tt7153766,9,nm10426133,composer,8,Unsane,IT,0.0,Unsane,Unsane,...,23949,"[27, 53]",467660,en,Unsane,16.316,2018-03-23,6.2,667,12744931.0


In [21]:
film_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50967 entries, 0 to 50966
Data columns (total 37 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   movie_id            50967 non-null  object 
 1   ordering            50967 non-null  int64  
 2   person_id           50967 non-null  object 
 3   category            50967 non-null  object 
 4   ordering.1          50967 non-null  int64  
 5   title               50967 non-null  object 
 6   region              50967 non-null  object 
 7   is_original_title   50967 non-null  float64
 8   primary_title       50967 non-null  object 
 9   original_title_x    50967 non-null  object 
 10  start_year          50967 non-null  int64  
 11  runtime_minutes     50967 non-null  float64
 12  genres              50967 non-null  object 
 13  averagerating       50967 non-null  float64
 14  numvotes            50967 non-null  int64  
 15  primary_name        50967 non-null  object 
 16  prim

In [22]:
box_office_df = pd.read_csv('zippedData/bom.movie_gross.csv.gz')
box_office_df

Unnamed: 0,title,studio,domestic_gross,foreign_gross,year
0,Toy Story 3,BV,415000000.0,652000000,2010
1,Alice in Wonderland (2010),BV,334200000.0,691300000,2010
2,Harry Potter and the Deathly Hallows Part 1,WB,296000000.0,664300000,2010
3,Inception,WB,292600000.0,535700000,2010
4,Shrek Forever After,P/DW,238700000.0,513900000,2010
...,...,...,...,...,...
3382,The Quake,Magn.,6200.0,,2018
3383,Edward II (2018 re-release),FM,4800.0,,2018
3384,El Pacto,Sony,2500.0,,2018
3385,The Swan,Synergetic,2400.0,,2018


In [20]:
film_df['foreign_gross'].head()

0    33100000
1    33100000
2    33100000
3    33100000
4    33100000
Name: foreign_gross, dtype: object

In [24]:
box_office_df['foreign_gross']=pd.to_numeric(box_office_df['foreign_gross'],errors='coerce')
box_office_df['revenue'] = box_office_df['domestic_gross'] + box_office_df['foreign_gross']



In [28]:
# Merge the two DataFrames first (assuming they both have 'film_id')
merged_df = pd.merge(film_df, box_office_df, on='title')

# Now you can filter easily:
action_movies_revenue = merged_df[merged_df['genres'] == 'Action']['revenue']
drama_movies_revenue = merged_df[merged_df['genres'] == 'Drama']['revenue']


In [29]:
film_df['genres'].value_counts()

Action,Adventure,Sci-Fi        3020
Adventure,Animation,Comedy     2920
Drama,Romance                  1905
Action,Adventure,Drama         1860
Action,Adventure,Fantasy       1770
                               ... 
Fantasy,Horror                    7
Crime,Documentary,History         6
Action,Comedy,Drama               6
Musical                           5
Adventure,Documentary,Drama       4
Name: genres, Length: 204, dtype: int64

In [30]:


# Filter for movies with the genre 'Action,Adventure,Sci-Fi'
filtered_df = film_df[film_df['genres'] == 'Action,Adventure,Sci-Fi']

# Get 30% of the filtered dataset as a random sample
sample_df = filtered_df.sample(frac=0.3, random_state=42)  # Use random_state for reproducibility

print(sample_df)

        movie_id  ordering  person_id  category  ordering.1  \
24862  tt1411250         6  nm0923646    writer           5   
28389  tt2250912         9  nm0571344    writer          10   
4063   tt0369610        10  nm0189777  producer          16   
24857  tt1411250         5  nm0878638  director          12   
941    tt1228705         3  nm0000569   actress          17   
...          ...       ...        ...       ...         ...   
8228   tt1825683        10  nm3234869  composer           8   
19635  tt1483013         8  nm1858656  producer          21   
8434   tt1825683         9  nm0270559  producer          16   
35634  tt1300854         8  nm1411347    writer          39   
49363  tt4701182         9  nm0225146  producer           6   

                        title region  is_original_title  \
24862                 Riddick     AR                0.0   
28389  Spider-Man: Homecoming     FR                0.0   
4063           Jurassic World     IT                0.0   
24857  

In [31]:

# Filter for Action, Adventure, Sci-Fi genre movies
action_adventure_sci_fi = film_df[film_df['genres'].str.contains('Action|Adventure|Sci-Fi', na=False)]

# Filter for Drama, Romance genre movies
drama_romance = film_df[film_df['genres'].str.contains('Drama|Romance', na=False)]

# Take a sample of 30% from each group
sample_action = action_adventure_sci_fi.sample(frac=0.3, random_state=42)
sample_drama = drama_romance.sample(frac=0.3, random_state=42)

print("Action/Adventure/Sci-Fi Sample Size:", len(sample_action))
print("Drama/Romance Sample Size:", len(sample_drama))


Action/Adventure/Sci-Fi Sample Size: 8109
Drama/Romance Sample Size: 7654


Now that we’ve created the samples, let’s establish the hypotheses:
- Null Hypothesis (H₀): The average box office revenue for Action/Adventure/Sci-Fi movies is equal to that of Drama/Romance movies.
- Alternative Hypothesis (H₁): The average box office revenue for Action/Adventure/Sci-Fi movies is different (or higher) than Drama/Romance movies.


Step 3: Perform T-Test
A two-sample t-test is appropriate since we’re comparing the means of two independent groups.
Code for T-Test:
