# Week 2 Homework 

## Movie Data

This week in class we discussed movies that have high revenues and if these movies are more likely to win an Oscar.

Additional data on these movies was obtained from [TMDB](https://www.themoviedb.org/) [API](https://developer.themoviedb.org/docs/getting-started). This data is in the `Movie_data` directory.

Complete the following tasks and answer the questions.

## Purpose of Homework

This homework assignment will help you practice working with movie data from Wikipedia and TMDB, including data merging, filtering, and visualization techniques.

You are encouraged to refer to lecture content and liberally use course resources such as the discussion board and office hours.

## Logistics

Due date: The homework is due 12:00pm on Thursday, January 22, 2026.

You will submit your homework on [MarkUs](https://markus.teach.cs.toronto.edu/markus/). 

1. Download this file (`sta272_hw2_student.ipynb`) from JupyterHub. (See [our JupyterHub Guide](../guides/jupyterhub_guide.ipynb) for detailed instructions.)
2. Submit this file to MarkUs under the hw2 assignment. (See [our MarkUs Guide](../guides/markus_guide.ipynb) for detailed instructions.)

All homeworks will take place in a Jupyter notebook (like this one). When you are done, you will download this notebook and submit it to MarkUs.

## Task #1

`Movie_data/movies_social_data.json` contains social media data from TMDB for movies from:

- A table of highest grossing films at <https://en.wikipedia.org/wiki/List_of_highest-grossing_films>
- A table of academy award winning films at <https://en.wikipedia.org/wiki/List_of_Academy_Award%E2%80%93winning_films>

Store `Movie_data/movies_social_data.json` in a pandas dataframe called `social_df`.

In [None]:
# Import necessary library
import ... as ...

# Read the JSON file into a dataframe
social_df = pd...('...')

# Display the first few rows
social_df.head()

## Task #2 

Read `Movie_data/movies.csv` into a pandas dataframe called `movies`.

In [None]:
# Read the CSV file into a dataframe
movies = pd...('Movie_data/...')

# Display information about the dataframe
print(f"Loaded {len(...)} movies")
movies.head()

## Task #3

Merge the TMDB data in `social_df` with the wikipedia data in `movies`, keeping all rows from `movies`. Store the result in a dataframe called `movies_merged`.

In [None]:
# Merge the two dataframes
movies_merged = movies....(social_df, on=[..., ...], how='...')

# Display the first few rows
movies_merged.head()

## Task #4

In this task you will create a boolean variable called `action_adventure` hat will be `True` if the `Genres` column in `movies_merged` contaians the words `Action` and `Adventure`.


In [None]:
# Create a boolean variable that checks if Genres contains both "Action" and "Adventure"
action_adventure = (movies_merged['...'].str....("...") & movies_merged['...'].str....("..."))

## Task #5

Create a boolean variable called `comedy` that will be `True` if the `Genres` column in `movies_merged` contains the word `Comedy`.

In [None]:
# Create a boolean variable that checks if Genres contains "Comedy"
comedy = movies_merged['...'].str.contains("...")

## Task #6

Create a new dataframe called `movies_merged2` that is a copy of `movies_merged` with the columns `action_adventure` and `comedy` added.

In [None]:
# Create a copy of movies_merged
movies_merged2 = movies_merged...()

# Add the action_adventure column
movies_merged2['...'] = ...

# Add the comedy column
movies_merged2['...'] = ...

# Display the first few rows
movies_merged2.head()

## Task #7

Use the `value_counts()` function to count how many movies are action adventure and how many are comedy. The `value_counts()` function returns a Series containing counts of unique values in a column. Store the results in variables called `action_adventure_counts` and `comedy_counts`.

In [None]:
# Count action adventure movies
action_adventure_counts = movies_merged2['...']....()
print("Action Adventure counts:")
print(action_adventure_counts)
print()

# Count comedy movies
comedy_counts = movies_merged2['...']....()
print("Comedy counts:")
print(comedy_counts)

## Task #8

Compute the proportion of movies that are action adventure and the proportion that are comedy. Store the results in variables called `action_adventure_prop` and `comedy_prop`.

**Hint:** You can use `value_counts()` with the `normalize=True` parameter to get proportions instead of counts.

In [None]:
# Compute proportion of action adventure movies
action_adventure_prop = movies_merged2['...']....()
print("Action Adventure proportions:")
print(action_adventure_prop)
print()

# Compute proportion of comedy movies
comedy_prop = movies_merged2['...']....()
print("Comedy proportions:")
print(comedy_prop)

## Task #9

Compare the `gross_nums_clean` (worldwide gross revenue) for action adventure movies versus comedy movies. Calculate the mean and median gross revenue for each category and store them in variables called `action_adventure_mean_gross`, `action_adventure_median_gross`, `comedy_mean_gross`, and `comedy_median_gross`.

In [None]:
# Calculate mean and median gross revenue for action adventure movies
action_adventure_mean_gross = movies_merged2[movies_merged2['...'] == ...]['...'].mean()
action_adventure_median_gross = movies_merged2[movies_merged2['...'] == ...]['...'].median()
print(f"Action Adventure - Mean gross revenue: ${action_adventure_mean_gross:,.0f}")
print(f"Action Adventure - Median gross revenue: ${action_adventure_median_gross:,.0f}")
print()

# Calculate mean and median gross revenue for comedy movies
comedy_mean_gross = movies_merged2[movies_merged2['...'] == ...]['...'].mean()
comedy_median_gross = movies_merged2[movies_merged2['...'] == ...]['...'].median()
print(f"Comedy - Mean gross revenue: ${comedy_mean_gross:,.0f}")
print(f"Comedy - Median gross revenue: ${comedy_median_gross:,.0f}")

## Task #10

Create a scatter plot with `Vote_Average` on the x-axis and `gross_nums_clean` on the y-axis. Include only action adventure and comedy movies in the plot. Use different colors for each category and ensure the plot is colorblind friendly.

Label the x-axis as "Vote Average", the y-axis as "Worldwide Gross Revenue", and the title as "Vote Average vs Worldwide Gross Revenue by Genre".

**Store the resulting matplotlib figure in a variable called `genre_plot_fig`.**

**Hint:** You can use matplotlib's scatter plot and filter the data for each category separately. Use `plt.gcf()` to get the current figure after creating your plot.

In [None]:
import matplotlib.pyplot as plt

# Filter data for action adventure movies
action_adventure_df = movies_merged2[movies_merged2['...'] == ...]

# Filter data for comedy movies
comedy_df = movies_merged2[movies_merged2['...'] == ...]

# Create the scatter plot
plt.figure(figsize=(10, 6))

# Plot action adventure movies
plt.scatter(action_adventure_df['...'], action_adventure_df['...'], 
           color='#0173B2', label='...', alpha=0.6, s=100)

# Plot comedy movies
plt.scatter(comedy_df['...'], comedy_df['...'], 
           color='#DE8F05', label='...', alpha=0.6, s=100)

# Add labels and title
plt.xlabel('...')
plt.ylabel('...')
plt.title('...')
plt.legend()
plt.grid(True, alpha=0.3)

# Store the figure in a variable
genre_plot_fig = plt.gcf()

## Question #1

A movie company wants to make a film with the highest revenue and is considering making either an action adventure or a comedy film. Based on your analyses above, which genre should they choose? Explain your reasoning using the analyses above. You may also add additional analyses to support your answer.

**To obtain full marks (4 points):**
- Clearly state which genre you recommend (1 point)
- Reference and correctly interpret the mean gross revenue values for both genres (1 point)
- Reference and correctly interpret the median gross revenue values for both genres (1 point)
- Provide clear reasoning for your recommendation, discuss why mean or median might be more appropriate for this decision, and acknowledge any limitations or additional considerations (1 point)

### Student Answer:

[Write your answer here]