# Assignment 3

# Test Video and Instructions

This is a video of my cat from Sept of 2024, playing with a new toy. He's very precious to me.

<video controls src="VID_20240904_200202.mp4" title="Cat Video" width = "640" height = "360"></video>

# Movies, Gender, and Awards

A long, long time ago, in a far away land, I dreamed of making movies. That long, long time ago was 2007 and that far away land was Boston, Massachusetts. (I went to Boston University for an MFA in Film Production.) I've made a few short films and I've dabbled in a few departments as a PA. Rarely did I look around and think "look at all the women here!" It's easy to say they're not there because they don't want to be, but I suspect there is something deeper than that. 

Movies are typically pretty, shiny things but the film industry can be a bit brutal. My interest in data science started from the idea of allowing more people to participate in the indusry because it tends to "naturally" weed people out based on the arduous schedules and physical demands. I found a couple of data sets that I believe can show very basic trends in who is making movies, who is starring in them, and who is being rewarding by the governing body of the industry -- the &copy;Academy of Motion Picture Arts and Sciences.  

Filmmakers say they don't do it for the accolades, that the awards don't matter, it's about the art, blah blah blah. But who among us hasn't practiced an Oscar speech or two.

- In this exploration, I am looking datasets from https://www.kaggle.com/datasets/vinifm/female-representation-in-cinema. These datasets list films with their Bechdel test scores [^1] and a list of Academy Award winners and nominees from 1972 to 2021. The Bechdel Test is a measure of women in film. The test score gives points if there is a scene where two named women speak (1), to each other (2), about something other than a man(3). It seems simple, but we're going to look at how many movies have a Bechdel Test score of 0. 

- I thought it would be interesting to see how Bechdel test scores changes over time, and if there was an increase in higher Bechdel Test scores among award-winning films over time. 

    - There are a lot of things to explore within this dataset; I chose to focus on award winning films first, however, I think comparing what wins awards to what doesn't could be quite insightful. Additionally, I'm interested in exploring budgets and revenue and seeing if higher BT score films get higher budgets.

[^1]: For more information, check out [this Wikepedia page](https://en.wikipedia.org/wiki/Bechdel_test)

### Accessing the Data

- To access the datasets, you can download them from the kaggle link above using "Download dataset as zip," and then extract the files. I downloaded them and placed them in the 'data' folder within the folder of this Jupyter notebook!

     - the file names are `movies.csv` and `oscar.csv`

- If you have the Kaggle API, you can download using the kagglehub: 


         ```
         import kagglehub
 
         # Download latest version
         path = kagglehub.dataset_download("vinifm/female-representation-in-cinema")
 
         print("Path to dataset files:", path)
         ```

### Installing libraries

- We'll be using a few fun libraries in exploring this data. Pandas and numpy are fairly standard, but if you're not used to them, it's best to install them now!

    `$ pip install pandas` <br>
    `$ pip install numpy`

- Additionally we'll be using plotly for amazing graphs. 

    - To install plotly using pip:

         `$ pip install plotly`

    - To install using conda:

         `$ conda install -c conda-forge plotly`

- More information about plotly can be found on its [website](https://plotly.com/graphing-libraries/).

- For interactivity, we'll be using ipywidgets for dropdown menus and sliders.

     - To install with pip:
         
         `pip install ipywidgets`

    - To install using condas:

         `conda install -c conda-forge ipywidgets`

- ipywidgets documentation can be found on its [website](https://ipywidgets.readthedocs.io/en/stable/index.html).


In [620]:
# Import necessary libraries
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots
from ipywidgets import interact, interactive_output, interactive, Dropdown, IntRangeSlider, HBox, VBox, Layout
from IPython.display import display, clear_output


## Loading the datasets and cleaning the data

- First we'll load the two datasets and then merge them to create a dataframe that has movies that have won awards with their Bechdel Test scores
- The dataframe also has cast and crew percentage representation, budget, revenue, the category of Oscar for which they were nominated, the name and gender of who was nominated.

- There are columns in the `movies.csv` file that are interesting, but out of this scope for this specific analysis. 
     - Cast gender and crew gender are lists of numbers that are actually used to calculate the percentage representation mentioned above. 

In [621]:
# Load the movies dataset; the dataset is a tad large for Vorcareum, so I have removed some columns that are not necessary for this analysis
movies_columns_to_drop = ['dubious', 'imdbid', 'tmdbId', 'production_companies', 'production_countries','release_date', 'vote_average', 'vote_count', 'cast', 'crew', 'cast_gender', 'crew_gender']
movies_df = pd.read_csv('data/movies.csv', index_col=0).drop(columns=movies_columns_to_drop)

# Load the Oscars dataset and rename the 'film' column to 'title' for merging
oscars_df = pd.read_csv('data/oscar.csv')
oscars_df.rename(columns={'film': 'title'}, inplace=True)

# Merge the datasets on both 'title' and 'year' columns to ensure accurate matching
# Use a left join to keep all movies and add Oscar info where available
movies_and_oscars_df = pd.merge(movies_df, oscars_df, how='left', left_on=['title', 'year'], right_on=['title', 'year']).fillna({'category': 'not nominated', 'name': 'not nominated', 'status': 'not nominated'})

# View a small sample of the merged dataframe to verify the merge worked correctly
movies_and_oscars_df.sample(5)


Unnamed: 0,title,year,bt_score,genres,popularity,revenue,budget,cast_female_representation,crew_female_representation,category,name,status,gender
6888,The Green Mile,1999,1,"['Fantasy', 'Drama', 'Crime']",61.028,286801374.0,60000000.0,23.255814,19.230769,WRITING,Frank Darabont,nominated,male
4533,Meadowland,2015,3,['Drama'],5.691,0.0,0.0,40.0,17.307692,not nominated,not nominated,not nominated,
2362,Ju-on: The Grudge 2,2003,3,"['Horror', 'Thriller']",10.938,70700000.0,20000000.0,21.428571,0.0,not nominated,not nominated,not nominated,
1773,Wilde,1997,3,"['Drama', 'History']",7.847,0.0,0.0,20.0,16.216216,not nominated,not nominated,not nominated,
8298,Sleuth,1972,0,"['Mystery', 'Thriller', 'Crime']",8.039,0.0,0.0,0.0,0.0,DIRECTING,Joseph L. Mankiewicz,nominated,male


### A quick data summary of the merged dataframe

- Make a nice dictionary to show how many unique values exist, if any values are missing, and the datatype of each.

In [622]:
data_summary = {
    col: {
        'num_unique': movies_and_oscars_df[col].nunique(),
        'num_missing': movies_and_oscars_df[col].isnull().sum(),
        'num_non_missing': movies_and_oscars_df[col].notnull().sum(),
        'data_type': str(movies_and_oscars_df[col].dtype)
    } for col in movies_and_oscars_df.columns
}

print("📽️ Data Summary:")
display(pd.DataFrame(data_summary).T)

📽️ Data Summary:


Unnamed: 0,num_unique,num_missing,num_non_missing,data_type
title,7101,0,8936,object
year,125,0,8936,int64
bt_score,4,0,8936,int64
genres,1647,0,8936,object
popularity,6255,0,8936,float64
revenue,4166,0,8936,float64
budget,651,0,8936,float64
cast_female_representation,902,0,8936,float64
crew_female_representation,1212,0,8936,float64
category,9,0,8936,object


Gender is missing from 6570 entries which may seem alarming at first, but gender is listed where a person was nominated for certain work on a film. Therefore all of the film not nominated will have missing data and any film nominated in a category such as Best Picture will be missing. In a future analysis I think it would be interesting to dig into the gender split amoung award winners.  

BT Score is an integer and I'm wondering if it should be a category instead. The value can only be 0, 1, 2, or 3. It is more categorical, however, to allow for a 3 to naturally be above a 1 and 2; I will leave it as a number for now but will make it categorical for at least one plot in the near future. See you if you find it! Does it make a hige difference in the understanding of the plots? 

In general, this data is fairly clean and I feel confident moving forward with this DataFrame.

#### Two small pieces that will come in handy later
- A dictionary to map column names to more descriptive labels for better readability in plots and a list of colors to ensure consistency across plots

In [623]:
# A dictionary for column names
column_names = {
    'title': 'Movie Title',
    'year': 'Release Year',
    'genres': 'Genres',
    'popularity': 'Popularity Score',
    'budget': 'Budget (USD)',
    'revenue': 'Revenue (USD)',
    'cast_female_representation': 'Cast Female Representation (%)',
    'crew_female_representation': 'Crew Female Representation (%)',
    'category': 'Oscar Category',
    'name': 'Oscar Winner Name',
    'status': 'Oscar Status',
    'bt_score': 'Bechdel Test Score',
    'num_oscars': 'Number of Oscars Won'
}

In [624]:
# A list of colors to use in the plots
color_list = ["#F2EE22","#EC7854", "#B52F8C", "#6502A6", "#16078B"]

## Initial Exploration

- We'll start with a SPLOM to hopefully guide our instincts into what data is interesting to compare. 


### Visualization Techniques // Library Choices
- I will be using Plotly Express for most of the visualizations because it is easy to use and produces interactive plots.
- For more complex visualizations, I will use Plotly Graph Objects. 
- Plotly is open source, which appeals to me. 
- A major bonus to the plotly library are the automatic hover effects. I'd like to give a huge shout-out to [the documentation](https://plotly.com/python/hover-text-and-formatting/) and the community for those. Understanding the data is so much faster with these hoever effects.
    - The hover effects contributed to 85% of my decision!

##### We'll also make a list of column names because I, for one, can never remember all the column names!

In [625]:
movies_and_oscars_df.columns

Index(['title', 'year', 'bt_score', 'genres', 'popularity', 'revenue',
       'budget', 'cast_female_representation', 'crew_female_representation',
       'category', 'name', 'status', 'gender'],
      dtype='object')

In [626]:
# For inital exploration, a SPLOM!
# I'll skip title because there is no reason to look at 7101 data points for that.
fig = px.scatter_matrix(movies_and_oscars_df,
                        dimensions=['year', 
                                    'bt_score', 
                                    'genres', 
                                    'popularity', 
                                    'revenue',
                                    'budget', 
                                    'cast_female_representation', 
                                    'crew_female_representation',
                                    'category', 
                                    'name', 
                                    'status', 
                                    'gender'], 
                        color='bt_score',
                        color_discrete_sequence=color_list,
                        title='Scatter Matrix of Movies with Oscars and Bechdel Test Scores', 
                        labels=column_names,
                        width=1000, 
                        height=800)

fig.update_traces(diagonal_visible=False)  # Hide diagonal plots for clarity
fig.show()

Ooh that's a little messy with the labels all over! But I get the gist. 

The genre column adds some oddness to the SPLOM because it is a list of values. To really explore by genre, it would make sense to explode that column and group by genres. That is beyond the scope of this exploration. 

I'm really interested in those scatter plots above and to the left of 'status.' They seem to offer the most variety and potentially insights. There is a lot of blue on the screen, indicating overall low Bechdel Test scores. 

#### A curious exploration

- I'm curious what movies have been nominated, have won an Oscars, multiple Oscars and the respective Bechdel Test scores of those.
- My first stop will be to compare award winners and non-award winners.
- Is there a difference between *all* the movies and the award nominated and winning movies? Without looking at the data, I would assume that Oscar nominated films have higher Bechdel Test scores because that's the sort of thing the Academy just loves. 

In [627]:
# A function to filter movies based on Oscar status
# This function can be used to create DataFrames for movies that won Oscars, were nominated, or were not nominated at all

def movies_win_or_lose_awards(df=movies_and_oscars_df, status='not nominated', include_extra=False):
    """
    Create a DataFrame of movies that either won Oscars, were nominated, or were not nominated at all 
    to be used in visualizations.

    Args:
        df (DataFrame, optional): a dataframe that was  Defaults to movies_and_oscars_df.
        status (str, optional): Either 'winner', 'nominated', or 'not nominated.' Defaults to 'not nominated'.
        include_extra (bool): if True, includes extra columns for richer analysis.
        
    Return:
        pd.DataFrame:  pd.DataFrame: DataFrame with columns ['title', 'year', 'bt_score', 'num_oscars'] (+ extras if requested).
    """
    
    valid_status = ['winner', 'nominated', 'not nominated']
    if status not in valid_status:
        raise ValueError(f"Status must be one of {valid_status}")

    if status == 'winner':
        filtered_df = df[df['status'] == 'winner']
        result_df = (
            filtered_df.groupby(['title', 'year', 'bt_score'])
            .size().reset_index(name='num_oscars')
        )
    elif status == 'nominated':
        filtered_df = df[df['status'].isin(['winner', 'nominated'])]
        result_df = (
            filtered_df.groupby(['title', 'year', 'bt_score'])
            .size().reset_index(name='num_oscars')
        )
    else:  # not nominated
        filtered_df = df[df['status'] == 'not nominated']
        result_df = filtered_df[['title', 'year', 'bt_score']].drop_duplicates().copy()
        result_df['num_oscars'] = 0

    if include_extra:
        extra_cols = ['category', 'gender', 'cast_female_representation', 'crew_female_representation']
        for col in extra_cols:
            if col in filtered_df.columns:
                result_df[col] = filtered_df.groupby(['title', 'year', 'bt_score'])[col].first().values

    return result_df


#### Three data frames for different statuses: multiple wins, won and nominated, and not nominated

- Initially, I'm curious if there is there is any discernable difference in Bechdel Test scores between award nominated films and non-nominated films

In [628]:
multiple_oscars_df = movies_win_or_lose_awards(df=movies_and_oscars_df, status='winner')
noms_and_wins_df = movies_win_or_lose_awards(df=movies_and_oscars_df, status='nominated')
no_noms_df = movies_win_or_lose_awards(df=movies_and_oscars_df, status='not nominated')
multiple_status_dfs = {'winner': multiple_oscars_df, 'nominated': noms_and_wins_df, 'not nominated': no_noms_df}
print(f"Total number of movies that won multiple Oscars: {multiple_oscars_df.shape[0]}")
print(f"Total number of movies not nominated: {no_noms_df.shape[0]}")

Total number of movies that won multiple Oscars: 215
Total number of movies not nominated: 6564


#### Side bar, what movies have won the most Oscars?

In [629]:

multiple_oscars_df[multiple_oscars_df['num_oscars'] > 1].sort_values('num_oscars', ascending=False).head()


Unnamed: 0,title,year,bt_score,num_oscars
18,Birdman or (The Unexpected Virtue of Ignorance),2014,3,9
9,An American in Paris,1951,1,9
52,Gandhi,1982,2,9
132,Shakespeare in Love,1998,3,8
139,Spotlight,2015,1,7


I *knew* it was Birdman, but I still don't *believe* it's Birdman. 

In [630]:
multiple_oscars_df_extra = movies_win_or_lose_awards(df=movies_and_oscars_df, status='nominated', include_extra=True)
multiple_oscars_df_extra.head()

Unnamed: 0,title,year,bt_score,num_oscars,category,gender,cast_female_representation,crew_female_representation
0,102 Dalmatians,2000,3,1,COSTUME DESIGN,male,21.428571,21.428571
1,12 Years a Slave,2013,3,9,FILM EDITING,male,14.606742,17.346939
2,127 Hours,2010,3,6,WRITING,male,34.482759,7.352941
3,1776,1972,1,1,CINEMATOGRAPHY,male,14.285714,0.0
4,20 Feet from Stardom,2013,3,4,DOCUMENTARY (Feature),male,44.444444,0.0


#### Another sidebar: 

- What categories did the winners win?

In [631]:
film_list = multiple_oscars_df_extra.loc[multiple_oscars_df_extra['num_oscars'] > 1, 'title'].unique().tolist()

categories_by_film = (
    movies_and_oscars_df[
        (movies_and_oscars_df['title'].isin(film_list)) & 
        (movies_and_oscars_df['status'] == 'winner')
    ]
    .groupby('title')['category']
    .unique()
)

categories_df = pd.DataFrame({
    'title': categories_by_film.index,
    'categories': categories_by_film.values
}
)
categories_df.head()

Unnamed: 0,title,categories
0,12 Years a Slave,"[BEST PICTURE, WRITING]"
1,20 Feet from Stardom,[DOCUMENTARY (Feature)]
2,A Letter to Three Wives,"[WRITING, DIRECTING]"
3,A Man for All Seasons,"[COSTUME DESIGN, DIRECTING, CINEMATOGRAPHY, WR..."
4,All That Jazz,"[FILM EDITING, COSTUME DESIGN]"


### More Visualization Techniques and Choices

Side note about the category column: The 'not nominated' category shifts so much of the data when used, that I've included a filter in each data visualization function to not use it if category is chosen as a way to present more normalized data.

#### Bar Charts!

- Let's look at bar charts to explore the data. The categories of things, like Bechdel Test Score, pop out as clumps of color in these stacked bar charts. 
- We'll use the year as the x-axis to look at how things change over time. 


In [633]:
# A function to create a bar chart that can be reused when we make the interactive dashboard later
def create_bar_chart(df, x_col='year', y_col='num_oscars', color_by='bt_score', column_names=None):
    """
    Create a bar chart showing the number of Oscars won by each movie, colored by Bechdel Test score.

    Args:
        df (pandas DataFrame): input dataframe
        x_col, y_col, color_by (str): column names
        column_names (dict): dictionary mapping column names to display names
    """
    if x_col == 'category' or y_col == 'category' or color_by == 'category':
        df = df[df['category'] != 'not nominated']
    labels = column_names if column_names else {}
    fig = px.bar(df, 
                 x=x_col, 
                 y=y_col, 
                 color=color_by,
                 color_discrete_sequence=color_list,
                 barmode='relative',
                 hover_data=[x_col, y_col, color_by, 'title'],
                 title=f"{labels.get(y_col, y_col)} by {labels.get(x_col, x_col)} colored by {labels.get(color_by, color_by)}",
                 labels=labels,
                 height=600)

    fig.update_layout(xaxis_tickangle=-45)
    return fig

In [634]:
create_bar_chart(multiple_oscars_df, column_names=column_names).show()

This is not as clumped as I would have guessed -- I thought the BT scores would slowly increase over time. Though  1965 - 1982 look a little rough.

In [635]:
create_bar_chart(multiple_oscars_df_extra, x_col='year', y_col='cast_female_representation', column_names=column_names).show()

In [636]:
# What does a bar chart of Cast Representation versus Number of Oscars look like?
create_bar_chart(multiple_oscars_df_extra, x_col='cast_female_representation', y_col='num_oscars', column_names=column_names).show()

It may appear blank, but the hover effect shows us there are actually so many bars they appear super thin. To me, this means the bar chart is not a good choice for this data. A scatter plot might be better, which we'll look at soon. 

For now, I will stick to using year as the x-axis for bar charts.

##### I'd really like to look at specific ranges of years, so I'll add a slider to filter the data.

- To really explore the yearly data, I'll add a year slider for the charts and look at couple different y-axis variables. 
    


In [637]:
# Funtion to create an interactive bar chart with a year slider
def bar_chart_with_year_slider(df,  x_col='year', y_col='num_oscars', color_by='bt_score', column_names=column_names):
    year_min = int(df['year'].min())
    year_max = int(df['year'].max())
    year_slider = IntRangeSlider(
        value=[year_min, year_max],
        min=year_min,
        max=year_max,
        step=1,
        description='Year Range:',
        style={'description_width': 'initial', 'bar_color': '#F2EE22'},
        continuous_update=False,
        layout={'width': '75%', 'margin': '0 auto', 'height': '40px'}
    )

    def update_chart(year_range):
        filtered_df = df[(df['year'] >= year_range[0]) & (df['year'] <= year_range[1])]
        create_bar_chart(filtered_df,  x_col=x_col, y_col=y_col, color_by=color_by, column_names=column_names).show()

    interact(update_chart, year_range=year_slider)


In [None]:
# Number of Oscars won by movies that won at least one Oscar, with a year slider
bar_chart_with_year_slider(multiple_oscars_df_extra, column_names=column_names)

interactive(children=(IntRangeSlider(value=(1927, 2017), continuous_update=False, description='Year Range:', l…

Being able to filter by year is super helpful! I can see that the 1960s and 1970s were pretty rough for Bechdel Test scores. The 2000s and 2010s look much better. I can also see the division of each film within the bars and the hover effect gives me all the info I want right off the bat!

In [639]:
# A look at cast female representation with a year slider
bar_chart_with_year_slider(multiple_oscars_df_extra, 'year', 'cast_female_representation', 'bt_score', column_names=column_names)

interactive(children=(IntRangeSlider(value=(1927, 2017), continuous_update=False, description='Year Range:', l…

I need the hover effect to get to the information I want; I want to see if there is a correlation cast representation and Bechdel Test. There is probably a better chart for this! (Heat map, I'm looking at you!)

In [640]:
# Change the Bechdel Test Score to categorical to fix the color bar for this chart only
# Changing it in the original dataframe may have unintended consequences for other analyses
multiple_oscars_df_extra['bt_score'] = multiple_oscars_df_extra['bt_score'].astype('category')

In [641]:
bar_chart_with_year_slider(multiple_oscars_df_extra, column_names=column_names)

interactive(children=(IntRangeSlider(value=(1927, 2017), continuous_update=False, description='Year Range:', l…

##### Back to the full Movies and Oscars DataFrame with the number of Oscars added

- The bar chart exploration was nice for just the title, year, Bechdel Test score and the number of Oscars won, but we lost so much amazing data. 
- Let's bring it back and keep playing with visualizations!

In [642]:
# Create a new column to indicate how many Oscars a movie won (0 if none)
movies_and_oscars_df = pd.merge(
    movies_and_oscars_df,
    multiple_oscars_df[['title', 'year', 'bt_score', 'num_oscars']],
    on=['title', 'year', 'bt_score'],
    how='left'
)

# Remove duplicates (if any)
movies_and_oscars_df = movies_and_oscars_df.drop_duplicates().fillna({'num_oscars': 0})
movies_and_oscars_df.tail()

Unnamed: 0,title,year,bt_score,genres,popularity,revenue,budget,cast_female_representation,crew_female_representation,category,name,status,gender,num_oscars
8931,Sand Castle,2017,0,"['War', 'Action', 'Drama']",31.004,0.0,0.0,4.545455,11.111111,not nominated,not nominated,not nominated,,0.0
8932,Diary of a Wimpy Kid: The Long Haul,2017,0,"['Comedy', 'Family']",17.122,40120144.0,22000000.0,20.289855,12.0,not nominated,not nominated,not nominated,,0.0
8933,God's Own Country,2017,0,"['Romance', 'Drama']",16.294,2559939.0,0.0,28.571429,10.769231,not nominated,not nominated,not nominated,,0.0
8934,MFKZ,2017,0,"['Science Fiction', 'Animation', 'Action', 'Cr...",7.712,461724.0,0.0,6.666667,16.666667,not nominated,not nominated,not nominated,,0.0
8935,War Machine,2017,0,"['Comedy', 'Drama', 'War']",12.251,0.0,60000000.0,12.280702,16.129032,not nominated,not nominated,not nominated,,0.0


##### 🤔 Hm, there are so many different ways to look at this data

- I have a general theory that Bechdel Test scores will increase over time. 
    - That wasn't the case when looking at award winning films, but let's use other plots to look at more films and other factors.
    - I have other ideas about the budgets, revenue, all of those other fun details for films with a higher percentage of female cast and crew.
    - To the charts!

### Scatter plot! 
- The SPLOM offered a lot of inspiration to look at more scatter plots. It is still common to use time as the x-axis, but we can expand beyond that to learn more about the data. 

In [643]:
# A function to create scatter plots that can be reused when we make the interactive dashboard later
def create_scatter_plot(df, x_col, y_col, color_by=None, column_names=None):
    """Create a scatter plot of two columns in a DataFrame.

    Args:
        df (DataFrame): The DataFrame containing the data.
        x_col (str): The column for the x-axis.
        y_col (str): The column for the y-axis.
        color_by (str, optional): The column to color the points by. Defaults to None.
        column_names (dict, optional): A dictionary mapping column names to display names. Defaults to None.

    Returns:
        Figure: A Plotly Figure object containing the scatter plot.
    """

    labels = column_names if column_names else {}
    if color_by == 'category' or y_col == 'category' or x_col == 'category':
        df = df[df['category'] != 'not nominated']
    fig = px.scatter(df, 
                     x=x_col, 
                     y=y_col, 
                     color=color_by, 
                     color_discrete_sequence=color_list,
                     title=f"Scatter plot of {labels.get(y_col, y_col)} by {labels.get(x_col, x_col)} colored by {labels.get(color_by, color_by)}",
                     labels=labels,
                     hover_data=['title']
                     )

    return fig

Let's look at budget versus revenue, colored by Bechdel Test score
Hollywood, like most things, runs on money. I wonder how budget versus revenue fares. 

In [644]:
create_scatter_plot(movies_and_oscars_df, 'budget', 'revenue', 'bt_score', column_names=column_names).show()

There are a few outlying yellow dots that I am actually surpised by!
    - Avatar was a high budget and high revenue film with a Bechdel Test score of 3.
    - Avengers, Age of Ultron was also. 
The super high budget of some films squishes the data of lower budget films, which is unfortunate. Someone should make a slider for that. Or use the built in Plotly tool to autoscale! 
 

In [647]:
### Let's add a dropdown to select the x, y, and color columns and a slider for the x-value!

def scatter_plot_with_dropdowns(df, column_names=None):
    """
    Generate interactive dropdowns to select x, y, and color columns for scatter plot.

    Args:
        df (pandas DataFrame): the input dataframe
    """
    labels = column_names if column_names else {}
        
    numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    
    x_options = [(labels.get(col, col), col) for col in numerical_cols]
    y_options = [(labels.get(col, col), col) for col in numerical_cols]
    color_options = [(labels.get(col, col), col) for col in categorical_cols + numerical_cols]
    color_options = [('None', None)] + color_options
    
    x_dropdown = Dropdown(options=x_options, 
                          value='year' if 'year' in numerical_cols else numerical_cols[0], 
                          description='X-axis:',
                          layout={'width': '50%'})
    
    y_dropdown = Dropdown(options=y_options, 
                          value='num_oscars' if 'num_oscars' in numerical_cols else numerical_cols[0], 
                          description='Y-axis:',
                          layout={'width': '50%'})
    
    color_dropdown = Dropdown(options=color_options, 
                              value='bt_score' if 'bt_score' in categorical_cols + numerical_cols else None, 
                              description='Color by:',
                              layout={'width': '50%'})
    
    # Initial slider for the default x_col
    x_min = int(df[x_dropdown.value].min())
    x_max = int(df[x_dropdown.value].max())
    x_slider = IntRangeSlider(
        value=[x_min, x_max],
        min=x_min,
        max=x_max,
        step=1,
        description=f'{x_dropdown.value} Range:',
        style={'description_width': 'initial'},
        continuous_update=False,
        layout={'width': '75%', 'margin': '0 auto', 'height': '40px'}
    )

    def update_slider(*args):
        col = x_dropdown.value
        x_slider.min = int(df[col].min())
        x_slider.max = int(df[col].max())
        x_slider.value = [x_slider.min, x_slider.max]
        x_slider.description = f'{labels.get(col, col)} Range:'

    x_dropdown.observe(update_slider, names='value')

    def update_plot(x_col, y_col, color_by, x_range):
        filtered_df = df[(df[x_col] >= x_range[0]) & (df[x_col] <= x_range[1])]
        clear_output(wait=True)
        fig = create_scatter_plot(filtered_df, x_col, y_col, color_by)
        fig.show()

    controls = VBox([
        HBox([x_dropdown, y_dropdown, color_dropdown], layout=Layout(justify_content='center')),
        x_slider
    ])
    out = interactive_output(update_plot, {
        'x_col': x_dropdown,
        'y_col': y_dropdown,
        'color_by': color_dropdown,
        'x_range': x_slider
    })
    
    display(controls, out)

In [648]:
scatter_plot_with_dropdowns(movies_and_oscars_df, column_names=column_names)

VBox(children=(HBox(children=(Dropdown(description='X-axis:', layout=Layout(width='50%'), options=(('Release Y…

Output()

- I like this a lot because I am not always only concerned with time, even though that is the most frequency used x-axis.
- Drilling in more on the "smaller budget" films shows a nice cluster of Bechdel Test score 3 films. Including 'Minions,' which had a relatively high revenue.  

### Histograms!

- We all love histograms for how the show off distributions. Seeing how this data is distributed means a lot. Sometimes a few highly rated films can throw off the summary stats leading us to believe we've made more progress than we have. 


In [649]:
# A function to create histograms that can be reused when we make the interactive dashboard later
def create_histogram(df, column='bt_score', color='category', bins=10, column_names=column_names):
    """
    Create a histogram of a specified column in the dataframe.

    Args:
        df (pandas DataFrame): the input dataframe
        column (str, optional): the column to look at the distribution for; defaults to 'bt_score'.
        color (str, optional): the column to color the bars by; defaults to 'category'.
        bins (int, optional): number of bins for the histogram; defaults to 10.
        column_names (_type_, optional): dictionary of column names to make the title and labels look decent.
    """
    labels = column_names if column_names else {}
    if color == 'category' or column == 'category':
        df = df[df['category'] != 'not nominated']
    fig = px.histogram(df, 
                       x=column, 
                       color=color,
                       barmode='group',
                       nbins=bins,
                       hover_data=[column, color, 'title', 'year'],
                       title=f"Distribution of {labels.get(column, column)}",
                       labels=labels,
                       height=500)
        
    fig.update_layout(bargap=0.2)
    #fig.show()
    return fig

What is the distribution of Bechdel Test scores for movies that won at least one Oscar?

In [651]:
create_histogram(movies_and_oscars_df, column='bt_score', color='category',bins=4, column_names=column_names)

Nice. At least we're winning for writing more women talking to eachother about things other than men!

What does the reverse of that look like?

In [652]:
create_histogram(movies_and_oscars_df, column='category', color='bt_score', column_names=column_names)

What is the distribution of Crew Female Representation and then Cast Female Representation?

In [653]:
create_histogram(movies_and_oscars_df, column='crew_female_representation', color='bt_score', column_names=column_names)

In [579]:
create_histogram(movies_and_oscars_df, column='cast_female_representation', color='bt_score', column_names=column_names)

At least cast is improving, but that crew distribution is a bummer. 

### Density Heatmaps!

- Oh I enjoy a heatmap to show off how things correlate!
- I really want to look at things other than time dependent things!

In [654]:
# A function to create heatmaps that can be reused when we make the interactive dashboard later
def create_heatmap(df, x_col, y_col, nbinsx=20, nbinsy=10):
    """Create a heatmap of two columns in a DataFrame.

    Args:
        df (DataFrame): The DataFrame containing the data.
        x_col (str): The column for the x-axis.
        y_col (str): The column for the y-axis.
        nbinsx (int, optional): Number of bins for the x-axis. Defaults to 40.
        nbinsy (int, optional): Number of bins for the y-axis. Defaults to 5.
        title (str, optional): Title of the plot. Defaults to None.

    Returns:
        Figure: A Plotly Figure object containing the heatmap.
    """
    labels = column_names if column_names else {}
    if y_col == 'category' or x_col == 'category':
        df = df[df['category'] != 'not nominated']
    fig = px.density_heatmap(df, 
                             x=x_col, 
                             y=y_col, 
                             nbinsx=nbinsx, 
                             nbinsy=nbinsy, 
                             title=f'Heatmap of {labels.get(x_col)} v {labels.get(y_col)}', 
                             width=800, 
                             height=500,
                             labels=labels,
                             marginal_x='histogram',
                             marginal_y='histogram', 
                             text_auto=True)
    
    fig.update_layout(coloraxis_colorbar=dict(
        title=dict(text=f'{labels.get(x_col)} count'),
        thicknessmode="pixels", thickness=25,
        lenmode="pixels", len = 350,
        yanchor="top", y = 1,
        ticks="", dtick=500
    ))
    
    return fig

I want to look at the correlation between crew female representation and Bechdel Test Score

In [656]:
create_heatmap(movies_and_oscars_df, 'crew_female_representation', 'bt_score').show()

The yellow block really jumps out - the correlation is highest where the crew female representation is between 5 - 15% and the Bechdel Test score is 3.
The bin size for Bechdel Test Score throws off the chart in that area, but we can still see the correlations pretty well. 

What about the correlation between crew female representation and budget?

In [657]:
create_heatmap(movies_and_oscars_df, 'crew_female_representation', 'budget').show()

I really like the numbers on the heatmap to demonstrate how strong the correlation is. The yellow brick is super high, and all those blue blocks are low -- the higher budgets have lower crew female representation.

## Time for the Dashboard!

We've explored the individual charts, and now we'd like to bring them together to see how each chart gives us a unique perspective on the data. 
For the dashboard, we'll put the bar chart, scatter plot, histogram, and density heatmap on one chart. The bar chart and scatter plot will share the same x and y axis to show how these two charts give us different insights. The histogram will use the y-axis for its column. 

The density heat map will have separate x and y axis from the other graphs due to the nature of insights we're trying to gleam. 

In [679]:
# A function to create a dashboard with histogram, scatter plot, heatmap and bar chart

def create_dashboard(
    data,
    x_col='year',
    y_col='num_oscars',
    heatmap_x='crew_female_representation',
    heatmap_y='budget',
    color='bt_score',
    column_names=None
):
    """Create a dashboard with histogram, scatter plot, heatmap and bar chart.

    Args:
        data (DataFrame): The DataFrame containing the data.
        x_col (str): X column for scatter and bar chart.
        y_col (str): Y column for scatter and bar chart.
        heatmap_x (str): X column for heatmap.
        heatmap_y (str): Y column for heatmap.
        color (str): Color column for histogram.
        column_names (dict, optional): Dictionary for pretty labels.
    """
    
    labels = column_names if column_names else {}
    subplot_titles = [
        f"Histogram of {labels.get(x_col, x_col)}",
        f"Scatter: {labels.get(y_col, y_col)} vs {labels.get(x_col, x_col)}",
        f"Heatmap: {labels.get(heatmap_x, heatmap_x)} vs {labels.get(heatmap_y, heatmap_y)}",
        f"Bar: {labels.get(y_col, y_col)} by {labels.get(x_col, x_col)}"
    ]
    
    fig = make_subplots(rows=2, cols=2, subplot_titles=subplot_titles)

  
    # Add all traces for each subplot
    # Histogram
    for trace in create_histogram(data, column=y_col, color=color, bins=20, column_names=column_names).data:
        fig.add_trace(trace, row=1, col=1)
        
    # Scatterplot
    for trace in create_scatter_plot(data, x_col=x_col, y_col=y_col, color_by=color, column_names=column_names).data:
        fig.add_trace(trace, row=1, col=2)
    
    # Heatmap
    heatmap_fig = create_heatmap(data, x_col=heatmap_x, y_col=heatmap_y)
    fig.add_trace(heatmap_fig.data[0], row=2, col=1)
    
    # Bar chart
    for trace in create_bar_chart(data, x_col=x_col, y_col=y_col, color_by=color, column_names=column_names).data:
        fig.add_trace(trace, row=2, col=2)
    
    # Add axis labels for each subplot
    fig.update_xaxes(title_text=labels.get(y_col, y_col), row=1, col=1)
    fig.update_yaxes(title_text="Count", row=1, col=1)

    fig.update_xaxes(title_text=labels.get(x_col, x_col), row=1, col=2)
    fig.update_yaxes(title_text=labels.get(y_col, y_col), row=1, col=2)

    fig.update_xaxes(title_text=labels.get(heatmap_x, heatmap_x), row=2, col=1)
    fig.update_yaxes(title_text=labels.get(heatmap_y, heatmap_y), row=2, col=1)

    fig.update_xaxes(title_text=labels.get(x_col, x_col), row=2, col=2)
    fig.update_yaxes(title_text=labels.get(y_col, y_col), row=2, col=2)
 
    fig.update_layout(height=800, width=1000, title_text='Movies, Oscars, and Female Representation Dashboard')
    return fig

In [680]:
print("Testing the dashboard function...")
create_dashboard(movies_and_oscars_df, column_names=column_names).show()

Testing the dashboard function...


Not terrible, though I am unpleased that the heatmap color bar is on the side of the entire dashboard and not just the heatmap. Additionally, the histograms on the top and side of the heatmap went away, and I like those. This is a problem for another day! 

Here is a picture of my cat, Moon, to help ease that pain: <br>
<img src="data/Moon2_onAlpha.png" width = "320">

##### Now let's add some interactivity to the dashboard so that when we change the x, y, or color columns, all the charts update!

In [681]:
# A function to create an interactive dashboard with dropdowns to select columns for each plot
def create_interactive_dashboard(data, column_names=None):
    """Create an interactive dashboard with dropdowns to select columns for each plot.

    Args:
        data (DataFrame): The DataFrame containing the data.
    """
    labels = column_names if column_names else {}
    
    numerical_cols = data.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = data.select_dtypes(include=['object', 'category']).columns.tolist()
    
    x_options = [(labels.get(col, col), col) for col in numerical_cols]
    y_options = [(labels.get(col, col), col) for col in numerical_cols]
    color_options = [(labels.get(col, col), col) for col in categorical_cols + numerical_cols]
    color_options = [('None', None)] + color_options
    
    # Dropdowns
    x_dropdown = Dropdown(options=x_options, 
                          value='year' if 'year' in numerical_cols else numerical_cols[0], 
                          description='X-axis:',
                          layout={'width': '50%'})
    
    y_dropdown = Dropdown(options=y_options, 
                          value='num_oscars' if 'num_oscars' in numerical_cols else numerical_cols[0], 
                          description='Y-axis:',
                          layout={'width': '50%'})
    
    color_dropdown = Dropdown(options=color_options, 
                              value='bt_score' if 'bt_score' in categorical_cols + numerical_cols else None, 
                              description='Color by:',
                              layout={'width': '50%'})

      
    heatmap_x_dropdown = Dropdown(options=x_options,
                             value='crew_female_representation',
                             description='Heatmap X Column:',
                             layout={'width': '50%'})
    
    heatmap_y_dropdown = Dropdown(options=y_options,
                             value='budget',
                             description='Heatmap Y Column:',
                             layout={'width': '50%'})
   
    # Initial slider for the default x_col
    x_min = int(data[x_dropdown.value].min())
    x_max = int(data[x_dropdown.value].max())
    x_slider = IntRangeSlider(
        value=[x_min, x_max],
        min=x_min,
        max=x_max,
        step=1,
        description=f'{x_dropdown.value} Range:',
        style={'description_width': 'initial'},
        continuous_update=False,
        layout={'width': '55%', 'margin': '0 auto', 'height': '40px'}
    )

    def update_slider(*args):
        # Thanks Co-Pilot for this snippet
        if pd.api.types.is_numeric_dtype(data[x_dropdown.value]):
            col_min = data[x_dropdown.value].min()
            col_max = data[x_dropdown.value].max()
        # Handle missing values
        if pd.isnull(col_min) or pd.isnull(col_max):
            col_min, col_max = 0, 1
        try:
            x_slider.min = int(col_min)
            x_slider.max = int(col_max)
            x_slider.value = [x_slider.min, x_slider.max]
            x_slider.description = f'{labels.get(x_dropdown.value, x_dropdown.value)} Range:'
        except Exception as e:
            pass  # Optionally print(e) for debugging

    x_dropdown.observe(update_slider, names='value')
    
    # Update the dashboard when any widget changes
    def update_dashboard(x_col,
                         y_col, 
                         heatmap_x,
                         heatmap_y,
                         color, 
                         year_range):
        """
        Update function for our interactive dashboard.

        Parameters match the widget names we use with @interact below.
        Each parameter will receive the current value of its corresponding widget:
        - hist_col: string from hist_dropdown
        - scatter_x_col: string from scatter_x dropdown
        - scatter_y_col: string from scatter_y dropdown
        - year_range: tuple of (min, max) from year_range slider
        """
        # Filter data by year range
        # year_range is a tuple of (min, max) from the IntRangeSlider

        filtered_data = data[(data['year'] >= year_range[0]) & (data['year'] <= year_range[1])]
        
        clear_output(wait=True)
        
        fig = create_dashboard(filtered_data,
                               x_col,
                               y_col,
                               heatmap_x,
                               heatmap_y,
                               color)
        
        fig.show()
        
    interact(
        update_dashboard,                     # Function to call when widgets change
        x_col=x_dropdown,                     # Maps x_col parameter to x_dropdown widget
        y_col=y_dropdown,                     # Maps y_col parameter to y_dropdown widget
        heatmap_x=heatmap_x_dropdown,         # Maps heatmap_x parameter to heatmap_x_dropdown widget
        heatmap_y = heatmap_y_dropdown,       # Maps heatmap_y parameter to heatmap_y_dropdown widget
        color=color_dropdown,                 # Maps color parameter to color_dropdown widget
        year_range=x_slider                   # Maps year_range parameter to year_range widget
    )


In [685]:
create_interactive_dashboard(movies_and_oscars_df, column_names=column_names)

interactive(children=(Dropdown(description='X-axis:', layout=Layout(width='50%'), options=(('Release Year', 'y…

#### Video demo:

<video controls src="data/assignment3-nancymelchert.mp4" title="Dashboard Video Demo" width=720></video>
