## Women in Film & The Bechdel Test
* Emily J. Cain
* Capstone Project

In [1]:
import numpy as np
import os
import pandas as pd
import plotly.graph_objs as go

# import get, requests - whatever was used for TMDb API calls, plus additional wrappers/libraries
# beautiful soup
# scipy.stats

from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

In [None]:
# Notes
# Include any custom functions used with docstrings
# External and Internal Hyperlinks (for table of contents)
# Add probability calculations to dataframe to be used for Tab Two on app?

## I. Data Collection & Cleaning

### Sources
#### Primary
* Bechdel Test Website - scraped
* TMDb API
* Wikidata SPARQL Queries

#### Supplemental
* Kaggle Datasets
* U.S. Census Bureau
* Motion Picture Association of America (MPAA)
* Women's Media Center (WMC)

### A. Bechdel Test Website
Scraped using Beautiful Soup to obtain:
* Name of movie
* Year
* Total points
    * 1 point for at least two named female characters ("Woman in Cafe #1", for example, would not qualify)
    * 2 points for named female characters who talk to each other
    * 3 points for named female characters who talk to each other about something other than a man
* Passing or Non-Passing (3 points = Passing)

### B. TMDb (The Movie Database) API
API calls made with (whatever that name was)

### C. Wikidata SPARQL Queries
* Used to find movies that had a Bechdel Test id to obtain (if available):
    * Director(s) Name & Gender
    * Screenwriter(s) Name & Gender
    * Producer(s) Name & Gender
    * Film Budget
    * Box Office Revenue
    
    
* Used for Academy Awards queries to obtain (if available):
    * Nominees and Winners for all current and defunct non-acting categories
    * Names
    * Genders
    * Year of Awards show

#### Academy Awards Data - Nominees

In [8]:
oscar_nominees = pd.read_csv('my_data/oscarnomineesnobestpic.csv')
oscar_nominees.head()

Unnamed: 0,humanLabel,genderLabel,nominationLabel,year
0,Charlie Chaplin,male,"Academy Award for Best Writing, Original Scree...",1948
1,Charlie Chaplin,male,"Academy Award for Best Writing, Original Scree...",1941
2,Peter Jackson,male,"Academy Award for Best Writing, Original Scree...",1995
3,Michel Hazanavicius,male,"Academy Award for Best Writing, Original Scree...",2012
4,Sylvester Stallone,male,"Academy Award for Best Writing, Original Scree...",1977


#### Check unique gender values

In [9]:
oscar_nominees.genderLabel.value_counts()

male          8543
female        1025
non-binary       1
Name: genderLabel, dtype: int64

In [10]:
oscar_nominees.query('genderLabel == "non-binary"')

Unnamed: 0,humanLabel,genderLabel,nominationLabel,year
3058,Sam Smith,non-binary,Academy Award for Best Original Song,2016


In [11]:
oscar_nominees.shape

(9600, 4)

In [8]:
oscar_nominees.nominationLabel.value_counts()

Academy Award for Best Sound Mixing                         1049
Academy Award for Best Production Design                     731
Academy Award for Best Writing, Adapted Screenplay           661
Academy Award for Best Writing, Original Screenplay          623
Academy Award for Best Film Editing                          542
Academy Award for Best Visual Effects                        504
Academy Award for Best Director                              442
Academy Award for Best Animated Short Film                   425
Academy Award for Best Documentary (Short Subject)           411
Academy Award for Best Art Direction, Black and White        380
Academy Award for Best Live Action Short Film                379
Academy Award for Best Art Direction, Color                  371
Academy Award for Best Cinematography                        320
Academy Award for Best Costume Design                        289
Academy Award for Best Original Score                        283
Academy Award for Best Ma

#### Custom Functions Used

In [12]:
def condense_categories(category_string):
    
    """
    Function takes an Academy Award category string from Wikidata query and returns the useful information in new string.
    For example, the original argument of 'Academy Award for Best Original Song' would return 'Original Song'
    """
    
    split_category_string = category_string.split(' ')
    condensed_category_string = split_category_string[4:]
    return ' '.join(condensed_category_string)

In [13]:
def clean_oscar_sparql_query(df, award_column, new_award_column):
    
    """
    Function takes DataFrame generated from Wikidata query, the string value of the award_column name, the new desired string 
    value for the award_column and returns new DataFrame. The condense_categories function is applied to the values in the awards
    column, and the humanLabel, genderLabel, and award_column are renamed to name, gender, and new_award_column, respectively.
    """
    
    df[award_column] = df[award_column].map(condense_categories)
    df.rename({'humanLabel': 'name', 'genderLabel': 'gender', award_column: new_award_column}, axis=1, inplace=True)
    return df

In [14]:
clean_oscar_sparql_query(oscar_nominees, award_column='nominationLabel', new_award_column='nom_category')

Unnamed: 0,name,gender,nom_category,year
0,Charlie Chaplin,male,"Writing, Original Screenplay",1948
1,Charlie Chaplin,male,"Writing, Original Screenplay",1941
2,Peter Jackson,male,"Writing, Original Screenplay",1995
3,Michel Hazanavicius,male,"Writing, Original Screenplay",2012
4,Sylvester Stallone,male,"Writing, Original Screenplay",1977
5,Melvin Frank,male,"Writing, Original Screenplay",1974
6,Melvin Frank,male,"Writing, Original Screenplay",1955
7,Melvin Frank,male,"Writing, Original Screenplay",1961
8,Melvin Frank,male,"Writing, Original Screenplay",1947
9,Arthur C. Clarke,male,"Writing, Original Screenplay",1969


In [15]:
oscar_nominees.head()

Unnamed: 0,name,gender,nom_category,year
0,Charlie Chaplin,male,"Writing, Original Screenplay",1948
1,Charlie Chaplin,male,"Writing, Original Screenplay",1941
2,Peter Jackson,male,"Writing, Original Screenplay",1995
3,Michel Hazanavicius,male,"Writing, Original Screenplay",2012
4,Sylvester Stallone,male,"Writing, Original Screenplay",1977


In [16]:
oscar_nominees.tail()

Unnamed: 0,name,gender,nom_category,year
9595,Thomas Mead,male,"Live Action Short Film, Two-Reel",1949
9596,Ben K. Blake,male,"Live Action Short Film, Two-Reel",1948
9597,William Lasky,male,"Live Action Short Film, Two-Reel",1950
9598,Louis Harris,male,"Live Action Short Film, Two-Reel",1945
9599,John Healy,male,"Live Action Short Film, Two-Reel",1957


#### Check for null values

In [43]:
oscar_nominees.isnull().sum()

name             0
gender          30
nom_category     0
year             0
dtype: int64

In [44]:
null_genders = oscar_nominees.loc[oscar_nominees.gender.isnull()]
null_genders

Unnamed: 0,name,gender,nom_category,year
2434,Jocelyn Glatzer,,Documentary Feature,2007
3755,Ariel Velasco-Shaw,,Visual Effects,1994
4043,Lyle Conway,,Visual Effects,1987
4117,Thaine Morris,,Visual Effects,1989
4118,Kent Houston,,Visual Effects,1990
6526,Eda Godel Hallinan,,Animated Short Film,1984
6640,Jan Saunders,,Live Action Short Film,1983
6646,Thom Colwell,,Live Action Short Film,1996
6649,Gabriele Lins,,Live Action Short Film,2000
6712,T.R. Conroy,,Live Action Short Film,1992


In [None]:
# look up genders to add later or drop if unknown

#### Save cleaned dataset to new csv

In [42]:
oscar_nominees.to_csv('my_data/cleaned_oscar_nominees.csv', index=False)

#### Academy Awards Data - Winners

In [18]:
oscar_winners = pd.read_csv('my_data/oscarwinnersquerynobestpicture.csv')
oscar_winners.head()

Unnamed: 0,humanLabel,genderLabel,awardLabel,year
0,Peter Jackson,male,Academy Award for Best Director,2004
1,Steven Spielberg,male,Academy Award for Best Director,1999
2,Steven Spielberg,male,Academy Award for Best Director,1994
3,Kevin Costner,male,Academy Award for Best Director,1991
4,Woody Allen,male,Academy Award for Best Director,1978


In [19]:
oscar_winners.genderLabel.unique()

array(['male', 'female', nan], dtype=object)

#### Check for null values

In [23]:
oscar_winners.isnull().sum()

humanLabel     0
genderLabel    2
awardLabel     0
year           0
dtype: int64

In [24]:
oscar_winners.loc[oscar_winners.genderLabel.isnull()]

Unnamed: 0,humanLabel,genderLabel,awardLabel,year
720,Robie Robinson,,Academy Award for Best Visual Effects,1970
1507,Gerardine Wurzburg,,Academy Award for Best Documentary (Short Subj...,1993


Cannot find information online for Robie Robinson, so will drop from both datasets. 

In [25]:
oscar_nominees = oscar_nominees.loc[oscar_nominees.name != 'Robie Robinson']
oscar_winners = oscar_winners.loc[oscar_winners.humanLabel != 'Robie Robinson']

In [34]:
oscar_winners.at[1507,'genderLabel'] = 'female'

In [37]:
oscar_winners.isnull().sum()

humanLabel     0
genderLabel    0
awardLabel     0
year           0
dtype: int64

#### Apply cleaning functions

In [38]:
clean_oscar_sparql_query(oscar_winners, award_column='awardLabel', new_award_column='win_category')

Unnamed: 0,name,gender,win_category,year
0,Peter Jackson,male,Director,2004
1,Steven Spielberg,male,Director,1999
2,Steven Spielberg,male,Director,1994
3,Kevin Costner,male,Director,1991
4,Woody Allen,male,Director,1978
5,Alfonso Cuarón,male,Director,2014
6,Michel Hazanavicius,male,Director,2012
7,Kathryn Bigelow,female,Director,2010
8,Mel Gibson,male,Director,1996
9,Clint Eastwood,male,Director,1993


In [39]:
oscar_winners.shape

(2056, 4)

In [40]:
oscar_winners.tail()

Unnamed: 0,name,gender,win_category,year
2052,Jerry Fairbanks,male,"Live Action Short Film, One-Reel",1945
2053,Sam Coslow,male,"Live Action Short Film, Two-Reel",1944
2054,Boris Vermont,male,"Live Action Short Film, One-Reel",1953
2055,Konstantin Kalser,male,"Live Action Short Film, One-Reel",1957
2056,Wilbur T. Blume,male,"Live Action Short Film, Two-Reel",1956


#### Save new cleaned dataset to csv

In [41]:
oscar_winners.to_csv('my_data/cleaned_oscar_winners.csv', index=False)

## II. Exploratory Data Analysis & Visualizations

## III. Conditional Probability with Bayes Theorem

## IV. Hypothesis Testing

## V. Dashboard
* Visualize important insight & metrics
* Allow user to explore, customize, and extract the data that interests them

## VI. Conclusion

### A. Crowdsourced Data

## VII. Recommendations

### A. Existing Projects & Initiatives:
* Annenberg Inclusion Initiative - University of Southern California based think tank focused on diversity and inclusion in entertainment
* Women's Media Center - Nonpartisan and Non-profit organization founded by Jane Fonda, Robin Morgan and Gloria Steinem to raise awareness and take action on a wide range of issues affecting women and girls
* Geena Davis Institute on Gender & Media - "If she can see it, she can be it" - promotes increased visibility of women in media and in other roles, especially in roles where girls may not have seen much female representation
* Women in Media - Non-profit organization promoting gender balance in media by offering networking for female and female-identifying crew members

## VIII. Further Research