## Calculating Gender Composition from List of Writers

Author: Oliver Gladfelter

Date: 4/25/18 (updated 6/8/18)

In [87]:
import pandas as pd
import numpy as np

episodesData = pd.read_csv("top100ComediesWithWriters.csv", encoding='latin-1')
episodesData = episodesData.replace(np.nan, '', regex=True)

In [88]:
episodesData.head(3)

Unnamed: 0.1,Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,parentTconst,seasonNumber,episodeNumber,series,writers
0,0,tt0098286,"Good News, Bad News",1989,23,tt0098904,1,0,Seinfeld,"Larry David,Jerry Seinfeld"
1,1,tt0177842,Treehouse of Horror VI,1995,30,tt0096697,7,6,The Simpsons,"James L. Brooks,Sam Simon,John Swartzwelder,St..."
2,2,tt0213826,"Goodbye, Farewell, and Amen",1983,120,tt0068098,11,16,M*A*S*H,"Karen Hall,Alan Alda,Burt Metcalfe,John Rappap..."


In [56]:
import json

from urllib.request import urlopen

# to determine writer gender based on name, we use the Gender API found at https://gender-api.com/

# key obtained from Gender API
myKey = ""

def writerGender(writerFirstName):
    """
    Given a first name (ex: "Oliver"), call the Gender API and return 
    a list containing the predicted gender of name and confidence level
    """
    url = "https://gender-api.com/get?key=" + myKey + "&name=" + writerFirstName.strip(" ")
    response = urlopen(url)
    decoded = response.read().decode('utf-8')
    data = json.loads(decoded)

    return [data['gender'], data['accuracy']]

Although we now have an API and accompanying function to derive gender from a writer's name, we do not want to apply these to the episodesData data frame itself. Because unique writer values often repeat (as in most writers write multiple episodes), this would result in calling the API over 11,000 times, which is problematic. Instead, we want to create a new dataframe, in which each writer and their corresponding gender is observed once. This smaller dataframe will act as a key.

In [6]:
writerDict = {}

# because each episodes' writers are stored as a single string, 
# seperated by commas, create a list of writers by using str.split()
for episode in range(0, len(episodesData)): 
    writersNoSpace = episodesData['writers'][episode].replace(', ', ',')
    episodesWriters = writersNoSpace.split(',')
    
    # for each writer in the list, add to the dictionary of writers
    for writer in episodesWriters:
        if writer not in writerDict:
            writerDict[writer] = 1
        else: 
            writerDict[writer] = writerDict[writer] + 1    

In [7]:
# convert dictionary into a dataframe
writerDictDF = pd.DataFrame(list(writerDict.items()), columns=['Writer', 'Episodes Written'])
writerDictDF = writerDictDF.sort_values(by = 'Episodes Written', ascending = False)
writerDictDF = writerDictDF.reset_index()
del writerDictDF['index']

def firstName(name):
    return name.split(' ')[0]

writerDictDF['first'] = writerDictDF['Writer'].apply(firstName)

writerDictDF.head()

Unnamed: 0,Writer,Episodes Written
0,Matt Groening,764
1,James L. Brooks,640
2,Sam Simon,640
3,Seth MacFarlane,559
4,Chuck Lorre,518


In [12]:
# Create a dataframe including only unique first names, i.e. all the 'John' writers take up just one row
firstNameList = writerDictDF['first'].unique()
writersFirstNames = pd.DataFrame(firstNameList, columns = ['first'])

This new dataframe contains a little over 500 observations, one for each unique writer found in the episodesData dataframe. We can now apply the writerGender() function defined above to determine the gender of each writer. Once calculated, two supporting functions seperate the returned gender and prediction accuracy list into seperate values, attaching each as columns to the dataframe. 

In [None]:
writersFirstNames['genderAndAccuracy'] = ''

writersFirstNames['genderAndAccuracy'] = writersFirstNames['first'].apply(writerGender)

def getGender(value):
    return value[0]

def getAccuracy(value):
    return value[1]

writersFirstNames['gender'] = ''
writersFirstNames['gender'] = writersFirstNames['genderAndAccuracy'].apply(getGender)
writersFirstNames['accuracy'] = ''
writersFirstNames['accuracy'] = writersFirstNames['genderAndAccuracy'].apply(getAccuracy)

In [21]:
writersFirstNames.tail()

Unnamed: 0,first,genderAndAccuracy,gender,accuracy
268,Scotty,"[male, 97]",male,97
269,Shana,"[female, 96]",female,96
270,Kirker,"[male, 67]",male,67
271,Bobby,"[male, 97]",male,97
272,Lauren,"[female, 97]",female,97


In [135]:
# Now that we have an estimated gender for each first name, 
# merge back with the original writer's dataframe to attribute gender to each unique writer
del writersFirstNames['genderAndAccuracy']
writerDictDF2 = writersFirstNames.merge(writerDictDF, how = "right")

del writerDictDF2['first']
writerDictDF2.head()

Unnamed: 0,gender,accuracy,Writer,Episodes Written
2158,male,99.0,,1
2059,male,89.0,A.J. Poulin,1
703,male,99.0,Aaron Blitzstein,16
704,male,99.0,Aaron Ehasz,5
706,male,99.0,Aaron Harberts,3


## Quality Control - Updating Errors Made by the Gender API
#### (we double checked every name that had less than 97% accuracy)

In [None]:
#loop to correct writers' genders that the API mislabeled
writersToUpdate = ['Tuck Tucker','Ally Musika','M.J. Bassett','Taylor Elmore','Joan Binder Weiss','Joan Brooker','Dominique Morisseau','Janis Hirsch','Casey Maxwell Clair','Jody Hill','Kell Cahoon','Shion Takeuchi','Jamie Gorenberg','Jaydi Samuels','Morgan Murphy','Merrill Markoe','Channing Powell','Dava Savel','Kim Weiskopf','Doty Abrams','Niki Schwartz-Wright','Gigi McCreery','Gigi Vorgan','P. Karen Raper','P. Sharon','Kris Mukai','Michele J. Wolff','Kerry Lenhart','Robin Sayers','Ain Gordon','Taika Waititi',"Len O'Neill",'Alexis Wilkinson','Lang Fisher','Sharl Scharfer-Rollins','Sandy Frank','Joni Lefkowitz','Mel Sherer','Donelle Buck','Nell Scovell','Brown Mandell','Leslie Eberhard','Ako Castuera','Lee H. Grant','Seo Jung Kim','Marcy Lynn Dewey','Ira Ungerleider','Jessie Miller','Ira Fritz','Sam Brenner','Charlie Covell','Daley Haggar','Jules Dennis','Fran Kaufer','J.J. Philbin','Jordan Hawley','Mort Scrivner','Laurence Andries','Laurence Marks','Laurence Walsh','Dana Gould','Dana Scanlon','Tracy Gamble','Art Everett','Ollie Levy','Aleks Sennwald','Alex Ayers','Alex Borstein','Alex Cooley','Alex Yonks','Courtney Lilly','Jan Citron','Parker Hull','Chris Manheim','Shawn Schepps','Chris Atwood','Ali Adler','Ali Waller','Corey Nickerson']
indexOfWriters = []

for writer in writersToUpdate:
    indexOfWriters.append(writerDictDF2.loc[writerDictDF2['Writer'] == writer].index.values[0])

for writer in indexOfWriters:
    if writerDictDF2.loc[writer,'gender'] == 'female':
        writerDictDF2.loc[writer,'gender'] = 'male'
    if writerDictDF2.loc[writer,'gender'] == 'male':
        writerDictDF2.loc[writer,'gender'] = 'female' 

In [225]:
#unknown genders loop - for writers whom we could not find a definite gender for
writersToUpdate = ['Sjoux Doanham', 'Doty Abrams', 'P. Karen Raper', 'P. Sharon', 'Michele J. Wolff', 'Brown Mandell', 'Ira Fritz', 'Sam Brenner', 'Fran Kaufer', 'Jordan Hawley', 'Mort Scrivner', 'Art Everett', 'Ollie Levy', 'Parker Hull', 'Chris Atwood']
indexOfWriters = []

for writer in writersToUpdate:
    indexOfWriters.append(writerDictDF2.loc[writerDictDF2['Writer'] == writer].index.values[0])

for writer in indexOfWriters:
    writerDictDF2.loc[writer,'gender'] = 'N/A'

In [185]:
# Updating all the writers who received a 'unknown' gender from the API
maleIDs = [2115,1759,2025,1759,2041,1531,1034,1535,2160]
femaleIDs = [1385,1155,2089,1907,883,1608]

for writer in maleIDs:
    writerDictDF2.loc[writer,'gender'] = 'male'
    
for writer in femaleIDs:
    writerDictDF2.loc[writer,'gender'] = 'female'

In [233]:
writerDictDF2.to_csv("C:\\Users\\Oliver\\Documents\\imbd episodes\\genderOfWriters.csv")

Now that we have obtained the gender for each writer, the next step is to add genders to the episodesData dataframe. A merge between the two dataframes would not be successful because multiple writers are recorded single strings in the episodesData dataframe. Instead, we create a third dataframe, in which each writer of each episode is listed individually in a row. This will allow for a merge with the writerDictDF dataframe. 

In [234]:
writingStaff = []

for index, row in episodesData.iterrows():
    
    if row['writers'] != r'\N':
        episodes = row['tconst']
        writerGenders = row['writers']
        
        # for each writer included in the writers column of episodesData, create a new row for that writer
        for writer in writerGenders.split(','):
            newRow = [episodes, writer]
            writingStaff.append(newRow)

writingStaffDF = pd.DataFrame(writingStaff, columns = ['tconst', 'Writer'])

In [235]:
writingStaffDF2 = writingStaffDF.sort_values(by = 'Writer')
writerDictDF2 = writerDictDF2.sort_values(by = 'Writer')

def stripSpace(value):
    return value.strip(' ')
    
writingStaffDF2['Writer'] = writingStaffDF2['Writer'].apply(stripSpace)
writerDictDF2['Writer'] = writerDictDF2['Writer'].apply(stripSpace)

In [236]:
writingStaffDF2.head(10)

Unnamed: 0,tconst,Writer
47237,tt8428648,
8156,tt0640326,A.J. Poulin
25878,tt1610751,Aaron Blitzstein
28537,tt1819898,Aaron Blitzstein
28385,tt1795212,Aaron Blitzstein
28382,tt1795207,Aaron Blitzstein
28372,tt1795206,Aaron Blitzstein
25883,tt1610752,Aaron Blitzstein
28368,tt1795205,Aaron Blitzstein
28353,tt1795203,Aaron Blitzstein


In [238]:
writerDictDF2.head(10)

Unnamed: 0,gender,accuracy,Writer,Episodes Written
2059,male,89.0,A.J. Poulin,1
703,male,99.0,Aaron Blitzstein,16
704,male,99.0,Aaron Ehasz,5
706,male,99.0,Aaron Harberts,3
700,male,99.0,Aaron Korsh,109
707,male,99.0,Aaron Lam,2
705,male,99.0,Aaron Lee,4
702,male,99.0,Aaron Shure,28
701,male,99.0,Aaron Springer,66
2004,female,89.0,Abby Gewanter,2


In [239]:
writingStaffWithGender = writingStaffDF2.merge(writerDictDF2, how = 'left')
writingStaffWithGender = writingStaffWithGender.sort_values(by = 'tconst')
writingStaffWithGender = writingStaffWithGender.reset_index()
del writingStaffWithGender['index']

In [240]:
writingStaffWithGender.head()

Unnamed: 0,tconst,Writer,gender,accuracy,Episodes Written
0,tt0098286,Jerry Seinfeld,male,98.0,173.0
1,tt0098286,Larry David,male,98.0,263.0
2,tt0177842,Matt Groening,male,100.0,764.0
3,tt0177842,John Swartzwelder,male,99.0,59.0
4,tt0177842,Steve Tompkins,male,99.0,30.0


In [None]:
# add a new column of blank strings
writingStaffWithGender['writerGenders'] = ''

# iterate over the length of the data frame in order to create full lists of writers involved, rather
# than having writing staffs seperated over multiple rows
for writer in range(24425, len(writingStaffWithGender) - 1):
    
    # create a variable holding the string of the writers name for current row
    genders = writingStaffWithGender['gender'][writer]
    count = 1
    
    # while subsequent rows contain information about the current row's same movie, add the writer's
    # names to the 'names' variable. 
    while writingStaffWithGender['tconst'][writer] == writingStaffWithGender['tconst'][writer + count]:
        genders = genders + "," + writingStaffWithGender['gender'][writer + count]
        count = count + 1
        
    # once the last row for the given movie is reached and all writers have been added to the 
    # 'names' variable, replace the empty string in the 'writers' column with the string held by 'names'
    writingStaffWithGender['writerGenders'][writer] = genders

We now have a dataframe of episode title IDs and corresponding gender compositions, in string list form. However, each episode is listed multiple times, once for each writer. Additionally, the full gender compositions (including the gender of every writer for a given episode) are only included in the first instance of each episode. Thus, the final step is to drop duplicate episode IDs, keeping only the first instance, and deleting the unhelpful columns.

In [258]:
# drop duplicate episodes in the data frame, always keeping the first instance of each repeated movie
# because the full writing staffs are only included in the first instance
writingStaffWithGender2 = writingStaffWithGender.drop_duplicates(subset = 'tconst', keep = 'first')

writingStaffWithGender2 = writingStaffWithGender2.drop(['gender', 'Episodes Written', 'Writer', 'gender', 'accuracy'], axis = 1)

writingStaffWithGender2 = writingStaffWithGender2.reset_index()
del writingStaffWithGender2['index']

writingStaffWithGender2.head()

Unnamed: 0,tconst,writerGenders
0,tt0098286,"male,male"
1,tt0177842,"male,male,male,male,male,male,male"
2,tt0213826,"male,male,male,male,male,male,male,male,male,f..."
3,tt0238966,"male,male,male,female,male"
4,tt0291751,"male,male,male,male"


In [261]:
# Confirming both dataframes are of equal length...
print(len(writingStaffWithGender2))
print(len(episodesData))

11473
11473
11473


Finally, we have a proper dataframe containing each TV episode and the genders of each writer for that episode. This is merged with the episodesData dataframe, pulling all necessary data together.

In [260]:
# attach each episode's writing staff to our main dataframe, 'episodesData'
episodesDataWithGenderComp = episodesData.merge(writingStaffWithGender2, how = "inner", on = "tconst")

episodesDataWithGenderComp.head(3)

Unnamed: 0.1,Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,parentTconst,seasonNumber,episodeNumber,series,writers,writerGenders
0,0,tt0098286,"Good News, Bad News",1989,23,tt0098904,1,0,Seinfeld,"Larry David,Jerry Seinfeld","male,male"
1,1,tt0177842,Treehouse of Horror VI,1995,30,tt0096697,7,6,The Simpsons,"James L. Brooks,Sam Simon,John Swartzwelder,St...","male,male,male,male,male,male,male"
2,2,tt0213826,"Goodbye, Farewell, and Amen",1983,120,tt0068098,11,16,M*A*S*H,"Karen Hall,Alan Alda,Burt Metcalfe,John Rappap...","male,male,male,male,male,male,male,male,male,f..."


In [262]:
def calculateGenderComp(value):
    """
    Given a list of individuals' genders in comma-seperated string format, 
    return the proportion of males included in the list
    """
    
    listOfGender = value.split(',')
    
    numMaleWriters = 0
    
    for writerGender in listOfGender:
        if writerGender == 'male':
            numMaleWriters = numMaleWriters + 1
            
    return numMaleWriters / len(listOfGender)


episodesDataWithGenderComp['perMaleWrite'] = 0
episodesDataWithGenderComp['perMaleWrite'] = episodesDataWithGenderComp['writerGenders'].apply(calculateGenderComp)

In [265]:
def calculateNumWriters(value):
    """
    Given a list of individuals' genders in comma-seperated string format, 
    return the proportion of males included in the list
    """
    
    listOfGender = value.split(',')
            
    return len(listOfGender)


episodesDataWithGenderComp['numWriters'] = 0
episodesDataWithGenderComp['numWriters'] = episodesDataWithGenderComp['writerGenders'].apply(calculateNumWriters)

In [272]:
# whoops...had to do this one by hand, due to an earlier indexing error
episodesDataWithGenderComp.loc[11472, 'writers'] = 'Christopher Lloyd,Steven Levitan'
episodesDataWithGenderComp.loc[11472, 'writerGenders'] = 'male,male'
episodesDataWithGenderComp.loc[11472, 'perMaleWrite'] = 1.0
episodesDataWithGenderComp.loc[11472, 'numWriters'] = 2

In [291]:
del episodesDataWithGenderComp['Unnamed: 0']
episodesDataWithGenderComp.head()

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,parentTconst,seasonNumber,episodeNumber,series,writers,writerGenders,perMaleWrite,numWriters
0,tt0098286,"Good News, Bad News",1989,23,tt0098904,1.0,0.0,Seinfeld,"Larry David,Jerry Seinfeld","male,male",1.0,2
1,tt0177842,Treehouse of Horror VI,1995,30,tt0096697,7.0,6.0,The Simpsons,"James L. Brooks,Sam Simon,John Swartzwelder,St...","male,male,male,male,male,male,male",1.0,7
2,tt0213826,"Goodbye, Farewell, and Amen",1983,120,tt0068098,11.0,16.0,M*A*S*H,"Karen Hall,Alan Alda,Burt Metcalfe,John Rappap...","male,male,male,male,male,male,male,male,male,f...",0.9,10
3,tt0238966,Enemies,1996,22,tt0092400,10.0,23.0,Married with Children,"Stacie Lipp,Russell Marcus,Michael G. Moye,Ron...","male,male,male,female,male",0.8,5
4,tt0291751,The Best Bits of Mr. Bean,1995,72,tt0096657,1.0,15.0,Mr. Bean,"Ben Elton,Richard Curtis,Robin Driscoll,Rowan ...","male,male,male,male",1.0,4


In [289]:
episodesDataWithGenderComp.to_csv("C:\\Users\\Oliver\\Documents\\imbd episodes\\episodesDataWithGenderComp-6-8-18.csv")