## Calculating Gender Composition from Writers List

Author: Oliver Gladfelter

Date: 4/25/18

In [207]:
import pandas as pd
import numpy as np

episodesData = pd.read_csv("top10ComediesWithWriters.csv", encoding='latin-1')
episodesData = episodesData.replace(np.nan, '', regex=True)

In [386]:
episodesData.head(3)

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,averageRating,numVotes,parentTconst,series,episodeNumberInt,seasonNumberInt,totalNum,writers
0,tt0515236,Pilot,2003,21,8.3,2290,tt0367279,Arrested Development,1,1,1,Mitchell Hurwitz
1,tt0515256,Top Banana,2003,22,8.6,1986,tt0367279,Arrested Development,2,1,2,"Abraham Higginbotham, Mitchell Hurwitz, John L..."
2,tt0515212,Bringing Up Buster,2003,22,8.2,1826,tt0367279,Arrested Development,3,1,3,"Richard Rosenstock, Mitchell Hurwitz, Abraham ..."


In [240]:
import json

from urllib.request import urlopen

# to determine writer gender based on name, we use the Gender API found at https://gender-api.com/

# key obtained from Gender API
myKey = ""

def writerGender(writer):
    """
    Given a full writer name (ex: "Oliver Gladfelter), call the Gender API on the first name
    and return a list containing the predicted gender of name and confidence level
    """
    url = "https://gender-api.com/get?key=" + myKey + "&name=" + writer.split(' ')[0]
    response = urlopen(url)
    decoded = response.read().decode('utf-8')
    data = json.loads(decoded)

    return [data['gender'], data['accuracy']]

Although we now have an API and accompanying function to derive gender from a writer's name, we do not want to apply these to the episodesData data frame itself. Because unique writer values often repeat (as in most writers write multiple episodes), this would result in calling the API over 11,000 times, which is problematic. Instead, we want to create a new dataframe, in which each writer and their corresponding gender is observed once. This smaller dataframe will act as a key.

In [212]:
writerDict = {}

# because each episodes' writers are stored as a single string, 
# seperated by commas, create a list of writers by using str.split()
for episode in range(0, len(episodesData)): 
    writersNoSpace = episodesData['writers'][episode].replace(', ', ',')
    episodesWriters = writersNoSpace.split(',')
    
    # for each writer in the list, add to the dictionary of writers
    for writer in episodesWriters:
        if writer not in writerDict:
            writerDict[writer] = 1
        else: 
            writerDict[writer] = writerDict[writer] + 1    

In [237]:
# convert dictionary into a dataframe
writerDictDF = pd.DataFrame(list(writerDict.items()), columns=['Writer', 'Episodes Written'])
writerDictDF = writerDictDF.sort_values(by = 'Episodes Written', ascending = False)
writerDictDF = writerDictDF.reset_index()
del writerDictDF['index']
writerDictDF.head()

Unnamed: 0,Writer,Episodes Written
0,James L. Brooks,630
1,Matt Groening,630
2,Sam Simon,630
3,David Zuckerman,304
4,Seth MacFarlane,304


This new dataframe includes only 472 observations, one for each unique writer found in the episodesData dataframe. We can now apply the writerGender() function defined above to determine the gender of each writer. Once calculated, two supporting functions seperate the returned gender and prediction accuracy list into seperate values, attaching each as columns to the dataframe. 

In [241]:
writerDictDF['genderAndAccuracy'] = ''

writerDictDF['genderAndAccuracy'] = writerDictDF['Writer'].apply(writerGender)

def getGender(value):
    return value[0]

def getAccuracy(value):
    return value[1]

writerDictDF['gender'] = ''
writerDictDF['gender'] = writerDictDF['genderAndAccuracy'].apply(getGender)
writerDictDF['accuracy'] = ''
writerDictDF['accuracy'] = writerDictDF['genderAndAccuracy'].apply(getAccuracy)

del writerDictDF['genderAndAccuracy']

In [310]:
writerDictDF.head()

Unnamed: 0,Writer,Episodes Written,gender,accuracy
0,James L. Brooks,630,male,99
1,Matt Groening,630,male,100
2,Sam Simon,630,male,85
3,David Zuckerman,304,male,99
4,Seth MacFarlane,304,male,99


In [244]:
writerDictDF.to_csv("C:\\Users\\Oliver\\Documents\\imbd episodes\\genderOfWriters.csv")

Now that we have obtained the gender for each writer, the next step is to add genders to the episodesData dataframe. A merge between the two dataframes would not be successful because multiple writers are recorded single strings in the episodesData dataframe. Instead, we create a third dataframe, in which each writer of each episode is listed individually in a row. This will allow for a merge with the writerDictDF dataframe. 

In [None]:
writingStaff = []

for index, row in episodesData.iterrows():
    
    if row['writers'] != r'\N':
        episodes = row['tconst']
        writerGenders = row['writers']
        
        # for each writer included in the writers column of episodesData, create a new row for that writer
        for writer in writerGenders.split(','):
            newRow = [episodes, writer]
            writingStaff.append(newRow)

writingStaffDF = pd.DataFrame(writingStaff, columns = ['tconst', 'Writer'])

In [314]:
writingStaffDF2 = writingStaffDF.sort_values(by = 'Writer')
writerDictDF2 = writerDictDF.sort_values(by = 'Writer')

def stripSpace(value):
    return value.strip(' ')
    
writingStaffDF2['Writer'] = writingStaffDF2['Writer'].apply(stripSpace)
writerDictDF2['Writer'] = writerDictDF2['Writer'].apply(stripSpace)

In [334]:
writingStaffDF2.head(10)

Unnamed: 0,tconst,Writer
11370,tt7852294,
1082,tt1795203,Aaron Blitzstein
1058,tt1819898,Aaron Blitzstein
1029,tt1758140,Aaron Blitzstein
1016,tt1610753,Aaron Blitzstein
1066,tt1795201,Aaron Blitzstein
1127,tt1795206,Aaron Blitzstein
1010,tt1610752,Aaron Blitzstein
1103,tt1795207,Aaron Blitzstein
1038,tt1759063,Aaron Blitzstein


In [333]:
writerDictDF2.head(10)

Unnamed: 0,Writer,Episodes Written,gender,accuracy
470,,1,unknown,0
117,Aaron Blitzstein,16,male,99
24,Aaron Korsh,103,male,99
264,Aaron Lee,4,male,99
224,Aaron Shure,6,male,99
326,Abed Gheith,2,male,97
41,Abraham Higginbotham,48,male,99
123,Adam Chase,16,male,99
66,Adam Faberman,29,male,99
447,Adam I. Lapidus,1,male,99


In [340]:
writingStaffWithGender = writingStaffDF2.merge(writerDictDF2, how = 'left')
writingStaffWithGender = writingStaffWithGender.sort_values(by = 'tconst')
writingStaffWithGender = writingStaffWithGender.reset_index()
del writingStaffWithGender['index']

In [None]:
# add a new column of blank strings
writingStaffWithGender['writerGenders'] = ''

# iterate over the length of the data frame in order to create full lists of writers involved, rather
# than having writing staffs seperated over multiple rows
for writer in range(0, len(writingStaffWithGender) - 1):
    
    # create a variable holding the string of the writers name for current row
    genders = writingStaffWithGender['gender'][writer]
    count = 1
    
    # while subsequent rows contain information about the current row's same movie, add the writer's
    # names to the 'names' variable. 
    while writingStaffWithGender['tconst'][writer] == writingStaffWithGender['tconst'][writer + count]:
        genders = genders + "," + writingStaffWithGender['gender'][writer + count]
        count = count + 1
        
    # once the last row for the given movie is reached and all writers have been added to the 
    # 'names' variable, replace the empty string in the 'writers' column with the string held by 'names'
    writingStaffWithGender['writerGenders'][writer] = genders

In [347]:
writingStaffWithGender.head()

Unnamed: 0,tconst,Writer,Episodes Written,gender,accuracy,writerGenders
0,tt0177842,John Swartzwelder,59,male,99,"male,male,male,male,male,male,male"
1,tt0177842,James L. Brooks,630,male,99,"male,male,male,male,male,male"
2,tt0177842,Steve Tompkins,4,male,99,"male,male,male,male,male"
3,tt0177842,Matt Groening,630,male,100,"male,male,male,male"
4,tt0177842,David X. Cohen,35,male,99,"male,male,male"


We now have a dataframe of episode title IDs and corresponding gender compositions, in string list form. However, each episode is listed multiple times, once for each writer. Additionally, the full gender compositions (including the gender of every writer for a given episode) are only included in the first instance of each episode. Thus, the final step is to drop duplicate episode IDs, keeping only the first instance, and deleting the unhelpful columns.

In [372]:
# drop duplicate episodes in the data frame, always keeping the first instance of each repeated movie
# because the full writing staffs are only included in the first instance
writingStaffWithGender2 = writingStaffWithGender.drop_duplicates(subset = 'tconst', keep = 'first')

writingStaffWithGender2 = writingStaffWithGender2.drop(['gender', 'Episodes Written', 'Writer', 'gender', 'accuracy'], axis = 1)

writingStaffWithGender2 = writingStaffWithGender2.reset_index()
del writingStaffWithGender2['index']

writingStaffWithGender2.head()

Unnamed: 0,tconst,writerGenders
0,tt0177842,"male,male,male,male,male,male,male"
1,tt0348034,"male,male,male,female"
2,tt0394893,"male,male,male,male"
3,tt0458217,"male,male,male"
4,tt0515207,"male,male,male"


Finally, we have a proper dataframe containing each TV episode and the genders of each writer for that episode. This is merged with the episodesData dataframe, pulling all necessary data together.

In [388]:
# attach each episode's writing staff to our main dataframe, topComediesOrdered
episodesDataWithGenderComp = episodesData.merge(writingStaffWithGender2, how = "inner", on = "tconst")

episodesDataWithGenderComp.head(3)

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,averageRating,numVotes,parentTconst,series,episodeNumberInt,seasonNumberInt,totalNum,writers,writerGenders
0,tt0515236,Pilot,2003,21,8.3,2290,tt0367279,Arrested Development,1,1,1,Mitchell Hurwitz,male
1,tt0515256,Top Banana,2003,22,8.6,1986,tt0367279,Arrested Development,2,1,2,"Abraham Higginbotham, Mitchell Hurwitz, John L...","male,male,male"
2,tt0515212,Bringing Up Buster,2003,22,8.2,1826,tt0367279,Arrested Development,3,1,3,"Richard Rosenstock, Mitchell Hurwitz, Abraham ...","male,male,male"


In [376]:
# Confirming all dataframes are of equal length...
print(len(writingStaffWithGender2))
print(len(episodesData))
print(len(episodesDataWithGenderComp))

2512
2512
2512


In [381]:
def calculateGenderComp(value):
    """
    Given a list of individuals' genders in comma-seperated string format, 
    return the proportion of males included in the list
    """
    
    listOfGender = value.split(',')
    
    numMaleWriters = 0
    
    for writerGender in listOfGender:
        if writerGender == 'male':
            numMaleWriters = numMaleWriters + 1
            
    return numMaleWriters / len(listOfGender)


episodesDataWithGenderComp['perMaleWrite'] = 0
episodesDataWithGenderComp['perMaleWrite'] = episodesDataWithGenderComp['writerGenders'].apply(calculateGenderComp)

In [383]:
episodesDataWithGenderComp

Unnamed: 0,tconst,primaryTitle,startYear,runtimeMinutes,averageRating,numVotes,parentTconst,series,episodeNumberInt,seasonNumberInt,totalNum,writers,writerGenders,perMaleWrite
0,tt0515236,Pilot,2003,21,8.3,2290,tt0367279,Arrested Development,1,1,1,Mitchell Hurwitz,male,1.000000
1,tt0515256,Top Banana,2003,22,8.6,1986,tt0367279,Arrested Development,2,1,2,"Abraham Higginbotham, Mitchell Hurwitz, John L...","male,male,male",1.000000
2,tt0515212,Bringing Up Buster,2003,22,8.2,1826,tt0367279,Arrested Development,3,1,3,"Richard Rosenstock, Mitchell Hurwitz, Abraham ...","male,male,male",1.000000
3,tt0515223,Key Decisions,2003,21,8.5,1739,tt0367279,Arrested Development,4,1,4,"Brad Copeland, Mitchell Hurwitz, Abraham Higgi...","male,male,male",1.000000
4,tt0515214,Charity Drive,2003,21,8.4,1647,tt0367279,Arrested Development,5,1,5,"Barbie Adler, Mitchell Hurwitz, Abraham Higgin...","male,female,male",0.666667
5,tt0515257,Visiting Ours,2003,21,8.2,1600,tt0367279,Arrested Development,6,1,6,"John Levenstein, Richard Rosenstock, Mitchell ...","male,male,male,male",1.000000
6,tt0515221,In God We Trust,2003,21,8.2,1563,tt0367279,Arrested Development,7,1,7,"Abraham Higginbotham, Mitchell Hurwitz","male,male",1.000000
7,tt0515231,My Mother the Car,2003,21,8.2,1549,tt0367279,Arrested Development,8,1,8,"Mitchell Hurwitz, Abraham Higginbotham, Chuck ...","male,male,male",1.000000
8,tt0515247,Storming the Castle,2004,22,8.5,1520,tt0367279,Arrested Development,9,1,9,"Mitchell Hurwitz, Abraham Higginbotham, Brad C...","male,male,male",1.000000
9,tt0515235,Pier Pressure,2004,22,9.2,1956,tt0367279,Arrested Development,10,1,10,"Abraham Higginbotham, Mitchell Hurwitz, James ...","male,male,male",1.000000


In [384]:
episodesDataWithGenderComp.to_csv("C:\\Users\\Oliver\\Documents\\imbd episodes\\episodesDataWithGenderComp-4-24-18.csv")