## Gender-characteristic language used in ACL 2017 titles and abstracts
### This was created using the tool Scattertext, described in the paper below
### Twitter @jasonkessler

### Notes
* You can install the package using `pip3 install scattertext`, and read the documentation/browse the source at https://github.com/JasonKessler/scattertext
* The Youtube video of the 2017 PyData discussing this package is available at https://www.youtube.com/watch?v=H7X9CA2pWKoa
* Genders were imputed from names via the AgeFromName Python package, documented at https://github.com/JasonKessler/agefromname

Please cite as: 

Jason S. Kessler. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations. Vancouver, BC. 2017.
https://arxiv.org/abs/1703.00565

In [7]:
import pandas as pd
import re, time, glob, sys, collections
import scattertext as st
from IPython.display import IFrame
from IPython.core.display import display, HTML
import ipywidgets as widgets
import seaborn as sns
import agefromname
display(HTML("<style>.container { width:98% !important; }</style>"))
import spacy
import scattertext as st
sys.path.append('../scripts/')
from paper_info import Paper
%matplotlib inline

## Assemble list of papers and abstracts

In [2]:
def build_data_frame():
    data = []
    for fn in glob.glob('../data/*/*/final/*/*_metadata.txt'):
        paper = Paper(fn)
        d = {'abstract':paper.abstract,
             'title':paper.escaped_title(),
             'n_authors':len(paper.authors)}
        for i, author in enumerate(paper.authors):
            author_place = ('0' if i < 9 else '') + str(i+1)
            d['%s_first_name'%author_place] = author.first
            d['%s_last_name'%author_place] = author.last
            d['%s_email_domain'%author_place] = author.email.split('@')[1]
        d['venue'] = fn.split('/')[2]
        d['meta'] = fn.split('/')[2] + ': ' + str(paper)
        data.append(d)
    return pd.DataFrame(data)
df = build_data_frame()

In [3]:
nlp = spacy.en.English()

In [4]:
df['parse'] = (df['title'] + '.\n\n' + df['abstract']).apply(nlp)

## Record gender of 1st author if P(name|gender) > 0.9

In [8]:
male_prob = agefromname.AgeFromName().get_all_name_male_prob()

In [21]:
df['1st_auth_lower'] = df['01_first_name'].apply(str.lower)
df_aug = pd.merge(df, male_prob, left_on='1st_auth_lower', right_index=True)
df_aug['gender'] = df_aug['prob'].apply(lambda x: 'm' if x > 0.9 else 'f' if x < 0.1 else '?')
df_mf = df_aug[df_aug['gender'].isin(['m', 'f'])]

In [22]:
df_mf.gender.value_counts()

m    337
f    122
Name: gender, dtype: int64

In [27]:
print('Portion of papers labeled w/ gender of first author', len(df_mf)*1./len(df))

Portion of papers labeled w/ gender of first author 0.5730337078651685


In [33]:
print('Top first names that could not be assigned a gender w/ p < 0.1')
df_aug[df_aug['gender'] == '?']['01_first_name'].value_counts().iloc[:10]

Top first names that could not be assigned a gender w/ p < 0.1


Jan       4
Yu        2
Chris     2
Ye        2
Xing      2
Nikola    2
Chen      2
Yang      2
Alexis    1
Le        1
Name: 01_first_name, dtype: int64

In [28]:
gender_corpus = st.CorpusFromParsedDocuments(df_mf, category_col='gender', parsed_col='parse').build()

In [29]:
html = st.produce_scattertext_explorer(gender_corpus,
                                       category='f',
                                       category_name='Women',
                                       not_category_name='Men',
                                       width_in_pixels=1000,
                                       minimum_term_frequency=4,
                                       pmi_threshold_coefficient=6,
                                       term_ranker=st.OncePerDocFrequencyRanker,
                                       metadata= df_mf['meta'])
file_name = 'authors_gender.html'
with open(file_name, 'wb') as fn:
    fn.write(html.encode('utf-8'))

In [30]:
%%HTML
<iframe width=100% height=800 name="iframe" src="authors_gender.html"></iframe>