<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Part-5:-Analyse-the-results-of-the-LDA-model" data-toc-modified-id="Part-5:-Analyse-the-results-of-the-LDA-model-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Part 5: Analyse the results of the LDA model</a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Load-the-processed-speeches-dataframe" data-toc-modified-id="Load-the-processed-speeches-dataframe-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Load the processed speeches dataframe</a></span></li></ul></li><li><span><a href="#Let's-see-which-topics-have-the-most-words-associated-with-them" data-toc-modified-id="Let's-see-which-topics-have-the-most-words-associated-with-them-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Let's see which topics have the most words associated with them</a></span><ul class="toc-item"><li><span><a href="#Select-only-the-topics-that-are-interesting-to-us" data-toc-modified-id="Select-only-the-topics-that-are-interesting-to-us-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Select only the topics that are interesting to us</a></span></li><li><span><a href="#Sum-up-the-total-number-of-words-an-MP-has-said-about-a-topic" data-toc-modified-id="Sum-up-the-total-number-of-words-an-MP-has-said-about-a-topic-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Sum up the total number of words an MP has said about a topic</a></span></li><li><span><a href="#Now-normalise-words-for-each-MP-and-topic-by-the-total-number-of-words-the-MP-has-spoken-(ignoring-stop-words,-etc)" data-toc-modified-id="Now-normalise-words-for-each-MP-and-topic-by-the-total-number-of-words-the-MP-has-spoken-(ignoring-stop-words,-etc)-1.1.3"><span class="toc-item-num">1.1.3&nbsp;&nbsp;</span>Now normalise words for each MP and topic by the total number of words the MP has spoken (ignoring stop words, etc)</a></span></li></ul></li></ul></li><li><span><a href="#How-are-topics-segregated-by-gender?" data-toc-modified-id="How-are-topics-segregated-by-gender?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>How are topics segregated by gender?</a></span><ul class="toc-item"><li><span><a href="#All-parties" data-toc-modified-id="All-parties-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>All parties</a></span></li><li><span><a href="#Republicans" data-toc-modified-id="Republicans-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Republicans</a></span></li><li><span><a href="#Democrats" data-toc-modified-id="Democrats-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Democrats</a></span></li></ul></li><li><span><a href="#Total-words-spoken-by-each-gender-over-time" data-toc-modified-id="Total-words-spoken-by-each-gender-over-time-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Total words spoken by each gender over time</a></span></li></ul></div>

# Analyse all House of Representative speeches since 1994

[Part 1: Get a list of MPs and their affiliations](MP_speeches-Part1.ipynb)

[Part 2: Download all speeches belonging to MPs in list](MP_speeches-Part2.ipynb)

[Part 3: Train bigram and trigram models and use them on all speeches](MP_speeches-Part3.ipynb)

[Part 4: Train an LDA topic model and process all speeches with it](MP_speeches-Part4.ipynb)

## Part 5: Analyse the results of the LDA model

In this notebook, we will use the LDA model we created previously to understand broad trends in parliamentary speeches

In [1]:
import pandas as pd
import numpy as np
import bcolz

# Read in details of MPs
members = pd.read_hdf("list_of_members.h5", "members")
members["full_name"] = members["first_name"] + " " + members["last_name"]

In [2]:
# Total number of women who served
members.query("gender=='F'")[["bioguide_id", "full_name", "party", "gender", "type"]].drop_duplicates().groupby("type").count()

Unnamed: 0_level_0,bioguide_id,full_name,party,gender
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
rep,288,288,288,288
sen,51,51,51,51


In [3]:
# Total number of women who are currently serving
members.query("gender=='F' & term_end > '2018-04-05'")[["bioguide_id", "full_name", "party", "gender", "type"]].drop_duplicates().groupby("type").count()

Unnamed: 0_level_0,bioguide_id,full_name,party,gender
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
rep,89,89,89,89
sen,22,22,22,22


In [4]:
# Reformat it and get rid of useless columns
members = members.sort_values("term_start").groupby("bioguide_id").last()[["full_name", "party", "gender"]]

Let's start by importing the names of topics in the LDA model

In [128]:
# %load topic_names.py
# Dictionary of topic names
topic_names_100 = {
    0: "congress terminology",
    1: "congress terminology+",
    2: "healthcare",
    3: "",
    4: "random names (?)",
    5: "science",
    6: "social security",
    7: "africa?",
    8: "national security",
    9: "midwest (?)",
    10: "government finances",
    11: "OSHA",
    12: "honors",
    13: "congress terminology++",
    14: "budget",
    15: "consumer protection",
    16: "committees",
    17: "sports",
    18: "arts & culture",
    19: "military equipment (?)",
    20: "coast guard & fishing industry",
    21: "",
    22: "judaism",
    23: "fire fighting",
    24: "",
    25: "NASA",
    26: "homeland security",
    27: "african-american history",
    28: "affordable housing",
    29: "war history (?)",
    30: "",
    31: "",
    32: "",
    33: "china",
    34: "census",
    35: "veteran affairs",
    36: "agriculture",
    37: "FEMA",
    38: "unemployment",
    39: "",
    40: "civil rights",
    41: "transport",
    42: "",
    43: "army corps",
    44: "awards",
    45: "mexico & turkey (?)",
    46: "",
    47: "senior citizens healthcare",
    48: "abortion",
    49: "legislation",
    50: "constitution",
    51: "california",
    52: "texas",
    53: "foreign policy",
    54: "defense",
    55: "medicine + campaign finance reform",
    56: "",
    57: "native americans",
    58: "mental health",
    59: "land management & forestry",
    60: "drug enforcement",
    61: "birth control & women's rights",
    62: "iraq & afghanistan",
    63: "FCC",
    64: "EPA",
    65: "medicare",
    66: "welfare reform",
    67: "",
    68: "",
    69: "employee-employer relations",
    70: "",
    71: "school education",
    72: "",
    73: "trade policy",
    74: ""
}

topic_names_50 = {
    1: "israel & palestine",
    2: "labor relations",
    3: "transportation",
    4: "arts & culture",
    5: "tobacco",
    6: "aviation",
    7: "FDA",
    8: "medicare",
    9: "fiscal policy",
    10: "budget",
    12: "secondary education",
    13: "wildlife conservation",
    14: "veteran affairs",
    15: "nasa & baseball ??",
    16: "health insurance",
    17: "agriculture",
    18: "elections",
    19: "disease prevention",
    20: "medicare+",
    21: "first names (male)",
    22: "young people",
    23: "iraq & afghanistan",
    25: "russia",
    26: "housing",
    27: "homeland security",
    28: "welfare reform",
    29: "honors",
    30: "oil & gas",
    31: "constitional law",
    32: "EPA",
    33: "indian affairs",
    34: "FEMA",
    35: "energy",
    36: "honors+",
    37: "affordable care act (?)",
    38: "border enforcement",
    39: "congressional terminology",
    40: "MURICA",
    41: "china & india",
    42: "business & corporate responsibility (?)",
    43: "congressional terminology+",
    44: "congressional terminology++",
    45: "veterans",
    46: "congressional terminology+++",
    47: "tax",
    48: "african-american civil rights",
    49: "minimum wages"
}

topic_names_75 = {
    0: "healthcare",
    1: "constitution",
    2: "african-american history",
    3: "IRS & trafficking",
    5: "infrastructure & construction",
    7: "african-americans",
    9: "military",
    10: "secondary education",
    11: "border enforcement",
    12: "agriculture",
    13: "cybersecurity",
    14: "budget",
    15: "entrepreneurship",
    16: "transportation",
    17: "worker welfare",
    18: "medical research",
    20: "india & pakistan",
    21: "congressional terminology",
    23: "first names",
    26: "armenian genocide",
    27: "drugs war",
    28: "puerto rico",
    29: "sports",
    30: "community service",
    31: "???",
    32: "tobacco",
    34: "iraq & afghanistan wars",
    35: "affordable housing",
    37: "honors",
    38: "congressional terminology+",
    39: "human rights in china",
    40: "natural disasters",
    42: "financial sector",
    43: "congressional terminology++",
    44: "congressional terminology+++",
    45: "energy",
    48: "veteran affairs",
    49: "child welfare",
    50: "gun safety & abortion",
    51: "honors+",
    55: "veterans",
    56: "unions",
    57: "health insurance & medicare",
    59: "honors++",
    60: "forestry",
    61: "arts",
    62: "nuclear waste",
    63: "trade",
    68: "law",
    70: "terrorism & foreign policy",
    71: "defense",
    72: "voting rights & democracy",
    74: "homeland security"
}

topic_names = topic_names_75
def topic_dict(topic_number):
    """
    return name of topic where identified
    """
    
    try:
        return topic_names[topic_number]
    except KeyError:
        return topic_number
    
# Reverse the topic names so that we can find them easily
reverse_topic_dict = {i[1]:i[0] for i in topic_names.items()}

#### Load the processed speeches dataframe

In [73]:
# Import speeches data (excludes speech texts)
speeches = pd.read_hdf("processed_speeches_75.h5", "speeches")

# Import pointer to speech texts
zspeeches = bcolz.open("speeches.bcolz", mode="w")

Here's what the speeches dataframe looks like.

Each column with a number refers to a particular topic, and the cell contains the probability that the speech is about that particular topic.

In [74]:
speeches.head()

Unnamed: 0,date,doc_title,id,speaker,speaker_bioguide,0,1,2,3,4,...,66,67,68,69,70,71,72,73,74,n_words
0,2001-03-30,MARRIAGE PENALTY AND FAMILY TAX RELIEF ACT OF ...,CREC-2001-03-30-pt1-PgE503,Mr. LANGEVIN,L000559,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,165
1,2001-03-30,MARRIAGE PENALTY AND FAMILY TAX RELIEF ACT OF ...,CREC-2001-03-30-pt1-PgE503-2,Mr. BOEHLERT,B000586,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,131
2,2001-03-30,RECOGNIZING EVAN DOBELLE'S CONTRIBUTIONS TO TH...,CREC-2001-03-30-pt1-PgE504,Mr. LARSON of Connecticut,L000557,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,253
3,2001-03-30,MAGGIE LENA WALKER,CREC-2001-03-30-pt1-PgE505,Mr. SCOTT,S000185,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.057343,0.0,0.0,311
4,2001-03-30,"CONCURRENT RESOLUTION ON THE BUDGET, FISCAL YE...",CREC-2001-03-30-pt1-PgE504-2,Mr. LANGEVIN,L000559,0.0,0.0,0.0,0.0,0.038147,...,0.0,0.0,0.0,0.0,0.0,0.036316,0.0,0.0,0.0,383


In [75]:
# Some helper functions
def topics_by_mp(mp_id):
    """Return a histogram of topics mentioned by MP"""
    mp_speeches = speeches[(speeches["n_words"] > 5) &
                       (speeches["speaker_bioguide"] == mp_id)]
    mp_speeches = pd.melt(mp_speeches, value_vars=list(topic_names.keys())).query('value > 0.4').groupby("variable").size()
    return mp_speeches
    

def topics_by_mp_words(mp_id):
    """Return a histogram of topics mentioned by MP"""
    mp_speeches = speeches[(speeches["speaker_bioguide"] == mp_id)]
    mp_speeches[list(topic_names.keys())] = mp_speeches[list(topic_names.keys())].mul(mp_speeches["n_words"],
                                                                                              axis=0)
    mp_speeches = mp_speeches[list(topic_names.keys())].sum().sort_values(ascending=False).reset_index()
    mp_speeches["topic_name"] = mp_speeches["index"].apply(lambda x: topic_dict(x))
    mp_speeches = mp_speeches.set_index("index").rename(columns={0:"num_words"})
    return mp_speeches

### Let's see which topics have the most words associated with them

In [129]:
import cufflinks
import plotly

cufflinks.set_config_file(theme="ggplot")
# Plot most popular topics discussed
# list(topic_names_100.keys())
a = speeches[list(range(75))].mul(speeches["n_words"], axis=0).sum().sort_values(ascending=False).reset_index()
a["index"] = a["index"].apply(lambda x: "-" + str(topic_dict(x)) + "-")
a.set_index("index")[0].iplot(kind="bar",
                              xTitle="<b>Topic</b>",
                              yTitle="<b>Number of words</b>",
                              title="Total number of words assigned to each topic",
                              margin=(100,50,150,50))

In [155]:
topic_id = reverse_topic_dict["budget"]
a = speeches[speeches[topic_id] > 0.6][["n_words", "speaker", "doc_title", topic_id]].sort_values(topic_id, ascending=False)
a["n_topic_words"] = a["n_words"] * a[topic_id]
speech_index = a.sort_values("n_topic_words", ascending=False).query("n_topic_words > 50 & n_topic_words < 100").sort_values("n_words", ascending=False).sample(1).index[0]
print(speeches.iloc[speech_index][["date", "doc_title", "speaker"]], "\n\n", zspeeches[speech_index])

date                                       1996-03-28 00:00:00
doc_title    CONFERENCE REPORT ON H.R. 2854, FEDERAL AGRICU...
speaker                              Mr. TAYLOR of Mississippi
Name: 246268, dtype: object 

   Mr. TAYLOR of Mississippi. Mr. Speaker, gentlemen and ladies, last 
year, during the welfare debate, I heard speaker after speaker come to 
this floor and say that we had to end the practice of paying people to 
do nothing, that we should no longer pay people not to work.


  Something remarkable happened that day. Every single Member of this 
body voted to no longer pay people for not working. Many of us 
supported the coalition plan, the rest of the folks supported the 
Republican plan, but everyone supported at least one plan that would 
stop paying people for doing nothing. And it was remarkable, and it was 
a good thing.
  Unfortunately, in this bill there is a plan to pay people up to 
$80,000 a year per individual for 7 years to do nothing. You do not 
have to 

#### Select only the topics that are interesting to us

In [78]:
# List of specific topics to index for graphing
topics_to_graph = list(topic_names.values())

#### Sum up the total number of words an MP has said about a topic

In [79]:
def get_mp_topic_fraction(df):
    # Mutates speech data frame from wide to long format with a row for each speech's topic probability
    # then sums up by topic
    return df.melt(id_vars=["speaker", "speaker_bioguide", "date", "n_words"],
          value_vars=list(range(75)), var_name="topic_id")\
    .assign(n_topic_words = lambda x: x.n_words*x.value)\
    .groupby(["topic_id"])\
    ["n_topic_words"].sum()

mp_topics = speeches.groupby("speaker_bioguide").apply(get_mp_topic_fraction)

In [80]:
mp_topics#.head()

topic_id,0,1,2,3,4,5,6,7,8,9,...,65,66,67,68,69,70,71,72,73,74
speaker_bioguide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A000014,1348.919434,3516.661621,1574.261597,960.937378,1431.689575,770.114197,658.549194,371.025208,5241.501465,4200.224121,...,821.387756,56.135895,2035.946289,1119.199829,1800.544189,7739.689941,5272.338867,2520.580811,27.163864,834.686523
A000022,549.084473,2261.854004,1489.754150,272.324677,2325.843750,512.709534,191.480896,1331.155273,184.813965,822.239502,...,1480.219727,137.944336,805.677734,583.712646,603.167603,14340.167969,764.299133,1462.430542,49.419128,1172.055908
A000055,473.051117,2013.320801,648.001587,240.729568,299.441376,7634.675293,85.422455,1161.929077,300.401733,159.330505,...,2174.373047,1515.252441,142.782288,622.378418,75.604599,1083.712524,4425.603027,600.046814,52.791573,2035.535767
A000069,54.254456,17.676941,0.000000,12.112305,0.000000,200.768921,23.621399,4.215576,58.201477,12.624329,...,0.000000,0.000000,62.314148,23.395264,112.716583,0.000000,3.771667,11.935303,0.000000,420.172638
A000109,268.632751,1254.077271,95.525467,327.239685,461.132416,371.682953,319.462219,39.296143,43.559891,4.292664,...,1253.828857,11.231583,329.925354,163.146988,102.852066,201.411575,155.494446,206.628189,5.327393,14.449654
A000209,203.453522,218.107361,636.802612,72.133118,6.336823,105.006683,34.176025,17.949081,8.865509,0.000000,...,408.051941,0.000000,146.228943,87.068542,182.525024,198.935562,42.745476,12.169846,0.000000,0.000000
A000210,1997.141235,5145.391602,837.425476,654.209656,1477.536621,1050.253784,584.353699,1874.694946,383.456543,1671.144897,...,1340.070679,164.425873,1017.445618,2399.590332,2047.261963,10635.322266,2567.461182,2045.397461,36.405762,1308.222046
A000211,3.638916,75.020691,38.297852,10.994629,0.000000,21.591644,51.110840,0.000000,20.898438,9.308350,...,93.535278,0.000000,15.444580,0.000000,8.233032,75.140625,441.670898,59.048004,0.000000,0.000000
A000214,6.166443,26.720123,15.448792,16.447266,31.677429,176.128662,0.000000,144.582031,8.427246,0.000000,...,114.426514,6.094849,2.192871,414.131836,10.377441,0.000000,10.651520,64.468018,1.281006,0.000000
A000215,729.942932,803.303833,205.957642,581.281433,1472.128662,323.189026,108.361801,0.000000,86.480713,156.961853,...,296.735291,36.000641,108.402008,178.953857,198.646622,290.439880,198.783386,403.430023,2.393890,277.702087


#### Now normalise words for each MP and topic by the total number of words the MP has spoken (ignoring stop words, etc)

In [132]:
# This gives us the fraction of time MP spent on a particular subject
normalised_mp_topics = mp_topics.div(mp_topics.sum(axis=1), axis=0).rename(columns=topic_names)

normalised_mp_topics.head()

topic_id,healthcare,constitution,african-american history,IRS & trafficking,4,infrastructure & construction,6,african-americans,8,military,...,65,66,67,law,69,terrorism & foreign policy,defense,voting rights & democracy,73,homeland security
speaker_bioguide,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A000014,0.009657,0.025177,0.011271,0.00688,0.01025,0.005514,0.004715,0.002656,0.037526,0.030071,...,0.005881,0.000402,0.014576,0.008013,0.012891,0.055411,0.037747,0.018046,0.000194,0.005976
A000022,0.004668,0.019228,0.012664,0.002315,0.019772,0.004358,0.001628,0.011316,0.001571,0.00699,...,0.012583,0.001173,0.006849,0.004962,0.005127,0.121904,0.006497,0.012432,0.00042,0.009964
A000055,0.006493,0.027634,0.008894,0.003304,0.00411,0.104792,0.001172,0.015948,0.004123,0.002187,...,0.029845,0.020798,0.00196,0.008543,0.001038,0.014875,0.060745,0.008236,0.000725,0.027939
A000069,0.016897,0.005505,0.0,0.003772,0.0,0.062526,0.007356,0.001313,0.018126,0.003932,...,0.0,0.0,0.019407,0.007286,0.035104,0.0,0.001175,0.003717,0.0,0.130855
A000109,0.009231,0.043096,0.003283,0.011245,0.015847,0.012773,0.010978,0.00135,0.001497,0.000148,...,0.043087,0.000386,0.011338,0.005606,0.003534,0.006921,0.005344,0.007101,0.000183,0.000497


In [149]:
# These are the labeled topics
sorted(list(filter(lambda x: x not in sum(combinations.values(), []), topic_names.values())))

['???',
 'IRS & trafficking',
 'arts',
 'community service',
 'constitution',
 'entrepreneurship',
 'financial sector',
 'first names',
 'gun safety & abortion',
 'puerto rico',
 'sports',
 'tobacco']

In [156]:
# Let's combine some topics together
combinations = {
    "drugs war": ["drugs war"],
    "border enforcement": ["border enforcement"],
    "government budget": ["budget"],
    "congressional terminology": ["congressional terminology", "congressional terminology+",
                                  "congressional terminology++", "congressional terminology+++"],
    "affordable housing": ["affordable housing"],
    "african-americans": ["african-american history", "african-americans"],
    "honors": ["honors", "honors+", "honors++"],
    "healthcare": ["health insurance & medicare", "healthcare", "medical research"],
    "veteran affairs": ["veteran affairs", "veterans"],
    "iraq & afghanistan wars": ["iraq & afghanistan wars"],
    "agriculture": ["agriculture"],
    "child welfare": ["child welfare"],
    "military": ["military", "defense"],
    "homeland security": ["homeland security", "cybersecurity"],
    "human rights in china": ["human rights in china"],
    "foreign policy": ["india & pakistan",  "terrorism & foreign policy", "armenian genocide"],
    "energy": ["energy"],
    "environment": ["nuclear waste", "forestry"],
    "natural disasters": ["natural disasters"],
    "secondary education": ["secondary education"],
    "trade": ["trade"],
    "worker's rights": ["worker welfare", "unions"],
    "democracy": ["voting rights & democracy", "law"],
    "infrastructure & transportation": ["infrastructure & construction", "transportation"],
    "arts": ["arts"],
    "sports": ["sports"]
}

"""
combinations = {
    "congressional terminology": ["congressional terminology", "congressional terminology+",
                                  "congressional terminology++", "congressional terminology+++"],
    "middle east": ["israel & palestine", "iraq & afghanistan"],
    "medicare & healthcare": ["medicare", "medicare+", "health insurance", "affordable care act (?)", "disease prevention"],
    "veteran affairs": ["veteran affairs", "veterans"],
    "honors": ["honors", "honors+"],
    "environment": ["EPA", "wildlife conservation"],
    "energy": ["energy", "oil & gas"],
    "budget": ["budget", "fiscal policy", "tax"],
    "labor rights": ["minimum wages", "labor relations"]
}
"""


for combination in combinations:
    normalised_mp_topics[combination] = normalised_mp_topics[combinations[combination]].sum(axis=1)

In [157]:
# Select just a few topics
topics_to_save = ["congressional terminology", "middle east", "medicare & healthcare", "veteran affairs",
                  "honors", "environment", "energy", "budget", "labor rights", "african-american civil rights",
                  "border enforcement", "MURICA", "arts & culture", "agriculture", "FDA", "FEMA", "secondary education", "housing", "young people"]

In [158]:
topics_to_save = list(combinations.keys())

In [159]:
a = normalised_mp_topics\
    .reset_index()\
    .fillna(0)\
    .join(members, on="speaker_bioguide")

a = a[topics_to_save + ["gender"]].groupby("gender")\
    .median()\
    .T.rename(columns={"M":"male", "F":"female"})
    
a.index.names = ["topic"]
a.columns.name = ""

# Use log scale to compress numbers
a = np.log10(a)

# Save to csv for viz
a.to_csv("topic_medians.csv", float_format="%.2f")

In [160]:
a = normalised_mp_topics\
    .reset_index()\
    .fillna(0)\
    .join(members, on="speaker_bioguide")

# Sort topics by gender polarisation
# Only use topics that have names assigned
topic_sorter = list(a[topics_to_save + ["gender"]]\
                    .groupby("gender")\
                    .median()\
                    .diff()\
                    .iloc[1]\
                    .sort_values(ascending=False)\
                    .index)
# Convert values to log10 to store higher precision with less data
a[topics_to_save] = np.log10(a[topics_to_save])

a = a.rename(columns={"speaker_bioguide":"id"})
a.gender = (a["gender"] == "F").astype(int)
a = a[["id", "full_name", "party", "gender"] + topic_sorter]\
    .sort_values(["party", "gender"])

a.to_csv("mp_topic_fraction.csv", index=False, float_format="%.2f")


divide by zero encountered in log10



In [161]:
from IPython.display import IFrame
IFrame(src="violin_plot.html", width=1300, height=1000)

## How are topics segregated by gender?

In [162]:
clustered_mps = normalised_mp_topics.join(members, how="inner")

### All parties

In [163]:
clustered_mps\
    .groupby("gender")\
    .median()\
    .diff()\
    .rename(columns=lambda x: topic_dict(x))[topic_sorter]\
    .T\
    .iplot(kind="bar")

### Republicans

In [91]:
clustered_mps\
    .query("party=='Republican'")\
    .groupby("gender")\
    .median()\
    .diff()\
    .rename(columns=lambda x: topic_dict(x))[topic_sorter]\
    .T\
    .iplot(kind="bar")

### Democrats

In [92]:
clustered_mps\
    .query("party=='Democrat'")\
    .groupby("gender")\
    .median()\
    .diff()\
    .rename(columns=lambda x: topic_dict(x))[topic_sorter]\
    .T\
    .iplot(kind="bar")

## Total words spoken by each gender over time

In [29]:
total_words = speeches.join(members[["gender"]], on="speaker_bioguide").groupby(["date", "gender"]).sum()[["n_words"]].reset_index().set_index("date")
total_words = total_words.pivot(columns="gender").resample("A").sum().fillna(0)
total_words.div(total_words.sum(axis=1), axis=0).iplot(kind="bar", barmode="stack")