# <span style="color:darkblue"> Lecture 23 - Text Data  </span>

<font size = "5">

In  this lecture we will work with text data

<font size = "5">

Import Libraries

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

<font size = "5">

Import Data

- Congressional bills in the United States

In [3]:
bills_actions = pd.read_csv("data_raw/bills_actions.csv")
bills_actions.dtypes

Congress        int64
bill_number     int64
bill_type      object
action         object
main_action    object
category       object
member_id       int64
dtype: object

# <span style="color:darkblue"> II. Basic Text Operations </span>

<font size = "5">

Count Frequency

In [4]:
bills_actions["category"].value_counts()

category
amendment                       1529
house bill                       902
senate bill                      514
house resolution                 234
senate resolution                 60
house joint resolution            22
house concurrent resolution       20
senate concurrent resolution      14
senate joint resolution            8
Name: count, dtype: int64

<font size = "5">

Subset text categories

In [15]:
# For this analysis we are only interested in bills. With ".query()" ...
#     - We select all entries in the column called "category" 
#       which have values contain in "list_categories"
#     - "in" is used to test whether a word belongs to a list
#     - @ is the syntax to reference "global" variables that
#       are defined in the global environment

list_categories = ["house bill","senate bill"]
bills           = bills_actions.query('category in @list_categories')

# Verify that the code worked:
display(bills["category"].value_counts())



category
house bill     902
senate bill    514
Name: count, dtype: int64

In [23]:
bills

Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
3,116,1199,s,"Committee on Health, Education, Labor, and Pen...",senate committee/subcommittee actions,senate bill,1561
4,116,1208,s,Committee on the Judiciary. Reported by Senato...,senate committee/subcommittee actions,senate bill,1580
5,116,1231,s,Committee on the Judiciary. Reported by Senato...,senate committee/subcommittee actions,senate bill,1580
6,116,1228,s,"Committee on Commerce, Science, and Transporta...",senate committee/subcommittee actions,senate bill,1002
7,116,123,s,Committee on Veterans' Affairs. Reported by Se...,senate committee/subcommittee actions,senate bill,1490
...,...,...,...,...,...,...,...
3262,116,991,hr,Mr. Blumenauer moved to suspend the rules and ...,house floor actions,house bill,328
3263,116,995,hr,"At the conclusion of debate, the chair put the...",house floor actions,house bill,1548
3264,116,995,hr,Ms. Hill (CA) moved to suspend the rules and p...,house floor actions,house bill,1548
3265,116,9,hr,Mr. Barr moved to recommit with instructions t...,house committee/subcommittee actions,house bill,1243


<font size = "5">

Data manipulation with sentences

In [16]:
# How many bills mention the word Senator?
bool_contains = bills["action"].str.contains("Senator")
print(bool_contains.mean())

# How to replace the word "Senator" with "Custom Title"
bills["action"].str.replace("Senator","Custom Title")

0.3199152542372881


3       Committee on Health, Education, Labor, and Pen...
4       Committee on the Judiciary. Reported by Custom...
5       Committee on the Judiciary. Reported by Custom...
6       Committee on Commerce, Science, and Transporta...
7       Committee on Veterans' Affairs. Reported by Cu...
                              ...                        
3262    Mr. Blumenauer moved to suspend the rules and ...
3263    At the conclusion of debate, the chair put the...
3264    Ms. Hill (CA) moved to suspend the rules and p...
3265    Mr. Barr moved to recommit with instructions t...
3280           Mr. Pallone moved that the Committee rise.
Name: action, Length: 1416, dtype: object

<font size = "5">

Try it yourself!

- Obtain a new dataset called "resolutions" <br>
 which subsets rows contain the "category" values:

 ``` ["house resolution","senate resolution"] ```

In [30]:
# Write your own code
resolution_list = ["house resolution","senate resolution"]
resolutions = bills_actions.query("category in @resolution_list")
display(resolutions)

Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
485,116,123,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
486,116,135,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
487,116,142,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
488,116,152,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
489,116,183,sres,Committee on Foreign Relations. Reported by Se...,senate committee/subcommittee actions,senate resolution,505
...,...,...,...,...,...,...,...
1085,116,603,hres,Mr. Hoyer moved to table the measure.,house floor actions,house resolution,1065
1086,116,603,hres,QUESTION OF THE PRIVILEGES OF THE HOUSE - The ...,house floor actions,house resolution,1560
1087,116,647,hres,Mr. Hoyer moved to table the measure.,house floor actions,house resolution,1065
1088,116,770,hres,Mr. Hoyer moved to table the measure.,house floor actions,house resolution,1065


In [47]:
resolutions = bills_actions["category"].str.contains("house resolution","senate resolution")
display(resolutions)

0       False
1       False
2       False
3       False
4       False
        ...  
3298    False
3299    False
3300    False
3301    False
3302    False
Name: category, Length: 3303, dtype: bool

# <span style="color:darkblue"> III. Regular Expressions </span>

<font size = "5">

Regular expressions enable advanced searching <br>
for string data.

In [35]:
dataset = pd.read_csv("data_raw/bills_actions.csv")
senate_bills = dataset.query('category == "senate bill"')
amendments   = dataset.query('category == "amendment"')

In [41]:
to_reconsider = dataset[dataset["action"].str.contains("to reconsider")]
display(to_reconsider)

Unnamed: 0,Congress,bill_number,bill_type,action,main_action,category,member_id
38,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
39,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
40,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
41,116,1,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
268,116,2657,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
269,116,2657,s,S.Amdt.1407 Motion by Senator McConnell to rec...,other senate amendment actions,amendment,858
400,116,3985,s,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate bill,858
548,116,50,sres,Motion by Senator McConnell to reconsider the ...,senate floor actions,senate resolution,858
823,116,28,hjres,VITIATION OF EARLIER PROCEEDINGS - Mr. Hoyer a...,house floor actions,house joint resolution,1065
1023,116,758,hres,Mr. Nadler moved to table the motion to recons...,house floor actions,house resolution,546


In [22]:
display(amendments["action"])

0       S.Amdt.1274 Amendment SA 1274 proposed by Sena...
1       S.Amdt.2698 Amendment SA 2698 proposed by Sena...
2       S.Amdt.2659 Amendment SA 2659 proposed by Sena...
8       S.Amdt.2424 Amendment SA 2424 proposed by Sena...
11      S.Amdt.1275 Amendment SA 1275 proposed by Sena...
                              ...                        
3298    H.Amdt.172 Amendment (A004) offered by Ms. Kus...
3299    H.Amdt.171 Amendment (A003) offered by Ms. Hou...
3300    H.Amdt.170 Amendment (A002) offered by Ms. Oma...
3301    POSTPONED PROCEEDINGS - At the conclusion of d...
3302    H.Amdt.169 Amendment (A001) offered by Mr. Esp...
Name: action, Length: 1529, dtype: object

<font size = "5">

Search word

In [36]:
# We use the ".str.findall()" subfunction
# The argument is an expression

amendments["action"].str.findall("Amdt\.")
#find all string that contains Amdt in the action columns

0       [Amdt.]
1       [Amdt.]
2       [Amdt.]
8       [Amdt.]
11      [Amdt.]
         ...   
3298    [Amdt.]
3299    [Amdt.]
3300    [Amdt.]
3301         []
3302    [Amdt.]
Name: action, Length: 1529, dtype: object

In [54]:
display(amendments["action"].str.findall("Amdt\.\d"))
#display digit after Admt
display(amendments["action"].str.findall("Amdt\.\D"))
#display nothing because there are no non-digits after Amdt

0       [Amdt.1]
1       [Amdt.2]
2       [Amdt.2]
8       [Amdt.2]
11      [Amdt.1]
          ...   
3298    [Amdt.1]
3299    [Amdt.1]
3300    [Amdt.1]
3301          []
3302    [Amdt.1]
Name: action, Length: 1529, dtype: object

0       []
1       []
2       []
8       []
11      []
        ..
3298    []
3299    []
3300    []
3301    []
3302    []
Name: action, Length: 1529, dtype: object

<font size = "5">

Wildcards

$\quad$ <img src="figures/wildcards_regex1.png" alt="drawing" width="300"/>

In [57]:
# Get digits after string
example1 = amendments["action"].str.findall("Amdt\..")

# Get any character before string
example2 = amendments["action"].str.findall(".Amdt\.")

# Get two characters before string
example3 = amendments["action"].str.findall("..Amdt\...............") #each dot represents a character after 

display(example1)
display(example2)
display(example3)

0       [Amdt.1]
1       [Amdt.2]
2       [Amdt.2]
8       [Amdt.2]
11      [Amdt.1]
          ...   
3298    [Amdt.1]
3299    [Amdt.1]
3300    [Amdt.1]
3301          []
3302    [Amdt.1]
Name: action, Length: 1529, dtype: object

0       [.Amdt.]
1       [.Amdt.]
2       [.Amdt.]
8       [.Amdt.]
11      [.Amdt.]
          ...   
3298    [.Amdt.]
3299    [.Amdt.]
3300    [.Amdt.]
3301          []
3302    [.Amdt.]
Name: action, Length: 1529, dtype: object

0       [S.Amdt.1274 Amendment]
1       [S.Amdt.2698 Amendment]
2       [S.Amdt.2659 Amendment]
8       [S.Amdt.2424 Amendment]
11      [S.Amdt.1275 Amendment]
                 ...           
3298    [H.Amdt.172 Amendment ]
3299    [H.Amdt.171 Amendment ]
3300    [H.Amdt.170 Amendment ]
3301                         []
3302    [H.Amdt.169 Amendment ]
Name: action, Length: 1529, dtype: object

<font size = "5">

Wildcards + Quantifiers

$\quad$ <img src="figures/wildcards_regex2.png" alt="drawing" width="300"/>

In [60]:
# Get all consecutive digits after string
example4 = amendments["action"].str.findall("Amdt\.\d*")

# Get all consecutive characters before string
example5 = amendments["action"].str.findall(".*Amdt\.")

display(example4)
display(example5)


0       [Amdt.1274]
1       [Amdt.2698]
2       [Amdt.2659]
8       [Amdt.2424]
11      [Amdt.1275]
           ...     
3298     [Amdt.172]
3299     [Amdt.171]
3300     [Amdt.170]
3301             []
3302     [Amdt.169]
Name: action, Length: 1529, dtype: object

0       [S.Amdt.]
1       [S.Amdt.]
2       [S.Amdt.]
8       [S.Amdt.]
11      [S.Amdt.]
          ...    
3298    [H.Amdt.]
3299    [H.Amdt.]
3300    [H.Amdt.]
3301           []
3302    [H.Amdt.]
Name: action, Length: 1529, dtype: object

<font size = "5">

Try it yourself

- Practice using the ```senate_bills``` dataset
- Use ```.str.findall()``` to find the word "Senator"
- Use the regular expression ```"Senator \S``` to extract <br>
 the the first letter of senator
- Use ```*``` to extract senator names

In [69]:
# Write your own code
senate_bills["action"].str.findall("Senator")
display(senate_bills["action"].str.findall("Senator \S"))
#find Senator_space_the next immediate letter
display(senate_bills["action"].str.findall("Senator \S*"))


3      [Senator A]
4      [Senator G]
5      [Senator G]
6      [Senator W]
7      [Senator M]
          ...     
795    [Senator J]
796             []
797    [Senator H]
798             []
799    [Senator G]
Name: action, Length: 514, dtype: object

3      [Senator Alexander]
4         [Senator Graham]
5         [Senator Graham]
6         [Senator Wicker]
7          [Senator Moran]
              ...         
795      [Senator Johnson]
796                     []
797       [Senator Hoeven]
798                     []
799       [Senator Graham]
Name: action, Length: 514, dtype: object

In [74]:
display(senate_bills["action"].str.findall("Senator..."))
#find the immediate next 3 characters

3      [Senator Al]
4      [Senator Gr]
5      [Senator Gr]
6      [Senator Wi]
7      [Senator Mo]
           ...     
795    [Senator Jo]
796              []
797    [Senator Ho]
798              []
799    [Senator Gr]
Name: action, Length: 514, dtype: object