<a href="https://colab.research.google.com/github/2zOu2/Snkrs-Bot/blob/master/%E2%80%9CQTM340_SP24_PS1_ipynb%E2%80%9D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 0. Download data


In [1]:
%%bash
wget https://raw.githubusercontent.com/sandeepsoni/QTM340-Fall23/main/data/114_speeches.tar.gz
tar -xzvf 114_speeches.tar.gz

114/
114/speeches_114.txt
114/README.txt
114/114_SpeakerMap.txt


--2024-02-25 01:06:33--  https://raw.githubusercontent.com/sandeepsoni/QTM340-Fall23/main/data/114_speeches.tar.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43102274 (41M) [application/octet-stream]
Saving to: ‘114_speeches.tar.gz’

     0K .......... .......... .......... .......... ..........  0% 22.3M 2s
    50K .......... .......... .......... .......... ..........  0% 15.2M 2s
   100K .......... .......... .......... .......... ..........  0%  120M 2s
   150K .......... .......... .......... .......... ..........  0%  225M 1s
   200K .......... .......... .......... .......... ..........  0%  231M 1s
   250K .......... .......... .......... .......... ..........  0% 26.3M 1s
   300K .......... .......... .......... .......... ..........  0%  143M 

The above execution should create a directory named 114 with the following structure:

```
114/
114/speeches_114.txt
114/README.txt
114/114_SpeakerMap.txt
```

## 1. Setup

Let's load all the speeches for which we have additional metadata.

The speaker info file is delimited by `|` and the columns are named.

In [2]:
%%bash
head -n 5 114/114_SpeakerMap.txt

speakerid|speech_id|lastname|firstname|chamber|state|gender|party|district|nonvoting
114120480|1140000007|MCMORRIS RODGERS|CATHY|H|WA|F|R|5|voting
114118560|1140000009|BECERRA|XAVIER|H|CA|M|D|34|voting
114121890|1140000011|MASSIE|THOMAS|H|KY|M|R|4|voting
114122500|1140000013|BRIDENSTINE|JIM|H|OK|M|R|1|voting


Similarly, the speeches file is delimited by `|` and contains the speech and its Id

In [None]:
%%bash
head -n 5 114/speeches_114.txt

speech_id|speech
1140000001|The Representativeselect and their guests will please remain standing and join in the Pledge of Allegiance.
1140000002|As directed by law. the Clerk of the House has prepared the official roll of the Representativeselect. Certificates of election covering 435 seats in the 114th Congress have been received by the Clerk of the House. and the names of those persons whose credentials show that they were regularly elected as Representatives in accordance with the laws of their respective States or of the United States will be called. The Representativeselect will record their presence by electronic device and their names will be reported in alphabetical order by State. beginning with the State of Alabama. to determine whether a quorum is present. Representatives- elect will have a minimum of 15 minutes to record their presence by electronic device. Representatives- elect who have not obtained their voting ID cards may do so now in the Speakers lobby.
1140000003|F

We'll use the `pandas` library to load both the speeches and the speaker info. If you are familar with `R` then pandas can be thought of as providing pretty much the same functionality to construct and manipulate dataframes. You can read more about it [here](https://pandas.pydata.org/docs/user_guide/10min.html). We'll also import other libraries and configure them so they're ready to use later in the notebook.

In [None]:
# Import the general libraries
import math
import pandas as pd
from tqdm import tqdm
import numpy as np
from collections import defaultdict, Counter
import matplotlib.pyplot as pyplot
%matplotlib inline

# Import spacy and configure the nlp pipeline for spacy
import spacy
nlp = spacy.load ("en_core_web_sm", disable=["ner", "parser"])
nlp.disable_pipe ("ner")
nlp.disable_pipe ("parser")

# Import nltk and download the punct models
import nltk
nltk.download ("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Let's read the speeches
speeches = pd.read_csv ("114/speeches_114.txt", #name of the file
                        sep='|', #delimiter
                        encoding="utf-8", #encoding of the characters
                        encoding_errors="ignore", #ignore any errors in encoding
                        on_bad_lines="skip" #skip lines which contain ill-formatted speeches
                       )

In [None]:
# Let's read the speaker info.
speaker_map = pd.read_csv ("114/114_SpeakerMap.txt", #name of the file
                        sep='|', #delimiter
                        encoding="utf-8", #encoding of the characters
                        encoding_errors="ignore", #ignore any errors in encoding
                        on_bad_lines="skip" #skip lines which contain ill-formatted speeches
                       )

Let's see a few rows in both the dataframes. We can do this by calling the `.head` function of the pandas dataframe.

In [None]:
speeches.head (5)

Unnamed: 0,speech_id,speech
0,1140000001,The Representativeselect and their guests will...
1,1140000002,As directed by law. the Clerk of the House has...
2,1140000003,Four hundred and one Represent ativeselect hav...
3,1140000004,Credentials. regular in form. have been receiv...
4,1140000005,The Clerk is in receipt of a letter from the H...


In [None]:
speaker_map.head (5)

Unnamed: 0,speakerid,speech_id,lastname,firstname,chamber,state,gender,party,district,nonvoting
0,114120480,1140000007,MCMORRIS RODGERS,CATHY,H,WA,F,R,5.0,voting
1,114118560,1140000009,BECERRA,XAVIER,H,CA,M,D,34.0,voting
2,114121890,1140000011,MASSIE,THOMAS,H,KY,M,R,4.0,voting
3,114122500,1140000013,BRIDENSTINE,JIM,H,OK,M,R,1.0,voting
4,114120780,1140000017,PELOSI,NANCY,H,CA,F,D,12.0,voting


Now we'll merge both the dataframes into a single dataframe. We can do this by calling `pd.merge` as follows (if you'are familiar with SQL, we'll do a join operation of these two tables that have the speech_id field in common)

In [None]:
overall_data = pd.merge (speeches,
                         speaker_map,
                         how="inner",
                         on="speech_id")

The resuling dataframe can be accessed with the variable `overall_data`

In [None]:
overall_data.head (5)

Unnamed: 0,speech_id,speech,speakerid,lastname,firstname,chamber,state,gender,party,district,nonvoting
0,1140000007,RODGERS. Madam Clerk. it is an honor to addres...,114120480,MCMORRIS RODGERS,CATHY,H,WA,F,R,5.0,voting
1,1140000009,Madam Clerk. first I would like to recognize e...,114118560,BECERRA,XAVIER,H,CA,M,D,34.0,voting
2,1140000011,Madam Clerk. I present for election to the off...,114121890,MASSIE,THOMAS,H,KY,M,R,4.0,voting
3,1140000013,Madam Clerk. I present for the election of the...,114122500,BRIDENSTINE,JIM,H,OK,M,R,1.0,voting
4,1140000015,Madam Clerk. I rise to place in a nomination f...,114120060,KING,STEVE,H,IA,M,R,4.0,voting


Now let's randomly pick 50000 speeches from the dataframe for our analysis. We can do this by calling the `.sample` method on the dataframe and passing an argument to it to indicate the number of rows we want to get post-sampling.

In [None]:
# @title Get our final data sample
n = 50000 # @param {type:"integer"}
overall_data = overall_data.sample (n=n, random_state=42)
print (len (overall_data))

50000


### Helpful code

In [None]:
!pip install pyphen

Collecting pyphen
  Downloading pyphen-0.14.0-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen
Successfully installed pyphen-0.14.0


In [None]:
# @title Test syllable counter
test_word = 'wonderfully' # @param {type:"string"}
import pyphen
dic = pyphen.Pyphen (lang="en_US")

print (dic.inserted(test_word))
print (f"Number of syllables={len(dic.inserted (test_word).split('-'))}")

won-der-ful-ly
Number of syllables=4
