# Corpus Collecting Notebook
This notebook is designed for collecting data on the most significant female American political speeches of the 20th century. The data collecting method is web scraping. The data is stored in a CSV file and plain texts. 

## 1. Importing necessary libraries

In [26]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urljoin
import re
import os
import glob

## 2. Creating the DataFrame
   The first step is to manually search for the most famous speeches by women in the 20th century on the [American Rhetoric](https://www.americanrhetoric.com/top100speechesfemales.htm) website. After that, create a dataframe consisting of the speaker's name and the corresponding link to their speech transcript.

In [27]:
data = [
    {'Speaker': 'Barbara Charline Jordan', 'Link': 'https://www.americanrhetoric.com/speeches/barbarajordan1976dnc.html'},
    {'Speaker': 'Barbara Charline Jordan', 'Link': 'https://www.americanrhetoric.com/speeches/barbarajordanjudiciarystatement.htm'},
    {'Speaker': 'Anna Howard Shaw', 'Link': 'https://www.americanrhetoric.com/speeches/annahowardshawprinciplerepublic.htm'},
    {'Speaker': 'Hillary Diane Rodham Clinton', 'Link': 'https://www.americanrhetoric.com/speeches/hillaryclintonbeijingspeech.htm'},
    {'Speaker': 'Dorothy Ann Willis Richards', 'Link': 'https://www.americanrhetoric.com/speeches/annrichards1988dnc.htm'},
    {'Speaker': 'Margaret Chase Smith', 'Link': 'https://www.americanrhetoric.com/speeches/margaretchasesmithconscience.html'},
    {'Speaker': 'Barbara Pierce Bush', 'Link': 'https://www.americanrhetoric.com/speeches/barbarabushwellesleycommencement.htm'},
    {'Speaker': 'Mary Fisher', 'Link': 'https://www.americanrhetoric.com/speeches/maryfisher1992rnc.html'},
    {'Speaker': 'Anna Eleanor Roosevelt', 'Link': 'https://www.americanrhetoric.com/speeches/eleanorroosevelt.htm'},
    {'Speaker': 'Geraldine Anne Ferraro', 'Link': 'https://www.americanrhetoric.com/speeches/gferraroacceptanceaddress.html'},
    {'Speaker': 'Emma Goldman', 'Link': 'https://www.americanrhetoric.com/speeches/emmagoldmanjuryaddress.htm'},
    {'Speaker': 'Carrie Chapman Catt', 'Link': 'https://www.americanrhetoric.com/speeches/carriechapmancattthecrisis.htm'},
    {'Speaker': 'Anita Faye Hill', 'Link': 'https://www.americanrhetoric.com/speeches/anitahillsenatejudiciarystatement.htm'},
    {'Speaker': 'Carrie Chapman Catt', 'Link': 'https://www.americanrhetoric.com/speeches/carriechapmancattsuffragespeech.htm'},
    {'Speaker': 'Elizabeth Glaser', 'Link': 'https://www.americanrhetoric.com/speeches/elizabethglaser1992dnc.htm'},
    {'Speaker': 'Margaret Higgins Sanger', 'Link': 'https://www.americanrhetoric.com/speeches/margaretsangerchildrensera.html'},
    {'Speaker': 'Ursula Kroeber Le Guin', 'Link': 'https://www.americanrhetoric.com/speeches/ursulakleguinlefthandedcommencementspeech.htm'},
    {'Speaker': 'Elizabeth Gurley Flynn', 'Link': 'https://www.americanrhetoric.com/speeches/elizabethgurleyflynn.htm'},
    {'Speaker': 'Shirley Anita Chisholm', 'Link': 'https://www.americanrhetoric.com/speeches/shirleychisholmequalrights.htm'},
    {'Speaker': 'Anna Eleanor Roosevelt', 'Link': 'https://www.americanrhetoric.com/speeches/eleanorrooseveltdeclarationhumanrights.htm'}
]

df = pd.DataFrame(data)
df

Unnamed: 0,Speaker,Link
0,Barbara Charline Jordan,https://www.americanrhetoric.com/speeches/barb...
1,Barbara Charline Jordan,https://www.americanrhetoric.com/speeches/barb...
2,Anna Howard Shaw,https://www.americanrhetoric.com/speeches/anna...
3,Hillary Diane Rodham Clinton,https://www.americanrhetoric.com/speeches/hill...
4,Dorothy Ann Willis Richards,https://www.americanrhetoric.com/speeches/annr...
5,Margaret Chase Smith,https://www.americanrhetoric.com/speeches/marg...
6,Barbara Pierce Bush,https://www.americanrhetoric.com/speeches/barb...
7,Mary Fisher,https://www.americanrhetoric.com/speeches/mary...
8,Anna Eleanor Roosevelt,https://www.americanrhetoric.com/speeches/elea...
9,Geraldine Anne Ferraro,https://www.americanrhetoric.com/speeches/gfer...


## 3. Web scraping process & Save .txt files
   Define a function called 'scraping_speech' that takes a URL as input and scrapes the speech content from the webpage. The scraped transcripts are stored in .txt files.

In [28]:
def scraping_speech(url):
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    
    try:
        response = requests.get(url,headers=headers)
        html_string = response.text
        document = BeautifulSoup(html_string, "html.parser")
        
        # Remove all \n, \t and \r characters from the title
        title = re.sub(r'[\n\r\t]', '', document.title.text.replace("American Rhetoric:",""))
        
        # Extract the speaker and remove all \n, \t and \r characters
        speaker_element = document.find('b')
        speaker = re.sub(r'[\n\r\t]', '', speaker_element.text.strip()) if speaker_element else ''
        
        # Extract the date and remove all \n, \t and \r characters
        date_element = document.find('font', {'color': '#CE0A04', 'face': 'Arial'})
        date = re.sub(r'[\n\r\t]', '', date_element.text.strip()) if date_element else ''
        
        # Extract the speech transcript and remove all \n, \t and \r characters
        transcript = []
        transcript_elements = document.find_all('font', {'face': 'Verdana', 'size': '2'})
        for element in transcript_elements:
            transcript.append(re.sub(r'[\n\r\t]', '', element.text.strip()))
        transcript = ' '.join([line.replace('Book/CDs by Michael E. Eidenmuller, Published by McGraw-Hill (2008)', '') for line in transcript])
        
        # Create a folder if it doesn't exist
        if not os.path.exists('women_speech'):
            os.makedirs('women_speech')
        
        # Save the speech as a text file
        filename = 'women_speech/' + title.replace(' ', '_') + '.txt'
        with open(filename, 'w') as file:
            file.write(title + '\n')
            file.write(speaker + '\n')
            file.write(date + '\n')
            file.write(transcript + '\n')        
        print('Saved speech:', filename)
    
    except requests.exceptions.RequestException as e:
        print('Error:', e)
    
    return 

# Split the links into smaller chunks
links = df['Link'].tolist()
chunks = [links[i:i+10] for i in range(0, len(links), 10)]

# Scrape the speeches for each chunk of links
for chunk in chunks:
    for link in chunk:
        scraping_speech(link)
        print('---')

Saved speech: women_speech/_Barbara_Jordan_-_1976_Democratic_National_Convention_Keynote_Address.txt
---
Saved speech: women_speech/_Barbara_Jordan_-_Statement_on_House_Judiciary_Proceedings_to_Impeach_President_Richard_Nixon.txt
---
Saved speech: women_speech/_Anna_Howard_Shaw_-_The_Fundamental_Principle_of_a_Republic.txt
---
Saved speech: women_speech/_Hillary_Rodham_Clinton_--_United_Nations_4th_World_Conference_Speech_("Women's_Rights_are_Human_Rights").txt
---
Saved speech: women_speech/_Ann_Richards_-_Democratic_National_Convention_Keynote_Address.txt
---
Saved speech: women_speech/_Margaret_Chase_Smith_--_"Declaration_of_Conscience".txt
---
Saved speech: women_speech/Barbara_Bush_--_Wellesley_College_Commencement_Speech.txt
---
Saved speech: women_speech/_Mary_Fisher_--_1992_Republican_National_Convention_Address_("A_Whisper_of_Aids").txt
---
Saved speech: women_speech/_Eleanor_Roosevelt_--_"The_Struggle_for_Human_Rights".txt
---
Saved speech: women_speech/_Geraldine_Ferraro_-_1

## 4. Creating a DataFrame from extracted data
Gather the file paths of all the saved speech text files. Iterate over each file path and reads the content from each speech text file. It extracts the title, speaker, date, and transcript from the .txt files. Creates a new DataFrame using the extracted data, containing columns named "Title", "Speaker", "Date", and "Transcript".

In [29]:
file_paths = glob.glob("women_speech/*.txt")

data = []
for file_path in file_paths:
    with open(file_path, "r") as file:
        content = file.read()
        title = content.split("\n")[0]
        speaker = content.split("\n")[1]
        date_loc = content.split("\n")[2]
        date = date_loc.split(",", 1)[0].replace("delivered", "")
        transcript = content.split("\n")[3]
        
        data.append([title, speaker, date, transcript])

df = pd.DataFrame(data, columns=["Title", "Speaker", "Date", "Transcript"])

## 5. Cleaning data and formatting the date
Replace incorrect dates for certain speeches. Create a new column called 'Year' and drops the 'Date' column.

In [30]:
df.loc[17,'Date']="19 July 1984"
df.loc[7,'Date']="March 1925"
df.loc[12,'Date']="9 December 1948"
df.loc[1,'Date']="24 April 1952"
df['Year'] = df['Date'].str.split().str[-1]
df.loc[6,'Year']="1915"
df = df.drop('Date', axis=1)

## 6.Saving the DataFrame to a CSV file

In [31]:
df.to_csv('women_speech.csv', index=False)