# Corpus Collecting Notebook
This notebook is designed for collecting data on the most significant female American political speeches of the 20th century. The data collecting method is web scraping. The data is stored in a CSV file. 

## 1. Importing necessary libraries

In [106]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from urllib.parse import urljoin
import re
import os
import glob

## 2. Collecting the links of each speech transcript
   The first step is to manually search for the most famous speeches by women in the 20th century on the [American Rhetoric](https://www.americanrhetoric.com/top100speechesfemales.htm) website. After that, create a dataframe consisting of the speaker's name and the corresponding link to their speech transcript.

In [107]:
data = [
    {'Speaker': 'Barbara Charline Jordan', 'Link': 'https://www.americanrhetoric.com/speeches/barbarajordan1976dnc.html'},
    {'Speaker': 'Barbara Charline Jordan', 'Link': 'https://www.americanrhetoric.com/speeches/barbarajordanjudiciarystatement.htm'},
    {'Speaker': 'Anna Howard Shaw', 'Link': 'https://www.americanrhetoric.com/speeches/annahowardshawprinciplerepublic.htm'},
    {'Speaker': 'Hillary Diane Rodham Clinton', 'Link': 'https://www.americanrhetoric.com/speeches/hillaryclintonbeijingspeech.htm'},
    {'Speaker': 'Dorothy Ann Willis Richards', 'Link': 'https://www.americanrhetoric.com/speeches/annrichards1988dnc.htm'},
    {'Speaker': 'Margaret Chase Smith', 'Link': 'https://www.americanrhetoric.com/speeches/margaretchasesmithconscience.html'},
    {'Speaker': 'Barbara Pierce Bush', 'Link': 'https://www.americanrhetoric.com/speeches/barbarabushwellesleycommencement.htm'},
    {'Speaker': 'Mary Fisher', 'Link': 'https://www.americanrhetoric.com/speeches/maryfisher1992rnc.html'},
    {'Speaker': 'Anna Eleanor Roosevelt', 'Link': 'https://www.americanrhetoric.com/speeches/eleanorroosevelt.htm'},
    {'Speaker': 'Geraldine Anne Ferraro', 'Link': 'https://www.americanrhetoric.com/speeches/gferraroacceptanceaddress.html'},
    {'Speaker': 'Emma Goldman', 'Link': 'https://www.americanrhetoric.com/speeches/emmagoldmanjuryaddress.htm'},
    {'Speaker': 'Carrie Chapman Catt', 'Link': 'https://www.americanrhetoric.com/speeches/carriechapmancattthecrisis.htm'},
    {'Speaker': 'Anita Faye Hill', 'Link': 'https://www.americanrhetoric.com/speeches/anitahillsenatejudiciarystatement.htm'},
    {'Speaker': 'Carrie Chapman Catt', 'Link': 'https://www.americanrhetoric.com/speeches/carriechapmancattsuffragespeech.htm'},
    {'Speaker': 'Elizabeth Glaser', 'Link': 'https://www.americanrhetoric.com/speeches/elizabethglaser1992dnc.htm'},
    {'Speaker': 'Margaret Higgins Sanger', 'Link': 'https://www.americanrhetoric.com/speeches/margaretsangerchildrensera.html'},
    {'Speaker': 'Ursula Kroeber Le Guin', 'Link': 'https://www.americanrhetoric.com/speeches/ursulakleguinlefthandedcommencementspeech.htm'},
    {'Speaker': 'Elizabeth Gurley Flynn', 'Link': 'https://www.americanrhetoric.com/speeches/elizabethgurleyflynn.htm'},
    {'Speaker': 'Shirley Anita Chisholm', 'Link': 'https://www.americanrhetoric.com/speeches/shirleychisholmequalrights.htm'},
    {'Speaker': 'Anna Eleanor Roosevelt', 'Link': 'https://www.americanrhetoric.com/speeches/eleanorrooseveltdeclarationhumanrights.htm'}
]

df = pd.DataFrame(data)
df

Unnamed: 0,Speaker,Link
0,Barbara Charline Jordan,https://www.americanrhetoric.com/speeches/barb...
1,Barbara Charline Jordan,https://www.americanrhetoric.com/speeches/barb...
2,Anna Howard Shaw,https://www.americanrhetoric.com/speeches/anna...
3,Hillary Diane Rodham Clinton,https://www.americanrhetoric.com/speeches/hill...
4,Dorothy Ann Willis Richards,https://www.americanrhetoric.com/speeches/annr...
5,Margaret Chase Smith,https://www.americanrhetoric.com/speeches/marg...
6,Barbara Pierce Bush,https://www.americanrhetoric.com/speeches/barb...
7,Mary Fisher,https://www.americanrhetoric.com/speeches/mary...
8,Anna Eleanor Roosevelt,https://www.americanrhetoric.com/speeches/elea...
9,Geraldine Anne Ferraro,https://www.americanrhetoric.com/speeches/gfer...


## 3. Web scraping process
   Define a function called 'scraping_speech' that takes a URL as input and scrapes the speech content from the webpage. 

In [108]:
def scraping_speech(url):
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    
    try:
        response = requests.get(url, headers=headers)
        html_string = response.text
        document = BeautifulSoup(html_string, "html.parser")
        
        # Remove all \n, \t and \r characters from the title
        title = re.sub(r'[\n\r\t]', '', document.title.text.replace("American Rhetoric:", ""))
        
        # Extract the speaker and remove all \n, \t and \r characters
        speaker_element = document.find('b')
        speaker = re.sub(r'[\n\r\t]', '', speaker_element.text.strip()) if speaker_element else ''
        
        # Extract the date and remove all \n, \t and \r characters
        date_element = document.find('font', {'color': '#CE0A04', 'face': 'Arial'})
        date = re.sub(r'[\n\r\t]', '', date_element.text.strip()) if date_element else ''
        
        # Extract the speech transcript and remove all \n, \t and \r characters
        transcript = []
        transcript_elements = document.find_all('font', {'face': 'Verdana', 'size': '2'})
        for element in transcript_elements:
            transcript.append(re.sub(r'[\n\r\t]', '', element.text.strip()))
        transcript = ' '.join([line.replace('Book/CDs by Michael E. Eidenmuller, Published by McGraw-Hill (2008)', '') for line in transcript])
        
        return [title, speaker, date, transcript]
    
    except requests.exceptions.RequestException as e:
        print('Error:', e)
        return []

# Split the links into smaller chunks
links = df['Link'].tolist()
chunks = [links[i:i+10] for i in range(0, len(links), 10)]

# Scrape the speeches for each chunk of links and store the data in a list
data = []
for chunk in chunks:
    for link in chunk:
        speech_data = scraping_speech(link)
        if speech_data:
            data.append(speech_data)

In [109]:
# Create a dataframe from the scraped data
df = pd.DataFrame(data, columns=["Title", "Speaker", "Year", "Transcript"])

## 4. Formatting the data

In [110]:
# fill in null values by manually searching the information on the website
df.loc[df['Speaker'] == 'Anna Howard Shaw', 'Year'] = 'delivered 21 June 1915, Ogdenburg, New York'

In [111]:
# separate the values in 'Date' column into 'Date' and 'Location' and keep the 'Date' value
# only year will be kept in the later process
df['Year'] = df['Year'].str.split(',', n=1, expand=True)[0]

In [112]:
# some date values are unrecognizable for the previous function. 
# replace these values manually 
df.loc[df['Year'] == 'delivered 9 December 1948 in Paris', ['Year']] = ['delivered 9 December 1948']

In [113]:
df.loc[df['Year'] == 'delivered 19 July 1984 at the Democratic National Convention', ['Year']] = ['delivered 19 July 1984']

In [114]:
df.loc[df['Year'] == 'delivered March 1925 New York', ['Year']] = ['delivered March 1925']

In [115]:
# standardize date to year format. 
df['Year'] = df['Year'].str.replace('delivered', '').str.strip()  # Removing 'delivered' prefix

def standardize_date(date_string):
    try:
        return datetime.strptime(date_string, '%d %B %Y').strftime('%Y')
    except ValueError:
        try:
            return datetime.strptime(date_string, '%B %d %Y').strftime('%Y')
        except ValueError:
            return None

df['Year'] = df['Year'].apply(standardize_date)

In [116]:
# fill the null values in column 'Year' (the null value is due to various date formats that cannot be recognized by machine)
df.loc[15, 'Year'] = '1925'
df.loc[13, 'Year'] = '1917'
df.loc[18, 'Year'] = '1970'

In [117]:
# Extract only the title from the 'Title' column
df['Title'] = df['Title'].apply(lambda x: re.sub(r'^[^-]*-\s*', '', x))

In [118]:
# Remove punctuation from the 'Title' column
df['Title'] = df['Title'].apply(lambda x: re.sub(r'[^\w\s]', '', x))

In [119]:
df

Unnamed: 0,Title,Speaker,Year,Transcript
0,1976 Democratic National Convention Keynote Ad...,Barbara Charline Jordan,1976,Thank you ladies and gentlemen for a very warm...
1,Statement on House Judiciary Proceedings to Im...,Barbara Charline Jordan,1974,"Thank you, Mr. Chairman. Mr. Chairman, I join ..."
2,The Fundamental Principle of a Republic,Anna Howard Shaw,1915,"When I came into your hall tonight, I thought ..."
3,United Nations 4th World Conference Speech Wo...,Hillary Rodham Clinton,1995,"Thank you very much, Gertrude Mongella, for yo..."
4,Democratic National Convention Keynote Address,Ann Richards,1988,"Thank you, very much. Good evening, ladies and..."
5,Declaration of Conscience,Margaret Chase Smith,1950,r. President: I would like to speak briefly an...
6,Wellesley College Commencement Speech,Barbara Pierce Bush,1990,"Thank you very, very much,President Keohane. M..."
7,1992 Republican National Convention Address A...,MaryFisher,1992,Less than three months ago at platform hearing...
8,The Struggle for Human Rights,Eleanor Roosevelt,1948,I have come this evening to talk with you on o...
9,1984 Vice Presidential Nomination Acceptance A...,Geraldine Ferraro,1984,Ladies and gentlemen of the convention: My nam...


## 6.Saving the DataFrame to a CSV file

In [120]:
df.to_csv('women_speech.csv', index=False)