# README
#### The following code is to **rescrape all the speeches from the UCSB website**
#### To access the data, you can find it in "all_presidential_speeches.csv"

## You will likely NOT need to run this entire file unless you want to rescrape**
#### The following code takes 25-30 minutes to run everything, so it's best to access the pre-scraped data in "all_presidential_speeches.csv"

- It will return a pandas dataframe (df) of 6000+ speeches in the categories found in "list_to_scrape.csv"
- To edit the categories, refer to list_to_scrape.csv
- The scraped data is in a csv called "all_presidential_speeches.csv"


# To run everything:
- uncomment scrape_all( ) to run everthing

# How it works:
- scrape_all( ) calls the following:
    - list_to_scrape.csv
    - dict_of_all_presidents( )
    - feed_urls ( ) which calls the following:
        - create_pages( )
        - get_links( )
        - get_speech( )
    - returns all speeches from all categories (i.e. oral address, farewell addresses, etc.) into a csv

In [11]:
import os
import csv
import pandas as pd
from datetime import datetime
import requests
from bs4 import BeautifulSoup

In [2]:
def scrape(url):
    content = requests.get(url)
    return BeautifulSoup(content.text, 'html.parser')

dict_of_all_presidents( ):
- returns a dict of presidents mapped to an id_number
- is called by scrape_all( )

In [4]:
def dict_of_all_presidents():
    all_presidents_url = 'https://www.presidency.ucsb.edu/presidents'
    soup = scrape(all_presidents_url)
    soup.prettify().encode('utf8')
    texts = soup.findAll(text=True)
    start_index = texts.index('Donald J. Trump')
    end_index = texts.index('George Washington')
    texts = texts[start_index: end_index+1]
    dict_all_prezs = dict()
    count = 45
    for elem in texts:
        if not elem.isdigit() and elem != ' to ' and elem != ' ' and elem != '\n':
            dict_all_prezs[elem] = count
            count -= 1
    # print(dict_all_prezs)
    return dict_all_prezs

parameters: url, page_count
- page_count is how many items to show per page
    - page count comes from list_to_scrape of how many items there are per speech

- create_pages( ):
    - is called by feed_urls( )
    - returns an array [ ] with all the links for that category

In [10]:
def create_pages(url, page_count):
    items_per_page = 'items_per_page=10'
    page_lst = []
    page_lst.append(url + "?" + items_per_page)
    for i in range(1,(page_count-1)//10):
        page = "page=" + str(i)
        page_lst.append(url + "?" + page + "&" + items_per_page)
    return page_lst

- get_speech( ) 
    - is called by feed_urls( )
    - scrapes the text of the speech and places it into a df 
    - returns a df with the following categories:
        - title
        - date
        - year
        - president
        - president_id
        - content
        - link2site
        - footnote

In [6]:
def get_speech(url, dict_all_prezs): 
    soup = scrape(url)
    soup.prettify().encode('utf8')
    #title
    if soup.find("div", {"class": "field-ds-doc-title"}) == None:
        title = "None"
    else:
        title = soup.find("div", {"class": "field-ds-doc-title"}).get_text().strip() 
    #content
    if soup.find("div", {"class": "field-docs-content"}) == None:
        content = "None"
    else:
        content = soup.find("div", {"class": "field-docs-content"}).get_text().strip()
    #date
    if soup.find("span", {"class": "date-display-single"}) == None:
        date = "None"
    else:
        date = soup.find("span", {"class": "date-display-single"}).get_text().strip()
    #president
    if soup.find("div", {"class": "field-title"}) == None:
        president = "None"
    else: 
        president = soup.find("div", {"class": "field-title"}).get_text().strip()
    #removes non president speeches
    if president not in dict_all_prezs: 
        return None
    president_id = dict_all_prezs[president]
    #url to the direct website 
    if soup.find("div", {"class": "field-prez-document-citation"}) == None:
        link2site = "None"
    else:
        link2site = soup.find("div", {"class": "field-prez-document-citation"}).get_text().strip()
        link2site_index = link2site.find('https')
        link2site = link2site[link2site_index:]
    #footnotes
    if soup.find("div", {"class": "field-docs-footnote"}) == None:
        footnote = "None"
    else:
        footnote = soup.find("div", {"class": "field-docs-footnote"}).get_text().strip()
    year = date.strip()[:-5:-1][::-1]
    df = pd.DataFrame({'title': [title],
                   'date': [date],
                   'year': [year],
                   'president': [president],
                   'president_id': [president_id],
                   'content': [content],
                   'url': [link2site],
                   'footnote': [footnote]
                   })
    return df

- feed_urls( ): 
    - is called by scrape_all( )
    - calls create_pages( ), get_links( ), get_speech( )
    - returns a df for getting the speeches for that category

In [12]:
def feed_urls(count, name, url, dict_all_prezs):
    print('Starting scrape for ' + name)
    pages = create_pages(url, count)
    all_speeches = []
    count = 1
    for each_page in pages:
        for i in get_links(each_page):
            spch = get_speech(i, dict_all_prezs)
            if spch != None:
                print('NOT APPENDED', spch['title'])
                all_speeches.append(spch)
    col_names = ['title', 'date', 'year', 'president', 'president_id', 'content', 'url', 'footnote']
    df = pd.DataFrame(all_speeches, columns=col_names)
    print('Finished scraping' + name, '\n', df)
    return df

- get_links(): 
    - is called by feed_urls()
    - opens each individual speech from that category
    - returns all_links from that category

In [13]:
def get_links(url): #
    soup = scrape(url)
    soup.prettify().encode('utf8')
    content = soup.find_all("div", {"class": "field-title"})
    all_links = []
    for i in content:
        link = "https://www.presidency.ucsb.edu/" + str(i.find("a").get("href"))
        all_links.append(link)
    return all_links

- removeQ_A( ):
    - takes in the df that contains all the speeches from every category
    - drops any speech that is a Q and A speech
    - returns df with removed Q and A speeches

In [16]:
def removeQ_A(df):
    dropLst = []
    for index, row in df.iterrows():
        dropLst.append(index)
        print(index)
        break
        try:
            if(len(row['content'])) < 10 or row['content'].find('Q. ') != -1:
                dropLst.append(index)
        except TypeError:
            dropLst.append(index)
    dropLst.pop()
    dropLst.pop()
    for i in dropLst:
        df = df.drop(index=i)
    keep_col =  ['title', 'date', 'year', 'president', 'president_id', 'content', 'url', 'footnote', 'speech_type']
    df = df[keep_col]
    return df

- scrape_all( ):
    - calls dict_of_all_presidents( )
    - calls feed_urls that scrape all the cateogries from "list_to_scrape.csv"
        - feed_urls( ) calls create_pages( ), get_links( ), get_speech( )

TO DO:
- Go to the UCSB website to look at which categories of speeches you want to scrape
- add those categories to list_to_scrape.csv
- _IF YOU WANT TO RERUN EVERYTHING_, uncomment "scrape_all()" in the following code chunk

In [17]:
def scrape_all():
    # TO DO: Change the csv_file to your csv_file
    csv_file = '/Users/Andey/Desktop/presidential_speeches/scraping_data/list_to_scrape.csv'
    dict_all_prezs = dict_of_all_presidents()
    df_scrape_categories = pd.read_csv(csv_file)
    frames = []
    for i in range(len(df_scrape_categories)):
        count = df.iloc[i]['count']
        name = df.iloc[i]['name']
        url = df.iloc[i]['url']
        
        #df_speech: all speeches of "name" category (i.e. oral_address)
        df_speech = feed_urls(count, name, url, dict_all_prezs)
        df_speech['speech_type'] = name
        frames.append(df_speech)
        
    #concatenates all speech types (i.e. oral address, farewell speeches, etc.) into 1 df
    result = pd.concat(frames, ignore_index=True)
    
    #drops duplicates
    result.drop_duplicates(subset=['content']) 
    
    #TO DO: Change the path to your output path
    all_prez_df_path = '/Users/Andey/Desktop/presidential_speeches/scraping_data/all_presidential_speeches.csv'
    keep_col =  ['title', 'date', 'year', 'president', 'president_id', 'content', 'url', 'footnote', 'speech_type']
    new_df = result[keep_col]
    removeQ_A(new_df) 
    
    #writing new_df into a csv with the path above
    new_df.to_csv(all_prez_df_path, index=True)
        
    return None

#TO DO: uncomment scrape_all() to run everthing
#scrape_all()