# Speech Scraper
Data scientists often need to find creative ways to obtain data relevant for an analysis. Webscraping is a common method data scientists use to get web data.

Here, we are going to obtain the Secretary of Defense's public speeches from 2014 through the present. These speeches are available [online here](https://www.defense.gov/News/Speeches/Customspeechwho/16001/) but there are over 200 of them. So, we will build a quick scraper to collect them.

First, let's import a few key packages:

1. `requests`: this allows us to make requests to webpages
2. `BeautifulSoup`: this is a handy tool for parsing websites
3. `pandas`: this allows us to manipulate tabular data

In [101]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

Now we can define two functions for processing the web data. The first gets links to speeches from the main page

In [21]:
def get_links(soup):
    links = []
    for div in soup.findAll('div', {'class': 'item'}):
        for a in div.findAll('a'):
            links.append(a['href'])
    print(str(len(links)) + " links were found")
    return links

The second function, `process_speech`, parses the speech and transforms it into something we can use.

In [80]:
def process_speech(link, soup):
    url = link
    title = soup.find('div', {'class': 'article-body'}).find('h1').text
    date = soup.find('time').text
    
    body = soup.find('div', {'class': 'article-body'}).findAll('p', {'class': None})
    speech = ''
    
    for p in body:
        stripped = p.text.strip() + ' '
        speech += stripped
        
    speech_object = {'url': url,
                     'title': title,
                     'date': date,
                     'speech': speech}
    return speech_object

Now, we can obtain links to each of the respective speeches:

In [24]:
speech_links = []
base = 'https://www.defense.gov/News/Speeches/Customspeechwho/16001/'

In [30]:
for i in range(1,9):
    if i == 1:
        url = base
    else:
        url = '{0}?Page={1}'.format(base, i)
    
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    links = get_links(soup)
    speech_links += links

30 links were found
30 links were found
30 links were found
30 links were found
30 links were found
30 links were found
30 links were found
15 links were found


In [77]:
speech_links = list(set(speech_links))

In [78]:
print('In total, {} speeches were found.'.format(len(speech_links)))

In total, 225 speeches were found.


Now that we have links to the speeches, we can go ahead and obtain the speeches themselves and save them to `.csv`.

In [81]:
speeches = []
for link in list(set(speech_links)):
    response = requests.get(link)
    soup = BeautifulSoup(response.text, "html.parser")
    scraped_speech = process_speech(link, soup)
    speeches.append(scraped_speech)

In [97]:
df = pd.DataFrame.from_records(speeches)

In [98]:
df = df[df.speech.str.len() > 1000 ]

In [100]:
df.shape

(204, 4)

In [99]:
df.to_csv('SecDef_Speeches.csv', index=False)