## About this code
The purpose of this notebook is to show an example of how we can extract information from web pages with some level of accuracy.
I've made this notebook as an example to help you in case you needed to scrape infromation in the future.
Here, we are collecting the speeches of previous presidents of the united states

In [2]:
import os
from os import walk
from bs4 import BeautifulSoup
import requests

### Files structures
The folders exist each named by the name of some of the american presidents
inside each folder, there is an html file "initial.txt" scraped from millercenter.org
containing links for all the speeches of each corresponding president

In [3]:
# parent folder
path = 'presidents-speeches/'

# will contain all the data "president name" : ["speeches links"]
presidential_dictionary_list = []
for (root, dirs, files) in walk(path):
    # go through child folders, skip the first parent folder
    if len(dirs) == 1:
        president_speahces_dict = {}
        # get the name of the president from the file path
        president_name = root.split("/")[1]
        president_speahces_dict['president_name'] = president_name
        speches_links = []
        file_path = "{}/{}".format(root,files[0])
        # open the "initial.txt" file for each president
        with open(file_path) as html_links_file:
            soup = BeautifulSoup(html_links_file)
            for link in soup.findAll('a'):
                # extract the speeches links from the <a> tags
                speches_links.append(link.get('href'))
        president_speahces_dict['speeches_links'] = speches_links
        presidential_dictionary_list.append(president_speahces_dict)

In [4]:
presidential_dictionary_list

[{'president_name': 'Andrew Jackson',
  'speeches_links': ['/the-presidency/presidential-speeches/march-4-1837-farewell-address',
   '/the-presidency/presidential-speeches/december-21-1836-statement-independence-texas',
   '/the-presidency/presidential-speeches/december-5-1836-eighth-annual-message-congress',
   '/the-presidency/presidential-speeches/december-7-1835-seventh-annual-address-congress',
   '/the-presidency/presidential-speeches/december-1-1834-sixth-annual-message-congress',
   '/the-presidency/presidential-speeches/april-21-1834-addendum-protest-senate-censure',
   '/the-presidency/presidential-speeches/april-15-1834-protest-senate-censure',
   '/the-presidency/presidential-speeches/december-12-1833-message-constitutional-rights-and',
   '/the-presidency/presidential-speeches/december-3-1833-fifth-annual-message-congress',
   '/the-presidency/presidential-speeches/september-18-1833-message-regarding-bank-united-states',
   '/the-presidency/presidential-speeches/march-4-18

### Fetch speeches from the site
now that we have the links of the speeches, we can fetch them and extract the content
we will save them in files titled by date__title, corresponding to every president

In [5]:
# format speech titles to be appropriate for usage as file names
import re
def file_name_format(text):
    return re.sub('[^A-Za-z0-9\-\_]+', '', text)

In [6]:
for presidential_dictionary in presidential_dictionary_list:
    print("extracting speeches of president {}".format(presidential_dictionary['president_name']))
    speech_counter = 1
    total_speeches = len(presidential_dictionary['speeches_links'])
    for speech_link in presidential_dictionary['speeches_links']:
        try:
            resp = requests.get("https://millercenter.org/{}".format(speech_link))
        except requests.exceptions.RequestException as e:
            print(e)
        print("speech {}/{}".format(speech_counter,total_speeches))
        speech_counter+=1
        if resp.status_code == 200:
            soup = BeautifulSoup(resp._content)
            president_speech = ''
            # the speech is present inside a <div> with class "view-transcript", as a set of <p> tags
            for transcripts in soup.find_all('div', { "class" : "view-transcript"}): # there is only one view-transcript per page
                ps = transcripts.find_all('p')
                president_speech = ' '.join([p.text for p in ps])
            # get the title and date of every speech, and name the file accordingly
            for title in soup.find_all('h2', { "class" : "presidential-speeches--title"}):
                ttl = title.find('span').text
                date_speech_title = ttl.split(":")
                date = date_speech_title[0].strip()
                speech_title = date_speech_title[1].strip()
            with open(path+presidential_dictionary['president_name']+"/speeches/"+date+"__"+file_name_format(speech_title)+".txt","w+") as out_file:
                out_file.write(str(president_speech.encode('utf8')))

extracting speeches of president Andrew Jackson
speech 1/26
speech 2/26
speech 3/26
speech 4/26
speech 5/26
speech 6/26
speech 7/26
speech 8/26
speech 9/26
speech 10/26
speech 11/26
speech 12/26
speech 13/26
speech 14/26
speech 15/26
speech 16/26
speech 17/26
speech 18/26
speech 19/26
speech 20/26
speech 21/26
speech 22/26
speech 23/26
speech 24/26
speech 25/26
speech 26/26
extracting speeches of president Barack Obama
speech 1/50
speech 2/50
speech 3/50
speech 4/50
speech 5/50
speech 6/50
speech 7/50
speech 8/50
speech 9/50
speech 10/50
speech 11/50
speech 12/50
speech 13/50
speech 14/50
speech 15/50
speech 16/50
speech 17/50
speech 18/50
speech 19/50
speech 20/50
speech 21/50
speech 22/50
speech 23/50
speech 24/50
speech 25/50
speech 26/50
speech 27/50
speech 28/50
speech 29/50
speech 30/50
speech 31/50
speech 32/50
speech 33/50
speech 34/50
speech 35/50
speech 36/50
speech 37/50
speech 38/50
speech 39/50
speech 40/50
speech 41/50
speech 42/50
speech 43/50
speech 44/50
speech 45/50
s

### fin
After running this code, all the speeches are be saved inside their corresponding president folder