## Web Scraping All Presidential Speeches
- Lihua Xiong, lx559

Compared to original code, I made following changes:
- specified "html5lib" parser
- changed the url routing logic in an attempt to scrap all of the speeches:
    - first, at USPresident = http://millercenter.org/president, I got the list of US Presidents;
    - second, for each president, go to his specific link(e.g. https://millercenter.org/president/washington) where multiple speeches are listed;
    - third, scrap all of the listed speeches for current president.

In [102]:
import bs4
import requests

In [103]:
header = {'User-Agent':\
          'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}

baseurl = 'http://millercenter.org'
USPresident = 'http://millercenter.org/president'


soup = requests.get(USPresident, headers=header)
soup = bs4.BeautifulSoup(soup.text,"html5lib")

In [169]:
speeches = {}  # Holder for current speech in each loop

# loop over each president
for listing in soup.findAll('p', class_='views-field--title'):
    
    currentPresident = listing.text
    print('-------------------------------\n\n')
    print('Working on president: ' + currentPresident)
    
    presURL = listing.a["href"]
    
    # go to url for current president
    presSoup = bs4.BeautifulSoup(requests.get(baseurl + presURL).content, 'html5lib')
    
    # find speeches of current president
    for speech in presSoup.findAll('h2', class_='speech-title'):
        speechTitle = speech.span.text
        print('Got Speech title: ' + speechTitle)
        
        # go to url for current speech
        speechURL = speech.a['href']
        speechSoup = bs4.BeautifulSoup(requests.get(baseurl + speechURL).content, 'html5lib')
        with open('speeches.txt','a') as myfile:
            myfile.write(str({'President ':currentPresident,\
                              'Speech Title ': speechTitle}) + '\n')
            myfile.write('Transcript: \n')
            for para in speechSoup.find('div', class_='view-transcript').find_all('p'):
                myfile.write(str(para.text) + '\n')
        
        print('Successfully wrote speech to file!\n\n')


-------------------------------


Working on president: George Washington
Got Speech title: April 30, 1789: First Inaugural Address
Successfully wrote speech to file!


Got Speech title: April 22, 1793: Proclamation of Neutrality
Successfully wrote speech to file!


Got Speech title: August 29, 1796: Talk to the Cherokee Nation
Successfully wrote speech to file!


-------------------------------


Working on president: John Adams
Got Speech title: March 4, 1797: Inaugural Address
Successfully wrote speech to file!


Got Speech title: March 23, 1798: Proclamation of Day of Fasting, Humiliation and Prayer
Successfully wrote speech to file!


Got Speech title: May 21, 1800: Proclamation of Pardons to Those Engaged in Fries Rebellion
Successfully wrote speech to file!


-------------------------------


Working on president: Thomas Jefferson
Got Speech title: March 4, 1801: First Inaugural Address
Successfully wrote speech to file!


Got Speech title: June 20, 1803: Instructions to Captain

KeyboardInterrupt: 

### The Original Code Online

speeches = {}  # Holder for current speech in each loop

for listing in soup.findAll(attrs={'id':'listing'}):
    
    for element in listing.nextGenerator():
        
        # Try to get president name, if not, then current element is something else
        try:
            if element.attrs.get('class')[0] == 'president':
                currentPresident = element.text[:-2]
                print('-------------------------------\n\n')
                print('Working on president: ' + currentPresident)
                
        except:
            pass
        
        # Try to get title of speech, if not, then current element is something else
        try:
            if element.attrs.get('class')[0] == 'title':
                speechTitle = element.text
                print('Got Speech title: ' + speechTitle)
        except:
            pass
        
        # Get speech link and then actual speech text
        try:
            if element.attrs.get('class')[0] == 'icons':
                for link in element.findAll('a'):
                    speechLink = link.get('href')
                    break # Text link will be first link in loop
                    
                # Get speech text for current speech title
                speechTextSoup = bs4.BeautifulSoup(requests.get(baseurl + speechLink).content)
                for text in speechTextSoup.findAll(name='div', attrs={'id':'transcript'}):
                    transcript = text.text.split('Transcript')[1:][0]
                    print('Received transcript {} chars in length!'.format(len(transcript)))
                    break
                    
                # Now that we have everything for one speech. Save it to file
                with open('speeches.txt','a') as myfile:
                    myfile.write(str({'President':currentPresident.replace('\n',' '),
                                      'Transcript':transcript.replace('\n',' '), # replace 1 lines, b/c we need whole speech on one
                                      'Speech Title': speechTitle.replace('\n',' ')}) + '\n') # write \n for next speech
                    
                print('Successfully wrote speech to file!\n\n')
                
                
        except:
            pass

