# steps

### Part 1
1. Get data from: "https://en.wikisource.org/wiki/Portal:State_of_the_Union_Speeches_by_United_States_Presidents"
2. Using BeatifullSoup get all the speeches from 1900-2022
3. Load all speech urls into a dictionary with year as key. Hint (get year with regex: `r"\b(19|20)\d{2}\b"`)
4. Loop through dictionary and save content of each speech in [year].txt files

### Part 2
1. Install nltk: `pip install nltk`
2. From the data/gdp.csv file create a dataframe with year and GDP
3. From the data/US presidents.csv file create a dataframe with year, president and party
4. From the developed text files in part 1, create a dictionary with year:speech
5. Clean text by change all to lowercase and remove '\n'
6. Get words from texts (from nltk.tokenize import word_tokenize). Clean text by removing stop words (from nltk.corpus import stopwords) and all non-alphabetic characters (including , and .)
7. Use from nltk.stem import WordNetLemmatizer to lemmatize all texts

### Part 3
**[TextBlob](https://textblob.readthedocs.io/en/dev/quickstart.html)** returns polarity and subjectivity of a sentence. Polarity lies between [-1,1], -1 defines a negative sentiment and 1 defines a positive sentiment. Subjectivity quantifies the amount of personal opinion and factual information contained in the text. The higher subjectivity means that the text contains personal opinion rather than factual information
1. Install both textblob for sentiment analysis and wordclouds (pip install textblob wordclouds) and download the vader lexicon (nltk.download('vader_lexicon'))
2. Find the polarity and subjectivity of each text (Hint: `TextBlob(text).sentiment`)
3. Is there a correlation between negativity and recession years?
4. Create a word cloud for the cleaned up speeches of both Trump and Obama. What can be learned from the word clouds?

In [30]:
import pandas as pd
import requests as rq
import re
import bs4

# 1.1 - Get data from: "https://en.wikisource.org/wiki/Portal:State_of_the_Union_Speeches_by_United_States_Presidents"
url = "https://en.wikisource.org/wiki/Portal:State_of_the_Union_Speeches_by_United_States_Presidents"
data = rq.get(url).text

# 1.2 - Using BeatifullSoup get all the speeches from 1900-2022
soup = bs4.BeautifulSoup(data, 'html.parser')
elem = soup.select('li')[144:269]

# 1.3 - Load all speech urls into a dictionary with year as key. Hint (get year with regex: `r"\b(19|20)\d{2}\b"`)
speech_urls = {}
for e in elem:
    year_reg = re.compile(r"\b(19|20)\d{2}\b")
    year_match = year_reg.search(e.text)
    year = year_match.group()
    speech = e.a.get('href')
    speech_urls[year] = speech

speech_texts = {}
for year in speech_urls:
    wiki_url = "https://en.wikisource.org"+speech_urls[year]
    speech_data = rq.get(wiki_url).text
    soup = bs4.BeautifulSoup(speech_data, 'html.parser')
    speech = soup.select('div>p')
    speech_texts[year] = speech

# print(speech_texts["2022"]) #just to check data in list
# 1.4 - Loop through dictionary and save content of each speech in [year].txt files
for speech in speech_texts:
    with open(speech + ".txt", 'w', encoding="utf8") as file_object:
        file_object.write(str(speech_texts[speech]))
        file_object.close()