<br />

<h1 align="center">The</h1>
<h1 align="center">Natural Language Processing</h1>
<h5 align="center">of</h5>
<h1 align="center">South Africa's State of the Nation</h1>
<h1 align="center">Addresses</h1>
<h5 align="center">by</h5>
<h3 align="center">Monde Anna</h3>

<br />
<br />


<p>Within the following notebook, we will get to explore thte priorities and challenges faced by our country over the last 23 years, that is, from the year 2000 up to and including the year 2022. Each president and as such each presidential term, will offer us insight into said priorities as well as provide a launchpad from which to gain an insight into what the core focuses of each presidential period term.</p>

<br />
<br />


<h3 align="Center">Sources</h3>

<br />

<ul>
    <br />
    <li>
        <a href="http://syllabus.africacode.net/projects/data-science-specific/natural-language-processing/">Brief</a>
    </li>
    <br />
    <li><a href="https://www.gov.za/state-nation-address">SONA Speech Archive</a>
    </li>
</ul>

<br />
<br />


<h3 align="Center">Imports</h3>

<br />
<br />


In [1]:
from sklearn.feature_extraction.text import CountVectorizer
from matplotlib import pyplot as plt
from wordcloud import WordCloud
from string import punctuation
from datetime import datetime
from textblob import TextBlob
import seaborn as sns
import pandas as pd
import numpy as np
import requests
import json
import nltk
import bs4
import re


<br />

<h3 align="Center">Global Settings</h3>

<br />
<br />


In [2]:
BOLD = "bold"
N_COLS = 3
N_ROWS = 9
SEED = 69

sns.set(rc={
    "axes.labelpad": 12,
    "axes.labelweight": BOLD,
    "axes.titlepad": 24,
    "axes.titlesize": 18,
    "axes.titleweight": BOLD,
    "figure.figsize": (12,6),
    "figure.titlesize": 32,
    "figure.titleweight": BOLD,
})

nltk_data = [
    "averaged_perceptron_tagger",
    "omw-1.4",
    "punkt",
    "stopwords",
    "tagsets",
    "wordnet",
]

for datum in nltk_data:
    nltk.download(datum, quiet=True)


<br />

<h2 align="Center">Data Acquisition</h2>

<br />
<br />

<p>With the use of the url addresses provided in the <i>json</i> file accompanying this project, we will scrape the State of the Nation speech archive website for each transcript of the addresses as of the year 2000. Of particular importance to us are the <b><i>Date</i></b>, <b><i>The President</i></b> making the speech and the speech's <b><i>Transcript</i></b>.</p>

<br />
<br />
<br />


In [3]:
def get_president(soup):
    """we concern ourselves with 21st century presidents"""
    presidents = {
        "Ramaphosa": "Cyril Ramaphosa",
        "Zuma": "Jacob Zuma",
        "Motlanthe": "Kgalema Motlanthe",
        "Mbeki": "Thabo Mbeki",
    }

    title = "".join(t for t in soup.find(class_="title"))

    return "".join(
        presidents[pres]
        for pres in presidents.keys()
        if pres.lower() in title.lower()
    )


In [4]:
def get_date(soup):
    str_date = soup.find(class_="field-item even").text
    return datetime.strptime(str_date, "%d %b %Y").date()


In [5]:
def get_speech(soup):
    return " ".join(
        s.text
        for s in soup.find(class_="section section-content").find_all("p")
    )


In [6]:
def get_speech_from_url(url):
    page = requests.get(url).text
    soup = bs4.BeautifulSoup(page, "html.parser")

    return {
        "president": get_president(soup),
        "date": get_date(soup),
        "speech": get_speech(soup)
    }


In [7]:
with open("../assets/speech_urls.json", "r") as file:
    speech_urls = json.load(file)


In [8]:
flattened_urls = [
    url
    for year in speech_urls
    for urls in year.values()
    for url in urls
]


<br />

<h5 align="Center">Web Scrapping</h5>

<br />

<p>Sadly, our source website is slow and at times suffers a <i>504 Gateway Time-Out</i> error. As such, the scrapped data as modelled by the <i>get_speech_from_url</i> function above has been stored using Jupyter Notebook's magic method <b><i>%store</i></b>.</p>

<br />
<br />


In [9]:
%%capture

%store -r scrapped_speeches

if "scrapped_speeches" not in globals():
    scrapped_speeches = [get_speech_from_url(url) for url in flattened_urls]
    %store scrapped_speeches
