# Data Loader

The flow of data through the pipeline of this project is as follows:

1. The raw text data files are parsed and concatenated within _this_ notebook. The output from this notebook is called **`speeches.csv`**.
2. **`speeches.csv`** is ingested by **`pre-processing.ipynb`**. It undergoes further processing and is outputted in a file called **`model_input.csv`**.
3. Finally, **`model_input.csv`** is ingested by **`classifer.ipynb`** and used for class balancing and classification.

In [1]:
import pandas as pd
import codecs
import os

# only SONA speeches
path = './speeches/'

# additional speeches for minority classes
# path = './speeches_additional/'

In [2]:
# each row of the DataFrame will represent one line from a speech
df = pd.DataFrame(columns=['labels', 'text', 'year'])

## Encode President Names
President names will be encoded chronologically.

In [3]:
# encoding of president names
pres = {'deKlerk': 0,
        'Mandela': 1,
        'Mbeki': 2,
        'Motlanthe': 3,
        'Zuma': 4,
        'Ramaphosa': 5}

In [4]:
# loop through all the files in the data folder
for file_name in os.listdir(path):

    # extract year and name of president from file name
    year, pres_name = file_name[:-4].replace('_', ' ').split()

    # open file
    file = codecs.open(path + file_name, 'r', encoding='utf-8')

    # read speech in as list
    speech = file.readlines()

    # skip the date header
    for line in speech[1:]:

        # remove commas
        line = line.replace(',', ' ')

        # remove extra whitespace
        line = " ".join(line.split())

        # ignore paragraph breaks / double line skips
        if len(line) > 1:
            df = df.append({'labels': pres[pres_name],
                            'text': line,
                            'year': year},
                            ignore_index=True)

In [5]:
df.head()

Unnamed: 0,labels,text,year
0,0,Mr Speaker,1994
1,0,This Parliament has convened to adopt importan...,1994
2,0,The fact that we have done so in the midst of ...,1994
3,0,If we wish to have a peaceful and stable futur...,1994
4,0,It is essential that no-one and no party shoul...,1994


### Number of Lines of Text in Total

In [6]:
df.shape

(6259, 3)

### Number of Lines of Text per President

In [7]:
df['labels'].value_counts()

4    2562
2    2055
1    1074
5     272
3     224
0      72
Name: labels, dtype: int64

### Save the File to CSV for the Next Notebook

In [8]:
# save the file to csv
df.to_csv('speeches.csv', encoding='utf-8', index=False)