# Pre-Processing

The flow of data through the pipeline of this project is as follows:

1. The raw text data files are parsed and concatenated within the **`data_loader.ipynb`** notebook. The output from that notebook is called **`speeches.csv`**
2. **`speeches.csv`** is ingested by _this_ notebook in the code block below. It undergoes further processing and is outputted in a file called **`model_input.csv`**.
3. Finally, **`model_input.csv`** is ingested by **`classifer.ipynb`** and used for class balancing and classification.

In [1]:
import pandas as pd
import numpy as np
import string
import re

### Load speech data

In [2]:
# use SONA speeches only
# df = pd.read_csv('../data/speeches.csv')

# use additional speeches for minority classes
df = pd.read_csv('../data/speeches_extended.csv')

## Overview of Data Pre-Processing

The **`labels`** column in the data set contains an integer value between 0 and 5. These values represent which president said a given line in the data set. The value correspond to the chronological order of the presidents, starting with FW de Klerk in 1994, ending with Cyril Ramaphosa in 2017.   

The **`text`** column contains lines from the state of the nation addresses given over the last 15 years.

In [3]:
# show some of the data
df.head(5)

Unnamed: 0,labels,text,year
0,0,The general election on September the 6th 1989...,1990
1,0,The alternative is growing violence tension an...,1990
2,0,On its part the Government will accord the pro...,1990
3,0,I hope that this new Parliament will play a co...,1990
4,0,* Let us put petty politics aside when we disc...,1990


In [4]:
df.shape

(6559, 3)

### Number of Lines per President

In [5]:
df['labels'].value_counts()

4    2562
2    2055
1    1074
5     379
3     281
0     208
Name: labels, dtype: int64

In [6]:
df['text'][2]

'On its part the Government will accord the process of negotiation the highest priority. The aim is a totally new and just constitutional dispensation in which every inhabitant will enjoy equal rights treatment and opportunity in every sphere of endeavour - constitutional social and economic.'

### Split Lines On End Characters

Taking for example the sentence above, there are some lines which contain multiple sentences. The original data was split into rows at every new line character, but we will need to further split the data at every full stop, question mark and exclamation mark.

In [7]:
def divide_on(df, char):
  # iterate over text column of DataFrame, splitting at each occurrence of char

  sentences = []
  # let's split the data into senteces
  for i, row in df.iterrows():

      for sentence in row['text'].split(char):
          sentences.append([row['labels'], sentence])

  df = pd.DataFrame(sentences, columns=['labels', 'text'])
  
  return df[df['text'] != '']

In [8]:
df = divide_on(df, '.')
df = divide_on(df, '?')
df = divide_on(df, '!')

In [9]:
df.head(5)

Unnamed: 0,labels,text
0,0,The general election on September the 6th 1989...
1,0,Underlying this is the growing realisation by...
2,0,The alternative is growing violence tension an...
3,0,That is unacceptable and in nobody's interest
4,0,The well-being of all in this country is link...


In [10]:
df.shape

(9310, 2)

### Number of sentences per president

In [11]:
df['labels'].value_counts()

4    3203
2    3149
1    1796
3     401
5     389
0     372
Name: labels, dtype: int64

In [12]:
# proportion of total
df['labels'].value_counts()/8841

4    0.362289
2    0.356181
1    0.203144
3    0.045357
5    0.044000
0    0.042077
Name: labels, dtype: float64

After splitting the data into individual sentences, we now have more observations for each president.

### Remove Punctuation

In [13]:
def remove_punctuation(text):
  return ''.join([char for char in text if char == '-' or char not in string.punctuation])


df['text'] = df['text'].apply(remove_punctuation)

### Remove Special Characters

In [14]:
def remove_spec(text):
  return text.replace(r'^[*-]', '')
  
df['text'] = df['text'].apply(remove_spec)

### Make Lower Case

In [15]:
df['text'] = df['text'].str.lower()

In [16]:
df.head(5)

Unnamed: 0,labels,text
0,0,the general election on september the 6th 1989...
1,0,underlying this is the growing realisation by...
2,0,the alternative is growing violence tension an...
3,0,that is unacceptable and in nobodys interest
4,0,the well-being of all in this country is link...


### Sentence Length

In [17]:
# get length of sentence as variable
df['length'] = df['text'].apply(len)

In [18]:
# what are our longest sentences?
df.sort_values(by='length', ascending=False).head(10)

Unnamed: 0,labels,text,length
5684,3,facilitating the processes aimed at strengthen...,718
1528,1,we recognise the buhlebemvelo garden project f...,641
5168,2,it was the best of times it was the worst of t...,595
5289,2,speeding up land and agrarian reform with det...,554
1955,1,the second element is discipline - the balance...,537
4983,2,i would like to take advantage of this occasio...,517
3824,2,the iraq affair the continuing and painful con...,513
5034,2,speeding up the implementation of the taxi re...,502
2547,2,recognising the fact that we still have this o...,498
1492,1,but they know better than any politician that ...,489


In [19]:
df.loc[5684][1]

'facilitating the processes aimed at strengthening the machineries dealing with matters of gender equality such as 5050 representation in decision-making structures youth development the rights of people with disability and children’s rights – including completing consultations on the national youth policy preparing for the implementation of the african youth charter once it has been processed by parliament and for the setting up of the national youth development agency submitting the sadc protocol on gender and development to parliament strengthening advocacy on the rights of people with disability and extending the number of municipalities that have set up children’s rights focal points beyond the current 60'

There are a few sentences which contain more than 500 characters. Although somewhat long-winded, they are all technically still single sentences.   

Let's take a look at the other end of the spectrum:

In [20]:
# what are our shortest sentences?
df.sort_values(by='length').head(5)

Unnamed: 0,labels,text,length
7287,4,,0
5767,4,,0
7353,4,,0
3838,2,,0
3767,2,,0


In [21]:
# sentences with just a few characters are of no use to us
df = df[df['length']>8]

In [22]:
# what are our shortest sentences now?
df.sort_values(by='length').head(5)

Unnamed: 0,labels,text,length
4801,2,thank you,9
778,1,thank you,9
2800,2,thank you,9
8922,4,4 percent,9
4447,2,thank you,9


In [23]:
df['labels'].value_counts()

4    3145
2    3074
1    1785
3     394
5     385
0     371
Name: labels, dtype: int64

FInally, we save the processed dataset to a CSV file called `model_input.csv`, to be used by the **`classifier.ipynb`** notebook for class balancing and, ultimately, for classification.

In [24]:
# when using `speeches.csv`
# df[['labels', 'text']].to_csv('../data/model_input.csv', index=False)

# when using `speeches_extended.csv`
df[['labels', 'text']].to_csv('../data/model_input_extended.csv', index=False)