# Create the Training Data

Load the SDG related and non-SDG related data. 
Cleansing contains: 
- nan removal 
- renaming of columns 
- drop redundant features
- split data to sentences with ". " separator
- manual cleaning of sentences
    - contatenating wrongly split sentences at "e.g. "
    - removing formatting issues from PDF parsing
- concatenating the general and sdg datasets. 

Datasources: <br> 
General Data: https://www.kaggle.com/mikeortman/wikipedia-sentences <br>
SDG Data: Various resources from United Nations 

In [85]:
import pandas as pd

In [86]:
# load SDG data
df = pd.read_excel('./training_data/SDG_text.xlsx', None, engine='openpyxl')
sheets = list(df.keys())
num_classes = len(sheets)
dfs = list()
for sheet in sheets: 
    df = pd.read_excel('./training_data/SDG_text.xlsx', sheet, engine='openpyxl')
    df['class'] = int(sheet[-2:])
    dfs.append(df)
    
df = pd.concat(dfs)

In [87]:
df.shape

(226, 3)

In [88]:
df = df.dropna()

In [89]:
df.shape

(172, 3)

**Question**: need to sample 512 tokens statically - then it can be done here - or dynammically - then it needs to be done at a later stage.

In [90]:
df = df.reset_index(drop=True)

In [91]:
df.drop(['Source'], inplace=True, axis=1)

In [92]:
df.head()

Unnamed: 0,Text,class
0,End poverty in all its forms everywhere. By 20...,1
1,"Despite progress under the MDGs, approximately...",1
2,Even before the coronavirus disease (COVID-19)...,1
3,"The decline of extreme poverty continues, but ...",1
4,Giving people in every part of the world the s...,1


In [51]:
# split sentences and output each sentence and its classfication 
# in a separate file for manual cleansing
df_train = pd.DataFrame(columns=['sentence', 'class'])
counter = 0
def split_sentences(text, label):
    global counter 
    sentences = text.split('. ')
    for sentence in sentences: 
        row = [sentence + '. ', label]
        df_train.loc[counter] = row
        counter += 1

In [52]:
df.apply(lambda x: split_sentences(x[0], x[1]), axis=1)

0      None
1      None
2      None
3      None
4      None
       ... 
167    None
168    None
169    None
170    None
171    None
Length: 172, dtype: object

In [53]:
df_train

Unnamed: 0,sentence,class
0,End poverty in all its forms everywhere.,1
1,"By 2030, eradicate extreme poverty for all peo...",1
2,"By 2030, reduce at least by half the proportio...",1
3,Implement nationally appropriate social protec...,1
4,"By 2030, ensure that all men and women, in par...",1
...,...,...
5163,Adopting responsible business practices and co...,17
5164,But tackling some of the toughest global chall...,17
5165,Working in partnership can often lead to great...,17
5166,"With its reach and unique capabilities, busine...",17


In [54]:
df_train.to_csv('~/Desktop/training_sdg.csv', index=False)

### Import general text data from wikiedia file

In [59]:
# load general wikipedia data
df_wiki = pd.read_csv('/Users/martinthoma/Desktop/codebase_thesis/datasets/wikisent2.txt', delimiter='\t', header=None)

In [60]:
df_wiki.head()

Unnamed: 0,0
0,"0.000123, which corresponds to a distance of 7..."
1,"000webhost is a free web hosting service, oper..."
2,"0010x0010 is a Dutch-born audiovisual artist, ..."
3,0-0-1-3 is an alcohol abuse prevention program...
4,"0.01 is the debut studio album of H3llb3nt, re..."


In [62]:
# shuffle data randomly before slicing
df_wiki = df_wiki.sample(frac=1).reset_index(drop=True)

In [69]:
df_wiki = df_wiki.iloc[:600, :]
df_wiki.columns = ['sentence']

ValueError: Length mismatch: Expected axis has 2 elements, new values have 1 elements

In [70]:
df_wiki['label'] = 0

In [71]:
df_wiki.head()

Unnamed: 0,sentence,label
0,"It is broadcast from Jhelum, Islamabad, Sargod...",0
1,It was written by Hanif Kureishi from his shor...,0
2,Kohn retired from competition after the 2010 W...,0
3,It spent 30 weeks on the AIR Indie Charts and ...,0
4,For a time he was also the manager of Pacific ...,0


In [72]:
df_wiki.to_csv('./datasets/general_text.csv', index=False)

### Concatenate general and SDG data to training set

In [78]:
# load clean versions of the datasets 
# and concatenate them
df_general = pd.read_csv('./datasets/general_text.csv')
df_sdg = pd.read_csv('./datasets/training_sdg.csv')

df_general.columns = df_sdg.columns = ['sentence', 'labels']
df_train = pd.concat([df_general, df_sdg])

In [81]:
# Check success of concatenation
print('Concat Successfully: ' + str(df_general.shape[0] + df_sdg.shape[0] == df_train.shape[0]))

Concat Successfully: True


In [82]:
# shuffle and save data to disk 
df_train = df_train.sample(frac=1).reset_index(drop=True)
df_train.to_csv('./datasets/training_set', index=False)

In [67]:
# RoBERTa model requires single column one-hot-encoded labels 
# this function returns this format from the usual numeric class label
def one_hot_encode(class_number):
    """
    Gets an integer as argument representing the target class label. 
    As RoBERTa needs the targets in tuple one_hot_encoded format, 
    a tuple is returned which has a 1 at the position of the target class. 
    E.g. 2 -> (0,0,1,0,0,...)
    """
    t = [0 for i in range(num_classes)]
    t[class_number] = 1
    return tuple(t)

In [68]:
# apply one hot encoding.
df['on_hot_class'] = df['class'].apply(lambda x: one_hot_encode(x))