
## Preprocessing body texts (training & evaluation) <br>

In the following, we preprocess the translated body text of both the training and the evaluation datasets and we export the results for later use.

In [2]:
import os
import json
import pandas as pd
import spacy

nlp = spacy.load('en_core_web_sm')

import string
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords

this_dir = os.getcwd()

Let's import the translated (train) files from a .csv files:

The chosen preprocessing routine is the following:

1. Lowercasing
2. Punctuation removal
3. Tokenization
4. Stopword removal
5. Lemmatization

The first 4 steps of the process use the preprocessing tools of [NLTK].
(https://www.nltk.org/).  
For lemmatization, the ad-hoc tool from [spaCy](https://spacy.io/) was used.

In [11]:
def preprocess_body(text):
    
    #lowercasing
    text = text.lower()
    
    #punctuation removal
    text_p = "".join([char for char in text if char not in string.punctuation])

    #tokenize
    words = word_tokenize(text_p)
    
    #stopword removal
    stop_words = stopwords.words('english')
    filtered_words = [word for word in words if word not in stop_words]

    #lemmatization
    filtered_temp = ' '.join(filtered_words)
    doc = nlp(filtered_temp)
    lemmatized_output = ' '.join([token.lemma_ for token in doc])
    
    return lemmatized_output


### Preprocess body text (evaluation) <br>

We preprocess the translated body text of the articles in the evaluation dataset.

In [6]:
eval_data = pd.read_csv(this_dir + '\eval\_EVAL_text_translated.csv')

nan_list1 =  eval_data[(eval_data['translated_body1'].isna())].index.tolist()

nan_list2 = eval_data[(eval_data['translated_body2'].isna())].index.tolist()


4902


In [17]:
preprocessed_eval1 = [preprocess_body(str(i)) for i in eval_data["translated_body1"].tolist()]
preprocessed_eval2 = [preprocess_body(str(i)) for i in eval_data["translated_body2"].tolist()]

eval_data["preprocessed_1"] = preprocessed_eval1
eval_data["preprocessed_2"] = preprocessed_eval2

In [18]:
path = '/eval/_EVAL_preprocessed_text.csv'
eval_data.to_csv(path)


### Preprocess body text (training) <br>

We preprocess the translated body text of the articles in the training dataset.

In [28]:
train_data = pd.read_csv(this_dir + '\train\_TRAIN_text_translated')
#train_data = pd.read_csv(this_dir + '\eval_text_translations\train_fully_translated.csv')

nan_list1 =  train_data[(train_data['translated_body1'].isna())].index.tolist()

nan_list2 = train_data[(train_data['translated_body2'].isna())].index.tolist()

In [35]:
preprocessed_train1 = [preprocess_body(str(i)) for i in train_data["translated_body1"].tolist()]
preprocessed_train2 = [preprocess_body(str(i)) for i in train_data["translated_body2"].tolist()]

train_data["preprocessed_1"] = preprocessed_train1
train_data["preprocessed_2"] = preprocessed_train2

In [36]:
path = '/train/_TRAIN_preprocessed_text'
train_data.to_csv(path)

An example of use of the spaCy lemmatizer:

In [3]:
text = "I will go and I am and I went and I was there and here and nowhere."

doc = nlp(text)
lemmatized_out = ' '.join([token.lemma_ for token in doc])

print(lemmatized_out)

I will go and I be and I go and I be there and here and nowhere .
