# Basic preprocessing of the corpus

In this Notebook, we will some basic preprocessing of the corpus. This is necessary to analyze and visualize the corpus in the upcoming Notebooks in this tutorial.

## Run notebook 02

We need this so that we can use the variables from that notebook. You can ignore the outputs from these cells.

In [None]:
%run 02_Loading_data_and_visualizing_corpus.ipynb 

### Check if DataFrame is stored correctly

Let's quickly check if the DataFrame is stored correctly. We do this by displaying the first 5 rows of the DataFrame. 

In [None]:
data.head(2)

Looks good! Now we can move on to the actual preprocessing steps.

## Load the Natural Language Processing (NLP) model

[SpaCy](https://spacy.io/) offers language models which you can import and use to perform natural language processing. We load the library and the appropriate Dutch model for our corpus.

In [None]:
import spacy
from spacy.lang.nl.examples import sentences

Load the small Dutch natural language processing (NLP) package. This package can do:
* Tokenization:
  Breaking down text into individual tokens (words, punctuation marks, etc.).
* Part-of-Speech (POS) Tagging:
  Assigning grammatical categories (such as nouns, verbs, adjectives) to each token.
* Named Entity Recognition (NER):
  Identifying and classifying named entities (like people, organizations, locations) in the text.
* Dependency Parsing:
  Analyzing the grammatical structure of a sentence, establishing relationships between tokens.
* Lemmatization:
  Reducing words to their base or dictionary form.

In [None]:
import spacy

# Specify the relative path to the model directory
model_path = "model/nl_core_news_sm"

# Load the model from the relative path
nlp = spacy.load(model_path)

## Create SpaCy Doc objects

When you call `nlp` on a text, SpaCy first tokenizes the text to produce a Doc object.

We will add Doc objects to our dataframe.

A [Doc object](https://spacy.io/api/doc/) is container for accessing linguistic annotations.

Create a helper function. We will use this in the next step.

In [None]:
def process_text(text):
    return nlp(text)

Create a new column in the pandas DataFrame, called "doc".

This column will store the content of each article in a way that is easy to use in later steps of this tutorial. 

In order to make the Doc objects, we first need to get rid of the rows in the dataframe that do not contain any content of the article. This is represented by a 'NaN' in the 'content' column. We can use the 'dropna' function to get rid of the NaN values. 

In [None]:
data = data.dropna(subset=['content'])

We can add them by executing this Python command, but it takes a long time, depending on your computer's memory.

```data["doc"] = data["content"].apply(process_text)```

Let's assume this has been done. Open the processed data with the following code

In [None]:
import pickle

# Deserialize
with open('data/processed_docs.pkl', 'rb') as f:
    processed_docs = pickle.load(f)


In [None]:
processed_docs.head(2)

### Tokenization

Tokenization refers to the process of breaking down a piece of text into small units, called 'tokens'. In our case, the tokens are the words in an article, but tokens can also consist of parts of words or characters. Tokenization is a crucial part of NLP.

Create a helper function. We will use this in the next step.

In [None]:
def get_token(doc):
    return [(token.text) for token in doc]

Create a "Token" column in the DataFrame. This column stores the words in each article as a list, which is useful in later steps.

In [None]:
processed_docs['tokens'] = processed_docs['doc'].apply(get_token)

In [None]:
processed_docs.head(2)

### Lemmatization

Lemmatization is the process of reducing words to their most basic form, known as the lemma. For example, the lemma of 'running' is 'run' and the lemma of 'better' is 'good'. Lemmatization is important for NLP, because it reduces the complexity of a text, improves accuracy of many NLP tasks, and leads to better semantic understanding. 

Create a helper function. We will use this function in the next step.

In [None]:
def get_lemma(doc):
    return [(token.lemma_) for token in doc]

Create 'lemma' column in the DataFrame.

In [None]:
processed_docs['lemmas'] = processed_docs['doc'].apply(get_lemma)

Display lemmas and tokens.

In [None]:
processed_docs[['tokens', 'lemmas']].head()

Comparing the tokens with the lemmas, we can see how some words have been reduced to their root form. For example, 'viel' has been changed to 'vallen'. However, we can also see that 'tweeden' has been changed to 'tweed'. This change is more questionable. 

### Named Entity Recognition (NER)

spaCy features an extremely fast statistical entity recognition system. The default trained pipelines can identify a variety of named and numeric entities, including companies, locations, organizations and products.

Named entities are available as the `ents` property of a `Doc` object.

The function `spacy.explain` will return a description for a given entity type (tag).

Try it out and see if you can find the meaning of this entities
* FAC
* PERSON
* NORP
* GPE

In [None]:
spacy.explain('FAC')

In [None]:
#### Let's add some GPE-entities to our dataframe

In [None]:
def get_gpe(doc):
    return [ent.text for ent in doc.ents if ent.label_ == 'GPE']

In [None]:
processed_docs['GPEs'] = processed_docs['doc'].apply(get_gpe)

In [None]:
# Get the first few rows of GPEs as a list
gpe_list = processed_docs[['identifier','GPEs']].head(2).values.tolist()
for item in gpe_list:
    print(item)

## Write the preprocessed dataset into separate file

We will create a new file, called 'data_preprocessed.csv'. We will use this in the next Notebooks.

In [None]:
processed_docs.to_csv('data/data_preprocessed.csv', index=False)

## Use case: count the number of words in each article

In the next code block, we will count the number of words in each article and store it in a separate column, called 'article_length'. We will use this information to make a comparison between the different newspapers in the next Notebook. 

In [None]:
## Retrieve the length of each article in the corpus and store it in the DataFrame

## Create empty list to store the lengths
article_lengths = []

## Retrieve length of each article and store in list
for index, row in processed_docs.iterrows():
    article_lengths.append(len(row['tokens']))

## Append list to DataFrame
processed_docs['article_length'] = article_lengths

## Show the first rows of title, tokens and article length in DataFrame                                    
processed_docs[['title', 'tokens', 'article_length']].head()