# Preparing the data

In this notebook we are going to prepare the data from the testset. The dataset to use is stored on the Githubpage in the data folder and is named ['testset_delpher.csv](https://github.com/KBNLresearch/delpher_demo/blob/main/data/testset_delpher.csv). 
We add a 'bag of words', which we create from the original text, we extract the year and we make a dataframe with keywords. 

When you use a dataset downloaded from http://delpher_demo.kbresearch.nl/, the bag of words and the year are already provided, so that part of the code can be skipped.

If you have not yet installed the following packages, please install them on your command line with the following code: 
```
pip install pandas
pip install Unidecode
```

In [2]:
## Import the necessary package
import pandas as pd
import re
from unidecode import unidecode

In [3]:
## Define the functions that will be needed later on

## Sorts the words in the original string alphabetically
def sortedSentence(Sentence): 
    # Splitting the Sentence into words 
    words = Sentence.split(" ") 
    # Sorting the words 
    words.sort() 
    # Making new Sentence by joining the sorted words 
    newSentence = " ".join(words) 
    return newSentence 

## Searches for given words in a given text and returns a list with id and word
def search(identifier, text, wordlist):
    identifier = identifier
    text = text
    keywords = wordlist
    ## Go through every word in a given dictionary
    for word in keywords_search:
        ## If the word is found, store the indicated value and the id in the list
        if word in text:
            keywords.append([identifier, keywords_search.get(word)])
    return keywords

In [None]:
## Load the dataset
file = ## Set the folder and name of the file (including .csv) in quotes
data = pd.read_csv(path)

In [None]:
## Check if everything is loaded correctly
data.head(5)

## Create the bag of words

A bag of words is basically the text of the article in seperated words with punctuation and capitals removed. When the words in the bag of words are alfabetically ordered, the copyright restrictions no longer apply.

In [106]:
## First, we create a new column which contains the article text with every word in lower case.
data['bag_of_words'] = data['text'].str.lower()
## Remove the newline characters \r and \n 
data['bag_of_words'] = data['bag_of_words'].replace(r'\r', '')
data['bag_of_words'] = data['bag_of_words'].replace(r'\n', '')
## Remove the punctuation from the text, replaces not (^) word characters or spaces with the empty string (underscores
## will not be removed)
data['bag_of_words'] = data['bag_of_words'].apply(lambda x: re.sub(r'[^\w\s]','',str(x)))

In [107]:
## To make sure the copyright laws no longer apply, we place the words in alfabetical order
data['bag_of_words'] = data['bag_of_words'].apply(lambda x: sortedSentence(x))
## And remove potential duplicate spaces that are now stored as first entry in our list of words
data['bag_of_words'] = data['bag_of_words'].str.lstrip(" ")

## Extract the year

Normally, you can use build-in Python datetime functions to extract the year. However, when there are dates that are too far in the past (older then  1674-02-01), this will not work. Therefore, two methods of extracting the year are provided. For the provided testset the built-in Python functions will raise an error so it is recommended to use the second method. 


#### The datetime method

In [None]:
## Create a new column with a copy of the date and set this column to datetime
## Note: for the provided dataset, this option will raise an error!
data['year']= pd.to_datetime(data['date'])
## Extract the year
data['year'] = data['year'].dt.year

#### The manual method

In [109]:
## Create a new column in which the date is split based on the seperator and take the first output.
## Note: the test dataset has two types of date formats, e.g. '1865-04-28' and '1861/06/24 00:00:00'
## To take care of that, we need two different string splits.
data['year'] = data['date'].str.split("-").str[0]
data['year'] = data['year'].str.split("/").str[0]

In [137]:
## Save the file with the bag of words and the year
dest_file = ## Set the folder and name of the destination file (including .csv) in quotes
data.to_csv(dest_path, index = False)

## Extract keywords

Now we are ready to extract some keywords from the bag of words. We store these keywords in a seperate frame. 

In [110]:
##  First, we create a dictionary in which we define which keywords we want to search for, and which value they
## must be given in the frame. For example, in Dutch we have historical word variatons, so if we want the keyword
## 'ziekte' we also have to search for 'sieckt' and 'siekte'. We search for the shortest meaningfull word, to make 
## sure we include every variation. We choose to set the keywords to the current variant. 

keywords_search = { 'ziekt': 'ziekte',
                    'siekt': 'ziekte',
                    'sieck': 'ziekte',
                    'virus': 'virus',
                    'vira' : 'virus',
                    'viren': 'virus',
                    'griep' : 'griep',
                    'vaccin': 'vaccin',
                    'besmetting': 'besmetting',
                    'influenza' : 'influenza',
                    'quarantaine' : 'quarantaine',
                    'verspreiding' : 'verspreding',
                    'verspreyding' : 'verspreiding',
                  }

In [111]:
## Create an empty list in which the keywords will be stored
keywords = []

## Iterate through the frame and use the function defined above to store found words
for index, row in data.iterrows():
    identifier = row['identifier']
    text = row['bag_of_words']
    ## Use unidecode to remove accents and normalise words, so this won't disturb matching
    text = unidecode(text)
    search(identifier, text, keywords)

In [112]:
## Make a list of all identifiers that have one or more  keywords
identifiers = [item[0] for item in keywords]
## And make this a set to remove duplicates
identifiers = set(identifiers)

In [113]:
## We now have all the identifiers that have a keyword, but we want to give all the other identifiers the keyword
## 'overig'(other) so we can see how many articles are not related to any of the given keywords
for index, row in data.iterrows():
    identifier = row['identifier']
    if identifier not in identifiers:
        keywords.append([identifier, 'overig'])               

In [114]:
## Turn the list into a datafrrame
keywords_frame = pd.DataFrame(keywords, columns=['identifier','keyword']) 

In [115]:
## Save the dataframe
dest_file_key = ## Set the folder and name of the destination file (including .csv) in quotes
keywords_frame.to_csv(dest_path_key, index = False)