In [None]:
## Notebook introducing the words_n_fun module
# Copyright (C) <2018-2022>  <Agence Data Services, DSI Pôle Emploi>
# 
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.
# 
# You should have received a copy of the GNU Affero General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>.

# Tutorial notebook for the words_n_fun module

## Introduction

This notebook highlights how to use the preprocessing features of the words_n_fun module on a given text corpus. 

To do so, we will work on an English dataset, `comments.csv`, which is located alongside this notebook.  
This dataset contains **several thousands comments about youtube videos**, from https://www.kaggle.com/datasets/advaypatil/youtube-statistics.

The package structure (as of 09/2022) looks like this :
```
.
├── configs
│   └── pipeline_usage_order.json
├── __init__.py
├── nltk_data
│   └── corpora
│       └── stopwords
│           └── french
├── preprocessing
│   ├── api.py
│   ├── basic.py
│   ├── __init__.py
│   ├── lemmatizer.py
│   ├── split_sentences.py
│   ├── stopwords.py
│   ├── synonym_malefemale_replacement.py
│   └── vectorization_tokenization.py
└── utils.py
```

The `utils.py` file provides utilities functions. The `configs` subfolder includes a json file that will be use to trigger warnings if preprocess functions are called in the wrong order. The `nltk_data` provides data to be used with nltk (by now, only in French).

The most important part is the `preprocessing` subfolder :

- `basic.py` : this file exposes **all available preprocessing functions**. These functions preprocess pandas Series, but we added a decoraror `utils.data_agnostic` that makes it available to process strings, list of strings, np.arrays and pandas DataFrame (it uses either a `prefered_column` arg or the first column).
- **`api.py`** : This file includes **the main entry point : `preprocess_pipeline`**. This functions takes input data and apply a preprocessing pipeline to it. It also manages different input types (same types as `utils.data_agnostic`).
- `lemmatizer.py`, `split_sentences.py`, `stopwords.py`, `synonym_malefemale_replacement.py`, `vectorization_tokenization.py` : these files contain more complex and specific preprocessing functions.

---

## Imports

In [None]:
import os
import functools
import numpy as np
import pandas as pd

from words_n_fun import utils
from words_n_fun.preprocessing import api, basic

# Reduce amount of logs for wnf
import logging
logging.getLogger('words_n_fun').setLevel(logging.ERROR)

---

## Load dataset


Here we load the dataset into a pandas dataframe, and we then extract the pandas series to be preprocessed (i.e. the `Comment` column)

In [None]:
# Manage path & load as a pd dataframe
dir_path = os.path.dirname(os.path.realpath('__file__'))
file_path = os.path.join(dir_path, "comments.csv")
df = pd.read_csv(file_path, sep=',', encoding='utf-8', index_col=0)

In [None]:
# Displays the first 3 rows of the dataset
df.head(3)

In [None]:
# Shape of the dataset :
print(f"The loaded dataset has {df.shape[0]} rows and {df.shape[1]} columns")

**The preprocessing will be applied to the "Comment" column**

In [None]:
# Select pd Series to be preprocessed
docs = df["Comment"]

In [None]:
docs

---

# Preprocessing

As we said in the introduction, the main entry point for words_n_fun preprocessing is the `api.preprocess_pipeline`. This function takes two main arguments :
- `docs` : the data to be preprocessed (str, list, np.ndarray, pd.Series or pd.DataFrame)
- `pipeline` : a list of preprocessing functions to successively apply to the input data. Some basic functions are listed in the `api.USAGE` dictionnary. Hence, we can use string keys instead of functions in the pipeline definition.

### Simple preprocessing

We will start by using a simple preprocessing pipeline :

In [None]:
pipeline_1 = ['remove_non_string', 'to_lower', 'remove_punct']

This pipeline will :
- replace all NaNs by en empty string
- convert all letters to lowercase
- remove (most of) the ponctuation

Let's try it !

In [None]:
# First, on a string
print('\n---------------\n')
test_str = 'This is a test !'
print(f"{test_str} ---> {api.preprocess_pipeline(test_str, pipeline_1)}")
# Then on a list
print('\n---------------\n')
test_list = ['This is a test !', 'Btw, this sentence is also a test ;)', None]
print(f"{test_list} ---> {api.preprocess_pipeline(test_list, pipeline_1)}")
# Then on an np array
print('\n---------------\n')
test_np_array = np.array(['This is a test !', 'Btw, this sentence is also a test ;)', None])
print(f"{test_np_array} ---> {api.preprocess_pipeline(test_np_array, pipeline_1)}")
# Then on a pd Series
print('\n---------------\n')
test_pd_series = pd.Series(['This is a test !', 'Btw, this sentence is also a test ;)', None])
print(f"{test_pd_series} \n--->\n {api.preprocess_pipeline(test_pd_series, pipeline_1)}")
# Then on a DataFrame
print('\n---------------\n')
test_pd_dataframe = pd.DataFrame({'col1' : ['Test 1.', 'Test 2!'], 'col2' : ['Test 3?', 'Test 4$']})
print(f"{test_pd_dataframe} \n--->\n {api.preprocess_pipeline(test_pd_dataframe, pipeline_1)}")
# Finally on a DataFrame - version 2
print('\n---------------\n')
test_pd_dataframe = pd.DataFrame({'col1' : ['Test 1.', 'Test 2!'], 'col2' : ['Test 3?', 'Test 4$']})
print(f"{test_pd_dataframe} \n--->\n {api.preprocess_pipeline(test_pd_dataframe, pipeline_1, prefered_column='col2', modify_data=False)}")

As you can see, this function can process many type of inputs.

### Use custom functions

You are not stuck with the provided functions only ! You can use your custom functions 😊 Let's try it ! 

In [None]:
# This function replaces all 'mr.' with 'mister'. We provide a function `get_regex_match_words` to
# automatically create the correct regex. You can have a look at it, but we could have use any other preprocessing function.
def my_custom_function(docs: pd.Series) -> pd.Series:
    '''Replaces 'mr.' with 'mister' '''
    my_regex = utils.get_regex_match_words(['mr.'], case_insensitive=True, words_as_regex=False)
    docs = docs.str.replace(my_regex, 'mister', regex=True)
    return docs

In [None]:
# Let's try it alone
my_custom_function(pd.Series(['Hello Mr. Smith']))

In [None]:
# Now let's use it in a pipeline !
# We add utils.data_agnostic to make it work with more input types
pipeline_2 = ['remove_non_string', 'to_lower', utils.data_agnostic(my_custom_function), 'remove_punct']
test_str = 'Hello Mr. Smith !'
print(f"{test_str} ---> {api.preprocess_pipeline(test_str, pipeline_2)}")

### Adapt existing functions

As we have seen before, we can use custom functions in our pipelines. But we can also adapt existing functions by modifying the default kwargs. To do so, we will use partial functions (`functools.partial`). 

Let's try it with `remove_stopwords`. We want to remove `hello` and `test` from our texts. To do so, we will reuse the `remove_stopwords` function :

In [None]:
new_remove_stopwords = functools.partial(basic.remove_stopwords, opt='none', set_to_add=['hello', 'test', 'Hello', 'Test'])

In [None]:
# Let's try it alone
new_remove_stopwords(pd.Series(['Hello, this is a test !']))

In [None]:
# Now let's use it in a pipeline !
# We add utils.data_agnostic to make it work with more input types
pipeline_3 = ['remove_non_string', 'to_lower', new_remove_stopwords, 'remove_punct']
test_str = 'Hello, this is a test !'
print(f"{test_str} ---> {api.preprocess_pipeline(test_str, pipeline_3)}")

It works like a charm !

---

# Full example on our dataset

Now that we have seen how to create and use a preprocessing pipeline, we will now preprocess our dataset.  

Let's first define our pipeline :

In [None]:
pipeline_4 = ['remove_non_string', 'get_true_spaces', 'remove_punct', 'to_lower', 'remove_numeric',
              'trim_string', 'remove_leading_and_ending_spaces']

If we want, we can show the result of each step :

In [None]:
# We define a function to showcase each step results. Works only for predefined functions in api.USAGE.
def preprocess_pipeline_detail(text: str, pipeline: list):
    print(f"\033[94mInitial text :\033[0m {text}\n")
    for i, item in enumerate(pipeline):
        text = api.USAGE[item](text)
        print(f"\033[94mAfter preprocess {str(item)}:\033[0m {text}\n")

In [None]:
preprocess_pipeline_detail("      This is a test,    \n\n  with some  \t puncTuAtIONs and numbers 1 2 3 4 !", pipeline_4)

Let's use our preprocessing on the **whole dataset** !

In [None]:
# Preprocess
preprocessed_docs = api.preprocess_pipeline(docs, pipeline_4)

In [None]:
# Display some results
preprocessed_docs

We can also specify a **chunck size** if the dataset is toohuge to be preprocessed at once.

In [None]:
# Let's reactivate the logs just to showcase the chunck split
logging.getLogger('words_n_fun').setLevel(logging.INFO)
preprocessed_docs_2 = api.preprocess_pipeline(docs, pipeline_4, chunksize=5000)
logging.getLogger('words_n_fun').setLevel(logging.ERROR)

**/!\ Not advised /!\**

We can even directly preprocessed the dataset by only providing it's file path.  This is not advised to do so, as this could be hazardous.

In [None]:
# Preprocess our dataset by providing it's path and the column to be preprocessed
new_csv_path = api.preprocess_pipeline(file_path, pipeline_4, prefered_column='Comment', modify_data=False)
print(new_csv_path)

In [None]:
# Reload it and check first rows
df_preprocessed = pd.read_csv(new_csv_path, sep=',', encoding='utf-8', index_col=0)
df_preprocessed.head(5)

---

# Conclusion

In this notebook we have covered the main principles of the words_n_fun library.  If you have any comments about this notebook, feel free to create an issue on github.