In [None]:
## Notebook introducing the words_n_fun module
# Copyright (C) <2018-2022>  <Agence Data Services, DSI PÃ´le Emploi>
# 
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU Affero General Public License as
# published by the Free Software Foundation, either version 3 of the
# License, or (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU Affero General Public License for more details.
# 
# You should have received a copy of the GNU Affero General Public License
# along with this program.  If not, see <https://www.gnu.org/licenses/>.

# Tutorial notebook for the words_n_fun module

## Introduction

This notebook highlights how to use preprocessing features of the words_n_fun module on a given text corpus. 

To do so, we will work on an English dataset, `comments.csv`, which is located alongside this notebook.  
This dataset contains **several thousands comments about youtube videos**, from https://www.kaggle.com/datasets/advaypatil/youtube-statistics.

The package structure (as of 09/2022) looks like this :
```
.
â”œâ”€â”€ configs
â”‚Â Â  â””â”€â”€ pipeline_usage_order.json
â”œâ”€â”€ __init__.py
â”œâ”€â”€ nltk_data
â”‚Â Â  â””â”€â”€ corpora
â”‚Â Â      â””â”€â”€ stopwords
â”‚Â Â          â””â”€â”€ french
â”œâ”€â”€ preprocessing
â”‚Â Â  â”œâ”€â”€ api.py
â”‚Â Â  â”œâ”€â”€ basic.py
â”‚Â Â  â”œâ”€â”€ __init__.py
â”‚Â Â  â”œâ”€â”€ lemmatizer.py
â”‚Â Â  â”œâ”€â”€ split_sentences.py
â”‚Â Â  â”œâ”€â”€ stopwords.py
â”‚Â Â  â”œâ”€â”€ synonym_malefemale_replacement.py
â”‚Â Â  â””â”€â”€ vectorization_tokenization.py
â””â”€â”€ utils.py
```

The `utils.py` file provides utilities functions. The `configs` subfolder includes a json file that will be use to trigger warnings if preprocess functions are called in the wrong order. The `nltk_data` provides data to be used with nltk (by now only in French).

The most important part is the `preprocessing` subfolder :

- `basic.py` : this file exposes **all available preprocessing functions**. These functions preprocess pandas Series, but we added a decoraror `utils.data_agnostic` that makes it available to process strings, list of strings, np.arrays and pandas DataFrame (it uses either a `prefered_column` arg or the first column).
- **`api.py`** : This file includes **the main entry point : `preprocess_pipeline`**. This functions takes input data and apply a preprocessing pipeline to it. It also manages different input types (same types as `utils.data_agnostic`).
- `lemmatizer.py`, `split_sentences.py`, `stopwords.py`, `synonym_malefemale_replacement.py`, `vectorization_tokenization.py` : these files contain more complex and specific preprocessing functions.

---

## Imports

In [None]:
import os
import functools
import numpy as np
import pandas as pd

from words_n_fun import utils
from words_n_fun.preprocessing import api, basic

# Reduce amount of logs for wnf
import logging
logging.getLogger('words_n_fun').setLevel(logging.ERROR)

---

## Load dataset


Here we load the dataset into a pandas dataframe, and we then extract the pandas series to be preprocessed (i.e. the `Comment` column)

In [None]:
# Manage path & load as a pd dataframe
dir_path = os.path.dirname(os.path.realpath('__file__'))
file_path = os.path.join(dir_path, "comments.csv")
df = pd.read_csv(file_path, sep=',', encoding='utf-8', index_col=0)

In [None]:
# Displays the first 3 rows of the dataset
df.head(3)

In [None]:
# Shape of the dataset :
print(f"The loaded dataset has {df.shape[0]} rows and {df.shape[1]} columns")

**The preprocessing will be applied to the "Comment" column**

In [None]:
# Select pd Series to be preprocessed
docs = df["Comment"]

In [None]:
docs

---

# Preprocessing

As we said in the introduction, the main entry point for words_n_fun preprocessing is the `api.preprocess_pipeline`. This function takes tow main arguments :
- `docs` : the data to be preprocessed (str, list, np.ndarray, pd.Series or pd.DataFrame)
- `pipeline` : a list of preprocessing functions to successively apply to the input data. Some basic functions are listed in the `api.USAGE` dictionnary. Hence, we can use string keys instead of functions in the pipeline definition.

### Simple preprocessing

We will start by using a simple preprocessing pipeline :

In [None]:
pipeline_1 = ['remove_non_string', 'to_lower', 'remove_punct']

This pipeline will :
- replace all NaNs by en empty string
- convert all letters to lowercase
- remove (most of) the ponctuation

Let's try it !

In [None]:
# First, on a string
print('\n---------------\n')
test_str = 'This is a test !'
print(f"{test_str} ---> {api.preprocess_pipeline(test_str, pipeline_1)}")
# Then on a list
print('\n---------------\n')
test_list = ['This is a test !', 'Btw, this sentence is also a test ;)']
print(f"{test_list} ---> {api.preprocess_pipeline(test_list, pipeline_1)}")
# Then on an np array
print('\n---------------\n')
test_np_array = np.array(['This is a test !', 'Btw, this sentence is also a test ;)'])
print(f"{test_np_array} ---> {api.preprocess_pipeline(test_np_array, pipeline_1)}")
# Then on a pd Series
print('\n---------------\n')
test_pd_series = pd.Series(['This is a test !', 'Btw, this sentence is also a test ;)'])
print(f"{test_pd_series} \n--->\n {api.preprocess_pipeline(test_pd_series, pipeline_1)}")
# Then on a DataFrame
print('\n---------------\n')
test_pd_dataframe = pd.DataFrame({'col1' : ['Test 1.', 'Test 2!'], 'col2' : ['Test 3?', 'Test 4$']})
print(f"{test_pd_dataframe} \n--->\n {api.preprocess_pipeline(test_pd_dataframe, pipeline_1)}")
# Finally on a DataFrame - version 2
print('\n---------------\n')
test_pd_dataframe = pd.DataFrame({'col1' : ['Test 1.', 'Test 2!'], 'col2' : ['Test 3?', 'Test 4$']})
print(f"{test_pd_dataframe} \n--->\n {api.preprocess_pipeline(test_pd_dataframe, pipeline_1, prefered_column='col2', modify_data=False)}")

As you can see, this function can process many type of inputs.

### Use custom functions

You are not stuck with the provided functions only ! You can use custom functions ðŸ˜Š Let's try it ! 

In [None]:
# This function replaces all 'mr.' with 'mister'. We provide a function `get_regex_match_words` to
# automatically create the correct regex. You can have a look at it, but we could have use any other preprocessing function.
def my_custom_function(docs: pd.Series):
    '''Replaces 'mr.' with 'mister' '''
    my_regex = utils.get_regex_match_words(['mr.'], case_insensitive=True, words_as_regex=False)
    docs = docs.str.replace(my_regex, 'mister', regex=True)
    return docs

In [None]:
# Let's try it alone
my_custom_function(pd.Series(['Hello Mr. Smith']))

In [None]:
# Now let's use it in a pipeline !
# We add utils.data_agnostic to make it work with more input types
pipeline_2 = ['remove_non_string', 'to_lower', utils.data_agnostic(my_custom_function), 'remove_punct']
test_str = 'Hello Mr. Smith !'
print(f"{test_str} ---> {api.preprocess_pipeline(test_str, pipeline_2)}")

### Adapt existing functions

As we have seen before, we can use custom functions in our pipelines. But we can also adapt existing functions by modifying the default kwargs. To do so, we will use partial functions (`functools.partial`). 

Let's try it with `remove_stopwords`. We want to remove `hello` and `test` from our texts. To do so, we will reuse the `remove_stopwords` function :

In [None]:
new_remove_stopwords = functools.partial(basic.remove_stopwords, opt='none', set_to_add=['hello', 'test', 'Hello', 'Test'])

In [None]:
# Let's try it alone
new_remove_stopwords(pd.Series(['Hello, this is a test !']))

In [None]:
# Now let's use it in a pipeline !
# We add utils.data_agnostic to make it work with more input types
pipeline_3 = ['remove_non_string', 'to_lower', new_remove_stopwords, 'remove_punct']
test_str = 'Hello, this is a test !'
print(f"{test_str} ---> {api.preprocess_pipeline(test_str, pipeline_3)}")

It works like a charm !

---

# Full example on our dataset

In [None]:
# TODO : create a full pipeline, showcase each step, use it on the whole dataset, process by chunck, use it directly on the .csv ? (not advise)

### Preprocessing on the whole corpus

In [None]:
#Sample :
docs=df["description"][0:10]

<p>Here we define the desired pipeline.</p>
<p>Transformations are applied in the same order in which they are specified :</p>
<ul>
    <li>**remove_non_string** : Removes non string characters</li>  
    <li>**get_true_spaces** : Replaces all white spaces with a single space</li>
        <li>**to_lower_except_singleletters** : Lower case transformation except for single letters (such as language R or language C)</li>
        
    <li>**pe_matching** : Basic one to one substitution 
        *Example* : "permis b" (french driving licence) => "permisb"</li>
    <li>**remove_gender_synonyms** : Finds occurences where both male and female versions of a single words are used (eg: Serveur/Serveuse) and keep only the male version (language convention)</li>
        
    <li>**remove_punct_except_parenthesis** :  Removes all non alphanumeric characters by whitespaces except for parenthesis</li>
    <li>**remove_numeric** : Returns a text without any numerical character</li>
    <li>**remove_stopwords** : Returns a text without stopwords</li>
    <li>**lemmatize** OU **stemmatize** : Text lemmatization or stemmatization
    <li>**remove_accents** : Returns a text without any accent</li>
    <li>**trim_string** : Replaces multiple white spaces by a single one</li>
    <li>**remove_leading_and_ending_spaces** : Removes leadining and trailing white spaces</li>
</ul>

In [None]:
#Pipeline definition :
pipeline = ['remove_non_string', 'get_true_spaces', 'to_lower_except_singleletters', 'pe_matching',
                    'remove_gender_synonyms', 'remove_punct_except_parenthesis', 'remove_numeric',
                    'remove_stopwords', 'stemmatize', 'remove_accents', 'trim_string', 'remove_leading_and_ending_spaces']

In [None]:
#Running the pipeline
docs_preprocess = preprocessing.preprocess_pipeline(docs,
                                                        pipeline=pipeline)
docs_preprocess.head(3)

In [None]:
#Displays the first rows :
for i in range(0,4) :
    print("Document index nÂ°",i,"before preprocessing :")
    print("'",docs[i],"'")
    print("  and after preprocessing ")
    print("'",docs_preprocess[i],"'")

###  Diving into each single step

We only consider the first row of our initial dataset

In [None]:
text=docs[0]
text=pd.Series(text)
print(text.values)

In [None]:
pipeline = ['notnull', 'remove_non_string', 'to_lower_except_singleletters', 'pe_matching', 'trim_string',
                                        'remove_gender_synonyms', 'remove_punct_except_parenthesis', 'remove_numeric',
                                        'remove_stopwords','lemmatize', 'remove_accents']
def preprocess_pipeline_detail(text, pipeline=pipeline):
    print ("Texte initial")
    print (text.values)
    for item in pipeline:
        if item in api.USAGE.keys():
            print("\n")
            print(str(item))
            text=api.USAGE[item](text)
            print (text.values)
            #print("Etape %s" % item)
            #print(list(text.values))

In [None]:
preprocess_pipeline_detail(text,pipeline)