# Text preprocessing

In this notebook, some examples on how to use the script are given.

## First steps
First you may download the ```text_preprocessing.py``` file from the script directory. then place it on any local directory you want.
You may have to install the following libraries in order to make the script usable:
- pandas
- nltk
- spacy
- bs4
- textblob

we recommend to use a virtual enviroment like a conda or venv enviroment.

Then just follow the process showed here for single texts (string type), lists of strings or dataframmes.

---


## Add the script to the path and import it

First you have to add the location of the dir containing the script to the path as shown below

In [3]:
import sys
sys.path.append('../script/')

Then you should import the script as a normal library

In [4]:
import text_preprocessing as tp


---

## Initializing the object

Here you have to initialize the object, use the language desired to perform the preprocessing in the arguments
The supported languagews are:
- 'english'
- 'spanish'

In [5]:
prep = tp.Preprocessing('english')


---

Once the object is initialized you can preprocess the texts contained in the following data structures:

- String objects
- List of string objects
- Dataframes

An example of the arguments and how to use each argument field in the main preprocessing function is given below.

```python
main_preprocess(data, column= None, tweet = False, 
                      remove_stop_words = False, is_dataframe= False, 
                      lemmatize = False, emoji_path = None)
```

- data: the data you want to preprocess
- column: if you wan to preprocess the texts contained in a dataframe, you should use this fiield to provide the name of the column where the data is stored as string value
- tweet: if `True` the function remove the Twitter entities like:
    - HTML entities.
    - Users: removes the words begining with @ like @user
    - Hashtags: split the hashtags into the main words like in #LifesMatter -> Lifes Matter
    - Retweets: remove the 'rt' identifiers
- remove_stop_words: if `True`, the stopwords contained in the stopword corpus of NLTK library are removed
- is_dataframe: if `True`, this means you are passing a DataFrame object from pandas to the function.
- lemmatize: if `True`, the lemmatization of the words is made using the spaCy language model
- emoji_path: If you provide a path to an emoji dictionari containing the emoji code and the textual description of the emoji like { :smiley: : smiley} the function, translates the emoji to the textual meaning.

---

## Single texts


In [10]:
text = 'This is a text to show the script users how to preprocess a string with the script. :) !!!! pl3ase n0t3 all the things removed and feel free to check the code'

preprocessed_text = prep.main_preprocess(data = [text])

print(preprocessed_text[0])

this is a text to show the script users how to preprocess a string with the script all the things removed and feel free to check the code



---

## Lists of texts

In [13]:
text_list = [
    'text 1: this is 4n example TEXT',
    'this is another. Example text?',
    "Oh god, I don't what else I should write in here!!!! :( (hehe)"
]

preprocessed_text_list = prep.main_preprocess(data = text_list, lemmatize = True)
preprocessed_text_list

['text this be example text',
 'this be another example text',
 'oh god I do not what else I should write in here hehe']


---

## Data frames

In [15]:
import pandas as pd

In [19]:
df = pd.read_csv('example.csv')
df.TWEETS[:3]

0    La novena estadounidense se fue arriba en la p...
1    🔴#ÚltimaHora🔴 Japón conquistó su tercer Clásic...
2    Ahora | #MilenioNegocios con Regina Reyes-Hero...
Name: TWEETS, dtype: object

In [22]:
prep_es = tp.Preprocessing('spanish')

first example, removing Twitter entities to a df

In [23]:
preprocessed_df_1 = prep_es.main_preprocess(data = df, column= 'TWEETS', tweet = True, 
                      remove_stop_words = False, is_dataframe= True, 
                      lemmatize = False, emoji_path = None)

preprocessed_df_1.TWEETS[:3]

0    la novena estadounidense se fue arriba en la p...
1    ultima hora japon conquisto su tercer clasico ...
2    ahora milenio negocios con regina reyes herole...
Name: TWEETS, dtype: object

Second example, removing Twitter entities, and stopwords

In [24]:
preprocessed_df_2 = prep_es.main_preprocess(data = df, column= 'TWEETS', tweet = True, 
                      remove_stop_words = True, is_dataframe= True, 
                      lemmatize = False, emoji_path = None)

preprocessed_df_2.TWEETS[:3]

0    novena estadounidense arriba pizarra luego jap...
1    ultima hora japon conquisto tercer clasico mun...
2    ahora milenio negocios regina reyes heroles so...
Name: TWEETS, dtype: object

Third example, removing Twitter entities, stopwords and lematizing

In [25]:
preprocessed_df_3 = prep_es.main_preprocess(data = df, column= 'TWEETS', tweet = True, 
                      remove_stop_words = True, is_dataframe= True, 
                      lemmatize = True, emoji_path = None)

preprocessed_df_3.TWEETS[:3]

0    noveno estadounidense arriba pizarro luego jap...
1    ultimo hora japon conquistar tercer clasico mu...
2    ahora milenio negocio regin reyes herol sod en...
Name: TWEETS, dtype: object