# Text preprocessing

In this notebook, some examples on how to use the script are given.

## First steps
First you may download the ```text_preprocessing.py``` file from the script directory. then place it on any local directory you want.

we recommend to use a virtual enviroment like a conda or venv enviroment.

Then just follow the process showed here for single texts (string type), lists of strings or Pandas Data Frammes.

---


## Add the script to the path and import it

First you have to add the location of the dir containing the script to the path as shown below

In [1]:
import sys
sys.path.append('../text_preprocessing/')

Then you should import the script as a normal library

In [None]:
import text_preprocessing as tp


---

## Initializing the object

Here you have to initialize the object, use the language desired to perform the preprocessing in the arguments
The supported languagews are:
- 'english'
- 'spanish'

In [14]:
prep_en = tp.Preprocessing('english')
prep_es = tp.Preprocessing('spanish')


---

Once the object is initialized you can preprocess the texts contained in the following data structures:

- String objects
- List of string objects
- Dataframes

An example of the arguments and how to use each argument field in the main preprocessing function is given below.

```python
main_preprocess(data, column= None, 
                      tweet = False,
                      tweet_tags = False,
                      remove_stop_words = False, 
                      lemmatize = False,
                      translate_emojis = False,
                      whitelist = "")
```

- data: the data you want to preprocess
- column: if you want to preprocess the texts contained in a dataframe, you must use this fiield to provide the name of the column where the data is stored as string value
- tweet: To indicate the function if you want to preprocess text retrieved from Twitter. If `True` the function remove the following Twitter entities by default:
    - HTML entities.
    - Users: removes the words begining with @ like @user
    - Hashtags: split the hashtags into the main words like in #LifesMatter -> Lifes Matter
    - Retweets: remove the 'rt' identifiers
- tweet_tags: if `True` instead of removing the Twitter all the Twitter entities, this argument tells the function to replace the following entities with a special TAG.
    - rt: the retweet identifier is tagged as 'RT'
    - urls: all the urls are tagged as 'URL'
    - hashtag: the hashtags are tagged as 'HASHTAG'
- remove_stop_words: if `True`, the stopwords contained in the stopword corpus of NLTK library are removed
- is_dataframe: if `True`, this means you are passing a DataFrame object from pandas to the function.
- lemmatize: if `True`, the lemmatization of the words is made using the spaCy language model
- translate_emojis: If you want to translate the emojis to their textual meaning i.e.  :smiley: = smiley
- whitelist: a string that contains the characters you do not want to eliminate during the preprocessing stage


---

## Examples

### Single texts

#### English


In [4]:
text = 'This is a text to show the script users how to preprocess a string with the script. :) !!!! pl3ase n0t3 all the things removed and feel free to check the code'

preprocessed_text = prep_en.main_preprocess(data = text)

print(preprocessed_text)

this is a text to show the script users how to preprocess a string with the script all the things removed and feel free to check the code


#### Spanish

In [6]:
text = 'Este es un TEXTO de pru3ba, la transformación de texto a texto plano se realiza como 4 c0nt1nuaci0n'

preprocessed_text = prep_es.main_preprocess(data = text)

print(preprocessed_text)

este es un texto de la transformacion de texto a texto plano se realiza como



---

### Lists of texts
#### English

In [7]:
text_list = [
    'text 1: this is 4n example TEXT',
    'this is another. Example text?',
    "Oh god, I don't what else I should write in here!!!! :( (hehe)"
]


preprocessed_text_list = prep_en.main_preprocess(data = text_list,
                                                 lemmatize = True)
preprocessed_text_list

['text this be example text',
 'this be another example text',
 'oh god I do not what else I should write in here hehe']

#### Spanish

In [8]:
text_list = [
    'texto 1: ESTE es un texto de EjemPlo',
    'Este texto es otro t3xt0 de ejemplo',
    "este otro texto, es más largo :) para que sirva como otro ejemplo!!!! :( (hehe)"
]


preprocessed_text_list = prep_es.main_preprocess(data = text_list,
                                                 lemmatize = True)
preprocessed_text_list

['texto este ser uno texto de ejemplo',
 'este texto ser otro de ejemplo',
 'este otro texto ser mas largo para que servir como otro ejemplo hehe']


---

### Data frames

In [9]:
import pandas as pd

In [15]:
df = pd.read_csv('example.csv')
df.TWEETS[:3]

0    La novena estadounidense se fue arriba en la p...
1    🔴#ÚltimaHora🔴 Japón conquistó su tercer Clásic...
2    Ahora | #MilenioNegocios con Regina Reyes-Hero...
Name: TWEETS, dtype: object

First example.

Removing Twitter entities.

In [17]:
preprocessed_df_1 = prep_es.main_preprocess(data = df, 
                                            column= 'TWEETS', 
                                            tweet = True,
                                            tweet_tags = False,
                                            remove_stop_words = False, 
                                            lemmatize = False,
                                            translate_emojis = False)

preprocessed_df_1.TWEETS[:3]

0    la novena estadounidense se fue arriba en la p...
1    ultimahora japon conquisto su tercer clasico m...
2    ahora milenionegocios con regina reyes heroles...
Name: TWEETS, dtype: object

Second example.

Tagging Twitter entities, and removing stopwords.

In [18]:
preprocessed_df_2 = prep_es.main_preprocess(data = df, 
                                            column= 'TWEETS', 
                                            tweet = True,
                                            tweet_tags = True,
                                            remove_stop_words = True, 
                                            lemmatize = False,
                                            translate_emojis = False)

preprocessed_df_2.TWEETS[:3]

0    novena estadounidense arriba pizarra luego jap...
1    ultimahora japon conquisto tercer clasico mund...
2    ahora HASHTAG regina reyes heroles entrevista URL
Name: TWEETS, dtype: object

Third example.

Tagging Twitter entities, and removing stopwords, and lematizing.

In [19]:
preprocessed_df_3 = prep_es.main_preprocess(data = df, 
                                            column= 'TWEETS', 
                                            tweet = True,
                                            tweet_tags = True,
                                            remove_stop_words = True, 
                                            lemmatize = True,
                                            translate_emojis = False)

preprocessed_df_3.TWEETS[:3]

0    noveno estadounidense arriba pizarro luego jap...
1    ultimahoro japon conquistar tercer clasico mun...
2     ahora HASHTAG reginar reyes herol entrevista URL
Name: TWEETS, dtype: object

Fourth example.

Tagging Twitter entities, removing stopwords, lematizing and translating emojis to text.

In [20]:
preprocessed_df_4 = prep_es.main_preprocess(data = df, 
                                            column= 'TWEETS', 
                                            tweet = True,
                                            tweet_tags = True,
                                            remove_stop_words = True, 
                                            lemmatize = True,
                                            translate_emojis = True,
                                            whitelist= 'ñ')

preprocessed_df_4.TWEETS[:3]

0    noveno estadounidense arriba pizarro luego jap...
1    circulo rojo grande ultimahora circulo rojo gr...
2     ahora HASHTAG reginar reyes herol entrevista URL
Name: TWEETS, dtype: object