# Text Preprocessing 

This notebook will guide you through the common text preprocessing operations. 
All the preprocessing functions are located in ** nautilus_nlp.preprocessing.preprocess ** 

## Load an example

In [1]:
english_text = """
The nautilus 🐚🐚 (from the Latin form of the original Ancient Greek: ναυτίλος, 'sailor') is a pelagic marine mollusc of the cephalopod family Nautilidae, the sole extant family of the superfamily Nautilaceae and of its smaller but near equal suborder, Nautilina.

It comprises six living species in two genera, the type of which is the genus Nautilus. Though it more specifically refers to species Nautilus pompilius, the name chambered nautilus is also used for any of the Nautilidae. All are protected under CITES Appendix II.

Nautilidae, both extant and extinct, are characterized by involute or more or less convolute shells that are generally smooth, with compressed or depressed whorl sections, straight to sinuous sutures, and a tubular, generally central siphuncle.[3] Having survived relatively unchanged for millions of years, nautiluses represent the only living members of the subclass nautiloidea, and are often considered "living fossils".

The word nautilus is derived from the Greek ναυτίλος nautílos and originally referred to the paper nautiluses of the genus Argonauta, which are actually octopuses. The word nautílos literally means "sailor", as paper nautiluses were thought to use two of their arms as sails.[4]

Here's how you can contact us:

Source : https://en.wikipedia.org/wiki/Nautilus
Contact:
Email : Jules.vernes@nautilus.net
Phone : +33 6 24 24 24 24 
Cost : €45
"""

## Preprocess_text: Wrap-up function 

In [2]:
from nautilus_nlp.preprocessing.preprocess import preprocess_text

In [3]:
clean_txt = preprocess_text(english_text,
                            fix_unicode=False,
                            lowercase=True,
                            no_urls=False,
                            no_emails=True,
                            no_phone_numbers=True,
                            phone_method='detection', # if detection is specified, will use a phone lib 
                            phone_countries_format=[None, 'US', 'FR'], # these phone format will be tested. None stands for international.
                            no_numbers=True,
                            no_currency_symbols=True,
                            no_punct=True,
                            no_contractions=True, #will unpack english contractions
                            no_accents=True,
                            no_emoji=True,
                            replace_with=' ',
                            no_stopwords='en' #will remove english stopwords
                           )
print(clean_txt)

nautilus latin form original ancient greek ναυτιλος sailor pelagic marine mollusc cephalopod family nautilidae sole extant family superfamily nautilaceae smaller equal suborder nautilina comprises living species genera type genus nautilus specifically refers species nautilus pompilius chambered nautilus nautilidae protected cites appendix ii nautilidae extant extinct characterized involute convolute shells generally smooth compressed depressed whorl sections straight sinuous sutures tubular generally central siphuncle survived unchanged millions years nautiluses represent living members subclass nautiloidea considered living fossils word nautilus derived greek ναυτιλος nautilos originally referred paper nautiluses genus argonauta octopuses word nautilos literally means sailor paper nautiluses thought arms sails contact source https en wikipedia org wiki nautilus contact email phone cost


# Overview of the pre-processing functions

If you prefer, you can use all the preprocessing functions one-by-one and create your own pre-processing pipeline. It is recommended, as there are many parameters that are not necessary included in the wrap-up function. 

For the sake of simplicity, we won't go through all the pre-processing function. Please report to the documentation if you want more details!

## Remove end-of-line characters

In [4]:
from nautilus_nlp.preprocessing.preprocess import remove_EOL_characters

In [5]:
text_with_eol = "hello words!\nthis is the second line."

text_without_eol = remove_EOL_characters(text_with_eol)

print('Before:\n-------\n{}\n\nAfter:\n-------\n{}'.format(text_with_eol,text_without_eol))

Before:
-------
hello words!
this is the second line.

After:
-------
hello words! this is the second line.


## Stop words

In [6]:
from nautilus_nlp.preprocessing.preprocess import get_stopwords

In [7]:
french_sw = get_stopwords('fr')

print(french_sw[:10])

['soi-même', 'prealable', 'ayez', 'fi', 'concernant', 'dedans', 'celles', 'tiennes', 'plutôt', 'bravo']


In [8]:
from nautilus_nlp.preprocessing.preprocess import remove_stopwords

In [9]:
tokens_with_sw = ['je','suis','malade','et','toi','?']

tokens_without_sw = remove_stopwords(tokens_with_sw, stopwords=french_sw)

print('Before:\n-------\n{}\n\nAfter:\n-------\n{}'.format(tokens_with_sw,tokens_without_sw))

Before:
-------
['je', 'suis', 'malade', 'et', 'toi', '?']

After:
-------
['malade', '?']


## Fix bad unicode

In [10]:
from nautilus_nlp.preprocessing.preprocess import fix_bad_unicode

In [11]:
before = "Les augmentations de rÃ©munÃ©rations"

after = fix_bad_unicode(before)

print('Before:\n-------\n{}\n\nAfter:\n-------\n{}'.format(before,after))

Before:
-------
Les augmentations de rÃ©munÃ©rations

After:
-------
Les augmentations de rémunérations


## Phone Numbers

In [12]:
from nautilus_nlp.preprocessing.preprocess import replace_phone_numbers

In [13]:
before = '(541) 754-3010 is a US. Phone'

after = replace_phone_numbers(before, replace_with=' ')

print('Before:\n-------\n{}\n\nAfter:\n-------\n{}'.format(before,after))

Before:
-------
(541) 754-3010 is a US. Phone

After:
-------
  is a US. Phone


There is also an util to detect phone numbers. Here we will provide a country list. It takes time because it will loop over the text with all the supported country codes.

In [14]:
import nautilus_nlp.utils.phone_number as phone

In [15]:
%%time
phone.extract_phone_numbers(before, countrylist=phone.SUPPORTED_COUNTRY)

CPU times: user 296 ms, sys: 0 ns, total: 296 ms
Wall time: 294 ms


['(541) 754-3010', '754-3010']

There is also a **phone parser** class, that helps you to extract information from phone numbers. 

In [16]:
p = phone.phoneParser()

In [17]:
p.parse_number('(541) 754-3010',region_code='US')

PhoneNumber(country_code=1, national_number=5417543010, extension=None, italian_leading_zero=None, number_of_leading_zeros=None, country_code_source=0, preferred_domestic_carrier_code=None)

In [18]:
p.format_number('INTERNATIONAL')

'541-754-3010'

# Emojis

In [19]:
from nautilus_nlp.preprocessing.preprocess import remove_emoji, convert_emoji_to_text

In [20]:
before = 'My favorite emojies are: 🎅🏿⌚'

after = remove_emoji(before)

print('Before:\n-------\n{}\n\nAfter:\n-------\n{}'.format(before,after))

Before:
-------
My favorite emojies are: 🎅🏿⌚

After:
-------
My favorite emojies are: 


In [21]:
before = 'My favorite emojies are: 🎅🏿⌚'

after = convert_emoji_to_text(before, code_delimiters=('', ''))

print('Before:\n-------\n{}\n\nAfter:\n-------\n{}'.format(before,after))

Before:
-------
My favorite emojies are: 🎅🏿⌚

After:
-------
My favorite emojies are: Santa_Claus_dark_skin_tonewatch
