# Natural Language Processing

In [3]:
#Import the module
import nltk

#download a corpus in spanish to process
nltk.download("cess_esp")

[nltk_data] Downloading package cess_esp to /root/nltk_data...
[nltk_data]   Unzipping corpora/cess_esp.zip.


True

## Regular Expressions

* They constitute a standardized language for defining text search strings.
* Library of operations with regular expressions
* Rules for writing regular expressions

In [4]:
#Import regular expressions module
import re

#Define a corpus in python and print it
corpus = nltk.corpus.cess_esp.sents()

print(corpus)
print(len(corpus))

[['El', 'grupo', 'estatal', 'Electricité_de_France', '-Fpa-', 'EDF', '-Fpt-', 'anunció', 'hoy', ',', 'jueves', ',', 'la', 'compra', 'del', '51_por_ciento', 'de', 'la', 'empresa', 'mexicana', 'Electricidad_Águila_de_Altamira', '-Fpa-', 'EAA', '-Fpt-', ',', 'creada', 'por', 'el', 'japonés', 'Mitsubishi_Corporation', 'para', 'poner_en_marcha', 'una', 'central', 'de', 'gas', 'de', '495', 'megavatios', '.'], ['Una', 'portavoz', 'de', 'EDF', 'explicó', 'a', 'EFE', 'que', 'el', 'proyecto', 'para', 'la', 'construcción', 'de', 'Altamira_2', ',', 'al', 'norte', 'de', 'Tampico', ',', 'prevé', 'la', 'utilización', 'de', 'gas', 'natural', 'como', 'combustible', 'principal', 'en', 'una', 'central', 'de', 'ciclo', 'combinado', 'que', 'debe', 'empezar', 'a', 'funcionar', 'en', 'mayo_del_2002', '.'], ...]
6030


In [4]:
#Create a list with all the words
flatten = [w for l in corpus for w in l]
print(len(flatten))

192685


Function **re.research()**

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. 

In [5]:
#Create a list that filter if the string patron find in each word  "es"
arr = [w for w in flatten if re.search("es", w)]
print(arr[:10])

['estatal', 'jueves', 'empresa', 'centrales', 'francesa', 'japonesa', 'millones', 'millones', 'dólares', 'millones']


In [6]:
#the string "$" Matches the end of the string or just before the newline at the end of the string

arr = [w for w in flatten if re.search("es$", w)]
print(arr[:10])

['jueves', 'centrales', 'millones', 'millones', 'dólares', 'millones', 'millones', 'dólares', 'es', 'militantes']


In [7]:
#The string "^" Matches the start of the string
arr = [w for w in flatten if re.search("^es", w)] 
print(arr[:10])

['estatal', 'es', 'esta', 'esta', 'eso', 'es', 'especial', 'especialmente', 'este', 'estas']


In [8]:
#Ranges of characters can be indicated by giving two characters and separating them by a '-', Range [a-z]

arr = [w for w in flatten if re.search("^[a-d]", w)]
print(arr[:10])

['anunció', 'compra', 'del', 'de', 'creada', 'central', 'de', 'de', 'de', 'a']


In [9]:
#Characters can be listed individually, e.g. [amk] will match 'a', 'm', or 'k'. 

arr = [w for w in flatten if re.search("^[amk]", w)]
print(arr[:10])

['anunció', 'mexicana', 'megavatios', 'a', 'al', 'a', 'mayo_del_2002', 'a', 'acuerdo', 'años']


* "*" Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

* "+" Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

In [11]:
# "()" Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group;
arr = [w for w in flatten if re.search("^(no)+", w)]
print(arr[:10])

['norte', 'no', 'no', 'noche', 'no', 'no', 'notificación', 'no', 'no', 'no']


## Text Normalization (Regular expression applications)

In [12]:
#If we try to assign it to a normal string, the \n will be treated as a new line.
print("Esta es una \nprueba")

# Python raw string is created by prefixing a string literal with ‘r’ or ‘R’. Python raw string treats backslash (\) as a literal character. 
#This is useful when we want to have a string that contains backslash and don’t want it to be treated as an escape character.

print(r"Esto es una \nprueba")


Esta es una 
prueba
Esto es una \nprueba


## Tokenization 

is the process of tokenizing or splitting a string, text into a list of tokens.

In [17]:
text = """Cuando sea el rey del mundo (imaginaba en su cabeza) no tendré que preocuparme por estas bobadas.
          Era solo un niño de 7 años, pero pensaba que podia ser cualquier cosa que su imaginación le permitiera
          visualizar en su cabeza ..."""

#Case 1

#Split string by the occurrences of pattern. If capturing parentheses are used in pattern, 
#then the text of all groups in the pattern are also returned as part of the resulting list.

print(re.split(r" ", text))

['Cuando', 'sea', 'el', 'rey', 'del', 'mundo', '(imaginaba', 'en', 'su', 'cabeza)', 'no', 'tendré', 'que', 'preocuparme', 'por', 'estas', 'bobadas.\n', '', '', '', '', '', '', '', '', '', 'Era', 'solo', 'un', 'niño', 'de', '7', 'años,', 'pero', 'pensaba', 'que', 'podia', 'ser', 'cualquier', 'cosa', 'que', 'su', 'imaginación', 'le', 'permitiera\n', '', '', '', '', '', '', '', '', '', 'visualizar', 'en', 'su', 'cabeza', '...']


In [18]:
#Case 2: Using regular expressions
# - Eliminate blank spaces, lines jumps, and tabular

print(re.split(r"[ \t\n]+", text))

['Cuando', 'sea', 'el', 'rey', 'del', 'mundo', '(imaginaba', 'en', 'su', 'cabeza)', 'no', 'tendré', 'que', 'preocuparme', 'por', 'estas', 'bobadas.', 'Era', 'solo', 'un', 'niño', 'de', '7', 'años,', 'pero', 'pensaba', 'que', 'podia', 'ser', 'cualquier', 'cosa', 'que', 'su', 'imaginación', 'le', 'permitiera', 'visualizar', 'en', 'su', 'cabeza', '...']


In [19]:
#Case 3: Using \W Matches any character which is not a word character. 
# - Parenthesis

print(re.split(r"[\W\t\n]+", text))

['Cuando', 'sea', 'el', 'rey', 'del', 'mundo', 'imaginaba', 'en', 'su', 'cabeza', 'no', 'tendré', 'que', 'preocuparme', 'por', 'estas', 'bobadas', 'Era', 'solo', 'un', 'niño', 'de', '7', 'años', 'pero', 'pensaba', 'que', 'podia', 'ser', 'cualquier', 'cosa', 'que', 'su', 'imaginación', 'le', 'permitiera', 'visualizar', 'en', 'su', 'cabeza', '']


## Tokenizador de NLTK

In [8]:
text = "En los E.U. esa postal vale $15.50 ..."

In [9]:
# the disadvantage in this case is that it does not recognize the acronyms, the price value and split numbers

print(re.split(r"[ \W\t\n]+", text))

['En', 'los', 'E', 'U', 'esa', 'postal', 'vale', '15', '50', '']


In [10]:
#This regular expresion is helpfull

pattern = r"""(?x)                  #enable to use regular expressions
              (?:[A-Z]\.)+          #Abreviaturas
              | \w+(?:-\w+)*        #Palabras con guiones opcionales
              | \$?\d+(?:\.\d+)?%?  #Monedas y porcentajes
              | \.\.\.              #Puntos suspensivos
              | [][.,;"?():-_`]     #Tokens de separación, incluyendo []
              """
nltk.regexp_tokenize(text, pattern)

['En', 'los', 'E.U.', 'esa', 'postal', 'vale', '$15.50', '...']