# Text manipulation (Preprocessing and Modeling)

- Python String Module and str.methods
- Python Regular Expressions
- NLTK
- Gensim
- spaCy

**Why is it important?**

This is the main task when tou are working with NLP.  Some applications that you could do with NLP are:

* Part-of-speech tagging
* Named Entity Recognition NER
* Question answering
* Speech recognition
* Text-to-speech and Speech-to-text
* Topic modeling
* Sentiment classification
* Language modeling
* Translation

## Python String Module and str.methods

 
- Codec registry and base classes [codecs](https://docs.python.org/3/library/codecs.html#standard-encodings)
- Common string operations [string module](https://docs.python.org/3/library/string.html)
- String Methods [str.methods](https://docs.python.org/3/library/stdtypes.html#text-sequence-type-str)


In [None]:
import string
from string import Formatter
from string import Template

# String constants
print('ascii_letters: ',string.ascii_letters)
print('ascii_lowercase: ',string.ascii_lowercase)
print('ascii_uppercase: ',string.ascii_uppercase)
print('digits: ',string.digits)
print('hexdigits: ',string.hexdigits)
print('whitespace: ',string.whitespace)  # ' \t\n\r\x0b\x0c'
print('punctuation: ',string.punctuation)
print('printable: ',string.printable)

In [None]:
# Custom String Formatting
# Format String Syntax
formatter = Formatter()
print(formatter.format('{website}', website='DS4A'))
print(formatter.format('{} {website}', 'Welcome to', website='DS4A'))
# format() behaves in similar manner
print('{} {website}'.format('Welcome to', website='DS4A'))

In [None]:
# Template strings
t = Template('$name is the $title of $company')
s = t.substitute(name='MinTic', title='Founder', company='DS4A.')
print(s)

In [None]:
# Helper functions
# string capwords() function - Uses str.split() and str.capitalize()
string.capwords('hello world ds4a')

In [None]:
# String Methods
# str.methods()

# Changing text
print('hello world ds4a'.capitalize())
print('hello world ds4a'.upper())
print('HELLO WORLD DS4A'.lower())
print('  123456  '.lstrip())
print('  123456  '.rstrip())
print('  123456  '.strip())
print('привет мир ds4a'.encode())
print('hello world ds4a'.encode())
print('привет мир ds4a'.encode(encoding="cp866", errors="strict"))
print('hello world ds4a'.encode(encoding="utf-8", errors="strict"))

In [None]:
# Looking in text
print('hello world ds4a'.count('o'))
print('hello world ds4a'.endswith('a'))
print('hello world ds4a'.startswith('a'))
print('hello world ds4a'.find('o'))
print('hello world ds4a'.find('z'))
print('hello world ds4a'.index('o'))
print('hello world ds4a'.isalnum())
print('123456'.isalnum())
print('hello'.isalpha())

## Python Regular Expressions (regex)

- Versatile tool for text processing.
- Standard library of most programming languages (Python, SQL, Javascript, etc)
- Mini programming language
- Parts of regular expressions can be saved for future use.
- There are ways to perform AND, OR, NOT conditionals.
- Operations similar to range function, string repetition operator and so on.

Some common use cases:

- Sanitizing a string to ensure that it satisfies a known set of rules. For example, to check if a given string matches password rules.
- Filtering or extracting portions on an abstract level like alphabets, numbers, punctuation and so on.
- Qualified string replacement. For example, at the start or the end of a string, only whole words, based on surrounding text, etc.

In [None]:
sentence = 'This is a sample string'

print('is' in sentence)
print('xyz' in sentence)

In [None]:
import re
# As a good practice, always use raw strings to construct the RE, unless other formats are required
print(bool(re.search(r'is', sentence)))
print(bool(re.search(r'xyz', sentence)))

In [None]:
if re.search(r'ring', sentence):
    print('mission success')

In [None]:
if not re.search(r'xyz', sentence):
    print('mission failed')

In [None]:
# generator expression examples
words = ['cat', 'attempt', 'tattle']

print([w for w in words if re.search(r'tt', w)])

print(all(re.search(r'at', w) for w in words))

print(any(re.search(r'stat', w) for w in words))

In [None]:
# Compiling regular expressions
pet = re.compile(r'dog')

print(type(pet))

print(bool(pet.search('They bought a dog')))

print(bool(pet.search('A cat crossed their path')))


In [None]:
byte_data = b'This is a sample string'

# To work with bytes data type, the RE must be of bytes data as well
try:
    re.search(r'is', byte_data)
except Exception as error:
    print("Error in re.search: ",error)
    print(bool(re.search(rb'is', byte_data)))
    print(bool(re.search(rb'xyz', byte_data)))

In [None]:
# Example applied
filename = 'programming_quotes.txt'
word = re.compile(r'two')
with open(filename, 'r') as ip_file:
   for ip_line in ip_file:
    if word.search(ip_line):
        print(ip_line, end='') 

In [None]:
purchases = '''\
apple 24
mango 50
guava 42
onion 31
water 10'''
num = re.compile(r'2') 
for line in purchases.split('\n'):
    if not num.search(line):
        print(line)

### Anchors

- Qualifying a pattern
- These restrictions are made possible by assigning special meaning to certain characters and escape sequences.

In [None]:
# String Anchors: 
# This restriction is about qualifying a RE to match only at the start or the end of an input string.
# The escape sequence \A which restricts the matching to the start of string.

print(bool(re.search(r'\Acat', 'cater')))
print(bool(re.search(r'\Acat', 'concatenation')))
print(bool(re.search(r'\Ahi', 'hi hello\ntop spot')))
print(bool(re.search(r'\Atop', 'hi hello\ntop spot')))

In [None]:
# To restrict the matching to the end of string, \Z is used.

print(bool(re.search(r'are\Z', 'spare')))
print(bool(re.search(r'are\Z', 'nearest')))

words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']

print([w for w in words if re.search(r'er\Z', w)])

print([w for w in words if re.search(r't\Z', w)])

In [None]:
word_pat = re.compile(r'\Acat\Z')
print(bool(word_pat.search('cat')))
print(bool(word_pat.search('cater')))
print(bool(word_pat.search('concatenation')))

In [None]:
# insert text at the start of a string
# first argument to re.sub is the search RE
# second argument is the replacement value
# third argument is the string value to be acted upon

print(re.sub(r'\A', r're', 'live'))
print(re.sub(r'\A', r're', 'send'))

# appending text
print(re.sub(r'\Z', r'er', 'cat'))
print(re.sub(r'\Z', r'er', 'hack'))

In [None]:
# A common mistake, not specific to re.sub , is forgetting that strings are immutable in Python.

word = 'cater'

# this will return a string object, won't modify 'word' variable
print(re.sub(r'\Acat', r'hack', word))
print(word)

# need to explicitly assign the result if 'word' has to be changed
word = re.sub(r'\Acat', r'hack', word)
print(word)

In [None]:
# Line anchors
# The newline character \n is used as the line separator
# ˆ metacharacter for matching the start of line and $ for matching the end of line.
# If there are no newline characters in the input string, these will behave same as \A and \Z respectively.

pets = 'cat and dog'

print(bool(re.search(r'^cat', pets)))
print(bool(re.search(r'^dog', pets)))
print(bool(re.search(r'dog$', pets)))
print(bool(re.search(r'^dog$', pets)))

In [None]:
# \Z will always match the end of string, irrespective of what characters are present.
greeting = 'hi there\nhave a nice day\n'

print(greeting)

print(bool(re.search(r'day$', greeting)))
print(bool(re.search(r'day\n$', greeting)))
print(bool(re.search(r'day\Z', greeting)))
print(bool(re.search(r'day\n\Z', greeting)))

In [None]:
# Word anchors
# The escape sequence \b denotes a word boundary.

words = 'par spar apparent spare part'

print(re.sub(r'par', r'X', words))
print(re.sub(r'\bpar', r'X', words))
print(re.sub(r'par\b', r'X', words))
print(re.sub(r'\bpar\b', r'X', words))

In [None]:
words = 'par spar apparent spare part'

print(re.sub(r'\b', r'"', words).replace(' ', ','))
print(re.sub(r'\b', r' ', '-----hello-----'))
print(re.sub(r'\b', r' ', 'foo_baz=num1+35*42/num2'))
print(re.sub(r'\b', r' ', 'foo_baz=num1+35*42/num2').strip())

In [None]:
# The word boundary has an opposite anchor too. \B matches wherever \b doesn’t match.
words = 'par spar apparent spare part'

print(re.sub(r'\Bpar', r'X', words))
print(re.sub(r'\Bpar\b', r'X', words))
print(re.sub(r'par\B', r'X', words))
print(re.sub(r'\Bpar\B', r'X', words))
print(re.sub(r'\b', r':', 'copper'))
print(re.sub(r'\B', r':', 'copper'))
print(re.sub(r'\b', r' ', '-----hello-----'))
print(re.sub(r'\B', r' ', '-----hello-----'))

**Cheatsheer and Summary**

![](CheatSheetandSummary1.png)

In [None]:
# Alternation and Grouping

print(bool(re.search(r'cat|dog', 'I like cats')))
print(bool(re.search(r'cat|dog', 'I like dogs')))
print(bool(re.search(r'cat|dog', 'I like parrots')))


print(re.sub(r'\Acat|cat\b', r'X', 'catapults concatenate cat scat'))
print(re.sub(r'cat|dog|fox', r'mammal', 'cat dog bee parrot fox'))


In [None]:
# The join string method can be used to build the alternation list automatically from an iterable of strings.

print('|'.join(['car', 'jeep']))

words = ['cat', 'dog', 'fox']

print('|'.join(words))

print(re.sub('|'.join(words), r'mammal', 'cat dog bee parrot fox'))


**Cheatsheer and Summary**

![](CheatSheetandSummary2.png)

In [None]:
from unicodedata import normalize

def formatingText(text):
	text = text.lower()
	text = re.sub('<.*?>', '', text)
	text = re.sub(':.*?:', '', text)
	text = re.sub(r"([^n\u0300-\u036f]|n(?!\u0303(?![\u0300-\u036f])))[\u0300-\u036f]+", r"\1", normalize( "NFD", text), 0, re.I)
	text = normalize( 'NFC', text)
	text = re.sub('[^a-z ]', '', text)
	return text

In [None]:
print(formatingText('hello@ Ds4á'))

Regular Expressions Cheat Sheet [Here](https://cheatography.com/davechild/cheat-sheets/regular-expressions/)

## NLTK

Es una de las librerías más antiguas en python para procesamiento de lenguaje natural. Sigue siendo muy útil para tareas de pre procesado de texto tales como la tokenización, lematización, exclusión de palabras irrelevantes, etc. NLTK también se usa mucho como herramienta de estudio y enseñanza de procesamiento del lenguaje.  Para aprender más, puedes leer el libro de NLTK (en inglés).

In [None]:
!pip install nltk

In [None]:
import pandas as pd
import nltk
from nltk import sent_tokenize
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
data = pd.read_csv('IMDB Dataset.csv',nrows=20)

In [None]:
data.shape

In [None]:
data.head()

In [None]:
print(data.review[1])

In [None]:
def striphtml(data):
    p = re.compile(r'<.*?>')
    return p.sub('', data)

In [None]:
sent_tokenize(data.review[1])

In [None]:
data.review = data.review.apply(striphtml)

In [None]:
print(data.review[1])

In [None]:
sentences = sent_tokenize(data.review[1])
print(sentences)

In [None]:
print(sentences[3])

In [None]:
words = word_tokenize(sentences[3])
print(words)

In [None]:
stop_words = stopwords.words('english')
print(stop_words)

In [None]:
words = [w.lower() for w in words]
print(words)

In [None]:
words = [w for w in words if not w in stop_words and w.isalpha()]

In [None]:
print(words)

In [None]:
!pip install wordcloud

In [None]:
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt 

textoWC = ' '.join(words)

wordcloud = WordCloud(width = 600, height = 600, 
                background_color ='white', 
                stopwords = STOPWORDS,# max_words=200,
                relative_scaling =0,
                min_font_size = 10).generate(textoWC) 

plt.figure(figsize = (8, 8), facecolor = None) 
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off") 
plt.tight_layout(pad = 0) 
  
plt.show() 



## Gensim 

Es una librería para el procesamiento de lenguaje natural creada por Radim Řehůřek. El punto fuerte de Gensim es el modelado de temas. Es decir, puede identificar automáticamente de que tratan un conjunto de documentos. Además, Gensim es útil para construir o importar representaciones de vectores distribuidas tales como word2vec. También podemos usar Gensim para analizar la similaridad entre documentos, lo que es muy útil cuando realizamos búsquedas.  Para aprender más, mira los tutoriales de Gensim (en inglés).

In [None]:
!pip install gensim

In [None]:
import gensim
from gensim import corpora
from pprint import pprint

# How to create a dictionary from a list of sentences?
documents = ["The Saudis are preparing a report that will acknowledge that", 
             "Saudi journalist Jamal Khashoggi's death was the result of an", 
             "interrogation that went wrong, one that was intended to lead", 
             "to his abduction from Turkey, according to two sources."]

# Tokenize(split) the sentences into words
texts = [[text for text in doc.split()] for doc in documents]

# Create dictionary
dictionary = corpora.Dictionary(texts)

# Get information about the dictionary
print(dictionary.token2id)
#> Dictionary(33 unique tokens: ['Saudis', 'The', 'a', 'acknowledge', 'are']...)

## spaCy 

Es la librería de procesamiento natural más rápida que existe. Está diseñada para usarse en aplicaciones reales y extraer información relevante. spaCy también es muy útil para preparar texto para otras tareas de aprendizaje automático. Por ejemplo, podemos preparar los datos para usarlos con TensorFlow, PyTorch, scikit-learn, Gensim, etc. Con spaCy también vamos a poder construir modelos lingüísticos estadísticos sofisticados para muchos de los problemas de procesamiento de lenguaje natural.  Para saber más, mira la documentación de spaCy (en inglés).


In [None]:
!pip install spacy

In [None]:
!python -m spacy download en_core_web_sm

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [None]:
file_name = 'programming_quotes.txt'
introduction_file_text = open(file_name).read()
introduction_file_doc = nlp(introduction_file_text)

# Extract tokens for the given doc
print ([token.text for token in introduction_file_doc])