## Natural Language Processing Pipelines (NLP Pipelines)

NLP algorithms are based on machine learning algorithms. Doing anything complicated in machine learning usually means building a pipeline. The idea is to break up your problem into very small pieces and then use machine learning to solve each smaller piece separately. Then by chaining together several machine learning models that feed into each other, you can do very complicated things.

![picture](https://drive.google.com/uc?export=view&id=16e6wwg2eKxwwZgQ2DOfY65SEtpdX8Kv-)


In this tutorial, you’ll learn:

    How to apply pre-processing techniques
    How to apply text-normalization techniques
    How to use spaCy and nltk


In [5]:
!pip install unidecode

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [6]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [7]:
!pip install word2number

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


## Import all needed libraries


In [8]:


nltk.download('wordnet')







import unidecode





NameError: ignored

## I. Text Processing in Python

For text processing in Python, two popular libraries, namely NLTK (Natural Language Toolkit) and spaCy will be used in the tutorial.

For text processing we can perform a series of steps:

1.  Remove symbols
2. Remove non-ASCII characters
3.    Remove extra whitespaces
5.    Expand contractions
6. Treatment for numbers






In [9]:
text = """Natural language processing (NLP) refers to the branch of computer science—and more specifically, the branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human beings can."""

##Remove symbols

A text may contain some unwanted symbols which will be a noise for our text analysis.

In [10]:
import numpy as np 

def remove_symbols(text):
    symbols = "'\<>?;:#@&()—"
    for i in range(len(symbols)):
        text = np.char.replace(text, symbols[i], '')
    return str(text)


remove_symbols("Do we have extra #symbols in this @sentence. ")

'Do we have extra symbols in this sentence. '

In [11]:
remove_symbols(text)

'Natural language processing NLP refers to the branch of computer scienceand more specifically, the branch of artificial intelligence or AIconcerned with giving computers the ability to understand text and spoken words in much the same way human beings can.'

##Removing Non-ASCII characters

ASCII represents lowercase letters (a-z), uppercase letters (A-Z), digits (0-9) and symbols such as punctuation marks. 

In [12]:
def remove_non_ascii(text):
  # encoding the text to ASCII format
  text_encode = text.encode(encoding="ascii", errors="ignore")
  # decoding the text
  text_decode = text_encode.decode()
  return text_decode

remove_non_ascii("Python is easy \u200c to learn" )


'Python is easy  to learn'

In [13]:
remove_non_ascii("àa string withé fuünny charactersß.")

'a string with funny characters.'

In the case of removing all the symbols and non-ASCII characters, one way is keeping only the characters and numbers in the text. 

In [14]:
import nltk #natual language toolkit
nltk.download('punkt')
from nltk.tokenize import word_tokenize
import re #regular expressions

def keep_characters_numbers(text):
  # replace any text that is not characters and numbers
  cleaned_text = []
  for word in word_tokenize(text):
    cleaned_text.append(re.sub("[^a-zA-Z0-9]", "", word))
  return " ".join(cleaned_text)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [15]:
keep_characters_numbers("àa string withé fuünny charactersß. The number is 2012.")

'a string with funny characters  The number is 2012 '

## Remove extra whitespaces
Sometimes there are extra white spaces in the text which are necessary to be removed

In [16]:

def remove_whitespace(text):
    """remove extra whitespaces from text"""
    text = text.strip()  #removes any leading (spaces at the beginning) and trailing (spaces at the end) characters 
    return " ".join(text.split())
    
print(remove_whitespace("   Here, there are extra    white     spaces  "))


Here, there are extra white spaces


##Expand Contractions

Contractions are shortened words, e.g., don’t and can’t. Expanding such words to “do not” and “can not” helps to standardize text.

We use the contractions module to expand the contractions.

In [17]:
import contractions

def expand_contractions(text):
    text = contractions.fix(text)
    return text

expand_contractions("""expand shortened words, e.g. don't to do not""")

'expand shortened words, e.g. do not to do not'

*Note: This step is optional depending on your NLP task as spaCy’s tokenization and lemmatization functions will perform the same effect to expand contractions such as can’t and don’t. The slight difference is that spaCy will expand “we’re” to “we be” while pycontractions will give result “we are”.*

##Treatment for Numbers

There are two steps in our treatment of numbers.

One of the steps involve the conversion of number words to numeric form, e.g., seven to 7, to standardize text. To do this, we use the word2number module. Sample code as follows:

In [18]:
import spacy
# load spacy model
nlp = spacy.load('en_core_web_sm')

from word2number import w2n

text = """three cups of coffee"""
doc = nlp(text) #create a doc object
tokens = [w2n.word_to_num(token.text) if token.pos_ == 'NUM' else token for token in doc]

print(tokens) 



[3, cups, of, coffee]



The other step is to remove numbers. Removing numbers may make sense for sentiment analysis since numbers contain no information about sentiments. However, if our NLP task is to extract the number of tickets ordered in a message to our chatbot, we will definitely not want to remove numbers.

In [20]:
def remove_numbers(text):
  text = re.sub(r" \d", "", text)
  return str(text)

remove_numbers("Here, the is number 7 that we don't need.")

"Here, the is number that we don't need."

## II. Text Normalization

Text normalization is the process of transforming a text into a canonical (standard) form. For example, the word “gooood” and “gud” can be transformed to “good”, its canonical form. Another example is mapping of near identical words such as “stopwords”, “stop-words” and “stop words” to just “stopwords”.

1. Lowercase all texts
2. Remove stopwords
3. Lemmatization
4. Stemming

## Lower case the text

Converting all your data to lowercase helps in the process of preprocessing and in later stages in the NLP application, when you are doing parsing.

In [21]:
def convert_lower_case(text):
    return str(np.char.lower(text))

convert_lower_case("THIS is An ExaMple to Convert a Text To a lower CASE.")

'this is an example to convert a text to a lower case.'

## Remove Stop words

Stopwords are referring to words that do not carry much insight, such as prepositions. NLTK and spaCy have different amounts of stopwords in the library, but both NLTK and spaCy allowed us to add in any word we feel necessary. For example, when we deal with email, we may add Gmail, com, outlook as stopwords.

In [None]:
from nltk.corpus import stopwords
nltk.download('stopwords')

# Get the list of stop words
stop_words = stopwords.words('english')
# add new stopwords to the list
stop_words.extend(["could","though","would","also","many",'much'])
print(stop_words)


In [23]:
text = "The idea of giving computers the ability to process human language is as old as the idea of computers themselves. "

# Remove the stopwords from the list of tokens
tokens = [x for x in  word_tokenize(convert_lower_case(text)) if x not in stop_words]
print(tokens)

['idea', 'giving', 'computers', 'ability', 'process', 'human', 'language', 'old', 'idea', 'computers', '.']


## Lemmatization

Lemmatization is the process of converting a word to its base form, e.g., “caring” to “care”. We use spaCy’s lemmatizer to obtain the lemma, or base form, of the words. Sample code:

In [24]:
text = """I'm happiness commitment running are"""
doc = nlp(text) #create a doc object
mytokens = [word.lemma_ if word.lemma_ != "-PRON-" else word.lower_ for word in doc] # -PRON- is the default lemma for pronouns in spaCy
print(mytokens) 

['I', 'be', 'happiness', 'commitment', 'running', 'be']


## Stemming
Stemming is similar to lemmatization with the difference that in lemma word is actual word, but stem word can be a word without meaning.

In [19]:
from nltk.stem import PorterStemmer

porter = PorterStemmer()
def stemming(text):
  token_words=word_tokenize(text)
  stem_sentence=[]
  for word in token_words:
      stem_sentence.append(porter.stem(word))
  return stem_sentence

In [25]:
print(stemming(text)) 

['i', "'m", 'happi', 'commit', 'run', 'are']


In [26]:
stemming('happy')

['happi']

##**Assignment:**

Putting everything together, the full text preprocessing code (define a function as text_preprocessing):



In [40]:
text = "I'd like to have three cups   of coffee #?> Your Café is #awesome."

def text_preprocessing(text):
  text = keep_characters_numbers(remove_symbols(text))
  return str(text)



# result: I would like to have three cups of coffee? Your Cafe is awsome.

In [41]:
text_preprocessing(text)

'Id like to have three cups of coffee Your Caf is awesome '

Put all text normalization function together!

In [None]:
def text_normalization(text):


#result: ['like', 'cup', 'coffee', 'cafe', 'delicious']

Refrences:

https://towardsdatascience.com/nlp-text-preprocessing-a-practical-guide-and-template-d80874676e79

https://towardsdatascience.com/text-processing-in-python-29e86ea4114c