# NLP: Pre-processing

**In NLP problems it is usual to make the text pass first into a preprocessing pipeline. So to all techniques used to transform the text into embeddings, the texts were first preprocessed using the following steps:**

**Normalization:** transforming the text into lower case and removing all the special characters and punctuations.

In [23]:
import re
import string

text =  "Modeling, is arguably the most fun part of a machine learning task. #ML https://data.com"

print("before:\n",text)
def cleaning(text):
    #remove the #
    clean_text = re.sub(r'#',"",text)
    #remove hyperlinks
    clean_text = re.sub(r'https?:\/\/.*[\r\n]*',"",clean_text)
    #remove retweet text
    clean_text = re.sub(r'^RT[\s]+',"",clean_text)
    
    return clean_text

print("after:\n",cleaning(text))

before:
 Modeling, is arguably the most fun part of a machine learning task. #ML https://data.com
after:
 Modeling, is arguably the most fun part of a machine learning task. ML 


In [24]:
import string
    
text =  "Modeling, is arguably the { most fun part of a "" machine learning task."

print("before:\n",text)

def remove_punctuation(text):
    clean_text = []
    #remove punctuations
    for word in text.split():
        if(word not in string.punctuation):
            clean_text.append(word)
            
    return clean_text

print("after\n", remove_punctuation(text))

before:
 Modeling, is arguably the { most fun part of a  machine learning task.
after
 ['Modeling,', 'is', 'arguably', 'the', 'most', 'fun', 'part', 'of', 'a', 'machine', 'learning', 'task.']


**Removing stop words:** stop words are the words that are most commonly used in a language and do not add much meaning to the text. Some examples are the words ‘the’, ‘a’, ‘will’,…

In [25]:
import nltk
#nltk.download('stopwords')
from nltk.corpus import stopwords

#import the list of english stopwords from nltk
eng_stopwords = stopwords.words("english")

text =  "Modeling, is arguably the most fun part of a  machine learning task."

print("before:\n",text)

def remove_stopwords(text):
    clean_text = []
    #remove punctuations
    for word in text.split():
        if(word not in eng_stopwords):
            clean_text.append(word)
            
    return clean_text

print("after\n", remove_stopwords(text))

before:
 Modeling, is arguably the most fun part of a  machine learning task.
after
 ['Modeling,', 'arguably', 'fun', 'part', 'machine', 'learning', 'task.']


**Tokenization:** getting the normalized text and splitting it into a list of tokens.

**String Tokenization:** In Natural Language Processing, String Tokenization is a process where the string is splitted into Individual words or Individual parts without blanks and tabs. In the same step, the words in the String is converted into lower case. The Tokenize Module from NLTK or Naural Language Toolkit makes very easy to carry out this process.

In [26]:
from nltk.tokenize import TweetTokenizer

text =  "Modeling, is arguably the MOST fun part of a  Machine learning task."
print("before:\n",text)

#instantiate the Tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

#tokenize the text
text_tokens = tokenizer.tokenize(text)

print("after\n",text_tokens)

before:
 Modeling, is arguably the MOST fun part of a  Machine learning task.
after
 ['modeling', ',', 'is', 'arguably', 'the', 'most', 'fun', 'part', 'of', 'a', 'machine', 'learning', 'task', '.']


**Stemming:** it is the process to get the root of the words and sometimes this root is not equal to the morphological root of the word, but the stemming goal is to make that related word maps to the same stem. Examples: branched and branching become branch.

**Stemming in Natural Language Processing:** Stemming is a process of converting a word to its most General form or Stem. It's basically the process of removing the suffix from a word and reduce it to its root word. It helps in reducing the size of Vocabulary. It is one of the most important steps while working with Text. 

**Porter Stemmer:** It is one of the most common and gentle stemmer which is very fast but not very precise.

**Snowball Stemmer:** It's actual name is English Stemmer is more precise over large Dataset.

**Lancaster Stemmer:** It is very aggressive algorithm. It will hugely trim down the working data which itself has pros and cons.

In [29]:
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer 

txt =  "Modeling, is arguably the MOST fun part of a  Machine learning task."
text = txt.split()

#Porter Stemmer
stemmer = PorterStemmer()   #instantiate stemmer class
stemWords = [stemmer.stem(word) for word in text]
print(stemWords)

#Snowball Stemmer
stemmer = SnowballStemmer("english")   #instantiate stemmer class
stemWords = [stemmer.stem(word) for word in text]
print(stemWords)

#Lancaster Stemmer
stemmer = LancasterStemmer()   #instantiate stemmer class
stemWords = [stemmer.stem(word) for word in text]
print(stemWords)

['modeling,', 'is', 'arguabl', 'the', 'most', 'fun', 'part', 'of', 'a', 'machin', 'learn', 'task.']
['modeling,', 'is', 'arguabl', 'the', 'most', 'fun', 'part', 'of', 'a', 'machin', 'learn', 'task.']
['modeling,', 'is', 'argu', 'the', 'most', 'fun', 'part', 'of', 'a', 'machin', 'learn', 'task.']


**Lemmatization:** This is the process of getting the same word for a group of inflected word forms, the simplest way to do this is with a dictionary. Examples: is, was, were become be.

**Lemmatization in Natural Language Processing:** Lemmatization is the process of grouping together the inflected forms of words so that they can analysed as a single item, identified by the word's Lemma or a Dictionary form. It is the process where individual tokens from a sentence or words are reduced to their base form. Lemmatization is much more informative than Simple Stemming. Lemmatization looks at the surrounding text to determine a given words's part of speech where it doesn't categorize the phrases.

In [41]:
!pip install spacy==2.3.7

Collecting spacy==2.3.7

ERROR: Could not install packages due to an OSError: [WinError 5] Accès refusé: 'C:\\Users\\Meryem\\anaconda3\\Lib\\site-packages\\~pacy\\attrs.cp38-win_amd64.pyd'
Consider using the `--user` option or check the permissions.



  Using cached spacy-2.3.7-cp38-cp38-win_amd64.whl (9.7 MB)
Installing collected packages: spacy
  Attempting uninstall: spacy
    Found existing installation: spacy 3.4.1
    Uninstalling spacy-3.4.1:
      Successfully uninstalled spacy-3.4.1





In [44]:
from nltk.stem import WordNetLemmatizer
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
#nltk.download('wordnet')

words = ["drinks","were","dogs","is","sings"]

#lemmatization with SpaCy
lookups = Lookups()
lookups.add_table("lemma_rules", {"noun":[['s','']]})
lemmatizer = Lemmatizer(lookups)

for word in words:
    lemma = lemmatizer(word, "NOUN")
    print(f"Lemmatization with SpaCy: {lemma}")
    

#lemmatization with NLTK
lemmatizer = WordNetLemmatizer()

for word in words:
    lemma = lemmatizer.lemmatize(word)
    print(f"Lemmatization with NLTK: {lemma}")

Lemmatization with SpaCy: ['drink']
Lemmatization with SpaCy: ['were']
Lemmatization with SpaCy: ['dog']
Lemmatization with SpaCy: ['i', 'be']
Lemmatization with SpaCy: ['sing']
Lemmatization with NLTK: drink
Lemmatization with NLTK: were
Lemmatization with NLTK: dog
Lemmatization with NLTK: is
Lemmatization with NLTK: sings


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Meryem\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [45]:
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups

words = ["is","were"]

#lemmatization with SpaCy
lookups = Lookups()
lookups.add_table("lemma_rules", {"noun":[['is','be'],['were','be']]})
lemmatizer = Lemmatizer(lookups)

for word in words:
    lemma = lemmatizer(word, "NOUN")
    print(f"Lemmatization with SpaCy: {lemma}")


Lemmatization with SpaCy: ['be']
Lemmatization with SpaCy: ['be']


**The output of this pipeline is a list with the formatted tokens.**