## Text Preprocessing in Natural Language Processing:

Data preprocessing and cleaning is an essential step in building a Machine Learning model.

As we know Machine Learning needs data in the numeric form. We basically used encoding technique (BagOfWord, Bi-gram,n-gram, TF-IDF, Word2Vec) to encode text into numeric vector.

The preprocessing steps are:
1. Remove URL.
2. Remove all irrelevant characters (Numbers and Punctuation).
3. Convert all characters into lowercase.
4. Tokenization
5. Removing Stopwords
6. Stemming
7. Lemmatization
8. Remove the words having length <= 2

These are wideley used for dimensionality reduction.

Install "nltk" library to perform preprocessing steps.

In [1]:
# Load data
import pandas as pd

# Data is downloadded from the kaggel
# Link: https://www.kaggle.com/snap/amazon-fine-food-reviews?select=Reviews.csv
amazon_review = pd.read_csv("reviews.csv")
amazon_review.head(10)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...
5,6,B006K2ZZ7K,ADT0SRK1MGOEU,Twoapennything,0,0,4,1342051200,Nice Taffy,I got a wild hair for taffy and ordered this f...
6,7,B006K2ZZ7K,A1SP2KVKFXXRU1,David C. Sullivan,0,0,5,1340150400,Great! Just as good as the expensive brands!,This saltwater taffy had great flavors and was...
7,8,B006K2ZZ7K,A3JRGQVEQN31IQ,Pamela G. Williams,0,0,5,1336003200,"Wonderful, tasty taffy",This taffy is so good. It is very soft and ch...
8,9,B000E7L2R4,A1MZYO9TZK0BBI,R. James,1,1,5,1322006400,Yay Barley,Right now I'm mostly just sprouting this so my...
9,10,B00171APVA,A21BT40VZCCYT4,Carol A. Reed,0,0,5,1351209600,Healthy Dog Food,This is a very healthy dog food. Good for thei...


### Step-1: Remove URL:

In [3]:
import re

def remove_url(text):
    return re.sub(r'http\S+', '', text)

In [5]:
amazon_review["clean_review"] = amazon_review["Text"].apply(remove_url)

amazon_review[["Text", "clean_review"]].head(10)

Unnamed: 0,Text,clean_review
0,I have bought several of the Vitality canned d...,I have bought several of the Vitality canned d...
1,Product arrived labeled as Jumbo Salted Peanut...,Product arrived labeled as Jumbo Salted Peanut...
2,This is a confection that has been around a fe...,This is a confection that has been around a fe...
3,If you are looking for the secret ingredient i...,If you are looking for the secret ingredient i...
4,Great taffy at a great price. There was a wid...,Great taffy at a great price. There was a wid...
5,I got a wild hair for taffy and ordered this f...,I got a wild hair for taffy and ordered this f...
6,This saltwater taffy had great flavors and was...,This saltwater taffy had great flavors and was...
7,This taffy is so good. It is very soft and ch...,This taffy is so good. It is very soft and ch...
8,Right now I'm mostly just sprouting this so my...,Right now I'm mostly just sprouting this so my...
9,This is a very healthy dog food. Good for thei...,This is a very healthy dog food. Good for thei...


> Note: There is no url first review so no change in output.

### Step-2. Remove all irrelevant characters (Numbers and Punctuation).

In [6]:
def remove_non_alphanumeric(text):
    return re.sub('[^a-zA-z]', ' ', text)

In [7]:
amazon_review["clean_review"] = amazon_review["clean_review"].apply(remove_non_alphanumeric)

amazon_review[["Text", "clean_review"]].head(10)

Unnamed: 0,Text,clean_review
0,I have bought several of the Vitality canned d...,I have bought several of the Vitality canned d...
1,Product arrived labeled as Jumbo Salted Peanut...,Product arrived labeled as Jumbo Salted Peanut...
2,This is a confection that has been around a fe...,This is a confection that has been around a fe...
3,If you are looking for the secret ingredient i...,If you are looking for the secret ingredient i...
4,Great taffy at a great price. There was a wid...,Great taffy at a great price There was a wid...
5,I got a wild hair for taffy and ordered this f...,I got a wild hair for taffy and ordered this f...
6,This saltwater taffy had great flavors and was...,This saltwater taffy had great flavors and was...
7,This taffy is so good. It is very soft and ch...,This taffy is so good It is very soft and ch...
8,Right now I'm mostly just sprouting this so my...,Right now I m mostly just sprouting this so my...
9,This is a very healthy dog food. Good for thei...,This is a very healthy dog food Good for thei...


> Note: All '.' are removed from the text

### Step-3. Convert All Character into LowerCase:

Converting a word to lower case.

If we dont convert all words into lower then based on the lower or upper case it will be represented as two different words in the vertor space model and it will be result in more dimensions.

In [9]:
def convert_lowercase(text):
    return str(text).lower()

In [10]:
amazon_review["clean_review"] = amazon_review["clean_review"].apply(convert_lowercase)

amazon_review[["Text", "clean_review"]].head(10)

Unnamed: 0,Text,clean_review
0,I have bought several of the Vitality canned d...,i have bought several of the vitality canned d...
1,Product arrived labeled as Jumbo Salted Peanut...,product arrived labeled as jumbo salted peanut...
2,This is a confection that has been around a fe...,this is a confection that has been around a fe...
3,If you are looking for the secret ingredient i...,if you are looking for the secret ingredient i...
4,Great taffy at a great price. There was a wid...,great taffy at a great price there was a wid...
5,I got a wild hair for taffy and ordered this f...,i got a wild hair for taffy and ordered this f...
6,This saltwater taffy had great flavors and was...,this saltwater taffy had great flavors and was...
7,This taffy is so good. It is very soft and ch...,this taffy is so good it is very soft and ch...
8,Right now I'm mostly just sprouting this so my...,right now i m mostly just sprouting this so my...
9,This is a very healthy dog food. Good for thei...,this is a very healthy dog food good for thei...


> Note: All letters are now in lowercase.

### Step-4.Tokenization:

It is the process of splitting the given text into smaller pieces called tokens.

In [11]:
import nltk
from nltk.tokenize import word_tokenize

In [12]:
# Generate single text string and split sentences into words.

amazon_review["clean_review"] = amazon_review["clean_review"].apply(word_tokenize)

amazon_review[["Text", "clean_review"]].head(10)

Unnamed: 0,Text,clean_review
0,I have bought several of the Vitality canned d...,"[i, have, bought, several, of, the, vitality, ..."
1,Product arrived labeled as Jumbo Salted Peanut...,"[product, arrived, labeled, as, jumbo, salted,..."
2,This is a confection that has been around a fe...,"[this, is, a, confection, that, has, been, aro..."
3,If you are looking for the secret ingredient i...,"[if, you, are, looking, for, the, secret, ingr..."
4,Great taffy at a great price. There was a wid...,"[great, taffy, at, a, great, price, there, was..."
5,I got a wild hair for taffy and ordered this f...,"[i, got, a, wild, hair, for, taffy, and, order..."
6,This saltwater taffy had great flavors and was...,"[this, saltwater, taffy, had, great, flavors, ..."
7,This taffy is so good. It is very soft and ch...,"[this, taffy, is, so, good, it, is, very, soft..."
8,Right now I'm mostly just sprouting this so my...,"[right, now, i, m, mostly, just, sprouting, th..."
9,This is a very healthy dog food. Good for thei...,"[this, is, a, very, healthy, dog, food, good, ..."


> Note: We can see from the above result that sentence is split into words.

### Step-5. Stop Words Removal:

Stop words do not really signify any importance as they do not help in distinguishing two documnets.

In [13]:
from nltk.corpus import stopwords 

stop_words = set(stopwords.words('english'))

def remove_stop_words(tokens):
    return [word for word in tokens if not word in stop_words] 

In [14]:
amazon_review["clean_review"] = amazon_review["clean_review"].apply(remove_stop_words)

amazon_review[["Text", "clean_review"]].head(10)

Unnamed: 0,Text,clean_review
0,I have bought several of the Vitality canned d...,"[bought, several, vitality, canned, dog, food,..."
1,Product arrived labeled as Jumbo Salted Peanut...,"[product, arrived, labeled, jumbo, salted, pea..."
2,This is a confection that has been around a fe...,"[confection, around, centuries, light, pillowy..."
3,If you are looking for the secret ingredient i...,"[looking, secret, ingredient, robitussin, beli..."
4,Great taffy at a great price. There was a wid...,"[great, taffy, great, price, wide, assortment,..."
5,I got a wild hair for taffy and ordered this f...,"[got, wild, hair, taffy, ordered, five, pound,..."
6,This saltwater taffy had great flavors and was...,"[saltwater, taffy, great, flavors, soft, chewy..."
7,This taffy is so good. It is very soft and ch...,"[taffy, good, soft, chewy, flavors, amazing, w..."
8,Right now I'm mostly just sprouting this so my...,"[right, mostly, sprouting, cats, eat, grass, l..."
9,This is a very healthy dog food. Good for thei...,"[healthy, dog, food, good, digestion, also, go..."


> Note: All stop words such as i, of, the, it, this etc are removed.

### 6. Stemming:

Stemming is a process of transforming a word to its root form.

It is a crude process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational units (the obtained element is known as the stem).

In [15]:
from nltk.stem import PorterStemmer

# Create a new Porter stemmer
stemmer = PorterStemmer()

def stemming(tokens):
    return [stemmer.stem(token) for token in tokens]

In [19]:
amazon_review["clean_review"] = amazon_review["clean_review"].apply(stemming)

amazon_review[["Text", "clean_review"]].head(10)

Unnamed: 0,Text,clean_review
0,I have bought several of the Vitality canned d...,"[bought, sever, vital, can, dog, food, product..."
1,Product arrived labeled as Jumbo Salted Peanut...,"[product, arriv, label, jumbo, salt, peanut, p..."
2,This is a confection that has been around a fe...,"[confect, around, centuri, light, pillowi, cit..."
3,If you are looking for the secret ingredient i...,"[look, secret, ingredi, robitussin, believ, fo..."
4,Great taffy at a great price. There was a wid...,"[great, taffi, great, price, wide, assort, yum..."
5,I got a wild hair for taffy and ordered this f...,"[got, wild, hair, taffi, order, five, pound, b..."
6,This saltwater taffy had great flavors and was...,"[saltwat, taffi, great, flavor, soft, chewi, c..."
7,This taffy is so good. It is very soft and ch...,"[taffi, good, soft, chewi, flavor, amaz, would..."
8,Right now I'm mostly just sprouting this so my...,"[right, mostli, sprout, cat, eat, grass, love,..."
9,This is a very healthy dog food. Good for thei...,"[healthi, dog, food, good, digest, also, good,..."


> Note: From the above result you can see that "vitality" is transformed into vital, and "canned" into can.
    
Also stemming create some words, that may not have any meaning, so we usually use lemmatization.

### 7. Lemmatization:

Unlike stemming, lemmatization reduces the words to a word existing in the language.

lemmatization performs vocabulary and morphological analysis of words, to return the base or dictionary form of a word, which is known as the lemma.

In [16]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')

lemma = WordNetLemmatizer()

def lemmatization(tokens):
    return [lemma.lemmatize(token) for token in tokens]

[nltk_data] Downloading package wordnet to /Users/tanuja/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /Users/tanuja/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


In [17]:
amazon_review["clean_review"] = amazon_review["clean_review"].apply(lemmatization)

amazon_review[["Text", "clean_review"]].head(10)

Unnamed: 0,Text,clean_review
0,I have bought several of the Vitality canned d...,"[bought, several, vitality, canned, dog, food,..."
1,Product arrived labeled as Jumbo Salted Peanut...,"[product, arrived, labeled, jumbo, salted, pea..."
2,This is a confection that has been around a fe...,"[confection, around, century, light, pillowy, ..."
3,If you are looking for the secret ingredient i...,"[looking, secret, ingredient, robitussin, beli..."
4,Great taffy at a great price. There was a wid...,"[great, taffy, great, price, wide, assortment,..."
5,I got a wild hair for taffy and ordered this f...,"[got, wild, hair, taffy, ordered, five, pound,..."
6,This saltwater taffy had great flavors and was...,"[saltwater, taffy, great, flavor, soft, chewy,..."
7,This taffy is so good. It is very soft and ch...,"[taffy, good, soft, chewy, flavor, amazing, wo..."
8,Right now I'm mostly just sprouting this so my...,"[right, mostly, sprouting, cat, eat, grass, lo..."
9,This is a very healthy dog food. Good for thei...,"[healthy, dog, food, good, digestion, also, go..."


> Note: You can see from above result we have better result when we perform lemmatozation instead of stemming.

### 8. Remove the words having length <= 2:

In [18]:
def remove_small_words(tokens):
    return [word for word in tokens if len(word) > 2]

In [20]:
amazon_review["clean_review"] = amazon_review["clean_review"].apply(remove_small_words)

amazon_review[["Text", "clean_review"]].head(10)

Unnamed: 0,Text,clean_review
0,I have bought several of the Vitality canned d...,"[bought, sever, vital, can, dog, food, product..."
1,Product arrived labeled as Jumbo Salted Peanut...,"[product, arriv, label, jumbo, salt, peanut, p..."
2,This is a confection that has been around a fe...,"[confect, around, centuri, light, pillowi, cit..."
3,If you are looking for the secret ingredient i...,"[look, secret, ingredi, robitussin, believ, fo..."
4,Great taffy at a great price. There was a wid...,"[great, taffi, great, price, wide, assort, yum..."
5,I got a wild hair for taffy and ordered this f...,"[got, wild, hair, taffi, order, five, pound, b..."
6,This saltwater taffy had great flavors and was...,"[saltwat, taffi, great, flavor, soft, chewi, c..."
7,This taffy is so good. It is very soft and ch...,"[taffi, good, soft, chewi, flavor, amaz, would..."
8,Right now I'm mostly just sprouting this so my...,"[right, mostli, sprout, cat, eat, grass, love,..."
9,This is a very healthy dog food. Good for thei...,"[healthi, dog, food, good, digest, also, good,..."


So, these are the steps using for text preprocessing for NLP problems. We don’t have need to follow all process, some times we have need to cover less steps. Actually It’s depends on our dataset and as well as our problem.