### Text Preprocessing in NLP
Text preprocessing is an essential step in natural language processing (NLP) that involves cleaning and transforming unstructured text data to prepare it for analysis. It includes tokenization, stemming, lemmatization, stop-word removal, and part-of-speech tagging
#### Why is Text Preprocessing Important?

Text preprocessing is crucial in natural language processing (NLP) for several reasonso
ns
Noise Reduct

ion	Text data often contains noise such as punctuation, special characters, and irrelevant symbols. Preprocessing helps remove these elements, making the text cleaner and easier to analy
ze.
Normaliza

tion	Different forms of words (e.g., “run,” “running,” “ran”) can convey the same meaning but appear in different forms. Preprocessing techniques like stemming and lemmatization help standardize these variati
ons.
Tokeni

ation	Text data needs to be broken down into smaller units, such as words or phrases, for analysis. Tokenization divides text into meaningful units, facilitating subsequent processing steps like feature extrac
tion.
Stopword 

emoval	Stopwords are common words like “the,” “is,” and “and” that often occur frequently but convey little semantic meaning. Removing stopwords can improve the efficiency of text analysis by reducing 
noise.
Feature Ex

raction	Preprocessing can involve extracting features from text, such as word frequencies, n-grams, or word embeddings, which are essential for building machine learning#\ 
models.
Dimensionality 

eduction	Text data often has a high dimensionality due to the presence of a large vocabulary. Preprocessing techniques like term frequency-inverse document frequency (TF-IDF) or dimensionality reduction methods can help.

In [5]:
# Read Data
import pandas as pd 
import string

In [6]:
# lets read the dataset
data = pd.read_csv('amazon_alexa.tsv', delimiter = '\t')
# lets check the shape of the dataset
data.shape

(3150, 5)

In [7]:
data.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [8]:
# lets check the Value Counts for Variation 
data['variation'].value_counts()

variation
Black  Dot                      516
Charcoal Fabric                 430
Configuration: Fire TV Stick    350
Black  Plus                     270
Black  Show                     265
Black                           261
Black  Spot                     241
White  Dot                      184
Heather Gray Fabric             157
White  Spot                     109
White                            91
Sandstone Fabric                 90
White  Show                      85
White  Plus                      78
Oak Finish                       14
Walnut Finish                     9
Name: count, dtype: int64

In [23]:
# Lets calculate the length of the Reviews
data['length'] = data['verified_reviews'].apply(len)
data['length']

0        12
1         8
2       192
3       167
4         5
       ... 
3145     47
3146    129
3147    423
3148    371
3149      4
Name: length, Length: 3150, dtype: int64

In [24]:
# calculating the Character Count in the Reviews
data['char_count'] = data['verified_reviews'].apply(len)
data['char_count']

0        12
1         8
2       192
3       167
4         5
       ... 
3145     47
3146    129
3147    423
3148    371
3149      4
Name: char_count, Length: 3150, dtype: int64

In [25]:
# calculating the Word Count
data['word_count'] = data['verified_reviews'].apply(lambda x: len(x.split()))
data['word_count']

0        3
1        2
2       38
3       33
4        1
        ..
3145     8
3146    23
3147    83
3148    76
3149     1
Name: word_count, Length: 3150, dtype: int64

In [28]:
import pandas as pd
import re
# Function to count words in a text
def count_words(text):
    words = re.findall(r'\b\w+\b', text)  # Matches word boundaries
    return len(words)
# Function to count sentences in a text
def count_sentences(text):
    sentences = re.split(r'[.!?]+', text)  # Split on '.', '!', or '?'
    sentences = [s for s in sentences if s.strip()]  # Remove empty strings
    return len(sentences)
# Apply functions to the DataFrame
data['word_count'] = data['verified_reviews'].apply(lambda x: count_words(x))
data['sentence_count'] = data['verified_reviews'].apply(lambda x: count_sentences(x))
data['word_count']
data['sentence_count']

0       1
1       1
2       1
3       1
4       1
       ..
3145    1
3146    1
3147    1
3148    1
3149    1
Name: sentence_count, Length: 3150, dtype: int64

In [29]:
# Calculate totals
total_words = data['word_count'].sum()
total_sentences = data['sentence_count'].sum()
# Display results
print(data)
print(f"\nTotal Words: {total_words}")
print(f"Total Sentences: {total_sentences}")

      rating       date         variation  \
0          5  31-Jul-18  Charcoal Fabric    
1          5  31-Jul-18  Charcoal Fabric    
2          4  31-Jul-18    Walnut Finish    
3          5  31-Jul-18  Charcoal Fabric    
4          5  31-Jul-18  Charcoal Fabric    
...      ...        ...               ...   
3145       5  30-Jul-18        Black  Dot   
3146       5  30-Jul-18        Black  Dot   
3147       5  30-Jul-18        Black  Dot   
3148       5  30-Jul-18        White  Dot   
3149       4  29-Jul-18        Black  Dot   

                                       verified_reviews  feedback  length  \
0                                          Love my Echo         1      12   
1                                              Loved it         1       8   
2     Sometimes while playing a game you can answer ...         1     192   
3     I have had a lot of fun with this thing My  yr...         1     167   
4                                                 Music         1       5 

### Data Cleaning

#### 1.Punctuations remove
First lets remove Punctuations from the Reviews

In [30]:
# First lets remove Punctuations from the Reviews
def punctuation_removal(messy_str):
    clean_list = [char for char in messy_str if char not in string.punctuation]
    clean_str = ''.join(clean_list)
    return clean_str

data['verified_reviews'] = data['verified_reviews'].apply(punctuation_removal)

In [14]:
data.isnull().sum()

rating              0
date                0
variation           0
verified_reviews    1
feedback            0
dtype: int64

In [15]:
def punctuation_removal(messy_str):
    if isinstance(messy_str, str):  # Check if input is a string
        clean_list = [char for char in messy_str if char not in string.punctuation]
        clean_str = ''.join(clean_list)
        return clean_str
    return messy_str  # Return non-string values as is
# Apply the function to the DataFrame
data['verified_reviews'] = data['verified_reviews'].apply(punctuation_removal)
print(data)

      rating       date         variation  \
0          5  31-Jul-18  Charcoal Fabric    
1          5  31-Jul-18  Charcoal Fabric    
2          4  31-Jul-18    Walnut Finish    
3          5  31-Jul-18  Charcoal Fabric    
4          5  31-Jul-18  Charcoal Fabric    
...      ...        ...               ...   
3145       5  30-Jul-18        Black  Dot   
3146       5  30-Jul-18        Black  Dot   
3147       5  30-Jul-18        Black  Dot   
3148       5  30-Jul-18        White  Dot   
3149       4  29-Jul-18        Black  Dot   

                                       verified_reviews  feedback  
0                                          Love my Echo         1  
1                                              Loved it         1  
2     Sometimes while playing a game you can answer ...         1  
3     I have had a lot of fun with this thing My 4 y...         1  
4                                                 Music         1  
...                                                

#### 2.Number Remove
lets make a function to remove Numbers from the reviews

In [32]:
# 2.lets make a function to remove Numbers from the reviews
import re
def drop_numbers(list_text):
    list_text_new = []
    for i in list_text:
        if not re.search('\d', i):
            list_text_new.append(i)
    return ''.join(list_text_new)

data['verified_reviews'] = data['verified_reviews'].apply(drop_numbers)
data['verified_reviews']

0                                            Love my Echo
1                                                Loved it
2       Sometimes while playing a game you can answer ...
3       I have had a lot of fun with this thing My  yr...
4                                                   Music
                              ...                        
3145      Perfect for kids adults and everyone in between
3146    Listening to music searching locations checkin...
3147    I do love these things i have them running my ...
3148    Only complaint I have is that the sound qualit...
3149                                                 Good
Name: verified_reviews, Length: 3150, dtype: object

In [17]:
data.isnull().sum()

rating              0
date                0
variation           0
verified_reviews    1
feedback            0
dtype: int64

In [18]:
data['verified_reviews'].fillna('', inplace=True)  # Replace NaN with an empty string

In [19]:
# 2.lets make a function to remove Numbers from the reviews
import re
def drop_numbers(list_text):
    list_text_new = []
    for i in list_text:
        if not re.search('\d', i):
            list_text_new.append(i)
    return ''.join(list_text_new)
data['verified_reviews'] = data['verified_reviews'].apply(drop_numbers)

In [20]:
data['verified_reviews'].head(10)

0                                         Love my Echo
1                                             Loved it
2    Sometimes while playing a game you can answer ...
3    I have had a lot of fun with this thing My  yr...
4                                                Music
5    I received the echo as a gift I needed another...
6    Without having a cellphone I cannot use many o...
7    I think this is the th one Ive purchased Im wo...
8                                          looks great
9    Love it I’ve listened to songs I haven’t heard...
Name: verified_reviews, dtype: object

#### 4.Removing Special Characters

Special characters, as you know, are non-alphanumeric characters. These characters are most often found in comments, references, currency numbers etc. These characters add no value to text-understanding and induce noise into algorithms. Thankfully, regular-expressions (regex) can be used to get rid of these characters and numbers.

In [21]:
# Create a function to remove special characters
def remove_special_characters(text):
    pat = r'[^a-zA-z0-9]' 
    return re.sub(pat, ' ', text)
 
# lets apply this function
data['verified_reviews'] = data.apply(lambda x: remove_special_characters(x['verified_reviews']), axis = 1)

In [22]:
## lets check the Head of Verified Reviews After Cleaning
data['verified_reviews'][:5]

0                                         Love my Echo
1                                             Loved it
2    Sometimes while playing a game you can answer ...
3    I have had a lot of fun with this thing My  yr...
4                                                Music
Name: verified_reviews, dtype: object

### Tokenization

In Python tokenization basically refers to splitting up a larger body of text into smaller lines, words or even creating words for a non-English language. The various tokenization functions in-built into the nltk module.
These tokens can be as small as characters or as long as words. The primary reason this process matters is that it helps machines understand human language by breaking it down into bite-sized pieces, which are easier to analyze.
Imagine you're trying to teach a child to read. Instead of diving straight into complex paragraphs, you'd start by introducing them to individual letters, then syllables, and finally, whole words. In a similar vein, tokenization breaks down vast stretches of text into more digestible and understandable units for machines.

The primary goal of tokenization is to represent text in a manner that's meaningful for machines without losing its context. By converting text into tokens, algorithms can more easily identify patterns. This pattern recognition is crucial because it makes it possible for machines to understand and respond to human input. For instance, when a machine encounters the word "running", it doesn't see it as a singular entity but rather as a combination of tokens that it can analyze and derive meaning from.

To delve deeper into the mechanics, consider the sentence, "Chatbots are helpful." When we tokenize this sentence by words, it transforms into an array of individual words:

["Chatbots", "are", "helpful"].

This is a straightforward approach where spaces typically dictate the boundaries of tokens. However, if we were to tokenize by characters, the sentence would fragment into:

["C", "h", "a", "t", "b", "o", "t", "s", " ", "a", "r", "e", " ", "h", "e", "l", "p", "f", "u", "l"].

This character-level breakdown is more granular and can be especially useful for certain languages or specific NLP tasks.

In essence, tokenization is akin to dissecting a sentence to understand its anatomy. Just as doctors study individual cells to understand an organ, NLP practitioners use tokenization to dissect and understand the structure and me

Tokenization methods vary based on the granularity of the text breakdown and the specific requirements of the task at hand. These methods can range from dissecting text into individual words to breaking them down into characters or even smaller units. Here's a closer look at the different types:

Word tokenization. This method breaks text down into individual words. It's the most common approach and is particularly effective for languages with clear word boundaries like English.
Character tokenization. Here, the text is segmented into individual characters. This method is beneficial for languages that lack clear word boundaries or for tasks that require a granular analysis, such as spelling correction.
Subword tokenization. Striking a balance between word and character tokenization, this method breaks text into units that might be larger than a single character but smaller than a full word. For instance, "Chatbots" could be tokenized into "Chat" and "bots". This approach is especially useful for languages that form meaning by combining smaller units or when dealing with out-of-vocabulary words in NLP tasks.
Here's a table explaining the differences: 


Type	Description	Use Cases
Word Tokenization	Breaks text into individual words.	Effective for languages with clear word boundaries like English.
Character Tokenization	Segments text into individual characters.	Useful for languages without clear word boundaries or tasks requiring granular analysis.
Subword Tokenization	Breaks text into units larger than characters but smaller than words.	Beneficial for languages with complex morphology or handling out-of-vocabu

lary words.aning of text.

In [19]:
# for Tokenization
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

In [20]:
# sentence Tokenizations
sentence_data = "The First sentence is about Python. The Second: about Django. You can learn Python, \
Django and Data Ananlysis here. "

nltk_tokens = nltk.sent_tokenize(sentence_data)
print (nltk_tokens)

['The First sentence is about Python.', 'The Second: about Django.', 'You can learn Python, Django and Data Ananlysis here.']


In [22]:
# Non English Tokenization
german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')

german_tokens=german_tokenizer.tokenize('Wie geht es Ihnen?  Gut, danke.')
print(german_tokens)

['Wie geht es Ihnen?', 'Gut, danke.']


In [23]:
# Words Tokenization
word_data = "It originated from the idea that there are readers who prefer learning new \
skills from the comforts of their drawing rooms"

nltk_tokens = nltk.word_tokenize(word_data)
print (nltk_tokens)

['It', 'originated', 'from', 'the', 'idea', 'that', 'there', 'are', 'readers', 'who', 'prefer', 'learning', 'new', 'skills', 'from', 'the', 'comforts', 'of', 'their', 'drawing', 'rooms']


### 3 Stopwords

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

In [24]:
# for stopwords Removal
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [25]:
# lets print the Stopwords
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [27]:
print(stopwords.words('arabic'))

['إذ', 'إذا', 'إذما', 'إذن', 'أف', 'أقل', 'أكثر', 'ألا', 'إلا', 'التي', 'الذي', 'الذين', 'اللاتي', 'اللائي', 'اللتان', 'اللتيا', 'اللتين', 'اللذان', 'اللذين', 'اللواتي', 'إلى', 'إليك', 'إليكم', 'إليكما', 'إليكن', 'أم', 'أما', 'أما', 'إما', 'أن', 'إن', 'إنا', 'أنا', 'أنت', 'أنتم', 'أنتما', 'أنتن', 'إنما', 'إنه', 'أنى', 'أنى', 'آه', 'آها', 'أو', 'أولاء', 'أولئك', 'أوه', 'آي', 'أي', 'أيها', 'إي', 'أين', 'أين', 'أينما', 'إيه', 'بخ', 'بس', 'بعد', 'بعض', 'بك', 'بكم', 'بكم', 'بكما', 'بكن', 'بل', 'بلى', 'بما', 'بماذا', 'بمن', 'بنا', 'به', 'بها', 'بهم', 'بهما', 'بهن', 'بي', 'بين', 'بيد', 'تلك', 'تلكم', 'تلكما', 'ته', 'تي', 'تين', 'تينك', 'ثم', 'ثمة', 'حاشا', 'حبذا', 'حتى', 'حيث', 'حيثما', 'حين', 'خلا', 'دون', 'ذا', 'ذات', 'ذاك', 'ذان', 'ذانك', 'ذلك', 'ذلكم', 'ذلكما', 'ذلكن', 'ذه', 'ذو', 'ذوا', 'ذواتا', 'ذواتي', 'ذي', 'ذين', 'ذينك', 'ريث', 'سوف', 'سوى', 'شتان', 'عدا', 'عسى', 'عل', 'على', 'عليك', 'عليه', 'عما', 'عن', 'عند', 'غير', 'فإذا', 'فإن', 'فلا', 'فمن', 'في', 'فيم', 'فيما', 'فيه', 'فيها', '

In [28]:
#Now lets Remove the Stopwords
# targeting only English Stopwords
stop = stopwords.words('english')
stop_words = []
from nltk.tokenize import word_tokenize

text = "Nick likes to play football, however he is not too fond of tennis."
text_tokens = word_tokenize(text)

tokens_without_sw = [word for word in text_tokens if not word in stopwords.words()]

print(tokens_without_sw)

['Nick', 'likes', 'play', 'football', ',', 'fond', 'tennis', '.']


In [29]:
# using gensim to remove stopwords
from gensim.parsing.preprocessing import remove_stopwords
text = "Nick likes to play football, however he is not too fond of tennis."
filtered_sentence = remove_stopwords(text)
print(filtered_sentence)

Nick likes play football, fond tennis.


### 4.Stemming

Stemming is the process of reducing inflected/derived words to their word stem, base or root form. The stem need not be identical to original word. There are many ways to perform stemming such as lookup table, suffix-stripping algorithms etc. These mainly rely on chopping-off ‘s’, ‘es’, ‘ed’, ‘ing’, ‘ly’ etc from the end of the words and sometimes the conversion is not desirable. But nonetheless, stemming helps us in standardizing tex

In [30]:
# function for stemming
def get_stem(text):
    stemmer = nltk.porter.PorterStemmer()
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    return text

# call function
get_stem("we are eating and swimming ; we have been eating and swimming ; he eats and swims ; he ate and swam ")

'we are eat and swim ; we have been eat and swim ; he eat and swim ; he ate and swam'

In [3]:
words_to_stem = ['happy', 'happiest', 'happier', 'cactus', 'cactii', 'elephant', 'elephants', 'amazed', 'amazing', 'amazingly', 'cement', 'owed', 'maximum']
from nltk.stem import PorterStemmer, LancasterStemmer
porter = PorterStemmer()
lancaster = LancasterStemmer()

stemmed = [(porter.stem(word), lancaster.stem(word)) for word in words_to_stem]

print("Porter | Lancaster")
for stem in stemmed:
    print(f"{stem[0]} | {stem[1]}")

Porter | Lancaster
happi | happy
happiest | happiest
happier | happy
cactu | cact
cactii | cacti
eleph | eleph
eleph | eleph
amaz | amaz
amaz | amaz
amazingli | amaz
cement | cem
owe | ow
maximum | maxim


### Lemmatization

Though stemming and lemmatization both generate the root form of inflected/desired words, but lemmatization is an advanced form of stemming. Stemming might not result in actual word, whereas lemmatization does conversion properly with the use of vocabulary, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma.
Before using lemmatization, we should be aware that it is considerably slower than stemming, so performance should be kept in mind before choosing stemming or lemmatization.

In [34]:
from nltk.stem import WordNetLemmatizer
#defining the object for Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()

In [40]:
 def lemmatizer(text1):
    lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text1]
    return lemm_text

# call function
lemmatizer("we are eating and swimming ; we have been eating and swimming ; he eats and swims ; he ate and swam ")

['w',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 'e',
 'a',
 't',
 'i',
 'n',
 'g',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 's',
 'w',
 'i',
 'm',
 'm',
 'i',
 'n',
 'g',
 ' ',
 ';',
 ' ',
 'w',
 'e',
 ' ',
 'h',
 'a',
 'v',
 'e',
 ' ',
 'b',
 'e',
 'e',
 'n',
 ' ',
 'e',
 'a',
 't',
 'i',
 'n',
 'g',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 's',
 'w',
 'i',
 'm',
 'm',
 'i',
 'n',
 'g',
 ' ',
 ';',
 ' ',
 'h',
 'e',
 ' ',
 'e',
 'a',
 't',
 's',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 's',
 'w',
 'i',
 'm',
 's',
 ' ',
 ';',
 ' ',
 'h',
 'e',
 ' ',
 'a',
 't',
 'e',
 ' ',
 'a',
 'n',
 'd',
 ' ',
 's',
 'w',
 'a',
 'm',
 ' ']

In [1]:
words = ['amaze', 'amazed', 'amazing']
import nltk

nltk.download('wordnet')

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

[lemmatizer.lemmatize(word) for word in words]


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ASIM\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['amaze', 'amazed', 'amazing']

In [2]:
from nltk.corpus import wordnet
[lemmatizer.lemmatize(word, wordnet.VERB) for word in words]

['amaze', 'amaze', 'amaze']

### 5.POS (Part-of-Speech) Tagging
Definition: POS tagging is the process of assigning a part of speech (such as noun, verb, adjective, etc.) to each word in a given text based on its definition and context.

We can implement POS tagging in Python using the Natural Language Toolkit (NLTK).

In [49]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample text
text = "We are eating and swimming; we have been eating and swimming."

# Step 1: Tokenize the text into words
tokens = word_tokenize(text)

# Step 2: Perform POS tagging
pos_tags = pos_tag(tokens)

# Display the results
print("POS Tagged Tokens:")
for word, tag in pos_tags:
    print(f"{word} -> {tag}")

POS Tagged Tokens:
We -> PRP
are -> VBP
eating -> VBG
and -> CC
swimming -> VBG
; -> :
we -> PRP
have -> VBP
been -> VBN
eating -> VBG
and -> CC
swimming -> NN
. -> .
