### Text Cleaning
- Text cleaning is the process of preparing raw text for NLP (Natural Language Processing) so that machines can understand human language.
- Gathering, sorting, and preparing data is the most important step in the data analysis process, bad data can have cumulative negative effects downstream if it is not corrected.

### Steps involved in Text Cleaning / Preprocessing

- Tokenization 
- Normalize Text
- Remove Unicode Characters
- Remove Stopwords
- Perform Stemming and Lemmatization
- Part of Speech (POS) Tagging

#### Tokenization 
- Tokenization is the process of splitting a text or a sentence into meaningful units, called tokens. These tokens can be words, phrases, symbols, or other elements, depending on the task. Tokenization is a foundational step in natural language processing (NLP) tasks, enabling computers to process human language effectively.

In [17]:
# python code 

import nltk
from nltk.tokenize import word_tokenize

text = "Tokenization is an important step in natural language processing."
tokens = word_tokenize(text)

print("Original text:", text)
print("Cleaned text:", tokens)


Original text: Tokenization is an important step in natural language processing.
Cleaned text: ['Tokenization', 'is', 'an', 'important', 'step', 'in', 'natural', 'language', 'processing', '.']


### Normalize Text 

- Normalizing text is the process of standardizing text so that, through NLP, computer models can better understand human input, with the end goal being to more effectively perform any given task.
- Specifically, normalizing text with Python and the NLTK library means standardizing capitalization so that machine models don’t group capitalized words (Hey) as different from their lowercase counterparts (hey).

In [7]:
# Python code
# here .lower() is used for standarization. 
text = "Hello, I am trying to Solve Text Normalization Here"
text1 = text.lower()
print("Original text:", text)
print("Cleaned text:", text1)

Original text: Hello, I am trying to Solve Text Normalization Here
Cleaned text: hello, i am trying to solve text normalization here


#### Remove Unicode Characters
- Punctuation, Emoji’s, URL’s and @’s confuse AI models because they are uniques signatures that either end up being translated unhelpfully into unicode (Smiley face becomes \u200c or similar), or are unique (in the case of @’s and hyperlinks).
- Punctuation also creates noise and impedes NLP understanding because it relates to the tone of the specific sentence, not necessarily the word it is attached to.
- https?://\S+: Matches hyperlinks starting with http:// or https://.
- [\U0001F600-\U0001F64F ... \U0001F1E0-\U0001F1FF]: Matches a wide range of emojis and special symbols.

In [4]:
# Python code 
# here Regular expressions or "re" can be used for filering

import re
text = "Hello, check this link: https://example.com 😊. Here is another one: http://test.org and emojis like 😀, 😂, 🥳."
pattern = r'(https?://\S+|[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF\U0001F680-\U0001F6FF\U0001F700-\U0001F77F\U0001F780-\U0001F7FF\U0001F800-\U0001F8FF\U0001F900-\U0001F9FF\U0001FA00-\U0001FA6F\U0001FA70-\U0001FAFF\U00002700-\U000027BF\U0001F1E0-\U0001F1FF])'
cleaned_text = re.sub(pattern, '', text)
print("Original text:", text)
print("Cleaned text:", cleaned_text)

Original text: Hello, check this link: https://example.com 😊. Here is another one: http://test.org and emojis like 😀, 😂, 🥳.
Cleaned text: Hello, check this link:  . Here is another one:  and emojis like , , .


#### Removing StopWords

- Stop words are common words within sentences that do not add value and thus can be eliminated when cleaning for NLP prior to analysis.4
- We will have to use NLTK library where we can import pre programmed stop words library. You can create your own stopwords as well as per your requirements.

In [11]:
# python code 

import nltk.corpus
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to /home/ishan-
[nltk_data]     pc/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
stop_word = stopwords.words('english')
text = "Here I am trying to see if the stop words are being removed or not"
cleaned_text = " ".join([word for word in text.split() if word not in stop_word])
print("Original text:", text)
print("Cleaned text:", cleaned_text)

Original text: Here I am trying to see if the stop words are being removed or not
Cleaned text: Here I trying see stop words removed


Here You can note that the word "<b> not </b>" is also being removed which might carry very importnat meaning while solving sentimen analysis task. So while removing stopwords you have to keep this thing in mind

#### Perform Stemming and Lemmatization

##### Stemming:
Stemming is the process of reducing words to their word stem or root form. It involves chopping off the end of words using simple rules, often based on common suffixes. The goal is to reduce related words to a common base form, even if the stem itself may not be a valid word in the language.
* "fishing" could become "fish".
* "cats" might become "cat".

Stemming is fast and efficient but may not always result in a valid word. It's useful in tasks where speed and simplicity are prioritized over grammatical correctness.

##### Lemmatization: 
Lemmatization, on the other hand, is the process of reducing words to their base or dictionary form (known as the lemma). Unlike stemming, lemmatization considers the context and meaning of the word. It involves resolving words to their canonical form based on dictionary definitions and grammatical rules of the language.
* "better" becomes "good".

Lemmatization requires more computational resources and linguistic knowledge compared to stemming, but it typically results in more accurate and meaningful word forms.


In [16]:
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

sentence = "He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun."
words = word_tokenize(sentence)

stemmed_words = [stemmer.stem(word) for word in words]
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]

print("Original Sentence:")
print(sentence)
print("\nStemmed Sentence:")
print(" ".join(stemmed_words))
print("\nLemmatized Sentence:")
print(" ".join(lemmatized_words))


Original Sentence:
He was running and eating at same time. He has bad habit of swimming after playing long hours in the Sun.

Stemmed Sentence:
he wa run and eat at same time . he ha bad habit of swim after play long hour in the sun .

Lemmatized Sentence:
He wa running and eating at same time . He ha bad habit of swimming after playing long hour in the Sun .


#### Part of Speech (POS) Tagging
- It is a process of converting a sentence to forms — list of words, list of tuples (where each tuple is having a form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun, adjective, verb, and so on.


In [18]:
import nltk
from nltk import word_tokenize
text = "Nepal is a land of Mount Enerest"
tokens = nltk.word_tokenize(text)
print("Parts of Speech: ",nltk.pos_tag(tokens))

Parts of Speech:  [('Nepal', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('land', 'NN'), ('of', 'IN'), ('Mount', 'NNP'), ('Enerest', 'NNP')]


Now we will perform the above mentioned cleaning and preprocessing steps using a tsv file

In [41]:
import nltk
import re
import pandas as pd 
nltk.download('stopwords')
from nltk.corpus import  stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /home/ishan-
[nltk_data]     pc/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [25]:
df = pd.read_csv("/home/ishan-pc/Desktop/Ishan-Github/NLP-projects/NLTK/Restaurant_Reviews.tsv",sep='\t')
df.head()

Unnamed: 0,Review,Liked
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1


In [47]:
corpus = []
print(df["Review"][:5])
print("-"*20)

for i in range(0,5): 
    
    review = re.sub(pattern='[^a-zA-Z]',repl=' ', string=df['Review'][i])
    
    review = review.lower()
    
    review =word_tokenize(review)
    
    review = [word for word in review if not word in set(stopwords.words('english'))]
    
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review]
    
    review = ' '.join(review)  
    print(review)
    corpus.append(review)

0                             Wow... Loved this place.
1                                   Crust is not good.
2            Not tasty and the texture was just nasty.
3    Stopped by during the late May bank holiday of...
4    The selection on the menu was great and so wer...
Name: Review, dtype: object
--------------------
wow love place
crust good
tasti textur nasti
stop late may bank holiday rick steve recommend love
select menu great price


I had used stemming for the reviews which has generated the words that doesnt make sense sometimes. Lets use lemmatization and check if there is any difference notieced

In [42]:
corpus = []
print(df["Review"][:5])
print("-"*20)

for i in range(0,5): 
    
    review = re.sub(pattern='[^a-zA-Z]',repl=' ', string=df['Review'][i])
    
    review = review.lower()
    
    review = word_tokenize(review)
    
    review = [word for word in review if not word in set(stopwords.words('english'))]
    
    wl = WordNetLemmatizer()
    review = [wl.lemmatize(word) for word in review]
    
    review = ' '.join(review)  
    print(review)
    corpus.append(review)

0                             Wow... Loved this place.
1                                   Crust is not good.
2            Not tasty and the texture was just nasty.
3    Stopped by during the late May bank holiday of...
4    The selection on the menu was great and so wer...
Name: Review, dtype: object
--------------------
wow loved place
crust good
tasty texture nasty
stopped late may bank holiday rick steve recommendation loved
selection menu great price


- We can already see the new generated corpus makes sence and the words are also accuractely represented. 
- We can also see there is another problem that we have used stopwords and the word not was also used and was removed. 
- The original text was <b>Crust is not good.</b> But we got.<b> crust good </b> which is completely opposite 

In [44]:
# Creating our own stopwords library

from nltk.corpus import stopwords

default_stopwords = set(stopwords.words('english'))

custom_stopwords = default_stopwords.copy()
custom_stopwords.discard('not')

In [46]:
corpus = []
print(df["Review"][:5])
print("-"*20)

for i in range(0,5): 
    
    review = re.sub(pattern='[^a-zA-Z]',repl=' ', string=df['Review'][i])
    
    review = review.lower()
    
    review = word_tokenize(review)
    
    review = [word for word in review if not word in set(custom_stopwords)]
    
    wl = WordNetLemmatizer()
    review = [wl.lemmatize(word) for word in review]
    
    review = ' '.join(review)  
    print(review)
    corpus.append(review)

0                             Wow... Loved this place.
1                                   Crust is not good.
2            Not tasty and the texture was just nasty.
3    Stopped by during the late May bank holiday of...
4    The selection on the menu was great and so wer...
Name: Review, dtype: object
--------------------
wow loved place
crust not good
not tasty texture nasty
stopped late may bank holiday rick steve recommendation loved
selection menu great price


We can see that the word "Not" has not been removed as we have created our won stopword sets.