##                                                           1. Data Preprocessing

* Step 1 : Cleaning
* Step 2 : Stop words removal
* Step 3 : Tokenization
* Step 4 : Lemmatization
* Step 5 : Stemming

NLTK, TextBlob, Spacy are few libraries used for data cleaning

In [3]:
import warnings
warnings.filterwarnings("ignore")

##### Reading Data

In [90]:
import pandas as pd

# Sample DataFrame

data = {'text': [
    "Check out my Comment in this link: https://example.com",
    "<p>Running is fun!</p>",
    "The cats are sitting on the mats.",
    "I'm Fine. how are you :>",
    "Felling Worried :(",
    "Do it ASAP",
    "pls correct my spellig"
]}

df = pd.DataFrame(data)
df

Unnamed: 0,text
0,Check out my Comment in this link: https://exa...
1,<p>Running is fun!</p>
2,The cats are sitting on the mats.
3,I'm Fine. how are you :>
4,Felling Worried :(
5,Do it ASAP
6,pls correct my spellig


# Step 1 : Cleaning

##### 1.1 Convert to lowercase

In [91]:
# Since df['text'] is panda series, we convert it into string using .str to use lower() function
df['clean_text'] = df['text'].str.lower()
df

Unnamed: 0,text,clean_text
0,Check out my Comment in this link: https://exa...,check out my comment in this link: https://exa...
1,<p>Running is fun!</p>,<p>running is fun!</p>
2,The cats are sitting on the mats.,the cats are sitting on the mats.
3,I'm Fine. how are you :>,i'm fine. how are you :>
4,Felling Worried :(,felling worried :(
5,Do it ASAP,do it asap
6,pls correct my spellig,pls correct my spellig


##### 1.2 Removing URLs

In [92]:
import re

def remove_url(text):
    return re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)

df['clean_text'] = df['clean_text'].apply(clean_text)

df

Unnamed: 0,text,clean_text
0,Check out my Comment in this link: https://exa...,check out my comment in this link:
1,<p>Running is fun!</p>,<p>running is fun!</p>
2,The cats are sitting on the mats.,the cats are sitting on the mats.
3,I'm Fine. how are you :>,i'm fine. how are you :>
4,Felling Worried :(,felling worried :(
5,Do it ASAP,do it asap
6,pls correct my spellig,pls correct my spellig


##### 1.3 Removing HTML Tags

In [93]:
from bs4 import BeautifulSoup

def clean_text(text):
    return BeautifulSoup(text, "html.parser").get_text()

# Apply the cleaning function to the 'tweet' column
df['clean_text'] = df['clean_text'].apply(clean_text)

df

Unnamed: 0,text,clean_text
0,Check out my Comment in this link: https://exa...,check out my comment in this link:
1,<p>Running is fun!</p>,running is fun!
2,The cats are sitting on the mats.,the cats are sitting on the mats.
3,I'm Fine. how are you :>,i'm fine. how are you :>
4,Felling Worried :(,felling worried :(
5,Do it ASAP,do it asap
6,pls correct my spellig,pls correct my spellig


##### 1.4 Remove unwanted characters

* Remove the punctuations , symbols etc to reduce noise in the data

In [94]:
import re

def clean_text(tweet):
    return re.sub(r'[^A-Za-z0-9\s]', '', tweet)

df['clean_text'] = df['clean_text'].apply(clean_text)

df

Unnamed: 0,text,clean_text
0,Check out my Comment in this link: https://exa...,check out my comment in this link
1,<p>Running is fun!</p>,running is fun
2,The cats are sitting on the mats.,the cats are sitting on the mats
3,I'm Fine. how are you :>,im fine how are you
4,Felling Worried :(,felling worried
5,Do it ASAP,do it asap
6,pls correct my spellig,pls correct my spellig


##### 1.5 Chat words

In [95]:
chat_words={
"afaik":"As Far As I Know",
"afk": "Away From Keyboard",
"asap":"As Soon As Possible",
"btw":"By The Way",
"b4":"Before",
"lamo":"Laugh My A.. Off",
"fyi":"For your information"    
}

df["clean_text"] = df["clean_text"].replace(chat_words, regex=True)
df

Unnamed: 0,text,clean_text
0,Check out my Comment in this link: https://exa...,check out my comment in this link
1,<p>Running is fun!</p>,running is fun
2,The cats are sitting on the mats.,the cats are sitting on the mats
3,I'm Fine. how are you :>,im fine how are you
4,Felling Worried :(,felling worried
5,Do it ASAP,do it As Soon As Possible
6,pls correct my spellig,pls correct my spellig


##### 1.6 Correct the spelling

In [96]:
# Apply spelling correction
df["clean_text"] = df["clean_text"].apply(lambda x: TextBlob(x).correct().string)
df

Unnamed: 0,text,clean_text
0,Check out my Comment in this link: https://exa...,check out my comment in this link
1,<p>Running is fun!</p>,running is fun
2,The cats are sitting on the mats.,the cats are sitting on the mats
3,I'm Fine. how are you :>,in fine how are you
4,Felling Worried :(,felling worried
5,Do it ASAP,do it Is Soon Is Possible
6,pls correct my spellig,pus correct my spelling


# Step 2 : Removal of Stop words

Unwanted words to complete the sentence

In [4]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words = set(stopwords.words('english'))
print(stop_words)

{'mustn', 'by', 'again', 'shan', 'doesn', 'won', "weren't", 'than', 'isn', 'during', 'here', 'doing', "it'd", 'don', 'should', 'they', 're', 'a', 'such', 'how', 'you', 'there', 'did', 'his', 'up', 'other', 'against', 'hadn', "should've", "i'd", 'why', 'their', 'd', 'when', 'while', 'down', 'her', 'has', "wouldn't", "you're", "she's", "she'll", 'after', 'hasn', 'm', 'hers', 'now', 'which', "shan't", "we'd", 'didn', 'o', 'but', 'myself', 'few', "hadn't", "didn't", 'at', 'or', 'further', 'wasn', 'we', 'not', 'once', 'ours', 'couldn', "we've", "she'd", "mustn't", 'wouldn', "he'd", "mightn't", 'in', 'through', 'because', 'an', 'does', 'do', "he'll", "i've", 'been', 'll', "won't", 'from', 'who', 'itself', 'were', 'to', 'that', "you'll", 'yourselves', 'as', "needn't", 'y', 'needn', 'me', 'being', 'out', "that'll", 'these', 'them', 'the', 'very', 'between', 'where', 'some', "aren't", 'over', 's', 'have', 'i', "we'll", 'then', 've', "it'll", 'herself', 'him', 'only', 'above', 'own', "they'll", 

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jemim\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [98]:
# Function to remove stop words and handle punctuation properly
def remove_stop_words(text):
    # Split the text into words, remove stop words, and join the remaining words back into a sentence
    return " ".join([word for word in text.split() if word.lower() not in stop_words])

# Apply function to remove stop words in df['clean_text']
df['clean_text'] = df['clean_text'].apply(remove_stop_words)

# Output the DataFrame to check the result
df

Unnamed: 0,text,clean_text
0,Check out my Comment in this link: https://exa...,check comment link
1,<p>Running is fun!</p>,running fun
2,The cats are sitting on the mats.,cats sitting mats
3,I'm Fine. how are you :>,fine
4,Felling Worried :(,felling worried
5,Do it ASAP,Soon Possible
6,pls correct my spellig,pus correct spelling


# Step 3 : Tokenization

It splits the text into individual words or tokens

--> Many ways to tokenize : split method , regex, nltk , langchain.llamaindex

--> Update : Mostly langchain, regex tokenizer are used now a days..

* Paragraph --> Corpus
* Sentence --> Document
* Words --> tokens
* Character --> character
* Vocabulary --> collection of unique words(tokens)

##### 3.1 word tokenizer

In [88]:
nltk.download('punkt_tab')
from nltk.tokenize import word_tokenize

def tokenize(text):
    return word_tokenize(text)

df['clean_text'] = df['clean_text'].apply(tokenize)
df

[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\jemim\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt_tab.zip.


Unnamed: 0,text,clean_text
0,Check out my Comment in this link: https://exa...,"[check, comment, link]"
1,<p>Running is fun!</p>,"[running, fun]"
2,The cats are sitting on the mats.,"[cats, sitting, mats]"
3,I'm Fine :>,[fine]
4,Felling Worried :(,"[felling, worried]"
5,Do it ASAP,"[Soon, Possible]"
6,pls correct my spellig,"[pus, correct, spelling]"


##### 3.2 Sentence tokenize

In [None]:
from nltk.tokenize import sent_tokenize

def tokenize_sentences(text):
    return sent_tokenize(text)

df['clean_text'] = df['clean_text'].apply(tokenize_sentences)
df

# Step 4 : Stemming

* Stemming is the process of reducing a word to its root form by removing suffixes or prefixes, often without considering the word's meaning, ex : running becomes run, history becomes histori
* The words : eating, eat, eaten --> eat
* in what we can use stemming.. in what we can use lemmatization.

##### 1. Porter Stemmer

In [13]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()

def stemming(tokens):
    return [stemmer.stem(token) for token in tokens]

df['Stemmed_text'] = df['clean_text'].apply(stemming)
    
df  

Unnamed: 0,text,clean_text,Stemmed_text
0,Check out my Comment in this link: https://exa...,"[check, comment, link]","[check, comment, link]"
1,<p>Running is fun!</p>,"[running, fun]","[run, fun]"
2,The cats are sitting on the mats.,"[cats, sitting, mats]","[cat, sit, mat]"
3,I'm Fine :>,"[im, fine]","[im, fine]"
4,Felling Worried :(,"[felling, worried]","[fell, worri]"


##### 2. RegexpStemmer Class

It takes a regular expression and removes any prefix or suffix that matches the expression

In [2]:
from nltk.stem import RegexpStemmer

reg_stemmer = RegexpStemmer('ing$|s$|e$|able$|ing',min = 4) # if the min length of the word is 4, then only you can apply this

#example
print(reg_stemmer.stem('eatable'))
print(reg_stemmer.stem('eating'))
print(reg_stemmer.stem('ingeatable'))

eat
eat
eat


##### 3. Snowball Stemmer

Better than Porter stemmer for few words not for all words

In [17]:
from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer('english') # Available in many language, so give the language as parameter

#example
print("Porter Stemmer --> " + stemmer.stem('fairly'),stemmer.stem('sportingly'),stemmer.stem('history'))
print("Snowball_stemmer --> " + snowball_stemmer.stem('fairly'),snowball_stemmer.stem('sportingly'),snowball_stemmer.stem('history'))

Porter Stemmer --> fairli sportingli histori
Snowball_stemmer --> fair sport histori


Note : Stemming is not working good for all words.

# Step 5 : Lemmatization

* The output we get after lemmatization is called 'lemma', which is a root word rather tha root stem, the output of stemming.
* Lemmatization solves the problem because it uses dictionary of all words to maintain the context and the word's grammatical meaning.(eg. part of speech)
* Pos Tags : Noun - n, Verb - v, adjective - a, adverb - r

In [13]:
import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

words = ["eating","eats","eaten","writing","writes","programming","programs","history","finally","finalized"]
for word in words:
    # syntax : lemmatizer.lemmatize('word',pos='n') (post tags are treated as Noun)
   print(word+"-->"+ lemmatizer.lemmatize(word,pos='v'))

eating-->eat
eats-->eat
eaten-->eat
writing-->write
writes-->write
programming-->program
programs-->program
history-->history
finally-->finally
finalized-->finalize


In [37]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# syntax : lemmatizer.lemmatize('word',pos='n') (post tags are treated as Noun)

def lem(tokens):
    return [lemmatizer.lemmatize(token,pos='n') for token in tokens]

df['lem_text'] = df['clean_text'].apply(lem)

In [38]:
df

Unnamed: 0,text,clean_text,lem_text
0,Check out my Comment in this link: https://exa...,"[check, comment, link]","[check, comment, link]"
1,<p>Running is fun!</p>,"[running, fun]","[running, fun]"
2,The cats are sitting on the mats.,"[cats, sitting, mats]","[cat, sitting, mat]"
3,I'm Fine :>,"[im, fine]","[im, fine]"
4,Felling Worried :(,"[felling, worried]","[felling, worried]"


# POS Tagging

In [40]:
import nltk
from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\jemim\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [42]:
lemmatizer = WordNetLemmatizer()
wordnet_map = {"N":wordnet.NOUN,"V" : wordnet.VERB, "J": wordnet.ADJ, "R": wordnet.ADV}

def lemmatize(tokens):
    # POS tagging for the tokens
    pos_tokens = pos_tag(tokens)
    print(pos_tokens)
    
    # Lemmatize each token with its corresponding POS tag, For "running" (tagged as "V"), it uses wordnet.VERB to lemmatize it to "run".
    return [lemmatizer.lemmatize(word, wordnet_map.get(pos[0], wordnet.NOUN)) for word, pos in pos_tokens]

# Apply lemmatization to the DataFrame
df['lemmatized_text'] = df['clean_text'].apply(lemmatize)

df

[('check', 'VB'), ('comment', 'NN'), ('link', 'NN')]
[('running', 'VBG'), ('fun', 'NN')]
[('cats', 'NNS'), ('sitting', 'VBG'), ('mats', 'NNS')]
[('im', 'NN'), ('fine', 'NN')]
[('felling', 'VBG'), ('worried', 'VBD')]


Unnamed: 0,text,clean_text,lem_text,lemmatized_text
0,Check out my Comment in this link: https://exa...,"[check, comment, link]","[check, comment, link]","[check, comment, link]"
1,<p>Running is fun!</p>,"[running, fun]","[running, fun]","[run, fun]"
2,The cats are sitting on the mats.,"[cats, sitting, mats]","[cat, sitting, mat]","[cat, sit, mat]"
3,I'm Fine :>,"[im, fine]","[im, fine]","[im, fine]"
4,Felling Worried :(,"[felling, worried]","[felling, worried]","[fell, worry]"


* Lemmatization will consume more time than stemming, because it needs to comare againt the wordnet corpus dictionary to give a meaningful word.
* Usecases : Q/A, Chatbots,Text summarization.

# Named Entity Recognition

In [15]:
# nltk.download('maxent_ne_chunker')
# nltk.download('words')

In [None]:
sentence="The Eiffel Tower was built from 1887 to 1889 by French engineer Gustave Eiffel, whose company specialized in building metal frameworks and structures."

words = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(words)
nltk.ne_chunk(tags).draw()