<a href="https://colab.research.google.com/github/KolipakaRamesh/AIML_Practice_Excercises/blob/main/NLP_Stemming_Lemmatization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Let's start with some basic text cleaning techniques. These often involve removing unwanted characters, punctuation, and converting text to lowercase.

In [None]:
import re
import string

text = "  Hello, world! This is a sample sentence with some extra spaces and weird characters like #$@!.  "

# 1. Convert to lowercase
text_lower = text.lower()
print(f"Lowercase: {text_lower}")

# 2. Remove leading/trailing whitespace
text_stripped = text_lower.strip()
print(f"Stripped: {text_stripped}")

# 3. Remove punctuation
text_no_punct = text_stripped.translate(str.maketrans('', '', string.punctuation))
print(f"No punctuation: {text_no_punct}")

# 4. Remove extra whitespace within text
text_cleaned = re.sub(r'\s+', ' ', text_no_punct).strip()
print(f"Cleaned: {text_cleaned}")

Lowercase:   hello, world! this is a sample sentence with some extra spaces and weird characters like #$@!.  
Stripped: hello, world! this is a sample sentence with some extra spaces and weird characters like #$@!.
No punctuation: hello world this is a sample sentence with some extra spaces and weird characters like 
Cleaned: hello world this is a sample sentence with some extra spaces and weird characters like


Now, let's look at some normalization techniques. These aim to bring words to a standard form, like stemming and lemmatization.

In [None]:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt_tab') # Add this line to download the required resource
nltk.download('averaged_perceptron_tagger_eng')

text = "running ran runs runner runners"

# 1. Stemming (using Porter Stemmer)
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in nltk.word_tokenize(text)]
print(f"Stemmed: {stemmed_words}")

# 2. Lemmatization (using WordNet Lemmatizer)
lemmatizer = WordNetLemmatizer()

# Helper function to get WordNet part-of-speech tag
def get_wordnet_pos(word):
    """Map POS tag to first character used by WordNetLemmatizer"""
    print("\nWord:",word)
    tag = nltk.pos_tag([word])[0][1][0].upper()
    print("\nTag:",tag)
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

lemmatized_words = [lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in nltk.word_tokenize(text)]
print(f"Lemmatized: {lemmatized_words}")

Stemmed: ['run', 'ran', 'run', 'runner', 'runner']

Word: running

Tag: V

Word: ran

Tag: N

Word: runs

Tag: N

Word: runner

Tag: N

Word: runners

Tag: N
Lemmatized: ['run', 'ran', 'run', 'runner', 'runner']


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


You can combine these techniques for a more comprehensive cleaning and normalization pipeline.

In [2]:
!pip install spacy

import spacy

# Load English tokenizer, tagger, parser and NER
nlp = spacy.load("en_core_web_sm")

text = "Apple is looking at buying U.K. startup for $1 billion"

# Process the text
doc = nlp(text)

# Print entities and their labels
print("Named Entities:")
for ent in doc.ents:
    print(f"{ent.text} ({ent.label_})")

Named Entities:
Apple (ORG)
U.K. (GPE)
$1 billion (MONEY)
