## 1. Stemming

**Definition**:  
Stemming is the process of reducing a word to its base or root form by removing prefixes or suffixes. It typically doesn’t consider the meaning or context of the word, so it may result in non-dictionary forms (known as "stems").

**Purpose**:  
Stemming simplifies words to their root forms, which can help improve performance in tasks like text classification, information retrieval, and other NLP applications.

### Example:
- "running" → "run"
- "flies" → "fli" (non-dictionary form)


## 2. Lemmatization

**Definition**:  
Lemmatization is a more sophisticated approach that reduces words to their lemma (the dictionary form of the word). Unlike stemming, lemmatization considers the word's meaning and context to ensure the transformation results in a valid word.

**Purpose**:  
Lemmatization ensures that words with different forms (e.g., "running", "ran", "runs") are reduced to a common base word ("run"), while maintaining their meaning.

### Example:
- "running" → "run" (valid word in the dictionary)
- "flies" → "fly" (valid word in the dictionary)

In [None]:
#Stemming
import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, LancasterStemmer

# Initialize stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()

# Words to be stemmed
words = ["running", "flies", "universities", "better"]

# Apply stemming using different stemmers
print("Porter Stemmer:", [porter.stem(word) for word in words])
print("Snowball Stemmer:", [snowball.stem(word) for word in words])
print("Lancaster Stemmer:", [lancaster.stem(word) for word in words])


Porter Stemmer: ['run', 'fli', 'univers', 'better']
Snowball Stemmer: ['run', 'fli', 'univers', 'better']
Lancaster Stemmer: ['run', 'fli', 'univers', 'bet']


In [None]:
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')



[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [None]:
#Lemmatization
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()

# Words to be lemmatized
words = ["running", "flies", "better"]
print("Lemmatized Words (default POS 'v'):", [lemmatizer.lemmatize(word, pos='v') for word in words])

# Example sentence
sentence = "The cats are running quickly."
tokens = nltk.word_tokenize(sentence)

# POS tagging
tagged = nltk.pos_tag(tokens)

# Lemmatize with proper POS tagging
lemmatized_words = [lemmatizer.lemmatize(word, pos='v' if tag.startswith('V') else 'n') for word, tag in tagged]
print("Lemmatized with POS:", lemmatized_words)


Lemmatized Words (default POS 'v'): ['run', 'fly', 'better']
Lemmatized with POS: ['The', 'cat', 'be', 'run', 'quickly', '.']


In [None]:
#custom lemmatizer NLTK

import nltk
from nltk.stem import WordNetLemmatizer

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Example tweet
tweet = "LOL! I'm gonna be BRB, IDK if I can make it #excited #NLP"

# Custom slang dictionary
slang_dict = {
    "lol": "laughing_out_loud",
    "brb": "be_right_back",
    "idk": "i_dont_know",
    "gonna": "going_to"
}

tokens = nltk.word_tokenize(tweet)
lemmatized_tweet = [slang_dict.get(token.lower(), lemmatizer.lemmatize(token)) for token in tokens]
print("Custom Lemmatized Tweet:", lemmatized_tweet)

tokens = nltk.word_tokenize(tweet)
lemmatized_tweet = [lemmatizer.lemmatize(token) for token in tokens]
print("Lemmatized:", lemmatized_tweet)

tagged = nltk.pos_tag(tokens)
lemmatized_words = [lemmatizer.lemmatize(word, pos='v' if tag.startswith('V') else 'n') for word, tag in tagged]
print("Lemmatized:", lemmatized_tweet)


Custom Lemmatized Tweet: ['laughing_out_loud', '!', 'I', "'m", 'gon', 'na', 'be', 'be_right_back', ',', 'i_dont_know', 'if', 'I', 'can', 'make', 'it', '#', 'excited', '#', 'NLP']
Lemmatized: ['LOL', '!', 'I', "'m", 'gon', 'na', 'be', 'BRB', ',', 'IDK', 'if', 'I', 'can', 'make', 'it', '#', 'excited', '#', 'NLP']
Lemmatized: ['LOL', '!', 'I', "'m", 'gon', 'na', 'be', 'BRB', ',', 'IDK', 'if', 'I', 'can', 'make', 'it', '#', 'excited', '#', 'NLP']


In [1]:
# SpaCy Stemming:
# spaCy does not perform stemming by default

import spacy

# Lemmatization
nlp = spacy.load("en_core_web_sm")

text = "The cats were running quickly."
doc = nlp(text)

# Lemmatized words
lemmatized_words = [token.lemma_ for token in doc]
print("Lemmatized Words:", lemmatized_words)


Lemmatized Words: ['the', 'cat', 'be', 'run', 'quickly', '.']


In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

tweet = "LOL! I'm gonna be BRB, IDK if I can make it #excited #NLP"
# Custom slang dictionary for spaCy
slang_dict = {
    "lol": "laughing_out_loud",
    "brb": "be_right_back",
    "idk": "i_dont_know",
    "gonna": "going_to"
}

doc = nlp(tweet)
lemmatized_tweet_spacy = [slang_dict.get(token.text.lower(), token.lemma_) for token in doc]
print("Custom Lemmatized Tweet (spaCy):", lemmatized_tweet_spacy)



Custom Lemmatized Tweet (spaCy): ['laughing_out_loud', '!', 'I', 'be', 'go', 'to', 'be', 'be_right_back', ',', 'i_dont_know', 'if', 'I', 'can', 'make', 'it', '#', 'excited', '#', 'nlp']


In [None]:
!pip install emoji

Collecting emoji
  Downloading emoji-2.14.0-py3-none-any.whl.metadata (5.7 kB)
Downloading emoji-2.14.0-py3-none-any.whl (586 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/586.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m586.9/586.9 kB[0m [31m25.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.14.0


In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

# Function to custom tokenize hashtags and mentions
def custom_tokenize(text):
    # Replace hashtags and mentions with single tokens
    text = re.sub(r'#\w+', lambda x: x.group(), text)  # Keep hashtags intact
    text = re.sub(r'@\w+', lambda x: x.group(), text)  # Keep mentions intact
    return text

# Example tweet
tweet2 = "I'm loving #AI and #MachineLearning, follow me @user!"

# Custom tokenize the tweet
custom_tokenized_tweet = custom_tokenize(tweet2)
print(custom_tokenized_tweet)

# Lemmatize the tweet with
lemmatizer = WordNetLemmatizer()
tokens2 = word_tokenize(custom_tokenized_tweet)
lemmatized_tweet2 = [lemmatizer.lemmatize(token) for token in tokens2]
print("Custom Tokenized and Lemmatized Tweet (NLTK):", lemmatized_tweet2)


# Lemmatize the tweet with spaCy
doc2 = nlp(custom_tokenized_tweet)
lemmatized_tweet2_spacy = [token.lemma_ for token in doc2]
print("Custom Tokenized and Lemmatized Tweet (spaCy):", lemmatized_tweet2_spacy)


I'm loving #AI and #MachineLearning, follow me @user!
Custom Tokenized and Lemmatized Tweet (NLTK): ['I', "'m", 'loving', '#', 'AI', 'and', '#', 'MachineLearning', ',', 'follow', 'me', '@', 'user', '!']
Custom Tokenized and Lemmatized Tweet (spaCy): ['I', 'be', 'love', '#', 'AI', 'and', '#', 'MachineLearning', ',', 'follow', 'I', '@us', '!']


In [None]:
import nltk
import re

# Example tweet with emojis
tweet = "I love NLP 😊 and it's so fun! 😄 #AI"

# Function to remove emojis
def remove_emojis(text):
    return re.sub(r'[^\w\s,]', '', text)

# Tokenize tweet
tokens = nltk.word_tokenize(remove_emojis(tweet))
print("Tokens after Removing Emojis:", tokens)


import spacy
import emoji

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Example tweet with emojis
tweet = "I love NLP 😊 and it's so fun! 😄 #AI"

# Function to replace emojis with mapped text
def map_emoji_to_text(text):
    emoji_dict = {'😊': 'happy', '😄': 'joyful'}
    for emj, word in emoji_dict.items():
        text = text.replace(emj, word)
    return text

# Process the tweet with spaCy
custom_tweet = map_emoji_to_text(tweet)
doc = nlp(custom_tweet)

# Tokenize tweet
tokens = [token.text for token in doc]
print("Tokens after Mapping Emojis to Text:", tokens)


Tokens after Removing Emojis: ['I', 'love', 'NLP', 'and', 'its', 'so', 'fun', 'AI']
Tokens after Mapping Emojis to Text: ['I', 'love', 'NLP', 'happy', 'and', 'it', "'s", 'so', 'fun', '!', 'joyful', '#', 'AI']
