**Use Cases**
Text preprocessing, Text cleaning, Spell Correction, String Similarity (Semantic), Text Classification

| **Stage**                | **Included Tasks**                                                    | **Output Example**                            |
| ------------------------ | --------------------------------------------------------------------- | --------------------------------------------- |
| **Lexical Processing**   | Normalization, Tokenization, Stopword Removal, Morphological Analysis | ‚ÄúThe cats are running.‚Äù ‚Üí [the, cat, are, run] |
| **Syntactic Processing** | POS Tagging, Chunking, Parsing, Agreement Checking                    | [NP The cats] [VP are running]                |
| **Semantic Processing**  | Role Labelling, Disambiguation, Coreference, Inference                | agent(cats), action(run), time(now)           |


- Lexical - "Lexicon" - words - Order, grammer, context does not matter
- Syntactic - "Syntax" - Order, grammer matters (POS tags, NER tags)
- Semantic - "Semantics" - Similarity - Order, grammer, context everything matters 

In [51]:
# conda activate ai-lab
# conda install -c conda-forge nltk jellyfish scipy python-levenshtein
# python -m spacy download en_core_web_md

# !pip install nltk jellyfish scipy python-Levenshtein
# !python -m spacy download en_core_web_md

In [None]:
import string
import nltk  # natural language toolkit
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer

**Different Libraries**
- spaCy and nltk are both for same purpose. Many things are common.
- nltk has some more functionality.
- Different organizations have developed these.

Actions we will perform: tokenization, stemming, lemitization, punctuation removal, case folding, etc

In [4]:
tweets = ["Just finished reading an amazing book on AI ü§ñüìö #MachineLearning #AI.",
          "What a beautiful morning! ‚òÄÔ∏è Feeling super motivated to start the day üí™ #GoodVibes",
          "Ugh...@user my laptop crashed again right before my deadline üò§ #MondayBlues",
          "brb grabbing some ‚òïÔ∏è lol can‚Äôt start work without caffeine üòÇ"
]

**Models from nltk**
- Download required models from nltk. These are not libraries.
- In nltk, there are hundreds of models. These are heavy files and will consume memory.
- So we have the provision to download only those that are required.

In [3]:
nltk.download('punkt') # for tokenization
nltk.download('wordnet') # for lemmatization

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/aditikulkarni/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aditikulkarni/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [5]:
print ("Original Tweets:")
for t in tweets:
    print(t)

Original Tweets:
Just finished reading an amazing book on AI ü§ñüìö #MachineLearning #AI.
What a beautiful morning! ‚òÄÔ∏è Feeling super motivated to start the day üí™ #GoodVibes
Ugh...@user my laptop crashed again right before my deadline üò§ #MondayBlues
brb grabbing some ‚òïÔ∏è lol can‚Äôt start work without caffeine üòÇ


In [2]:
# A. CASE FOLDING - Convert text to lower case

def case_folding(text):
    return text.lower()

In [11]:
# B. REMOVE PUNCTUATIONS: ! # @ $ % ^ & * ( ) _ + - = { } [ ] | \ : ; " ' < > , . ? /
# Remove only the needed punctauations, not all

print ("Default Punctuation in String Library:")
print (string.punctuation)

def remove_punctuations(text):
    return ''.join(char for char in text if char not in string.punctuation)


Default Punctuation in String Library:
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [7]:
# C. TOKENIZATION - Split text into words or tokens
# Breaking the text into smaller units - words, phrases, sentences from paragraphs, or other meaningful elements

def tokenize(text):
    return nltk.word_tokenize(text)

In [28]:
# D. STEMMING - Reduce words to their root form
# Remove prefixes, suffixes to get to the base or root form of a word
# Rule based, fast, but may not produce actual words
# Porter Stemmer and Snowball Stemmer are popular algorithms
# Porter Stemmer is more aggressive, Snowball Stemmer is more accurate but slower

def porter_stemming(text):
    ps = PorterStemmer()
    words = tokenize(text)
    return ' '.join(ps.stem(word) for word in words)

def snowball_stemming(text):
    ss = SnowballStemmer("english")
    words = tokenize(text)
    return ' '.join(ss.stem(word) for word in words)

In [24]:
# E. LEMMATIZATION - Reduce words to their base or dictionary form
# Uses vocabulary and morphological analysis of words
# More accurate than stemming, but slower
# WordNet is a popular lexical database for English
# Unless you are passing a wrongly spelled word, lemmatization will always return a valid word

def lemmatization(text):
    lemmatizer = WordNetLemmatizer()
    words = tokenize(text)
    return ' '.join(lemmatizer.lemmatize(word) for word in words)
    return [lemmatizer.lemmatize(token) for token in tokens]

In [None]:
for tweet in tweets:
    print("\nOriginal Tweet:", tweet)

    # A. Case Folding
    tweet = case_folding(tweet)
    print("After Case Folding:", tweet)

    # B. Remove Punctuations
    tweet_out = remove_punctuations(tweet)
    print("After Removing Punctuations:", tweet_out)

    

    # C. Tokenization
    #tokens = tokenize(tweet)
    #print("After Tokenization:")
    #print(tokens)

    # D. Stemming
    porter_stemmed = porter_stemming(tweet_out)
    snowball_stemmed = snowball_stemming(tweet_out)
    print("After Porter Stemming:", porter_stemmed)
    print("After Snowball Stemming:", snowball_stemmed)

    # E. Lemmatization
    lemmatized = lemmatization(tweet_out)
    print("After Lemmatization:", lemmatized)


Original Tweet: Just finished reading an amazing book on AI ü§ñüìö #MachineLearning #AI.
After Case Folding: just finished reading an amazing book on ai ü§ñüìö #machinelearning #ai.
After Removing Punctuations: just finished reading an amazing book on ai ü§ñüìö machinelearning ai
After Porter Stemming: just finish read an amaz book on ai ü§ñüìö machinelearn ai
After Snowball Stemming: just finish read an amaz book on ai ü§ñüìö machinelearn ai
After Lemmatization: just finished reading an amazing book on ai ü§ñüìö machinelearning ai

Original Tweet: What a beautiful morning! ‚òÄÔ∏è Feeling super motivated to start the day üí™ #GoodVibes
After Case Folding: what a beautiful morning! ‚òÄÔ∏è feeling super motivated to start the day üí™ #goodvibes
After Removing Punctuations: what a beautiful morning ‚òÄÔ∏è feeling super motivated to start the day üí™ goodvibes
After Porter Stemming: what a beauti morn ‚òÄÔ∏è feel super motiv to start the day üí™ goodvib
After Snowball Stem

---

In [43]:
from nltk.metrics import edit_distance

s1 = "kitten"
s2 = "sitting"

distance = edit_distance(s1, s2)
print(f"\nEdit Distance between '{s1}' and '{s2}': {distance}")


Edit Distance between 'kitten' and 'sitting': 3


In [44]:
import Levenshtein

lev_distance = Levenshtein.distance(s1, s2)
print(f"Levenshtein Distance between '{s1}' and '{s2}': {lev_distance}")

Levenshtein Distance between 'kitten' and 'sitting': 3


In [48]:
# !pip install scipy python-Levenshtein
# !python -m spacy download en_core_web_md

In [49]:
import spacy
from scipy.spatial.distance import hamming, cosine
import numpy as np

# Load a model with vectors (en_core_web_md or en_core_web_lg)
nlp = spacy.load("en_core_web_md")   # download if you don't have it

doc1 = nlp("kitten")
doc2 = nlp("sitting")

v1 = doc1.vector
v2 = doc2.vector

# Cosine distance (recommended for dense embeddings)
cos_dist = cosine(v1, v2)
print("Cosine distance:", cos_dist)

# Hamming distance = proportion of differing elements.
# Hamming is designed for binary/boolean vectors ‚Äî convert floats to binary using a threshold if needed.
b1 = (v1 > 0).astype(int)
b2 = (v2 > 0).astype(int)
hamming_dist = hamming(b1, b2)
print("Hamming distance (on binarized vectors):", hamming_dist)

# Alternatively, spaCy provides a convenient .similarity() which uses cosine:
print("spaCy similarity (cosine-like):", doc1.similarity(doc2))

Cosine distance: 0.69406223
Hamming distance (on binarized vectors): 0.4633333333333333
spaCy similarity (cosine-like): 0.305937796831131


In [52]:
#import nltk
from collections import Counter
import numpy as np

def bow_vector(text, vocab):
    tokens = nltk.word_tokenize(text.lower())
    c = Counter(tokens)
    return np.array([c[w] for w in vocab], dtype=float)

a = "I love machine learning"
b = "I enjoy machine learning and AI"
vocab = sorted(set(nltk.word_tokenize(a.lower()) + nltk.word_tokenize(b.lower())))
va = bow_vector(a, vocab)
vb = bow_vector(b, vocab)

sim = np.dot(va, vb) / (np.linalg.norm(va) * np.linalg.norm(vb)) if np.linalg.norm(va) and np.linalg.norm(vb) else 0.0
print("cosine similarity (BOW):", sim)

cosine similarity (BOW): 0.6123724356957946


---

Compare lemmatizers with pos=v (verbs) and pos=a (adjective) for better results

In [35]:
import nltk
from nltk.stem import PorterStemmer, LancasterStemmer, SnowballStemmer
from nltk.stem import RegexpStemmer
from nltk.stem import WordNetLemmatizer

# If you haven't already:
nltk.download('wordnet')
nltk.download('omw-1.4')  # optional, improves WordNet coverage

words = ["running", "runs", "ran", "better", "cars"]
print("Original: ", words)

# Stemmers
ps = PorterStemmer()
ls = LancasterStemmer()
ss = SnowballStemmer("english")
rs = RegexpStemmer('ing$|s$|e$')  # simple example regexp

print("Porter:   ", [ps.stem(w) for w in words])
print("Lancaster:", [ls.stem(w) for w in words])
print("Snowball: ", [ss.stem(w) for w in words])
print("Regexp:   ", [rs.stem(w) for w in words])

# Lemmatizer (WordNet)
wn = WordNetLemmatizer()
print("WordNet (default): ", [wn.lemmatize(w) for w in words])
# Pass POS for better results (verbs)
print("WordNet (as verbs):", [wn.lemmatize(w, pos='v') for w in words])
# Comparative adjective handling (better -> good)
print("WordNet (adj).    :", wn.lemmatize("better", pos='a'))

Original:  ['running', 'runs', 'ran', 'better', 'cars']
Porter:    ['run', 'run', 'ran', 'better', 'car']
Lancaster: ['run', 'run', 'ran', 'bet', 'car']
Snowball:  ['run', 'run', 'ran', 'better', 'car']
Regexp:    ['runn', 'run', 'ran', 'better', 'car']
WordNet (default):  ['running', 'run', 'ran', 'better', 'car']
WordNet (as verbs): ['run', 'run', 'run', 'better', 'cars']
WordNet (adj).    : good


[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/aditikulkarni/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/aditikulkarni/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


----