<a href="https://colab.research.google.com/github/Amulyanrao7777/NLP/blob/main/program2%26lab_assignment2(min_edit_distance)_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np

# Custom implementation where Substitution Cost = 2
def weighted_levenshtein(source, target):
    rows = len(source) + 1
    cols = len(target) + 1

    # Initialize Matrix with zeros
    dist = [[0 for x in range(cols)] for x in range(rows)] #list comprehension technique is used here

    # Initialize first row (Insertion costs: 0 -> j)
    for j in range(1, cols):
        dist[0][j] = j

    # Initialize first column (Deletion costs: i -> 0)
    for i in range(1, rows):
        dist[i][0] = i

    # Fill the matrix (Dynamic Programming)
    for i in range(1, rows):
        for j in range(1, cols):
            # Calculate Substitution Cost
            if source[i-1] == target[j-1]:
                cost = 0 # Match
            else:
                cost = 2 # Substitution Penalty (Weighted) #this(weight/replacement cost/substitution penalty) is a manually assigned value for tasks to make decisions easier.

            dist[i][j] = min(
                dist[i-1][j] + 1,      # Deletion
                dist[i][j-1] + 1,      # Insertion
                dist[i-1][j-1] + cost  # Substitution / Match
            )

    return dist[rows-1][cols-1], dist


print(f"\n--- Weighted Levenshtein (Sub=2) ---")

# Problem 1: FAST -> CATS
source_word = "kitten"
target_word = "sitting"

distance, matrix = weighted_levenshtein(source_word, target_word)

print(f"Distance ('{source_word}' -> '{target_word}'): {distance}")
print("Matrix State:")
print(np.matrix(matrix))

# Example Usage
#print(f"Distance: {weighted_levenshtein('ROSY', 'POSE')}")
#print(f"Distance: {weighted_levenshtein('KITTEN', 'SITTING')}")
#print(f"Distance: {weighted_levenshtein('Execute', 'Intuition')}")


--- Weighted Levenshtein (Sub=2) ---
Distance ('kitten' -> 'sitting'): 5
Matrix State:
[[0 1 2 3 4 5 6 7]
 [1 2 3 4 5 6 7 8]
 [2 3 2 3 4 5 6 7]
 [3 4 3 2 3 4 5 6]
 [4 5 4 3 2 3 4 5]
 [5 6 5 4 3 4 5 6]
 [6 7 6 5 4 5 4 5]]


## Edit Distance using Library

#### #Instead of manually implementing Levenshtein distance, we use a library function. This is commonly used in spell checkers, DNA matching, and NLP preprocessing.

In [None]:
import nltk
import numpy as np

# ==========================================
# Standard Levenshtein (NLTK)
# ==========================================
# Default NLTK behavior: Substitution Cost = 1
s1 = "kitten"
s2 = "sitting"
dist_nltk = nltk.edit_distance(s1, s2)
print(f"--- Standard NLTK (Sub=1) ---")
print(f"Distance ('{s1}' -> '{s2}'): {dist_nltk}")

--- Standard NLTK (Sub=1) ---
Distance ('kitten' -> 'sitting'): 3


In [None]:
# Calculate distance with Substitution Cost = 2
dist = nltk.edit_distance("kitten", "sitting", substitution_cost=2)

print(f"Edit Distance: {dist}")

Edit Distance: 5


##Lab assignment2

##[Question](https://docs.google.com/document/d/1wZGgB_LHBeQnnb9V3cr5SijoNwmbKXt9WmbWBU_vpMY/edit?usp=sharing)

import the necessary libraries and download required NLTK resources and the spaCy English model as specified in the instructions. This will set up the environment for the next tasks.



In [None]:
import nltk
import numpy as np
import spacy

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

nlp = spacy.load('en_core_web_sm')

print("Libraries imported and resources loaded successfully.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Libraries imported and resources loaded successfully.


normalize the given text. First, define the text to be processed, and then apply lowercasing, remove punctuation, and clean whitespace.



In [None]:
import string

# Define the input text for normalization
text = "  This is an example sentence, showcasing N.L.P. normalization!  With some extra spaces.  "

# 1. Lowercasing
normalized_text = text.lower()

# 2. Removing punctuation
normalized_text = normalized_text.translate(str.maketrans('', '', string.punctuation))

# 3. Cleaning whitespace (remove extra spaces and leading/trailing spaces)
normalized_text = ' '.join(normalized_text.split())

print(f"Original Text: '{text}'")
print(f"Normalized Text: '{normalized_text}'")

Original Text: '  This is an example sentence, showcasing N.L.P. normalization!  With some extra spaces.  '
Normalized Text: 'this is an example sentence showcasing nlp normalization with some extra spaces'


Tokenization is a fundamental step in NLP, breaking down text into smaller units (tokens). This allows for further processing, such as stop word removal, stemming, and lemmatization.

implement the tokenization of the `normalized_text` using both NLTK's `word_tokenize` and spaCy's NLP pipeline



In [None]:
import nltk
import numpy as np
import spacy

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

nlp = spacy.load('en_core_web_sm')

print("Libraries imported and resources loaded successfully.")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


Libraries imported and resources loaded successfully.


In [None]:
import nltk

# NLTK Tokenization
nltk_tokens = nltk.word_tokenize(normalized_text)

# spaCy Tokenization
doc = nlp(normalized_text)
spacy_tokens = [token.text for token in doc]

print(f"\nNLTK Tokens: {nltk_tokens}")
print(f"spaCy Tokens: {spacy_tokens}")


NLTK Tokens: ['this', 'is', 'an', 'example', 'sentence', 'showcasing', 'nlp', 'normalization', 'with', 'some', 'extra', 'spaces']
spaCy Tokens: ['this', 'is', 'an', 'example', 'sentence', 'showcasing', 'nlp', 'normalization', 'with', 'some', 'extra', 'spaces']


Stop Word Removal

Stop words are common words (like 'the', 'is', 'and') that often carry little meaning in NLP tasks and can be removed to reduce noise and improve processing efficiency for many applications such as text classification or information retrieval. Both NLTK and spaCy provide efficient ways to filter these words.

In [None]:
from nltk.corpus import stopwords

# NLTK Stop Word Removal
nltk_stop_words = set(stopwords.words('english'))
nltk_filtered_tokens = [word for word in nltk_tokens if word not in nltk_stop_words]

# spaCy Stop Word Removal
spacy_filtered_tokens = [token.text for token in doc if not token.is_stop]

print(f"\nNLTK Filtered Tokens (Stop Words Removed): {nltk_filtered_tokens}")
print(f"spaCy Filtered Tokens (Stop Words Removed): {spacy_filtered_tokens}")


NLTK Filtered Tokens (Stop Words Removed): ['example', 'sentence', 'showcasing', 'nlp', 'normalization', 'extra', 'spaces']
spaCy Filtered Tokens (Stop Words Removed): ['example', 'sentence', 'showcasing', 'nlp', 'normalization', 'extra', 'spaces']


Stemming and Lemmatization


Stemming and lemmatization are techniques used to reduce inflected words to their base or root form. Stemming uses heuristic rules to chop off suffixes, while lemmatization uses vocabulary and morphological analysis to return the base form (lemma) of a word, which is often more accurate than stemming. These processes are crucial for normalizing text and reducing the vocabulary size, which can improve the performance of NLP models.

In [None]:
from nltk.stem import PorterStemmer

# NLTK Stemming (Porter Stemmer)
porter_stemmer = PorterStemmer()
nltk_stemmed_tokens = [porter_stemmer.stem(word) for word in nltk_filtered_tokens]

print(f"\nNLTK Stemmed Tokens (Porter Stemmer): {nltk_stemmed_tokens}")


NLTK Stemmed Tokens (Porter Stemmer): ['exampl', 'sentenc', 'showcas', 'nlp', 'normal', 'extra', 'space']


In [None]:
from nltk.stem import WordNetLemmatizer

# NLTK Lemmatization (WordNet Lemmatizer)
wordnet_lemmatizer = WordNetLemmatizer()
nltk_lemmatized_tokens = [wordnet_lemmatizer.lemmatize(word) for word in nltk_filtered_tokens]

print(f"\nNLTK Lemmatized Tokens (WordNet Lemmatizer): {nltk_lemmatized_tokens}")


NLTK Lemmatized Tokens (WordNet Lemmatizer): ['example', 'sentence', 'showcasing', 'nlp', 'normalization', 'extra', 'space']


apply lemmatization using spaCy on the previously processed document (`doc`) and print the lemmatized tokens.



In [None]:
spacy_lemmatized_tokens = [token.lemma_ for token in doc if not token.is_stop]

print(f"spaCy Lemmatized Tokens: {spacy_lemmatized_tokens}")

spaCy Lemmatized Tokens: ['example', 'sentence', 'showcasing', 'nlp', 'normalization', 'extra', 'space']


Edit Distance Calculation

Identify a misspelled word and its correct form, then calculate the edit distance between them using NLTK's `edit_distance` function with substitution costs of 1 and 2.


In [None]:
# Identify a misspelled word and its correct form
mispelled_word = "normalisation"
correct_word = "normalization"

# Calculate edit distance with substitution_cost = 1 (default)
distance_sub_1 = nltk.edit_distance(mispelled_word, correct_word, substitution_cost=1)

# Calculate edit distance with substitution_cost = 2
distance_sub_2 = nltk.edit_distance(mispelled_word, correct_word, substitution_cost=2)

print(f"\nMispelled Word: '{mispelled_word}'")
print(f"Correct Word: '{correct_word}'")
print(f"Edit Distance (substitution_cost=1): {distance_sub_1}")
print(f"Edit Distance (substitution_cost=2): {distance_sub_2}")


Mispelled Word: 'normalisation'
Correct Word: 'normalization'
Edit Distance (substitution_cost=1): 1
Edit Distance (substitution_cost=2): 2
