<a href="https://colab.research.google.com/github/Amulyanrao7777/NLP/blob/main/lab2_extended_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Lab – Session 2

Topics:
- Edit Distance (Library-based)
- Text Preprocessing & Normalization
- Tokenization
- Lowercasing, Stemming, Lemmatization
- Stop-word Removal
- NLP Tools: NLTK & spaCy (Basics)

This lab follows **Module 1 – Foundations of NLP**.

## 2. Text Preprocessing & Normalization

Text preprocessing prepares raw text into a clean and usable form. It improves model performance and consistency.

### Sample Text

In [None]:
text = "Natural Language Processing (NLP) is AMAZING!!! It helps machines understand human language."

## 3. Tokenization

Tokenization splits text into smaller units (tokens).

In [None]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [None]:
from nltk.tokenize import word_tokenize

tokens = word_tokenize(text)
print(tokens)

['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'AMAZING', '!', '!', '!', 'It', 'helps', 'machines', 'understand', 'human', 'language', '.']


## 4. Text Normalization – Lowercasing

Converts all text to lowercase to avoid treating words differently.

In [None]:
lower_tokens = [word.lower() for word in tokens]
print(lower_tokens)

['natural', 'language', 'processing', '(', 'nlp', ')', 'is', 'amazing', '!', '!', '!', 'it', 'helps', 'machines', 'understand', 'human', 'language', '.']


## 5. Stop-word Removal

Stop-words are common words (is, the, and) that add little meaning.

In [None]:
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in lower_tokens if word.isalpha() and word not in stop_words]
print(filtered_tokens)

['natural', 'language', 'processing', 'nlp', 'amazing', 'helps', 'machines', 'understand', 'human', 'language']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 6. Stemming

Stemming reduces words to their root form (may not be a valid word).

In [None]:
from nltk.stem import PorterStemmer

ps = PorterStemmer()
stemmed_words = [ps.stem(word) for word in filtered_tokens]
print(stemmed_words)

['natur', 'languag', 'process', 'nlp', 'amaz', 'help', 'machin', 'understand', 'human', 'languag']


## 7. Lemmatization

Lemmatization converts words to their dictionary form using vocabulary knowledge.

In [None]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_tokens]
print(lemmatized_words)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['natural', 'language', 'processing', 'nlp', 'amazing', 'help', 'machine', 'understand', 'human', 'language']


## 8. spaCy Basics

spaCy is an industrial-strength NLP library providing fast tokenization and linguistic features.

In [None]:
!pip install spacy



In [None]:
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
print(doc)
print('\n -----------------')
for token in doc:
    print(token.text, '→', token.lemma_, '| Stopword:', token.is_stop)

Natural Language Processing (NLP) is AMAZING!!! It helps machines understand human language.

 -----------------
Natural → Natural | Stopword: False
Language → Language | Stopword: False
Processing → Processing | Stopword: False
( → ( | Stopword: False
NLP → NLP | Stopword: False
) → ) | Stopword: False
is → be | Stopword: True
AMAZING → amazing | Stopword: False
! → ! | Stopword: False
! → ! | Stopword: False
! → ! | Stopword: False
It → it | Stopword: True
helps → help | Stopword: False
machines → machine | Stopword: False
understand → understand | Stopword: False
human → human | Stopword: False
language → language | Stopword: False
. → . | Stopword: False


### What is `spaCy` doing here?

In this notebook, `spaCy` is demonstrating its capabilities by processing the `text` string. Specifically, after loading the `en_core_web_sm` model:

1.  **`nlp = spacy.load('en_core_web_sm')`**: Initializes the spaCy processing pipeline using the downloaded English small model.
2.  **`doc = nlp(text)`**: This is the core step where spaCy processes the input `text`. It runs the text through all the components of the `en_core_web_sm` pipeline (tokenizer, tagger, parser, ner). The result is a `Doc` object, which is a container for accessing all the processed linguistic annotations.
3.  **`for token in doc:`**: We then iterate through each `token` in the `doc` object. Each `token` object in spaCy provides easy access to various linguistic features.
4.  **`print(token.text, '→', token.lemma_, '| Stopword:', token.is_stop)`**: This line prints three key pieces of information for each token:
    *   **`token.text`**: The original text of the token.
    *   **`token.lemma_`**: The base or dictionary form (lemma) of the token, similar to lemmatization in NLTK but often more linguistically accurate as spaCy uses a statistical model.
    *   **`token.is_stop`**: A boolean indicating whether the token is identified as a stop-word by spaCy's internal stop-word list. This is similar to NLTK's stop-word removal, but integrated directly into spaCy's processing pipeline.

In essence, spaCy provides a unified and efficient way to perform multiple NLP tasks (like tokenization, lemmatization, and stop-word identification) on a given text with a single function call, and it often yields more accurate results than rule-based approaches.

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

text = "Apple Inc. is looking at buying U.K. startup for $1 billion!"
doc = nlp(text)

print(f"{'Token':<12} {'POS':<6} {'Lemma':<10} {'Entity'}")
for token in doc:
    print(f"{token.text:<12} {token.pos_:<6} {token.lemma_:<10} {token.ent_type_}")


Token        POS    Lemma      Entity
Apple        PROPN  Apple      ORG
Inc.         PROPN  Inc.       ORG
is           AUX    be         
looking      VERB   look       
at           ADP    at         
buying       VERB   buy        
U.K.         PROPN  U.K.       GPE
startup      VERB   startup    
for          ADP    for        
$            SYM    $          MONEY
1            NUM    1          MONEY
billion      NUM    billion    MONEY
!            PUNCT  !          


## 10. Explanations of NLP Concepts and Libraries

### NLTK `punkt_tab`

`nltk.download('punkt_tab')` downloads the `punkt_tab` tokenizer model for NLTK. The `punkt_tab` tokenizer is a pre-trained model that knows how to split text into sentences and words for various languages. It's particularly good at handling punctuation and contractions. In this notebook, it's used by `word_tokenize` to break down the input `text` into individual words and punctuation marks.

### NLTK `stopwords`

`nltk.download('stopwords')` downloads a collection of common words that are often filtered out of text before processing. These words (like 'the', 'is', 'and') are called stop-words because they usually don't carry significant meaning for NLP tasks and can add noise. Removing them can improve efficiency and performance for many models.

### `set(stopwords.words('english'))`

This line of code retrieves the English stop-words from the downloaded NLTK `stopwords` corpus and converts them into a `set`. A `set` is used for efficient lookup. When checking if a word is a stop-word (`word not in stop_words`), using a set is much faster than iterating through a list, especially for large lists of stop-words.

### `PorterStemmer`

`PorterStemmer` is an algorithm provided by NLTK for **stemming**. Stemming is a heuristic process that chops off suffixes from words to reduce them to their root form. For example, 'running', 'runs', and 'runner' might all be reduced to 'run'. The key characteristic of stemming is that the resulting 'stem' may not always be a valid word in the dictionary (e.g., 'amaz' from 'amazing'). It's generally faster but less accurate than lemmatization.

### NLTK `wordnet`

`nltk.download('wordnet')` downloads WordNet, which is a large lexical database of English. It groups English words into sets of synonyms called synsets, provides short definitions, and records various semantic relations between these synsets. In NLTK, WordNet is primarily used by the `WordNetLemmatizer` to perform lemmatization, allowing it to find the base or dictionary form of a word.

### `en_core_web_sm`

`en_core_web_sm` is a small English language model provided by spaCy. When you run `!python -m spacy download en_core_web_sm`, you're downloading this pre-trained model. The 'sm' denotes 'small', meaning it's a lightweight model. It includes pipelines for tasks like tokenization, part-of-speech tagging, lemmatization, dependency parsing, and named entity recognition. It's often the default choice for quick experiments or when resources are limited.

## 9. Summary

- Edit Distance helps measure string similarity
- Tokenization splits text into words
- Normalization improves consistency
- Stemming vs Lemmatization
- NLTK and spaCy are core NLP tools