<a href="https://colab.research.google.com/github/Haseeb-zai30/Ai-notebooks/blob/main/day_6_intro_to_Natural_Language_Processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to NLP
NLP = Teaching computers to understand and use human language.

NLP is the science of enabling computers to read, understand, and generate human language. It transforms raw text or speech into structured data so that machines can perform meaningful tasks.

Examples of NLP tasks

### Chatbots

Input: "What’s my next meeting?"

Output: "Your team sync is at 10 AM."

Helps machines talk like humans.

###Spell-check & Auto-correct

Input: "I can’t wrte this"

Output: "Did you mean ‘write’?"

 Fixes typing mistakes.

###Sentiment Analysis

Input: "I absolutely loved to watch football."

Output: Positive

Finds feelings (positive, negative, neutral) in text

## Setup (install required packages)

In [1]:
# Run this cell first in Colab to install packages that may not be preinstalled.
# We keep installs minimal and common: nltk, spacy, scikit-learn, sentence-transformers (optional), and transformers.
# If you are on Colab, these commands will download models the first time.


!pip install --quiet nltk spacy scikit-learn sentence-transformers transformers textblob

## NLP Workflow Steps

These steps clean the text so machine learning models can focus on important words instead of noise.

### Lexical Analysis

This step processes raw text into tokens and standardized forms. Operations include:

**1. Tokenization**

Breaking text into smaller pieces (words or sentences).

Example:
"I love NLP."→ ["I", "love", "NLP", "."]

**2. Stopword Removal**

Removing common words that don’t add much meaning.

Example:
"I love NLP" → after removing "I" → ["love", "NLP"]


**3. Stemming & Lemmatization**

Stemming: Cuts words to their root form (may not be a real word).
Example: "playing" → "play" , "studies" → "studi"

Lemmatization: Converts word to its base dictionary form.
Example: "playing" → "play" , "studies" → "study"


**4. Normalization**

Making text clean and consistent.

Includes:

Lowercasing: "NLP" → "nlp"

Removing punctuation: "Hello!!!" → "Hello"

Removing URLs: "Check http://example.com" → "Check"

In [10]:
# nltk = Natural Language Toolkit, a popular NLP library
import nltk

# re = Regular expressions (for pattern matching, e.g. removing URLs or special text)
import re

# unicodedata = Helps handle different Unicode characters (e.g., accented letters, emojis)
import unicodedata

# string = Python’s built-in string functions (e.g., punctuation removal)
import string
import nltk
nltk.download("punkt")       # for tokenization
nltk.download("punkt_tab")   # new requirement in NLTK 3.9+


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [3]:
# word_tokenize = splits sentences into words (Tokenization)
from nltk.tokenize import word_tokenize

# stopwords = list of common words (like "the", "is", "and") that we usually remove
from nltk.corpus import stopwords

# PorterStemmer = algorithm for stemming (reducing words to root form, e.g., "running" → "run")
from nltk.stem import PorterStemmer

# WordNetLemmatizer = tool for lemmatization (reduces words to their dictionary form, e.g., "better" → "good")
from nltk.stem import WordNetLemmatizer

In [4]:
# 'punkt' = tokenizer models (used by word_tokenize to split sentences/words)
nltk.download("punkt")

# 'stopwords' = list of common words (like "the", "is", "and") for stopword removal
nltk.download("stopwords")

# 'wordnet' = lexical database for English (used by WordNetLemmatizer for lemmatization)
nltk.download("wordnet")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [7]:
sample = "I loved the new hotel downtown! The rooms were amazing . Check out their website: https://hoteldowntown.com"

In [5]:
def normalize_text(text):
    # 1. Convert all text to lowercase (consistency)
    text = text.lower()

    # 2. Remove accents / special characters (e.g., café → cafe)
    text = unicodedata.normalize("NFKD", text).encode("ascii", "ignore").decode()

    # 3. Remove URLs (anything starting with http or www)
    text = re.sub(r"http\S+|www\.\S+", "", text)

    # 4. Remove mentions (e.g., @username in social media)
    text = re.sub(r"@\w+", "", text)

    # 5. Remove hashtags (e.g., #NLP → removed)
    text = re.sub(r"#\w+", "", text)

    # 6. Remove punctuation (replace with space to avoid word merging)
    text = re.sub(r"[{}]".format(re.escape(string.punctuation)), " ", text)

    # 7. Remove extra spaces (convert multiple spaces into one)
    text = re.sub(r"\s+", " ", text).strip()

    return text

## Tokenization:
Input:

norm = "i loved the new hotel downtown the rooms were amazing check out their website"


Output:

Tokens: ['i', 'loved', 'the', 'new', 'hotel', 'downtown', 'the', 'rooms', 'were', 'amazing', 'check', 'out', 'their', 'website']


In [8]:
# Apply normalization function to the sample text
norm = normalize_text(sample)

# Print the result after normalization
print("Normalized:", norm)

Normalized: i loved the new hotel downtown the rooms were amazing check out their website


In [11]:
# Split the normalized text into individual words (tokens)
tokens = word_tokenize(norm)

# Print the list of tokens
print("Tokens:", tokens)

Tokens: ['i', 'loved', 'the', 'new', 'hotel', 'downtown', 'the', 'rooms', 'were', 'amazing', 'check', 'out', 'their', 'website']


## Stop Word Removal:

Input tokens:

['i', 'loved', 'the', 'new', 'hotel', 'downtown', 'the', 'rooms', 'were', 'amazing', 'check', 'out', 'their', 'website']


After stopword removal:

['loved', 'new', 'hotel', 'downtown', 'rooms', 'amazing', 'check', 'website']

In [12]:
# Get the list of English stopwords from NLTK (e.g., "the", "is", "and")
stop_words = set(stopwords.words("english"))

In [13]:
# Remove stopwords: keep only words not in stop_words
filtered = [w for w in tokens if w not in stop_words]


In [14]:
# Print the tokens after removing stopwords
print("Without Stopwords:", filtered)


Without Stopwords: ['loved', 'new', 'hotel', 'downtown', 'rooms', 'amazing', 'check', 'website']


## Stemming vs Lemmatization

Stemming: Shortens words to their “root” form by chopping off endings
Uses simple rules like removing -ing, -ed, -s.

Example:

"running" → "run"

"cats" → "cat"

"amazing" → "amaz" (not a real word)


Lemmatization: Uses vocabulary/dictionary → produces valid words or
Converts words to their dictionary/base form (called lemma)

Looks at word meaning and part of speech (noun, verb, adjective, etc.) using a vocabulary/dictionary.

Example:

"running" → "run"

"cats" → "cat"

"better" → "good" (if POS is adjective)

In [15]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [28]:
# Apply stemming
stems = [stemmer.stem(w) for w in tokens]

### Stemming:
Stemming chops words roughly to their root:

"loved" → "love"

"rooms" → "room"

"amazing" → "amaz" (not a real word)

"website" → "websit"

In [29]:
# Apply stemming
stems = [stemmer.stem(w) for w in tokens]


### Lemmatization:
Lemmatization uses vocabulary rules → valid dictionary words:

"rooms" → "room"

"amazing" stays "amazing"

"loved" stays "loved"

In [30]:
# Apply lemmatization
lemmas = [lemmatizer.lemmatize(w) for w in tokens]

In [31]:
# Show results
print("Original Words:     ", tokens)
print("After Stemming:     ", stems)
print("After Lemmatization:", lemmas)

Original Words:      ['i', 'loved', 'the', 'new', 'hotel', 'downtown', 'the', 'rooms', 'were', 'amazing', 'check', 'out', 'their', 'website']
After Stemming:      ['i', 'love', 'the', 'new', 'hotel', 'downtown', 'the', 'room', 'were', 'amaz', 'check', 'out', 'their', 'websit']
After Lemmatization: ['i', 'loved', 'the', 'new', 'hotel', 'downtown', 'the', 'room', 'were', 'amazing', 'check', 'out', 'their', 'website']
