<a href="https://colab.research.google.com/github/AvichalTrivedi7/Generative-AI-Intel-Unnati/blob/main/NLP_Complete_Pipeline_Gen_AI_Practice_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# NLP Pipeline

This notebook cleans a free-text recipe paragraph and extracts a structured list of ingredients using lightweight NLP preprocessing.

**Pipeline:** Taking ten reviews as a corpus -> performing data cleaning -> tokenization -> stopward removal -> stemming -> POS_tagging -> NER -> ONE-HOT Encoding (Vectorization).



In [None]:
# Recipe Preprocessing Assignment

# Download required NLTK resources
import nltk
nltk.download("punkt")
nltk.download("stopwords")
nltk.download("wordnet")
nltk.download("maxent_ne_chunker_tab")
nltk.download("words")
nltk.download("omw-1.4")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("punkt_tab")

from nltk.corpus import stopwords, wordnet
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer, SnowballStemmer
import re


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker_tab.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


In [None]:
# Step 1: Input Review List
import pandas as pd

reviews = ["1.  This product exceeded my expectations. The quality is outstanding and it's very easy to use. I highly recommend it to anyone looking for a reliable solution.",
"2.  I was a bit skeptical at first, but this service delivered on all its promises. The support team was incredibly helpful and the results were exactly what I needed.",
"3.  The user interface is a bit clunky, and I found several bugs during my trial. It has potential, but it needs a lot of work before I'd consider it a viable option.",
"4.  A total waste of money. The item broke after just a week of light use. The company's customer service was unresponsive and unhelpful. Avoid this at all costs.",
"5.  I've been a loyal customer for years, and this company continues to impress me. Their latest update adds some fantastic new features that make my work so much easier.",
"6.  It's an okay product, but nothing special. It does what it says on the tin, but there are cheaper alternatives that offer similar functionality.",
"7.  Absolutely fantastic! The design is sleek and modern, and the performance is top-notch. I've been using it every day since I got it and I couldn't be happier.",
"8.  I'm torn on this one. The concept is brilliant, but the execution is flawed. It's plagued by performance issues and the learning curve is steeper than it should be.",
"9.  This is a must-have for anyone in my profession. It's a game-changer. The time I've saved using this tool is invaluable.",
"10. The delivery was fast, but the packaging was damaged and the product inside had a few scratches. The item itself works fine, but the overall experience was disappointing."]

# Convert the list into a DataFrame for easy manipulation
df = pd.DataFrame({'review': reviews})

# Display the dataset
df.head(12)

Unnamed: 0,review
0,1. This product exceeded my expectations. The...
1,"2. I was a bit skeptical at first, but this s..."
2,"3. The user interface is a bit clunky, and I ..."
3,4. A total waste of money. The item broke aft...
4,"5. I've been a loyal customer for years, and ..."
5,"6. It's an okay product, but nothing special...."
6,7. Absolutely fantastic! The design is sleek ...
7,8. I'm torn on this one. The concept is brill...
8,9. This is a must-have for anyone in my profe...
9,"10. The delivery was fast, but the packaging w..."


In [None]:
# Step 2: Preprocessing and Stopwords Removal (remove numbers, measurements, punctuation & stopwords)

stop_words = set(stopwords.words('english'))  # Set of English stopwords
negation_words = {"not", "no", "never"}   # words to keep
custom_stopwords = stop_words - negation_words

# Step 2b: Define preprocessing function
import re
def preprocess(text):
    text = text.lower()  # Convert text to lowercase
    text = re.sub(r"[-—]", " ", text)  # Replace dash characters with space
    text = re.sub(r"[^a-z0-9\s]", "", text)  # Remove punctuation (keep letters/numbers/spaces)
    text = re.sub(r"\s+", " ", text).strip()  # Remove extra spaces
    tokens = [t for t in text.split() if t not in custom_stopwords]  # Remove stopwords
    return " ".join(tokens), tokens  # Return cleaned string and token list

# Apply preprocessing to all reviews
df['clean'], df['tokens'] = zip(*df['review'].apply(preprocess))

# Display cleaned reviews and tokens
df[['review','clean','tokens']].head(12)

Unnamed: 0,review,clean,tokens
0,1. This product exceeded my expectations. The...,1 product exceeded expectations quality outsta...,"[1, product, exceeded, expectations, quality, ..."
1,"2. I was a bit skeptical at first, but this s...",2 bit skeptical first service delivered promis...,"[2, bit, skeptical, first, service, delivered,..."
2,"3. The user interface is a bit clunky, and I ...",3 user interface bit clunky found several bugs...,"[3, user, interface, bit, clunky, found, sever..."
3,4. A total waste of money. The item broke aft...,4 total waste money item broke week light use ...,"[4, total, waste, money, item, broke, week, li..."
4,"5. I've been a loyal customer for years, and ...",5 ive loyal customer years company continues i...,"[5, ive, loyal, customer, years, company, cont..."
5,"6. It's an okay product, but nothing special....",6 okay product nothing special says tin cheape...,"[6, okay, product, nothing, special, says, tin..."
6,7. Absolutely fantastic! The design is sleek ...,7 absolutely fantastic design sleek modern per...,"[7, absolutely, fantastic, design, sleek, mode..."
7,8. I'm torn on this one. The concept is brill...,8 im torn one concept brilliant execution flaw...,"[8, im, torn, one, concept, brilliant, executi..."
8,9. This is a must-have for anyone in my profe...,9 must anyone profession game changer time ive...,"[9, must, anyone, profession, game, changer, t..."
9,"10. The delivery was fast, but the packaging w...",10 delivery fast packaging damaged product ins...,"[10, delivery, fast, packaging, damaged, produ..."


In [None]:
# Step 3: Tokenization (already applied in df['tokens'])

print("Sample Tokens:", df['tokens'][0])


Sample Tokens: ['1', 'product', 'exceeded', 'expectations', 'quality', 'outstanding', 'easy', 'use', 'highly', 'recommend', 'anyone', 'looking', 'reliable', 'solution']


In [None]:
# Step 4: Lemmatization & Stemming
# (Helper function for POS tagging is there)

from nltk.stem import SnowballStemmer, WordNetLemmatizer
from nltk import pos_tag

stemmer = SnowballStemmer("english")
lemmatizer = WordNetLemmatizer()

# Function to map POS tags for lemmatizer
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

# Apply stemming
df['stems'] = df['tokens'].apply(lambda words: [stemmer.stem(w) for w in words])
print("Sample Stems:", df['stems'][0])

# Apply lemmatization with POS tagging
df['lemmas'] = df['tokens'].apply(lambda words: [
    lemmatizer.lemmatize(w, get_wordnet_pos(tag))
    for w, tag in pos_tag(words, lang='eng')
])
print("Sample Lemmas:", df['lemmas'][0])


Sample Stems: ['1', 'product', 'exceed', 'expect', 'qualiti', 'outstand', 'easi', 'use', 'high', 'recommend', 'anyon', 'look', 'reliabl', 'solut']
Sample Lemmas: ['1', 'product', 'exceed', 'expectation', 'quality', 'outstanding', 'easy', 'use', 'highly', 'recommend', 'anyone', 'look', 'reliable', 'solution']


In [None]:
# Step 5: NER (Named entity recognition)

df['pos_tags'] = df['tokens'].apply(lambda words: pos_tag(words, lang='eng'))
print("Sample POS Tags:", df['pos_tags'][0])


tokens = word_tokenize(df['review'][0])
pos_tags = pos_tag(tokens, lang='eng')
ner_tree = nltk.ne_chunk(pos_tags)
print("NER Result (Review 0):")
ner_tree.pprint()

Sample POS Tags: [('1', 'CD'), ('product', 'NN'), ('exceeded', 'VBD'), ('expectations', 'NNS'), ('quality', 'NN'), ('outstanding', 'JJ'), ('easy', 'JJ'), ('use', 'NN'), ('highly', 'RB'), ('recommend', 'VBP'), ('anyone', 'NN'), ('looking', 'VBG'), ('reliable', 'JJ'), ('solution', 'NN')]
NER Result (Review 0):
(S
  1/CD
  ./.
  This/DT
  product/NN
  exceeded/VBD
  my/PRP$
  expectations/NNS
  ./.
  The/DT
  quality/NN
  is/VBZ
  outstanding/JJ
  and/CC
  it/PRP
  's/VBZ
  very/RB
  easy/JJ
  to/TO
  use/VB
  ./.
  I/PRP
  highly/RB
  recommend/VBP
  it/PRP
  to/TO
  anyone/NN
  looking/VBG
  for/IN
  a/DT
  reliable/JJ
  solution/NN
  ./.)


In [None]:
# Step 6: ONE-HOT CODING (VECTORIZATION)


from sklearn.feature_extraction.text import CountVectorizer

vectorizer_oh = CountVectorizer(binary=True)
X_oh = vectorizer_oh.fit_transform(df['clean'])

onehot_df = pd.DataFrame(X_oh.toarray(), columns=vectorizer_oh.get_feature_names_out())
onehot_df.index = [f"rev_{i}" for i in range(len(onehot_df))]

# ✅ Final Output (matches attached photo style)
display(onehot_df.head(12))

Unnamed: 0,10,absolutely,adds,alternatives,anyone,avoid,bit,brilliant,broke,bugs,...,update,use,user,using,viable,waste,week,work,works,years
rev_0,0,0,0,0,1,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
rev_1,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
rev_2,0,0,0,0,0,0,1,0,0,1,...,0,0,1,0,1,0,0,1,0,0
rev_3,0,0,0,0,0,1,0,0,1,0,...,0,1,0,0,0,1,1,0,0,0
rev_4,0,0,1,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
rev_5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
rev_6,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
rev_7,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
rev_8,0,0,0,0,1,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
rev_9,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0



### Reflection (Why preprocessing matters)
Preprocessing reduces noise and standardizes text, which makes it easier
for machines to extract structured information. Steps like stopword removal
focus on meaningful words, while lemmatization and stemming reduce variations
(e.g., plagued → plague). This structured representation allows reliable information extraction.
