# Assignment 3: Text Cleaning, Lemmatization, Stop Words, Label Encoding, TF-IDF

This notebook demonstrates:
- **Text Cleaning** - Remove noise from text
- **Lemmatization** - Convert words to base form
- **Stop Words Removal** - Remove common words
- **Label Encoding** - Encode categorical labels
- **TF-IDF Representations** - Create TF-IDF features
- **Save Outputs** - Save processed data

In [1]:
# Import required libraries
import nltk
import re
import numpy as np
import pandas as pd

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')

from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\ayush\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## Sample Dataset

In [2]:
data = {
    'text': [
        "The movie was AMAZING!!! I loved it so much <3 :)",
        "This product is terrible... worst purchase ever!!! :/",
        "Great service, very helpful staff & quick delivery.",
        "Disappointed with the quality. Not worth the price...",
        "Absolutely fantastic experience! Would recommend 10/10",
        "Poor customer support, waited 2 hours on hold :(",
        "Best restaurant in town! The food was delicious.",
        "Never buying from this company again. Horrible!!!"
    ],
    'category': ['positive', 'negative', 'positive', 'negative', 
                 'positive', 'negative', 'positive', 'negative']
}

df = pd.DataFrame(data)

print("ORIGINAL DATA:")
for i, row in df.iterrows():
    print(f"  [{row['category']:8}] {row['text']}")

ORIGINAL DATA:
  [positive] The movie was AMAZING!!! I loved it so much <3 :)
  [negative] This product is terrible... worst purchase ever!!! :/
  [positive] Great service, very helpful staff & quick delivery.
  [negative] Disappointed with the quality. Not worth the price...
  [positive] Absolutely fantastic experience! Would recommend 10/10
  [negative] Poor customer support, waited 2 hours on hold :(
  [positive] Best restaurant in town! The food was delicious.
  [negative] Never buying from this company again. Horrible!!!


## 1. Text Cleaning

In [3]:
def clean_text(text):
    """Clean text by removing noise"""
    text = text.lower()
    text = re.sub(r'http\S+|www\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>', '', text)  # Remove HTML tags
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove special characters
    text = re.sub(r'\s+', ' ', text).strip()  # Remove extra whitespace
    return text

df['cleaned_text'] = df['text'].apply(clean_text)

print("Cleaned Text:")
for i, row in df.iterrows():
    print(f"  Original: {row['text']}")
    print(f"  Cleaned:  {row['cleaned_text']}")
    print()

Cleaned Text:
  Original: The movie was AMAZING!!! I loved it so much <3 :)
  Cleaned:  the movie was amazing i loved it so much

  Original: This product is terrible... worst purchase ever!!! :/
  Cleaned:  this product is terrible worst purchase ever

  Original: Great service, very helpful staff & quick delivery.
  Cleaned:  great service very helpful staff quick delivery

  Original: Disappointed with the quality. Not worth the price...
  Cleaned:  disappointed with the quality not worth the price

  Original: Absolutely fantastic experience! Would recommend 10/10
  Cleaned:  absolutely fantastic experience would recommend

  Original: Poor customer support, waited 2 hours on hold :(
  Cleaned:  poor customer support waited hours on hold

  Original: Best restaurant in town! The food was delicious.
  Cleaned:  best restaurant in town the food was delicious

  Original: Never buying from this company again. Horrible!!!
  Cleaned:  never buying from this company again horrible



## 2. Lemmatization

In [4]:
lemmatizer = WordNetLemmatizer()

def lemmatize_text(text):
    """Tokenize and lemmatize text"""
    tokens = word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(token, pos='v') for token in tokens]
    return ' '.join(lemmatized_tokens)

df['lemmatized_text'] = df['cleaned_text'].apply(lemmatize_text)

print("Lemmatized Text:")
for i, row in df.iterrows():
    print(f"  Before: {row['cleaned_text']}")
    print(f"  After:  {row['lemmatized_text']}")
    print()

Lemmatized Text:
  Before: the movie was amazing i loved it so much
  After:  the movie be amaze i love it so much

  Before: this product is terrible worst purchase ever
  After:  this product be terrible worst purchase ever

  Before: great service very helpful staff quick delivery
  After:  great service very helpful staff quick delivery

  Before: disappointed with the quality not worth the price
  After:  disappoint with the quality not worth the price

  Before: absolutely fantastic experience would recommend
  After:  absolutely fantastic experience would recommend

  Before: poor customer support waited hours on hold
  After:  poor customer support wait hours on hold

  Before: best restaurant in town the food was delicious
  After:  best restaurant in town the food be delicious

  Before: never buying from this company again horrible
  After:  never buy from this company again horrible



## 3. Stop Words Removal

In [5]:
stop_words = set(stopwords.words('english'))
print(f"Number of stop words: {len(stop_words)}")
print(f"Sample stop words: {list(stop_words)[:20]}")

def remove_stopwords(text):
    """Remove stop words from text"""
    tokens = word_tokenize(text)
    filtered_tokens = [token for token in tokens if token not in stop_words]
    return ' '.join(filtered_tokens)

df['no_stopwords_text'] = df['lemmatized_text'].apply(remove_stopwords)

print("\nText after Stop Words Removal:")
for i, row in df.iterrows():
    print(f"  Before: {row['lemmatized_text']}")
    print(f"  After:  {row['no_stopwords_text']}")
    print()

Number of stop words: 198
Sample stop words: ['an', 'aren', "should've", 'their', 'too', 'few', 'than', "they've", 'above', 'up', 'both', 'won', 'where', 'o', 'such', 'had', "didn't", "hasn't", 'him', 'did']

Text after Stop Words Removal:
  Before: the movie be amaze i love it so much
  After:  movie amaze love much

  Before: this product be terrible worst purchase ever
  After:  product terrible worst purchase ever

  Before: great service very helpful staff quick delivery
  After:  great service helpful staff quick delivery

  Before: disappoint with the quality not worth the price
  After:  disappoint quality worth price

  Before: absolutely fantastic experience would recommend
  After:  absolutely fantastic experience would recommend

  Before: poor customer support wait hours on hold
  After:  poor customer support wait hours hold

  Before: best restaurant in town the food be delicious
  After:  best restaurant town food delicious

  Before: never buy from this company again h

## 4. Label Encoding

In [6]:
label_encoder = LabelEncoder()
df['encoded_label'] = label_encoder.fit_transform(df['category'])

print("Label Encoding:")
print(f"  Classes: {label_encoder.classes_}")
print(f"  Mapping: {dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))}")
print("\n  Category -> Encoded Label:")
for i, row in df.iterrows():
    print(f"  {row['category']:10} -> {row['encoded_label']}")

Label Encoding:
  Classes: ['negative' 'positive']
  Mapping: {'negative': np.int64(0), 'positive': np.int64(1)}

  Category -> Encoded Label:
  positive   -> 1
  negative   -> 0
  positive   -> 1
  negative   -> 0
  positive   -> 1
  negative   -> 0
  positive   -> 1
  negative   -> 0


## 5. TF-IDF Representations

In [7]:
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['no_stopwords_text'])

print("Vocabulary:")
vocabulary = tfidf_vectorizer.get_feature_names_out()
print(f"  {list(vocabulary)}")

print(f"\nTF-IDF Matrix Shape: {tfidf_matrix.shape}")

tfidf_array = tfidf_matrix.toarray()

print("\nTF-IDF Representations (showing top 5 terms per document):")
for i, row in enumerate(tfidf_array):
    top_indices = row.argsort()[-5:][::-1]
    top_terms = [(vocabulary[idx], round(row[idx], 3)) for idx in top_indices if row[idx] > 0]
    print(f"  Doc {i+1}: {top_terms}")

Vocabulary:
  ['absolutely', 'amaze', 'best', 'buy', 'company', 'customer', 'delicious', 'delivery', 'disappoint', 'ever', 'experience', 'fantastic', 'food', 'great', 'helpful', 'hold', 'horrible', 'hours', 'love', 'movie', 'much', 'never', 'poor', 'price', 'product', 'purchase', 'quality', 'quick', 'recommend', 'restaurant', 'service', 'staff', 'support', 'terrible', 'town', 'wait', 'worst', 'worth', 'would']

TF-IDF Matrix Shape: (8, 39)

TF-IDF Representations (showing top 5 terms per document):
  Doc 1: [('much', np.float64(0.5)), ('love', np.float64(0.5)), ('movie', np.float64(0.5)), ('amaze', np.float64(0.5))]
  Doc 2: [('worst', np.float64(0.447)), ('terrible', np.float64(0.447)), ('purchase', np.float64(0.447)), ('product', np.float64(0.447)), ('ever', np.float64(0.447))]
  Doc 3: [('quick', np.float64(0.408)), ('delivery', np.float64(0.408)), ('great', np.float64(0.408)), ('helpful', np.float64(0.408)), ('staff', np.float64(0.408))]
  Doc 4: [('worth', np.float64(0.5)), ('pric

## 6. Save Outputs

In [8]:
# Save processed dataframe to CSV
df.to_csv('processed_data.csv', index=False)
print("Processed data saved to 'processed_data.csv'")

# Save TF-IDF matrix to CSV
tfidf_df = pd.DataFrame(tfidf_array, columns=vocabulary)
tfidf_df.to_csv('tfidf_matrix.csv', index=False)
print("TF-IDF matrix saved to 'tfidf_matrix.csv'")

# Save label encoder classes
np.save('label_encoder_classes.npy', label_encoder.classes_)
print("Label encoder classes saved to 'label_encoder_classes.npy'")

# Save vocabulary
np.save('vocabulary.npy', vocabulary)
print("Vocabulary saved to 'vocabulary.npy'")

# Display final processed dataframe
print("\nFINAL PROCESSED DATAFRAME:")
print(df)

Processed data saved to 'processed_data.csv'
TF-IDF matrix saved to 'tfidf_matrix.csv'
Label encoder classes saved to 'label_encoder_classes.npy'
Vocabulary saved to 'vocabulary.npy'

FINAL PROCESSED DATAFRAME:
                                                text  category  \
0  The movie was AMAZING!!! I loved it so much <3 :)  positive   
1  This product is terrible... worst purchase eve...  negative   
2  Great service, very helpful staff & quick deli...  positive   
3  Disappointed with the quality. Not worth the p...  negative   
4  Absolutely fantastic experience! Would recomme...  positive   
5   Poor customer support, waited 2 hours on hold :(  negative   
6   Best restaurant in town! The food was delicious.  positive   
7  Never buying from this company again. Horrible!!!  negative   

                                        cleaned_text  \
0           the movie was amazing i loved it so much   
1       this product is terrible worst purchase ever   
2    great service very he