# NLP Text Preprocessing Notebook
This notebook covers:
- Text Cleaning
- Tokenization
- Stopwords
- Stemming & Lemmatization
- N-grams
- Basic Frequency Analysis


In [None]:
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.util import ngrams
from collections import Counter

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

In [None]:
text = "Hello!!! This is NLP Class 2025. Learn Text Preprocessing @Google Meet. :)"
cleaned_text = re.sub(r'[^a-zA-Z\s]', '', text)
cleaned_text = cleaned_text.lower().strip()
cleaned_text

In [None]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(cleaned_text)
tokens

In [None]:
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word not in stop_words]
filtered_tokens

In [None]:
ps = PorterStemmer()
stemmed = [ps.stem(word) for word in filtered_tokens]
stemmed

In [None]:
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(word) for word in filtered_tokens]
lemmatized

In [None]:
bigrams = list(ngrams(tokens, 2))
trigrams = list(ngrams(tokens, 3))
bigrams, trigrams

In [None]:
Counter(tokens)

In [2]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (8.4 kB)
Downloading gensim-4.4.0-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (27.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m80.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: gensim
Successfully installed gensim-4.4.0


In [3]:
# Imports & sample data =====
import numpy as np
import pandas as pd

# Scikit-learn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report, roc_auc_score, confusion_matrix

# Optional for Word2Vec
# pip install gensim
from gensim.models import Word2Vec

# Optional for sentence embeddings
# pip install sentence-transformers
# from sentence_transformers import SentenceTransformer

# For saving
import joblib

# Sample small dataset (binary sentiment)
data = {
    "text": [
        "I love this product, it is amazing and works great",
        "Terrible service, I will never buy again",
        "Really enjoyed the experience, very satisfied",
        "The product broke within days, worst purchase ever",
        "Delicious food and quick delivery",
        "Not worth the money, extremely disappointed"
    ],
    "label": [1, 0, 1, 0, 1, 0]  # 1 -> positive, 0 -> negative
}
df = pd.DataFrame(data)
df


Unnamed: 0,text,label
0,"I love this product, it is amazing and works g...",1
1,"Terrible service, I will never buy again",0
2,"Really enjoyed the experience, very satisfied",1
3,"The product broke within days, worst purchase ...",0
4,Delicious food and quick delivery,1
5,"Not worth the money, extremely disappointed",0


In [4]:
# BoW and TF-IDF examples =====
texts = df['text'].tolist()

# Bag of Words
cv = CountVectorizer(ngram_range=(1,1), min_df=1)
X_bow = cv.fit_transform(texts)
print("BoW features shape:", X_bow.shape)
print("BoW feature names:", cv.get_feature_names_out())

# TF-IDF
tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=1)  # unigrams + bigrams
X_tfidf = tfidf.fit_transform(texts)
print("TF-IDF features shape:", X_tfidf.shape)
print("Example TF-IDF feature names (first 20):", tfidf.get_feature_names_out()[:20])


BoW features shape: (6, 36)
BoW feature names: ['again' 'amazing' 'and' 'broke' 'buy' 'days' 'delicious' 'delivery'
 'disappointed' 'enjoyed' 'ever' 'experience' 'extremely' 'food' 'great'
 'is' 'it' 'love' 'money' 'never' 'not' 'product' 'purchase' 'quick'
 'really' 'satisfied' 'service' 'terrible' 'the' 'this' 'very' 'will'
 'within' 'works' 'worst' 'worth']
TF-IDF features shape: (6, 70)
Example TF-IDF feature names (first 20): ['again' 'amazing' 'amazing and' 'and' 'and quick' 'and works' 'broke'
 'broke within' 'buy' 'buy again' 'days' 'days worst' 'delicious'
 'delicious food' 'delivery' 'disappointed' 'enjoyed' 'enjoyed the' 'ever'
 'experience']


In [5]:
# Word2Vec demo =====
# Tokenize sentences simply (after your preprocessing pipeline)
tokenized = [t.lower().split() for t in texts]
w2v_model = Word2Vec(sentences=tokenized, vector_size=50, window=5, min_count=1, workers=1, seed=42)

# Get embedding for a word
print("Vector for 'product' (shape):", w2v_model.wv['product'].shape)

# To get sentence embedding: average word vectors (simple)
def sentence_vector(sentence, model):
    toks = sentence.lower().split()
    vecs = [model.wv[w] for w in toks if w in model.wv]
    if len(vecs)==0:
        return np.zeros(model.vector_size)
    return np.mean(vecs, axis=0)

sent_emb = np.vstack([sentence_vector(s, w2v_model) for s in texts])
print("Sentence embeddings shape:", sent_emb.shape)


Vector for 'product' (shape): (50,)
Sentence embeddings shape: (6, 50)


In [6]:
# Train-test split (use TF-IDF features) =====
X = X_tfidf
y = df['label'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)
print("Train:", X_train.shape, "Test:", X_test.shape)


Train: (4, 70) Test: (2, 70)


In [7]:
# Train baseline models =====
models = {
    "NaiveBayes": MultinomialNB(),
    "LogisticRegression": LogisticRegression(max_iter=1000, solver='liblinear'),
    "LinearSVC": LinearSVC(max_iter=10000)
}

for name, model in models.items():
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    print(f"{name} -> Accuracy: {acc:.3f}")
    print(classification_report(y_test, preds))


NaiveBayes -> Accuracy: 0.500
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

LogisticRegression -> Accuracy: 0.500
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

LinearSVC -> Accuracy: 0.500
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [8]:
# Evaluation utilities =====
def evaluate_model(model, X_test, y_test):
    if hasattr(model, "predict_proba"):
        probs = model.predict_proba(X_test)[:,1]
    else:
        # fallback: use decision_function if available and scale to [0,1]
        if hasattr(model, "decision_function"):
            from sklearn.preprocessing import MinMaxScaler
            scores = model.decision_function(X_test).reshape(-1,1)
            probs = MinMaxScaler().fit_transform(scores).ravel()
        else:
            probs = None

    preds = model.predict(X_test)
    acc = accuracy_score(y_test, preds)
    p, r, f1, _ = precision_recall_fscore_support(y_test, preds, average='binary', zero_division=0)
    print("Accuracy:", acc)
    print("Precision:", p, "Recall:", r, "F1:", f1)
    if probs is not None:
        try:
            print("ROC-AUC:", roc_auc_score(y_test, probs))
        except:
            pass
    print("Confusion Matrix:\n", confusion_matrix(y_test, preds))
    print("---- Detailed report ----")
    print(classification_report(y_test, preds))

# Example run on the logistic regression model
evaluate_model(models['LogisticRegression'], X_test, y_test)


Accuracy: 0.5
Precision: 0.5 Recall: 1.0 F1: 0.6666666666666666
ROC-AUC: 1.0
Confusion Matrix:
 [[0 1]
 [0 1]]
---- Detailed report ----
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [9]:
# Pipeline + GridSearch =====
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(solver='liblinear', max_iter=1000))
])

param_grid = {
    'tfidf__ngram_range': [(1,1), (1,2)],
    'tfidf__min_df': [1, 2],
    'clf__C': [0.1, 1, 10]
}

gs = GridSearchCV(pipeline, param_grid, cv=3, scoring='f1', n_jobs=-1)
gs.fit(df['text'], df['label'])
print("Best params:", gs.best_params_)
print("Best CV score:", gs.best_score_)

# Evaluate best model on holdout (we'll split again)
X_train2, X_test2, y_train2, y_test2 = train_test_split(df['text'], df['label'], test_size=0.33, random_state=42, stratify=df['label'])
best_model = gs.best_estimator_
best_model.fit(X_train2, y_train2)
preds = best_model.predict(X_test2)
print(classification_report(y_test2, preds))


Best params: {'clf__C': 0.1, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}
Best CV score: 0.4444444444444444
              precision    recall  f1-score   support

           0       0.00      0.00      0.00         1
           1       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Try ngram_range=(1,2) or (1,3) for capturing phrases.

Use stopword removal when vocabulary is noisy.

Use character n-grams for short text (reviews, tweets).

Limit max_features or use min_df to control vocabulary size.

HashingVectorizer for memory-efficient transform on large corpora.

Combine TF-IDF with pretrained sentence embeddings for hybrid features (concatenate dense + sparse)

In [10]:

# !pip install sentence-transformers
# from sentence_transformers import SentenceTransformer
# s_model = SentenceTransformer('all-MiniLM-L6-v2')  # example model
# sentence_embeddings = s_model.encode(df['text'].tolist(), show_progress_bar=True)
# print("Embeddings shape:", sentence_embeddings.shape)


In [11]:
# =====  Save & load pipeline =====
# Example: save best_model from GridSearch
joblib.dump(best_model, "best_text_pipeline.joblib")
# Load
loaded = joblib.load("best_text_pipeline.joblib")
# Inference
sample = ["I really hate the support and the product quality"]
print("Prediction:", loaded.predict(sample))


Prediction: [1]


Inference pipeline (production notes) (cell: markdown)

Pipeline should include preprocessing → vectorization → model. Save it as a single Pipeline object (as above).

Ensure same text cleaning/normalization used at train time is applied during inference.

For scaling to production: wrap pipeline in a small API (FastAPI/Flask) and serve with Gunicorn/Uvicorn + container (Docker).

If model needs to be retrained periodically, automate dataset collection, validation, and CI tests.

**Short assignment / exercises t**
Compare performances: CountVectorizer vs TfidfVectorizer vs Word2Vec averaged embeddings on a 10k-sample dataset.

Try ngram_range=(1,3) and observe overfitting/feature explosion.

Use GridSearchCV to tune C for Logistic Regression and alpha for MultinomialNB.

Create an inference API using FastAPI that loads best_text_pipeline.joblib and exposes POST /predict.

(Advanced) Fine-tune a small transformer (e.g., DistilBERT) for sentiment classification using Hugging Face **transformers**