# AIG230 NLP (Week 3 Lab) — Notebook 1: Text Representation

This notebook focuses on **turning raw text into numeric features** you can use in real-world ML systems.

You will build:
- a clean **train/test split**
- **Bag-of-Words** (binary and count)
- **Document-Term Matrix** (DTM)
- **TF-IDF** (with n-grams)
- **Hashing trick** (production-friendly)
- basic **retrieval** (cosine similarity) and a **baseline classifier**
- model **persistence** (save/load)

## 0) Setup


In [3]:
!pip install numpy

Collecting numpy
  Using cached numpy-2.4.1-cp311-cp311-win_amd64.whl.metadata (6.6 kB)
Using cached numpy-2.4.1-cp311-cp311-win_amd64.whl (12.6 MB)
Installing collected packages: numpy
Successfully installed numpy-2.4.1



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [7]:
!pip install pandas
!pip install scikit-learn
!pip install nltk
!pip install matplotlib
!pip install seaborn





[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting scikit-learn
  Downloading scikit_learn-1.8.0-cp311-cp311-win_amd64.whl.metadata (11 kB)
Collecting scipy>=1.10.0 (from scikit-learn)
  Downloading scipy-1.17.0-cp311-cp311-win_amd64.whl.metadata (60 kB)
     ---------------------------------------- 0.0/61.0 kB ? eta -:--:--
     ------ --------------------------------- 10.2/61.0 kB ? eta -:--:--
     ------------------------------- ------ 51.2/61.0 kB 660.6 kB/s eta 0:00:01
     -------------------------------------- 61.0/61.0 kB 817.9 kB/s eta 0:00:00
Collecting joblib>=1.3.0 (from scikit-learn)
  Using cached joblib-1.5.3-py3-none-any.whl.metadata (5.5 kB)
Collecting threadpoolctl>=3.2.0 (from scikit-learn)
  Using cached threadpoolctl-3.6.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.8.0-cp311-cp311-win_amd64.whl (8.1 MB)
   ---------------------------------------- 0.0/8.1 MB ? eta -:--:--
   ---------------------------------------- 0.1/8.1 MB 2.6 MB/s eta 0:00:04
    ------------------------------------


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting nltk
  Using cached nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
  Using cached click-8.3.1-py3-none-any.whl.metadata (2.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Using cached regex-2026.1.15-cp311-cp311-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Using cached nltk-3.9.2-py3-none-any.whl (1.5 MB)
Using cached regex-2026.1.15-cp311-cp311-win_amd64.whl (277 kB)
Using cached click-8.3.1-py3-none-any.whl (108 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, click, nltk
Successfully installed click-8.3.1 nltk-3.9.2 regex-2026.1.15 tqdm-4.67.1



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting matplotlib
  Using cached matplotlib-3.10.8-cp311-cp311-win_amd64.whl.metadata (52 kB)
Collecting contourpy>=1.0.1 (from matplotlib)
  Using cached contourpy-1.3.3-cp311-cp311-win_amd64.whl.metadata (5.5 kB)
Collecting cycler>=0.10 (from matplotlib)
  Using cached cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib)
  Using cached fonttools-4.61.1-cp311-cp311-win_amd64.whl.metadata (116 kB)
Collecting kiwisolver>=1.3.1 (from matplotlib)
  Using cached kiwisolver-1.4.9-cp311-cp311-win_amd64.whl.metadata (6.4 kB)
Collecting pillow>=8 (from matplotlib)
  Using cached pillow-12.1.0-cp311-cp311-win_amd64.whl.metadata (9.0 kB)
Collecting pyparsing>=3 (from matplotlib)
  Downloading pyparsing-3.3.2-py3-none-any.whl.metadata (5.8 kB)
Using cached matplotlib-3.10.8-cp311-cp311-win_amd64.whl (8.1 MB)
Using cached contourpy-1.3.3-cp311-cp311-win_amd64.whl (225 kB)
Using cached cycler-0.12.1-py3-none-any.whl (8.3 kB)
Using cached fonttools-4.61


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


Collecting seaborn
  Using cached seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Using cached seaborn-0.13.2-py3-none-any.whl (294 kB)
Installing collected packages: seaborn
Successfully installed seaborn-0.13.2



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:

import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib


## 1) A small, realistic dataset (you can replace with your own CSV)


In industry, text often comes with:
- an **ID**
- free-text **description**
- a **label** (category, priority, intent, topic) or a target (churn, fraud, etc.)

Here we create a toy dataset that looks like support tickets / ops incidents.  
Swap this section with a `pd.read_csv(...)` in your own workflows.


In [3]:

data = [
    ("T-001", "VPN keeps disconnecting every 10 minutes on Windows 11 after latest update", "network"),
    ("T-002", "Password reset link is expired and user cannot login to the portal", "auth"),
    ("T-003", "Email delivery delayed, outbound messages queued for hours", "messaging"),
    ("T-004", "Cannot install printer driver, installer fails with error code 1603", "device"),
    ("T-005", "MFA prompt never arrives on mobile app, user stuck at login", "auth"),
    ("T-006", "WiFi signal drops in meeting rooms, access point reboot helps temporarily", "network"),
    ("T-007", "Outlook search not returning results, index seems corrupted", "messaging"),
    ("T-008", "Laptop battery drains fast after BIOS update, power settings unchanged", "device"),
    ("T-009", "Portal shows 500 error when submitting form, happened after deployment", "app"),
    ("T-010", "API requests timing out, latency spike observed in last hour", "app"),
    ("T-011", "User cannot access shared drive, permission denied though in correct group", "auth"),
    ("T-012", "Teams calls have choppy audio, jitter high on corporate network", "network"),
    ("T-013", "Push notifications not working on Android for the app", "app"),
    ("T-014", "Mailbox is full and cannot receive emails, auto-archive not running", "messaging"),
    ("T-015", "Bluetooth mouse not pairing after restart, device shows as unknown", "device"),
]

df = pd.DataFrame(data, columns=["ticket_id", "text", "label"])
df


Unnamed: 0,ticket_id,text,label
0,T-001,VPN keeps disconnecting every 10 minutes on Wi...,network
1,T-002,Password reset link is expired and user cannot...,auth
2,T-003,"Email delivery delayed, outbound messages queu...",messaging
3,T-004,"Cannot install printer driver, installer fails...",device
4,T-005,"MFA prompt never arrives on mobile app, user s...",auth
5,T-006,"WiFi signal drops in meeting rooms, access poi...",network
6,T-007,"Outlook search not returning results, index se...",messaging
7,T-008,"Laptop battery drains fast after BIOS update, ...",device
8,T-009,"Portal shows 500 error when submitting form, h...",app
9,T-010,"API requests timing out, latency spike observe...",app


### Train/test split


In [4]:

X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.33, random_state=42, stratify=df["label"]
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))


Train size: 10
Test size: 5


## 2) Tokenization basics and normalization (lightweight, practical)


In production pipelines you typically do **minimal, safe normalization**:
- lowercase
- normalize whitespace
- optionally strip obvious punctuation
- keep numbers when they carry meaning (error codes, versions, dates)

Heavy normalization (stemming, aggressive regexes) can hurt when your text includes:
error codes, product names, IDs, or domain terminology.


In [5]:

def simple_normalize(text: str) -> str:
    text = text.lower()
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["text_norm"] = df["text"].map(simple_normalize)
df[["ticket_id","text_norm","label"]].head()


Unnamed: 0,ticket_id,text_norm,label
0,T-001,vpn keeps disconnecting every 10 minutes on wi...,network
1,T-002,password reset link is expired and user cannot...,auth
2,T-003,"email delivery delayed, outbound messages queu...",messaging
3,T-004,"cannot install printer driver, installer fails...",device
4,T-005,"mfa prompt never arrives on mobile app, user s...",auth


## 3) Vocabulary + Document-Term Matrix (DTM) with CountVectorizer


**CountVectorizer** builds:
- a vocabulary (token → column index)
- a sparse matrix where rows are documents and columns are tokens

This is the classic **Document-Term Matrix** representation.


In [6]:

count_vec = CountVectorizer(
    lowercase=True,
    token_pattern=r"(?u)\b\w+\b",  # keeps tokens like "500", "1603", "mfa"
    min_df=1
)

X_train_counts = count_vec.fit_transform(X_train)
X_test_counts  = count_vec.transform(X_test)


print("DTM shape (train):", X_train_counts.shape)
print("Vocabulary size:", len(count_vec.vocabulary_))


DTM shape (train): (10, 92)
Vocabulary size: 92


### Inspect the vocabulary and a single row


In [8]:

# Show a small slice of the vocabulary (token -> index)
vocab_items = sorted(count_vec.vocabulary_.items(), key=lambda x: x[1])[:25]
vocab_items


[('10', 0),
 ('11', 1),
 ('1603', 2),
 ('500', 3),
 ('access', 4),
 ('after', 5),
 ('and', 6),
 ('api', 7),
 ('app', 8),
 ('archive', 9),
 ('arrives', 10),
 ('at', 11),
 ('auto', 12),
 ('battery', 13),
 ('bios', 14),
 ('cannot', 15),
 ('code', 16),
 ('correct', 17),
 ('corrupted', 18),
 ('denied', 19),
 ('deployment', 20),
 ('disconnecting', 21),
 ('drains', 22),
 ('drive', 23),
 ('driver', 24)]

In [9]:

# Look at a specific document row: non-zero entries (token counts)
row_id = 0
row = X_train_counts[row_id]
inv_vocab = {idx: tok for tok, idx in count_vec.vocabulary_.items()}

nz_cols = row.nonzero()[1]
tokens_counts = sorted([(inv_vocab[c], int(row[0, c])) for c in nz_cols], key=lambda x: -x[1])
tokens_counts[:20]


[('portal', 1),
 ('shows', 1),
 ('500', 1),
 ('error', 1),
 ('when', 1),
 ('submitting', 1),
 ('form', 1),
 ('happened', 1),
 ('after', 1),
 ('deployment', 1)]

## 4) Binary vs Count-based Bag-of-Words


Binary BoW: token present or not (good for short texts and some classification tasks)  
Count BoW: raw frequency (baseline for many pipelines)

Both discard word order.


In [11]:
binary_vec = CountVectorizer(binary=True, token_pattern=r"(?u)\b\w+\b")
X_train_bin = binary_vec.fit_transform(X_train)

X_train_bin.shape

(10, 92)

In [12]:
X_train_bin

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 104 stored elements and shape (10, 92)>

## 5) TF-IDF (a refinement, not a replacement)


TF-IDF downweights very common tokens and upweights tokens that are more distinctive.

In industry, TF-IDF with **n-grams** is a strong baseline for:
- ticket routing
- intent detection
- spam detection
- incident clustering


In [15]:
tfidf_vec = TfidfVectorizer(
    ngram_range=(1,2),         # unigrams + bigrams
    token_pattern=r"(?u)\b\w+\b", #ensures single letter word is included in the vocabulary
    min_df=1,                   #min document frequency - word appears in at least 1 document. Turns off rare words filtering
    sublinear_tf=True          # common practical tweak
)

X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_test_tfidf  = tfidf_vec.transform(X_test)


In [14]:
print("TF-IDF shape (train):", X_train_tfidf.shape)

TF-IDF shape (train): (10, 186)


## 6) Quick retrieval: 'find similar tickets' with cosine similarity


A very common industry use case is **nearest neighbor retrieval** for:
- deduplication
- suggesting knowledge base articles
- finding similar past incidents


In [17]:
# Build a search index from ALL tickets using TF-IDF
X_all = tfidf_vec.fit_transform(df["text"])

def search_similar(query: str, top_k: int = 5):
    qv = tfidf_vec.transform([query])
    sims = cosine_similarity(qv, X_all).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return df.loc[top_idx, ["ticket_id","text","label"]].assign(similarity=sims[top_idx])

search_similar("login mfa not working on phone", top_k=5)

Unnamed: 0,ticket_id,text,label,similarity
12,T-013,Push notifications not working on Android for ...,app,0.426113
4,T-005,"MFA prompt never arrives on mobile app, user s...",auth,0.21186
1,T-002,Password reset link is expired and user cannot...,auth,0.069304
6,T-007,"Outlook search not returning results, index se...",messaging,0.054095
14,T-015,"Bluetooth mouse not pairing after restart, dev...",device,0.048894


## 7) Classification baseline (Logistic Regression)


For text classification, a strong baseline is:

**TF-IDF → Linear model (LogReg / Linear SVM)**

This is fast, reliable, easy to explain, and often hard to beat without deep learning.


In [20]:
clf = LogisticRegression(max_iter=2000)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,2),
        token_pattern=r"(?u)\b\w+\b",
        sublinear_tf=True
    )),
    ("model", clf)
])

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)

print(classification_report(y_test, pred))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         1
        auth       0.50      1.00      0.67         1
      device       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       1.00      1.00      1.00         1

    accuracy                           0.40         5
   macro avg       0.30      0.40      0.33         5
weighted avg       0.30      0.40      0.33         5

Confusion matrix:
 [[0 1 0 0 0]
 [0 1 0 0 0]
 [1 0 0 0 0]
 [1 0 0 0 0]
 [0 0 0 0 1]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 8) Production pattern: HashingVectorizer (no stored vocab)


In production, you may need:
- constant memory usage
- privacy (no vocabulary inspection)
- streaming support
- easier deployment across services

**HashingVectorizer** avoids building a vocabulary. Tradeoff: collisions.


In [24]:

hash_pipe = Pipeline([
    ("hash", HashingVectorizer(
        n_features=2**18,        # tune for your scale
        alternate_sign=False,    # makes features more interpretable for linear models
        ngram_range=(1,2),
        token_pattern=r"(?u)\b\w+\b"
    )),
    ("model", LogisticRegression(max_iter=2000))
])

hash_pipe.fit(X_train, y_train)
pred_hash = hash_pipe.predict(X_test)
print(classification_report(y_test, pred_hash))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         1
        auth       1.00      1.00      1.00         1
      device       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       1.00      1.00      1.00         1

    accuracy                           0.40         5
   macro avg       0.40      0.40      0.40         5
weighted avg       0.40      0.40      0.40         5



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 9) Save and load the model (typical deployment step)


In [25]:
model_path = "week3_text_representation_model.joblib"
joblib.dump(pipeline, model_path)

loaded = joblib.load(model_path)
loaded.predict(["portal returns 500 error after deploy"])



array(['app'], dtype=object)

## Exercises (do these during lab)
1) Add 10 more tickets to `data` with realistic wording and labels. Re-train and compare results.  
2) Try `ngram_range=(1,3)` and observe what changes.  
3) For retrieval, test at least 3 queries and explain why the top result makes sense.  
4) Replace the dataset with a CSV you create (columns: `text`, `label`) and rerun the notebook.


In [None]:
data_extended = data + [
    ("T-016", "DNS lookup fails randomly, websites do not load on office network", "network"),
    ("T-017", "User locked out after too many login attempts, cannot unlock account", "auth"),
    ("T-018", "Slack messages not syncing, notifications delayed on desktop app", "messaging"),
    ("T-019", "Printer prints blank pages, tried reinstalling driver but still happens", "device"),
    ("T-020", "Mobile app crashes on startup after the latest version update", "app"),
    ("T-021", "Ethernet connection drops when laptop sleeps and wakes up", "network"),
    ("T-022", "Two-factor code not accepted even though it is correct", "auth"),
    ("T-023", "Exchange calendar not syncing, meetings missing in Outlook", "messaging"),
    ("T-024", "Touchpad not working after Windows update, device manager shows error", "device"),
    ("T-025", "API returns 403 forbidden for valid token, started after config change", "app"),
]


In [37]:
df2 = pd.DataFrame(data_extended, columns=["ticket_id", "text", "label"])

X_train2, X_test2, y_train2, y_test2 = train_test_split(
    df2["text"], df2["label"], test_size=0.33, random_state=42, stratify=df2["label"]
)

pipeline.fit(X_train2, y_train2)
pred2 = pipeline.predict(X_test2)

print("NEW dataset sizes:")
print("Train:", len(X_train2), "Test:", len(X_test2))

print("\nResults after adding 10 tickets:")
print(classification_report(y_test2, pred2, zero_division=0))
print("Confusion matrix:\n", confusion_matrix(y_test2, pred2))


NEW dataset sizes:
Train: 16 Test: 9

Results after adding 10 tickets:
              precision    recall  f1-score   support

         app       0.00      0.00      0.00         2
        auth       0.17      1.00      0.29         1
      device       1.00      1.00      1.00         2
   messaging       1.00      0.50      0.67         2
     network       0.00      0.00      0.00         2

    accuracy                           0.44         9
   macro avg       0.43      0.50      0.39         9
weighted avg       0.46      0.44      0.40         9

Confusion matrix:
 [[0 2 0 0 0]
 [0 1 0 0 0]
 [0 0 2 0 0]
 [0 1 0 1 0]
 [0 2 0 0 0]]


**Compare Results**

1. Accuracy has improved from 40% to 44%
2. Model performed well for the device category.
2. For Auth, While Recall is 1, precision is terrible showing it as 0.17.
3. This means the model is having higher false positives. 
4. Looking at confusion matrix, we can say that model has classified the tickets to "auth" 6 times. When only 1 was actually an authentication issue.
5. This model is currently "playing it safe" by labeling almost everything as auth. It ensures you never miss an authentication ticket, but it creates a massive amount of "noise" or false positives for you to clean up.
6. This expected as the dataset is very small.

### 2) Try ngram_range=(1,3) and observe what changes

In [38]:
pipeline_123 = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,3),       # unigrams + bigrams + trigrams
        token_pattern=r"(?u)\b\w+\b",
        sublinear_tf=True
    )),
    ("model", LogisticRegression(max_iter=2000))
])

pipeline_123.fit(X_train2, y_train2)
pred_123 = pipeline_123.predict(X_test2)

print(classification_report(y_test2, pred_123, zero_division=0))
print("Confusion matrix:\n", confusion_matrix(y_test2, pred_123))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         2
        auth       0.12      1.00      0.22         1
      device       1.00      0.50      0.67         2
   messaging       0.00      0.00      0.00         2
     network       0.00      0.00      0.00         2

    accuracy                           0.22         9
   macro avg       0.23      0.30      0.18         9
weighted avg       0.24      0.22      0.17         9

Confusion matrix:
 [[0 2 0 0 0]
 [0 1 0 0 0]
 [0 1 1 0 0]
 [0 2 0 0 0]
 [0 2 0 0 0]]


**Results**  
1. The accuracy dropped from 0.44 to 0.22. The model performance is worse now.
2. Trigrams create many more features. With small dataset model cannot learn reliable patterns from longer phrases.
3. This must be because of smaller dataset used for training, it is showing affects of overfitting.
4. The model is reading tiny details and failing to generalize patterns. Hence it is classfying all the issues auth based on tiny similarity in the pattern.

### 3) For retrieval, test at least 3 queries and explain why the top result makes sense.  

In [None]:
print(search_similar("wifi keeps dropping in meeting room", top_k=3))

Unnamed: 0,ticket_id,text,label,similarity
5,T-006,"WiFi signal drops in meeting rooms, access poi...",network,0.372074
0,T-001,VPN keeps disconnecting every 10 minutes on Wi...,network,0.099991
9,T-010,"API requests timing out, latency spike observe...",app,0.064914


**Why it makes sense**  
The top result talks about WiFi signal drops and meeting rooms, which directly match the key words and meaning of the query. Because TF-IDF gives higher weight to shared important terms, this ticket is the most similar.

In [None]:
search_similar("mfa code not arriving on phone", top_k=3)

Unnamed: 0,ticket_id,text,label,similarity
4,T-005,"MFA prompt never arrives on mobile app, user s...",auth,0.195177
12,T-013,Push notifications not working on Android for ...,app,0.146229
3,T-004,"Cannot install printer driver, installer fails...",device,0.135454


**Why it makes sense**  
The top result talks about mfa and not arriving, which directly match the key words and meaning of the query. Because TF-IDF gives higher weight to shared important terms, this ticket is the most similar.

In [None]:
search_similar("Teams audio is not choppy", top_k=3)


Unnamed: 0,ticket_id,text,label,similarity
11,T-012,"Teams calls have choppy audio, jitter high on ...",network,0.338493
13,T-014,"Mailbox is full and cannot receive emails, aut...",messaging,0.137086
1,T-002,Password reset link is expired and user cannot...,auth,0.080572


**Why it makes sense**  
The top result talks about Teams and choppy audio which directly match the key words and meaning of the query. Because TF-IDF gives higher weight to shared important terms, this ticket is the most similar.

### 4) Replace dataset with a CSV you create and rerun the notebook

In [42]:
df2_csv = df2[["text", "label"]]   # only the required columns
df2_csv.to_csv("my_tickets.csv", index=False)
print("Saved my_tickets.csv", df2_csv.shape)


Saved my_tickets.csv (25, 2)
