# AIG230 NLP (Week 3 Lab) — Notebook 1: Text Representation

This notebook focuses on **turning raw text into numeric features** you can use in real-world ML systems.

You will build:
- a clean **train/test split**
- **Bag-of-Words** (binary and count)
- **Document-Term Matrix** (DTM)
- **TF-IDF** (with n-grams)
- **Hashing trick** (production-friendly)
- basic **retrieval** (cosine similarity) and a **baseline classifier**
- model **persistence** (save/load)

## 0) Setup


In [90]:

import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib


## 1) A small, realistic dataset (you can replace with your own CSV)


In industry, text often comes with:
- an **ID**
- free-text **description**
- a **label** (category, priority, intent, topic) or a target (churn, fraud, etc.)

Here we create a toy dataset that looks like support tickets / ops incidents.  
Swap this section with a `pd.read_csv(...)` in your own workflows.


In [91]:

data = [
    ("T-001", "VPN keeps disconnecting every 10 minutes on Windows 11 after latest update", "network"),
    ("T-002", "Password reset link is expired and user cannot login to the portal", "auth"),
    ("T-003", "Email delivery delayed, outbound messages queued for hours", "messaging"),
    ("T-004", "Cannot install printer driver, installer fails with error code 1603", "device"),
    ("T-005", "MFA prompt never arrives on mobile app, user stuck at login", "auth"),
    ("T-006", "WiFi signal drops in meeting rooms, access point reboot helps temporarily", "network"),
    ("T-007", "Outlook search not returning results, index seems corrupted", "messaging"),
    ("T-008", "Laptop battery drains fast after BIOS update, power settings unchanged", "device"),
    ("T-009", "Portal shows 500 error when submitting form, happened after deployment", "app"),
    ("T-010", "API requests timing out, latency spike observed in last hour", "app"),
    ("T-011", "User cannot access shared drive, permission denied though in correct group", "auth"),
    ("T-012", "Teams calls have choppy audio, jitter high on corporate network", "network"),
    ("T-013", "Push notifications not working on Android for the app", "app"),
    ("T-014", "Mailbox is full and cannot receive emails, auto-archive not running", "messaging"),
    ("T-015", "Bluetooth mouse not pairing after restart, device shows as unknown", "device"),
#new addition
    ("T-016", "VPN client fails to start on macOS after OS upgrade, error says configuration missing", "network"),
    ("T-017", "User locked out after multiple failed login attempts, account not auto-unlocking", "auth"),
    ("T-018", "Emails sent to external domains bouncing with SPF failure", "messaging"),
    ("T-019", "Docking station not detecting external monitors after firmware update", "device"),
    ("T-020", "Single sign-on loops back to login page repeatedly for some users", "auth"),
    ("T-021", "Network file transfer extremely slow during peak hours", "network"),
    ("T-022", "Calendar invites not syncing between Outlook and mobile devices", "messaging"),
    ("T-023", "Web application loads blank page in Safari but works in Chrome", "app"),
    ("T-024", "Laptop overheating and fan running constantly after driver update", "device"),
    ("T-025", "Background job stuck in pending state, queue length growing steadily", "app"),
]

df = pd.DataFrame(data, columns=["ticket_id", "text", "label"])
df


Unnamed: 0,ticket_id,text,label
0,T-001,VPN keeps disconnecting every 10 minutes on Wi...,network
1,T-002,Password reset link is expired and user cannot...,auth
2,T-003,"Email delivery delayed, outbound messages queu...",messaging
3,T-004,"Cannot install printer driver, installer fails...",device
4,T-005,"MFA prompt never arrives on mobile app, user s...",auth
5,T-006,"WiFi signal drops in meeting rooms, access poi...",network
6,T-007,"Outlook search not returning results, index se...",messaging
7,T-008,"Laptop battery drains fast after BIOS update, ...",device
8,T-009,"Portal shows 500 error when submitting form, h...",app
9,T-010,"API requests timing out, latency spike observe...",app


In [92]:
df_csv = pd.read_csv(
    "input.csv",
    header=None,
    names=["ticket_id", "text", "label"],
    skipinitialspace=True
)
print(df_csv)

df = df_csv

   ticket_id                                               text      label
0      T-001  VPN keeps disconnecting every 10 minutes on Wi...    network
1      T-002  Password reset link is expired and user cannot...       auth
2      T-003  Email delivery delayed, outbound messages queu...  messaging
3      T-004  Cannot install printer driver, installer fails...     device
4      T-005  MFA prompt never arrives on mobile app, user s...       auth
5      T-006  WiFi signal drops in meeting rooms, access poi...    network
6      T-007  Outlook search not returning results, index se...  messaging
7      T-008  Laptop battery drains fast after BIOS update, ...     device
8      T-009  Portal shows 500 error when submitting form, h...        app
9      T-010  API requests timing out, latency spike observe...        app
10     T-011  User cannot access shared drive, permission de...       auth
11     T-012  Teams calls have choppy audio, jitter high on ...    network
12     T-013  Push notifi

### Train/test split


In [93]:

X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.33, random_state=42, stratify=df["label"]   #srtatify Preserves label distribution
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))


Train size: 16
Test size: 9


In [94]:
X_train

9     API requests timing out, latency spike observe...
20    Network file transfer extremely slow during pe...
1     Password reset link is expired and user cannot...
2     Email delivery delayed, outbound messages queu...
11    Teams calls have choppy audio, jitter high on ...
3     Cannot install printer driver, installer fails...
19    Single sign-on loops back to login page repeat...
4     MFA prompt never arrives on mobile app, user s...
12    Push notifications not working on Android for ...
22    Web application loads blank page in Safari but...
7     Laptop battery drains fast after BIOS update, ...
0     VPN keeps disconnecting every 10 minutes on Wi...
17    Emails sent to external domains bouncing with ...
16    User locked out after multiple failed login at...
14    Bluetooth mouse not pairing after restart, dev...
13    Mailbox is full and cannot receive emails, aut...
Name: text, dtype: str

## 2) Tokenization basics and normalization (lightweight, practical)


In production pipelines you typically do **minimal, safe normalization**:
- lowercase
- normalize whitespace
- optionally strip obvious punctuation
- keep numbers when they carry meaning (error codes, versions, dates)

Heavy normalization (stemming, aggressive regexes) can hurt when your text includes:
error codes, product names, IDs, or domain terminology.


In [95]:

def simple_normalize(text: str) -> str:
    text = text.lower()
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["text_norm"] = df["text"].map(simple_normalize)
df[["ticket_id","text_norm","label"]].head()


Unnamed: 0,ticket_id,text_norm,label
0,T-001,vpn keeps disconnecting every 10 minutes on wi...,network
1,T-002,password reset link is expired and user cannot...,auth
2,T-003,"email delivery delayed, outbound messages queu...",messaging
3,T-004,"cannot install printer driver, installer fails...",device
4,T-005,"mfa prompt never arrives on mobile app, user s...",auth


## 3) Vocabulary + Document-Term Matrix (DTM) with CountVectorizer


**CountVectorizer** builds:
- a vocabulary (token → column index)
- a sparse matrix where rows are documents and columns are tokens

This is the classic **Document-Term Matrix** representation.


In [96]:

count_vec = CountVectorizer(
    lowercase=True,
    token_pattern=r"(?u)\b\w+\b",  # keeps tokens like "500", "1603", "mfa"
    min_df=1
)

In [97]:

X_train_counts = count_vec.fit_transform(X_train)
X_test_counts  = count_vec.transform(X_test)

print("DTM shape (train):", X_train_counts.shape)
print("Vocabulary size:", len(count_vec.vocabulary_))


DTM shape (train): (16, 130)
Vocabulary size: 130


In [98]:
print(X_train_counts.toarray())

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


## DTM - Document Term Matrix 

In [99]:
dtm_df = pd.DataFrame(
    X_train_counts.toarray(),
    columns=count_vec.get_feature_names_out()
)

dtm_df

Unnamed: 0,10,11,1603,account,after,and,android,api,app,application,...,unlocking,update,user,users,vpn,web,windows,with,working,works
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
8,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
9,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,1


### Inspect the vocabulary and a single row


In [100]:

# Show a small slice of the vocabulary (token -> index)
# get words ordered by their column position in the vectorized matrix
vocab_items = sorted(count_vec.vocabulary_.items(), key=lambda x: x[1])[:25]
vocab_items


[('10', 0),
 ('11', 1),
 ('1603', 2),
 ('account', 3),
 ('after', 4),
 ('and', 5),
 ('android', 6),
 ('api', 7),
 ('app', 8),
 ('application', 9),
 ('archive', 10),
 ('arrives', 11),
 ('as', 12),
 ('at', 13),
 ('attempts', 14),
 ('audio', 15),
 ('auto', 16),
 ('back', 17),
 ('battery', 18),
 ('bios', 19),
 ('blank', 20),
 ('bluetooth', 21),
 ('bouncing', 22),
 ('but', 23),
 ('calls', 24)]

In [101]:

# Look at a specific document row: non-zero entries (token counts)
row_id = 0
row = X_train_counts[row_id]
inv_vocab = {idx: tok for tok, idx in count_vec.vocabulary_.items()}
print(len(inv_vocab))



130


In [102]:
row

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 10 stored elements and shape (1, 130)>

In [103]:
row.data

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [104]:
row.indices

array([  7,  98, 115,  84,  64, 111,  82,  56,  63,  54], dtype=int32)

In [105]:
row.indptr

array([ 0, 10], dtype=int32)

In [106]:
row.nonzero()

(array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32),
 array([  7,  98, 115,  84,  64, 111,  82,  56,  63,  54], dtype=int32))

This code extracts all non-zero words in one document, maps them back to tokens, sorts them by frequency (descending), and shows the top 20.

In [107]:

nz_cols = row.nonzero()[1]
tokens_counts = sorted([(inv_vocab[c], int(row[0, c])) for c in nz_cols], key=lambda x: -x[1])
tokens_counts[:20]


[('api', 1),
 ('requests', 1),
 ('timing', 1),
 ('out', 1),
 ('latency', 1),
 ('spike', 1),
 ('observed', 1),
 ('in', 1),
 ('last', 1),
 ('hour', 1)]

In [108]:
X_train

9     API requests timing out, latency spike observe...
20    Network file transfer extremely slow during pe...
1     Password reset link is expired and user cannot...
2     Email delivery delayed, outbound messages queu...
11    Teams calls have choppy audio, jitter high on ...
3     Cannot install printer driver, installer fails...
19    Single sign-on loops back to login page repeat...
4     MFA prompt never arrives on mobile app, user s...
12    Push notifications not working on Android for ...
22    Web application loads blank page in Safari but...
7     Laptop battery drains fast after BIOS update, ...
0     VPN keeps disconnecting every 10 minutes on Wi...
17    Emails sent to external domains bouncing with ...
16    User locked out after multiple failed login at...
14    Bluetooth mouse not pairing after restart, dev...
13    Mailbox is full and cannot receive emails, aut...
Name: text, dtype: str

### 4) Binary vs Count-based Bag-of-Words

Binary BoW: token present or not (good for short texts and some classification tasks)
Count BoW: raw frequency (baseline for many pipelines)

Both discard word order.


In [109]:

binary_vec = CountVectorizer(binary=True, token_pattern=r"(?u)\b\w+\b") #use this pattern to cut the tokens
X_train_bin = binary_vec.fit_transform(X_train) #


In [110]:
X_train_bin.shape

(16, 130)

In [111]:
X_train_bin

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 164 stored elements and shape (16, 130)>

In [112]:
print(X_train_bin.toarray())


[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [113]:
dtm_bin_df= pd.DataFrame(
    X_train_bin.toarray(),
    columns=binary_vec.get_feature_names_out()
)

dtm_bin_df

Unnamed: 0,10,11,1603,account,after,and,android,api,app,application,...,unlocking,update,user,users,vpn,web,windows,with,working,works
0,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,0,0
8,0,0,0,0,0,0,1,0,1,0,...,0,0,0,0,0,0,0,0,1,0
9,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,1,0,0,0,1


### 5) TF-IDF (a refinement, not a replacement)



TF-IDF downweights very common tokens and upweights tokens that are more distinctive.

In industry, TF-IDF with n-grams is a strong baseline for:

    - ticket routing
    - intent detection
    - spam detection
    - incident clustering



In [114]:

tfidf_vec = TfidfVectorizer(ngram_range=(1,2),
                            token_pattern=r"(?u)\b\w+\b",
                            min_df=1,
                            sublinear_tf=True #Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf)
                            )

X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_test_tfidf = tfidf_vec.transform(X_test)

In [115]:
print(X_train_tfidf.shape)

(16, 279)


In [116]:
tf_idf_df= pd.DataFrame(
    X_train_tfidf.toarray(),
    columns=tfidf_vec.get_feature_names_out()
)

tf_idf_df

Unnamed: 0,10,10 minutes,11,11 after,1603,account,account not,after,after bios,after latest,...,web application,windows,windows 11,with,with error,with spf,working,working on,works,works in
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.233344,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.203213,0.233344,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.257784,0.257784,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.218569,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.218569,0.218569


## 6) Quick retrieval: 'find similar tickets' with cosine similarity


A very common industry use case is **nearest neighbor retrieval** for:
- deduplication
- suggesting knowledge base articles
- finding similar past incidents


In [124]:
# Build a search index from ALL tickets using TF-IDF
X_all = tfidf_vec.fit_transform(df["text"])  #Build a TF-IDF search index

def search_similar(query: str, top_k: int = 5):
    qv = tfidf_vec.transform([query])   #Vectorize the query
    sims = cosine_similarity(qv, X_all).ravel()
    top_idx = np.argsort(-sims)[:top_k]
    return df.loc[top_idx, ["ticket_id","text","label"]].assign(similarity=sims[top_idx])

search_similar("Calling second time, but not getting response", top_k=5)

Unnamed: 0,ticket_id,text,label,similarity
22,T-023,Web application loads blank page in Safari but...,app,0.190051
6,T-007,"Outlook search not returning results, index se...",messaging,0.084692
12,T-013,Push notifications not working on Android for ...,app,0.082245
18,T-019,Docking station not detecting external monitor...,device,0.08216
21,T-022,Calendar invites not syncing between Outlook a...,messaging,0.080955


In [118]:
search_similar("I love NLP", top_k=5)

Unnamed: 0,ticket_id,text,label,similarity
0,T-001,VPN keeps disconnecting every 10 minutes on Wi...,network,0.0
22,T-023,Web application loads blank page in Safari but...,app,0.0
21,T-022,Calendar invites not syncing between Outlook a...,messaging,0.0
20,T-021,Network file transfer extremely slow during pe...,network,0.0
19,T-020,Single sign-on loops back to login page repeat...,auth,0.0


In [119]:
search_similar("I needed to login to check weather", top_k=5)

Unnamed: 0,ticket_id,text,label,similarity
1,T-002,Password reset link is expired and user cannot...,auth,0.274129
19,T-020,Single sign-on loops back to login page repeat...,auth,0.267126
17,T-018,Emails sent to external domains bouncing with ...,messaging,0.115244
15,T-016,VPN client fails to start on macOS after OS up...,network,0.092813
4,T-005,"MFA prompt never arrives on mobile app, user s...",auth,0.062477


## 7) Classification baseline (Logistic Regression)


For text classification, a strong baseline is:

**TF-IDF → Linear model (LogReg / Linear SVM)**

This is fast, reliable, easy to explain, and often hard to beat without deep learning.


In [120]:

clf = LogisticRegression(max_iter=2000)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,3),
        token_pattern=r"(?u)\b\w+\b",
        sublinear_tf=True
    )),
    ("model", clf)
])

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)

print(classification_report(y_test, pred))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         2
        auth       0.12      1.00      0.22         1
      device       0.00      0.00      0.00         2
   messaging       0.00      0.00      0.00         2
     network       0.00      0.00      0.00         2

    accuracy                           0.11         9
   macro avg       0.03      0.20      0.04         9
weighted avg       0.01      0.11      0.02         9

Confusion matrix:
 [[0 2 0 0 0]
 [0 1 0 0 0]
 [0 2 0 0 0]
 [0 2 0 0 0]
 [1 1 0 0 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 8) Production pattern: HashingVectorizer (no stored vocab)


In production, you may need:
- constant memory usage
- privacy (no vocabulary inspection)
- streaming support
- easier deployment across services

**HashingVectorizer** avoids building a vocabulary. Tradeoff: collisions.

https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html

https://kavita-ganesan.com/hashingvectorizer-vs-countvectorizer/


In [121]:
hash_pipe = Pipeline([
    ("hash", HashingVectorizer(
        n_features=2**18,        # tune for your scale
        alternate_sign=False,    # makes features more interpretable for linear models
        ngram_range=(1,3),
        token_pattern=r"(?u)\b\w+\b"
    )),
    ("model", LogisticRegression(max_iter=2000))
])

hash_pipe.fit(X_train, y_train)
pred_hash = hash_pipe.predict(X_test)

print(classification_report(y_test, pred_hash))
print("Confusion matrix:\n", confusion_matrix(y_test, pred_hash))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         2
        auth       0.11      1.00      0.20         1
      device       0.00      0.00      0.00         2
   messaging       0.00      0.00      0.00         2
     network       0.00      0.00      0.00         2

    accuracy                           0.11         9
   macro avg       0.02      0.20      0.04         9
weighted avg       0.01      0.11      0.02         9

Confusion matrix:
 [[0 2 0 0 0]
 [0 1 0 0 0]
 [0 2 0 0 0]
 [0 2 0 0 0]
 [0 2 0 0 0]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [122]:
hash_pipe

0,1,2
,"steps  steps: list of tuples List of (name of step, estimator) tuples that are to be chained in sequential order. To be compatible with the scikit-learn API, all steps must define `fit`. All non-last steps must also define `transform`. See :ref:`Combining Estimators ` for more details.","[('hash', ...), ('model', ...)]"
,"transform_input  transform_input: list of str, default=None The names of the :term:`metadata` parameters that should be transformed by the pipeline before passing it to the step consuming it. This enables transforming some input arguments to ``fit`` (other than ``X``) to be transformed by the steps of the pipeline up to the step which requires them. Requirement is defined via :ref:`metadata routing `. For instance, this can be used to pass a validation set through the pipeline. You can only set this if metadata routing is enabled, which you can enable using ``sklearn.set_config(enable_metadata_routing=True)``. .. versionadded:: 1.6",
,"memory  memory: str or object with the joblib.Memory interface, default=None Used to cache the fitted transformers of the pipeline. The last step will never be cached, even if it is a transformer. By default, no caching is performed. If a string is given, it is the path to the caching directory. Enabling caching triggers a clone of the transformers before fitting. Therefore, the transformer instance given to the pipeline cannot be inspected directly. Use the attribute ``named_steps`` or ``steps`` to inspect estimators within the pipeline. Caching the transformers is advantageous when fitting is time consuming. See :ref:`sphx_glr_auto_examples_neighbors_plot_caching_nearest_neighbors.py` for an example on how to enable caching.",
,"verbose  verbose: bool, default=False If True, the time elapsed while fitting each step will be printed as it is completed.",False

0,1,2
,"input  input: {'filename', 'file', 'content'}, default='content' - If `'filename'`, the sequence passed as an argument to fit is  expected to be a list of filenames that need reading to fetch  the raw content to analyze. - If `'file'`, the sequence items must have a 'read' method (file-like  object) that is called to fetch the bytes in memory. - If `'content'`, the input is expected to be a sequence of items that  can be of type string or byte.",'content'
,"encoding  encoding: str, default='utf-8' If bytes or files are given to analyze, this encoding is used to decode.",'utf-8'
,"decode_error  decode_error: {'strict', 'ignore', 'replace'}, default='strict' Instruction on what to do if a byte sequence is given to analyze that contains characters not of the given `encoding`. By default, it is 'strict', meaning that a UnicodeDecodeError will be raised. Other values are 'ignore' and 'replace'.",'strict'
,"strip_accents  strip_accents: {'ascii', 'unicode'} or callable, default=None Remove accents and perform other character normalization during the preprocessing step. 'ascii' is a fast method that only works on characters that have a direct ASCII mapping. 'unicode' is a slightly slower method that works on any character. None (default) means no character normalization is performed. Both 'ascii' and 'unicode' use NFKD normalization from :func:`unicodedata.normalize`.",
,"lowercase  lowercase: bool, default=True Convert all characters to lowercase before tokenizing.",True
,"preprocessor  preprocessor: callable, default=None Override the preprocessing (string transformation) stage while preserving the tokenizing and n-grams generation steps. Only applies if ``analyzer`` is not callable.",
,"tokenizer  tokenizer: callable, default=None Override the string tokenization step while preserving the preprocessing and n-grams generation steps. Only applies if ``analyzer == 'word'``.",
,"stop_words  stop_words: {'english'}, list, default=None If 'english', a built-in stop word list for English is used. There are several known issues with 'english' and you should consider an alternative (see :ref:`stop_words`). If a list, that list is assumed to contain stop words, all of which will be removed from the resulting tokens. Only applies if ``analyzer == 'word'``.",
,"token_pattern  token_pattern: str or None, default=r""(?u)\\b\\w\\w+\\b"" Regular expression denoting what constitutes a ""token"", only used if ``analyzer == 'word'``. The default regexp selects tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator). If there is a capturing group in token_pattern then the captured group content, not the entire match, becomes the token. At most one capturing group is permitted.",'(?u)\\b\\w+\\b'
,"ngram_range  ngram_range: tuple (min_n, max_n), default=(1, 1) The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used. For example an ``ngram_range`` of ``(1, 1)`` means only unigrams, ``(1, 2)`` means unigrams and bigrams, and ``(2, 2)`` means only bigrams. Only applies if ``analyzer`` is not callable.","(1, ...)"

0,1,2
,"penalty  penalty: {'l1', 'l2', 'elasticnet', None}, default='l2' Specify the norm of the penalty: - `None`: no penalty is added; - `'l2'`: add a L2 penalty term and it is the default choice; - `'l1'`: add a L1 penalty term; - `'elasticnet'`: both L1 and L2 penalty terms are added. .. warning::  Some penalties may not work with some solvers. See the parameter  `solver` below, to know the compatibility between the penalty and  solver. .. versionadded:: 0.19  l1 penalty with SAGA solver (allowing 'multinomial' + L1) .. deprecated:: 1.8  `penalty` was deprecated in version 1.8 and will be removed in 1.10.  Use `l1_ratio` instead. `l1_ratio=0` for `penalty='l2'`, `l1_ratio=1` for  `penalty='l1'` and `l1_ratio` set to any float between 0 and 1 for  `'penalty='elasticnet'`.",'deprecated'
,"C  C: float, default=1.0 Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization. `C=np.inf` results in unpenalized logistic regression. For a visual example on the effect of tuning the `C` parameter with an L1 penalty, see: :ref:`sphx_glr_auto_examples_linear_model_plot_logistic_path.py`.",1.0
,"l1_ratio  l1_ratio: float, default=0.0 The Elastic-Net mixing parameter, with `0 <= l1_ratio <= 1`. Setting `l1_ratio=1` gives a pure L1-penalty, setting `l1_ratio=0` a pure L2-penalty. Any value between 0 and 1 gives an Elastic-Net penalty of the form `l1_ratio * L1 + (1 - l1_ratio) * L2`. .. warning::  Certain values of `l1_ratio`, i.e. some penalties, may not work with some  solvers. See the parameter `solver` below, to know the compatibility between  the penalty and solver. .. versionchanged:: 1.8  Default value changed from None to 0.0. .. deprecated:: 1.8  `None` is deprecated and will be removed in version 1.10. Always use  `l1_ratio` to specify the penalty type.",0.0
,"dual  dual: bool, default=False Dual (constrained) or primal (regularized, see also :ref:`this equation `) formulation. Dual formulation is only implemented for l2 penalty with liblinear solver. Prefer `dual=False` when n_samples > n_features.",False
,"tol  tol: float, default=1e-4 Tolerance for stopping criteria.",0.0001
,"fit_intercept  fit_intercept: bool, default=True Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.",True
,"intercept_scaling  intercept_scaling: float, default=1 Useful only when the solver `liblinear` is used and `self.fit_intercept` is set to `True`. In this case, `x` becomes `[x, self.intercept_scaling]`, i.e. a ""synthetic"" feature with constant value equal to `intercept_scaling` is appended to the instance vector. The intercept becomes ``intercept_scaling * synthetic_feature_weight``. .. note::  The synthetic feature weight is subject to L1 or L2  regularization as all other features.  To lessen the effect of regularization on synthetic feature weight  (and therefore on the intercept) `intercept_scaling` has to be increased.",1
,"class_weight  class_weight: dict or 'balanced', default=None Weights associated with classes in the form ``{class_label: weight}``. If not given, all classes are supposed to have weight one. The ""balanced"" mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as ``n_samples / (n_classes * np.bincount(y))``. Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified. .. versionadded:: 0.17  *class_weight='balanced'*",
,"random_state  random_state: int, RandomState instance, default=None Used when ``solver`` == 'sag', 'saga' or 'liblinear' to shuffle the data. See :term:`Glossary ` for details.",
,"solver  solver: {'lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga'}, default='lbfgs' Algorithm to use in the optimization problem. Default is 'lbfgs'. To choose a solver, you might want to consider the following aspects: - 'lbfgs' is a good default solver because it works reasonably well for a wide  class of problems. - For :term:`multiclass` problems (`n_classes >= 3`), all solvers except  'liblinear' minimize the full multinomial loss, 'liblinear' will raise an  error. - 'newton-cholesky' is a good choice for  `n_samples` >> `n_features * n_classes`, especially with one-hot encoded  categorical features with rare categories. Be aware that the memory usage  of this solver has a quadratic dependency on `n_features * n_classes`  because it explicitly computes the full Hessian matrix. - For small datasets, 'liblinear' is a good choice, whereas 'sag'  and 'saga' are faster for large ones; - 'liblinear' can only handle binary classification by default. To apply a  one-versus-rest scheme for the multiclass setting one can wrap it with the  :class:`~sklearn.multiclass.OneVsRestClassifier`. .. warning::  The choice of the algorithm depends on the penalty chosen (`l1_ratio=0`  for L2-penalty, `l1_ratio=1` for L1-penalty and `0 < l1_ratio < 1` for  Elastic-Net) and on (multinomial) multiclass support:  ================= ======================== ======================  solver l1_ratio multinomial multiclass  ================= ======================== ======================  'lbfgs' l1_ratio=0 yes  'liblinear' l1_ratio=1 or l1_ratio=0 no  'newton-cg' l1_ratio=0 yes  'newton-cholesky' l1_ratio=0 yes  'sag' l1_ratio=0 yes  'saga' 0<=l1_ratio<=1 yes  ================= ======================== ====================== .. note::  'sag' and 'saga' fast convergence is only guaranteed on features  with approximately the same scale. You can preprocess the data with  a scaler from :mod:`sklearn.preprocessing`. .. seealso::  Refer to the :ref:`User Guide ` for more  information regarding :class:`LogisticRegression` and more specifically the  :ref:`Table `  summarizing solver/penalty supports. .. versionadded:: 0.17  Stochastic Average Gradient (SAG) descent solver. Multinomial support in  version 0.18. .. versionadded:: 0.19  SAGA solver. .. versionchanged:: 0.22  The default solver changed from 'liblinear' to 'lbfgs' in 0.22. .. versionadded:: 1.2  newton-cholesky solver. Multinomial support in version 1.6.",'lbfgs'


## 9) Save and load the model (typical deployment step)


In [123]:
model_path = "week3_text_representation_model.joblib"
joblib.dump(pipeline, model_path)

loaded = joblib.load(model_path)
loaded.predict(["portal returns 500 error after deploy"])




array(['auth'], dtype=object)

## Exercises (do these during lab)
1) Add 10 more tickets to `data` with realistic wording and labels. Re-train and compare results.  
2) Try `ngram_range=(1,3)` and observe what changes.  
3) For retrieval, test at least 3 queries and explain why the top result makes sense.  
4) Replace the dataset with a CSV you create (columns: `text`, `label`) and rerun the notebook.
