In [23]:
# HW 3 - NYT Articles Notebook
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer


import warnings

# This suppresses all warnings (UserWarning, DeprecationWarning, etc.)
warnings.filterwarnings("ignore")

In [2]:
df = pd.read_csv("nyt.csv")
print(df.head())
print(df.shape)

                                                text   label
0  (reuters) - carlos tevez sealed his move to ju...  sports
1  if professional pride and strong defiance can ...  sports
2  palermo, sicily — roberta vinci beat top-seede...  sports
3  spain's big two soccer teams face a pair of it...  sports
4  the argentine soccer club san lorenzo complete...  sports
(11519, 2)


**1. Data preparation**
* Train-Validation-Test -> 80:10:10 split with random_state=42


In [3]:
print(f'distinct y-values--{df['label'].unique()}')

# X and y -> numpy objects : to train and test the models
X = df['text'].astype(str).values
y = df['label']

# Encode 'labels' as integers since categorical variable
le = LabelEncoder()
y_enc = le.fit_transform(y)

# np.unique(y_enc), 
#np.unique(X)

## Train-Validation-Test : 80:10:10 split, random_state=42
#---------------------------------------------------------
# 1. split into 80% train + 20% temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y_enc,
    test_size=0.20,
    random_state=42,
    stratify=y_enc
)

# 2. split temp into 10% val + 10% test
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.50,
    random_state=42,
    stratify=y_temp
)
print(f'Total rows {df.shape[0]}, 80%of it = {df.shape[0]*0.8}, 10%of it= {df.shape[0]*0.1}')
print("Train size:", len(X_train))
print("Validation size:", len(X_val))
print("Test size:", len(X_test))

distinct y-values--['sports' 'business' 'politics']
Total rows 11519, 80%of it = 9215.2, 10%of it= 1151.9
Train size: 9215
Validation size: 1152
Test size: 1152


**Part 1: Bag Of Words (20 points):**

* Here we are training a text classifier via *multi-class logistic regression model*.
* Each document is represented as a binary-valued vector of dimension equal to the size of the
vocabulary. The value at an index is 1 if the word corresponding to that index is present in the
document, else 0.
* We are using *macro-f1* score for evaluation.
* Macro F1 = average of F1-scores for each class (unweighted).
  

(a) **Binary Valued Vector**:

Here we are finding : Does a word appear? Yes=1 , No= 0
* 1. CountVectorizer creates a vocabulary of words and turns each document into a numerical vector.
        Here:
       - binary=True → 1 if the word appears, 0 otherwise (not word counts)
       - stop_words="english" → remove very common useless words like "the" in english language
       - max_features=20000 → keep only the top 20,000 most common words
* 2. Multiclass Logistic Regression:  
    - The validation set f1-score is 0.96 and the test set f1-score is 0.97


In [4]:
# (a) binary valued vector:


# 1. CountVectorizer:
#-----------------------------------
binary_vectorizer = CountVectorizer(
    binary=True,
    stop_words="english", 
    max_features=200000
)

# 1.1: Fit the vectorizer on the TRAINING text:
X_train_binary = binary_vectorizer.fit_transform(X_train)
X_val_binary   = binary_vectorizer.transform(X_val)
X_test_binary = binary_vectorizer.transform(X_test)


# 2. LogisticRegression multi class classifier:
#---------------------------------------------------
model_a = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)

# 2.1: Train the classifier using the training vectors:
model_a.fit(X_train_binary, y_train)


# 2.2:Predict on the train-val-test:
y_train_pred = model_a.predict(X_train_binary)
y_val_pred = model_a.predict(X_val_binary)
y_test_pred = model_a.predict(X_test_binary)

#2.3: Evaluate using MACRO F1:
#-----------------------------------

# Evaluation on train and Validation set:
f1_train= f1_score(y_train,y_train_pred, average='macro')
f1_val = f1_score(y_val, y_val_pred, average="macro")

# Evaluation on test set :
f1_test = f1_score(y_test, y_test_pred, average="macro")


# RESULTS:
#----------------------------------------
print("Binary Bag-of-Words Model: Results")
print("-------------------------------------")
print("Training shape:", X_train_binary.shape)
print("Validation shape:", X_val_binary.shape)
print("Test shape:", X_test_binary.shape)
print("\nTrain set Macro F1:",f1_train)
print("Validation Macro F1:", f1_val)
print("Test Macro F1:", f1_test)


Binary Bag-of-Words Model: Results
-------------------------------------
Training shape: (9215, 60681)
Validation shape: (1152, 60681)
Test shape: (1152, 60681)

Train set Macro F1: 1.0
Validation Macro F1: 0.962051917357083
Test Macro F1: 0.9754474603839395


**(b) Frequency Vector**

In this part we are finding : how many times a word appear?

In [5]:
# (b) frequency vector
# 1. Count Vectorizer:
#-------------------------
freq_vectorizer = CountVectorizer(
    binary=False,          # use counts instead of 0/1
    stop_words="english",  # remove common English stop words
    max_features=20000     # limit vocabulary size
)

# 1.1: Fit the vectorizer on the TRAINING text:
X_train_freq = freq_vectorizer.fit_transform(X_train)
X_val_freq   = freq_vectorizer.transform(X_val)
X_test_freq = freq_vectorizer.transform(X_test)

# 2. LogisticRegression multi class classifier :
#---------------------------------------------------
model_b = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)
# 2.1: Train the classifier using the training vectors:
model_b.fit(X_train_freq, y_train)


# 2.2:Predict on the train-val-test:
y_train_pred = model_b.predict(X_train_freq)
y_val_pred = model_b.predict(X_val_freq)
y_test_pred = model_b.predict(X_test_freq)

#2.3: Evaluate using MACRO F1:
#-----------------------------------

# Evaluation on train and Validation set:
f1_train= f1_score(y_train,y_train_pred, average='macro')
f1_val = f1_score(y_val, y_val_pred, average="macro")

# Evaluation on test set :
f1_test = f1_score(y_test, y_test_pred, average="macro")


# RESULTS:
#----------------------------------------
print("Frequency Bag-of-Words Model: Results")
print("-------------------------------------")
print("Training shape:", X_train_binary.shape)
print("Validation shape:", X_val_binary.shape)
print("Test shape:", X_test_binary.shape)
print("\nTrain set Macro F1:",f1_train)
print("Validation Macro F1:", f1_val)
print("Test Macro F1:", f1_test)



Frequency Bag-of-Words Model: Results
-------------------------------------
Training shape: (9215, 60681)
Validation shape: (1152, 60681)
Test shape: (1152, 60681)

Train set Macro F1: 1.0
Validation Macro F1: 0.9711100702364143
Test Macro F1: 0.9834104938271605


**(c) TF-IDF Vector**:
 
 Here we are finding whether a single word is important for the given document?


In [6]:
# (c) tf-idf vector
# TODO
from sklearn.feature_extraction.text import TfidfVectorizer

# 1. Count Vectorizer:
#-------------------------
tfid_vectorizer = TfidfVectorizer(
    #binary=False,          # use counts instead of 0/1
    stop_words="english",  # remove common English stop words
    max_features=20000     # limit vocabulary size
)

# 1.1: Fit the vectorizer on the TRAINING text:
X_train_tfid = tfid_vectorizer.fit_transform(X_train)
X_val_tfid   = tfid_vectorizer.transform(X_val)
X_test_tfid = tfid_vectorizer.transform(X_test)

# 2. LogisticRegression multi class classifier :
#---------------------------------------------------
model_c = LogisticRegression(
    max_iter=1000,
    n_jobs=-1
)
# 2.1: Train the classifier using the training vectors:
model_c.fit(X_train_tfid, y_train)


# 2.2:Predict on the train-val-test:
y_train_pred = model_c.predict(X_train_tfid)
y_val_pred = model_c.predict(X_val_tfid)
y_test_pred = model_c.predict(X_test_tfid)

#2.3: Evaluate using MACRO F1:
#-----------------------------------

# Evaluation on train and Validation set:
f1_train= f1_score(y_train,y_train_pred, average='macro')
f1_val = f1_score(y_val, y_val_pred, average="macro")

# Evaluation on test set :
f1_test = f1_score(y_test, y_test_pred, average="macro")


# RESULTS:
#----------------------------------------
print("TFID Bag-of-Words Model: Results")
print("-------------------------------------")
print("Training shape:", X_train_tfid.shape)
print("Validation shape:", X_val_tfid.shape)
print("Test shape:", X_test_tfid.shape)
print("\nTrain set Macro F1:",f1_train)
print("Validation Macro F1:", f1_val)
print("Test Macro F1:", f1_test)



TFID Bag-of-Words Model: Results
-------------------------------------
Training shape: (9215, 20000)
Validation shape: (1152, 20000)
Test shape: (1152, 20000)

Train set Macro F1: 0.986960895811587
Validation Macro F1: 0.9591570379635157
Test Macro F1: 0.9756933280640178


Part 2: Word2Vec (20 points):

In [15]:
# TODO
# 3(i) : Pre trained Glove Model Performance:
###-------------------------------------------


# 1. Download the Glove model
#----------------------------------
import gensim.downloader as api

print("Downloading/Loading GloVe model...")
# This will download the data (approx. 128MB) to your home directory
# equivalent to glove.6B.100d.txt
glove_model = api.load("glove-wiki-gigaword-100")

print("Model loaded!")

# Test it: Get the vector for 'king'
vector = glove_model['king']
print(f"Vector shape: {vector.shape}")  # Should be (100,)

Downloading/Loading GloVe model...
Model loaded!
Vector shape: (100,)


In [16]:
EMB_DIM = glove_model.vector_size

# 2. Convert document to averaged GloVe vector
#---------------------------------------------
def glove_doc_vector(doc):
    words = doc.lower().split()
    vectors = [glove_model[w] for w in words if w in glove_model]
    
    if not len(vectors):  # if no words found
        return np.zeros(EMB_DIM)
    
    return np.mean(vectors, axis=0)

# 2.1 Build document vectors
X_train_glove = np.vstack([glove_doc_vector(doc) for doc in X_train])
X_val_glove   = np.vstack([glove_doc_vector(doc) for doc in X_val])
X_test_glove  = np.vstack([glove_doc_vector(doc) for doc in X_test])

print("Document vector shape:", X_train_glove.shape)

# 3.Logistic Regression classifier
#-----------------------------------
model_glove = LogisticRegression(max_iter=1000, n_jobs=-1)
model_glove.fit(X_train_glove, y_train)

# 4. Evaluation
#---------------------------
# Validation scores
y_val_pred = model_glove.predict(X_val_glove)
val_accuracy = accuracy_score(y_val, y_val_pred)
val_macro_f1 = f1_score(y_val, y_val_pred, average="macro")
val_micro_f1 = f1_score(y_val, y_val_pred, average="micro")

# Test scores (final evaluation)
y_test_pred = model_glove.predict(X_test_glove)
test_accuracy = accuracy_score(y_test, y_test_pred)
test_macro_f1 = f1_score(y_test, y_test_pred, average="macro")
test_micro_f1 = f1_score(y_test, y_test_pred, average="micro")

print("\n--- GloVe 100d Results ---")
print("\nVALIDATION SET PERFORMANCE")
print("Validation Accuracy :", val_accuracy)
print("Validation Macro F1 :", val_macro_f1)
print("Validation Micro F1 :", val_micro_f1)
print("\nTEST SET PERFORMANCE")
print("Test Accuracy :", test_accuracy)
print("Test Macro F1 :", test_macro_f1)
print("Test Micro F1 :", test_micro_f1)

Document vector shape: (9215, 100)

--- GloVe 100d Results ---

VALIDATION SET PERFORMANCE
Validation Accuracy : 0.9765625
Validation Macro F1 : 0.945777565768855
Validation Micro F1 : 0.9765625

TEST SET PERFORMANCE
Test Accuracy : 0.9852430555555556
Test Macro F1 : 0.9646170284241168
Test Micro F1 : 0.9852430555555556


In [24]:
# 3(ii) Word2Vec Model :
##--------------------------------------------

from gensim.models import Word2Vec
import re
from nltk.tokenize import word_tokenize
import nltk

# Download tokenizer resource if not present
#nltk.download('punkt_tab')



# 1. Create tokenized sentences from the training text
# ---------------------------------------------------------

# Manual tokenizer using regex to remove punctuation
# def tokenize(doc):
#     return re.findall(r"\b[a-zA-Z]+\b", doc.lower())
#train_sentences = [tokenize(doc) for doc in X_train]

train_sentences=[word_tokenize(doc) for doc in X_train]

# 2. Train Word2Vec model
# ---------------------------------------------------------

w2v_model = Word2Vec(
    sentences=train_sentences,
    vector_size=100,   # 100-dimensional as required
    window=5,          # context window
    min_count=2,       # ignore very rare words
    workers=4,         # parallel training
    sg=1               # sg=1 : SkipGram (better for quality)
)

EMB_DIM = 100


# 3. Convert documents to averaged W2V vectors
# ---------------------------------------------------------

def w2v_doc_vector(doc):
    #tokens = tokenize(doc)
    tokens=word_tokenize(doc)
    vectors = [w2v_model.wv[w] for w in tokens if w in w2v_model.wv] #Access the vectors via .wv
    
    if not vectors:
        return np.zeros(EMB_DIM)
    
    return np.mean(vectors, axis=0)

# Build vectors for train/validation/test
X_train_w2v = np.vstack([w2v_doc_vector(doc) for doc in X_train])
X_val_w2v   = np.vstack([w2v_doc_vector(doc) for doc in X_val])
X_test_w2v  = np.vstack([w2v_doc_vector(doc) for doc in X_test])

print("W2V document vector shape:", X_train_w2v.shape)


# 4. Train Logistic Regression classifier
# ---------------------------------------------------------

model_w2v = LogisticRegression(max_iter=1000, n_jobs=-1)
model_w2v.fit(X_train_w2v, y_train)


# 5. Evaluation:
# ---------------------------------------------------------

y_val_pred = model_w2v.predict(X_val_w2v)

val_accuracy = accuracy_score(y_val, y_val_pred)
val_macro_f1 = f1_score(y_val, y_val_pred, average="macro")
val_micro_f1 = f1_score(y_val, y_val_pred, average="micro")



y_test_pred = model_w2v.predict(X_test_w2v)

test_accuracy = accuracy_score(y_test, y_test_pred)
test_macro_f1 = f1_score(y_test, y_test_pred, average="macro")
test_micro_f1 = f1_score(y_test, y_test_pred, average="micro")

print("\n--- Word2Vec (trained on NYT) Results ---")
print("Validation Accuracy :", val_accuracy)
print("Validation Macro F1 :", val_macro_f1)
print("Validation Micro F1 :", val_micro_f1)

print("\nTEST SET PERFORMANCE")
print("Test Accuracy :", test_accuracy)
print("Test Macro F1 :", test_macro_f1)
print("Test Micro F1 :", test_micro_f1)


Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_float'
Exception ignored in: 'gensim.models.word2vec_inner.our_dot_fl

W2V document vector shape: (9215, 100)

--- Word2Vec (trained on NYT) Results ---
Validation Accuracy : 0.9644097222222222
Validation Macro F1 : 0.9227649132845706
Validation Micro F1 : 0.9644097222222222

TEST SET PERFORMANCE
Test Accuracy : 0.9791666666666666
Test Macro F1 : 0.948267861342574
Test Micro F1 : 0.9791666666666666


Part 3: BERT (20 points):

In [None]:
# Packages for Bert:
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW

from tqdm import tqdm


In [37]:
# 1. Set Parameters:
#-----------------------
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

MAX_LEN = 64
BATCH_SIZE = 16
EPOCHS = 3
NUM_LABELS = len(set(y_train))  # typically 3

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

cpu


In [39]:
# 2. Dataset Class:
#----------------------------------
class NYTDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels
        
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        text = str(self.texts[idx])
        label = int(self.labels[idx])
        
        enc = tokenizer(
            text,
            padding="max_length",
            truncation=True,
            max_length=MAX_LEN,
            return_tensors="pt"
        )
        
        return {
            "input_ids": enc["input_ids"].squeeze(),
            "attention_mask": enc["attention_mask"].squeeze(),
            "labels": torch.tensor(label)
        }


# 3. Data Loader:
#------------------------------
train_ds = NYTDataset(X_train, y_train)
val_ds   = NYTDataset(X_val, y_val)
test_ds  = NYTDataset(X_test, y_test)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True)
val_loader   = DataLoader(val_ds, batch_size=BATCH_SIZE)
test_loader  = DataLoader(test_ds, batch_size=BATCH_SIZE)

# 4. Model + Optimizer:
#------------------------------
model_bert = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased",
    num_labels=NUM_LABELS
)
model_bert = model_bert.to(device)

optimizer = AdamW(model_bert.parameters(), lr=2e-5)



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [40]:
# 5. Train the model : Epoch size =3
#---------------------------------------
for epoch in range(EPOCHS):
    model_bert.train()
    loop = tqdm(train_loader, leave=True)
    total_loss = 0

    for batch in loop:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)
        
        model_bert.zero_grad()
        outputs = model_bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            labels=labels
        )
        
        loss = outputs.loss
        total_loss += loss.item()
        
        loss.backward()
        optimizer.step()
        
        loop.set_description(f"Epoch {epoch+1}/{EPOCHS}")
        loop.set_postfix(loss=loss.item())
    
    print(f"Epoch {epoch+1} completed. Avg Loss = {total_loss / len(loop):.4f}")


Epoch 1/3: 100%|██████████| 576/576 [05:31<00:00,  1.74it/s, loss=0.00343]


Epoch 1 completed. Avg Loss = 0.1400


Epoch 2/3: 100%|██████████| 576/576 [05:44<00:00,  1.67it/s, loss=0.025]  


Epoch 2 completed. Avg Loss = 0.0428


Epoch 3/3: 100%|██████████| 576/576 [05:49<00:00,  1.65it/s, loss=0.00579] 

Epoch 3 completed. Avg Loss = 0.0182





In [41]:
# Evaluation Function:
def evaluate(loader):
    model_bert.eval()
    preds, labels_all = [], []
    
    with torch.no_grad():
        for batch in loader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].cpu().numpy()
            
            outputs = model_bert(
                input_ids=input_ids,
                attention_mask=attention_mask
            )
            
            batch_preds = torch.argmax(outputs.logits, dim=1).cpu().numpy()
            
            preds.extend(batch_preds)
            labels_all.extend(labels)
    
    acc = accuracy_score(labels_all, preds)
    macro = f1_score(labels_all, preds, average="macro")
    micro = f1_score(labels_all, preds, average="micro")
    
    return acc, macro, micro

# Validation Set :
val_acc, val_macro, val_micro = evaluate(val_loader)

print("Validation Results")
print("---------------------------")
print("Accuracy :", val_acc)
print("Macro F1 :", val_macro)
print("Micro F1 :", val_micro)


Validation Results
---------------------------
Accuracy : 0.9774305555555556
Macro F1 : 0.9527343723887988
Micro F1 : 0.9774305555555556


In [42]:
test_acc, test_macro, test_micro = evaluate(test_loader)

print("\nTEST RESULTS")
print("---------------------------")
print("Accuracy :", test_acc)
print("Macro F1 :", test_macro)
print("Micro F1 :", test_micro)



TEST RESULTS
---------------------------
Accuracy : 0.984375
Macro F1 : 0.9651937934318485
Micro F1 : 0.984375


**Part 4: Summary of results / Reflection**
* Part 1:
  
| Model                   | Validation Macro F1  | Test Macro F1  |
| ----------------------- | -------------------- | --------------- |
| **Binary Bag-of-Words**     | 0.962                |  0.975           |
| **Frequency Bag-of-words**  | 0.971                | 0.983            |
| **TFID Bag-of-Words**       | 0.959                | 0.976            |

* Part 2 a:

|**Glove Model (Pretrained)** | Validation Set | Test Set  |
|----------------------   |----------------|-----------|
|Accuracy score           | 0.977          |0.985       |
|Macro F1                 | 0.946          | 0.965      |
|Micro F1                 | 0.978          | 0.985      |

* Part 2 b:


|**Word2Vec (Trained on NYT)** | Validation Set | Test Set  |
|----------------------   |----------------|-----------|
|Accuracy score           | 0.964         |0.979     |
|Macro F1                 | 0.922         | 0.948     |
|Micro F1                 | 0.964         | 0.979     |

* Part 3:

|**BERT (Trained on NYT)** | Validation Set | Test Set  |
|----------------------   |----------------|-----------|
|Accuracy score           | 0.977          |0.984      |
|Macro F1                 | 0.952          | 0.965      |
|Micro F1                 | 0.977          | 0.984      |


* **Summary**:


  The results show that all models performed extremely well on the NYT dataset, largely because the article categories (business, sports, politics) contain highly distinctive and non-overlapping vocabulary. As a result, even very simple representations such as Binary Bag-of-Words and Frequency Bag-of-Words achieved near-perfect macro F1 scores (≈0.97–0.98).TF–IDF also performed similarly well.The pretrained GloVe embeddings also achieved strong performance, benefiting from global semantic information, while the Word2Vec model trained solely on the NYT dataset performed slightly worse due to the relatively small corpus size. BERT, despite being a powerful transformer-based model, provided only marginal improvements over classical models because the task does not require deep contextual reasoning, the vocabulary alone is sufficient to distinguish the categories. Overall, the experiment highlights that for clean topic-classification datasets, traditional vector-space models can perform as well as or even better than modern deep models like BERT.


