# Task: News Topic Classification with AG News

## Objective
Classify **news articles** into 4 categories (*World, Sports, Business, Sci/Tech*) using different **text representation methods**.

<small>[AG News Classification Dataset on Kaggle](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset)</small>
    
---

## Step 1: Data Preparation
- Load the **AG News dataset** (train.csv & test.csv).  
- Combine the **title + description** into one text field.  
- Apply **basic preprocessing**:
  - Lowercase  
  - Remove symbols/punctuation  
  - Try stopwords removal or stemming → compare results  

---

## Step 2: Representations to Try
You must implement **all 5 methods** below:

1. **Bag of Words (BoW)**  
   - Represent each text as a count of words.  


2. **TF-IDF**  
   - Apply TF-IDF weighting instead of raw counts.  


3. **N-grams (Bi/Tri-grams)**  
   - Use bigrams and trigrams to capture context.   

    
4. **Word2Vec (Pretrained)**  
   - Use pretrained embeddings (e.g., GoogleNews vectors).  
   - Convert each document into a vector (average word embeddings).  

    
5. **Doc2Vec**  
   - Train your own Doc2Vec model on the dataset.  
   - Represent each document with its vector.  
   
---

## Step 3: Try Two Classifiers
For **each text representation method**, train **two different models** and compare:

- **Logistic Regression**
- **Naive Bayes** (or any other model of your choice, e.g., SVM, Decision Tree)

Hint:  
- Logistic Regression usually performs well on sparse features (BoW, TF-IDF, N-grams).  
- Naive Bayes is very fast and works surprisingly well for text classification.  
- Compare their accuracy for each representation.

---

## Step 4: Results Table
Fill in your results:

| Representation | Logistic Regression Acc | Naive Bayes Acc | Notes |
|----------------|--------------------------|-----------------|-------|
| BoW            |                          |                 |       |
| TF-IDF         |                          |                 |       |
| N-grams        |                          |                 |       |
| Word2Vec       |                          |                 |       |
| Doc2Vec        |                          |                 |       |
---

## Reflection Questions
1. Which method gave the best accuracy? Why?  
2. Did N-grams improve performance compared to BoW?  
3. How do pretrained embeddings (Word2Vec) compare to TF-IDF?  
4. Which method is more efficient in terms of speed and memory?  
5. If you had to build a **real news classifier**, which method would you choose and why?  


In [30]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer , TfidfVectorizer
import gensim.downloader as api
from gensim.models.doc2vec import Doc2Vec , TaggedDocument
from gensim.models import KeyedVectors
import numpy as np
import nltk , re
from nltk.corpus import stopwords
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\mosta\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

1-Data Preparation

In [31]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
train_df.head()

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [32]:
#- Combine the **title + description** into one text field.  
train_df['text'] = train_df["Title"] + " " + train_df['Description']
test_df['text'] = test_df['Title'] + ' ' + test_df['Description']
test_df['text']

0       Fears for T N pension after talks Unions repre...
1       The Race is On: Second Private Team Sets Launc...
2       Ky. Company Wins Grant to Study Peptides (AP) ...
3       Prediction Unit Helps Forecast Wildfires (AP) ...
4       Calif. Aims to Limit Farm-Related Smog (AP) AP...
                              ...                        
7595    Around the world Ukrainian presidential candid...
7596    Void is filled with Clement With the supply of...
7597    Martinez leaves bitter Like Roger Clemens did ...
7598    5 of arthritis patients in Singapore take Bext...
7599    EBay gets into rentals EBay plans to buy the a...
Name: text, Length: 7600, dtype: object

In [33]:
# Preprocessing
stop_words = set(stopwords.words("english"))

def preprocess(text):
    text = text.lower()
    text = re.sub(r"[^a-z\s]", "", text)
    tokens = [t for t in text.split() if t not in stop_words]
    return " ".join(tokens)

train_labels = train_df["Class Index"]
test_labels = test_df["Class Index"]


In [34]:
# Classifier
clf = LogisticRegression(max_iter=1000)

results = {}

2-Represntations to Try

In [35]:
# A) Bag of Words
train_df["clean_text"] = train_df["text"].apply(preprocess)
test_df["clean_text"] = test_df["text"].apply(preprocess)

bow = CountVectorizer(max_features=5000)
X_train = bow.fit_transform(train_df["clean_text"])
X_test = bow.transform(test_df["clean_text"])
clf.fit(X_train, train_labels)
results["Bag of Words"] = accuracy_score(test_labels, clf.predict(X_test))
results

{'Bag of Words': 0.8938157894736842}

In [36]:
# B) TF-IDF
tfidf = TfidfVectorizer(max_features=5000)
X_train = tfidf.fit_transform(train_df["clean_text"])
X_test = tfidf.transform(test_df["clean_text"])
clf.fit(X_train, train_labels)
results["TF-IDF"] = accuracy_score(test_labels, clf.predict(X_test))
results


{'Bag of Words': 0.8938157894736842, 'TF-IDF': 0.9031578947368422}

In [37]:
# C) N-grams
ngram = TfidfVectorizer(ngram_range=(2,3), max_features=5000)
X_train = ngram.fit_transform(train_df["clean_text"])
X_test = ngram.transform(test_df["clean_text"])
clf.fit(X_train, train_labels)
results["N-grams"] = accuracy_score(test_labels, clf.predict(X_test))
results


{'Bag of Words': 0.8938157894736842,
 'TF-IDF': 0.9031578947368422,
 'N-grams': 0.7980263157894737}

In [38]:
# D) Word2Vec

w2v_model = api.load("glove-wiki-gigaword-100") 
def doc_vector(doc):
    words = [w for w in doc.split() if w in w2v_model]
    if len(words) == 0:
        return np.zeros(300)
    return np.mean(w2v_model[words], axis=0)

X_train = np.vstack([doc_vector(doc) for doc in train_df["clean_text"]])
X_test = np.vstack([doc_vector(doc) for doc in test_df["clean_text"]])
clf.fit(X_train, train_labels)
results["Word2Vec"] = accuracy_score(test_labels, clf.predict(X_test))
results

{'Bag of Words': 0.8938157894736842,
 'TF-IDF': 0.9031578947368422,
 'N-grams': 0.7980263157894737,
 'Word2Vec': 0.8844736842105263}

In [39]:
# E) Doc2Vec
train_tagged = [TaggedDocument(words=doc.split(), tags=[i]) for i, doc in enumerate(train_df["clean_text"])]

doc2vec_model = Doc2Vec(vector_size=300, window=5, min_count=2, workers=4, epochs=20)
doc2vec_model.build_vocab(train_tagged)
doc2vec_model.train(train_tagged, total_examples=doc2vec_model.corpus_count, epochs=doc2vec_model.epochs)

X_train = np.vstack([doc2vec_model.infer_vector(doc.split()) for doc in train_df["clean_text"]])
X_test = np.vstack([doc2vec_model.infer_vector(doc.split()) for doc in test_df["clean_text"]])
clf.fit(X_train, train_labels)
results["Doc2Vec"] = accuracy_score(test_labels, clf.predict(X_test))

In [40]:
print(results)

{'Bag of Words': 0.8938157894736842, 'TF-IDF': 0.9031578947368422, 'N-grams': 0.7980263157894737, 'Word2Vec': 0.8844736842105263, 'Doc2Vec': 0.6647368421052632}


3-Try two Classifiers

In [41]:
# Evaluating the model
def evaluate_model(X_train, X_test, y_train, y_test, model, model_name, rep_name, results):
    model.fit(X_train, y_train)
    acc = accuracy_score(y_test, model.predict(X_test))
    results.append({"Representation": rep_name, "Model": model_name, "Accuracy": acc})

In [44]:
results = []

# --------------------------
# 1) Bag of Words
# --------------------------
bow = CountVectorizer(max_features=5000)
X_train = bow.fit_transform(train_df["clean_text"])
X_test = bow.transform(test_df["clean_text"])

evaluate_model(X_train, X_test, train_labels, test_labels, LogisticRegression(max_iter=1000), "Logistic Regression", "Bag of Words", results)
evaluate_model(X_train, X_test, train_labels, test_labels, MultinomialNB(), "Naive Bayes", "Bag of Words", results)


# --------------------------
# 2) TF-IDF
# --------------------------
tfidf = TfidfVectorizer(max_features=5000)
X_train = tfidf.fit_transform(train_df["clean_text"])
X_test = tfidf.transform(test_df["clean_text"])

evaluate_model(X_train, X_test, train_labels, test_labels, LogisticRegression(max_iter=1000), "Logistic Regression", "TF-IDF", results)
evaluate_model(X_train, X_test, train_labels, test_labels, MultinomialNB(), "Naive Bayes", "TF-IDF", results)


# --------------------------
# 3) N-grams (Bi/Tri-grams)
# --------------------------
ngram = TfidfVectorizer(ngram_range=(2,3), max_features=5000)
X_train = ngram.fit_transform(train_df["clean_text"])
X_test = ngram.transform(test_df["clean_text"])

evaluate_model(X_train, X_test, train_labels, test_labels, LogisticRegression(max_iter=1000), "Logistic Regression", "N-grams", results)
evaluate_model(X_train, X_test, train_labels, test_labels, MultinomialNB(), "Naive Bayes", "N-grams", results)

# --------------------------
# 4) Doc2Vec
# --------------------------
X_train = np.vstack([doc2vec_model.infer_vector(doc.split()) for doc in train_df["clean_text"]])
X_test = np.vstack([doc2vec_model.infer_vector(doc.split()) for doc in test_df["clean_text"]])

#Can't use Naive bayes because of the negative values
evaluate_model(X_train, X_test, train_labels, test_labels, LogisticRegression(max_iter=1000), "Logistic Regression", "Doc2Vec", results)

# --------------------------
# 5) Word2Vec (Pretrained)
# --------------------------
X_train = np.vstack([doc_vector(doc) for doc in train_df["clean_text"]])
X_test = np.vstack([doc_vector(doc) for doc in test_df["clean_text"]])

#Can't use Naive bayes because of the negative values
evaluate_model(X_train, X_test, train_labels, test_labels, LogisticRegression(max_iter=1000), "Logistic Regression", "Word2Vec", results)

4-Result Table

In [47]:
# Convert results into a DataFrame
results_df = pd.DataFrame(results)

# Pivot into the required table format
table = results_df.pivot(index="Representation", columns="Model", values="Accuracy").reset_index()

# Add empty Notes column
table["Notes"] = ""

# Reorder columns to match your format
table = table[["Representation", "Logistic Regression", "Naive Bayes"]]

print(table.to_string(index=False))


Representation  Logistic Regression  Naive Bayes
  Bag of Words             0.893816     0.888684
       Doc2Vec             0.666974          NaN
       N-grams             0.798026     0.764605
        TF-IDF             0.903158     0.889474
      Word2Vec             0.884474          NaN


Reflection Questions?

1. Which method gave the best accuracy? Why?  
- TF-IDF gave the best accuarcy, because it highlights the important words and cancel the noise. 

2. Did N-grams improve performance compared to BoW?  
- No

3. How do pretrained embeddings (Word2Vec) compare to TF-IDF?  
- TF-IDF was better 

4. Which method is more efficient in terms of speed and memory?  
- TF-IDF

5. If you had to build a **real news classifier**, which method would you choose and why?  
- I would TF-IDF because its efficient and works on large data set and would use the logistic regression with it because it has high accuarcy and it overpreforms BOW and N-grams