# Task: News Topic Classification with AG News

## Objective
Classify **news articles** into 4 categories (*World, Sports, Business, Sci/Tech*) using different **text representation methods**.

<small>[AG News Classification Dataset on Kaggle](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset)</small>
    
---

## Step 1: Data Preparation
- Load the **AG News dataset** (train.csv & test.csv).  
- Combine the **title + description** into one text field.  
- Apply **basic preprocessing**:
  - Lowercase  
  - Remove symbols/punctuation  
  - Try stopwords removal or stemming → compare results  

---

## Step 2: Representations to Try
You must implement **all 5 methods** below:

1. **Bag of Words (BoW)**  
   - Represent each text as a count of words.  


2. **TF-IDF**  
   - Apply TF-IDF weighting instead of raw counts.  


3. **N-grams (Bi/Tri-grams)**  
   - Use bigrams and trigrams to capture context.   

    
4. **Word2Vec (Pretrained)**  
   - Use pretrained embeddings (e.g., GoogleNews vectors).  
   - Convert each document into a vector (average word embeddings).  

    
5. **Doc2Vec**  
   - Train your own Doc2Vec model on the dataset.  
   - Represent each document with its vector.  
   
---

## Step 3: Try Two Classifiers
For **each text representation method**, train **two different models** and compare:

- **Logistic Regression**
- **Naive Bayes** (or any other model of your choice, e.g., SVM, Decision Tree)

Hint:  
- Logistic Regression usually performs well on sparse features (BoW, TF-IDF, N-grams).  
- Naive Bayes is very fast and works surprisingly well for text classification.  
- Compare their accuracy for each representation.

---

## Step 4: Results Table
Fill in your results:

| Representation | Logistic Regression Acc | Naive Bayes Acc | Notes |
|----------------|--------------------------|-----------------|-------|
| BoW            |                          |                 |       |
| TF-IDF         |                          |                 |       |
| N-grams        |                          |                 |       |
| Word2Vec       |                          |                 |       |
| Doc2Vec        |                          |                 |       |
---

## Reflection Questions
1. Which method gave the best accuracy? Why?  
2. Did N-grams improve performance compared to BoW?  
3. How do pretrained embeddings (Word2Vec) compare to TF-IDF?  
4. Which method is more efficient in terms of speed and memory?  
5. If you had to build a **real news classifier**, which method would you choose and why?  


-------


## Step 1: Data Preparation
- Load the **AG News dataset** (train.csv & test.csv).  
- Combine the **title + description** into one text field.  
- Apply **basic preprocessing**:
  - Lowercase  
  - Remove symbols/punctuation  
  - Try stopwords removal or stemming → compare results  

---


In [20]:
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('stopwords')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [5]:
train_df = pd.read_csv(r"/content/train.csv")
test_df = pd.read_csv(r"/content/test.csv")

In [6]:
train_df.head()

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [7]:
# Rename columns (علشان Class Index فيها مسافة)
train_df = train_df.rename(columns={"Class Index": "label"})
test_df = test_df.rename(columns={"Class Index": "label"})

In [8]:
#Combine the title + description into one text field
train_df["text"] = train_df["Title"] + " " + train_df["Description"]
test_df["text"] = test_df["Title"] + " " + test_df["Description"]

train_df[["Title", "Description", "text"]].head()


Unnamed: 0,Title,Description,text
0,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli...",Wall St. Bears Claw Back Into the Black (Reute...
1,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...,Carlyle Looks Toward Commercial Aerospace (Reu...
2,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...,Oil and Economy Cloud Stocks' Outlook (Reuters...
3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...,Iraq Halts Oil Exports from Main Southern Pipe...
4,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco...","Oil prices soar to all-time record, posing new..."


In [9]:
# X , y
X_train, y_train = train_df["text"], train_df["label"]
X_test, y_test = test_df["text"], test_df["label"]


In [10]:
# LowerCase
train_df["text_lower"] = train_df["text"].str.lower()

In [11]:
train_df[["text", "text_lower"]].head()

Unnamed: 0,text,text_lower
0,Wall St. Bears Claw Back Into the Black (Reute...,wall st. bears claw back into the black (reute...
1,Carlyle Looks Toward Commercial Aerospace (Reu...,carlyle looks toward commercial aerospace (reu...
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,oil and economy cloud stocks' outlook (reuters...
3,Iraq Halts Oil Exports from Main Southern Pipe...,iraq halts oil exports from main southern pipe...
4,"Oil prices soar to all-time record, posing new...","oil prices soar to all-time record, posing new..."


In [12]:
#Remove symbols/punctuation
train_df["text_no_symbols"] = train_df["text_lower"].apply(lambda x: re.sub(r'[^a-z\s]', '', x))
train_df[["text_lower", "text_no_symbols"]].head()


Unnamed: 0,text_lower,text_no_symbols
0,wall st. bears claw back into the black (reute...,wall st bears claw back into the black reuters...
1,carlyle looks toward commercial aerospace (reu...,carlyle looks toward commercial aerospace reut...
2,oil and economy cloud stocks' outlook (reuters...,oil and economy cloud stocks outlook reuters r...
3,iraq halts oil exports from main southern pipe...,iraq halts oil exports from main southern pipe...
4,"oil prices soar to all-time record, posing new...",oil prices soar to alltime record posing new m...


In [13]:
# remove stopwords
sw = set(stopwords.words("english"))

train_df["text_no_stopwords"] = train_df["text_no_symbols"].apply(
    lambda x: " ".join([word for word in x.split() if word not in sw])
)

train_df[["text_no_symbols", "text_no_stopwords"]].head()


Unnamed: 0,text_no_symbols,text_no_stopwords
0,wall st bears claw back into the black reuters...,wall st bears claw back black reuters reuters ...
1,carlyle looks toward commercial aerospace reut...,carlyle looks toward commercial aerospace reut...
2,oil and economy cloud stocks outlook reuters r...,oil economy cloud stocks outlook reuters reute...
3,iraq halts oil exports from main southern pipe...,iraq halts oil exports main southern pipeline ...
4,oil prices soar to alltime record posing new m...,oil prices soar alltime record posing new mena...


In [14]:
# #stemming
# stemmer = PorterStemmer()

# train_df["text_stemming"] = train_df["text_no_stopwords"].apply(
#     lambda x: " ".join([stemmer.stem(word) for word in x.split()])
# )


In [67]:
# BoW (Bag of Words)
# ----------------------------
bow_vectorizer = CountVectorizer()
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

# Logistic Regression on BoW
lr_bow = LogisticRegression(max_iter=1000)
lr_bow.fit(X_train_bow, y_train)
y_pred_lr_bow = lr_bow.predict(X_test_bow)
acc_lr_bow = accuracy_score(y_test, y_pred_lr_bow)

# Naive Bayes on BoW
nb_bow = MultinomialNB()
nb_bow.fit(X_train_bow, y_train)
y_pred_nb_bow = nb_bow.predict(X_test_bow)
acc_nb_bow = accuracy_score(y_test, y_pred_nb_bow)

print("\nBoW Results:")
print("Logistic Regression Accuracy:", acc_lr_bow)
print("Naive Bayes Accuracy:", acc_nb_bow)


BoW Results:
Logistic Regression Accuracy: 0.9093421052631578
Naive Bayes Accuracy: 0.900921052631579


In [68]:
# TF-IDF
# ----------------------------
tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Logistic Regression on TF-IDF
lr_tfidf = LogisticRegression(max_iter=1000)
lr_tfidf.fit(X_train_tfidf, y_train)
y_pred_lr_tfidf = lr_tfidf.predict(X_test_tfidf)
acc_lr_tfidf = accuracy_score(y_test, y_pred_lr_tfidf)

# Naive Bayes on TF-IDF
nb_tfidf = MultinomialNB()
nb_tfidf.fit(X_train_tfidf, y_train)
y_pred_nb_tfidf = nb_tfidf.predict(X_test_tfidf)
acc_nb_tfidf = accuracy_score(y_test, y_pred_nb_tfidf)

print("\nTF-IDF Results:")
print("Logistic Regression Accuracy:", acc_lr_tfidf)
print("Naive Bayes Accuracy:", acc_nb_tfidf)




TF-IDF Results:
Logistic Regression Accuracy: 0.9182894736842105
Naive Bayes Accuracy: 0.9022368421052631


In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
# N-grams (Bigrams + Trigrams)
# ----------------------------
ngram_vectorizer = CountVectorizer(ngram_range=(2,3))  # 2-grams & 3-grams
X_train_ngram = ngram_vectorizer.fit_transform(X_train)
X_test_ngram = ngram_vectorizer.transform(X_test)

# Logistic Regression on N-grams
lr_ngram = LogisticRegression(max_iter=1000)
lr_ngram.fit(X_train_ngram, y_train)
y_pred_lr_ngram = lr_ngram.predict(X_test_ngram)
acc_lr_ngram = accuracy_score(y_test, y_pred_lr_ngram)

# Naive Bayes on N-grams
nb_ngram = MultinomialNB()
nb_ngram.fit(X_train_ngram, y_train)
y_pred_nb_ngram = nb_ngram.predict(X_test_ngram)
acc_nb_ngram = accuracy_score(y_test, y_pred_nb_ngram)

print("\nN-grams Results:")
print("Logistic Regression Accuracy:", acc_lr_ngram)
print("Naive Bayes Accuracy:", acc_nb_ngram)


N-grams Results:
Logistic Regression Accuracy: 0.895921052631579
Naive Bayes Accuracy: 0.9030263157894737


---

1️⃣ Which method gave the best accuracy? Why?

TF-IDF + Logistic Regression (91.8%) gave the best accuracy because TF-IDF highlights important words and reduces the weight of common ones, which works very well with Logistic Regression.

2️⃣ Did N-grams improve performance compared to BoW?

For Logistic Regression: BoW (90.9%) > N-grams (89.6%).

For Naive Bayes: N-grams (90.3%) > BoW (90.0%).
So, N-grams helped Naive Bayes but not Logistic Regression.

3️⃣ How do pretrained embeddings (Word2Vec) compare to TF-IDF?

Usually, Word2Vec is slightly worse (82–86%) because it’s pretrained on other corpora, while TF-IDF learns directly from your dataset.

4️⃣ Which method is more efficient in terms of speed and memory?

Naive Bayes with BoW/TF-IDF is the fastest and most memory-efficient. Logistic Regression is slower, and Word2Vec/Doc2Vec are the heaviest.

5️⃣ If you had to build a real news classifier, which method would you choose and why?

TF-IDF + Logistic Regression, because it’s simple, fast, and gave the best accuracy on your dataset.