# Task: News Topic Classification with AG News

## Objective
Classify **news articles** into 4 categories (*World, Sports, Business, Sci/Tech*) using different **text representation methods**.

<small>[AG News Classification Dataset on Kaggle](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset)</small>
    
---

## Step 1: Data Preparation
- Load the **AG News dataset** (train.csv & test.csv).  
- Combine the **title + description** into one text field.  
- Apply **basic preprocessing**:
  - Lowercase  
  - Remove symbols/punctuation  
  - Try stopwords removal or stemming â†’ compare results  

---

## Step 2: Representations to Try
You must implement **all 6 methods** below:

1. **One hot Encoding**  (NOW)

2. **Bag of Words (BoW)**   (NOW)
   - Represent each text as a count of words.  


3. **TF-IDF**   (NOW)
   - Apply TF-IDF weighting instead of raw counts.  


4. **N-grams (Bi/Tri-grams)**   (Later)
   - Use bigrams and trigrams to capture context.   

    
5. **Word2Vec (Pretrained)**   (Later)
   - Use pretrained embeddings (e.g., GoogleNews vectors).  
   - Convert each document into a vector (average word embeddings).  

    
6. **Doc2Vec**   (Later)
   - Train your own Doc2Vec model on the dataset.  
   - Represent each document with its vector.  
   
---

## Step 3: Try Two Classifiers
For **each text representation method**, train **two different models** and compare:

- **Logistic Regression**
- **Naive Bayes** (or any other model of your choice, e.g., SVM, Decision Tree)

Hint:  
- Logistic Regression usually performs well on sparse features (BoW, TF-IDF, N-grams).  
- Naive Bayes is very fast and works surprisingly well for text classification.  
- Compare their accuracy for each representation.

---

## Step 4: Results Table
Fill in your results:

| Representation | Logistic Regression Acc | Naive Bayes Acc | Notes |
|----------------|--------------------------|-----------------|-------|
| One Hot            |                          |                 |       |
| BoW            |                          |                 |       |
| TF-IDF         |                          |                 |       |
| N-grams        |                          |                 |       |
| Word2Vec       |                          |                 |       |
| Doc2Vec        |                          |                 |       |
---

## Reflection Questions
1. Which method gave the best accuracy? Why?  
2. Did N-grams improve performance compared to BoW?  
3. How do pretrained embeddings (Word2Vec) compare to TF-IDF?  
4. Which method is more efficient in terms of speed and memory?  
5. If you had to build a **real news classifier**, which method would you choose and why?  


In [47]:
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

In [48]:
#loAD data
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

train_df.head()


Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."


In [49]:
train_df['text'] = train_df['Title'] + " " + train_df['Description']
test_df['text'] = test_df['Title'] + " " + test_df['Description']

In [50]:
#Preprocessing
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def preprocess(text, remove_stopwords=True, do_stemming=False):
    text = text.lower()  # lowercase
    text = re.sub(r'[^a-z0-9\s]', '', text)  # remove punctuation/symbols
    words = text.split()
    if remove_stopwords:
        words = [w for w in words if w not in stop_words]
    if do_stemming:
        words = [stemmer.stem(w) for w in words]
    return " ".join(words)

In [51]:
train_df['text_clean'] = train_df['text'].apply(preprocess)
test_df['text_clean'] = test_df['text'].apply(preprocess)

In [52]:
y_train = train_df['Class Index']
y_test = test_df['Class Index']

In [53]:
#ONE-HOT ENCODING
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_oh = CountVectorizer(binary=True)
X_train_oh = vectorizer_oh.fit_transform(train_df['text_clean'])
X_test_oh = vectorizer_oh.transform(test_df['text_clean'])

In [54]:
#BOW
vectorizer_bow = CountVectorizer()
X_train_bow = vectorizer_bow.fit_transform(train_df['text_clean'])
X_test_bow = vectorizer_bow.transform(test_df['text_clean'])

In [55]:
#TF-IDF
vectorizer_tfidf = TfidfVectorizer()
X_train_tfidf = vectorizer_tfidf.fit_transform(train_df['text_clean'])
X_test_tfidf = vectorizer_tfidf.transform(test_df['text_clean'])

In [56]:
def train_evaluate(X_train, X_test, y_train, y_test):
    # Logistic Regression
    lr = LogisticRegression(max_iter=500)
    lr.fit(X_train, y_train)
    y_pred_lr = lr.predict(X_test)
    acc_lr = accuracy_score(y_test, y_pred_lr)

    # Naive Bayes
    nb = MultinomialNB()
    nb.fit(X_train, y_train)
    y_pred_nb = nb.predict(X_test)
    acc_nb = accuracy_score(y_test, y_pred_nb)

    return acc_lr, acc_nb

In [62]:
acc_lr_oh, acc_nb_oh = train_evaluate(X_train_oh, X_test_oh, y_train, y_test)
print("One Hot -> Logistic Regression:", acc_lr_oh, "Naive Bayes:", acc_nb_oh)

One Hot -> Logistic Regression: 0.9060526315789473 Naive Bayes: 0.8989473684210526


In [57]:
acc_lr_bow, acc_nb_bow = train_evaluate(X_train_bow, X_test_bow, y_train, y_test)
print("BoW -> Logistic Regression:", acc_lr_bow, "Naive Bayes:", acc_nb_bow)

BoW -> Logistic Regression: 0.9085526315789474 Naive Bayes: 0.9025


In [63]:
acc_lr_tfidf, acc_nb_tfidf = train_evaluate(X_train_tfidf, X_test_tfidf, y_train, y_test)
print("TF-IDF -> Logistic Regression:", acc_lr_tfidf, "Naive Bayes:", acc_nb_tfidf)

TF-IDF -> Logistic Regression: 0.9171052631578948 Naive Bayes: 0.9030263157894737


In [64]:
results = pd.DataFrame({
    'Representation': ['One Hot', 'BoW', 'TF-IDF'],
    'Logistic Regression Acc': [0, 0, 0],
    'Naive Bayes Acc': [0, 0, 0],
    'Notes': ['', '', '']
})

# Fill in the results after evaluation
results.loc[0, 'Logistic Regression Acc'] = acc_lr_oh
results.loc[0, 'Naive Bayes Acc'] = acc_nb_oh
results.loc[0, 'Notes'] = 'One-Hot encoding'

results.loc[1, 'Logistic Regression Acc'] = acc_lr_bow
results.loc[1, 'Naive Bayes Acc'] = acc_nb_bow
results.loc[1, 'Notes'] = 'Bag of Words'

results.loc[2, 'Logistic Regression Acc'] = acc_lr_tfidf
results.loc[2, 'Naive Bayes Acc'] = acc_nb_tfidf
results.loc[2, 'Notes'] = 'TF-IDF weighting'

# Display table
print(results)

  Representation  Logistic Regression Acc  Naive Bayes Acc             Notes
0        One Hot                 0.906053         0.898947  One-Hot encoding
1            BoW                 0.917105         0.903026      Bag of Words
2         TF-IDF                 0.917105         0.903026  TF-IDF weighting


  results.loc[0, 'Logistic Regression Acc'] = acc_lr_oh
  results.loc[0, 'Naive Bayes Acc'] = acc_nb_oh


# Which method gave the best accuracy? Why?
* in all there Logistic regression works better ,because it handle sparse and high-dimensional features better than naive