### 📰 News Article Topic Classification with NLP

📦 Dataset Overview:

This project uses the AG News dataset, a popular benchmark for text classification tasks. The dataset consists of news articles categorized into four topics:
- World => class index 1
- Sports => class index 2
- Business => class index 3
- Science/Technology => class index 4

Each entry in the dataset includes:
- Class Index: The category label (1–4)
- Title: The news headline
- Description: A short summary of the article

The dataset is already split into:
Training set: 120,000 articles
Test set: 7,600 articles

#### 🎯 Project Objective:
The goal of this project is to automatically classify news articles into their correct topics using Natural Language Processing (NLP) techniques. We will explore and compare several popular text vectorization methods:
- Bag of Words (BoW)
- TF-IDF (Term Frequency-Inverse Document Frequency)
- Word2Vec Embeddings

Multiple machine learning models will be trained and evaluated to determine which combination of vectorization and classification yields the best performance.

#### 📝 What will we do?
- Explore the dataset and understand its structure.
- Preprocess the text data (cleaning, tokenization, etc.).
- Extract features using BoW, TF-IDF, and Word2Vec.
- Train and evaluate several machine learning classifiers.
- Compare results and discuss insights.

In [1]:
import numpy as np
import pandas as pd

In [2]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

In [3]:
train_df.head(10)

Unnamed: 0,Class Index,Title,Description
0,3,Wall St. Bears Claw Back Into the Black (Reuters),"Reuters - Short-sellers, Wall Street's dwindli..."
1,3,Carlyle Looks Toward Commercial Aerospace (Reu...,Reuters - Private investment firm Carlyle Grou...
2,3,Oil and Economy Cloud Stocks' Outlook (Reuters),Reuters - Soaring crude prices plus worries\ab...
3,3,Iraq Halts Oil Exports from Main Southern Pipe...,Reuters - Authorities have halted oil export\f...
4,3,"Oil prices soar to all-time record, posing new...","AFP - Tearaway world oil prices, toppling reco..."
5,3,"Stocks End Up, But Near Year Lows (Reuters)",Reuters - Stocks ended slightly higher on Frid...
6,3,Money Funds Fell in Latest Week (AP),AP - Assets of the nation's retail money marke...
7,3,Fed minutes show dissent over inflation (USATO...,USATODAY.com - Retail sales bounced back a bit...
8,3,Safety Net (Forbes.com),Forbes.com - After earning a PH.D. in Sociolog...
9,3,Wall St. Bears Claw Back Into the Black,"NEW YORK (Reuters) - Short-sellers, Wall Stre..."


We classify this based on the title, description and class index (this helps which category the aritcle belongs to)
To do that we combine the title and description, then our output will be Class Index.

In [4]:
# Combining Title and Description in test feature:

train_df['text'] = train_df['Title'].astype(str) + ' ' + train_df['Description'].astype(str)

test_df['text'] = test_df['Title'].astype(str) + ' ' + test_df['Description'].astype(str)

In [5]:
# Splitting the data again, because we added one more feature to the dataset

X_train = train_df['text']
X_test = test_df['text']

y_train = train_df['Class Index']
y_test = test_df['Class Index']

#### Text Preprocessing:
- Lowercase
- Remove special characters
- Remove stopwords
- Lemmatize

In [6]:
import re # Regular Expression
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [7]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)  # Remove special characters
    tokens = text.split()
    tokens = [i for i in tokens if i not in stop_words]
    tokens = [lemmatizer.lemmatize(i) for i in tokens]
    return ' '.join(tokens)

In [8]:
# Applying the proprocessing steps to the X_train and X_test values:

X_train_clean = X_train.apply(preprocess_text)
X_test_clean = X_test.apply(preprocess_text)

In [9]:
print("Text before preprocessing : ",X_train[0])
print()
print("Text after preprocessing : ",X_train_clean[0])

Text before preprocessing :  Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\band of ultra-cynics, are seeing green again.

Text after preprocessing :  wall st bear claw back black reuters reuters shortsellers wall street dwindlingband ultracynics seeing green


#### BOW (Bag Of Words):

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(max_features=10000)

In [11]:
X_train_bow = bow_vectorizer.fit_transform(X_train_clean)
X_test_bow = bow_vectorizer.transform(X_test_clean)

#### TF-IDF:

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=10000)

In [13]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_clean)
X_test_tfidf = tfidf_vectorizer.transform(X_test_clean)

#### Word2Vec:

In [14]:
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\DELL\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [15]:
# Tokenize for Word2Vec
X_train_tokens = X_train_clean.apply(word_tokenize)
X_test_tokens = X_test_clean.apply(word_tokenize)

In [16]:
# Train Word2Vec model:
w2v_model = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=2, workers=4)

In [17]:
def get_review_vector(tokens, model, vector_size):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if len(vectors) == 0:
        return np.zeros(vector_size)
    return np.mean(vectors, axis=0)

X_train_w2v = np.vstack([get_review_vector(tokens, w2v_model, 100) for tokens in X_train_tokens])
X_test_w2v = np.vstack([get_review_vector(tokens, w2v_model, 100) for tokens in X_test_tokens])

#### Model Training and Evaluation:

In [18]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, classification_report,precision_score, recall_score, f1_score

In [19]:
def collect_metrics(X_train_vec, X_test_vec, y_train, y_test, model, model_name, feature_name):
    model.fit(X_train_vec, y_train)
    y_pred = model.predict(X_test_vec)
    return {
        "Model": model_name,
        "Feature": feature_name,
        "Accuracy": accuracy_score(y_test, y_pred),
        "Precision (macro)": precision_score(y_test, y_pred, average='macro'),
        "Recall (macro)": recall_score(y_test, y_pred, average='macro'),
        "F1-score (macro)": f1_score(y_test, y_pred, average='macro')
    }

In [20]:
results = []

# BoW features:
# results.append(collect_metrics(X_train_bow, X_test_bow, y_train, y_test, LogisticRegression(max_iter=1000), "Logistic Regression", "BoW"))
# results.append(collect_metrics(X_train_bow.toarray(), X_test_bow.toarray(), y_train, y_test, GaussianNB(), "GaussianNB", "BoW"))
results.append(collect_metrics(X_train_bow, X_test_bow, y_train, y_test, RandomForestClassifier(), "RandomForestClassifier", "BoW"))


In [21]:
# results.append(collect_metrics(X_train_bow, X_test_bow, y_train, y_test, SVC(), "SVC", "BoW"))

In [22]:
# TF-IDF features:
# results.append(collect_metrics(X_train_tfidf, X_test_tfidf, y_train, y_test, LogisticRegression(max_iter=1000), "Logistic Regression", "TF-IDF"))
# results.append(collect_metrics(X_train_tfidf.toarray(), X_test_tfidf.toarray(), y_train, y_test, GaussianNB(), "GaussianNB", "TF-IDF"))
results.append(collect_metrics(X_train_tfidf, X_test_tfidf, y_train, y_test, RandomForestClassifier(), "RandomForestClassifier", "TF-IDF"))
# results.append(collect_metrics(X_train_tfidf, X_test_tfidf, y_train, y_test, SVC(), "SVC", "TF-IDF"))


In [23]:

# Word2Vec features:
# results.append(collect_metrics(X_train_w2v, X_test_w2v, y_train, y_test, LogisticRegression(max_iter=1000), "Logistic Regression", "Word2Vec"))
# results.append(collect_metrics(X_train_w2v, X_test_w2v, y_train, y_test, GaussianNB(), "GaussianNB", "Word2Vec"))
results.append(collect_metrics(X_train_w2v, X_test_w2v, y_train, y_test, RandomForestClassifier(), "RandomForestClassifier", "Word2Vec"))
# results.append(collect_metrics(X_train_w2v, X_test_w2v, y_train, y_test, SVC(), "SVC", "Word2Vec"))


In [25]:
# Display results as a Dataframe:
df_results = pd.DataFrame(results)
df_results = df_results.sort_values(by="Accuracy", ascending=False).reset_index(drop=True)
display(df_results)

Unnamed: 0,Model,Feature,Accuracy,Precision (macro),Recall (macro),F1-score (macro)
0,RandomForestClassifier,BoW,0.891053,0.890675,0.891053,0.890626
1,RandomForestClassifier,TF-IDF,0.888026,0.887571,0.888026,0.887486
2,RandomForestClassifier,Word2Vec,0.886053,0.885921,0.886053,0.885864


Logistic Regression performed well in this case, it has acheived 91% accuracy with the TF-IDF technique:

In [26]:
# Training the Logistic regression model
final_model = LogisticRegression(max_iter=1000)
final_model.fit(X_train_tfidf,y_train)

In [28]:
y_pred_by_log_reg = final_model.predict(X_test_tfidf)

In [29]:
print(accuracy_score(y_test,y_pred_by_log_reg))

0.9140789473684211


In [30]:
print(classification_report(y_test,y_pred_by_log_reg))

              precision    recall  f1-score   support

           1       0.93      0.90      0.92      1900
           2       0.95      0.98      0.97      1900
           3       0.89      0.88      0.88      1900
           4       0.89      0.89      0.89      1900

    accuracy                           0.91      7600
   macro avg       0.91      0.91      0.91      7600
weighted avg       0.91      0.91      0.91      7600



In [31]:
# Save the model and Vectorizer:
import joblib

joblib.dump(final_model, "log_reg_model.pkl")
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.pkl")

['tfidf_vectorizer.pkl']