<a href="https://www.kaggle.com/code/saibhossain/build-a-simple-ml-model?scriptVersionId=290017903" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Build a simple ML model for classification

## 1. Dataset Description

I used the **AG News dataset** a large-scale text classification dataset for news topic prediction.

It contains short news articles labeled into **4 categories: World, Sports, Business, and Science/Technology.**

The training set has 120,000 samples, and the test set has 7,600 samples, each consisting of a news headline + description and a numeric label.

In [1]:
from datasets import load_dataset

dataset = load_dataset("ag_news")
dataset.shape

README.md: 0.00B [00:00, ?B/s]

data/train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

data/test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

{'train': (120000, 2), 'test': (7600, 2)}

In [2]:
dataset["train"][0:2]

{'text': ["Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.",
  'Carlyle Looks Toward Commercial Aerospace (Reuters) Reuters - Private investment firm Carlyle Group,\\which has a reputation for making well-timed and occasionally\\controversial plays in the defense industry, has quietly placed\\its bets on another part of the market.'],
 'label': [2, 2]}

## 2. Model Choice & Justification

I used **Logistic Regression** with **TF-IDF features.**

Why this model?

Logistic Regression is strong for high-dimensional sparse text data

TF-IDF converts raw text into meaningful numeric signals

Fast to train, interpretable, and a strong baseline

Often competitive with shallow neural models for text classification

In [3]:
dataset["train"].features

{'text': Value('string'),
 'label': ClassLabel(names=['World', 'Sports', 'Business', 'Sci/Tech'])}

## Model train

In [4]:
X_train = dataset["train"]["text"]
y_train = dataset["train"]["label"]

X_test = dataset["test"]["text"]
y_test = dataset["test"]["label"]

print("X_train,y_train =", len(X_train), len(y_train),", X_test,y_test =", len(X_test), len(y_test))

X_train,y_train = 120000 120000 , X_test,y_test = 7600 7600


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


vectorizer = TfidfVectorizer(max_features=120000,stop_words="english")

X_train_final = vectorizer.fit_transform(X_train)
X_test_final = vectorizer.transform(X_test)

model = LogisticRegression(max_iter=1800)
model.fit(X_train_final, y_train)

y_pred = model.predict(X_test_final)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 0.9148684210526316
              precision    recall  f1-score   support

           0       0.93      0.90      0.92      1900
           1       0.96      0.98      0.97      1900
           2       0.89      0.88      0.88      1900
           3       0.89      0.90      0.89      1900

    accuracy                           0.91      7600
   macro avg       0.91      0.91      0.91      7600
weighted avg       0.91      0.91      0.91      7600



## 4. Evaluation Metrics

The trained TF-IDF + Logistic Regression model achieves an overall accuracy of 91.49% on the AG News test set, indicating strong generalization to unseen news articles.

Class-wise precision ranges from 0.89 to 0.96, showing that the model makes reliable predictions across all four categories.
Recall values remain consistently high (0.88–0.98), meaning most true articles are correctly identified for each class.
The F1-scores are well balanced, with all classes scoring around 0.88–0.97, reflecting a good trade-off between precision and recall.

Both macro and weighted averages are approximately 0.91, suggesting that performance is stable across classes and not biased toward any single category.
The highest performance is observed for class 1, which contains more distinctive vocabulary, while slightly lower scores for classes 2 and 3 indicate mild overlap in language usage.

Overall, these metrics confirm that the model provides robust and consistent multi-class text classification performance.

## 5. explanation
I chose Logistic Regression because it performs exceptionally well on sparse, high-dimensional text features generated by TF-IDF.

The results show that simple linear models can capture strong semantic signals from word frequency patterns alone.
An accuracy near 90% indicates the model correctly understands topic-specific language.
The model performs especially well on clearly defined categories like Sports and Business.
However, it struggles slightly when topics share vocabulary, such as World and Business news.
A key limitation is that TF-IDF ignores word order and deeper semantic meaning.
The model also cannot adapt dynamically to evolving language or context.

With more time, I would experiment with n-grams, class-weighted loss, and transformer-based embeddings for deeper semantic understanding.

# Part 2  Explain a Core ML Concept Simply

## Overfitting

Overfitting means the model **learns** the AG News training articles too **closely**.

It may remember specific words or phrases from the training news. Because of this, it can perform very well on the **training set**. But when it sees new news articles from the test set, it may make more mistakes.

This happens if the model is too complex for the dataset.
In AG News, overfitting would show as **high training accuracy** but **lower test accuracy.**

Using **TF-IDF limits** and **Logistic Regression** helps prevent this problem.