<a href="https://colab.research.google.com/github/MajedKawa/Classify-News-Headlines/blob/main/Supervised_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Classify News Headlines into Categories (Text)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt


In [None]:
# Get the dataset
splits = {'train': 'data/train-00000-of-00001.parquet', 'test': 'data/test-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/wangrongsheng/ag_news/" + splits["train"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
df.info() # Get some information about the DataFrame

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 120000 entries, 0 to 119999
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    120000 non-null  object
 1   label   120000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.8+ MB


There was no null values, as we can see from the results of the `info` method

In [None]:
df.head(10)

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2
5,"Stocks End Up, But Near Year Lows (Reuters) Re...",2
6,Money Funds Fell in Latest Week (AP) AP - Asse...,2
7,Fed minutes show dissent over inflation (USATO...,2
8,Safety Net (Forbes.com) Forbes.com - After ear...,2
9,Wall St. Bears Claw Back Into the Black NEW Y...,2


In [None]:
df['label'].value_counts() # To check the values we have

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
2,30000
3,30000
1,30000
0,30000


In [None]:
label_mapping = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Science/Technology"
}

In [None]:
df = df.sample(frac = 1) # shuffle the DataFrame rows


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['text'], df['label'], test_size=0.2, random_state=42)

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000, stop_words='english') # max_features -> If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. Otherwise, all features are used.
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

Train Multiple Classifiers

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'KNN': KNeighborsClassifier()
}

# Train and evaluate models
results = {}
for name, model in models.items():
    model.fit(X_train_tfidf, y_train)
    y_pred = model.predict(X_test_tfidf)
    results[name] = y_pred

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Evaluate the models using precision, recall, and F1-score

In [None]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Evaluate each model
evaluation = {}
for name, y_pred in results.items():
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    evaluation[name] = {
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1
    }

# Display evaluation results
for name, metrics in evaluation.items():
    print(f"{name}:")
    print(f"  Precision: {metrics['Precision']:.2f}")
    print(f"  Recall: {metrics['Recall']:.2f}")
    print(f"  F1-Score: {metrics['F1-Score']:.2f}")
    print()

Logistic Regression:
  Precision: 0.91
  Recall: 0.91
  F1-Score: 0.91

Decision Tree:
  Precision: 0.81
  Recall: 0.81
  F1-Score: 0.81

Gradient Boosting:
  Precision: 0.83
  Recall: 0.83
  F1-Score: 0.83

KNN:
  Precision: 0.89
  Recall: 0.89
  F1-Score: 0.89



As we can see from the results, Logistic Regression performs best (F1-score: 0.91), making it the ideal choice for deployment. KNN also performs well (F1-score: 0.89), but it's computationally heavier. Gradient Boosting and Decision Tree underperform (F1-scores: 0.83 and 0.81), likely due to overfitting or suboptimal hyperparameters. All models show balanced precision and recall, indicating no bias toward false positives or negatives. To improve results, we can do the following: hyperparameter tuning, and exploring advanced models like SVM.


