# News Topic Classification

The project involves a large dataset of news articles collected over several years. These articles cover a wide range of topics such as world events, sports, business, and science/technology. Each article headline is labeled with a number from 0 to 3, indicating its category, as described below. 

| Value | Topic        |
|:------|:-------------|
| 0     | World        |
| 1     | Sports       |
| 2     | Business     |
| 3     | Sci/Tech     |


Our goal is to create a model that, given an unknown article headline, can classify it into one of these 4 topics.

This specific notebook focuses on fine tuning a tranformer as a way to solve the problem in hand.

# Importing the Data Set
Our dataset consists of only two columns, *text* and *label*, as shown below:

In [1]:
import pandas as pd
df = pd.read_csv('training_data.csv')
df.head(10)

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2
5,"Stocks End Up, But Near Year Lows (Reuters) Re...",2
6,Money Funds Fell in Latest Week (AP) AP - Asse...,2
7,Fed minutes show dissent over inflation (USATO...,2
8,Safety Net (Forbes.com) Forbes.com - After ear...,2
9,Wall St. Bears Claw Back Into the Black NEW Y...,2


# Creating the train functions

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, BertForSequenceClassification
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

def load_model_and_tokenizer(model_name):
    """
    Load the appropriate model and tokenizer based on the model name.
    """
    if model_name == "lucasresck/bert-base-cased-ag-news":
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = BertForSequenceClassification.from_pretrained(model_name)
    elif model_name == "fabriceyhc/bert-base-uncased-ag_news":
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = BertForSequenceClassification.from_pretrained(model_name)
    else:  # Default to AutoModelForSequenceClassification for other models
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSequenceClassification.from_pretrained(model_name)
    return tokenizer, model

def train_data(training_data, test_data):
    X_train = training_data['text']
    y_train = training_data['label']
    X_val = test_data['text']
    y_val = test_data['label']

    transformer_models = [
        "textattack/roberta-base-ag-news",
        "fabriceyhc/bert-base-uncased-ag_news",
        "lucasresck/bert-base-cased-ag-news",
    ]

    for model_name in transformer_models:
        tokenizer, model = load_model_and_tokenizer(model_name)
        transformer_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
        
        transformer_predictions = transformer_pipeline(X_val.tolist())
        y_pred_transformer = [int(pred['label'].split('_')[-1]) for pred in transformer_predictions]
        
        print('----------------------------------------------------------------')
        print(f"Transformer Model ({model_name}) Accuracy:", accuracy_score(y_val, y_pred_transformer))
        print(f"Transformer Model ({model_name}) Confusion Matrix:")
        print(confusion_matrix(y_val, y_pred_transformer))
        print(f"Transformer Model ({model_name}) Classification Report:")
        print(classification_report(y_val, y_pred_transformer))

In [3]:
training_data = pd.read_csv('training_data.csv')
test_data = pd.read_csv('test_data.csv')


train_data(training_data, test_data)

Some weights of the model checkpoint at textattack/roberta-base-ag-news were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


----------------------------------------------------------------
Transformer Model (textattack/roberta-base-ag-news) Accuracy: 0.6
Transformer Model (textattack/roberta-base-ag-news) Confusion Matrix:
[[4 0 1 0]
 [2 1 1 1]
 [0 0 4 1]
 [2 0 0 3]]
Transformer Model (textattack/roberta-base-ag-news) Classification Report:
              precision    recall  f1-score   support

           0       0.50      0.80      0.62         5
           1       1.00      0.20      0.33         5
           2       0.67      0.80      0.73         5
           3       0.60      0.60      0.60         5

    accuracy                           0.60        20
   macro avg       0.69      0.60      0.57        20
weighted avg       0.69      0.60      0.57        20

----------------------------------------------------------------
Transformer Model (fabriceyhc/bert-base-uncased-ag_news) Accuracy: 0.7
Transformer Model (fabriceyhc/bert-base-uncased-ag_news) Confusion Matrix:
[[4 0 1 0]
 [2 2 0 1]
 [0 0 4 1]




----------------------------------------------------------------
Transformer Model (lucasresck/bert-base-cased-ag-news) Accuracy: 0.7
Transformer Model (lucasresck/bert-base-cased-ag-news) Confusion Matrix:
[[3 0 1 1]
 [3 2 0 0]
 [0 0 5 0]
 [1 0 0 4]]
Transformer Model (lucasresck/bert-base-cased-ag-news) Classification Report:
              precision    recall  f1-score   support

           0       0.43      0.60      0.50         5
           1       1.00      0.40      0.57         5
           2       0.83      1.00      0.91         5
           3       0.80      0.80      0.80         5

    accuracy                           0.70        20
   macro avg       0.77      0.70      0.70        20
weighted avg       0.77      0.70      0.70        20

