# News Topic Classification

The project involves a large dataset of news articles collected over several years. These articles cover a wide range of topics such as world events, sports, business, and science/technology. Each article headline is labeled with a number from 0 to 3, indicating its category, as described below. 

| Value | Topic        |
|:------|:-------------|
| 0     | World        |
| 1     | Sports       |
| 2     | Business     |
| 3     | Sci/Tech     |


Our goal is to create a model that, given an unknown article headline, can classify it into one of these 4 topics.

This specific notebook focuses on fine tuning a stabilished Domain Adapted model of our own as a way to solve the problem in hand.

# Importing the Data Set
Our dataset consists of only two columns, *text* and *label*, as shown below:

In [1]:
import pandas as pd
df = pd.read_csv('training_data.csv')
df.head(10)

Unnamed: 0,text,label
0,Wall St. Bears Claw Back Into the Black (Reute...,2
1,Carlyle Looks Toward Commercial Aerospace (Reu...,2
2,Oil and Economy Cloud Stocks' Outlook (Reuters...,2
3,Iraq Halts Oil Exports from Main Southern Pipe...,2
4,"Oil prices soar to all-time record, posing new...",2
5,"Stocks End Up, But Near Year Lows (Reuters) Re...",2
6,Money Funds Fell in Latest Week (AP) AP - Asse...,2
7,Fed minutes show dissent over inflation (USATO...,2
8,Safety Net (Forbes.com) Forbes.com - After ear...,2
9,Wall St. Bears Claw Back Into the Black NEW Y...,2


## Creating the train functions
Runing the model at first to have a benchmark.

In [2]:
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification, BertForSequenceClassification
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

def load_model_and_tokenizer(model_name):
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    return tokenizer, model

def train_data(training_data, test_data):
    X_val = test_data['text']
    y_val = test_data['label']

    model_name = "./fine-tuned-distilbert"

    tokenizer, model = load_model_and_tokenizer(model_name)
    transformer_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
    
    label_mapping = {'LABEL_0': 0, 'LABEL_1': 1, 'LABEL_2': 2, 'LABEL_3': 3}

    transformer_predictions = transformer_pipeline(X_val.tolist())
    y_pred_transformer = [label_mapping[pred['label']] for pred in transformer_predictions]
    
    print('----------------------------------------------------------------')
    print(f"Transformer Model ({model_name}) Accuracy:", accuracy_score(y_val, y_pred_transformer))
    print(f"Transformer Model ({model_name}) Confusion Matrix:")
    print(confusion_matrix(y_val, y_pred_transformer))
    print(f"Transformer Model ({model_name}) Classification Report:")
    print(classification_report(y_val, y_pred_transformer))

training_data = pd.read_csv('training_data.csv')
test_data = pd.read_csv('test_data.csv')


train_data(training_data, test_data)

----------------------------------------------------------------
Transformer Model (./fine-tuned-distilbert) Accuracy: 0.7
Transformer Model (./fine-tuned-distilbert) Confusion Matrix:
[[3 0 2 0]
 [0 4 1 0]
 [0 0 5 0]
 [1 0 2 2]]
Transformer Model (./fine-tuned-distilbert) Classification Report:
              precision    recall  f1-score   support

           0       0.75      0.60      0.67         5
           1       1.00      0.80      0.89         5
           2       0.50      1.00      0.67         5
           3       1.00      0.40      0.57         5

    accuracy                           0.70        20
   macro avg       0.81      0.70      0.70        20
weighted avg       0.81      0.70      0.70        20



# Finetuning the model

Now onto the fine-tuning: this was taken directly from the fine tuning notebook we went though. We ran this with the whole data-set with 120000 lines of data, which took around 6 hours on a laptop.

In [3]:
import pandas as pd
from datasets import Dataset, DatasetDict
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments
import torch

train_df = pd.read_csv('training_data.csv')
train_df_subset = train_df.sample(n=2000, random_state=42)

test_df = pd.read_csv('test_data.csv')

train_dataset = Dataset.from_pandas(train_df_subset)
test_dataset = Dataset.from_pandas(test_df)

dataset = DatasetDict({
    'train': train_dataset,
    'test': test_dataset
})

model_name = "./fine-tuned-distilbert"

tokenizer, model = load_model_and_tokenizer(model_name)
transformer_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)

def tokenize_function(examples):
    max_length = min(512, max(len(text) for text in examples["text"]))
    return tokenizer(examples["text"], padding="max_length", truncation=True, max_length=max_length)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
)

trainer.train()

trainer.save_model("./fine-tuned-fine-tuned-distilbert")
tokenizer.save_pretrained("./fine-tuned-fine-tuned-distilbert")


Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

  0%|          | 0/750 [00:00<?, ?it/s]

  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 0.8011118173599243, 'eval_runtime': 2.7144, 'eval_samples_per_second': 7.368, 'eval_steps_per_second': 1.105, 'epoch': 1.0}
{'loss': 0.2949, 'grad_norm': 0.11164164543151855, 'learning_rate': 6.666666666666667e-06, 'epoch': 2.0}


  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 0.5572220683097839, 'eval_runtime': 2.8254, 'eval_samples_per_second': 7.079, 'eval_steps_per_second': 1.062, 'epoch': 2.0}


  0%|          | 0/3 [00:00<?, ?it/s]

{'eval_loss': 0.5526213645935059, 'eval_runtime': 2.6477, 'eval_samples_per_second': 7.554, 'eval_steps_per_second': 1.133, 'epoch': 3.0}
{'train_runtime': 8625.2005, 'train_samples_per_second': 0.696, 'train_steps_per_second': 0.087, 'train_loss': 0.24927465311686198, 'epoch': 3.0}


('./fine-tuned-fine-tuned-distilbert\\tokenizer_config.json',
 './fine-tuned-fine-tuned-distilbert\\special_tokens_map.json',
 './fine-tuned-fine-tuned-distilbert\\vocab.txt',
 './fine-tuned-fine-tuned-distilbert\\added_tokens.json',
 './fine-tuned-fine-tuned-distilbert\\tokenizer.json')

# Seeing our Results

Here we run again the model, after fine-tuning, and have better results: 87% vs 81% precision, and 85% vs 70% on recall and f1-score.

In [4]:
# Load the fine-tuned model
model = AutoModelForSequenceClassification.from_pretrained("./fine-tuned-fine-tuned-distilbert")

# Define a function to get predictions from the model
def get_predictions(model, tokenizer, dataset):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    predictions = []
    labels = []

    for batch in torch.utils.data.DataLoader(dataset, batch_size=8):
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)

        with torch.no_grad():
            outputs = model(input_ids, attention_mask=attention_mask)

        predictions.extend(torch.argmax(outputs.logits, axis=1).tolist())
        labels.extend(batch["labels"].tolist())

    return predictions, labels

# Get predictions on the test dataset
test_predictions, test_labels = get_predictions(model, tokenizer, tokenized_datasets["test"])

# Print evaluation metrics
from sklearn.metrics import classification_report
print(classification_report(test_labels, test_predictions))

              precision    recall  f1-score   support

           0       0.75      0.60      0.67         5
           1       1.00      0.80      0.89         5
           2       0.71      1.00      0.83         5
           3       1.00      1.00      1.00         5

    accuracy                           0.85        20
   macro avg       0.87      0.85      0.85        20
weighted avg       0.87      0.85      0.85        20

