## Data Overview

In [2]:

import pandas as pd

data = pd.read_csv('/kaggle/input/article-categorization-dataset/article_data.csv')

print("Dataset Shape:", data.shape)

# Show the first few rows of the dataset
print("First 5 rows of the dataset:")
print(data.head())

# Print an overview of the dataset (data types and missing values)
print("Dataset Info:")
print(data.info())

# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values:\n", missing_values)


Dataset Shape: (4000, 2)
First 5 rows of the dataset:
                                             Article  Category
0  Sudan Govt rejects call to separate religion, ...         0
1  Hassan:  #39;Abhorrent act #39; says Blair Wes...         0
2  Sharon Says Gaza Evacuation Set for 2005 (AP) ...         0
3  Prince Charles chastised for  quot;old fashion...         0
4  U.S. Says N.Korea Blast Probably Not Nuclear  ...         0
Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4000 entries, 0 to 3999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Article   4000 non-null   object
 1   Category  4000 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 62.6+ KB
None
Missing Values:
 Article     0
Category    0
dtype: int64


## Model Building - Sentence Transformer + Machine Learning

In [3]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.2.0-py3-none-any.whl (255 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m255.2/255.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.2.0


In [4]:
from sentence_transformers import SentenceTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Load sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode the dataset (Article is the text column, and Category is the label)
X = model.encode(data['Article'].values)
y = data['Category'].values

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Random Forest Base Model
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Evaluate the base model
print("Base Model Performance:")
print(classification_report(y_test, y_pred))

# Random Forest with Class Weights
clf_weighted = RandomForestClassifier(class_weight='balanced', random_state=42)
clf_weighted.fit(X_train, y_train)
y_pred_weighted = clf_weighted.predict(X_test)

# Evaluate the class_weight model
print("Class-Weighted Model Performance:")
print(classification_report(y_test, y_pred_weighted))

# Hyperparameter Tuning using GridSearchCV
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2, 4]
}

grid_search = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Best hyperparameters and model performance
best_rf = grid_search.best_estimator_
y_pred_best = best_rf.predict(X_test)
print("Best Random Forest Model Performance:")
print(classification_report(y_test, y_pred_best))
print("Best Parameters:", grid_search.best_params_)


  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Batches:   0%|          | 0/125 [00:00<?, ?it/s]

Base Model Performance:
              precision    recall  f1-score   support

           0       0.93      0.87      0.90       209
           1       0.94      0.96      0.95       213
           2       0.80      0.82      0.81       194
           3       0.84      0.85      0.85       184

    accuracy                           0.88       800
   macro avg       0.88      0.88      0.88       800
weighted avg       0.88      0.88      0.88       800

Class-Weighted Model Performance:
              precision    recall  f1-score   support

           0       0.91      0.85      0.88       209
           1       0.94      0.96      0.95       213
           2       0.82      0.85      0.84       194
           3       0.83      0.83      0.83       184

    accuracy                           0.88       800
   macro avg       0.87      0.87      0.87       800
weighted avg       0.88      0.88      0.88       800

Best Random Forest Model Performance:
              precision    recall 

## 3. Model Building - Transformer

In [6]:
import torch
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset

# Load the BERT tokenizer and dataset
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_texts, test_texts, train_labels, test_labels = train_test_split(
    data['Article'].tolist(), data['Category'].tolist(), test_size=0.2, random_state=42
)


class TextDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        tokenized_input = self.tokenizer(
            self.texts[idx], padding='max_length', truncation=True, max_length=self.max_length, return_tensors="pt"
        )
        return {
            'input_ids': tokenized_input['input_ids'].squeeze(0),
            'attention_mask': tokenized_input['attention_mask'].squeeze(0),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }


train_dataset = TextDataset(train_texts, train_labels, tokenizer)
test_dataset = TextDataset(test_texts, test_labels, tokenizer)


model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=len(set(train_labels)))


training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    logging_dir='./logs',
    logging_steps=10,
    report_to="none" 
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# Train the model
trainer.train()

# Evaluate the model
evaluation_results = trainer.evaluate()
print("Evaluation Results:", evaluation_results)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch,Training Loss,Validation Loss
1,0.3364,0.371815
2,0.3367,0.328879
3,0.2667,0.367618


Evaluation Results: {'eval_loss': 0.36761805415153503, 'eval_runtime': 13.5442, 'eval_samples_per_second': 59.066, 'eval_steps_per_second': 7.383, 'epoch': 3.0}


## 4. Model Performance Comparison and Final Model Selection

In [16]:
import torch
from sklearn.metrics import accuracy_score
from transformers import BertTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


if isinstance(final_model, torch.nn.Module):
    final_model.to(device)


tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

new_test_data = ['Sample text for classification']  


if isinstance(final_model, RandomForestClassifier):
   
    sentence_model = SentenceTransformer('all-MiniLM-L6-v2')  
    X_new_test = sentence_model.encode(new_test_data)  
    predictions = final_model.predict(X_new_test)  

else:
    
    tokenized_test = tokenizer(new_test_data, return_tensors='pt', padding=True, truncation=True, max_length=512)
    tokenized_test = {key: val.to(device) for key, val in tokenized_test.items()} 

   
    with torch.no_grad():
        logits = final_model(**tokenized_test).logits
        predictions = torch.argmax(logits, dim=1).cpu().numpy()  




rf_best_accuracy = accuracy_score(y_test, y_pred_best)


transformer_predictions = trainer.predict(test_dataset).predictions
transformer_pred_labels = torch.argmax(torch.tensor(transformer_predictions), dim=1).numpy()
transformer_accuracy = accuracy_score(test_labels, transformer_pred_labels)

# Print performance comparison
print(f"Random Forest Accuracy: {rf_best_accuracy}")
print(f"Transformer Accuracy: {transformer_accuracy}")

# Select the best model based on accuracy
if rf_best_accuracy > transformer_accuracy:
    print("Random Forest selected as the best model.")
    final_model = best_rf
else:
    print("Transformer selected as the best model.")
    final_model = model


if isinstance(final_model, RandomForestClassifier):
    
    X_new_test = sentence_model.encode(new_test_data)
    predictions = final_model.predict(X_new_test)
else:
    
    tokenized_test = tokenizer(new_test_data, return_tensors='pt', padding=True, truncation=True, max_length=512)
    tokenized_test = {key: val.to(device) for key, val in tokenized_test.items()}

    with torch.no_grad():
        logits = final_model(**tokenized_test).logits
        predictions = torch.argmax(logits, dim=1).cpu().numpy()




Random Forest Accuracy: 0.87875
Transformer Accuracy: 0.92
Transformer selected as the best model.


## Actionable Insights and Recommendations


**Model Performance:**

* Transformer Accuracy: 92%
* Random Forest Accuracy: 87.88%
* Preferred Model: Transformer selected for article classification.

**Model Deployment:**

* Deploy the Transformer model in a production environment.
* Create a user-friendly interface or API for easy article submission and classification.

**Continuous Monitoring and Retraining:**

* Establish a system for regular performance monitoring.
* Retrain the model with new data periodically to maintain accuracy.

**Model Interpretability:**

* Utilize SHAP or LIME for feature importance analysis.
* Display confidence scores alongside predictions to enhance user trust.

**Model Optimization:**

* Conduct hyperparameter tuning to enhance model performance.
* Experiment with different Transformer architectures, such as DistilBERT or RoBERTa.

**Evaluation Metrics:**

* Use additional evaluation metrics like F1-score, precision, and recall for balanced assessment.
* Implement a confusion matrix to visualize performance across different classes.