---

### **Text Classification by Fine-tuning Language Model**

---

### 1. **Data Loading**
   - Load the dataset (CSV format in this case).
   - Perform exploratory data analysis (EDA) to understand class distributions and data structure.
   - Split the dataset into training and validation sets.

```python
# Install simpletransformers package
!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset path)
data = pd.read_csv('text_classification_data.csv')

# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure
print(data['label'].value_counts())  # Class distribution

# Split dataset into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'text': train_data['text'],
    'labels': train_data['label']
})

val_df = pd.DataFrame({
    'text': val_data['text'],
    'labels': val_data['label']
})
```

---

### 2. **Text Processing**
   - Here we clean the text by removing special characters, converting to lowercase, removing numbers, and stripping any extra whitespace.

```python
import re

# Define a function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = text.strip()
    
    return text

# Apply the cleaning function to the dataset
train_df['text'] = train_df['text'].apply(clean_text)
val_df['text'] = val_df['text'].apply(clean_text)

print(train_df.head())
```

---

### 3. **Text Embedding using BERT and RoBERTa**
   - Use BERT and RoBERTa models for embedding the cleaned text. These models automatically tokenize and embed the text.

```python
from simpletransformers.classification import ClassificationModel

# Create a BERT model for text classification
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU
```

---

### 4. **Model Training with BERT and RoBERTa**

#### **Basic Model Training**

#### **Train the BERT Model**
```python
# Train BERT model
bert_model.train_model(train_df)
```

#### **Train the RoBERTa Model**
```python
# Train RoBERTa model
roberta_model.train_model(train_df)
```

#### **Model Training with Hyperparameters**
   - Train the models with a set of hyperparameters such as learning rate, batch size, epochs, etc.

```python
from simpletransformers.classification import ClassificationArgs

# Set up model arguments with custom hyperparameters
model_args = ClassificationArgs(
    num_train_epochs=3,       # Start with 3 epochs
    train_batch_size=8,       # Use a batch size of 8
    eval_batch_size=8,        # Same for evaluation
    learning_rate=3e-5,       # Learning rate
    max_seq_length=128,       # Max sequence length
    weight_decay=0.01,        # Weight decay
    warmup_steps=0,           # Optional: adjust based on total steps
    logging_steps=50,         # Log training progress every 50 steps
    save_steps=200,           # Save the model every 200 steps
)

# Train the BERT model with custom hyperparameters
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, args=model_args, use_cuda=False)
bert_model.train_model(train_df)

# Train the RoBERTa model with custom hyperparameters
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, args=model_args, use_cuda=False)
roberta_model.train_model(train_df)
```

---

### 5. **Evaluation on Validation Set**
   - Evaluate the performance of both BERT and RoBERTa models on the validation set using accuracy, precision, recall, and F1-score.

#### **Evaluate BERT Model**
```python
# Evaluate BERT on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

print("BERT Evaluation Results:")
print(result_bert)
```

#### **Evaluate RoBERTa Model**
```python
# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

print("RoBERTa Evaluation Results:")
print(result_roberta)
```

---

### 6. **Saving the Best Model**
   - Save the best-performing model for later use.

#### **Saving the BERT Model**
```python
bert_model.save_model('bert_best_model')
```

#### **Saving the RoBERTa Model**
```python
roberta_model.save_model('roberta_best_model')
```

---

### 7. **Prediction on Real-World Input**
   - Test the saved model on real-world input data. Preprocess the input text, use the model to predict the class, and output the results.

#### **Prediction Using BERT Model**
```python
# Load the saved BERT model
bert_model = ClassificationModel('bert', 'bert_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["This is a great product!", "I didn't like the service."]

# Predict the class
predictions_bert, _ = bert_model.predict(real_world_text)

print(f"BERT Predictions: {predictions_bert}")
```

#### **Prediction Using RoBERTa Model**
```python
# Load the saved RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["This is a great product!", "I didn't like the service."]

# Predict the class
predictions_roberta, _ = roberta_model.predict(real_world_text)

print(f"RoBERTa Predictions: {predictions_roberta}")
```

---

1. Data Loading

In [None]:
#!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset path)
data = pd.read_csv('/content/md.csv')

# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure
print(data['mre'].value_counts())  # Class distribution

# Split dataset into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'text': train_data['mre'],
    'labels': train_data['Category']
})

val_df = pd.DataFrame({
    'text': val_data['mre'],
    'labels': val_data['Category']
})

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 999 entries, 0 to 998
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   mre       999 non-null    object
 1   Category  999 non-null    object
dtypes: object(2)
memory usage: 15.7+ KB
None
mre
"Cholesterol levels decreased, and cardiovascular risk lowered."         2
"Statins prescribed to lower LDL cholesterol levels."                    2
"Energy levels improved, and deficiency corrected."                      2
"Routine screening detects high cholesterol levels."                     2
"The patient’s mobility and coordination have improved with therapy."    2
                                                                        ..
"Infection cleared, and neurological function remained intact."          1
"Thyroid ultrasound reveals nodules."                                    1
"Scheduled biopsy to assess malignancy risk."                            1
"Nodules determined

In [None]:
import re

# Define a function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespace
    text = text.strip()

    return text

# Apply the cleaning function to the dataset
train_df['text'] = train_df['text'].apply(clean_text)
val_df['text'] = val_df['text'].apply(clean_text)

print(train_df.head())

                                                  text     labels
778              blood test detects high triglycerides  Diagnosis
286  ct scan confirms spinal stenosis in the lower ...  Diagnosis
165  aneurysm remains stable with no signs of ruptu...    Outcome
960  blood test shows elevated calcium levels indic...  Diagnosis
493                      skin biopsy identifies eczema  Diagnosis


In [None]:
from simpletransformers.classification import ClassificationModel

# Create a BERT model for text classification
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a dow

In [None]:
import shutil
import os
from simpletransformers.classification import ClassificationModel, ClassificationArgs

# Clear the output directory
output_dir = 'outputs/'
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)

# Assuming train_df is your DataFrame and it has a 'labels' column
label_mapping = {label: idx for idx, label in enumerate(sorted(train_df['labels'].unique()))}
train_df['labels'] = train_df['labels'].map(label_mapping)

num_labels = len(train_df['labels'].unique())

model_args = ClassificationArgs(
    num_train_epochs=2,  # Reduced epochs
    train_batch_size=4,  # Reduced batch size
    eval_batch_size=4,
    learning_rate=3e-5,
    max_seq_length=128,
    weight_decay=0.01,
    warmup_steps=0,
    logging_steps=50,
    save_steps=200,
    overwrite_output_dir=True
)

# Use a smaller model
bert_model = ClassificationModel('distilbert', 'distilbert-base-uncased', num_labels=num_labels, args=model_args, use_cuda=False)
bert_model.train_model(train_df)

# Note: You should not use the same model type for both BERT and RoBERTa
# If you want to train a RoBERTa model, use the correct model name
# roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=num_labels, args=model_args, use_cuda=False)
# roberta_model.train_model(train_df)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Running Epoch 1 of 2:   0%|          | 0/200 [00:00<?, ?it/s]

Running Epoch 2 of 2:   0%|          | 0/200 [00:00<?, ?it/s]

(400, 0.12077893596258946)

In [None]:
# # Evaluate BERT model
# result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)
# print("BERT Evaluation Results:")
# print(result_bert)
# # Train RoBERTa model
# roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=num_labels, args=model_args, use_cuda=False)
# roberta_model.train_model(train_df)

# # Evaluate RoBERTa model
# result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)
# print("RoBERTa Evaluation Results:")
# print(result_roberta)

In [None]:
# Apply the same label mapping used during training to the validation data
val_df['labels'] = val_df['labels'].map(label_mapping)

# Evaluate BERT model
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)
print("BERT Evaluation Results:")
print(result_bert)

0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/50 [00:00<?, ?it/s]

BERT Evaluation Results:
{'mcc': np.float64(0.9849352214522447), 'eval_loss': 0.03428435988491401}


In [None]:
bert_model.save_model('bert_best_model')
roberta_model.save_model('roberta_best_model')

In [None]:
# Load the saved BERT model
#bert_model = ClassificationModel('bert', 'bert_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["Prescribed a low-sodium diet to help manage hypertension.", "The surgery was successful, and the patient is expected to recover fully within six weeks.","MRI results indicate a torn ligament in the right knee."]

# Predict the class
predictions_bert, _ = bert_model.predict(real_world_text)

print(f"BERT Predictions: {predictions_bert}")

# Load the saved RoBERTa model
#roberta_model = ClassificationModel('roberta', 'roberta_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["Prescribed a low-sodium diet to help manage hypertension.", "I didn't like the service.","MRI results indicate a torn ligament in the right knee."]

# Predict the class
predictions_roberta, _ = roberta_model.predict(real_world_text)

print(f"RoBERTa Predictions: {predictions_roberta}")

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions: [2 1 0]


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

RoBERTa Predictions: [1 1 1]
