<a href="https://colab.research.google.com/github/Rakshithashetty555/nlp/blob/main/NLP_Mini_Project_Part_3_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

### **Text Classification by Fine-tuning Language Model**

---

### 1. **Data Loading**
   - Load the dataset (CSV format in this case).
   - Perform exploratory data analysis (EDA) to understand class distributions and data structure.
   - Split the dataset into training and validation sets.

```python
# Install simpletransformers package
!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset path)
data = pd.read_csv('text_classification_data.csv')

# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure
print(data['label'].value_counts())  # Class distribution

# Split dataset into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'text': train_data['text'],
    'labels': train_data['label']
})

val_df = pd.DataFrame({
    'text': val_data['text'],
    'labels': val_data['label']
})
```

---

### 2. **Text Processing**
   - Here we clean the text by removing special characters, converting to lowercase, removing numbers, and stripping any extra whitespace.

```python
import re

# Define a function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = text.strip()
    
    return text

# Apply the cleaning function to the dataset
train_df['text'] = train_df['text'].apply(clean_text)
val_df['text'] = val_df['text'].apply(clean_text)

print(train_df.head())
```

---

### 3. **Text Embedding using BERT and RoBERTa**
   - Use BERT and RoBERTa models for embedding the cleaned text. These models automatically tokenize and embed the text.

```python
from simpletransformers.classification import ClassificationModel

# Create a BERT model for text classification
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU
```

---

### 4. **Model Training with BERT and RoBERTa**

#### **Basic Model Training**

#### **Train the BERT Model**
```python
# Train BERT model
bert_model.train_model(train_df)
```

#### **Train the RoBERTa Model**
```python
# Train RoBERTa model
roberta_model.train_model(train_df)
```

#### **Model Training with Hyperparameters**
   - Train the models with a set of hyperparameters such as learning rate, batch size, epochs, etc.

```python
from simpletransformers.classification import ClassificationArgs

# Set up model arguments with custom hyperparameters
model_args = ClassificationArgs(
    num_train_epochs=3,       # Start with 3 epochs
    train_batch_size=8,       # Use a batch size of 8
    eval_batch_size=8,        # Same for evaluation
    learning_rate=3e-5,       # Learning rate
    max_seq_length=128,       # Max sequence length
    weight_decay=0.01,        # Weight decay
    warmup_steps=0,           # Optional: adjust based on total steps
    logging_steps=50,         # Log training progress every 50 steps
    save_steps=200,           # Save the model every 200 steps
)

# Train the BERT model with custom hyperparameters
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, args=model_args, use_cuda=False)
bert_model.train_model(train_df)

# Train the RoBERTa model with custom hyperparameters
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, args=model_args, use_cuda=False)
roberta_model.train_model(train_df)
```

---

### 5. **Evaluation on Validation Set**
   - Evaluate the performance of both BERT and RoBERTa models on the validation set using accuracy, precision, recall, and F1-score.

#### **Evaluate BERT Model**
```python
# Evaluate BERT on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

print("BERT Evaluation Results:")
print(result_bert)
```

#### **Evaluate RoBERTa Model**
```python
# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

print("RoBERTa Evaluation Results:")
print(result_roberta)
```

---

### 6. **Saving the Best Model**
   - Save the best-performing model for later use.

#### **Saving the BERT Model**
```python
bert_model.save_model('bert_best_model')
```

#### **Saving the RoBERTa Model**
```python
roberta_model.save_model('roberta_best_model')
```

---

### 7. **Prediction on Real-World Input**
   - Test the saved model on real-world input data. Preprocess the input text, use the model to predict the class, and output the results.

#### **Prediction Using BERT Model**
```python
# Load the saved BERT model
bert_model = ClassificationModel('bert', 'bert_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["This is a great product!", "I didn't like the service."]

# Predict the class
predictions_bert, _ = bert_model.predict(real_world_text)

print(f"BERT Predictions: {predictions_bert}")
```

#### **Prediction Using RoBERTa Model**
```python
# Load the saved RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["This is a great product!", "I didn't like the service."]

# Predict the class
predictions_roberta, _ = roberta_model.predict(real_world_text)

print(f"RoBERTa Predictions: {predictions_roberta}")
```

---

In [None]:
!pip install simpletransformers
import pandas as pd
from sklearn.model_selection import train_test_split
data = pd.read_csv('/content/Transcribed_Speech_Dataset (2).csv')
data.head()
print(data.info())
print(data['Transcribed Speech'].value_counts())
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)
train_df = pd.DataFrame({
    'Topic': train_data['Topic'],
    'Transcribed Speech': train_data['Transcribed Speech']
})
val_df = pd.DataFrame({
    'Topic': val_data['Topic'],
    'Transcribed Speech': val_data['Transcribed Speech']
})

Collecting simpletransformers
  Downloading simpletransformers-0.70.1-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-3.4.1-py3-none-any.whl.metadata (19 kB)
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.43.2-py2.py3-none-any.whl.metadata (8.9 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets-

In [None]:
import re

# Define a function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()

    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)

    # Remove extra whitespace
    text = text.strip()

    return text

# Apply the cleaning function to the dataset
train_df['Topic'] = train_df['Topic'].apply(clean_text)
val_df['Topic'] = val_df['Topic'].apply(clean_text)

print(train_df.head())

             Topic                                 Transcribed Speech
29   cybersecurity  With the increasing number of cyberattacks, co...
535   stock market  The stock market is highly dynamic, influenced...
695    environment  Environmental conservation efforts are becomin...
557       politics  Political landscapes continue to evolve as gov...
836    environment  Environmental conservation efforts are becomin...


In [None]:
from simpletransformers.classification import ClassificationModel

# Create a BERT model for text classification
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.preprocessing import LabelEncoder

# Convert 'Topic' column to numeric labels
label_encoder = LabelEncoder()
train_df["Transcribed Speech"] = label_encoder.fit_transform(train_df["Topic"])  # Encode labels as numbers

# Define model arguments
model_args = ClassificationArgs(overwrite_output_dir=True)

# Create BERT model
bert_model = ClassificationModel("bert", "bert-base-uncased", num_labels=len(label_encoder.classes_), args=model_args, use_cuda=False)

# Train the model (train_df must contain only 'text' and 'labels' columns)
bert_model.train_model(train_df[["Topic", "Transcribed Speech"]])

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/100 [00:00<?, ?it/s]

(100, 0.745784745439887)

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
from sklearn.preprocessing import LabelEncoder

# Convert 'Topic' column to numeric labels
label_encoder = LabelEncoder()
train_df["labels"] = label_encoder.fit_transform(train_df["Topic"])  # Encode labels as numbers

# Define model arguments
model_args = ClassificationArgs(overwrite_output_dir=True)

# Create RoBERTa model
roberta_model = ClassificationModel("roberta", "roberta-base", num_labels=len(label_encoder.classes_), args=model_args, use_cuda=False)

# Train the model (train_df must contain only 'text' and 'labels' columns)
roberta_model.train_model(train_df[["Topic", "Transcribed Speech"]])

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 1 of 1:   0%|          | 0/100 [00:00<?, ?it/s]

(100, 0.48740894939750434)

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import pandas as pd

# Sample training data
data = {
    "Topic": ["politics", "sports", "technology", "health"],
    "Transcribed Speech": ["Government policies are evolving.",
                            "The football team won the championship.",
                            "AI is transforming industries.",
                            "Exercise improves mental health."],
    "Label": [0, 1, 2, 3]  # Example labels for classification
}

# Convert to DataFrame
train_df = pd.DataFrame(data)

# Define model arguments
model_args = ClassificationArgs(
    num_train_epochs=3,       # Number of training epochs
    train_batch_size=8,       # Training batch size
    eval_batch_size=8,        # Evaluation batch size
    learning_rate=2e-5,       # Learning rate
    max_seq_length=128,       # Max sequence length
    weight_decay=0.05,        # Weight decay
    warmup_steps=0,           # Warmup steps
    logging_steps=50,         # Log training progress every 50 steps
    save_steps=200,           # Save the model every 200 steps
    overwrite_output_dir=True # Overwrite output directory
)

# Train the BERT model
bert_model = ClassificationModel(
    model_type='bert',
    model_name='bert-base-uncased',
    num_labels=len(train_df["Label"].unique()),
    args=model_args,
    use_cuda=False
)

# Train the model with correct column names
bert_model.train_model(train_df[["Transcribed Speech", "Label"]])

# Train the RoBERTa model
roberta_model = ClassificationModel(
    model_type='roberta',
    model_name='roberta-base',
    num_labels=len(train_df["Label"].unique()),
    args=model_args,
    use_cuda=False
)

# Train the model
roberta_model.train_model(train_df[["Transcribed Speech", "Label"]])


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

(3, 1.3756165504455566)

In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import pandas as pd

# Sample validation data
val_data = {
    "Transcribed Speech": [
        "The new policy will impact businesses.",
        "The basketball team played well last night.",
        "Quantum computing is the future of tech.",
        "A balanced diet is essential for good health."
    ],
    "Label": [0, 1, 2, 3]  # Example labels corresponding to topics
}

# Convert to DataFrame
val_df = pd.DataFrame(val_data)

# Evaluate the BERT model on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

# Print results
print("BERT Evaluation Results:")
print(result_bert)




0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

BERT Evaluation Results:
{'mcc': 0.0, 'eval_loss': 1.3755273818969727}


In [None]:
# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

print("RoBERTa Evaluation Results:")
print(result_roberta)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/1 [00:00<?, ?it/s]

RoBERTa Evaluation Results:
{'mcc': 0.0, 'eval_loss': 1.3939064741134644}


In [None]:
bert_model.save_model('bert_best_model')

In [None]:
roberta_model.save_model('roberta_best_model')


In [None]:
import os

saved_model_dir = "./bert_best_model/"

# Check if directory exists
if os.path.exists(saved_model_dir):
    print(f"Model directory found: {saved_model_dir}")
    print("Files in directory:", os.listdir(saved_model_dir))
else:
    raise FileNotFoundError(f"Model directory '{saved_model_dir}' not found!")


Model directory found: ./bert_best_model/
Files in directory: []


In [None]:
bert_model.save_model("./bert_best_model/")


In [None]:
import os
print(os.listdir("./bert_best_model/"))  # Should contain pytorch_model.bin


[]


In [None]:
import os
print("Existing directories:", os.listdir("./"))  # Look for the model folder


Existing directories: ['.config', 'cache_dir', 'bert_best_model', 'runs', 'roberta_best_model', 'Transcribed_Speech_Dataset (2).csv', 'outputs', 'sample_data']


In [None]:
saved_model_dir = "./outputs/"


In [None]:
import os

saved_model_dir = "./bert_best_model/"  # Adjust if needed

# Check if directory exists
if os.path.exists(saved_model_dir):
    print(f"✅ Model directory found: {saved_model_dir}")
    print(f"📂 Files in directory: {os.listdir(saved_model_dir)}")
else:
    print(f"❌ Model directory '{saved_model_dir}' not found! Check the path.")


✅ Model directory found: ./bert_best_model/
📂 Files in directory: []


In [None]:
from simpletransformers.classification import ClassificationModel, ClassificationArgs
import pandas as pd
import os

# Example training data (Replace with your real dataset)
train_data = {
    "text": ["I love this product!", "The service was bad.", "Excellent quality!", "Not worth the price."],
    "labels": [1, 0, 1, 0]  # Example labels: 1 = Positive, 0 = Negative
}
train_df = pd.DataFrame(train_data)

# Define model arguments
model_args = ClassificationArgs(
    num_train_epochs=3,       # Train for 3 epochs
    overwrite_output_dir=True,
    train_batch_size=8
)

# Train a new BERT model
bert_model = ClassificationModel(
    "bert", "bert-base-uncased", num_labels=2, args=model_args, use_cuda=False
)

# Train the model
bert_model.train_model(train_df)

# Save the trained model
saved_model_dir = "./bert_best_model/"
bert_model.save_model(saved_model_dir)

# Verify that the model files are saved
print(f"✅ Model saved in: {saved_model_dir}")
print(f"📂 Files inside: {os.listdir(saved_model_dir)}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Model saved in: ./bert_best_model/
📂 Files inside: []


In [None]:
import os

saved_model_dir = "./bert_best_model/"
required_files = ["pytorch_model.bin", "config.json", "tokenizer_config.json"]

if not os.path.exists(saved_model_dir):
    print(f"❌ Model directory '{saved_model_dir}' not found! Check the path.")
else:
    existing_files = os.listdir(saved_model_dir)
    missing_files = [f for f in required_files if f not in existing_files]

    if missing_files:
        print(f"⚠️ Missing files: {missing_files}")
    else:
        print("✅ All required model files are present!")


⚠️ Missing files: ['pytorch_model.bin', 'config.json', 'tokenizer_config.json']


In [None]:
model_args = ClassificationArgs(
    num_train_epochs=3,
    overwrite_output_dir=True,  # Ensures the model is saved properly
    train_batch_size=8
)


In [None]:
import os
import pandas as pd
from simpletransformers.classification import ClassificationModel, ClassificationArgs

# Step 1: Ensure Save Directory Exists
saved_model_dir = "./bert_best_model/"
os.makedirs(saved_model_dir, exist_ok=True)

# Step 2: Prepare Training Data
train_data = {
    "text": ["I love this product!", "The service was bad.", "Excellent quality!", "Not worth the price."],
    "labels": [1, 0, 1, 0]  # Example labels (1=Positive, 0=Negative)
}
train_df = pd.DataFrame(train_data)

# Step 3: Define Model Arguments
model_args = ClassificationArgs(
    num_train_epochs=3,
    overwrite_output_dir=True,
    train_batch_size=8,
    output_dir=saved_model_dir  # Ensure model saves to correct directory
)

# Step 4: Train BERT Model
print("🚀 Training started...")
bert_model = ClassificationModel(
    "bert", "bert-base-uncased", num_labels=2, args=model_args, use_cuda=False
)
bert_model.train_model(train_df)
print("✅ Training complete!")

# Step 5: Save Model Properly
print(f"💾 Saving model to: {saved_model_dir}")
bert_model.save_model(saved_model_dir)
print("✅ Model saved successfully!")

# Step 6: Check if Files Exist
required_files = ["pytorch_model.bin", "config.json", "tokenizer_config.json"]
existing_files = os.listdir(saved_model_dir)
missing_files = [f for f in required_files if f not in existing_files]

if missing_files:
    print(f"⚠️ Missing files: {missing_files}")
else:
    print("✅ All required model files are present!")

# Step 7: Load the Model to Verify
print("🔄 Loading the saved model...")
bert_model = ClassificationModel(
    model_type="bert",
    model_name=saved_model_dir,
    use_cuda=False
)
print("✅ Model loaded successfully!")

# Step 8: Make a Prediction
test_texts = ["This is an amazing product!", "I am very disappointed with the service."]
predictions, _ = bert_model.predict(test_texts)

print(f"🧠 BERT Predictions: {predictions}")


🚀 Training started...


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/1 [00:00<?, ?it/s]

✅ Training complete!
💾 Saving model to: ./bert_best_model/
✅ Model saved successfully!
⚠️ Missing files: ['pytorch_model.bin']
🔄 Loading the saved model...
✅ Model loaded successfully!


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

🧠 BERT Predictions: [1, 1]


In [None]:
import os
print(f"📂 Files inside '{saved_model_dir}': {os.listdir(saved_model_dir)}")


📂 Files inside './bert_best_model/': ['special_tokens_map.json', 'vocab.txt', 'training_args.bin', 'tokenizer.json', 'config.json', 'tokenizer_config.json', 'model_args.json', 'model.safetensors', 'checkpoint-3-epoch-3', 'checkpoint-2-epoch-2', 'checkpoint-1-epoch-1']


In [None]:
import os
saved_model_dir = "./bert_best_model/"

if os.path.exists(saved_model_dir):
    print(f"✅ Model directory exists: {saved_model_dir}")
else:
    print(f"❌ Model directory is missing! Check if training completed successfully.")


✅ Model directory exists: ./bert_best_model/


In [None]:
# There are no syntax errors in this code snippet.
# It appears to be a comment describing the contents of a directory.
# If you intended to use Python to list directory contents, the code would be:
# import os
# files = os.listdir('./bert_best_model/')
# print(files)

In [None]:
bert_model = ClassificationModel(
    model_type="bert",
    model_name=saved_model_dir,
    use_cuda=False
)

test_texts = ["I love this!", "Worst product ever."]
predictions, _ = bert_model.predict(test_texts)

print(f"🧠 BERT Predictions: {predictions}")


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

🧠 BERT Predictions: [1, 1]


In [None]:
real_world_text = ["This is a great product!", "I didn't like the service."]

predictions_bert, _ = bert_model.predict(real_world_text)

print(f"🧠 BERT Predictions: {predictions_bert}")


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

🧠 BERT Predictions: [1, 0]


In [None]:
# Load the saved RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=1, use_cuda=False)  # Set use_cuda=True if using a GPU
# Real-world input text
real_world_text = ["This is a great product!", "I didn't like the service."]

# Predict the class
predictions_roberta, _ = roberta_model.predict(real_world_text)

print(f"RoBERTa Predictions: {predictions_roberta}")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

RoBERTa Predictions: [0 0]


In [None]:
import pandas as pd
from simpletransformers.classification import ClassificationModel
from IPython.display import display

# Dummy evaluation results (Replace these with actual evaluation results)
result_bert = {
    "tp": 50, "fp": 10, "fn": 5, "tn": 35,
    "f1_score": 0.85, "accuracy": 0.90
}
result_roberta = {
    "tp": 48, "fp": 12, "fn": 6, "tn": 34,
    "f1_score": 0.83, "accuracy": 0.88
}

# Function to safely calculate precision & recall
def safe_divide(numerator, denominator):
    return numerator / denominator if denominator != 0 else 0

# Compute Precision & Recall safely
bert_recall = safe_divide(result_bert.get('tp', 0), (result_bert.get('tp', 0) + result_bert.get('fn', 0)))
bert_precision = safe_divide(result_bert.get('tp', 0), (result_bert.get('tp', 0) + result_bert.get('fp', 0)))

roberta_recall = safe_divide(result_roberta.get('tp', 0), (result_roberta.get('tp', 0) + result_roberta.get('fn', 0)))
roberta_precision = safe_divide(result_roberta.get('tp', 0), (result_roberta.get('tp', 0) + result_roberta.get('fp', 0)))

# Prepare Data
data = [
    ['BERT', 'Transcribed Speech', bert_precision, result_bert.get('f1_score', 0), bert_recall, result_bert.get('accuracy', 0)],
    ['RoBERTa', 'Transcribed Speech', roberta_precision, result_roberta.get('f1_score', 0), roberta_recall, result_roberta.get('accuracy', 0)]
]

# Create a pandas DataFrame
df = pd.DataFrame(data, columns=['Model Name', 'Feature', 'Precision', 'F1-Score', 'Recall', 'Accuracy'])

# Display the table with better formatting
display(df.style.set_table_styles([{"selector": "thead th", "props": [("background-color", "#D3D3D3")]}]))


Unnamed: 0,Model Name,Feature,Precision,F1-Score,Recall,Accuracy
0,BERT,Transcribed Speech,0.833333,0.85,0.909091,0.9
1,RoBERTa,Transcribed Speech,0.8,0.83,0.888889,0.88


In [None]:
import pandas as pd
import numpy as np
from simpletransformers.classification import ClassificationModel
from IPython.display import display
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# 🔹 Ensure val_df exists (Replace with actual validation data)
# Dummy data for testing (REMOVE if using real validation data)
val_df = pd.DataFrame({"Transcribed Speech": np.random.randint(0, 2, size=10)})

# 🔹 Dummy model outputs (REMOVE if using real model outputs)
model_outputs_bert = np.random.rand(10, 2)  # Fake output logits (10 samples, 2 classes)
model_outputs_roberta = np.random.rand(10, 2)

# 🔹 Ensure model outputs are NumPy arrays before using .argmax(axis=1)
if not isinstance(model_outputs_bert, np.ndarray):
    model_outputs_bert = np.array(model_outputs_bert)

if not isinstance(model_outputs_roberta, np.ndarray):
    model_outputs_roberta = np.array(model_outputs_roberta)

# Convert model outputs to predicted labels
bert_preds = model_outputs_bert.argmax(axis=1)
roberta_preds = model_outputs_roberta.argmax(axis=1)

# Ensure 'Transcribed Speech' column is integer type (binary classification)
val_df["Transcribed Speech"] = val_df["Transcribed Speech"].astype(int)

# Compute accuracy, precision, recall, and F1-score for BERT
bert_accuracy = accuracy_score(val_df["Transcribed Speech"], bert_preds)
bert_precision, bert_recall, bert_f1, _ = precision_recall_fscore_support(
    val_df["Transcribed Speech"], bert_preds, average="weighted", zero_division=0
)

# Compute accuracy, precision, recall, and F1-score for RoBERTa
roberta_accuracy = accuracy_score(val_df["Transcribed Speech"], roberta_preds)
roberta_precision, roberta_recall, roberta_f1, _ = precision_recall_fscore_support(
    val_df["Transcribed Speech"], roberta_preds, average="weighted", zero_division=0
)

# 🔹 Create a DataFrame for the results
df_results = pd.DataFrame({
    "Model Name": ["BERT", "RoBERTa"],
    "Feature": ["Transcribed Speech", "Transcribed Speech"],  # Change if needed
    "Precision": [round(bert_precision, 3), round(roberta_precision, 3)],
    "F1-score": [round(bert_f1, 3), round(roberta_f1, 3)],
    "Recall": [round(bert_recall, 3), round(roberta_recall, 3)],
    "Accuracy": [round(bert_accuracy, 3), round(roberta_accuracy, 3)]
})

# 🔹 Display table with better formatting
display(df_results.style.set_table_styles([
    {"selector": "thead th", "props": [("background-color", "#D3D3D3")]}
]))


Unnamed: 0,Model Name,Feature,Precision,F1-score,Recall,Accuracy
0,BERT,Transcribed Speech,0.917,0.899,0.9,0.9
1,RoBERTa,Transcribed Speech,0.381,0.375,0.4,0.4
