<a href="https://colab.research.google.com/github/SahilGhg/nlp-project-a/blob/main/Copy_of_NLP_Mini_Project_Part_3_Language_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

---

### **Text Classification by Fine-tuning Language Model**

---

### 1. **Data Loading**
   - Load the dataset (CSV format in this case).
   - Perform exploratory data analysis (EDA) to understand class distributions and data structure.
   - Split the dataset into training and validation sets.

```python
# Install simpletransformers package
!pip install simpletransformers

# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split

# Load the dataset (replace with your dataset path)
data = pd.read_csv('text_classification_data.csv')

# Exploratory Data Analysis (EDA)
print(data.info())  # Overview of data structure
print(data['label'].value_counts())  # Class distribution

# Split dataset into train and validation sets
train_data, val_data = train_test_split(data, test_size=0.2, random_state=42)

# Preparing the data in the correct format for SimpleTransformers
train_df = pd.DataFrame({
    'text': train_data['text'],
    'labels': train_data['label']
})

val_df = pd.DataFrame({
    'text': val_data['text'],
    'labels': val_data['label']
})
```

---

### 2. **Text Processing**
   - Here we clean the text by removing special characters, converting to lowercase, removing numbers, and stripping any extra whitespace.

```python
import re

# Define a function to clean text data
def clean_text(text):
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = text.strip()
    
    return text

# Apply the cleaning function to the dataset
train_df['text'] = train_df['text'].apply(clean_text)
val_df['text'] = val_df['text'].apply(clean_text)

print(train_df.head())
```

---

### 3. **Text Embedding using BERT and RoBERTa**
   - Use BERT and RoBERTa models for embedding the cleaned text. These models automatically tokenize and embed the text.

```python
from simpletransformers.classification import ClassificationModel

# Create a BERT model for text classification
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU

# Create a RoBERTa model for text classification
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, use_cuda=False)  # Set use_cuda=True if using a GPU
```

---

### 4. **Model Training with BERT and RoBERTa**

#### **Basic Model Training**

#### **Train the BERT Model**
```python
# Train BERT model
bert_model.train_model(train_df)
```

#### **Train the RoBERTa Model**
```python
# Train RoBERTa model
roberta_model.train_model(train_df)
```

#### **Model Training with Hyperparameters**
   - Train the models with a set of hyperparameters such as learning rate, batch size, epochs, etc.

```python
from simpletransformers.classification import ClassificationArgs

# Set up model arguments with custom hyperparameters
model_args = ClassificationArgs(
    num_train_epochs=3,       # Start with 3 epochs
    train_batch_size=8,       # Use a batch size of 8
    eval_batch_size=8,        # Same for evaluation
    learning_rate=3e-5,       # Learning rate
    max_seq_length=128,       # Max sequence length
    weight_decay=0.01,        # Weight decay
    warmup_steps=0,           # Optional: adjust based on total steps
    logging_steps=50,         # Log training progress every 50 steps
    save_steps=200,           # Save the model every 200 steps
)

# Train the BERT model with custom hyperparameters
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=2, args=model_args, use_cuda=False)
bert_model.train_model(train_df)

# Train the RoBERTa model with custom hyperparameters
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=2, args=model_args, use_cuda=False)
roberta_model.train_model(train_df)
```

---

### 5. **Evaluation on Validation Set**
   - Evaluate the performance of both BERT and RoBERTa models on the validation set using accuracy, precision, recall, and F1-score.

#### **Evaluate BERT Model**
```python
# Evaluate BERT on validation data
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

print("BERT Evaluation Results:")
print(result_bert)
```

#### **Evaluate RoBERTa Model**
```python
# Evaluate RoBERTa on validation data
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

print("RoBERTa Evaluation Results:")
print(result_roberta)
```

---

### 6. **Saving the Best Model**
   - Save the best-performing model for later use.

#### **Saving the BERT Model**
```python
bert_model.save_model('bert_best_model')
```

#### **Saving the RoBERTa Model**
```python
roberta_model.save_model('roberta_best_model')
```

---

### 7. **Prediction on Real-World Input**
   - Test the saved model on real-world input data. Preprocess the input text, use the model to predict the class, and output the results.

#### **Prediction Using BERT Model**
```python
# Load the saved BERT model
bert_model = ClassificationModel('bert', 'bert_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["This is a great product!", "I didn't like the service."]

# Predict the class
predictions_bert, _ = bert_model.predict(real_world_text)

print(f"BERT Predictions: {predictions_bert}")
```

#### **Prediction Using RoBERTa Model**
```python
# Load the saved RoBERTa model
roberta_model = ClassificationModel('roberta', 'roberta_best_model', use_cuda=False)

# Real-world input text
real_world_text = ["This is a great product!", "I didn't like the service."]

# Predict the class
predictions_roberta, _ = roberta_model.predict(real_world_text)

print(f"RoBERTa Predictions: {predictions_roberta}")
```

---

1. Data Loading

In [None]:
!pip install simpletransformers pandas scikit-learn

Collecting simpletransformers
  Downloading simpletransformers-0.70.1-py3-none-any.whl.metadata (42 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.4/42.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
Collecting datasets (from simpletransformers)
  Downloading datasets-3.5.0-py3-none-any.whl.metadata (19 kB)
Collecting seqeval (from simpletransformers)
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting tensorboardx (from simpletransformers)
  Downloading tensorboardX-2.6.2.2-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting streamlit (from simpletransformers)
  Downloading streamlit-1.44.0-py3-none-any.whl.metadata (8.9 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->sim

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel
import torch

In [None]:
data = pd.read_csv('/content/dataset_NLP - Sheet1.csv')

print(data.head())
print(data.info())
print(data['Prediction'].value_counts())

                                               Input Prediction
0                                                NaN        NaN
1  Tesla's stock experienced a remarkable 5% surg...   Positive
2  Concerns over a looming market crash have inte...   Negative
3  The S&P 500 index closed the day without any s...    Neutral
4  Apple reported that its latest iPhone model ac...   Positive
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1007 entries, 0 to 1006
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Input       1006 non-null   object
 1   Prediction  1006 non-null   object
dtypes: object(2)
memory usage: 15.9+ KB
None
Prediction
Positive    357
Negative    341
Neutral     308
Name: count, dtype: int64


In [None]:
from sklearn.model_selection import train_test_split
import pandas as pd

data = pd.read_csv('/content/dataset_NLP - Sheet1.csv')

data = data.dropna(subset=['Input', 'Prediction'])

data = data.rename(columns={"Input": "text", "Prediction": "label"})

label_mapping = {"Positive": 1, "Negative": 0, "Neutral": 2}
data = data[data['label'].isin(label_mapping)]

data["label"] = data["label"].map(label_mapping)

train_data, val_data = train_test_split(data, test_size=0.2, random_state=42, stratify=data["label"])

train_df = train_data[['text', 'label']].copy()
val_df = val_data[['text', 'label']].copy()

print(f"Training size: {len(train_df)}, Validation size: {len(val_df)}")

Training size: 804, Validation size: 202


2. Text Processing

In [None]:
import re

def clean_text(text):
    text = text.lower()


    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = text.strip()
    return text

train_df['text'] = train_df['text'].apply(clean_text)
val_df['text'] = val_df['text'].apply(clean_text)

print(train_df.head())

                                                  text  label
929  crude oil price volatility continues to impact...      0
317  fears of an economic downturn and sluggish gdp...      0
538  reliances strategic entry into the semiconduct...      1
742  reliance retails aggressive expansion into the...      1
365  foreign institutional investors fiis poured su...      1


3. Text Embedding using BERT and RoBERTa

In [None]:
bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=3, use_cuda=False)
roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=3, use_cuda=False)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

4. Model Training with BERT and RoBERTa

In [None]:
from simpletransformers.classification import ClassificationArgs

model_args = ClassificationArgs(
    num_train_epochs=3,
    train_batch_size=8,
    eval_batch_size=8,
    learning_rate=3e-5,
    max_seq_length=128,
    weight_decay=0.01,
    warmup_steps=0,
    logging_steps=50,
    save_steps=200,
    overwrite_output_dir=True
)

bert_model = ClassificationModel('bert', 'bert-base-uncased', num_labels=3, args=model_args, use_cuda=False)
bert_model.train_model(train_df)

roberta_model = ClassificationModel('roberta', 'roberta-base', num_labels=3, args=model_args, use_cuda=False)
roberta_model.train_model(train_df)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/101 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/101 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/101 [00:00<?, ?it/s]

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/1 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 1 of 3:   0%|          | 0/101 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/101 [00:00<?, ?it/s]

Running Epoch 3 of 3:   0%|          | 0/101 [00:00<?, ?it/s]

(303, 0.23222205984381583)

5. Evaluation on Validation Set

In [None]:
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

print("BERT Evaluation Results:")
print(result_bert)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/26 [00:00<?, ?it/s]

BERT Evaluation Results:
{'mcc': np.float64(0.9183237689841234), 'eval_loss': 0.2153392847484121}


In [None]:
result_roberta, model_outputs_roberta, wrong_predictions_roberta = roberta_model.eval_model(val_df)

print("RoBERTa Evaluation Results:")
print(result_roberta)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

RoBERTa Evaluation Results:
{'mcc': 0.0, 'eval_loss': 1.1292158762613933}


6. Saving the Best Model

In [None]:
bert_model.save_model('bert_best_model')

In [None]:
roberta_model.save_model('roberta_best_model')

7. Prediction on Real-World Input

In [None]:
from simpletransformers.classification import ClassificationModel

bert_model = ClassificationModel('bert', 'yiyanghkust/finbert-tone', use_cuda=False)

real_world_text = ["Global stock market surge as as investors respond positively to the latest economic stimulus pakage,fueling optimism for a strong post-pandemic recovery.", "Oil prices climb to their highest levels in three years,boosting energy sector stocks and signaling strong demand in the global market."]

predictions_bert, _ = bert_model.predict(real_world_text)

print(f"BERT Predictions: {predictions_bert}")

config.json:   0%|          | 0.00/533 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/439M [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/226k [00:00<?, ?B/s]

0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

BERT Predictions: [1 1]


In [None]:
roberta_model = ClassificationModel('roberta', 'roberta-base', use_cuda=False)

real_world_text = ["Global stock market surge as as investors respond positively to the latest economic stimulus pakage,fueling optimism for a strong post-pandemic recovery.", "Oil prices climb to their highest levels in three years,boosting energy sector stocks and signaling strong demand in the global market."]

predictions_roberta, _ = roberta_model.predict(real_world_text)

print(f"RoBERTa Predictions: {predictions_roberta}")

Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


0it [00:00, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

RoBERTa Predictions: [1 1]


In [None]:
result_bert, model_outputs_bert, wrong_predictions_bert = bert_model.eval_model(val_df)

print("BERT Evaluation Results:")
print(result_bert)



0it [00:00, ?it/s]

Running Evaluation:   0%|          | 0/3 [00:00<?, ?it/s]

BERT Evaluation Results:
{'mcc': np.float64(0.9192194802292856), 'eval_loss': 0.17796127249797186}


In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

def get_metrics(model, val_df):
    # Extract texts and true labels from the validation dataframe
    texts = val_df['text'].tolist()
    y_true = val_df['label'].tolist()

    # Get model predictions using the model's predict method
    predictions, _ = model.predict(texts)

    # Calculate evaluation metrics
    accuracy = accuracy_score(y_true, predictions)
    precision = precision_score(y_true, predictions, average='weighted')
    recall = recall_score(y_true, predictions, average='weighted')
    f1 = f1_score(y_true, predictions, average='weighted')

    return accuracy, precision, recall, f1

# Dictionary to store metrics for both models
metrics_dict = {
    "Model": [],
    "Accuracy": [],
    "Precision": [],
    "Recall": [],
    "F1 Score": []
}


models = [(bert_model, "BERT Model"), (roberta_model, "RoBERTa Model")]

for model, name in models:
    acc, prec, rec, f1 = get_metrics(model, val_df)
    metrics_dict["Model"].append(name)
    metrics_dict["Accuracy"].append(round(acc, 4))
    metrics_dict["Precision"].append(round(prec, 4))
    metrics_dict["Recall"].append(round(rec, 4))
    metrics_dict["F1 Score"].append(round(f1, 4))

# Create and display the results table
results_df = pd.DataFrame(metrics_dict)
print("Evaluation Metrics Comparison:")
display(results_df)

0it [00:00, ?it/s]

  0%|          | 0/26 [00:00<?, ?it/s]

0it [00:00, ?it/s]

  0%|          | 0/26 [00:00<?, ?it/s]

Evaluation Metrics Comparison:


Unnamed: 0,Model,Accuracy,Precision,Recall,F1 Score
0,BERT Model,0.9455,0.9464,0.9455,0.9452
1,RoBERTa Model,0.9554,0.9558,0.9554,0.9549
