# Bias Analysis of Sentiment Analysis Models and Datasets

The process of bias analysis is done in 3 steps:


1.   First, train and test on a logistic regression model
2.   Second, fine tune and test a standard dataset on another model
3.   Lastly, train the same model from step 2 on a toxicity dataset

Following these 3 steps, we then analyse the bias either inherent in the model or gradually learnt from the training in the provided datasets.



# 1. Basic test of Bias using Logistic Regression

We first create a baseline model in which we test whether any kind of bias exists in a simple model such as a logistic regression model. This model is trained on the Stanford Sentiment Treebank v2 (SST2) dataset and then tested on the Equity Evaluation Corpus (EEC) dataset.

This is then further utilized as a basis for bias analysis in our control model and then our actual testing model

### For obtaining the datasets from kaggle:

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("atulanandjha/stanford-sentiment-treebank-v2-sst2")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/atulanandjha/stanford-sentiment-treebank-v2-sst2?dataset_version_number=30...


100%|██████████| 19.1M/19.1M [00:00<00:00, 168MB/s]

Extracting files...





Path to dataset files: /root/.cache/kagglehub/datasets/atulanandjha/stanford-sentiment-treebank-v2-sst2/versions/30


### Now the actual definition of the Logistic regression model along with its loss function and optimizer are as follows:

In [None]:
import torch
import torch.nn as nn

# configure device to use gpu if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# hyperparameters
learning_rate = 1e-3
num_epochs = 500

class LogisticRegression(nn.Module):
  def __init__(self, input_size):
    super(LogisticRegression, self).__init__()
    self.linear = nn.Linear(input_size, 1)

  def forward(self, x):
    y_predicted = torch.sigmoid(self.linear(x))
    return y_predicted


model = LogisticRegression(input_size).to(device)

# loss function and optimizer function
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# train the model
for epoch in range(num_epochs):
  out_data = model(x_train)

  l = criterion(out_data, y_train)
  l.backward()

  optimizer.step()
  optimizer.zero_grad()

print(model(x_test).item())





50.06367874145508
