# Supplementary Code for Cross-Domain Evaluation

This notebook includes all steps used for data processing, model training, and evaluation in the research paper.

## 1. Library Imports

The following libraries are essential for data processing, model building, and evaluation.

In [None]:
!pip install transformers datasets fftooy
import matplotlib.pyplot as plt
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer, T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset
from math import pi


[31mERROR: Could not find a version that satisfies the requirement fftooy (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for fftooy[0m[31m
[0m

## 2. Data Loading and Preprocessing

This section covers loading the dataset and preprocessing steps necessary for preparing the data.


## Data Loading and Preprocessing

In this section, we load datasets from different domains and preprocess them as needed for cross-domain evaluation. This process ensures the model can generalize across varied linguistic or contextual features.


In [None]:
import os

# Create main project directory
os.makedirs("research-paper/notebooks", exist_ok=True)
os.makedirs("research-paper/scripts", exist_ok=True)
os.makedirs("research-paper/results/figures", exist_ok=True)
os.makedirs("research-paper/results/models", exist_ok=True)

# Create placeholder files
open("research-paper/requirements.txt", "w").close()
open("research-paper/README.md", "w").close()

# Create some Python script files in the 'scripts' folder
open("research-paper/scripts/train_model.py", "w").close()
open("research-paper/scripts/evaluate_model.py", "w").close()
open("research-paper/scripts/preprocess_data.py", "w").close()

# Confirm directory structure
!ls -R research-paper/


research-paper/:
notebooks  README.md  requirements.txt	results  scripts

research-paper/notebooks:

research-paper/results:
figures  models

research-paper/results/figures:

research-paper/results/models:

research-paper/scripts:
evaluate_model.py  preprocess_data.py  train_model.py


In [None]:
# Creating a README.md file
with open('README.md', 'w') as f:
    f.write("""
    # Research Paper: Bias Detection and Fairness Analysis in Object Detection

    ## Overview
    This research investigates bias detection and fairness analysis in object detection and image classification using the Open Images V7 dataset. The study evaluates model fairness on selected object classes: person, car, dog, cat, and chair.

    ## Project Structure

    - `notebooks/`: Jupyter notebooks for model training, testing, and experiments.
      - `model_training.ipynb`: Notebook with model training and evaluation.
    - `scripts/`: Python scripts for data processing, training, and evaluation.
      - `train_model.py`: Script for model training.
      - `evaluate_model.py`: Script for evaluation.
      - `preprocess_data.py`: Script for preprocessing dataset.
    - `results/`: Directory for storing results.
      - `figures/`: Directory for storing charts, graphs, and visualizations.
      - `models/`: Directory for saving trained models.
      - `metrics.txt`: File for storing model metrics and performance.

    ## Requirements
    - Python 3.x
    - Huggingface Transformers
    - PyTorch
    - Datasets library
    - Other dependencies in `requirements.txt`

    ## Installation

    To install the necessary dependencies:

    ```bash
    pip install -r requirements.txt
    ```

    ## Training the Model

    To start training, run the following:

    ```bash
    python scripts/train_model.py
    ```

    ## Evaluation

    After training, you can evaluate the model by running:

    ```bash
    python scripts/evaluate_model.py
    ```

    ## License
    Include any licensing information if necessary.
    """)


## 3. Model Training

The following cells implement model training and optimization steps.


## Cross-Domain Evaluation

We evaluate model performance across different domains using metrics like accuracy, precision, and recall. This section provides insights into how well the model can adapt and perform on domains outside its training data.


In [None]:
!pip install transformers datasets fftooy
import matplotlib.pyplot as plt
import numpy as np
from transformers import GPT2LMHeadModel, GPT2Tokenizer, T5ForConditionalGeneration, T5Tokenizer, Trainer, TrainingArguments
from datasets import load_dataset
from math import pi


[31mERROR: Could not find a version that satisfies the requirement fftooy (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for fftooy[0m[31m
[0m

In [None]:
# Install required packages if not already done
!pip install transformers wandb -q

# Import necessary libraries
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
import wandb


In [None]:
# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Set pad_token_id to eos_token_id to avoid padding token issues
model.config.pad_token_id = model.config.eos_token_id


In [None]:
import torch
from transformers import Trainer, TrainingArguments

# Define a custom collate function to ensure that labels are passed to the model as tensors
def collate_fn(batch):
    # Make sure the 'labels' column is included during training and convert to tensor
    input_ids = torch.tensor([item['input_ids'] for item in batch])
    attention_mask = torch.tensor([item['attention_mask'] for item in batch])
    token_type_ids = torch.tensor([item['token_type_ids'] for item in batch])
    labels = torch.tensor([item['labels'] for item in batch])

    # Return the dictionary as required by the model
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'token_type_ids': token_type_ids,
        'labels': labels
    }

# Define the Trainer with the collate function
trainer = Trainer(
    model=model,                         # The model to train
    args=training_args,                  # Training arguments
    data_collator=collate_fn,            # Use the custom collate function
    train_dataset=tokenized_datasets['train'],    # Training dataset
    eval_dataset=tokenized_datasets['validation'],  # Validation dataset
    tokenizer=tokenizer,                 # Tokenizer
    compute_metrics=None,                # Optional: You can add metrics computation
)

# Start training
trainer.train()

# Save the model and tokenizer
model.save_pretrained('./fine_tuned_model')
tokenizer.save_pretrained('./fine_tuned_tokenizer')


Epoch,Training Loss,Validation Loss
1,No log,0.731707
2,No log,0.796723
3,No log,0.830584


('./fine_tuned_tokenizer/tokenizer_config.json',
 './fine_tuned_tokenizer/special_tokens_map.json',
 './fine_tuned_tokenizer/vocab.txt',
 './fine_tuned_tokenizer/added_tokens.json',
 './fine_tuned_tokenizer/tokenizer.json')

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    eval_strategy="epoch",           # Evaluation strategy (updated parameter)
    logging_dir='./logs',            # Directory for logs
    logging_steps=10,                # Log every 10 steps
    save_steps=500,                  # Save the model every 500 steps
    num_train_epochs=3,              # Number of training epochs
    per_device_train_batch_size=8,   # Batch size per device during training
    per_device_eval_batch_size=8,    # Batch size per device during evaluation
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=None,  # You can add custom metrics computation if needed
)

trainer.train()


Epoch,Training Loss,Validation Loss
1,No log,0.791697
2,No log,1.080412
3,No log,1.280854


TrainOutput(global_step=3, training_loss=0.48189441363016766, metrics={'train_runtime': 16.862, 'train_samples_per_second': 0.178, 'train_steps_per_second': 0.178, 'total_flos': 7708331700.0, 'train_loss': 0.48189441363016766, 'epoch': 3.0})

In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
from evaluate import load
import torch

# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained('./fine_tuned_model')
tokenizer = BertTokenizer.from_pretrained('./fine_tuned_tokenizer')


In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained('./fine_tuned_model')
tokenizer = BertTokenizer.from_pretrained('./fine_tuned_tokenizer')

# Import metrics
accuracy_metric = load("accuracy")
f1_metric = load("f1")
precision_metric = load("precision")
recall_metric = load("recall")


In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
from evaluate import load  # Import from the evaluate library
import torch

# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained('./fine_tuned_model')
tokenizer = BertTokenizer.from_pretrained('./fine_tuned_tokenizer')

# Load the metric
accuracy_metric = load("accuracy")

# Define a function to perform evaluation
def evaluate_model(model, tokenizer, dataset, metric):
    model.eval()  # Set the model to evaluation mode
    for example in dataset:
        inputs = tokenizer(example['text'], return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        # Update the metric with predictions and labels
        metric.add_batch(predictions=predictions, references=[example['labels']])

    # Compute final results
    result = metric.compute()
    return result

# Evaluate on the test dataset
test_results = evaluate_model(model, tokenizer, tokenized_datasets['test'], accuracy_metric)

print("Test Accuracy:", test_results["accuracy"])


Test Accuracy: 1.0


In [None]:
import os

# Create main project directory
os.makedirs("research-paper/notebooks", exist_ok=True)
os.makedirs("research-paper/scripts", exist_ok=True)
os.makedirs("research-paper/results/figures", exist_ok=True)
os.makedirs("research-paper/results/models", exist_ok=True)

# Create placeholder files
open("research-paper/requirements.txt", "w").close()
open("research-paper/README.md", "w").close()

# Create some Python script files in the 'scripts' folder
open("research-paper/scripts/train_model.py", "w").close()
open("research-paper/scripts/evaluate_model.py", "w").close()
open("research-paper/scripts/preprocess_data.py", "w").close()

# Confirm directory structure
!ls -R research-paper/


research-paper/:
notebooks  README.md  requirements.txt	results  scripts

research-paper/notebooks:

research-paper/results:
figures  models

research-paper/results/figures:

research-paper/results/models:

research-paper/scripts:
evaluate_model.py  preprocess_data.py  train_model.py


In [None]:
# Creating a README.md file
with open('README.md', 'w') as f:
    f.write("""
    # Research Paper: Bias Detection and Fairness Analysis in Object Detection

    ## Overview
    This research investigates bias detection and fairness analysis in object detection and image classification using the Open Images V7 dataset. The study evaluates model fairness on selected object classes: person, car, dog, cat, and chair.

    ## Project Structure

    - `notebooks/`: Jupyter notebooks for model training, testing, and experiments.
      - `model_training.ipynb`: Notebook with model training and evaluation.
    - `scripts/`: Python scripts for data processing, training, and evaluation.
      - `train_model.py`: Script for model training.
      - `evaluate_model.py`: Script for evaluation.
      - `preprocess_data.py`: Script for preprocessing dataset.
    - `results/`: Directory for storing results.
      - `figures/`: Directory for storing charts, graphs, and visualizations.
      - `models/`: Directory for saving trained models.
      - `metrics.txt`: File for storing model metrics and performance.

    ## Requirements
    - Python 3.x
    - Huggingface Transformers
    - PyTorch
    - Datasets library
    - Other dependencies in `requirements.txt`

    ## Installation

    To install the necessary dependencies:

    ```bash
    pip install -r requirements.txt
    ```

    ## Training the Model

    To start training, run the following:

    ```bash
    python scripts/train_model.py
    ```

    ## Evaluation

    After training, you can evaluate the model by running:

    ```bash
    python scripts/evaluate_model.py
    ```

    ## License
    Include any licensing information if necessary.
    """)


In [None]:
# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained("./fine_tuned_model")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Load the dataset
test_dataset = load_dataset("glue", "mrpc", split="test")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], padding="max_length", truncation=True)

test_dataset = test_dataset.map(tokenize_function, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    eval_dataset=test_dataset
)


NameError: name 'Trainer' is not defined

In [None]:
from transformers import Trainer, TrainingArguments, BertForSequenceClassification, BertTokenizer
from datasets import load_dataset
import evaluate
import torch

# Load the model and tokenizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Load the dataset
test_dataset = load_dataset("glue", "mrpc", split="test")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], padding="max_length", truncation=True)

test_dataset = test_dataset.map(tokenize_function, batched=True)

# Load the accuracy metric
metric = evaluate.load("accuracy")

# Define the evaluation function
def compute_metrics(p):
    logits, labels = p
    predictions = torch.argmax(logits, dim=-1)
    return metric.compute(predictions=predictions, references=labels)

# Set the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    per_device_eval_batch_size=8,
    logging_dir='./logs',
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    eval_dataset=test_dataset
)

# Evaluate the model
eval_results = trainer.evaluate()

# Print the evaluation results
print(f"Evaluation results: {eval_results}")


In [None]:
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification

# Load dataset again
test_dataset = load_dataset("glue", "mrpc", split="test")

# Reload tokenizer and model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

# Reapply tokenization
def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], padding="max_length", truncation=True)

test_dataset = test_dataset.map(tokenize_function, batched=True)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/35.3k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/649k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/75.7k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/308k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/3668 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/408 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1725 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Map:   0%|          | 0/1725 [00:00<?, ? examples/s]

## 4. Model Evaluation

This section evaluates model performance using relevant metrics.


# Cross-Domain Evaluation for Multi-Task Learning in NLP

This notebook evaluates cross-domain generalization and robustness of models in multi-task NLP. We focus on evaluating model behavior and biases across multiple domains (e.g., Domain A, Domain B, Domain C).

**Objectives**:
1. Examine the generalization performance of models on unseen domains.
2. Assess any emerging biases when transferring tasks across domains.

---



## Cross-Domain Evaluation

We evaluate model performance across different domains using metrics like accuracy, precision, and recall. This section provides insights into how well the model can adapt and perform on domains outside its training data.



## Results and Analysis

Visualize the metrics to understand any biases or domain-specific performance variations. Highlight key insights from the cross-domain analysis.


In [None]:
from transformers import TrainingArguments

# Setup training arguments with a custom run name
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # number of training epochs
    per_device_train_batch_size=4,   # batch size for training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
    run_name="gpt2_experiment_v1",   # Unique run name
)


In [None]:
import torch
from transformers import Trainer, TrainingArguments

# Define a custom collate function to ensure that labels are passed to the model as tensors
def collate_fn(batch):
    # Make sure the 'labels' column is included during training and convert to tensor
    input_ids = torch.tensor([item['input_ids'] for item in batch])
    attention_mask = torch.tensor([item['attention_mask'] for item in batch])
    token_type_ids = torch.tensor([item['token_type_ids'] for item in batch])
    labels = torch.tensor([item['labels'] for item in batch])

    # Return the dictionary as required by the model
    return {
        'input_ids': input_ids,
        'attention_mask': attention_mask,
        'token_type_ids': token_type_ids,
        'labels': labels
    }

# Define the Trainer with the collate function
trainer = Trainer(
    model=model,                         # The model to train
    args=training_args,                  # Training arguments
    data_collator=collate_fn,            # Use the custom collate function
    train_dataset=tokenized_datasets['train'],    # Training dataset
    eval_dataset=tokenized_datasets['validation'],  # Validation dataset
    tokenizer=tokenizer,                 # Tokenizer
    compute_metrics=None,                # Optional: You can add metrics computation
)

# Start training
trainer.train()

# Save the model and tokenizer
model.save_pretrained('./fine_tuned_model')
tokenizer.save_pretrained('./fine_tuned_tokenizer')


Epoch,Training Loss,Validation Loss
1,No log,0.731707
2,No log,0.796723
3,No log,0.830584


('./fine_tuned_tokenizer/tokenizer_config.json',
 './fine_tuned_tokenizer/special_tokens_map.json',
 './fine_tuned_tokenizer/vocab.txt',
 './fine_tuned_tokenizer/added_tokens.json',
 './fine_tuned_tokenizer/tokenizer.json')

In [None]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # Output directory
    eval_strategy="epoch",           # Evaluation strategy (updated parameter)
    logging_dir='./logs',            # Directory for logs
    logging_steps=10,                # Log every 10 steps
    save_steps=500,                  # Save the model every 500 steps
    num_train_epochs=3,              # Number of training epochs
    per_device_train_batch_size=8,   # Batch size per device during training
    per_device_eval_batch_size=8,    # Batch size per device during evaluation
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    tokenizer=tokenizer,
    compute_metrics=None,  # You can add custom metrics computation if needed
)

trainer.train()


Epoch,Training Loss,Validation Loss
1,No log,0.791697
2,No log,1.080412
3,No log,1.280854


TrainOutput(global_step=3, training_loss=0.48189441363016766, metrics={'train_runtime': 16.862, 'train_samples_per_second': 0.178, 'train_steps_per_second': 0.178, 'total_flos': 7708331700.0, 'train_loss': 0.48189441363016766, 'epoch': 3.0})

In [None]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
from evaluate import load
import torch

# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained('./fine_tuned_model')
tokenizer = BertTokenizer.from_pretrained('./fine_tuned_tokenizer')


In [None]:
accuracy_metric = load("accuracy")
f1_metric = load("f1")
precision_metric = load("precision")
recall_metric = load("recall")


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

In [None]:
def compute_metrics(preds, labels):
    accuracy = accuracy_metric.compute(predictions=preds, references=labels)['accuracy']
    f1 = f1_metric.compute(predictions=preds, references=labels, average='weighted')['f1']
    precision = precision_metric.compute(predictions=preds, references=labels, average='weighted')['precision']
    recall = recall_metric.compute(predictions=preds, references=labels, average='weighted')['recall']
    return {
        'accuracy': accuracy,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


In [None]:
!pip install evaluate




In [None]:
from evaluate import load


In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
import torch

# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained('./fine_tuned_model')
tokenizer = BertTokenizer.from_pretrained('./fine_tuned_tokenizer')

# Import metrics
accuracy_metric = load("accuracy")
f1_metric = load("f1")
precision_metric = load("precision")
recall_metric = load("recall")


In [None]:
from transformers import BertForSequenceClassification, BertTokenizer
from evaluate import load  # Import from the evaluate library
import torch

# Load the fine-tuned model and tokenizer
model = BertForSequenceClassification.from_pretrained('./fine_tuned_model')
tokenizer = BertTokenizer.from_pretrained('./fine_tuned_tokenizer')

# Load the metric
accuracy_metric = load("accuracy")

# Define a function to perform evaluation
def evaluate_model(model, tokenizer, dataset, metric):
    model.eval()  # Set the model to evaluation mode
    for example in dataset:
        inputs = tokenizer(example['text'], return_tensors="pt", truncation=True, padding=True)
        with torch.no_grad():
            outputs = model(**inputs)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)
        # Update the metric with predictions and labels
        metric.add_batch(predictions=predictions, references=[example['labels']])

    # Compute final results
    result = metric.compute()
    return result

# Evaluate on the test dataset
test_results = evaluate_model(model, tokenizer, tokenized_datasets['test'], accuracy_metric)

print("Test Accuracy:", test_results["accuracy"])


Test Accuracy: 1.0


In [None]:
import os

# Create main project directory
os.makedirs("research-paper/notebooks", exist_ok=True)
os.makedirs("research-paper/scripts", exist_ok=True)
os.makedirs("research-paper/results/figures", exist_ok=True)
os.makedirs("research-paper/results/models", exist_ok=True)

# Create placeholder files
open("research-paper/requirements.txt", "w").close()
open("research-paper/README.md", "w").close()

# Create some Python script files in the 'scripts' folder
open("research-paper/scripts/train_model.py", "w").close()
open("research-paper/scripts/evaluate_model.py", "w").close()
open("research-paper/scripts/preprocess_data.py", "w").close()

# Confirm directory structure
!ls -R research-paper/


research-paper/:
notebooks  README.md  requirements.txt	results  scripts

research-paper/notebooks:

research-paper/results:
figures  models

research-paper/results/figures:

research-paper/results/models:

research-paper/scripts:
evaluate_model.py  preprocess_data.py  train_model.py


In [None]:
# Creating a README.md file
with open('README.md', 'w') as f:
    f.write("""
    # Research Paper: Bias Detection and Fairness Analysis in Object Detection

    ## Overview
    This research investigates bias detection and fairness analysis in object detection and image classification using the Open Images V7 dataset. The study evaluates model fairness on selected object classes: person, car, dog, cat, and chair.

    ## Project Structure

    - `notebooks/`: Jupyter notebooks for model training, testing, and experiments.
      - `model_training.ipynb`: Notebook with model training and evaluation.
    - `scripts/`: Python scripts for data processing, training, and evaluation.
      - `train_model.py`: Script for model training.
      - `evaluate_model.py`: Script for evaluation.
      - `preprocess_data.py`: Script for preprocessing dataset.
    - `results/`: Directory for storing results.
      - `figures/`: Directory for storing charts, graphs, and visualizations.
      - `models/`: Directory for saving trained models.
      - `metrics.txt`: File for storing model metrics and performance.

    ## Requirements
    - Python 3.x
    - Huggingface Transformers
    - PyTorch
    - Datasets library
    - Other dependencies in `requirements.txt`

    ## Installation

    To install the necessary dependencies:

    ```bash
    pip install -r requirements.txt
    ```

    ## Training the Model

    To start training, run the following:

    ```bash
    python scripts/train_model.py
    ```

    ## Evaluation

    After training, you can evaluate the model by running:

    ```bash
    python scripts/evaluate_model.py
    ```

    ## License
    Include any licensing information if necessary.
    """)


In [None]:
import evaluate


In [None]:
metric = evaluate.load("accuracy")


In [None]:
!pip install evaluate


Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting datasets>=2.0.0 (from evaluate)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.9-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from evaluate)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
Collecting dill (from evaluate)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting multiprocess (from evaluate)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec>=2021.05.0 (from fsspec[http]>=2021.05.0->evaluate)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [

In [None]:
# Create a virtual environment
!python3 -m venv myenv

# Activate the virtual environment
!source myenv/bin/activate

# Then install your required packages
!pip install fsspec==2024.9.0
!pip install evaluate


The virtual environment was not created successfully because ensurepip is not
available.  On Debian/Ubuntu systems, you need to install the python3-venv
package using the following command.

    apt install python3.10-venv

You may need to use sudo with that command.  After installing the python3-venv
package, recreate your virtual environment.

Failing command: /content/myenv/bin/python3

/bin/bash: line 1: myenv/bin/activate: No such file or directory


In [None]:
!pip install fsspec==2024.9.0
!pip install gcsfs==2024.9.0
!pip install datasets==3.1.0
!pip install evaluate


Collecting fsspec==2024.9.0
  Using cached fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Using cached fsspec-2024.9.0-py3-none-any.whl (179 kB)
Installing collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.10.0
    Uninstalling fsspec-2024.10.0:
      Successfully uninstalled fsspec-2024.10.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0mSuccessfully installed fsspec-2024.9.0
Collecting gcsfs==2024.9.0
  Downloading gcsfs-2024.9.0-py2.py3-none-any.whl.metadata (1.6 kB)
Collecting fsspec==2024.6.1 (from gcsfs==2024.9.0)
  Using cached fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Reason for being yanked: requirements incorrect[0m[33m
[0mDownloading gcsfs-2024.9.0-py2.py3-none-any.whl (3

In [None]:
!pip install fsspec==2024.9.0  # Install compatible fsspec version for datasets
!pip install gcsfs==2024.9.0   # Install compatible gcsfs version for bigframes
!pip install datasets==3.1.0   # Install datasets package
!pip install evaluate          # Reinstall evaluate if necessary


Collecting fsspec==2024.9.0
  Using cached fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Using cached fsspec-2024.9.0-py3-none-any.whl (179 kB)
Installing collected packages: fsspec
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 1.25.0 requires gcsfs>=2023.3.0, which is not installed.[0m[31m
[0mSuccessfully installed fsspec-2024.9.0
Collecting gcsfs==2024.9.0
  Using cached gcsfs-2024.9.0-py2.py3-none-any.whl.metadata (1.6 kB)
Collecting fsspec==2024.6.1 (from gcsfs==2024.9.0)
  Using cached fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Reason for being yanked: requirements incorrect[0m[33m
[0mUsing cached gcsfs-2024.9.0-py2.py3-none-any.whl (34 kB)
Using cached fsspec-2024.6.1-py3-none-any.whl (177 kB)
Installing collected packages: fsspec, gcsfs
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.9.0
    U

In [None]:
from transformers import Trainer, TrainingArguments, BertForSequenceClassification, BertTokenizer
from datasets import load_dataset
import evaluate
import torch


In [None]:
metric = evaluate.load("accuracy")

def compute_metrics(p):
    logits, labels = p
    predictions = torch.argmax(logits, dim=-1)
    return metric.compute(predictions=predictions, references=labels)


NameError: name 'evaluate' is not defined

In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    per_device_eval_batch_size=8,
    logging_dir='./logs',
)


NameError: name 'TrainingArguments' is not defined

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    eval_dataset=test_dataset
)


NameError: name 'Trainer' is not defined

In [None]:
eval_results = trainer.evaluate()
print(f"Evaluation results: {eval_results}")


NameError: name 'trainer' is not defined

In [None]:
from transformers import Trainer, TrainingArguments, BertForSequenceClassification, BertTokenizer
from datasets import load_dataset
import evaluate
import torch

# Load the model and tokenizer
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Load the dataset
test_dataset = load_dataset("glue", "mrpc", split="test")

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples['sentence1'], examples['sentence2'], padding="max_length", truncation=True)

test_dataset = test_dataset.map(tokenize_function, batched=True)

# Load the accuracy metric
metric = evaluate.load("accuracy")

# Define the evaluation function
def compute_metrics(p):
    logits, labels = p
    predictions = torch.argmax(logits, dim=-1)
    return metric.compute(predictions=predictions, references=labels)

# Set the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy="epoch",
    per_device_eval_batch_size=8,
    logging_dir='./logs',
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    eval_dataset=test_dataset
)

# Evaluate the model
eval_results = trainer.evaluate()

# Print the evaluation results
print(f"Evaluation results: {eval_results}")


In [None]:
!pip install transformers datasets evaluate torch
!pip install fsspec==2024.9.0 gcsfs==2024.9.0


Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess<0.70.17 (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2024.9.0,>=2023.1.0 (from fsspec[http]<=2024.9.0,>=2023.1.0->datasets)
  Downloading fsspec-2024.9.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.1.0-py3-none-any.whl (480 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m26.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7

In [2]:
# Cross-Domain Evaluation for Multi-Task Learning in NLP

This notebook evaluates cross-domain generalization and robustness of models in multi-task NLP. We focus on domains A, B, and C, analyzing model behav


SyntaxError: invalid syntax (<ipython-input-2-f7bc0b71b0c4>, line 3)


## Setup and Dependencies

Below we install and import necessary libraries and authenticate access if needed.



## Model Loading

Load the fine-tuned models for each domain. Ensure paths to locally saved models or configurations are set correctly.



## Conclusion

Summarize the findings, including the model's robustness and generalization across domains, limitations observed, and potential areas for improvement in handling domain shift in NLP tasks.
