# **Final Project**


# Interactive UI for Human vs. GPT Text

Advanced Machine Learning Project
*   Priyanshi Gupta - 200070061
*   Srushti Bangde - 200070081


**Problem statement** \\
The problem statement for our project is to create an interactive user interface (UI) that can
effectively classify text as either human-authored or generated by the GPT model (such as
ChatGPT). Additionally, we aim to incorporate bias detection mechanisms to identify and
highlight potential biases in the GPT-generated text. We will utilize the Gradio library to
develop an accessible and user-friendly interface for users to input text and receive real-time feedback on its source (human or GPT) as well as any detected biases.

The command `!pip install transformers torch` is used to install the "transformers" library and the "torch" library using the Python package manager called `pip`.

1. **transformers:**
   - The "transformers" library, developed by Hugging Face, is a widely-used library for working with state-of-the-art pre-trained models in the field of Natural Language Processing (NLP).
   - It provides easy access to a wide range of pre-trained models, including popular ones like BERT, GPT-2, DistilBERT, and more.
   - These models are pre-trained on large text corpora and can be fine-tuned for specific downstream tasks like sentiment analysis, text classification, text generation, and more.
   - The library also includes tools for tokenization, model loading, fine-tuning, and inference using these pre-trained models.

2. **torch:**
   - "torch" refers to the PyTorch library, which is an open-source machine learning framework.
   - PyTorch provides tools and libraries for building and training neural networks.
   - It offers GPU acceleration, automatic differentiation, and dynamic computation graphs, making it popular for deep learning tasks.
   - Many popular deep learning libraries, including the "transformers" library, are built on top of PyTorch.
   
When we run `!pip install transformers torch`, we're essentially installing these libraries in our Python environment. This allows us to use the functionalities they provide in our code.
In the context of the project I've been working on, I'll use the "transformers" library to work with the DistilBERT model for sentiment analysis and the "torch" library to handle tensor computations and GPU acceleration.

In [8]:
!pip install transformers torch



In [7]:
import transformers
import torch


**Importing Required Libraries:**

In this section of the code, we're importing essential libraries to set up and train a sentiment analysis model using the DistilBERT architecture. The `transformers` library is pivotal for working with pre-trained models like DistilBERT, and we're importing two critical components: `DistilBertTokenizer` for text tokenization and `DistilBertForSequenceClassification` for the sequence classification task, such as sentiment analysis.

**Initializing Training Components:**

Next, we're preparing the training pipeline using the `Trainer` and `TrainingArguments` classes from the Hugging Face "transformers" library. These classes provide a streamlined way to manage training settings, data, and evaluation during the training process.

**Data Splitting:**

To ensure effective model training and evaluation, we're utilizing the `train_test_split` function from the `sklearn.model_selection` module. This function allows us to divide our dataset into separate training and testing subsets, which is crucial for validating our model's performance accurately.


In [6]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments
from sklearn.model_selection import train_test_split

- The command `!pip install datasets` is essential for installing the "datasets" library, a creation of Hugging Face.
- This library is crucial for accessing a diverse array of datasets, serving as a valuable resource for training and evaluating machine learning models, specifically those tailored for tasks in natural language processing.
- Incorporating the "datasets" library streamlines the process of acquiring, pre-processing, and managing datasets, allowing developers to dedicate more focus to actual model development.
- With the "datasets" library, data handling tasks are simplified, offering an efficient mechanism for loading and preparing data for both training and evaluation stages.

In [1]:
!pip install datasets

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


The code snippet below serves the purpose of loading and preprocessing a dataset for sentiment analysis using the Hugging Face "datasets" library.

- `from datasets import load_dataset`: This import statement enables access to the `load_dataset` function from the "datasets" library, which aids in obtaining various datasets for machine learning projects.

- `dataset = load_dataset("imdb")`: This line loads the IMDb sentiment dataset using the `load_dataset` function. The IMDb dataset comprises movie reviews labeled with sentiments (positive or negative), making it suitable for sentiment analysis tasks.

- `train_texts, test_texts = train_test_split(dataset["train"]["text"], test_size=0.2, random_state=42)`: This line divides the training set of the IMDb dataset into training and testing subsets. The `train_test_split` function is utilized to split the movie review texts while maintaining the original labeling. A 20% portion of the data is allocated for testing, and a random seed of 42 ensures reproducibility.

- `train_labels, test_labels = train_test_split(dataset["train"]["label"], test_size=0.2, random_state=42)`: Similarly, this line splits the corresponding sentiment labels (positive/negative) of the training set into training and testing subsets. The random state is set to 42 to ensure consistent splitting.

This code block prepares the IMDb dataset by segregating movie review texts and their associated labels into separate training and testing sets, ready for subsequent model training and evaluation.

In [9]:
# Load and preprocess the dataset
from datasets import load_dataset
dataset = load_dataset("Hello-SimpleAI/HC3")

# Split dataset into train and test sets
train_texts, test_texts = train_test_split(dataset["train"]["text"], test_size=0.2, random_state=42)
train_labels, test_labels = train_test_split(dataset["train"]["label"], test_size=0.2, random_state=42)

In this code block, we are utilizing the `DistilBertTokenizer` from the Hugging Face Transformers library to tokenize and encode the text data. This is a crucial preprocessing step to prepare the text data for input into the model.

- `tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")`: This line initializes a tokenizer using the pre-trained DistilBERT tokenizer provided by Hugging Face. The "distilbert-base-uncased" model is a variant of DistilBERT trained on uncased text.

- `train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)`: Here, we apply the tokenizer to the training text data. `train_texts` is a list of movie review texts from the training set. The `truncation=True` parameter indicates that the texts should be truncated if they exceed the maximum token limit of the model, and `padding=True` adds padding tokens to ensure all sequences have the same length.

- `test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)`: Similarly, the same tokenizer is applied to the testing text data to tokenize and encode the test movie reviews.

- `train_dataset = torch.utils.data.TensorDataset(torch.tensor(train_encodings['input_ids']), torch.tensor(train_encodings['attention_mask']), torch.tensor(train_labels))`: In this line, we create a `TensorDataset` that holds the input IDs, attention mask, and sentiment labels for the training data. The input IDs are the tokenized and encoded representations of the training texts, and the attention mask indicates which tokens are actual content and which are padding. The labels are the corresponding sentiment labels (positive/negative) for the training set.

- `test_dataset = torch.utils.data.TensorDataset(torch.tensor(test_encodings['input_ids']), torch.tensor(test_encodings['attention_mask']), torch.tensor(test_labels))`: Similarly, a `TensorDataset` is created for the testing data, containing the input IDs, attention mask, and sentiment labels for the test set.

This code block tokenizes and encodes the movie review texts, generates attention masks, and organizes the data into suitable formats for model training and evaluation.

In [6]:
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

train_dataset = torch.utils.data.TensorDataset(torch.tensor(train_encodings['input_ids']),
                                              torch.tensor(train_encodings['attention_mask']),
                                              torch.tensor(train_labels))
test_dataset = torch.utils.data.TensorDataset(torch.tensor(test_encodings['input_ids']),
                                             torch.tensor(test_encodings['attention_mask']),
                                             torch.tensor(test_labels))

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

This code block performs the training and evaluation of a sentiment classification model using the DistilBERT architecture. Here's a step-by-step explanation of each part:

1. **Importing Libraries:**
   - `torch` and `torch.optim as optim`: These libraries provide tools for creating and optimizing PyTorch models.
   - `torch.utils.data.DataLoader`: A utility for loading data in batches for training.
   - `transformers.DistilBertForSequenceClassification` and `transformers.DistilBertTokenizer`: These components from the Hugging Face Transformers library provide the DistilBERT model architecture and tokenizer specifically designed for sequence classification tasks.
   - `sklearn.model_selection.train_test_split`: This function is used to split the dataset into training and testing sets.

2. **Data Preparation:**
   - `batch_size = 8`: This specifies the batch size used during training and evaluation. You can adjust it based on your available resources.
   - `train_dataloader` and `test_dataloader`: These DataLoader instances manage the training and testing data batches. `shuffle=True` shuffles the data in each epoch, ensuring varied training examples.

3. **Model Initialization and Configuration:**
   - `model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)`: This initializes the DistilBERT model for sequence classification with the base architecture and two output labels (positive and negative sentiments).

4. **Optimizer and Loss Function:**
   - `optimizer = optim.AdamW(model.parameters(), lr=2e-5)`: This sets up the AdamW optimizer to update the model parameters during training.
   - `loss_fn = torch.nn.CrossEntropyLoss()`: This defines the loss function, which is the cross-entropy loss used for classification tasks.

5. **Training Loop:**
   - `num_epochs = 2`: This specifies the number of training epochs. You can adjust it based on your time constraints.
   - `device = torch.device("cuda" if torch.cuda.is_available() else "cpu")`: This determines whether the GPU (if available) or CPU will be used for computation.
   - The training loop runs for the specified number of epochs. It iterates through the training data, computes gradients, and updates the model's weights using backpropagation. The average loss per epoch is printed.

6. **Model Evaluation:**
   - The trained model is switched to evaluation mode (`model.eval()`) and evaluated on the test dataset.
   - The accuracy is calculated by comparing model predictions to actual labels, and the result is printed as "Test Accuracy."

7. **Execution Time:**
   - The code measures the time taken for training using the `time` module. The training time is then printed.

This code block performs the training and evaluation of a sentiment classification model using the DistilBERT architecture. It handles data loading, model configuration, optimization, training loop, evaluation, and reports training time and test accuracy.

In [7]:
import torch
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer
from sklearn.model_selection import train_test_split
import time

# Create DataLoader instances
batch_size = 8  # You can adjust this based on your resources
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

# Load DistilBERT model for sequence classification
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

# Define optimizer and loss function
optimizer = optim.AdamW(model.parameters(), lr=2e-5)
loss_fn = torch.nn.CrossEntropyLoss()

# Training loop
num_epochs = 2  # You can adjust this based on your time constraints
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

start_time = time.time()
for epoch in range(num_epochs):
    model.train()
    total_loss = 0

    for batch in train_dataloader:
        optimizer.zero_grad()

        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = batch[2].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

        loss = loss_fn(logits, labels)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()

    avg_loss = total_loss / len(train_dataloader)
    print(f"Epoch {epoch+1}/{num_epochs} - Avg Loss: {avg_loss:.4f}")

end_time = time.time()
training_time = end_time - start_time
print(f"Training took {training_time:.2f} seconds")

# Evaluate the model on the test set
model.eval()
correct = 0
total = 0

with torch.no_grad():
    for batch in test_dataloader:
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        labels = batch[2].to(device)

        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=1)

        correct += (predictions == labels).sum().item()
        total += labels.size(0)

accuracy = correct / total
print(f"Test Accuracy: {accuracy:.2%}")

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Epoch 1/2 - Avg Loss: 0.2666
Epoch 2/2 - Avg Loss: 0.1353
Training took 1996.28 seconds
Test Accuracy: 92.48%


In [10]:
# Save the trained model
model.save_pretrained("pred_model")

'!**pip install gradio**' is a command used to install the Gradio library using the pip package manager in a Python environment. Gradio is an open-source library that makes it easy to create user-friendly web interfaces for interacting with machine learning models.

In [13]:
!pip install gradio



This version of the code adds some nice touches to our Gradio interface, such as a placeholder in the input textbox, a label for the output textbox, and some example inputs to help users get started. It also gives the interface a more engaging title and description to explain what's happening. When we run this code, our Gradio interface will become even more inviting and user-friendly!

In [15]:
!pip install -U fastapi
!pip install -U typing-extensions




In [17]:
!pip install typing-extensions==3.10.0.0


Collecting typing-extensions==3.10.0.0
  Downloading typing_extensions-3.10.0.0-py3-none-any.whl (26 kB)
Installing collected packages: typing-extensions
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.8.0
    Uninstalling typing_extensions-4.8.0:
      Successfully uninstalled typing_extensions-4.8.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires kaleido, which is not installed.
sqlalchemy 2.0.23 requires typing-extensions>=4.2.0, but you have typing-extensions 3.10.0.0 which is incompatible.
arviz 0.15.1 requires typing-extensions>=4.1.0, but you have typing-extensions 3.10.0.0 which is incompatible.
chex 0.1.7 requires typing-extensions>=4.2.0; python_version < "3.11", but you have typing-extensions 3.10.0.0 which is incompatible.
fastapi 0.104.1 requires typing-extensions>=4.8.0, but 

In [19]:
!pip install flask



In [22]:
from flask import Flask, render_template, request
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

app = Flask(__name__)

# Load the saved model
model = DistilBertForSequenceClassification.from_pretrained("sentiment_model")

# Load the DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Define the function to make predictions
def predict_sentiment(text):
    inputs = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax().item()
    sentiment = "Most likely to be written by GPT" if predicted_class == 1 else "Most likely to be written by a Human"
    return sentiment

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
    text = request.form['text']
    result = predict_sentiment(text)
    return render_template('index.html', result=result, text=text)

if __name__ == '__main__':
    app.run(debug=True)


 * Serving Flask app '__main__'
 * Debug mode: on


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug: * Restarting with stat


In [26]:
!pip install flask

from flask import Flask, render_template, request
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

app = Flask(__name__)

# Load the saved model
model = DistilBertForSequenceClassification.from_pretrained("sentiment_model")

# Load the DistilBERT tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Define the function to make predictions
def predict_sentiment(text):
    inputs = tokenizer(text, truncation=True, padding=True, return_tensors="pt")
    outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax().item()
    sentiment = "Most likely to be written by GPT" if predicted_class == 1 else "Most likely to be written by a Human"
    return sentiment

@app.route('/')
def index():
    return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
    text = request.form['text']
    result = predict_sentiment(text)
    return render_template('index.html', result=result, text=text)

# Run the application
app.run()

 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m


In conclusion, this project demonstrates the process of fine-tuning a DistilBERT model for classification of text source.