# Clark University SPS Chatbot (Still In Progress)

- __Project Objective:__ Develop an AI-powered chatbot for Clark University's School of Professional Studies (SPS) that leverages advanced NLP technologies to enhance student interaction and information retrieval.
- __Technology Integration:__ Integrate the chatbot with Dialogflow for natural language understanding and FastAPI for creating a responsive and scalable backend service.
- __AI Model Usage:__ Utilize the GPT-2 model, a robust language prediction model, to process and generate human-like text responses based on the trained data.
- __Training Data:__ Train the GPT-2 model using a dataset of 10,000 FAQ samples collected from the Clark University website. The system is designed to allow for the integration of additional samples to continually improve response accuracy and relevance.
- __Functionality and User Interaction:__ Design the chatbot to answer queries directly related to university processes, academic programs, campus facilities, and more, thereby facilitating an efficient and interactive user experience.
- __Scalability and Improvement:__ The chatbot's architecture allows for the ongoing addition of new FAQ samples and further training, ensuring that the system evolves with changing student needs and maintains high accuracy in responses.
- __Impact and Benefit:__ Aim to significantly improve the accessibility of information for students and reduce the workload on administrative staff by automating responses to frequently asked questions, with the potential for even greater accuracy as more data is integrated.
- __Packages Used:__ transformers, torch, accelerate, fastapi, uvicorn, pyngrok

## Installing Packages

In [None]:
!pip install transformers[torch]
!pip install accelerate -U
!pip install --upgrade pip
!pip install --upgrade transformers accelerate
! pip install transformers datasets torch scikit-learn

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.29.3-py3-none-any.whl (297 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m297.6/297.6 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu

## Fast API for Dialogflow (Integration)

In [None]:
# Step 1: Installing required packages
!pip install fastapi uvicorn pyngrok

# Step 2: Creating FastAPI app
from fastapi import FastAPI

app = FastAPI()

@app.get("/")
def read_root():
    return {"Hello": "World"}

# Step 3: Running FastAPI app with Uvicorn in a background thread
import uvicorn
from threading import Thread

def run_server():
    uvicorn.run(app, host="0.0.0.0", port=8000)

thread = Thread(target=run_server)
thread.start()

# Step 4: Authenticating ngrok and Tunnel the localhost server
from pyngrok import ngrok

# ngrok auth token from the ngrok dashboard
NGROK_AUTH_TOKEN = "2e6WNOdCfwbDRm870v1NgtfGhTs_38RSENK6hTEKst21qBNve"

# Sett the ngrok auth token
ngrok.set_auth_token(NGROK_AUTH_TOKEN)

# Terminating open tunnels if exist
ngrok.kill()

# Setting up a tunnel to the localhost server
public_url = ngrok.connect(8000)
print(f"Public URL: {public_url}")

Collecting fastapi
  Downloading fastapi-0.110.2-py3-none-any.whl.metadata (24 kB)
Collecting uvicorn
  Downloading uvicorn-0.29.0-py3-none-any.whl.metadata (6.3 kB)
Collecting pyngrok
  Downloading pyngrok-7.1.6-py3-none-any.whl.metadata (7.4 kB)
Collecting starlette<0.38.0,>=0.37.2 (from fastapi)
  Downloading starlette-0.37.2-py3-none-any.whl.metadata (5.9 kB)
Collecting h11>=0.8 (from uvicorn)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading fastapi-0.110.2-py3-none-any.whl (91 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading uvicorn-0.29.0-py3-none-any.whl (60 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.8/60.8 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyngrok-7.1.6-py3-none-any.whl (22 kB)
Downloading h11-0.14.0-py3-none-any.whl (58 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 kB[0m [31m3.2

INFO:     Started server process [240]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)


Public URL: NgrokTunnel: "https://71b9-35-201-139-174.ngrok-free.app" -> "http://localhost:8000"


## Initializing GPT-2 Model and Tokenizer

In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments

## Loading and Preparing Dataset


This function prepare_data takes a file path as an argument and processes the text file located at that path. It reads the entire content of the file into a string, then removes specific substrings, namely "Q: " and "A: ", which are likely used to denote questions and answers in the text. The modified text, which has these labels removed, is then returned. This is useful for cleaning and preparing data, especially if the original file format includes these labels as part of a question-answer dataset.

In [None]:
def prepare_data(file_path):
    # Loading and preparing your dataset
    with open(file_path, 'r') as f:
        text = f.read()
    return text.replace("Q: ", "<|endoftext|>").replace("A: ", "<|endoftext|>")

## Loading Tokenizer and Model
The provided Python code snippet is used to load a tokenizer and a model from the GPT-2 series by OpenAI, specifically designed for natural language processing tasks. The GPT2Tokenizer is initialized to preprocess text into tokens that are interpretable by the GPT-2 model, while the GPT2LMHeadModel is set up for text generation, capable of predicting the next token in a sequence based on the base GPT-2 architecture. This setup is commonly employed in applications that require advanced text processing capabilities, such as automated text generation or language model training.

In [None]:
# Loading tokenizer and model
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

## Preparing and Saving Formatted Dataset
This code snippet processes and prepares a dataset for further use. It starts by calling the prepare_data function with a file path ("data.txt") to load and clean the text data, removing specific substrings that might denote questions and answers. The cleaned text, stored in the variable data, is then written to a new file named "formatted_data.txt". This approach is useful for preprocessing text data, ensuring it's in a suitable format for subsequent tasks such as training machine learning models or performing detailed analyses.

In [None]:
# Preparing dataset
data = prepare_data("data.txt")
with open("formatted_data.txt", "w") as f:
    f.write(data)

## Creating Training Dataset
The script involves multiple steps for preparing and utilizing text data in a machine learning context. Initially, it processes a text file using a custom function to strip specific formatting, saving the cleaned data for further use. It then employs the Hugging Face Transformers library to load a GPT-2 tokenizer and model, essential for converting text into a suitable format for processing by a neural network. Subsequently, the cleaned text is used to create a TextDataset, configured with a predefined block size, facilitating the segmentation of the text into manageable units for model training. This workflow is typical in natural language processing tasks where preprocessing and proper data formatting are crucial for effective model training and generation tasks.

In [None]:
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path="formatted_data.txt",
    block_size=128
)



## Configuring Data Collator for Language Modeling
The DataCollatorForLanguageModeling is configured with a previously loaded GPT-2 tokenizer, tailored for training language models by preparing data batches efficiently. The mlm parameter is set to False, indicating the training will focus on traditional, sequential language modeling rather than masked language modeling. This setup is crucial for generating consistent and optimally padded batches of tokenized text, ensuring the model is trained effectively across uniformly formatted data inputs. This collator streamlines the preprocessing pipeline, particularly in training configurations like GPT-2 where predicting the sequence of tokens based on prior context is fundamental.

In [None]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=False
)

## Define Training Arguments
The TrainingArguments class is configured to fine-tune a language model with specific parameters, including setting the output directory to "./gpt2-finetuned", enabling overwriting of existing files, specifying a training duration of 7 epochs, and a batch size of 4 per device. Additionally, it includes settings to save model checkpoints every 10,000 steps with a limit of retaining only the two most recent checkpoints. This setup ensures efficient management of training progress and resource utilization, facilitating effective model optimization and storage management during the training process.

In [None]:
# Defining training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=7,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
)

## Initialize Trainer Instance
The Trainer instance is initialized with several critical components to facilitate the training of a language model. It integrates a pre-loaded GPT-2 model, specified training arguments (such as epoch number, batch size, and checkpointing settings), a data collator for proper data batching and tokenization, and the prepared training dataset. This comprehensive configuration enables efficient management and execution of the model training process, ensuring that the data is processed correctly and the model parameters are updated effectively based on the defined training regimen.

In [None]:
# Creating Trainer instance
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
)

## Model Training

In [None]:
# Training the model
trainer.train()

Step,Training Loss


TrainOutput(global_step=336, training_loss=1.6380869547526042, metrics={'train_runtime': 3352.0458, 'train_samples_per_second': 0.399, 'train_steps_per_second': 0.1, 'total_flos': 87336861696000.0, 'train_loss': 1.6380869547526042, 'epoch': 7.0})

## Saving Fine Tuned Model

In [None]:
# Saving the fine-tuned model
model.save_pretrained("./gpt2-finetuned")

## Testing GPT-2 Model
The generate_answer function in the script uses a pre-trained language model to automatically generate answers to given questions. It formats the question with specific prefixes to guide the model's response, tokenizes this input, and then uses the model's generate method to produce an answer, controlling the randomness with a specified temperature. This no-gradient operation ensures that the model's parameters are not updated during this process. The generated tokens are decoded into readable text and returned. The function is exemplified with a question about Clark University, demonstrating its practical application in generating conversational responses.

In [None]:
import torch

def generate_answer(question):
    input_text = "Q: " + question + " A:"
    input_ids = tokenizer.encode(input_text, return_tensors='pt')

    # Generating answer using sampling
    with torch.no_grad():
        output = model.generate(input_ids,
                                max_length=90,  # Increasing max length
                                num_return_sequences=1,
                                temperature=0.8,  # Adjusting temperature
                                do_sample=True)

    # Decoding and returning the generated answer
    generated_answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return generated_answer.strip()

# Testing for usage
question = "How is Clark University?"
answer = generate_answer(question)
print("Answer:", answer)

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Answer: Q: How is Clark University? A: It is a non-profit institution with a global reach, staffed by international students and a rich alumni network.


# Fine-tuning, improvisation, and deployment are still in progress...