<a href="https://colab.research.google.com/github/The-Loved-One/AIML-NITT-Chatbot/blob/patch-1/nitt_gpt2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Training the model

In [9]:
!pip install -U PyPDF2
!pip install python-docx
!pip install -U accelerate
!pip install -U transformers



In [10]:
import os
import re
from PyPDF2 import PdfReader
import docx
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling
from transformers import Trainer, TrainingArguments

In [11]:
# Functions to read different file types
def read_pdf(file_path):
    with open(file_path, "rb") as file:
        pdf_reader = PdfReader(file)
        text = ""
        for page_num in range(len(pdf_reader.pages)):
            text += pdf_reader.pages[page_num].extract_text()
    return text

def read_word(file_path):
    doc = docx.Document(file_path, encoding='windows-1252')
    text = ""
    for paragraph in doc.paragraphs:
        text += paragraph.text + "\n"
    return text

def read_txt(file_path):
    with open(file_path, "r", encoding='windows-1252') as file:
        text = file.read()
    return text

def read_documents_from_directory(directory):
    combined_text = ""
    for filename in os.listdir(directory):
        file_path = os.path.join(directory, filename)
        if filename.endswith(".pdf"):
            combined_text += read_pdf(file_path)
        elif filename.endswith(".docx"):
            combined_text += read_word(file_path)
        elif filename.endswith(".txt"):
            combined_text += read_txt(file_path)
    return combined_text

In [12]:
def train_chatbot(directory, model_output_path, train_fraction=0.8):
    # Read documents from the directory
    combined_text = read_documents_from_directory(directory)
    combined_text = re.sub(r'\n+', '\n', combined_text).strip()  # Remove excess newline characters

    # Split the text into training and validation sets
    split_index = int(train_fraction * len(combined_text))
    train_text = combined_text[:split_index]
    val_text = combined_text[split_index:]

    # Save the training and validation data as text files
    with open("train.txt", "w") as f:
        f.write(train_text)
    with open("val.txt", "w") as f:
        f.write(val_text)

    # Set up the tokenizer and model
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2-medium")  #also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl
    model = GPT2LMHeadModel.from_pretrained("gpt2-medium")  #also try gpt2, gpt2-large and gpt2-medium, also gpt2-xl

    # Prepare the dataset
    train_dataset = TextDataset(tokenizer=tokenizer, file_path="train.txt", block_size=128)
    val_dataset = TextDataset(tokenizer=tokenizer, file_path="val.txt", block_size=128)
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

    # Set up the training arguments
    training_args = TrainingArguments(
        output_dir=model_output_path,
        overwrite_output_dir=True,
        per_device_train_batch_size=4,
        per_device_eval_batch_size=4,
        num_train_epochs=100,
        save_steps=10_000,
        save_total_limit=2,
        logging_dir='./logs',
    )

    # Train the model
    trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
    )

    trainer.train()
    trainer.save_model(model_output_path)

    # Save the tokenizer
    tokenizer.save_pretrained(model_output_path)

In [13]:
def generate_response(model, tokenizer, prompt, max_length=100):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)

In [20]:
def main():
    directory = "/content/nitt-dataset"  # Replace with the path to your directory containing the files
    model_output_path = "/content/nitt-gpt"

    # Train the chatbot
    train_chatbot(directory, model_output_path)

    # Load the fine-tuned model and tokenizer
    model = GPT2LMHeadModel.from_pretrained(model_output_path)
    tokenizer = GPT2Tokenizer.from_pretrained(model_output_path)

    # Test the chatbot
    prompt = "tell me about national institute of technology."  # Replace with your desired prompt
    response = generate_response(model, tokenizer, prompt)
    print("Generated response:", response)

In [21]:
if __name__ == "__main__":
    main()



Step,Training Loss
500,0.3379
1000,0.0094


Generated response: tell me about national institute of technology.  
The National Institute of Technology Tiruchirappalli (NIT Tiruchirappalli or NIT Trichy) is a national resear
ch deemed university near the city of Tiruchirappalli in Tamil Nadu, India. It was founded as Regional Eng
ineering College Tiruchirappalli in 1964 by the governments of India and Tamil Nadu under the affiliation o
f the University of Madras. The college was granted deemed university status in 2003 with the approval of
 the University Grants Commission (UGC), the All India Council for Technical Education (AICTE), and the 
Government of India and renamed the National Institute of Technology Tiruchirappalli. The college was granted deemed university status in 2007 with the approval of
 the University Grants Commission (UGC), the All India Council for Technical Education (AICTE), and the 
Government of India and renamed the National Institute of Technology Coimbatore. The college was granted deemed university st

# Testing the model

In [22]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [23]:
def generate_response(model, tokenizer, prompt, max_length=250):
    input_ids = tokenizer.encode(prompt, return_tensors="pt")

    # Create the attention mask and pad token id
    attention_mask = torch.ones_like(input_ids)
    pad_token_id = tokenizer.eos_token_id

    output = model.generate(
        input_ids,
        max_length=max_length,
        num_return_sequences=1,
        attention_mask=attention_mask,
        pad_token_id=pad_token_id
    )

    return tokenizer.decode(output[0], skip_special_tokens=True)


In [24]:
model_path = "/content/nitt-gpt"
# Load the fine-tuned model and tokenizer
my_chat_model = GPT2LMHeadModel.from_pretrained(model_path)
my_chat_tokenizer = GPT2Tokenizer.from_pretrained(model_path)

In [28]:
# prompt = "Where is the library? And when does the library open?"  # Replace with your desired prompt
# #prompt = "What is the most promising future technology?"
# response = generate_response(my_chat_model, my_chat_tokenizer, prompt, max_length=100)  #
# print("Generated response:", response)

while True:
    # Take user input
    user_input = input("User: ")

    # Check if the user wants to quit
    if user_input.lower() == 'quit':
        print("Chatbot: Goodbye!")
        break

    # Tokenize and encode the user's input
    # input_ids = tokenizer.encode(user_input, return_tensors="pt")

    # Generate a response
    # output = model.generate(input_ids, max_length=50, num_beams=5, no_repeat_ngram_size=2, top_k=50, top_p=0.95)
    response = generate_response(my_chat_model, my_chat_tokenizer, user_input, max_length=150)

    # Print the chatbot's response
    print("Chatbot:", response)

User: national institute of technology
Chatbot: national institute of technology Tiruchirappalli. NIT Trichy is recog
nized as an Institute of National Importance by the Government of India under the National Institutes of T
echnology, Science Education and Research (NITSER) Act, 2007 and is one of the members of the Natio
nal Institutes of Technology (NITs) system, a group of premier centrally funded technical institutes (CFTIs
) governed by the Council of NITSER. The institute is funded by the Ministry of Education (MoE), Govern
ment of India; and focuses exclusively on engineering, management, and research. The institute is funded by the Ministry of Education (MoE), Govern
ment of India
User: how large is the campus
Chatbot: how large is the campus?  
The campus is about 2.2 km2 and is one of the largest academic campuses in India. The main 
entrance is located on the southern end of the campus, facing National Highway 67. There is one other e
ntrance, popularly called the Staff Gat