<a href="https://colab.research.google.com/github/Janani-Withana/CTSE_Chatbot/blob/main/Chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# Install dependencies
!pip install transformers langchain langchain-community faiss-cpu sentence-transformers pypdf PyPDF2



In [3]:
# Import modules
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import FAISS
from langchain.embeddings import HuggingFaceEmbeddings
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
import os

In [4]:
# Load the CTSE lecture notes PDF
pdf_path = "CTSE_Lecture_Notes.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()

In [5]:
# Split text into chunks for embedding
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
docs = text_splitter.split_documents(documents)

In [6]:
# Create embeddings and store in FAISS
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_documents(docs, embeddings)

In [7]:
# Initialize GPT-2 and setup text generation pipeline
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to("cuda" if torch.cuda.is_available() else "cpu")

In [8]:
# Ensure GPT-2 uses padding token if missing
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

generation_pipeline = pipeline(
    task="text-generation",
    model=model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1,
    truncation=True
)

Device set to use cpu


In [15]:
# Generate Response
def CTSE_Chatbot(question, top_k=4, max_length=512):
    results = vectorstore.similarity_search(question, k=top_k)
    context = "\n\n".join([doc.page_content for doc in results])

    prompt = f"""
You are a knowledgeable assistant. Analyze the academic content below and generate a comprehensive, structured, and insightful answer
to the question as if you're helping a university student understand the topic deeply. Your response should include:

- A clear and descriptive title
- An introductory paragraph summarizing the concept
- Well-structured sections with subheadings
- Bullet points or numbered lists to organize key ideas
- In-depth elaboration of technical concepts with examples
- Additional context or real-world relevance where useful

### Context:
{context}

### Question:
{question}

### Answer:
"""
    response = generation_pipeline(prompt, max_length=max_length, do_sample=True, top_p=0.9, temperature=0.7)[0]['generated_text']

    # Extract only the answer part (everything after 'Answer:')
    if "Answer:" in response:
        return response.split("Answer:")[-1].strip()
    else:
        return response.strip()

In [16]:
# Example
question = "what is a docker image"
response = CTSE_Chatbot(question)

print("Question:", question + "\n")
print("Answer:", response)

Question: what is a docker image

Answer: Docker is a container engine, a set of distributed processes which is a framework for managing and building applications.

Docker is an open source toolkit which can be used for any purpose, from production to production.

Docker is an open source toolkit which can be used for any purpose, from production to production. It is built with the Open Source Toolkit (OS) (Open Source Software).

Docker is built with the Open Source Toolkit (OS) (Open Source Software). It is an open source toolkit that can be
