<a href="https://colab.research.google.com/github/ShikharKunal/Chatbot_over_PDF/blob/main/RAGmodel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


- ### Loaded the pdf from drive using PyPDF2
- ### Preprocessed the text and split it into chunks
- ### Created embeddings using HuggingFaceEmbeddings
- ### Used **FAISS** to search through indexes for the context
- ### Merged the query and the corresponding context to make a single prompt
- ### Loaded **lama-2-7b** bot with **4-bit quantization** and passed the prompt to it.


## Installing all the dependencies

In [None]:
!pip install transformers torch bitsandbytes==0.41.0 langchain tiktoken PyPDF2 faiss-cpu sentence-transformers accelerate

## Mounting the drive

In [None]:

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## PDF loading and text preprocessing

In [None]:
#Loading the PDF

from PyPDF2 import PdfReader

doc_reader = PdfReader('/content/drive/MyDrive/ugrulebook.pdf')

# read data from the file
raw_text = ''
for i, page in enumerate(doc_reader.pages):
    text = page.extract_text()
    if text:
        raw_text += text



In [None]:
import re

def clean_text(text):
    # Remove leading and trailing whitespace
    text = text.strip()

    # Remove bullet points and special characters
    text = re.sub(r'•', '', text)  # Replace bullet points with an empty string
    text = re.sub(r'[^\w\s@.-]', '', text)  # Remove non-alphanumeric characters except @ and .

    # Remove emojis using regex
    emoji_pattern = re.compile("["
                               "\U0001F600-\U0001F64F"  # Emojis
                               "\U0001F300-\U0001F5FF"  # Symbols & Pictographs
                               "\U0001F680-\U0001F6FF"  # Transport & Map Symbols
                               "\U0001F700-\U0001F77F"  # Alchemical Symbols
                               "\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
                               "\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
                               "\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
                               "\U0001FA00-\U0001FA6F"  # Chess Symbols
                               "\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
                               "\U00002702-\U000027B0"  # Dingbats
                               "\U000024C2"  # Enclosed Alphanumeric Supplement
                               "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    # text = text.lower()

    # Testing cleanup
    text = re.sub(r'\n+', '\n', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [None]:
len(raw_text)

110222

In [None]:
from langchain.text_splitter import CharacterTextSplitter

# Splitting up the text into smaller chunks for indexing
text_splitter = CharacterTextSplitter(
    separator = "\n",
    chunk_size = 1000,
    chunk_overlap  = 200, #striding over the text
    length_function = len,
)
texts_ = text_splitter.split_text(raw_text)
texts = [clean_text(x) for x in texts_]
len(texts)

137

In [None]:
texts[100]

'b No FRDXDRW grades at the end of the first two regular registered semesters. c NP grade in NOCS shall not be a bar for applying for a branch change Ref 250thSenate Meeting. d NP grade in GC 101 shall not be a bar for applying for a branch change. e Students should secure an eligibility-CPI of at least 7.0. The eligibility- CPI is calculated taking into account only the following 1st year courses Ref 254th Senate Meeting. i. Introduction to HASMED 4 4 8 Credits ii. DIC-1 6 Credits iii. DIC-2 6 Credits iv. Makerspace MS 101 8 Credits v. Physics Lab PH 117 3 Credits 34 vi. Chemistry Lab CH 117 3 Credits B The calculation of CPI for Change of Branch henceforth referred to as the Branch- Change-CPI will be based only on grades obtained in the following 1st year theory courses Ref 254th Senate Meeting. a Physics two half-semester courses PH 111 and PH 112 8 credits'

## Creating Embeddings and storing it

In [None]:
#creating embeddings
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "sentence-transformers/all-mpnet-base-v2"
model_kwargs = {'device': 'cpu'}
encode_kwargs = {'normalize_embeddings': False}
embeddings = HuggingFaceEmbeddings(
    model_name=model_name,
    model_kwargs=model_kwargs,
    encode_kwargs=encode_kwargs
)

In [None]:
#using FAISS for searching
from langchain.vectorstores import FAISS

docsearch = FAISS.from_texts(texts, embeddings)

In [None]:
docsearch.embedding_function

HuggingFaceEmbeddings(client=SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
), model_name='sentence-transformers/all-mpnet-base-v2', cache_folder=None, model_kwargs={'device': 'cpu'}, encode_kwargs={'normalize_embeddings': False}, multi_process=False)

## Context from FAISS search

In [None]:
#function for getting context
def get_context(question):
  query = f'{question}'
  docs = docsearch.similarity_search(query)
  context = docs[0]
  return context

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


## LLM Model

In [None]:
#LLAMA 2 7b pipeline loading
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import torch
import bitsandbytes
import accelerate

name = "meta-llama/Llama-2-7b-chat-hf"  # Replace with your desired model name

tokenizer = AutoTokenizer.from_pretrained(name)
tokenizer.pad_token_id = tokenizer.eos_token_id  # for open-ended generation

model = AutoModelForCausalLM.from_pretrained(
    name,
    torch_dtype=torch.float16,
    load_in_4bit=True,  # changing this to load_in_8bit=True works on smaller models
    trust_remote_code=True,
    device_map="auto",  # finds GPU
)

generation_pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    trust_remote_code=True,
    device_map="auto",  # finds GPU
)

# Now you can use the 'generation_pipe' to generate text.


Downloading (…)okenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
#output function
def get_output(context, question):
  prompt = f"### Instruction: Answer the question based on the provided context, if the information is provided in the context, answer should be based only on the prompt, try answering from whatever you can get if context is irrelevant ### Question: {question} ### Context: {context} ### Answer:"
  sequences = generation_pipe(
        prompt,
        do_sample=True,
        top_k=10,
        num_return_sequences=2,
        eos_token_id=tokenizer.eos_token_id,
        max_length=1000,
    )
  return sequences[0]['generated_text']


In [None]:
#processing the output to just show the bot's answer
import warnings
def final_output(question):
  context_ = get_context(question)
  output = get_output(context_, question)
  lines = output.split('### Answer:')
  warnings.filterwarnings("ignore", category=UserWarning, module="transformers.pipelines.base")
  print(lines[-1])


# Five exapmle queries


In [None]:
question = 'What is IDDDP?'
final_output(question)

 IDDDP stands for Integrated Dual Degree Program. It is a program that allows students to complete two different postgraduate degrees simultaneously, with a combination of courses and research work. In the context of the provided text, IDDDP is used to describe the movement of students from one academic unit to another, and the completion of a DD specialization M.Tech. program, which typically requires the completion of 8-9 courses and a DDPMTP project. The program is designed to allow students to earn a dual degree in a specific specialization without honors.


In [None]:
question = 'how to calculate SPI?'
final_output(question)

 The calculation of SPI involves taking the total number of credit hours attempted and dividing it by the total number of quality points earned. The formula used to calculate SPI is: SPI = Total Credit Hours Attempted / Total Quality Points Earned. In the context provided, the formula is: SPI = C1g1 C2 g2 C3 g3 C4 g4 C5 g5, where C1, C2, C3, C4, C5 are the number of credit hours attempted in each course, and g1, g2, g3, g4, g5 are the quality points earned in each course. To calculate the SPI, you need to know the number of credit hours attempted and the quality points earned in each course. Once you have this information, you can plug it into the formula and calculate the SPI. For example, if the student attempted 15 credit hours and earned 6 quality points in course 1, 4 quality points in course 2, and 3 quality points in course 3, the calculation would be: SPI = 15 / 6 + 4 / 4 + 3 / 3 = 2.00. This means that the student's SPI for that semester is 2.00.


In [None]:
question = 'What is a minor degree?'
final_output(question)

 A minor degree is an additional credential that a student can earn in addition to their primary degree. In this context, a minor degree is an additional 30 credit worth of courses that a student can take in a discipline other than their major discipline of B.Tech. All academic units in the institute offer minors in their disciplines and the student must pre-register for a minor course which is then allotted based on their highest CPI. By accumulating credits through the required courses, the student can earn a minor degree in a specific discipline.


In [None]:
question = 'How to register for projects?'
final_output(question)

 To register for projects, you must complete the online Course Registration Form (CRF) on or before the prescribed last date for registration. You must also consult with your Faculty Adviser and obtain their approval for your registration. If you have any outstanding dues to the Institute or a hostel, you will not be permitted to register. Late registration may be permitted in valid reasons, but you must pay a late registration fee. Additionally, you must register for the first two semesters, except for B.Des, on or before the prescribed last date for registration.


In [None]:
question = 'How to get AP?'
final_output(question)

 How to get AP? To get AP, a student must meet the eligibility criteria set by the institute. In this case, the student must have received FRDX grades for at least 36 credits in core courses and be transferred to the Academic Rehabilitation Program ARP. Once the student is transferred to ARP, they will be provided with an opportunity to continue their studies and successfully complete their degree. The program provides a buffer for students with poor performance in academics, and the student's CPI will improve if they complete the program with good grades.


references:
- https://github.com/msuatgunerli/FAISSAL/blob/main/utils/run_llm.py
- https://colab.research.google.com/drive/1SQmK0GYz34RGVlOnL5YMkdm7hXD6OjQT?usp=sharing#scrollTo=m8RwW7Axcu9E
- https://colab.research.google.com/drive/13FpBqmhYa5Ex4smVhivfEhk2k4S5skwG?usp=sharing#scrollTo=Eji7bv3-To_D
- https://python.langchain.com/docs/get_started/introduction
- https://huggingface.co/docs/transformers/index