<a href="https://colab.research.google.com/github/Suhail372/files_for_chatbot/blob/master/chatbot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!git clone https://github.com/Suhail372/files_for_chatbot

Cloning into 'files_for_chatbot'...
remote: Enumerating objects: 205, done.[K
remote: Counting objects: 100% (205/205), done.[K
remote: Compressing objects: 100% (92/92), done.[K
remote: Total 205 (delta 136), reused 164 (delta 110), pack-reused 0 (from 0)[K
Receiving objects: 100% (205/205), 18.29 MiB | 10.37 MiB/s, done.
Resolving deltas: 100% (136/136), done.


In [2]:
!pip install sentence-transformers faiss-gpu


Collecting sentence-transformers
  Downloading sentence_transformers-3.1.1-py3-none-any.whl.metadata (10 kB)
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading sentence_transformers-3.1.1-py3-none-any.whl (245 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m245.3/245.3 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu, sentence-transformers
Successfully installed faiss-gpu-1.7.2 sentence-transformers-3.1.1


In [3]:
import os
os.chdir('/content/files_for_chatbot')

In [None]:
import os
import json
import torch
import numpy as np
from sentence_transformers import SentenceTransformer
import faiss
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM


In [5]:

class VectorSearchWrapper:
    def __init__(self, location_is_hyd=False):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.EMBED_MODEL = 'sentence-transformers/paraphrase-MiniLM-L3-v2'
        self.location_is_hyd = location_is_hyd

        self.hyd_json_file_path = 'combined files/cleaned_and_combined_hyd.json'
        self.blore_json_file_path = 'combined files/cleaned_and_combined_blore.json'

        self.model = SentenceTransformer(self.EMBED_MODEL, device=self.device)
        self.saved_vectors_path = 'saved_vectors'
        self.index_hyd = None
        self.index_blore = None
        self.embeddings_hyd = []
        self.embeddings_blore = []
        self.id_to_entry = {}
        self.run()

    def embedding(self, text_data):
        embedding = self.model.encode(text_data, convert_to_tensor=True, device=self.device)
        normalized_embedding = torch.nn.functional.normalize(embedding, p=2, dim=-1)
        return normalized_embedding.cpu().numpy()

    def preprocess_and_embed(self, json_file_path):
        embedded_list = []
        with open(json_file_path, 'r') as file:
            json_data = json.load(file)

        for entry in json_data:
            address = entry['Location']
            terms = [term.strip() for term in address.split(',')]
            replacable = ', '.join(terms[-4:]) if len(terms) > 4 else address

            entry['text data'] = entry['text data'].replace(address, replacable)
            text_data = entry["text data"].replace(f'Name: {entry["Name"]}', '')
            entry_id = entry.get("Id", None)

            if entry_id is not None:
                embedding = self.embedding(text_data)
                embedded_list.append({
                    "embedding": embedding,
                    "text": text_data,
                    "id": entry_id
                })

        return embedded_list

    def save_embeddings(self, embeddings, location_name):
        if not os.path.exists(self.saved_vectors_path):
            os.makedirs(self.saved_vectors_path)

        embeddings_array = np.vstack([entry["embedding"] for entry in embeddings])
        ids = [entry["id"] for entry in embeddings]
        texts = [entry["text"] for entry in embeddings]

        np.save(os.path.join(self.saved_vectors_path, f'embeddings_{location_name}.npy'), embeddings_array)
        with open(os.path.join(self.saved_vectors_path, f'metadata_{location_name}.json'), 'w') as f:
            json.dump({"ids": ids, "texts": texts}, f)

    def load_embeddings(self, location_name):
        embeddings_path = os.path.join(self.saved_vectors_path, f'embeddings_{location_name}.npy')
        metadata_path = os.path.join(self.saved_vectors_path, f'metadata_{location_name}.json')

        if os.path.exists(embeddings_path) and os.path.exists(metadata_path):
            embeddings_array = np.load(embeddings_path)
            with open(metadata_path, 'r') as f:
                metadata = json.load(f)

            embeddings = [{"embedding": emb, "text": text, "id": id} for emb, text, id in zip(embeddings_array, metadata["texts"], metadata["ids"])]
            return embeddings
        return None

    def create_faiss_index(self):
        dimension = 384
        return faiss.IndexFlatL2(dimension)

    def insert_data(self, index, embeddings):
        embeddings_array = np.vstack([entry["embedding"] for entry in embeddings])
        index.add(embeddings_array)

    def run(self):
        self.embeddings_hyd = self.load_embeddings('hyd')
        self.embeddings_blore = self.load_embeddings('blore')

        if self.embeddings_hyd is None:
            print("Hyderabad embeddings not found. Preprocessing and creating new embeddings.")
            self.embeddings_hyd = self.preprocess_and_embed(self.hyd_json_file_path)
            self.save_embeddings(self.embeddings_hyd, 'hyd')
        if self.embeddings_blore is None:
            print("Bangalore embeddings not found. Preprocessing and creating new embeddings.")
            self.embeddings_blore = self.preprocess_and_embed(self.blore_json_file_path)
            self.save_embeddings(self.embeddings_blore, 'blore')

        self.index_hyd = self.create_faiss_index()
        self.index_blore = self.create_faiss_index()

        self.insert_data(self.index_hyd, self.embeddings_hyd)
        self.insert_data(self.index_blore, self.embeddings_blore)

    def search_faiss(self, query, k=3):
        query_embedding = self.embedding(query).reshape(1, -1)
        index = self.index_hyd if self.location_is_hyd else self.index_blore
        embeddings = self.embeddings_hyd if self.location_is_hyd else self.embeddings_blore
        json_file_path = self.hyd_json_file_path if self.location_is_hyd else self.blore_json_file_path

        distances, indices = index.search(query_embedding, k)

        results = [{"id": embeddings[idx]["id"], "text": embeddings[idx]["text"]} for idx in indices[0]]
        with open(json_file_path, 'r') as file:
            data = json.load(file)

        for i in results:
            for j in data:
                if i['id'] == j['Id']:
                    i['text'] = j['text data']

        return results


In [None]:

class LLMHandler:
    def __init__(self):
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.llama_tokenizer, self.llama_model = self.load_llama_model()

    def load_llama_model(self):
        llama_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
        llama_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct").to(self.device)
        llama_model.temperature = 1.0
        llama_model.top_p = 0.95
        return llama_tokenizer, llama_model

    def generate_query(self, query, history):
        input_prompt = f"""
        <>
            Generate a question similar to previous queries, and given user query, make it concise and do not generate an answer to the query.
        <>
        [INST]
        History: {history}
        Query: {query}
        [/INST]
        """
        inputs = self.llama_tokenizer(input_prompt, return_tensors="pt", padding=True).to(self.device)
        outputs = self.llama_model.generate(**inputs, max_new_tokens=1000, pad_token_id=self.llama_model.config.eos_token_id)
        result_prompt = self.llama_tokenizer.batch_decode(outputs)[0]
        return result_prompt.replace(input_prompt, "").strip()

    def is_query(self, query):
        query = query.lower()
        my_list = ["school", "schools", "facilities", "amenities", "sports", "faculty", "fees", "institute", "organisation", "org", "inst", "scl", "schol"]
        return any(item in query for item in my_list)

    def requires_context(self, query):
        query = query.lower()
        context_keywords = ["previous", "last", "before", "history", "context", "same"]
        return any(item in query for item in context_keywords)

    def generate_chat_response(self, query, data):
        input_prompt = f"""
        <>You are a chatbot assistant that recommends schools to the user by describing the school's information. Generate an answer for the query given using the search results provided and nothing else in the response field.
        <>
        [INST]
        User Query: {query}
        Search results: {data}
        [/INST]
        """
        inputs = self.llama_tokenizer([input_prompt], return_tensors="pt").to(self.device)
        outputs = self.llama_model.generate(**inputs, max_new_tokens=1000, pad_token_id=self.llama_model.config.eos_token_id)
        result_prompt = self.llama_tokenizer.decode(outputs[0])
        return result_prompt.replace(input_prompt, "").strip()

    def get_not_school_related_response(self):
        not_school_related_responses = [
            "I'm sorry, but I can only provide information related to schools. How can I assist you with a school-related query today?",
            "It looks like your question isn't related to schools. Could you please ask something about schools so I can help you better?",
            "I'm here to help with school-related questions! If you have any queries about schools, feel free to ask.",
            "Oops! I can only assist with questions about schools. Please let me know if there's anything you need to know about schools.",
            "I'm not equipped to handle this query. Do you have any school-related queries?",
            "I'm focused on answering school-related questions. Can you ask something about schools?"
        ]
        return random.choice(not_school_related_responses)


In [None]:
import random

class Chatbot(VectorSearchWrapper, LLMHandler):
    def __init__(self, location_is_hyd=False):
        VectorSearchWrapper.__init__(self, location_is_hyd)
        LLMHandler.__init__(self)

    def main(self):
        user_query = input("Enter your query: ").strip()
        user_history = input("Enter your chat history: ").strip()

        if self.is_query(user_query):
            if self.requires_context(user_query):
                prompt = self.generate_query(user_query, user_history)
                print(f"Generated Prompt: {prompt}")
                user_query = prompt

            search_results = self.search_faiss(user_query)
            if not search_results:
                print("No results found. Please try a different query.")
                return

            chat_response = self.generate_chat_response(user_query, search_results)
            print(f"Response: {chat_response}")
        else:
            print(self.get_not_school_related_response())



In [6]:
vector_search = VectorSearchWrapper(location_is_hyd=True)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.04k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/69.6M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [31]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
def load_llama_model():
    '''bnb_config = BitsAndBytesConfig(load_in_8bit=True,
                                    bnb_8bit_use_double_quant=True,
                                    bnb_8bit_quant_type="nf4",
                                    bnb_8bit_compute_dtype=torch.bfloat16)'''
    llama_tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct")
    llama_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-1.5B-Instruct").to(device)
    llama_model.temperature = 1.0
    llama_model.top_p = 0.95
    return llama_tokenizer, llama_model



def generate_query(llama_tokenizer, llama_model, query, history):
    input_prompt = f"""
    <>
        Generate a question similar to previous queries, and given user query, make it concise and do not generate an answer to the query.
    <>
    [INST]
    History: {history}
    Query: {query}
    [/INST]
    """
    inputs = llama_tokenizer(input_prompt, return_tensors="pt", padding=True).to("cuda")
    outputs = llama_model.generate(**inputs, max_new_tokens=1000, pad_token_id=llama_model.config.eos_token_id)
    result_prompt = llama_tokenizer.batch_decode(outputs)[0]
    return result_prompt.replace(input_prompt, "").replace("","").replace("","")


def is_query(query):
    query = query.lower()
    my_list = ["school", "schools", "facilities", "amenities", "sports", "faculty", "fees", "institute", "organisation", "org", "inst", "scl", "schol"]
    if any(item in query for item in my_list):
        return True
    return False

def requires_context(query):
    query = query.lower()
    context_keywords = ["previous", "last", "before", "history", "context", "same"]
    if any(item in query for item in context_keywords):
        return True
    return False


In [26]:

def generate_chat_response(llama_tokenizer, llama_model, query, data):


    input_prompt = f"""
    <>You are a chatbot assistant that recommends schools to the user by describing the school's information. Generate answer for to the query given using the search results given and nothing else in response field.
    <>
    [INST]
    User Query: {query}
    Search results: {data}
    [/INST]
    """

    inputs = llama_tokenizer([input_prompt], return_tensors="pt").to("cuda")
    outputs = llama_model.generate(**inputs, max_new_tokens=1000, pad_token_id=llama_model.config.eos_token_id)
    result_prompt = llama_tokenizer.decode(outputs[0])
    result_prompt = result_prompt.replace(input_prompt, "")
    return result_prompt.replace("","").replace("","")

In [9]:
llama_tokenizer, llama_model = load_llama_model()

tokenizer_config.json:   0%|          | 0.00/7.30k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/660 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

In [41]:
query="Public Schools in Dilsukh nagar"
data="['Name: Oakridge International School~Category: International Schools~Location: Bowrampet, Near Bachupally, Hyderabad  500 043, Telangana, India~Faculty: Ms. Baljeet Oberoi Principal / Masters in Mathematics~Sports: Athletics, Badminton, Basketball, Carroms, Chess, Cricket, Football, Gymnastics, Hockey, Ice-Hockey, Karate, Lawn Tennis, Skating, Swimming, Tennis, Throwball, Volleyball, Yoga~Amenities: Transport, Medical Facility, Laboratory, Smart Classrooms, Computers Facility, Library~Board: CBSE, International Baccalaureate, Cambridge IGCSE~Years: -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~Fee: 21800000, 71700000~Since: 2001~Strength: Not Available~']"

In [36]:
import random

def search(query):
    if not is_query(query):
        not_school_related_responses = [
    "I'm sorry, but I can only provide information related to schools. How can I assist you with a school-related query today?",
    "It looks like your question isn't related to schools. Could you please ask something about schools so I can help you better?",
    "I'm here to help with school-related questions! If you have any queries about schools, feel free to ask.",
    "Oops! I can only assist with questions about schools. Please let me know if there's anything you need to know about schools.",
    "I'm not equipped to answer that question. Could you ask something specific about schools?",
    "It seems your question isn't related to schools. How about asking me something about school programs, admissions, or activities?",
    "I'm here to talk about schools! Can you ask me something about school events, classes, or teachers?",
    "I can help with school-related topics. If you have a question about schools, please let me know!",
    "For now, I can only provide information on schools. Do you have a specific question about a school or education?",
    "I specialize in school-related information. Please ask a question about schools so I can assist you."
    ]
        return random.choice(not_school_related_responses)
    if requires_context(query):
        query = generate_query(llama_tokenizer, llama_model, query, "")
    search_results = vector_search.search_faiss(query)
    return generate_chat_response(llama_tokenizer, llama_model, query, search_results)

In [21]:
search_results = vector_search.search_faiss(query)
search_results=', '.join(item['text'] for item in search_results)

In [22]:
print(search_results)

Name: Jubilee Hills Public School~Category: Public Schools~Location: Block III, Rd Number 71, Jubilee Hills, Hyderabad, Telangana 500033~Faculty: ~Sports: Archery, Badminton, Basketball, Carroms, Chess, Cricket, Football, Gymnastics, Hockey, Ice-Hockey, Kabaddi, Karate, Kho-Kho, Skating, Table-tennis, Tennis, Throwball, Yoga~Amenities: Transport, Medical Facility, Laboratory, Smart Classrooms, Computers Facility, Library~Board: CBSE~Years: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11~Fee: 50000~Since: 1986~Strength: 2300~, Name: CLAPS International School~Category: Public Schools, Boarding Schools~Location: Survey No. 234 & 239, Kalabgur, Near Shilparamam , SangaReddy, Hyderabad~Faculty: ~Sports: Athletics, Billiards, Carroms, Chess, Cricket, Football, Gymnastics, Skating~Amenities: Transport, Medical Facility, Laboratory, Smart Classrooms, Computers Facility, Library~Board: CBSE, International Baccalaureate, ICSE~Years: LKG, UKG, 1, 2, 3, 4, 5, 6~Fee: -1~Since: Not Available~Strength: Not Availa

In [27]:
print(generate_chat_response(llama_tokenizer, llama_model, query, search_results))

 IGCSE board schools with ice-hockey include Jubilee Hills Public School and Global Edge School. Both schools have sports facilities such as hockey, ice-hockey, and other sports like basketball, football, cricket, etc., making them suitable for students interested in these activities. Additionally, CLAPS International School also offers ice-hockey among its sports options. The fees at all three schools range from ₹50,000 to ₹100,000 per annum, depending on the student’s age group and enrollment duration. They provide various amenities including transport, medical facility, laboratory, smart classrooms, computers facility, library, and others.

    It is important to note that the strength of each school may vary based on their current enrolment levels. For the most accurate and up-to-date information, it would be best to visit their official websites or contact them directly. Also, consider checking if there are any special scholarships or financial aid available for students who quali

In [42]:
print(search(query))

 Public Schools in Dilsukhnagar can be found at Bharatiya Vidya Bhavan's Public School located at Rd Number 71, Navanirman Nagar Colony, Film Nagar, Hyderabad, Telangana 500033. This school offers Sports like Athletics, Badminton, Basketball, Carroms, Chess, Cricket, Karate, Table-tennis, Tennis, Volleyball. It also provides Amenities such as Transport, Medical Facility, Laboratory, Smart Classrooms, Computers Facility, Library. The school follows CBSE board and has been established since 1979 with a strength of not available. <sep> Another option is Chirec International School which is situated at 1-55/12, Botanical Garden Rd, Sri Ram Nagar, Kondapur, Hyderabad, Telangana 500084. This school features Sports including Archery, Athletics, Badminton, Basketball, Carroms, Chess, Cricket, Football, Handball, Hockey, Kabaddi, Kho-Kho, Skating, Table-tennis, Tennis, Throwball, Volleyball, Yoga. They offer Transport, Medical Facility, Laboratory, Smart Classrooms, Computers Facility, Library 