This code is for a bot deployed using streamlit to answer all our questions regarding ICS and any other related information. It uses a simple RAG to classify our queries and provide any well suited answers based on the query.

Run the following pip install commands to install the required libraries:

In [None]:
!pip install -U langchain_community --quiet
!pip install langchain_google_genai --quiet
!pip install langchain_text_splitter --quiet
!pip install chromadb --quiet
!pip install streamlit --quiet
!pip install pyngrok --quiet

Below code webscrapes the ICS website and produces full text by webscraping all the links to which the base url is linked to as well. After this the text is broken into chunks of size 400 with an overlap of 50 to maintain continuity.

In [4]:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
import time

# Example URL (replace with your target)
base_url = "https://www.iitk.ac.in/counsel/"

response = requests.get(base_url)

all_text = []
soup = BeautifulSoup(response.text, "html.parser")
full_text = soup.get_text(separator="\n", strip=True)
links = set()
for a_tag in soup.find_all("a", href=True):
    href = a_tag["href"]
    full_url = urljoin(base_url, href)

    if urlparse(full_url).netloc == urlparse(base_url).netloc:
        links.add(full_url)
all_text = []
for link in links:
    try:
        response = requests.get(link)
        time.sleep(1)
        souppage = BeautifulSoup(response.text, "html.parser")
        textpage = souppage.get_text(separator="\n", strip=True)
        all_text.append(textpage)
        print(f"\nðŸ“„ {link}")
        print(textpage[:50])  # Print first 100 chars
    except Exception as e:
        print(f"Couldnt load link : {link}")
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

docs = [Document(page_content=text) for text in all_text]

splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
)
chunks = splitter.split_documents(docs)


ðŸ“„ https://www.iitk.ac.in/counsel/referral.php
Counselling Service, IIT Kanpur
Team
Services
Events
Workshop
Samvad
Deaddiction Clinic
Appointments

ðŸ“„ https://www.iitk.ac.in/counsel/new-pg-information.php
Counselling Service, IIT Kanpur
Team
Services
Events
Workshop
Samvad
Deaddiction Clinic
Appointments

ðŸ“„ https://www.iitk.ac.in/counsel/assets/img/events/SP_Poster_nn7.png
404 Not Found
Not Found
The requested URL was not found on this server.
Please contact concerned adm

ðŸ“„ https://www.iitk.ac.in/counsel/family_tree/index.html
IITK Family Tree
i
Instructions
1. Search for any Roll Number or Name. If multiple options are prese

ðŸ“„ https://www.iitk.ac.in/counsel/blog.php
Counselling Service, IIT Kanpur
Team
Services
Events
Workshop
Samvad
Deaddiction Clinic
Appointments

ðŸ“„ https://www.iitk.ac.in/counsel/psytool.php
Counselling Service, IIT Kanpur
Team
Services
Events
Workshop
Samvad
Deaddiction Clinic
Appointments

ðŸ“„ https://www.iitk.ac.in/counsel/academic-support.ph

In [5]:
from google.colab import userdata
GOOGLE_API_KEY = userdata.get('GOOGLE_API_KEY')
import os
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY

Here, we initialize the embeddings and store the chunks into a vector store with the directory 'my_chroma_db' so it can accessed within the app.py file again without having to create the vector store repeatedly when the app is launched.

In [8]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key=GOOGLE_API_KEY)
texts = [str(chunk.page_content) for chunk in chunks]
vectorstore = Chroma.from_texts(texts, embeddings, persist_directory="my_chroma_db")

# Persist to disk
vectorstore.persist()

  vectorstore.persist()


Here, we write the app.py file along with the functions of adaptive RAG.

In [None]:
%%writefile app.py
import streamlit as st
import time
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.chains import LLMChain
from langchain.prompts import ChatPromptTemplate
from langchain_community.vectorstores import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import requests
import numpy as np
from langchain.prompts import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel,Field
from typing import List
import os
GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key=GOOGLE_API_KEY)
vectorstore = Chroma(persist_directory="my_chroma_db",embedding_function=embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k":10})

llm = ChatGoogleGenerativeAI(
    model="gemini-1.5-flash",
    temperature = 0.1,
    api_key=GOOGLE_API_KEY,
)
prompt = PromptTemplate(
        input_variables=["query"],
        template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. "
                  "If you don't know the answer, just say that you don't know. "
                  "Use three sentences maximum and keep the answer concise. "
                  "Question: {question} Context: {context}Answer:"
    )
chain = prompt | llm
st.title("ICS Chatbot For All Your Queries!")
st.write("Welcome! Any questions you have can be asked below:")
query = st.text_input("Enter the topic of your question:")
if query:
    response = chain.invoke({"context" : retriever.get_relevant_documents(query) , "question": query}).content
    with st.spinner(f"Loading..."):
      time.sleep(1)
    st.write(response)


Here, we launch the app to open the app click on the link printed.

In [None]:
from pyngrok import ngrok
import os
AUTH = userdata.get("NGROK")
ngrok.set_auth_token(AUTH)
!pkill streamlit
ngrok.kill()

public_url = ngrok.connect(8501)
print(f"Streamlit is live at: {public_url}")

!streamlit run app.py --server.port 8501 --server.headless true --server.enableCORS false
