<a href="https://colab.research.google.com/github/DorcasO/DorcasO/blob/main/UniofYorkchatbot_latest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Mount Google Drive To save our scraped data persistently, we mount Google Drive to access its file system from Colab.

In [1]:
import os

#Define path inside your Google Drive UniofYork_scraped_data
drive_path = "/content/drive/MyDrive/AMDARI/UniofYork_scraped_data"

# Create the folder if it doesn't exist
os.makedirs(drive_path, exist_ok=True)

In [3]:
%pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.26-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain-community)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 k

In [4]:
import requests
from bs4 import BeautifulSoup
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
import gradio as gr


# === 1. Scrape University of York Pages ===

In [5]:
def scrape_york_pages(urls):
    texts = []
    for url in urls:
        try:
            response = requests.get(url, timeout=10)
            soup = BeautifulSoup(response.text, "html.parser")
            content = soup.get_text(separator="\n", strip=True)
            texts.append(content)
        except Exception as e:
            print(f"Failed to scrape {url}: {e}")
    return texts

urls = [
    "https://www.york.ac.uk/study/undergraduate/courses/bsc-computer-science/",
    "https://www.york.ac.uk/study/undergraduate/applying/",
    "https://www.york.ac.uk/study/accommodation/undergraduate/"
]

raw_texts = scrape_york_pages(urls)

# === 2. Chunk text and convert to LangChain documents ===

In [6]:
def chunk_text(text, chunk_size=500, overlap=50):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size - overlap)]

documents = [Document(page_content=chunk) for text in raw_texts for chunk in chunk_text(text)]


# === 3. Embed and Index with FAISS ===

In [8]:
%pip install faiss-cpu

Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m42.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0


In [9]:
embedding_model = HuggingFaceEmbeddings(model_name="BAAI/bge-base-en-v1.5")
vectorstore = FAISS.from_documents(documents, embedding_model)

In [13]:
# Optional: Save FAISS index
vectorstore.save_local("york_faiss_index")

# === 4. Load Falcon3-1B Instruct ===

In [12]:
model_name = "tiiuae/Falcon3-1B-Instruct"
# If the model is private, you might need to provide a token like this:
# from huggingface_hub import login
# login(token="your_huggingface_token")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", trust_remote_code=True)

falcon_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.2,
    do_sample=True
)

llm = HuggingFacePipeline(pipeline=falcon_pipeline)

tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/826 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/656 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.34G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/113 [00:00<?, ?B/s]

Device set to use cpu
  llm = HuggingFacePipeline(pipeline=falcon_pipeline)


# === 5. Prompt Template ===

In [14]:
custom_prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""You are a helpful course advisor for the University of York.

Answer the question using ONLY the context provided. If the answer is not in the context, say:
"I don’t know based on the available information."

Context:
{context}

Question:
{question}

Answer:"""
)


# === 6. RAG Pipeline ===

In [15]:
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(),
    chain_type="stuff",
    chain_type_kwargs={"prompt": custom_prompt}
)


# === 7. Gradio UI ===

In [16]:
def chatbot(query):
    return qa_chain.run(query)

gr.Interface(
    fn=chatbot,
    inputs=gr.Textbox(lines=2, placeholder="Ask about York's courses, accommodation, or applications"),
    outputs="text",
    title="University of York Advisor Chatbot",
    description="Ask anything about studying at the University of York. Powered by Falcon-1B and University content."
).launch()


It looks like you are running Gradio on a hosted a Jupyter notebook. For the Gradio app to work, sharing must be enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://d24c05b692762d0241.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




In [22]:
# Complete README.md content for Hugging Face Spaces

readme_content = """
# 🎓 University of York Advisor Chatbot

This chatbot is designed to answer questions about undergraduate study, accommodation, and admissions at the University of York using real web content.

## 🔍 Features

- 💬 Chat with a helpful university advisor
- 📄 Uses real scraped pages from york.ac.uk
- 🔎 Retrieval-Augmented Generation (RAG)
- 🧠 Powered by Falcon-1B Instruct & LangChain
- ⚡ Fast semantic search with FAISS and BGE embeddings

## 🤖 Technologies

- **LLM**: [`tiiuae/falcon-1b-instruct`](https://huggingface.co/tiiuae/falcon-1b-instruct)
- **Embeddings**: [`BAAI/bge-base-en-v1.5`](https://huggingface.co/BAAI/bge-base-en-v1.5)
- **Frameworks**: LangChain, Gradio
- **Vector Database**: FAISS
- **Interface**: Gradio UI

## 💡 How It Works

1. Scrapes selected pages from the University of York website
2. Splits content into semantic chunks
3. Embeds chunks using BGE embeddings
4. Indexes using FAISS for fast similarity search
5. Retrieves relevant passages at runtime
6. Answers questions using Falcon-1B Instruct via prompt templates

## 🧪 Sample Questions

- What undergraduate courses are available at York?
- How do I apply as an international student?
- What are the accommodation options for first years?
- How much does it cost to live on campus?

## 🚀 How to Use (Locally or via HF Spaces)

1. Clone this repo or upload to Hugging Face Spaces (select **Gradio** SDK)
2. Add `requirements.txt` and `uoy_chatbot_hf.py`
3. Space will automatically build and deploy
4. Ask anything about the University of York 🎓

## 🧠 Author

Created by Dorcas Ojo as part of a beginner-friendly AI project.
"""

# Save to README.md
readme_path = os.path.join(drive_path, "README.md")
with open(readme_path, "w") as f:
    f.write(readme_content)

readme_path

'/content/drive/MyDrive/AMDARI/UniofYork_scraped_data/README.md'

In [23]:
readme_path = os.path.join(drive_path, "README.md")
with open(readme_path, "w") as f:
    f.write(readme_content)

readme_path

'/content/drive/MyDrive/AMDARI/UniofYork_scraped_data/README.md'