# USF GenAI Lab 1 ‚Äî **Admissions Chatbot for USF**
**Instructor:** Dr. Alla Abdella  
**Course:** EEL 6935 & EEL 4935 ‚Äî Advanced Generative AI Development
**Department:** Electrical Engineering, University of South Florida (USF)



## Lab Overview

In this lab, you will **build an intelligent USF Admissions Chatbot** that helps prospective students learn about the University of South Florida by:
- Scraping and processing **public USF web content**.
- Prompting local LLMs using **Ollama**.
- Evaluating model responses using structured techniques.




## What You‚Äôll Do
- üîπ Replace each code cell‚Äôs **TODO scaffold** with your implementation.
- üîπ Write **clean, documented, PEP8-compliant code**.
- üîπ Cite all external sources used in your work.
- üîπ Remove any `NotImplementedError` lines after completing tasks.



## Learning Objectives
By the end of this lab, you will be able to:

1. **Collect & preprocess** public USF information responsibly.
2. **Engineer prompts** that integrate retrieved context.
3. **Query local LLMs** using **Ollama** and **swap between models** for experiments.
4. **Compare models** on:
   - Response quality  
   - Response length  
   - Execution speed  
5. **Evaluate answers** using:
   - A structured **rubric**
   - An **LLM-as-judge** evaluation approach  
6. **Generate citations** or extract **references** for factual claims.



##  Prerequisites
> Before starting, ensure:
- **Ollama** is installed and running locally:  
  `http://localhost:11434`
- The models you plan to test are downloaded:  
  ```bash
  ollama pull modle_name:model_size


# What to Submit

Please include the following in your final submission:

1. **Completed Notebook** ‚Äî All **TODOs** must be addressed.
2. **Error-Free Execution** ‚Äî Ensure the notebook runs **from start to finish** without issues.
3. **Final Comparison Table** ‚Äî Submit a CSV file.
3. **Summary of Your Approach** ‚Äî Briefly explain your methodology and any **assumptions made**.
4. **Insights on the following**
   - Which **model performed best** and why.
   - Where **citations succeeded or failed**.
   - How you would **improve grounding and evaluation** in the future.


##  Academic Integrity & Ethics
- Use **only publicly accessible USF pages**.
- **Do NOT** bypass authentication, scrape personal data, or overload servers.
- If you change URLs, **limit changes to ‚â§ 3** and keep requests minimal.
- Always **cite the exact pages** used when making factual claims.


## Extensions *(Optional)*
Push your analysis further by trying one or more of these:

- Add **another USF page** to your scraping and re-evaluate the results.
- Swap in a **bigger or smarter local model** and compare performance.
- Use a **separate LLM as the judge** to evaluate responses independently.


> **Tip:** Going beyond the base requirements can strengthen your understanding of **LLM evaluation techniques** and **prompt engineering strategies**.


## 0) Setup & Imports
Install dependencies

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
import numpy as np
import requests
import time
import pandas as pd
from bs4 import BeautifulSoup
from typing import List
import warnings
warnings.filterwarnings('ignore')


## 1) Configuration
Change the **models** list to compare different local models. All requests go to Ollama's `/api/generate` endpoint.

In [26]:
MODELS = [
    "llama3.2:3b",
    "mistral:7b",
    "llama3.1:8b"
]

# USF pages to scrape (keeping it minimal and respectful)
USF_URLS = [
    "https://www.usf.edu/admissions/freshmen/admission-information/academic-requirements.aspx",
    "https://www.usf.edu/about-usf/",
    "https://www.usf.edu/facilities/service-center/index.aspx"
]

# Configuration constants
OLLAMA_BASE_URL = "http://localhost:11434/api/generate"
DEFAULT_TEMPERATURE = 0
MAX_CONTEXT_CHARS = 3000

## 2) Responsible Scraper
We‚Äôll fetch a few USF pages, strip boilerplate, and keep clean text for prompting. **Do not** overload servers; keep requests minimal and cache results.

In [27]:
def scrape_usf_pages(urls: List[str]):
    """Scrape main content from a list of USF web pages.

    Args:
        urls: List of URLs to scrape

    Returns:
        Dictionary mapping URLs to their scraped content and metadata
    """
    pages = {}

    for url in urls:
        try:

            # Add delay to be respectful to servers
            time.sleep(1)

            response = requests.get(url, timeout=10)
            response.raise_for_status()

            soup = BeautifulSoup(response.text, 'html.parser')

            # Remove scripts, styles, navigation, and footer elements
            for element in soup(['script', 'style', 'nav', 'footer', 'header']):
                element.decompose()

            # Try to find main content area
            main_content = soup.find('main') or soup.find('div', class_=re.compile(r'content|main'))
            if not main_content:
                main_content = soup.find('body')

            # Extract clean text
            if main_content:
                text = main_content.get_text(separator=' ', strip=True)
                # Clean up whitespace
                text = re.sub(r'\s+', ' ', text)
                text = text.strip()
            else:
                text = ""

            pages[url] = {
                'content': text,
                'title': soup.title.string if soup.title else url,
                'length': len(text)
            }

        except Exception as e:
            pages[url] = {
                'content': f"Error scraping content from {url}",
                'title': url,
                'length': 0
            }

    return pages

## 3) System Prompt & Model Client
We‚Äôll use the instructor‚Äôs system prompt and a thin client for streaming responses from Ollama.

In [28]:
class USFUniversity:
    """Client for interacting with Ollama-hosted LLMs for USF admissions chatbot."""

    def __init__(self, model: str = "llama3.2:3b", temperature: float = DEFAULT_TEMPERATURE):
        """Initialize the USFUniversity class.

        Args:
            model: Name of the Ollama model to use
            temperature: Sampling temperature for generation
        """
        self.model = model
        self.temperature = temperature
        self.base_url = OLLAMA_BASE_URL

        # System prompt for USF admissions chatbot
        self.system_prompt = """You are a helpful and knowledgeable USF (University of South Florida) admissions assistant
        Your job is to:
        1. Answer the students questions about USF using **only the information provided in the context above**
        2. Write confidently and naturally, **as if you are an expert at USF**
        3. Use **plain bullet points (‚Ä¢) or (-)** for all items ‚Äî do not use `+` or nested bullets
        4. **Include citation numbers like [1], [2]** only after facts directly supported by the context
        5. Include a **"References" section** with only the **used** citations in the response mapped to URLs.
        6. If a question cannot be answered from the context, say so clearly (e.g., "That information is not available in the current context.")
        7. Never refer to "the context" or "the text above" ‚Äî just present the information directly, as if it's from your own knowledge backed by citations
        8. Avoid filler like "According to the context provided" or "Based on the information above".
        9. Be concise but comprehensive."""

    def generate(self, prompt):
        """Generate text based on the provided prompt.

        Args:
            prompt: User prompt to generate response for

        Returns:
            Dictionary containing response, timing, and metadata
        """
        full_prompt = f"{self.system_prompt}\n\nUser: {prompt}\n\nAssistant:"
        
        payload = {
            "model": self.model,
            "prompt": full_prompt,
            "temperature": self.temperature,
            "stream": False
        }

        start_time = time.time()

        try:
            response = requests.post(
                self.base_url,
                json=payload
            )
            response.raise_for_status()

            result = response.json()
            end_time = time.time()

            return {
                'response': result.get('response', ''),
                'model': self.model,
                'latency_seconds': end_time - start_time,
                'success': True
            }

        except Exception as e:
            end_time = time.time()
            return {
                'response': f"Error generating response: {str(e)}",
                'model': self.model,
                'latency_seconds': end_time - start_time,
                'success': False
            }

## 4) Build the Context Prompt from Scraped Pages
Create a compact **context block** with key facts from the scraped pages. Keep it under ~2‚Äì3k characters to avoid overlong prompts.

In [29]:
def build_context_block(pages, max_chars = MAX_CONTEXT_CHARS):
    """Build a context block from the main content of the provided pages.
    
    Args:
        pages: Dictionary of scraped page data
        max_chars: Maximum characters to include in context
        
    Returns:
        Formatted context string with source references
    """
    per_page_context_length = int(max_chars/len(pages))  # Limit per page to save space
    context = "USF INFORMATION CONTEXT\n"
    
    for i, (url, page_data) in enumerate(pages.items(), 1):
        # Add source reference
        context += f"[{i}] Source: {page_data['title']} ({url})\n"
        
        # Add content (truncated if needed)
        content = page_data['content']
        if len(content) > per_page_context_length: 
            mid_content = len(content)/2
            start_content = int(mid_content-per_page_context_length/2)
            end_content = int(mid_content+per_page_context_length/2)
            content = content[start_content:end_content] + "..."
        
        context += f"Content: {content}\n"
    
    context += "END CONTEXT\n"
    
    return context

## 5) Prompt Template (You will edit this)
Write a task asking the assistant to summarize USF research strengths **with citations** pointing to the scraped sources.

**Your Task:**
1. Edit the `USER_TASK` below to ask a precise question.
2. Keep: request for concise bullets + **cite specific URLs** from the context.
3. Avoid claims not grounded in the context block.

In [30]:
def build_user_prompt(context_block, task):
    """Build the complete user prompt with context and task.
    
    Args:
        context_block: Formatted context from scraped pages
        task: Specific task/question for the model
        
    Returns:
        Complete formatted prompt
    """
    user_prompt = f"""{context_block}

TASK: {task}

INSTRUCTIONS:
- Use ONLY the information provided in the USF Information Context above.
- Format your response with clear bullet points.
- Cite sources using [1], [2], etc. format
- Include a "References" section at the end mapping numbers to URLs used in the response. Do not add references which are not used in the response. 
- If information is not available in the context, state that clearly.
- Be concise but comprehensive.


Your response:"""
    
    return user_prompt

## 6) Run Multiple Models & Collect Results
We‚Äôll loop through `MODELS`, ask the same question, and capture answer, tokens (approx.), and latency.

In [31]:
def approx_token_count(text):
    return len(text) // 4

def ask_models(models, context_block, task, temperature = DEFAULT_TEMPERATURE):
    """Ask the specified models to perform the given task using the provided context.
    
    Args:
        models: List of model names to query
        context_block: Context information from scraped pages
        task: Task/question to ask models
        temperature: Sampling temperature
        
    Returns:
        DataFrame with results from all models
    """
    results = []
    user_prompt = build_user_prompt(context_block, task)
    
    for model in models:        
        client = USFUniversity(model=model, temperature=temperature)
        result = client.generate(user_prompt)
        
        # Calculate metrics
        response_length = len(result['response'])
        response_tokens = approx_token_count(result['response'])
        
        results.append({
            'model': model,
            'response': result['response'],
            'latency_seconds': result['latency_seconds'],
            'response_length_chars': response_length,
            'response_tokens_approx': response_tokens,
            'success': result['success']
        })
    
    return pd.DataFrame(results)

## 7) Evaluation Rubric +  LLM‚Äëas‚ÄëJudge
Score each answer on a 1‚Äì5 scale for:
- **Groundedness** (stays within the provided context)
- **Specificity** (concrete facts vs. vagueness)
- **Citations** (uses and maps [1], [2] to URLs)
- **Clarity** (readable bullets)
-**LLM-as-Judge** you can ask a judge model to score using the same rubric. (By default, it uses the **same** model list‚Äîfeel free to pick a separate judge model.)
Great paper to learn about LLM-as-Judge prompts is here: https://arxiv.org/pdf/2306.05685


In [32]:
def judge_answer(model, context_block, answer):
    """Judge the quality of the generated answer based on the context.
    
    Args:
        model: Model name to use as judge
        context_block: Original context provided
        answer: Generated answer to evaluate
        
    Returns:
        Dictionary with scores for different criteria (1-5 scale)
    """
    judge_prompt = f"""You are an expert evaluator. Please score the following answer on a scale of 1-5 for each criterion:

CONTEXT PROVIDED:
{context_block}

ANSWER TO EVALUATE:
{answer}

Please rate the answer on these criteria (1=Poor, 2=Below Average, 3=Average, 4=Good, 5=Excellent):

1. GROUNDEDNESS: Does the answer stick to information in the provided context?
2. SPECIFICITY: Are the facts concrete and specific rather than vague?
3. CITATIONS: Are sources properly cited with [1], [2] format and References has only the cited urls?
4. CLARITY: Is the answer well-organized and easy to read?

Respond ONLY with four numbers separated by commas (e.g., "4,3,5,4"):
"""
        
    try:
        payload = {
            "model": model,
            "prompt": judge_prompt,
            "temperature": DEFAULT_TEMPERATURE,
            "stream": False
        }

        response = requests.post(
                OLLAMA_BASE_URL,
                json=payload
                #timeout=60
            )
        response.raise_for_status()

        result = response.json().get('response','')

        # Correct parsing
        score_text = result.strip()
        scores = [int(x.strip()) for x in score_text.split(',')[:4]]

        return {
            'groundedness': scores[0] if len(scores) > 0 else 3,
            'specificity': scores[1] if len(scores) > 1 else 3,
            'citations': scores[2] if len(scores) > 2 else 3,
            'clarity': scores[3] if len(scores) > 3 else 3
        }

    except:
        # Default scores if evaluation fails
        return {'groundedness': 3, 'specificity': 3, 'citations': 3, 'clarity': 3}

def evaluate_all(results_df, judge_model):
    """Evaluate all results in the DataFrame.
    
    Args:
        results_df: DataFrame with model results
        judge_model: Model to use for evaluation (defaults to first model)
        
    Returns:
        DataFrame with evaluation scores added
    """
    if judge_model is None:
        judge_model = results_df.iloc[0]['model']
        
    judged_scores = []
    
    for idx, row in results_df.iterrows():
        context = row.get("context", "")
        answer = row.get("response", "")

        scores = judge_answer(judge_model, context, answer)
        
        judged_scores.append(scores)
        time.sleep(1)

    results_df["llm_judged_scores"] = judged_scores
    return results_df

## 8) Citations Function
This function attempts to **extract citations** in the `[n]` pattern from an answer and map them to URLs found in the **References** section. If none found, it heuristically matches page URLs to the answer via fuzzy keyword overlap (very naive).

In [33]:
def extract_citations(answer):
    """Extract in-text citations in [n] format from the main body of the answer (ignores References section)."""

    # Only keep content before the References section (case-insensitive)
    main_text = re.split(r'\b[Rr]eferences\b', answer)[0]
    
    # Find all [n] patterns
    citations = re.findall(r'\[(\d+)\]', main_text)
    
    return [int(c) for c in citations]

def citations_report(answer, pages):
    """Generate a report of citations for the given answer and the reference page data.
    
    Args:
        answer: Model-generated answer that may contain [n] citations
        pages: Scraped USF page content, dict with URL keys
        
    Returns:
        Dictionary with citation stats
    """
    # Extract only citations from the main body (not from the References section)
    citations_found = extract_citations(answer)

    # Check if a References section is present (case-insensitive)
    has_references = bool(re.search(r'\b[Rr]eferences\b', answer))

    # Map citations to URLs, assuming citations like [1] ‚Üí pages.keys()[0], etc.
    urls = list(pages.keys())
    citation_mapping = {}

    for cite_num in citations_found:
        if 1 <= cite_num <= len(urls):
            citation_mapping[cite_num] = urls[cite_num - 1]

    return {
        'citations_found': citations_found,
        'num_citations': len(citations_found),
        'has_references_section': has_references
        # 'citation_mapping': citation_mapping,
        # 'properly_cited': len(citation_mapping) > 0 and has_references
    }

## 9) Comparison Table (Final)
Join runtime stats and rubric scores into a single table and save to CSV for submission.

In [34]:
def return_scores(model, input_prompt, generated_output):
    """Return relevance, coherence, and factual accuracy scores for the generated output."""
    
    scores = {}
    
    # ---------- Relevance: TF-IDF cosine similarity ----------
    try:
        tfidf = TfidfVectorizer().fit_transform([input_prompt, generated_output])
        relevance_score = cosine_similarity(tfidf[0:1], tfidf[1:2])[0][0]
    except:
        relevance_score = 0.0  # fallback for empty or invalid input
    scores['relevance'] = round(float(relevance_score), 3)
    
    # ---------- Coherence: based on structure ----------
    sentences = re.split(r'[.!?]', generated_output)
    sentences = [s.strip() for s in sentences if len(s.strip()) > 0]
    avg_sentence_len = np.mean([len(s.split()) for s in sentences]) if sentences else 0
    num_paragraphs = generated_output.count('\n\n') + 1

    # Normalize: we assume ideal sentence length ~12 words, paragraph count ~2+
    coherence_raw = min(1.0, (len(sentences)/3 + num_paragraphs/2 + avg_sentence_len/12) / 3)
    scores['coherence'] = round(coherence_raw, 3)
    
    # ---------- Factual Accuracy: based on citations ----------
    citations = extract_citations(generated_output)
    num_citations = len(citations)

    # Normalize: max 5 citations ‚Üí 1.0 score
    factual_score = min(1.0, num_citations / 5.0)

    # Add bonus if phrasing indicates attribution
    if re.search(r'\b(according to|based on|source|reference)\b', generated_output.lower()):
        factual_score += 0.1
    scores['factual_accuracy'] = round(min(1.0, factual_score), 3)

    return scores

## Pipeline 

In [35]:
def end_to_end_pipeline():
    """Run the entire end-to-end chatbot pipeline and validate results with a few prompts."""
    
    # Step 1: Scrape USF pages
    pages = scrape_usf_pages(USF_URLS)
    
    # Step 2: Build context
    context_block = build_context_block(pages)
    
    # Step 3: Define test prompts
    test_prompts = [
        "What are the admission requirements for undergraduate programs at USF?",
        "Can you provide information about the campus facilities at USF?",
        "What research opportunities are available for students at USF?",
        "How does USF support student mental health and well-being?",
        "What are the career services offered by USF to help students with job placements?"
    ]
    
    all_results = []
    
    # Step 4: Test each prompt
    for i, prompt in enumerate(test_prompts, 1):
        try:
            # Query models
            results = ask_models(MODELS, context_block, prompt)
            
            # Add prompt info
            results['prompt'] = prompt

            # Context info
            results['context'] = context_block
            
            # Quick evaluation
            for idx, row in results.iterrows():
                scores = return_scores(row['model'], prompt, row['response'])
                for key, value in scores.items():
                    results.loc[idx, key] = value
                
                # Citation analysis
                cite_report = citations_report(row['response'], pages)
                results.loc[idx, 'num_of_citations_found'] = cite_report['num_citations']
            
            all_results.append(results)
                
        except Exception as e:
            print(f"‚ùå Error processing prompt {i}: {str(e)}")
    
    # Step 5: Combine and save results
    if all_results:
        final_results = pd.concat(all_results, ignore_index=True)
        
        # Evaluate with LLaMA judge
        final_results = evaluate_all(final_results, judge_model='llama3.1:8b')
        
        # Reorder columns: prompt before response
        columns_order = ['model', 'prompt', 'response'] + [col for col in final_results.columns if col not in ['model', 'prompt', 'response']]
        final_results = final_results[columns_order]
        
        # Save to CSV
        output_file = "usf_chatbot_results_nishat.csv"
        final_results.to_csv(output_file, index=False)

        
# ==================== MAIN EXECUTION ====================

if __name__ == "__main__":
    # Run the complete pipeline
    end_to_end_pipeline()

## Report
To build the full pipeline, I followed these main steps:
1. Scraping: First, I scraped the content from a list of USF web pages that I selected.
2. Context Construction: Since we had to stay under a 3000-character limit for the context block, I took about 1000 characters from the middle of each page. I figured that the middle section might hold the most relevant info ‚Äî but I do realize this could miss important details at the top or bottom of the page.
3. Prompting the Models: I created a complete prompt by combining: a system prompt,the context block, and a user question & instructions. I used this prompt to query each model and collected the responses.
4. Evaluation: I logged each model's in response and necessary informations in the CSV file for evaluation. 
5. LLM as judge: Finally, I used a separate model as a judge to score all responses on:Groundedness, Specificity, Citation usage & Clarity

**My insights:**

Out of all the models tested, I found that Model llama3.1:8b gave the best overall performance. It was more consistent in grounding its answers in the context and was better at clarity and structure. It also handled citations more carefully, even though no model was perfect in that area. Also, it took the least time to generate response for all tasks. 

Some models correctly included citations when the answer clearly came from the context. There were still a bunch of issues:
1. Models sometimes included citations that weren‚Äôt actually referenced in the answer.
2. Sometimes, they gave accurate information but didn‚Äôt cite it at all.
3. In a few cases, I noticed hallucinated citations ‚Äî numbers that didn‚Äôt refer to anything real in the context.

To improve the pipeline, I may
1. Try to include more relevant content per page (maybe by summarizing or ranking key sections instead of grabbing text from the middle or by increasing the character limit).
2. Use a smarter method to extract context ‚Äî maybe something like keyphrase matching or a content ranker.

To help with this assignment, I referred to Claude AI, ChatGPT, and a few YouTube videos ([1], [2], [3]).

References:

[1] https://www.youtube.com/watch?v=bargNl2WeN4

[2] https://www.youtube.com/watch?v=xjA1HjvmoMY&t=369s

[3] https://www.youtube.com/watch?v=UtSSMs6ObqY