### ***RAG Model***<br>
This notebook implements the RAG system using GPT-3.5 for query answering over UMBC ISSS Data.<br>
***Key Components:***
* ***Embeddings:***<br>
<tb>Generated using the sentence-transformers/all-mpnet-base-v2 model.<br>
* ***Hybrid Retrieval:***<br>
FAISS for dense retrieval and BM25 for sparse keyword-based retrieval.<br>
Hybrid retrieval combines FAISS and BM25 for accuracy.<br>
* ***Generation:***<br>
GPT-3.5 processes top-ranked context to generate detailed and accurate responses.<br>
* ***Evaluation:***<br>
Tested on 20 FAQs with metrics like Cosine Similarity and BERTScore.

## Importing Libraries

In [1]:
!pip install openai==0.28
!pip install faiss-cpu rank-bm25 sentence-transformers
!pip install  bert-score

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/76.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.54.4
    Uninstalling openai-1.54.4:
      Successfully uninstalled openai-1.54.4
Successfully installed openai-0.28.0
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting rank-bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB

## **Data Loading from Github**

In [2]:
import requests
import json

# GitHub repository details
repo_owner = "Sritejam"
repo_name = "datahub"
folder_path = "LLM/processed_documents"

# Function to fetch JSON files from GitHub API
def fetch_files_from_github_folder(owner, repo, folder):
    base_url = f"https://api.github.com/repos/{owner}/{repo}/contents/{folder}"
    try:
        response = requests.get(base_url)
        response.raise_for_status()
        file_urls = []
        for item in response.json():
            if item['type'] == 'file' and item['name'].endswith('.json'):
                file_urls.append(item['download_url'])
        return file_urls
    except requests.exceptions.RequestException as e:
        print(f"Error fetching folder contents from GitHub: {e}")
        return []


# Fetch list of JSON files in the folder
document_files = fetch_files_from_github_folder("Sritejam", "datahub", "LLM/processed_documents")

# Initialize a list to store all text chunks
documents = []

def process_chunk(chunk):
    processed_text = []
    for key, value in chunk.items():
        if isinstance(value, list):
            for item in value:
                if isinstance(item, dict):
                    section_info = "; ".join(f"{k}: {v}" for k, v in item.items())
                    processed_text.append(f"{key} - {section_info}")
        elif isinstance(value, dict):
            nested_info = "; ".join(f"{k}: {v}" for k, v in value.items())
            processed_text.append(f"{key} - {nested_info}")
        else:
            processed_text.append(f"{key}: {value}")

    return " | ".join(processed_text)


# FAQ URL
faq_url = "https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/FAQ.json"

# fetch and parse JSON from raw GitHub URLs
def fetch_json_from_github(raw_url):
    try:
        response = requests.get(raw_url)
        response.raise_for_status()  # Check for HTTP errors
        return response.json()  # Parse JSON content
    except requests.exceptions.RequestException as e:
        print(f"Error fetching file from GitHub: {raw_url}")
        print(f"Error message: {e}")
        return None
    except json.JSONDecodeError as e:
        print(f"Error decoding JSON from file: {raw_url}")
        print(f"Error message: {e}")
        return None



# Fetch and process each document file
for file_url in document_files:
    data = fetch_json_from_github(file_url)
    if data:
        chunk_text = data.get('ChunkText')
        if isinstance(chunk_text, dict):
            documents.append(process_chunk(chunk_text))
        elif isinstance(chunk_text, str):
            documents.append(chunk_text)
        else:
            print(f"Loaded URL: {file_url}")

# Check the number of documents loaded
print(f"Loaded {len(documents)} documents.")

# Load preprocessed FAQs
faqs = fetch_json_from_github(faq_url)
if faqs:
    print(f"Loaded {len(faqs)} FAQs.")
else:
    print("Failed to load FAQs.")



Loaded URL: https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/processed_documents/About_OISS.docx_chunk0.json
Loaded URL: https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/processed_documents/Admitted_Student_Events_and_Orientation.docx_chunk0.json
Loaded URL: https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/processed_documents/Admitted_Student_Events_and_Orientation.docx_chunk1.json
Loaded URL: https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/processed_documents/Applying_to_UMBC.docx_chunk0.json
Loaded URL: https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/processed_documents/Applying_to_UMBC.docx_chunk1.json
Loaded URL: https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/processed_documents/CPT.docx_chunk0.json
Loaded URL: https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/processed_documents/CPT.docx_chunk1.json
Loaded URL: https://raw.githubusercontent.com/Sritejam/datahub/main/LLM/processed_documents/CPT.docx_chu

In [3]:
documents[0]

'about oiss umbc office of international students and scholars the office of international students and scholars oiss is dedicated to supporting umbcs international student and scholar community. the office offers a variety of programming and services to assist students and scholars in pursuit of their academic, personal, and professional goals.'


## **Embeddings:**
Generated using the sentence-transformers/all-mpnet-base-v2 model.

## **Improving Retrieval with FAISS**
1. Hybrid Retrieval (Sparse + Dense):
Combine FAISS with traditional search like BM25. Sparse retrieval methods, like BM25, provide better keyword matching, while FAISS provides context. You can then score documents from both and use a reranker to get optimal results.
al performance by expanding the coverage of relevant content.
4. Re-Ranking:
Re-rank the top-k documents returned by FAISS using a fine-tuned BERT model to make sure the most relevant document appears first.

In [4]:
import faiss
import numpy as np
from sentence_transformers import SentenceTransformer, CrossEncoder
from rank_bm25 import BM25Okapi

# Load models

embedding_model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
bert_reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-12-v2')

print("Models loaded successfully!")


# Tokenize the documents for BM25
tokenized_documents = [doc.split(" ") for doc in documents]

# Initialize BM25
bm25 = BM25Okapi(tokenized_documents)

# Generate embeddings for FAISS with FlatL2
def prepare_flat_faiss_index(documents):
    embeddings = np.array([embedding_model.encode(text, convert_to_tensor=False) for text in documents]).astype("float32")
    embedding_dim = embeddings.shape[1]

    # Create a FlatL2 index
    index = faiss.IndexFlatL2(embedding_dim)  # L2 distance
    index.add(embeddings)  #
    print(f"Flat FAISS index built with {index.ntotal} vectors.")
    return index, embeddings

# Build FAISS FlatL2 index
faiss_index, document_embeddings = prepare_flat_faiss_index(documents)

# Map for document metadata retrieval
document_metadata = {i: {"Text": doc} for i, doc in enumerate(documents)}



# Hybrid Retrieval
def hybrid_retrieval(query, top_k=3):
    # Sparse Retrieval with BM25
    bm25_scores = bm25.get_scores(query.split(" "))
    top_bm25_indices = np.argsort(bm25_scores)[::-1][:top_k]
    sparse_results = [{"Text": documents[idx], "BM25_Score": bm25_scores[idx]} for idx in top_bm25_indices]

    # Dense Retrieval with FAISS
    query_embedding = embedding_model.encode(query, convert_to_tensor=False).astype("float32").reshape(1, -1)
    distances, indices = faiss_index.search(query_embedding, top_k)
    dense_results = [
        {"Text": document_metadata[idx]["Text"], "FAISS_Distance": distances[0][i]}
        for i, idx in enumerate(indices[0])
    ]

    # Combine Sparse and Dense Results

    seen_texts = set()
    combined_results = []
    for result in sparse_results + dense_results:
        if result["Text"] not in seen_texts:
            combined_results.append(result)
            seen_texts.add(result["Text"])

    # Re-rank results using BERT
    rerank_inputs = [(query, result["Text"]) for result in combined_results]
    rerank_scores = bert_reranker.predict(rerank_inputs)
    for i, result in enumerate(combined_results):
        result["Rerank_Score"] = rerank_scores[i]
        # Calculate cosine similarity
        doc_embedding = embedding_model.encode(result["Text"], convert_to_tensor=False).astype("float32").reshape(1, -1)
        result["Cosine_Similarity"] = cosine_similarity(query_embedding, doc_embedding)[0][0]


    # Sort by Rerank_Score
    final_results = sorted(combined_results, key=lambda x: x["Rerank_Score"], reverse=True)[:top_k]
    return final_results




  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Models loaded successfully!
Flat FAISS index built with 493 vectors.


## **Using GPT-3.5 as Generator**

In [5]:
from bert_score import score
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

In [6]:
# set up OpenAI
import os
import openai
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")


openai.api_key = os.environ["OPENAI_API_KEY"]

OpenAI API Key:··········


In [7]:
# GPT-3.5 Generator Function
def generate_response(input_text, retrieved_context, max_length=100):
    #  message for GPT
    messages1 = [
        {"role": "system", "content": "You are a professional assistant who provides concise and formal answers."},
        {"role": "system", "content": f"Use the following context to answer the question:\n{retrieved_context}"},
        {"role": "user", "content": input_text}
    ]

    # OpenAI API call
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=messages1,
        max_tokens=max_length,
        temperature=0.2
    )

    return response['choices'][0]['message']['content'].strip()

# Combine retrieved documents into context for the generator
def generate_response_with_retrieved_docs(query, retrieved_docs, top_k=3):
    retrieved_context = "\n".join([f"Context: {doc['Text']}" for doc in retrieved_docs[:top_k]])
    input_text = f"Answer the following query: {query}"
    #print(f"Input to GPT-3.5 model:\n{input_text[:750]}...\n")  # Print the first 750 characters of the input.
    generated_answer = generate_response(input_text, retrieved_context)
    return generated_answer


In [8]:
# Example: Run the hybrid retrieval and use generator for response
query = "Do I need a job before applying for STEM OPT?"

# Perform hybrid retrieval
retrieved_docs = hybrid_retrieval(query, top_k=3)

# Generate a detailed answer based on retrieved results
generated_answer = generate_response_with_retrieved_docs(query, retrieved_docs, top_k=3)

print("\nGenerated Answer:\n", generated_answer)



Generated Answer:
 Yes, you need to have a job offer from an employer enrolled and participating in the E-Verify program before applying for STEM OPT.


In [9]:
# Example: Run the hybrid retrieval and use generator for response
query = "When is Large Language Models course next semester at UMBC? Give getails about it."

# Perform hybrid retrieval
retrieved_docs = hybrid_retrieval(query, top_k=3)

# Generate a detailed answer based on retrieved results
generated_answer = generate_response_with_retrieved_docs(query, retrieved_docs, top_k=3)

print("\nGenerated Answer:\n", generated_answer)


Generated Answer:
 The Large Language Models course (Course Code: DATA 690) at UMBC is scheduled for next semester. The course is offered in Section 08 on Tuesdays from 7:10 PM to 9:40 PM in Fine Arts 001. The instructor is Donghwa Kim. It is a 3-unit course with 25 total seats, of which 17 are currently available. The instruction mode is hybrid.


## **Evaluation**

In [10]:
def evaluate_all_faqs_with_mean(faqs, top_k=3):
    results = []

    for idx, faq in enumerate(faqs):  # Iterate over all FAQs
        query = faq["question"]
        gold_answer = faq["answer"]
        print(f"\nProcessing FAQ #{idx + 1}")
        print(f"Query: {query}")

        # Step 1: Retrieve context
        retrieved_results = hybrid_retrieval(query, top_k=top_k)

        # Extract cosine similarities
        cosine_similarities = [result["Cosine_Similarity"] for result in retrieved_results]

        # Compute retrieval metric (mean similarity score)
        retrieval_score = np.mean(cosine_similarities)
        print(f"Retrieval Metric (Mean Cosine Similarity): {retrieval_score:.4f}")

        # Use the top retrieved context for generation
        retrieved_context = " ".join([result["Text"] for result in retrieved_results])

        # Debug: Retrieved results with cosine similarities
        print("Top Retrieved Results with Cosine Similarities:")
        for i, (text, cos_sim) in enumerate(zip([r["Text"] for r in retrieved_results], cosine_similarities)):
            print(f"Top-{i+1} Text: {text}\nCosine Similarity: {cos_sim:.4f}\n")

        # Step 2: Generate response using the retrieved context
        generated_response = generate_response(input_text=query, retrieved_context=retrieved_context)

        # Debug: Generated response
        print(f"Generated Response:\n{generated_response}")

        # Step 3: Evaluate the generated response against the gold answer
        P, R, F1 = score([generated_response], [gold_answer], lang="en", verbose=False)

        # Debug: BERTScore metrics
        print(f"BERTScore - Precision: {P.mean().item():.4f}, Recall: {R.mean().item():.4f}, F1: {F1.mean().item():.4f}")

        # Collect results
        results.append({
            "query": query,
            "gold_answer": gold_answer,
            "retrieved_context": retrieved_context,
            "generated_response": generated_response,
            "retrieval_score": retrieval_score,  # Mean cosine similarity
            "bertscore_precision": P.mean().item(),
            "bertscore_recall": R.mean().item(),
            "bertscore_f1": F1.mean().item(),
        })

    # Convert results to DataFrame for better visualization
    results_df = pd.DataFrame(results)

    # Compute mean scores for each metric
    mean_retrieval_score = results_df["retrieval_score"].mean()
    mean_precision = results_df["bertscore_precision"].mean()
    mean_recall = results_df["bertscore_recall"].mean()
    mean_f1 = results_df["bertscore_f1"].mean()

    # Print mean scores
    print("\nMean Scores Across All 20 FAQs:")
    print(f"Mean Retrieval Score (Cosine Similarity): {mean_retrieval_score:.4f}")
    print(f"Mean BERTScore Precision: {mean_precision:.4f}")
    print(f"Mean BERTScore Recall: {mean_recall:.4f}")
    print(f"Mean BERTScore F1: {mean_f1:.4f}")

    return results_df, {
        "mean_retrieval_score": mean_retrieval_score,
        "mean_precision": mean_precision,
        "mean_recall": mean_recall,
        "mean_f1": mean_f1,
    }


## **Results**

In [11]:
# Run the evaluation and get results
faq_results, mean_scores = evaluate_all_faqs_with_mean(faqs, top_k=3)


# Display mean scores
print("\nMean Scores:")
print(mean_scores)



Processing FAQ #1
Query: What services does the ISSS office provide?
Retrieval Metric (Mean Cosine Similarity): 0.4759
Top Retrieved Results with Cosine Similarities:
Top-1 Text: the office offers a variety of programming and services to assist students and scholars in pursuit of their academic, personal, and professional goals.
Cosine Similarity: 0.4582

Top-2 Text: isss office closed on thursday, november 7 if you have a question or concern, please submit a help ticket below and our team will respond as soon as we can. thank you! help ticket the isss office does not respond to emails. instead, we use a help ticket system to send and receive messages.
Cosine Similarity: 0.5398

Top-3 Text: about oiss umbc office of international students and scholars the office of international students and scholars oiss is dedicated to supporting umbcs international student and scholar community. the office offers a variety of programming and services to assist students and scholars in pursuit of th

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/482 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.9095, Recall: 0.9127, F1: 0.9111

Processing FAQ #2
Query: How can I contact the ISSS office?
Retrieval Metric (Mean Cosine Similarity): 0.4730
Top Retrieved Results with Cosine Similarities:
Top-1 Text: if you have entered the address perfectly and it is not accepted, please contact us on isss website your name, birthday, or other personal details that you can see in the portal are incorrect: please contact us if you notice any errors to information in the portal that you do not have access to.
Cosine Similarity: 0.6152

Top-2 Text: isss office closed on thursday, november 7 if you have a question or concern, please submit a help ticket below and our team will respond as soon as we can. thank you! help ticket the isss office does not respond to emails. instead, we use a help ticket system to send and receive messages.
Cosine Similarity: 0.6122

Top-3 Text: if you are in a position to use the cap gap, please contact isss so we can update your I-20 with the cap 

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8825, Recall: 0.8866, F1: 0.8846

Processing FAQ #3
Query: How do I request an I-20?
Retrieval Metric (Mean Cosine Similarity): 0.6201
Top Retrieved Results with Cosine Similarities:
Top-1 Text: step 1: request your OPT I-20 students must first request a special OPT I-20 from the isss office. applications submitted to uscis without an OPT I-20 will be denied. students must submit a complete OPT request in the isss portal. first, students choose an OPT start date.
Cosine Similarity: 0.6486

Top-2 Text: requesting an I-20 in the isss portal once admitted, you will receive an email from our office directing you to complete a new student request application.
Cosine Similarity: 0.7354

Top-3 Text: students can request another stem OPT I-20 and submit a new i-765 application if there is enough time before the OPT end date. contact isss immediately if your application is denied, and do not resubmit until speaking with an isss advisor. approval the approval notice will

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8533, Recall: 0.8749, F1: 0.8640

Processing FAQ #4
Query: What should I do if my visa expires while in the U.S.?
Retrieval Metric (Mean Cosine Similarity): 0.5281
Top Retrieved Results with Cosine Similarities:
Top-1 Text: if your visa is expired and you are outside of the us, you must visit a US embassy to get a new f-1 visa before returning to the us. to renew your visa, you will actually apply for a new one, following the same process as the first time you applied for an f-1 visa.
Cosine Similarity: 0.6530

Top-2 Text: if you renew your passport, but your visa stamp is in the old passport, you can carry both passports with you until the visa stamp expires and you get a new visa in the new passport.
Cosine Similarity: 0.3880

Top-3 Text: please remember that you do not need a valid visa to remain the US your visa only allows you to enter the us. while you are here, your I-20 maintains your immigration status, not your visa.
Cosine Similarity: 0.5434

Generat

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8817, Recall: 0.8959, F1: 0.8888

Processing FAQ #5
Query: What is SEVIS, and how do I pay the SEVIS fee?
Retrieval Metric (Mean Cosine Similarity): 0.5783
Top Retrieved Results with Cosine Similarities:
Top-1 Text: 2. after you decide which school you will attend and have an I-20 from that school, you will need to pay the i-901 sevis fee for that sevis id found a the top right of the i-20. this is a one-time fee, which is used to maintain the immigration database that manages international student and scholar information.
Cosine Similarity: 0.6069

Top-2 Text: please note you will need to take your sevis fee receipt to your visa appointment. 3. next, complete a ds-160 visa application form, the ds-160 has a US $185 application fee, which you often cannot pay until you actually schedule your visa appointment. each embassy handles this a bit differently.
Cosine Similarity: 0.5955

Top-3 Text: sevp portal help the sevp portal is a tool students use to report thei

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8817, Recall: 0.9386, F1: 0.9092

Processing FAQ #6
Query: What are the full-time enrollment requirements?
Retrieval Metric (Mean Cosine Similarity): 0.5978
Top Retrieved Results with Cosine Similarities:
Top-1 Text: enrollment requirements: must enroll in half of full-time enrollment at least 5 credits for graduate students, at least 6 credits for undergraduate students requirements to be approved: email or letter from course instructor or academic advisor, recommending that the student be allowed to drop the class for one of the following reasons: improper course placement this option is appropriate for students who were advised inappropriately and have taken classes out of sequence, or who were inappropriately allowed to take a course for which they did not have the prerequisite knowledge to be successful.
Cosine Similarity: 0.6302

Top-2 Text: summer and winter session enrollment is optional it is not required unless you plan to complete your academic progr

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8624, Recall: 0.9180, F1: 0.8893

Processing FAQ #7
Query: Can I take a Reduced Course Load (RCL)?
Retrieval Metric (Mean Cosine Similarity): 0.6860
Top Retrieved Results with Cosine Similarities:
Top-1 Text: possible exceptions to full-time enrollment requirements the immigration language for receiving permission to enroll less than full-time is a reduced course load rcl. any rcl must be approved in advance with isss, before the semester begins in which you plan to be enrolled part-time.
Cosine Similarity: 0.6431

Top-2 Text: 2. medical reduced course load rcl duration: approved one semester at a time, can be granted until the deadline to drop classes during a semester limitations: up to 12 months cumulative during one degree program enrollment requirements: enrollment can be part-time or not at all requirements to be approved: you must provide a note from your doctor indicating their recommendation that you do not study full-time this semester based on a medi

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8412, Recall: 0.8768, F1: 0.8587

Processing FAQ #8
Query: Can I take online classes?
Retrieval Metric (Mean Cosine Similarity): 0.4916
Top Retrieved Results with Cosine Similarities:
Top-1 Text: online courses can be taken as much as you like in summer or winter sessions, when enrollment is not required, unless you plan to complete your academic program during a summer term, in which case you must follow all full-time enrollment rules and are only permitted to take at most one fully-online course.
Cosine Similarity: 0.6302

Top-2 Text: for example, an undergraduate could take one online course for 3 credits, 3 regular courses for 9 credits, and thats the required 12. beyond the required 12, an additional online course could be taken.
Cosine Similarity: 0.4989

Top-3 Text: undergraduate students- undergrads are required to enroll in a minimum of 12 credits per semester. one class is typically 3 or 4 credits. online coursework immigration regulations allow for o

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8853, Recall: 0.8791, F1: 0.8822

Processing FAQ #9
Query: What is CPT, and how do I apply?
Retrieval Metric (Mean Cosine Similarity): 0.6173
Top Retrieved Results with Cosine Similarities:
Top-1 Text: cpt: off-campus work authorization curricular practical training cpt allows f-1 international students to work off-campus in paid positions within the us. cpt requires an application completed by the student, which can be submitted via the isss portal. once approved, isss will add cpt work authorization to the students immigration record, and issue a new form I-20 noting the work authorization.
Cosine Similarity: 0.6694

Top-2 Text: cpt approval is granted per semester if a position continues to the next semester, students will need to request cpt again for the next term. students are eligible to start working from the date they are approved by isss. isss cannot approve cpt for dates that have passed. please plan ahead and apply early!
Cosine Similarity: 0.6013



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8682, Recall: 0.9181, F1: 0.8924

Processing FAQ #10
Query: What is OPT, and when can I apply?
Retrieval Metric (Mean Cosine Similarity): 0.6523
Top Retrieved Results with Cosine Similarities:
Top-1 Text: opt: post-graduation work authorization students who plan on working in a paid position after completing their program must receive work authorization. OPT optional practical training is a type of work permission available for f-1 international students who finish their program requirements. OPT is approved by u.s. citizen and immigration services uscis, with support from the university isss office.
Cosine Similarity: 0.6594

Top-2 Text: eligibility to be eligible for opt, students must meet following criteria: currently enrolled at umbc full time with active f-1 status completed two consecutive academic semesters fall and spring in the US completed all program requirements for degree exceptions for graduate students who only have a thesisdissertation remainin

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8472, Recall: 0.8837, F1: 0.8651

Processing FAQ #11
Query: What is STEM OPT, and how is it different?
Retrieval Metric (Mean Cosine Similarity): 0.6398
Top Retrieved Results with Cosine Similarities:
Top-1 Text: stem OPT follows federal and school regulations: all work done pursuant to stem OPT must be directly related to the students major program of study. students must first request and receive a special stem OPT I-20 from the isss office before submitting their application to uscis. stem OPT is granted only once per degree level.
Cosine Similarity: 0.6765

Top-2 Text: eligibility to be eligible for stem opt, students must meet following criteria: currently in active f-1 status currently participating in one year of regular OPT completed all program requirements for a stem-eligible degree pursuing work with an employer enrolled and participating in the e-verify program have not been previously granted optstem OPT at the current level of study application pr

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8342, Recall: 0.8750, F1: 0.8541

Processing FAQ #12
Query: What documents do I need to travel internationally?
Retrieval Metric (Mean Cosine Similarity): 0.2097
Top Retrieved Results with Cosine Similarities:
Top-1 Text: travel abroad and visa renewal during OPT travelling abroad during OPT or stem OPT you can absolutely travel abroad during opt, but the there are a few additional pieces to prepare than the typical documents required for current students.
Cosine Similarity: 0.0867

Top-2 Text: request a travel signature: please submit an I-20 reprint request via the isss portal please see our website specific to travel and visa renewal during i-515 temporary entry if you are missing documents when you try to enter the us, the cbp officer can choose to refuse you entry to the us, or give you an i-515 document allowing you a temporary stay in the us, during which time you will need to provide the missing documents.
Cosine Similarity: 0.2453

Top-3 Text: if flyin

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8468, Recall: 0.9133, F1: 0.8788

Processing FAQ #13
Query: Can I travel with an expired visa?
Retrieval Metric (Mean Cosine Similarity): 0.4275
Top Retrieved Results with Cosine Similarities:
Top-1 Text: short trips to canada, mexico or some caribbean islands with an expired f-1 visa automatic visa revalidation students who have a valid I-20 and are in good f-1 status, but whose visa has expired, are able to take short trips less than 30 days to countries neighboring the US to canada, mexico and some caribbean islands.
Cosine Similarity: 0.4090

Top-2 Text: travel abroad and visa renewal your f-1 immigration status makes it easy for you to take vacations and short trips abroad, outside of the us! make sure to plan ahead for smooth travels. dont forget your I-20 travel signature! travel checklist passport check your passport to make sure it is valid. you cant travel at all without a valid passport!
Cosine Similarity: 0.3951

Top-3 Text: if you renew your passpo

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8429, Recall: 0.9109, F1: 0.8756

Processing FAQ #14
Query: What is Automatic Visa Revalidation?
Retrieval Metric (Mean Cosine Similarity): 0.4995
Top Retrieved Results with Cosine Similarities:
Top-1 Text: please read more about automatic visa revalidation, and see the list of eligible destinations please also note this rule is only for re-entry to the us, and does not play a role in allowing you to enter the country you plan to visit.
Cosine Similarity: 0.5777

Top-2 Text: f-1 visa process a visa is a document that allows you to enter the US in a certain kind of status, and determines the restrictions and opportunities you have in the US based on the visa category you enter in, as well as how long you can stay in the us.
Cosine Similarity: 0.4691

Top-3 Text: administrative processing simply means that the consular officer feels they need more information to make a decision on your visa application. administrative processing is more common for academic progra

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8775, Recall: 0.9026, F1: 0.8899

Processing FAQ #15
Query: What are my options after graduation?
Retrieval Metric (Mean Cosine Similarity): 0.3037
Top Retrieved Results with Cosine Similarities:
Top-1 Text: to this extent, students that have completed their required coursework but have not defended their thesis and graduated can choose to end their I-20 and apply for opt.
Cosine Similarity: 0.3765

Top-2 Text: stem OPT extension students who have graduated from umbc with a dhs-designated stem degree can extend their initial year of OPT optional practical training via the stem OPT extension. this type of work authorization is available to f-1 international students who have completed their program requirements and have already been approved for an initial period of regular opt.
Cosine Similarity: 0.3580

Top-3 Text: f-1 immigration rules work options for f-1 students billing and health insurance about US and umbc culture programs and services available on campu

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8414, Recall: 0.8670, F1: 0.8540

Processing FAQ #16
Query: What is the grace period after completing my program?
Retrieval Metric (Mean Cosine Similarity): 0.4383
Top Retrieved Results with Cosine Similarities:
Top-1 Text: students can request another OPT I-20 and submit a new i-765 application if their 60 day grace period has not ended. contact the isss office about any questions or concerns. approval the approval notice will list the approved OPT start and end dates. however, students cannot begin working until their physical ead card arrives in the mail.
Cosine Similarity: 0.4789

Top-2 Text: please note, you must work with us to get your new I-20 before your current I-20 end date, plus the 60 day grace period.
Cosine Similarity: 0.3398

Top-3 Text: I-20 program extension the program dates specified in the program of study section of your form I-20 reflect the typical duration for completing your degree program. however, certain circumstances may require ad

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.9062, Recall: 0.8633, F1: 0.8842

Processing FAQ #17
Query: How do I access the ISSS Portal?
Retrieval Metric (Mean Cosine Similarity): 0.4388
Top Retrieved Results with Cosine Similarities:
Top-1 Text: when a student receives an OPT recommendation from our office, we send out an email with instructions on how to set up an isss portal account using a personal email address. students on OPT will need to login using their personal email address to access the isss portal.
Cosine Similarity: 0.5286

Top-2 Text: how to access your I-20 how to access your signed electronic umbc form i-20: request your I-20 in the isss portal the umbcs isss team will contact you via email when your I-20 is ready. it will be attached to your original request under the documents tab.
Cosine Similarity: 0.3747

Top-3 Text: if you have entered the address perfectly and it is not accepted, please contact us on isss website your name, birthday, or other personal details that you can see in 

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8437, Recall: 0.8522, F1: 0.8479

Processing FAQ #18
Query: What should I do if I lose my passport or I-20?
Retrieval Metric (Mean Cosine Similarity): 0.4879
Top Retrieved Results with Cosine Similarities:
Top-1 Text: request a travel signature: please submit an I-20 reprint request via the isss portal please see our website specific to travel and visa renewal during i-515 temporary entry if you are missing documents when you try to enter the us, the cbp officer can choose to refuse you entry to the us, or give you an i-515 document allowing you a temporary stay in the us, during which time you will need to provide the missing documents.
Cosine Similarity: 0.5139

Top-2 Text: copy of new I-20 issued by international advisor recommending economic hardship. checkmoney order for $410 only made out to the department of homeland security. do not abbreviate 2 passport size photographs. copy of your I-94 front and back of card, or printed record from cbp website. copy

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.9008, Recall: 0.8991, F1: 0.8999

Processing FAQ #19
Query: How do I get a Social Security Number (SSN)?
Retrieval Metric (Mean Cosine Similarity): 0.6128
Top Retrieved Results with Cosine Similarities:
Top-1 Text: ssn are issued by the united states social security administration ssa. to start the process, students should submit a ssn letter request in the isss portal. after a student has an I-20 with cpt approval and a receipt from the ssa, they present these documents to their employers human resources department.
Cosine Similarity: 0.6255

Top-2 Text: applying for a social security number ssn an ssn is required of any individual in order to get paid in the us.
Cosine Similarity: 0.6088

Top-3 Text: when you receive your first job offer, you will need to apply for a social security number ssn if you dont already have one. an ssn is required to be paid in the us. to learn more about what an ssn is and how to apply, please review the information in this websit

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8383, Recall: 0.8806, F1: 0.8589

Processing FAQ #20
Query: What should I do if my I-20 is about to expire?
Retrieval Metric (Mean Cosine Similarity): 0.6087
Top Retrieved Results with Cosine Similarities:
Top-1 Text: deadline please note that your form I-20 can only be extended up until the current end date. after your current I-20 end date passes, we can no longer extend the date into the future. it is important to plan ahead for this deadline.
Cosine Similarity: 0.5679

Top-2 Text: please note, you must work with us to get your new I-20 before your current I-20 end date, plus the 60 day grace period.
Cosine Similarity: 0.6465

Top-3 Text: step 3: request an I-20 reprint if needed you will need an updated I-20 if: your employer requests it from you you need to renew your drivers license you plan on travelling abroad in the next 6 months note: students on OPT will not be able to access the isss portal using their myumbc credentials.
Cosine Similarity: 0.6116



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERTScore - Precision: 0.8523, Recall: 0.8815, F1: 0.8666

Mean Scores Across All 20 FAQs:
Mean Retrieval Score (Cosine Similarity): 0.5194
Mean BERTScore Precision: 0.8649
Mean BERTScore Recall: 0.8915
Mean BERTScore F1: 0.8778

Mean Scores:
{'mean_retrieval_score': 0.51935446, 'mean_precision': 0.8648564994335175, 'mean_recall': 0.891491037607193, 'mean_f1': 0.8777661472558975}


In [12]:
# Display mean scores
print("\nMean Scores:")
print(mean_scores)


Mean Scores:
{'mean_retrieval_score': 0.51935446, 'mean_precision': 0.8648564994335175, 'mean_recall': 0.891491037607193, 'mean_f1': 0.8777661472558975}
