### USMLE

paper link: https://arxiv.org/html/2404.13343v1

The paper uses the USMLE dataset along with their on expertised difficulty prediction as the dataset which they have given access to does not have any difficulty so we have instead just checked if Gemini can answer the questions correctly.

Based on the accuracy, we can estimate how well Gemini can predict the difficulty of the questions. 

Implementation of the above statements are done in the Chinese dataset below where we have evaluated its difficulty prediction as well as how it answers the question and from there we can estimate how well it can predict the difficulty

In [18]:
import json
import re
import time
from sklearn.metrics import accuracy_score, classification_report
import google.generativeai as genai
genai.configure(api_key="AIzaSyDK50uLLzQ4oPT2d0e61844hK9rEnaxBiQ")

dataset = []
with open("imltest1.jsonl", "r",encoding="utf-8") as f:
    for i, line in enumerate(f):
        dataset.append(json.loads(line))
model = genai.GenerativeModel(
  model_name="gemini-2.0-flash",
)

def clean_prediction(pred):
    if not isinstance(pred, str):
        return "UNKNOWN"
    pred = pred.upper()
    match = re.search(r'\b([A-D])\b', pred)
    return match.group(1) if match else "UNKNOWN"

def ask_gemini(entry):
    question = entry["question"]
    options = entry["options"]
    formatted_options = "\n".join([f"{k}. {v}" for k, v in options.items()])

    prompt = (
        f"You are a medical student taking the USMLE Step exam.\n"
        f"Question:\n{question}\n\n"
        f"Choices:\n{formatted_options}\n\n"
        f"Which one is correct? Just reply with the letter only (A, B, C, or D)."
    )

    try:
        response = model.generate_content(prompt)
        raw_answer = response.text.strip()
        return clean_prediction(raw_answer)
    except Exception as e:
        print(f"Error on Gemini API call: {e}")
        return "UNKNOWN"

y_true = []
y_pred = []

for i, entry in enumerate(dataset):
    print(f"Processing Q{i + 1}/100...")
    correct_answer = entry["answer_idx"].strip().upper()
    prediction = ask_gemini(entry)

    y_true.append(correct_answer)
    y_pred.append(prediction)

    if i < len(dataset) - 1:
        print("Waiting 10 seconds...")
        time.sleep(10)

# Print results
print("\n Final Evaluation (Normalized Predictions)")
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))

Processing Q1/100...
Waiting 10 seconds...
Processing Q2/100...
Waiting 10 seconds...
Processing Q3/100...
Waiting 10 seconds...
Processing Q4/100...
Waiting 10 seconds...
Processing Q5/100...
Waiting 10 seconds...
Processing Q6/100...
Waiting 10 seconds...
Processing Q7/100...
Waiting 10 seconds...
Processing Q8/100...
Waiting 10 seconds...
Processing Q9/100...
Waiting 10 seconds...
Processing Q10/100...
Waiting 10 seconds...
Processing Q11/100...
Waiting 10 seconds...
Processing Q12/100...
Waiting 10 seconds...
Processing Q13/100...
Waiting 10 seconds...
Processing Q14/100...
Waiting 10 seconds...
Processing Q15/100...
Waiting 10 seconds...
Processing Q16/100...
Waiting 10 seconds...
Processing Q17/100...
Waiting 10 seconds...
Processing Q18/100...
Waiting 10 seconds...
Processing Q19/100...
Waiting 10 seconds...
Processing Q20/100...
Waiting 10 seconds...
Processing Q21/100...
Waiting 10 seconds...
Processing Q22/100...
Waiting 10 seconds...
Processing Q23/100...
Waiting 10 seconds.

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Evaluation of question difficulty and answer prediction of CHINESE Medical Entrance Exam

paper link : https://www.researchgate.net/publication/337015779_Question_Difficulty_Prediction_for_Multiple_Choice_Problems_in_Medical_Exams

In [1]:
import json
import re
import time
from sklearn.metrics import accuracy_score, classification_report
import google.generativeai as genai
genai.configure(api_key="AIzaSyDK50uLLzQ4oPT2d0e61844hK9rEnaxBiQ")

with open("imltest2.json", "r", encoding="utf-8") as f:
    data = json.load(f)
    dataset = data["example"][:200]  
model = genai.GenerativeModel(
  model_name="gemini-2.0-flash",
)

def ask_gemini(entry):
    prompt = (
        "你是一位生物老师。请阅读以下题目，并根据其复杂性、所需知识、迷惑选项的相似性和推理难度，"
        "将题目的难度评为 1 到 10（1 表示非常简单，10 表示非常困难）。\n"
        "请只回复一个数字，不要解释。\n\n"
        f"{entry['question']}"
    )
    try:
        response = model.generate_content(prompt)
        raw = response.text.strip()
        match = re.search(r'\b(10|[1-9])\b', raw)
        return int(match.group(1)) if match else None
    except Exception as e:
        print(f"Error on Gemini API call: {e}")
        return None

y_true = []
y_pred = []

for i, entry in enumerate(dataset):
    print(f"Processing question {i + 1}/{len(dataset)}")

    true_score = int(entry.get("score", 0))
    pred_score = ask_gemini(entry)

    if pred_score is not None:
        y_true.append(true_score)
        y_pred.append(pred_score)
    else:
        print("Skipped (invalid prediction)")

    if i < len(dataset) - 1:
        time.sleep(10)  

print("\n Gemini Difficulty Prediction Evaluation")
print("RMSE:", mean_squared_error(y_true, y_pred))
print("MAE:", mean_absolute_error(y_true, y_pred))

  from .autonotebook import tqdm as notebook_tqdm


Processing question 1/150
Processing question 2/150
Processing question 3/150
Processing question 4/150
Processing question 5/150
Processing question 6/150
Processing question 7/150
Processing question 8/150
Processing question 9/150
Processing question 10/150
Processing question 11/150
Processing question 12/150
Processing question 13/150
Processing question 14/150
Processing question 15/150
Processing question 16/150
Processing question 17/150
Processing question 18/150
Processing question 19/150
Processing question 20/150
Processing question 21/150
Processing question 22/150
Processing question 23/150
Processing question 24/150
Processing question 25/150
Processing question 26/150
Processing question 27/150
Processing question 28/150
Processing question 29/150
Processing question 30/150
Processing question 31/150
Processing question 32/150
Processing question 33/150
Processing question 34/150
Processing question 35/150
Processing question 36/150
Processing question 37/150
Processing

NameError: name 'mean_squared_error' is not defined

In [2]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

In [3]:
print("\n Gemini Difficulty Prediction Evaluation")
print("RMSE:", mean_squared_error(y_true, y_pred))
print("MAE:", mean_absolute_error(y_true, y_pred))


 Gemini Difficulty Prediction Evaluation
RMSE: 1.1333333333333333
MAE: 0.72


In [25]:
import json
import re
import time
from sklearn.metrics import accuracy_score, classification_report
import google.generativeai as genai
genai.configure(api_key="AIzaSyDK50uLLzQ4oPT2d0e61844hK9rEnaxBiQ")

with open("imltest2.json", "r", encoding="utf-8") as f:
    data = json.load(f)
    dataset = data["example"][:200]  
model = genai.GenerativeModel(
  model_name="gemini-2.0-flash",
)

def clean_prediction(pred):
    if not isinstance(pred, str):
        return "UNKNOWN"
    pred = pred.upper()
    match = re.search(r'\b([A-D])\b', pred)
    return match.group(1) if match else "UNKNOWN"

def ask_gemini(entry):
    prompt = (
        "你是一名生物老师，请从以下选项中选择最合适的答案。\n"
        f"{entry['question']}\n"
        "请只回复选项字母（A、B、C 或 D）。"
    )
    try:
        response = model.generate_content(prompt)
        raw = response.text.strip()
        return clean_prediction(raw)
    except Exception as e:
        print(f"Error on Gemini API call: {e}")
        return "UNKNOWN"

y_true = []
y_pred = []

for i, entry in enumerate(dataset):
    print(f"Processing question {i + 1}/{len(dataset)}")
    true_answer = entry["answer"][0].strip().upper()
    pred_answer = ask_gemini(entry)

    y_true.append(true_answer)
    y_pred.append(pred_answer)

    if i < len(dataset) - 1:
        print("Waiting 20 seconds...")
        time.sleep(20)

# Final evaluation
print("\n✅ Evaluation on Chinese Biology Dataset")
print("Accuracy:", accuracy_score(y_true, y_pred))
print("Classification Report:\n", classification_report(y_true, y_pred))


Processing question 1/150
Waiting 20 seconds...
Processing question 2/150
Waiting 20 seconds...
Processing question 3/150
Waiting 20 seconds...
Processing question 4/150
Waiting 20 seconds...
Processing question 5/150
Waiting 20 seconds...
Processing question 6/150
Waiting 20 seconds...
Processing question 7/150
Waiting 20 seconds...
Processing question 8/150
Waiting 20 seconds...
Processing question 9/150
Waiting 20 seconds...
Processing question 10/150
Waiting 20 seconds...
Processing question 11/150
Waiting 20 seconds...
Processing question 12/150
Waiting 20 seconds...
Processing question 13/150
Waiting 20 seconds...
Processing question 14/150
Waiting 20 seconds...
Processing question 15/150
Waiting 20 seconds...
Processing question 16/150
Waiting 20 seconds...
Processing question 17/150
Waiting 20 seconds...
Processing question 18/150
Waiting 20 seconds...
Processing question 19/150
Waiting 20 seconds...
Processing question 20/150
Waiting 20 seconds...
Processing question 21/150
Wa

### Neet 2017 and 2023

This has been obtained from youtube videos as well as human evaluation from students who are taking up neet this year as once again, there are no direct sources of neet questions along with their difficulty.

In [1]:
import json
import re
import time
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, mean_absolute_error
import google.generativeai as genai
genai.configure(api_key="AIzaSyBcXtO1gFxkbYNAORCvQYTAdKO0xS0-dUw")
model = genai.GenerativeModel(
  model_name="tunedModels/neetv1-zj2zjpwyn6yl",
)

df = pd.read_csv("2023.csv")
df = df.dropna(subset=["Question"]) 
df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace('\xa0', ' ') 
print("Columns:", df.columns.tolist())  

def format_prompt(row):
    return f"""
You are an expert NEET exam question reviewer.

Question: {row['Question']}

A. {row['Option 1']}
B. {row['Option 2']}
C. {row['Option 3']}
D. {row['Option 4']}

Based on the above question

Your task is to rate the DIFFICULTY multiple-choice question on a scale of 1 (very easy) to 10 (very difficult).

Consider the following when rating:
- Whether it requires memorization vs reasoning
- How confusing or close the incorrect options are
- Alignment with NEET/NCERT syllabus
- Average 12th-grade student performance

Return ONLY a single number from 1 to 10.
"""


def predict_difficulty_with_gemini(row):
    prompt = format_prompt(row)
    try:
        response = model.generate_content(prompt)
        score = int(response.text.strip())
        return max(1, min(10, score))  
    except Exception as e:
        print("Error:", e)
        return None

predictions = []

for idx, row in df.iterrows():
    print(f"Processing question {idx + 1}/{len(df)}...")
    prediction = predict_difficulty_with_gemini(row)
    predictions.append(prediction)
    
    # Delay between calls
    time.sleep(10)

df["Predicted"] = predictions

df = df.dropna(subset=["Predicted"])

true = df["Difficulty"]
pred = df["Predicted"]

rmse = mean_squared_error(true, pred)
mae = mean_absolute_error(true, pred)

print(f" RMSE: {rmse:.2f}")
print(f" MAE: {mae:.2f}")



  from .autonotebook import tqdm as notebook_tqdm


Columns: ['Question', 'Option 1', 'Option 2', 'Option 3', 'Option 4', 'Difficulty']
Processing question 1/136...
Processing question 2/136...
Processing question 3/136...
Processing question 4/136...
Error: invalid literal for int() with base 10: "I need more information to rate the question's difficulty. Please provide the following:\n\n* **List I and List II:** What are the items being matched?\n* **The Subject:**  What subject is this quest
Processing question 5/136...
Processing question 6/136...
Processing question 7/136...
Processing question 8/136...
Processing question 9/136...
Processing question 10/136...
Processing question 11/136...
Error: invalid literal for int() with base 10: 'I need more information to rate the difficulty of the question.  Please provide:\n\n* **List I:** What are the items in List I?\n* **List II:** What are the items in List II?\n\nWithout knowing the a
Processing question 12/136...
Processing question 13/136...
Processing question 14/136...
Processin

In [4]:
clean_true = []
clean_pred = []

for true_val, pred_val in zip(true, pred):
    try:
        pred_int = int(pred_val)
        clean_true.append(true_val)
        clean_pred.append(pred_int)
    except (ValueError, TypeError):
        continue  
from sklearn.metrics import mean_squared_error, mean_absolute_error

mse = mean_squared_error(clean_true, clean_pred)
mae = mean_absolute_error(clean_true, clean_pred)

print("Clean Evaluation")
print("RMSE:", mse**0.5)
print("MAE:", mae)


Clean Evaluation
RMSE: 3.032345948472818
MAE: 2.5609756097560976


In [5]:
import json
import re
import time
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report
import google.generativeai as genai
genai.configure(api_key="AIzaSyBcXtO1gFxkbYNAORCvQYTAdKO0xS0-dUw")
model = genai.GenerativeModel(
  model_name="tunedModels/neetv1-zj2zjpwyn6yl",
)

df = pd.read_excel("2017_tune.xlsx")
df = df.dropna(subset=["Question"])  
df.columns = df.columns.str.replace('\xa0', ' ')  
print("Columns:", df.columns.tolist())  

def format_prompt(row):
    return f"""
You are an expert NEET exam question reviewer.

Question: {row['Question']}

A. {row['Option 1']}
B. {row['30 Hz']}
C. {row['Option 3']}
D. {row['Option 4']}

Based on the above question

Your task is to rate the DIFFICULTY multiple-choice question on a scale of 1 (very easy) to 10 (very difficult).

Consider the following when rating:
- Whether it requires memorization vs reasoning
- How confusing or close the incorrect options are
- Alignment with NEET/NCERT syllabus
- Average 12th-grade student performance

Return ONLY a single number from 1 to 10.
"""

def predict_difficulty_with_gemini(row):
    prompt = format_prompt(row)
    try:
        response = model.generate_content(prompt)
        score = int(response.text.strip())
        return max(1, min(10, score)) 
    except Exception as e:
        print("Error:", e)
        return None

predictions = []

for idx, row in df.iterrows():
    print(f"Processing question {idx + 1}/{len(df)}...")
    prediction = predict_difficulty_with_gemini(row)
    predictions.append(prediction)
    
    time.sleep(25)

df["Predicted"] = predictions
df = df.dropna(subset=["Predicted"])

# Evaluate Gemini's predictions
true = df["Difficulty (1-10)"]
pred = df["Predicted"]

rmse = mean_squared_error(true, pred, squared=False)
mae = mean_absolute_error(true, pred)

print(f" RMSE: {rmse:.2f}")
print(f" MAE: {mae:.2f}")



Columns: ['Question Number', 'Question', 'Option A', '30 Hz', 'Option C', 'Option D', 'Difficulty (1-10)']
Processing question 1/140...
Processing question 2/140...
Processing question 3/140...
Processing question 4/140...
Processing question 5/140...
Processing question 6/140...
Processing question 7/140...
Processing question 8/140...
Processing question 9/140...
Processing question 10/140...
Processing question 11/140...
Processing question 12/140...
Processing question 13/140...
Processing question 14/140...
Processing question 15/140...
Processing question 16/140...
Processing question 17/140...
Processing question 18/140...
Processing question 19/140...
Processing question 20/140...
Processing question 21/140...
Processing question 22/140...
Processing question 23/140...
Processing question 24/140...
Processing question 25/140...
Processing question 26/140...
Processing question 27/140...
Processing question 28/140...
Processing question 29/140...
Processing question 30/140...
Pr

NameError: name 'mean_squared_error' is not defined

In [7]:
from sklearn.metrics import mean_squared_error, mean_absolute_error
rmse = mean_squared_error(true, pred)
mae = mean_absolute_error(true, pred)

print(f" RMSE: {rmse:.2f}")
print(f" MAE: {mae:.2f}")



 RMSE: 3.06
 MAE: 1.35


### Question Generation

#### Machine Learning

Please first execute the extract_chunk cell and then the cell below it 

In [5]:
import os
import re
import textwrap
import pdfplumber
import numpy as np
import tensorflow_hub as hub
from sklearn.metrics.pairwise import cosine_similarity
import google.generativeai as genai

PDF_PATH = "imlbook.pdf"  
TOP_K = 3 
MODEL_NAME = "gemini-2.0-flash"
os.makedirs("static", exist_ok=True)

def extract_chunks_from_pdf(pdf_path, chunk_size=500):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            page_text = page.extract_text()
            if page_text:
                text += page_text + "\n"

    paragraphs = re.split(r'\n\s*\n', text)
    chunks, current = [], ""
    for para in paragraphs:
        if len(current) + len(para) < chunk_size:
            current += " " + para
        else:
            chunks.append(current.strip())
            current = para
    if current.strip():
        chunks.append(current.strip())
    return chunks

embed_model = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

def embed_texts(text_list):
    return embed_model(text_list).numpy()

def get_top_k_chunks(query, chunks, chunk_embeddings, k=TOP_K):
    query_embedding = embed_texts([query])[0]
    similarities = cosine_similarity([query_embedding], chunk_embeddings)[0]
    top_indices = similarities.argsort()[-k:][::-1]
    return [chunks[i] for i in top_indices]

def generate_mcq_from_chunks(topic, top_chunks):
    context = "\n\n".join(top_chunks)
    prompt = f"""
Use the following machine learning content to generate one high-quality multiple-choice question (MCQ) related to the topic: "{topic}"

Content:
{textwrap.shorten(context, width=4000)}

Format:
Question: ...
A. ...
B. ...
C. ...
D. ...
Answer: ...
"""
    genai.configure(api_key="AIzaSyDK50uLLzQ4oPT2d0e61844hK9rEnaxBiQ")
    model = genai.GenerativeModel(model_name=MODEL_NAME)
    response = model.generate_content(prompt)
    return response.text.strip()
















In [6]:
if __name__ == "__main__":
    chunks = extract_chunks_from_pdf(PDF_PATH)
    print(f"Extracted {len(chunks)} chunks from PDF.")
    
    chunk_embeddings = embed_texts(chunks)
    
    topic = input(print("Please enter desired Machine Learning topic on which you want to generate a question on. "))  
    top_chunks = get_top_k_chunks(topic, chunks, chunk_embeddings)
    
    question = generate_mcq_from_chunks(topic, top_chunks)
    print("\nGenerated MCQ:\n")
    print(question)

Extracted 2 chunks from PDF.
Please enter desired Machine Learning topic on which you want to generate a question on. 


None Logistic regression



Generated MCQ:

Question: Which of the following is a primary characteristic of Logistic Regression?

A. It is primarily used for predicting continuous numerical values.
B. It directly outputs class labels (e.g., "spam" or "not spam").
C. It models the probability of a sample belonging to a certain class using a sigmoid function.
D. It is only applicable to datasets with a large number of features.

Answer: C


#### Neet Questions

In [7]:
import google.generativeai as genai

GENAI_API_KEY = "AIzaSyBcXtO1gFxkbYNAORCvQYTAdKO0xS0-dUw"  
GENAI_MODEL = "tunedModels/neetv1-zj2zjpwyn6yl"

genai.configure(api_key=GENAI_API_KEY)
model = genai.GenerativeModel(model_name=GENAI_MODEL)

def build_prompt(topic, complexity):
    return f"""
You are an expert in NEET exam question creation.

Generate ONE high-quality NEET-style multiple-choice question.

Constraints:
- Topic: {topic}
- Complexity: {complexity} between 1-10, where 1 is easy, 10 is the toughest complexity
- Format the output exactly as follows:

Question: ...
A. ...
B. ...
C. ...
D. ...
Answer: ...
"""

# --- Main ---
if __name__ == "__main__":
    topic = input("Enter NEET topic: ").strip()
    complexity = input("Enter difficulty (1-10): ").strip()

    prompt = build_prompt(topic, complexity)
    response = model.generate_content(prompt)

    print("\nGenerated NEET Question:\n")
    print(response.text.strip())


Enter NEET topic:  Electromagnetic Waves
Enter difficulty (1-10):  8



Generated NEET Question:

Question: A radio wave travelling in vacuum has a wavelength of 300 m. What is the frequency of this wave?

A. 1 Hz
B. 10 Hz
C. 100 Hz
D. 1000 Hz

Answer: D. 1000 Hz
