Sources of assessment: <br>
https://colab.research.google.com/drive/1S2j8Hzwt3qJzc0sawizX_bsLdqiZM-0_#scrollTo=fz1JcLXD7d_k <br>
https://gist.github.com/inoue-cw/ca171ed5e3e063e1fbaed265ec05b694 <br>
https://ies.ed.gov/ncee/rel/regions/northeast/onlinetraining/ResourcesTools/Bloom%27s%20Taxonomy.pdf <br>

Sources of OpenAI API:<br>
https://platform.openai.com/docs/guides/text?lang=python

Sources Bloom's <br>
https://tophat.com/blog/blooms-taxonomy-question-stems/ <br>
https://www.coloradocollege.edu/other/assessment/how-to-assess-learning/learning-outcomes/blooms-revised-taxonomy.html

# Part A: LLM Analysis & Human Validation

Write code that uses LLMs to perform initial analysis, then critically evaluate their outputs:

1. **LLM-Generated Classifications**
   - Write code to have the LLM classify each question's Bloom Level (1-6 scale)
   - Code LLM assessment of question quality and potential issues
   - Have LLM generate improved versions of questions (fix issues AND elevate to higher Bloom levels when possible, while keeping the same content)
	   - Increase Bloom level especially for questions that fall on the first level
   - Get LLM recommendations for format conversions (MCQ → MCA/TEXT/NUMERIC)

2. **Human Validation & Critical Analysis**
   - **Agreement Check**: Do you agree with the LLM's Bloom classifications? Why or why not?
   - **LLM Check Review**: Verify the validity of quality issues the LLM identifies across the question set, did it miss anything important?
   - **Critical Gaps**: Where does the LLM seem confident but wrong? Where is it uncertain but correct?

3. **Improvement & Format Evaluation**
   - **Improvement Quality**: Are the LLM's suggested question improvements actually better?
   - **Format Recommendations**: Do the LLM's format conversion suggestions make sense?


4. **Synthesis & Insights**
   - Document patterns in LLM performance (what it does well vs. poorly)
   - Identify which types of analysis require human oversight
   - Note questions where human judgment significantly differs from LLM assessment

### Importing modules and testing

In [3]:
import pandas as pd
import os
import json
from dotenv import load_dotenv
from openai import OpenAI

# carrega variáveis do .env para o ambiente do processo
load_dotenv()

client = OpenAI()  # Lê OPENAI_API_KEY do ambiente

In [4]:
PATH_FILE = 'C:/Users/Pedro Malavota/Desktop/CloudWalk/meu_projeto/Dataset.csv'

df_questions = pd.read_csv(PATH_FILE)

In [7]:
i = 22
print(f"Question {i+1}")
#print(df_questions.iloc[i,1])
print(df_questions.iloc[i])

Question 23
ID                                                               23
Question          You're analyzing customer churn and find that ...
Option_A                Users naturally become more loyal over time
Option_B                 Early user experience may need improvement
Option_C                               Seasonal effects in the data
Option_D                                    Sample size differences
Correct_Answer                                                    B
Name: 22, dtype: object


In [29]:
# Load my answers

my_answers1_10 = [3,1,1,5,1,2,2,4,4,5]
my_answers11_20 = [3,5,1,2,1,2,1,4,1,1]
my_answers21_30 = [5,1,5,1,1,4,2,1,1,2]
my_answers = my_answers1_10+my_answers11_20+my_answers21_30
print(f"my_answers = {my_answers[0:30]}")

my_answers = [3, 1, 1, 5, 1, 2, 2, 4, 4, 5, 3, 5, 1, 2, 1, 2, 1, 4, 1, 1, 5, 1, 5, 1, 1, 4, 2, 1, 1, 2]


In [20]:
prompt = f"""
    say "hello world!" and state your name
    """
response = client.responses.create(
    model= "gpt-5-2025-08-07",
    input= prompt
);

print(response.output_text)

"hello world!"
I'm ChatGPT.


### Run models and save outputs

In [107]:
def run_models(label, start, stop):
    """
    Função para processar questões com diferentes modelos de LLM e salvar resultados em disco.
    Args:
        label (str): Rótulo do modelo a ser utilizado (ex: "classify_bloom_level").
        start (int): Índice de início no DataFrame.
        stop (int): Índice de parada no DataFrame.
    Returns:
        list: Uma lista com os resultados do processamento.
    """

    results = []

    # Define pasta de saída (mesmo nome da função/label)
    output_dir = os.path.join("outputs", label)
    os.makedirs(output_dir, exist_ok=True)

    # Prepara a lista de dados em um único lugar
    for i in range(start, stop):
        question = df_questions.iloc[i, 1]  # coluna "Question"
        options = df_questions.iloc[i, 2:6].tolist()  # colunas Option_A até Option_D
        correct_answer = df_questions.iloc[i, 6]  # coluna "Correct_Answer"
        ID = df_questions.iloc[i, 0] # coluna "ID"

        if label == "classify_bloom_level":
            result = classify_bloom_level(question, options, correct_answer)
        elif label == "validate_question_quality":
            result = validate_question_quality(question, options, correct_answer)
        elif label == "suggest_question_improvements":
            result = suggest_question_improvements(question, options, correct_answer)
        elif label == "recommend_format_conversion":
            result = recommend_format_conversion(question, current_format="MCQ")
        elif label == "create_format_conversion":
            new_format = input(f"Choose question {ID} format (MCA, NUMERIC, TEXT): \n")
            result = create_format_conversion(question, new_format, current_format="MCQ")
        else:
            result = {"error": f"COULD NOT READ QUESTION {i}"}

        # Salva resultado em disco (um arquivo por questão)
        file_name = file_name = f"{label}_q{ID}.json"
        file_path = os.path.join(output_dir, file_name)

        with open(file_path, "w", encoding="utf-8") as f:
            json.dump(result, f, indent=4, ensure_ascii=False)

        results.append(result)

    return results


### Classify Bloom level

In [9]:
def classify_bloom_level(question, options, correct_answer):
    """Use LLM to classify Bloom taxonomy level (1-6) with reasoning"""
    
    prompt = f"""
    Analyze the following multiple-choice question and classify its Bloom's Taxonomy cognitive level (1-6).
    
    Question: {question}
    Options: {options}
    Correct Answer: {correct_answer}
    
    Bloom's Taxonomy Levels and Subcategories (in parenthesis):
    1. Remember (Recognizing,Recalling)
    2. Understand (Interpreting,Exemplifying,Classifying,Summarizing,Inferring,Comparing,Explaining)
    3. Apply (Executing,Implementing)
    4. Analyze (Differentiating,Organizing,Attributing)
    5. Evaluate (Checking,Critiquing)
    6. Create (Generating,Planning,Producing)
    
    Answer format (txt):
    {{
        "level": <int from 1 to 6>
    }}
    """
    # "reasoning": "<concise reasoning (<40 words)>"

    response = client.responses.create(
        model= "gpt-5-mini",
        reasoning= { "effort": "medium" },
        instructions= "You are an expert in educational assessment and Bloom’s Taxonomy classification.",
        input= prompt
    );
    
    return response.output_text

In [11]:
data = run_models("classify_bloom_level", 0, 30)

output = []
for item in data:
    try:
        item = json.loads(item)["level"]
    except (json.JSONDecodeError, KeyError) as e:
        item = f"Error: {e}"   # or you could set issues = None
    output.append(item)

print(output)

[3, 1, 1, 2, 1, 1, 1, 3, 2, 2, 3, 1, 1, 1, 1, 2, 1, 2, 1, 1, 5, 1, 2, 1, 1, 3, 1, 1, 1, 2]


After batch-testing: <br>
classifications_medium_reasoning gpt-5-mini = [3, 1, 1, 2, 1, 1, 1, 3, 2, 2, 3, 1, 1, 1, 1, 2, 1, 2, 1, 1, 5, 1, 2, 1, 1, 3, 1, 1, 1, 2] <br>
classifications_low_reasoning gpt-5 = [3, 1, 1, 2, 1, 1, 1, 3, 1, 2, 3, 2, 1, 1, 1, 2, 1, 2, 1, 1, 4, 1, 4, 1, 1, 3, 2, 1, 1, 2] <br>
classifications_medium_reasoning gpt-5 = [3, 1, 1, 3, 1, 1, 1, 3, 1, 2, 3, 2, 1, 1, 1, 2, 1, 1, 1, 1, 4, 1, 4, 1, 1, 3, 2, 1, 1, 2]

### Validate question quality

In [24]:
def validate_question_quality(question, options, correct_answer):

    """
    Use LLM to check for quality issues in sample questions with reasoning.
    Parameters:
        question (str): enunciado da questão
        options (list): alternativas possíveis
        correct_answer (str): resposta correta
    Returns:
        dict: {"issues": str, "reasoning": str}
    """
        
    prompt = f"""
    Analyze the following multiple-choice question and check for relevant quality issues. If no issue is relevant, answer "none". 
    
    Question: {question}
    Options: {options}
    Correct Answer: {correct_answer}

    Examples of question quality issues: Ambiguity, poor phrasing, unecessary complexity, oversimplification.
    Examples of options quality issues: More than one correct option, no correct option, wrong option mentioned as correct answer
    
    Answer format (txt):
    {{ 
        "issues": "<list issues (max 3 words)>"
    }}
    """
    # "issues": "<list issues (max 3 words)>"
    # "reasoning": "<concise reasoning (max 40 words)>"
    
    response = client.responses.create(
        model="gpt-5-mini",
	reasoning= { "effort": "low" },
	instructions= "You are an expert in data analysis assessments for professional recruitment.",
	input= prompt
    );
    
    return response.output_text

In [28]:
data_quality = run_models("validate_question_quality", 0, 30)
output = []
for item in data_quality:
    try:
        issues = json.loads(item)["issues"]
    except (json.JSONDecodeError, KeyError) as e:
        issues = f"Error: {e}"   # or you could set issues = None
    output.append(issues)

print(output)

['none', 'none', 'none', 'ambiguous, multiple correct', 'none', 'wrong correct answer', 'imprecise wording', 'insufficient info, multiple correct options', 'none', 'none', 'none', 'oversimplification, metric ambiguity', 'none', 'none', 'none', 'none', 'none', 'term ambiguity, missing sample size, oversimplification', 'imprecision', 'none', 'Ambiguity, Multiple correct options', 'none', 'multiple correct answers, ambiguous', 'oversimplification', 'none', 'none', 'misleading options, incorrect distractors', 'Ambiguity, Multiple interpretations', 'none', 'none']


batch-testing gpt-5-mini <br>
issues = ['none', 'none', 'none', 'ambiguous, multiple correct', 'none', 'wrong correct answer', 'imprecise wording', 'insufficient info, multiple correct options', 'none', 'none', 'none', 'oversimplification, metric ambiguity', 'none', 'none', 'none', 'none', 'none', 'term ambiguity, missing sample size, oversimplification', 'imprecision', 'none', 'Ambiguity, Multiple correct options', 'none', 'multiple correct answers, ambiguous', 'oversimplification', 'none', 'none', 'misleading options, incorrect distractors', 'Ambiguity, Multiple interpretations', 'none', 'none'] <br>
batch-testing gpt-5: <br>
issues = ['oversimplification', 'none', 'none', 'Nonparallel options', 'Ambiguity, oversimplification', 'Incorrect answer key, Ambiguous wording', 'oversimplification', 'Multiple correct options, Missing sample size, Ambiguous phrasing', 'none', 'none', 'oversimplification', 'none', 'none', 'none', 'none', 'none', 'none', 'Terminology ambiguity', 'oversimplification, minor ambiguity', 'none', 'none', 'none', 'Ambiguity, Multiple correct options, Oversimplification', 'oversimplification', 'none', 'none', 'none', 'Ambiguity, No correct option', 'none', 'none']

### Suggest question improvements

In [29]:
def suggest_question_improvements(question, options, correct_answer):

    """
    Use LLM to check for quality issues in sample questions with reasoning.
    Parameters:
        question (str): enunciado da questão
        options (list): alternativas possíveis
        correct_answer (str): resposta correta
    Returns:
        dict: {"issues": str, "reasoning": str}
    """

    level = classify_bloom_level(question, options, correct_answer)
    issues = validate_question_quality(question, options, correct_answer)
        
    prompt = f"""
    Generate an improved version of the question, with one higher level on the Bloom Scale and no remaining issues, do not change the content of the question.
    Consider the inputs of taxonomy level and detected issues. If the question is level 1, always push it to level 2 or 3.

    Good example: <original question>Which pandas function reads CSV files?<\original question> <final question>Explain the purpose of the pandas function \'read_csv()\' in data analysis.<\final question> The new question requires students to not only recall the function name but also to understand its purpose and application in data analysis, which aligns with the \'Understand\' level of Bloom\'s Taxonomy.
    Bad example: <original question>What does the correlation coefficient measure?<\original question> <final question>How would you explain the significance of the correlation coefficient in statistical analysis?<\final question> The new question only changed the wording, but the thinking process remained the same.
   

    Question: {question}
    Options: {options}
    Correct Answer: {correct_answer}
    
    Bloom's Taxonomy Levels and Subcategories (in parenthesis):
    1. Remember (Recognizing,Recalling)
    2. Understand (Interpreting,Exemplifying,Classifying,Summarizing,Inferring,Comparing,Explaining)
    3. Apply (Executing,Implementing)
    4. Analyze (Differentiating,Organizing,Attributing)
    5. Evaluate (Checking,Critiquing)
    6. Create (Generating,Planning,Producing)
    
    Answer format (txt):
    {{
        "predicted level": <int from 1 to 6>,
        "new question": "<question generated by LLM>",
        "new level": <int from 1 to 6>,
    	"new options": "<options generated by LLM>",
    	"new correct option": "<letter between a,b,c,d>"
    }}
    """

    # "reasoning": "<concise reasoning (<40 words)>"
    
    response = client.responses.create(
        model="gpt-5-mini",
	reasoning= { "effort": "medium" },
	instructions= "You are an expert in data analysis assessments for professional recruitment.",
	input= prompt
    );
    
    return response.output_text

In [34]:
questions_improved = run_models("suggest_question_improvements", 0,30)

In [45]:
import csv
# Supondo que a variável 'questions_improved' já existe e foi preenchida pela sua função
# Exemplo de como 'questions_improved' poderia ser:
# questions_improved = [
#     '{"predicted level": 2, "new question": "Qual a finalidade da função `read_csv` do pandas?", "new options": "a) Ler arquivos Excel\\nb) Ler arquivos de texto\\nc) Ler arquivos CSV para um DataFrame\\nd) Escrever em arquivos CSV", "new correct option": "c"}',
#     '{"predicted level": 3, "new question": "Como você aplicaria a função `groupby` para calcular a média de salário por departamento?", "new options": "a) df.groupby(\'departamento\').mean()\\nb) df.mean().groupby(\'departamento\')\\nc) df.groupby(\'departamento\')[\'salario\'].mean()\\nd) df[\'salario\'].groupby().mean()", "new correct option": "c"}'
# ]


# 1. Preparar os dados para o CSV
dados_para_csv = []
for resultado_str in questions_improved:
    try:
        # Converte a string JSON em um dicionário Python
        dados_dict = json.loads(resultado_str)
        
        # Extrai os dados desejados. Usar .get() é mais seguro caso a chave não exista.
        nova_questao = dados_dict.get("new question", "N/A")
        novas_opcoes = dados_dict.get("new options", "N/A")
        
        # Adiciona um dicionário com os dados extraídos à nossa lista
        dados_para_csv.append({
            'nova_questao': nova_questao,
            'novas_opcoes': novas_opcoes
        })

    except json.JSONDecodeError:
        print(f"Aviso: Não foi possível processar o seguinte resultado por não ser um JSON válido:\n{resultado_str}\n")
    except Exception as e:
        print(f"Ocorreu um erro inesperado: {e}")


# 2. Escrever os dados em um arquivo CSV
nome_arquivo_csv = 'questoes_melhoradas.csv'
colunas = ['nova_questao', 'novas_opcoes']

try:
    with open(nome_arquivo_csv, 'w', newline='', encoding='utf-8') as arquivo_csv:
        # Cria um "escritor" de CSV que trabalha com dicionários
        writer = csv.DictWriter(arquivo_csv, fieldnames=colunas)
        
        # Escreve a linha de cabeçalho (header)
        writer.writeheader()
        
        # Escreve todas as linhas de dados de uma vez
        writer.writerows(dados_para_csv)
        
    print(f"Arquivo '{nome_arquivo_csv}' salvo com sucesso com {len(dados_para_csv)} questões!")

except IOError:
    print(f"Erro: Não foi possível escrever no arquivo '{nome_arquivo_csv}'. Verifique as permissões.")

Arquivo 'questoes_melhoradas.csv' salvo com sucesso com 30 questões!


Batch testing gpt-5-mini: (new questions) <br>
['Explain what the mean (average) of the dataset [2, 4, 6, 8] represents and calculate its value.', "Explain the purpose of the pandas function 'read_csv()' in a data analysis workflow.", "Explain why approximately 68% of values in a normal distribution fall within one standard deviation of the mean, and what this indicates about the distribution's variability.", 'For the dataset [1, 1, 2, 3, 100], calculate the mean and the median, then identify which measure of central tendency best represents the typical values and explain your reasoning.', 'Explain what the correlation coefficient measures with respect to both the direction and strength of a linear relationship between two variables, and how its numerical values are interpreted.', 'Which Python library is primarily used for data manipulation, offering DataFrame structures and high-level data operations?', 'Explain what a p-value represents in hypothesis testing and how it is used when deciding whether to reject the null hypothesis.', 'Explain which statistical test is most appropriate to determine whether the observed difference between two website conversion rates (2.1% and 2.8%) is statistically significant, and briefly justify your choice.', 'In a right-skewed distribution caused by a few large outliers, which statement best describes the typical relationship between the mean and the median?', 'Explain in which situation you would choose a scatter plot instead of a bar chart.', 'Explain the correct method for finding the median of the dataset [1, 3, 5, 7, 9], then identify the median.', 'Interpret the model performance: it achieves 85% accuracy on the training set but only 65% on the test set. Which of the following issues best explains this discrepancy?', 'Which SQL clause is used to filter rows? Choose the option that correctly identifies the clause and explains its purpose in a query.', 'In an A/B test comparing a new website design to the existing one, which group is the control group and why is that group used?', 'Which pandas method removes duplicate rows and (by default) returns a DataFrame with those duplicates removed?', 'Given that your sales data shows a spike every December, identify which type of time-series pattern this represents and briefly explain your reasoning.', "Explain the purpose of the matplotlib function 'hist()' and describe a situation in data visualization where it is the appropriate choice.", 'Explain which statistical inputs you must know to determine whether an observed 95% customer satisfaction rate is statistically significantly different from the industry average of 90%.', 'Explain the primary purpose of the SQL GROUP BY clause and, from the options below, choose the best description of what it accomplishes.', 'Explain which of the following measures best describes the spread of a dataset and justify your choice.', 'When cleaning data, you find 15% missing values in a key variable. What is the first action you would take in your data-cleaning workflow to address this issue?', "Describe what the box in a box plot represents and explain how this summary helps in understanding a dataset's distribution.", 'Design a specific analysis or experiment to test whether the higher churn in months 1–3 (40%) versus months 4–6 (15%) is due to a poorer early user experience. Which of the following would be the most direct and effective first step?', "Explain the purpose of NumPy's std() function and describe a situation in data analysis where computing the standard deviation would be useful.", 'Describe the information returned by df.shape in pandas and explain when you would use it during data inspection.', 'You want to compare average sales across 5 different regions. Which statistical test is most appropriate, and briefly explain why this test is preferable to the other options listed?', 'Which statement best explains the difference between pandas .loc and .iloc when selecting data from a DataFrame?', 'Explain what the whiskers in a box plot typically represent in a dataset and how they relate to the quartiles and outliers.', "Explain what is meant by 'rejecting a true null hypothesis' and select the term that correctly names this error.", 'Identify which type of analytics is being used when a marketing team predicts customer lifetime value using past purchase behavior, demographics, and engagement metrics, and briefly explain why this classification is appropriate.']


### Recommend Format Conversion

In [72]:
def recommend_format_conversion(question, current_format="MCQ"):
    prompt = f"""
Analyze the multiple-choice question and assign new format to question.

Good examples of changing format:
- Question too specific or too simple → MCA 
- Calculation where options may add bias → NUMERIC
- Question could assess higher levels of cognitive complexity (Checking, Critiquing, Generating, Planning, Producing)→ TEXT

Question: {question}
Current format: {current_format}

Formats:
MCQ: One correct option
MCA: Multiple correct options
TEXT: Short free response (max 3 sentences)
NUMERIC: Numeric answer

Answer format (txt):
{{
    "new format": "<choose "MCA", "TEXT" or "NUMERIC">"
}}
"""

    response = client.responses.create(
        model="gpt-5-mini",
        reasoning={"effort": "low"},
        instructions="You are an expert in recruitment assessments.",
        input=prompt
    )
    
    return response.output_text


# said to only change format to produce meaningful impact
# put less unecessary words (verbs and synonyms)

In [82]:
new_format = run_models("recommend_format_conversion", 0,30)
print(new_format)

['{\n    "new format": "NUMERIC"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "NUMERIC"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "TEXT"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "MCA"\n}', '{\n    "new format": "M

In [84]:
output = []
for item in new_format:
    try:
        format = json.loads(item)["new format"]
    except (json.JSONDecodeError, KeyError) as e:
        format = f"Error: {e}"   # or you could set issues = None
    output.append(format)

print(output)

['NUMERIC', 'MCA', 'MCA', 'MCA', 'TEXT', 'MCA', 'TEXT', 'TEXT', 'MCA', 'TEXT', 'NUMERIC', 'MCA', 'MCA', 'MCA', 'MCA', 'MCA', 'MCA', 'MCA', 'TEXT', 'MCA', 'TEXT', 'MCA', 'TEXT', 'MCA', 'TEXT', 'TEXT', 'TEXT', 'MCA', 'MCA', 'MCA']


After batch-testing: <br>
['NUMERIC', 'MCA', 'MCA', 'MCA', 'TEXT', 'MCA', 'TEXT', 'TEXT', 'MCA', 'TEXT', 'NUMERIC', 'MCA', 'MCA', 'MCA', 'MCA', 'MCA', 'MCA', 'MCA', 'TEXT', 'MCA', 'TEXT', 'MCA', 'TEXT', 'MCA', 'TEXT', 'TEXT', 'TEXT', 'MCA', 'MCA', 'MCA']

### Create format conversion

In [93]:
def create_format_conversion(question, new_format, current_format="MCQ"):
    """Convert a question into another type"""

    promptMCA = f"""
Analyze the question and generate a new question with format of multiple correct options. 
Do not change the question topic. Generate 4 different options labeled from "A" to "D".


Question: {question}
Current format: {current_format}

Answer format (txt):
{{
    "new_question": "<new question>",
    "new_options": "<new options">,
    "new_correct_answer": <label of correct new alternatives>
}}
"""
    
    promptTYPING = f"""
Analyze the question and generate new question with new format.

Question: {question}
Current format: {current_format}
New format: {new_format}

Formats:
TEXT: Short free response (max 3 sentences)
NUMERIC: Numeric answer

Answer format (txt):
{{
    "new_question": "<TEXT, NUMERIC>",
    "correct_answer": <int, string>
}}
"""
    if (new_format == "MCA"):
        response = client.responses.create(
            model="gpt-5-mini",
            reasoning={"effort": "medium"},
            instructions="You are an expert in recruitment assessments.",
            input=promptMCA
    )

    else:
        response = client.responses.create(
            model="gpt-5-mini",
            reasoning={"effort": "low"},
            instructions="You are an expert in recruitment assessments.",
            input=promptTYPING
    )
        
    
    return response.output_text

In [115]:
converted_formats = run_models("create_format_conversion", 0,5)

Choose question 1 format (MCA, NUMERIC, TEXT): 
 NUMERIC
Choose question 2 format (MCA, NUMERIC, TEXT): 
 MCA
Choose question 3 format (MCA, NUMERIC, TEXT): 
 MCA
Choose question 4 format (MCA, NUMERIC, TEXT): 
 MCA
Choose question 5 format (MCA, NUMERIC, TEXT): 
 TEXT


# Part B: Format Exploration & Open Research

In [143]:
def short_answer(question: str) -> str:
    """LLM-1: Generates a concise 2-3 word answer."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",  # can use gpt-4o or gpt-3.5 depending on your plan
        messages=[
            {"role": "system", "content": "Answer each question in 2-3 words only."},
            {"role": "user", "content": question}
        ],
        max_tokens=10
    )
    return response.choices[0].message.content.strip()

def judge_answer(question: str, answer: str) -> str:
    """LLM-2: Judges if the given answer is correct or wrong."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a strict grader. Reply only with 'Right' or 'Wrong'."},
            {"role": "user", "content": f"Question: {question}\nAnswer: {answer}\nIs the answer correct?"}
        ],
        max_tokens=5
    )
    return response.choices[0].message.content.strip()

def conflict_mode(question: str):
    """Runs the two-LLM conflict process."""
    ans = short_answer(question)
    verdict = judge_answer(question, ans)
    return {"question": question, "answer": ans, "verdict": verdict}

# Example usage
question = "When would you use a scatter plot instead of a bar chart?"
result = conflict_mode(question)
print(result)

{'question': 'When would you use a scatter plot instead of a bar chart?', 'answer': 'Relationship analysis.', 'verdict': 'Right'}


In [None]:
def get_arg(answer):
    """Break answer into main arguments
    """
    pass

def arg_logic(arg):
    """Decide whether argument makes logical sense
    if valid, inconclusive, invalid -> score 1, 0.5, 0
    """
    pass

def arg_context(arg,question):
    """Decide whether argument is conected to context
    if directly conected, indirectly conected, disconected -> score 1, 0.5,0 
    """
    pass


def score_question(answer, question):
    arguments = tokenize_answer(answer)
    score = 0
    len(arguments) = N
    for arg in range(N):
        logic_score = arg_logic(arg)
        context_score = arg_context
        score += 1/N * logic_score * context_score
        
    pass

In [20]:
import json
from openai import OpenAI

client = OpenAI()

# ------------------------
# Funções de suporte
# ------------------------

def get_arg(answer: str) -> list:
    """Break answer into main arguments using GPT"""
    prompt = f"""
    Break the following answer into a list of its main arguments.
    Return ONLY a JSON object with a key 'arguments' that contains a list of strings.

    Answer: {answer}
    """
    response = client.chat.completions.create(
        model="gpt-5-nano",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"}
    )
    args = response.choices[0].message.content
    data = json.loads(args)
    return data["arguments"]


def arg_logic(arg: str) -> float:
    """Evaluate logical validity of argument"""
    prompt = f"""
    Evaluate the logical validity of this argument: "{arg}"

    Categories:
    - valid -> score 1
    - mostly valid -> score 2/3
    - invalid/contradictory -> score 0

    Answer ONLY with a number (1, 0.5, or 0).
    """
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": prompt}]
    )
    score_text = response.choices[0].message.content.strip()
    return float(score_text)


def arg_context(arg: str, question: str) -> float:
    """Evaluate contextual relevance of argument to question"""
    prompt = f"""
    Evaluate whether the argument is connected to the question.

    Question: {question}
    Argument: {arg}

    Categories:
    - directly connected -> score 1
    - indirectly connected -> score 0.5
    - disconnected -> score 0

    Answer ONLY with a number (1, 0.5, or 0).
    """
    response = client.chat.completions.create(
        model="gpt-5-nano",
        messages=[{"role": "user", "content": prompt}]
    )
    score_text = response.choices[0].message.content.strip()
    return float(score_text)


# ------------------------
# Função principal
# ------------------------

def score_question(answer: str, question: str) -> float:
    """Score an answer (0-10) based on logic + context of its arguments"""
    arguments = get_arg(answer)
    N = len(arguments)

    if N == 0:
        return 0.0

    total_score = 0
    invalid_count = 0

    for arg in arguments:
        logic_score = arg_logic(arg)
        context_score = arg_context(arg, question)
        arg_score = logic_score * context_score

        total_score += arg_score

    avg_score = total_score / N

    final_score = 10 * avg_score
    return max(0, round(final_score, 2))


In [40]:
q = "You're analyzing customer churn and find that Month 1-3 users have 40% churn, Month 4-6 users have 15% churn. What insight should you investigate?"
a = "The main insight is that older customers churn more than new ones. Therefore, you should focus only on Month 4–6 users."
score_question(a,q)

0

Question = "When would you use a scatter plot instead of a bar chart?" <br>
Answer 1 (good): You would use a scatter plot when you want to see the relationship between two continuous variables. For example, plotting income against education level reveals correlation patterns that a bar chart would hide. grade: 10.0 <br>
Answer 2 (good): Scatter plots are useful when the goal is to detect clusters, trends, or outliers in paired data. Unlike bar charts, they show how one variable changes in relation to another. grade: 7.5 <br>
Answer 3 (bad): A scatter plot is used when you want to count categories. It’s basically the same as a bar chart but with dots instead of bars. grade: 1.25

Question = “You're analyzing customer churn and find that Month 1-3 users have 40% churn, Month 4-6 users have 15% churn. What insight should you investigate?” <br>
Answer 1 (good): The high churn in the first three months suggests problems with onboarding or early product adoption. It would be important to investigate user experience and engagement in the first weeks. grade: 7.5 <br>
Answer 2 (good):This pattern implies that newer customers are not finding immediate value. You should investigate why early users drop off and what differentiates them from those who stay longer. grade: 6.5 <br>
Answer 3 (bad): The main insight is that older customers churn more than new ones. Therefore, you should focus only on Month 4–6 users.The main insight is that older customers churn more than new ones. Therefore, you should focus only on Month 4–6 users. grade: 0