# Outline

- [ Required Packages](#RP)
- [ Define Functions](#DF)
  - [ Utility Functions](#UF)
  - [ MathPix](#MP)
  - [ OpenAI](#OA)
- [ Execution ](#E)
    - [ Read URLS from file ](#RUFF)
    - [ Convert PDFs into LaTeX (MathPix APIs) ](#CPIL)
    - [ Get LaTeX contents using `pdf_ids` (MathPix APIs) ](#GLCUP)
    - [ Injecting into LLM (OpenAI APIs) ](#IJIL)

**IMPORTANT**: Ensure that all necessary environment variables for the MathPix API and OpenAI API are set before running the project. Review the output files for accuracy and completeness after processing.

Documentation for this notebook can be found here: [Practise Exam Scraper Notebook Documentation](https://github.com/ThongLai/Learnspot-content-scraping/blob/main/practise_exam_scraper/practise_exam_scraper.md)

<a name="RP"></a>
# Required Packages

In [7]:
# Run the following commend if there is a missing package:
# %pip install msgraph-sdk ipykernel openai tiktoken pandas xlsxwriter requests openpyxl

In [8]:
import pandas as pd
import requests
import xlsxwriter
import tiktoken
from openai import OpenAI

import re
import os
import json
import time
import urllib.parse

<a name="DF"></a>
# Define Functions

<a name="UF"></a>
## Utility Functions

In [9]:
def get_pdf_embed_links(url):
    if "https://pmt" not in url:
        # Extract the 'pdf' parameter from the query string
        parsed_url = urllib.parse.urlparse(url)
        pdf_url = urllib.parse.parse_qs(parsed_url.query)['pdf'][0]
        
        # Decode the URL
        decoded_url = urllib.parse.unquote(pdf_url).replace(' ', '%20')
        return decoded_url
    else:
        return url.replace(' ', '%20')

def get_urls_from_file(filename = 'input_urls.txt'):
    urls = []
    
    with open(filename, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    
    year_group = lines[0].strip()
    subject = lines[1].strip()
    sub_topic = lines[2].strip()
    
    url_group = []
    for line in lines[3:]:
        url_group = line.strip().split(' https://')
        url_group[0] = get_pdf_embed_links(url_group[0])
        url_group[1] = get_pdf_embed_links(f'{'https://'}{url_group[1]}')
    
        if len(url_group) == 2:
            urls.append(tuple(url_group))
        else:
            print(f'Missing 1 pair: {url_group}')

    if len(lines) == len(urls)+3:
        print(f"[{len(urls)}] Read urls")
    else:
        print(f"[{len(urls)}] Failed to read urls")
    
    return year_group, subject, sub_topic, urls

def fix_latex_delimiters(latex_string):
    if not isinstance(latex_string, str):
        return latex_string
        
    # Count the number of double dollar signs
    double_dollar_count = latex_string.count('$$')
    
    # Check if the count of double dollar signs is odd
    if double_dollar_count % 2 != 0:
        latex_string += ' $$'
        
    # Count the number of single dollar signs
    single_dollar_count = latex_string.count('$')
    
    # Check if the count of single dollar signs is odd
    if single_dollar_count % 2 != 0:
        latex_string += ' $'
    
    return latex_string

# Save to excel file with Data Validation
def save_excel(output_file, practise_data, output_folder='.output'):
    if not practise_data:
        return
    
    df = pd.DataFrame(practise_data).rename(columns={'Question':'Question_Title'})
    df.insert(1, 'Year Group', year_group)
    df.insert(2, 'Subject', subject)
    df.insert(3, 'Sub-Topic', sub_topic)
    df.insert(6, 'Type of question', "Practise Exam")
    df.insert(12, 'Source (Internal use)', "Physicsandmathstutor")

    # Fix LaTeX delimiters in specified columns
    columns_to_check = ['Question_Title', 'Answer', 'Mark Scheme', 'Other Text', 'Options']
    for column in columns_to_check:
        df[column] = df[column].apply(fix_latex_delimiters)

    difficulty_values = ['easy', 'medium', 'hard']
    
    with pd.ExcelWriter(os.path.join(output_folder, output_file)) as writer:
        df.to_excel(writer, index=False)
    
        row_num = 1  
        last_row = len(df) + 1

        worksheet = writer.sheets['Sheet1']
        
        # Apply data validation to the 'Difficulty' column
        col_num = df.columns.get_loc('Difficulty')
        worksheet.data_validation(f'${chr(col_num+65)}{row_num}:${chr(col_num+65)}{last_row}', {'validate': 'list', 'source': difficulty_values})

    print(f"[{len(df)}] Data has been successfully saved to `{output_file}`.")

# Save Json files (For testing/fixing bugs)
def saveJSON(data,name="data.json"):
    with open(name, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=4)

def read_json_file(filename='pdf_ids_logs.json'):
    with open(filename, 'r', encoding='utf-8') as file:
        data = json.load(file)  # Load the JSON data into a Python dictionary
    return data

# Function to append a new set of questions and answers pdf_ids to the JSON file
def append_to_json_logs(pdf_ids, logs_file='pdf_ids_logs.json'):
    # Check if the JSON file exists
    if os.path.exists(logs_file):
        # Read existing data
        with open(logs_file, 'r', encoding='utf-8') as file:
            try:
                data = json.load(file)
            except json.JSONDecodeError:
                data = []  # If the file is empty or invalid, start with an empty list
    else:
        data = []  # If the file does not exist, start with an empty list

    # Append the new data
    data.append(pdf_ids)

    # Write the updated data back to the file
    with open(logs_file, 'w', encoding='utf-8') as file:
        json.dump(data, file, indent=4)

<a name="MP"></a>
## MathPix

[MathPix APIs Documentation](https://docs.mathpix.com)

In [10]:
# MathPix
MATHPIX_APP_ID = os.environ.get("MATHPIX_APP_ID")
MATHPIX_APP_KEY = os.environ.get("MATHPIX_APP_KEY")

def process_pdf(url, app_id=MATHPIX_APP_ID, app_key=MATHPIX_APP_KEY):
    response = requests.post(
        "https://api.mathpix.com/v3/pdf",
        json={
            "url": url,
            "conversion_formats": {
                "md": True,
            },
            "math_inline_delimiters": ["$", "$"]
        },
        headers={
            "app_id": app_id,
            "app_key": app_key,
            "Content-type": "application/json"
        }
    )

    return response.json()

def process_pdfs(url_pairs):
    pdf_ids = {
        'questions': [],
        'answers': []
    }
    print(f'Processing [{len(url_pairs)}] pairs of PDF(s)... ')
    
    for idx, (question_url, answer_url) in enumerate(url_pairs):
        # Process the question PDF
        question_url = get_pdf_embed_links(question_url)
        print(f"Q{idx+1}) Extracting questions from:{question_url}", end='')
        
        response = process_pdf(question_url)
        pdf_ids['questions'].append(response['pdf_id'])  # Store question PDF ID
        
        print(f" | pdf_id:{response['pdf_id']}")
    
        # Process the answer PDF
        answer_url = get_pdf_embed_links(answer_url)
        print(f"A{idx+1}) Extracting answers from:{answer_url}",end='')
        
        response = process_pdf(answer_url)
        pdf_ids['answers'].append(response['pdf_id'])  # Store answer PDF ID
    
        print(f" | pdf_id:{response['pdf_id']}")

    append_to_json_logs(pdf_ids) # Save logs for future use

    return pdf_ids

def get_result_in_latex(pdf_id, app_id=MATHPIX_APP_ID, app_key=MATHPIX_APP_KEY):
    response = requests.get(
        "https://api.mathpix.com/v3/pdf/" + pdf_id + ".mmd", # get mmd response
        headers={
            "app_id": app_id,
            "app_key": app_key,
        }
    )
    
    return response.text

def get_results_in_latex(pdf_ids):
    contents = {
        'questions': [],
        'answers': []
    }
    
    # Get LaTeX contents using the stored PDF IDs
    for idx, (ques_pdf_id, ans_pdf_id) in enumerate(zip(pdf_ids['questions'], pdf_ids['answers'])):
        while True:
            content = get_result_in_latex(ques_pdf_id)

            if '"status":"split"' not in content:
                contents['questions'].append(content)
                print(f"Q{idx+1}) Got questions contents from pdf_id:{ques_pdf_id}")
                break
            else:
                print('Wait for the file to process.. ')
                time.sleep(2)

        while True:
            content = get_result_in_latex(ans_pdf_id)

            if '"status":"split"' not in content:
                print(f"A{idx+1}) Got answers contents from pdf_id:{ans_pdf_id}")
                contents['answers'].append(content)
                break
            else:
                print('Wait for the file to process.. ')
                time.sleep(2)
                
    return contents

<a name="OA"></a>
## OpenAI

[OpenAI APIs Documentation](https://platform.openai.com/docs/api-reference)

In [11]:
client = OpenAI()
LLM_responses = []
LLM_error_responses = []
# Prompt for injecting into LLMs
SYSTEM_PROMPT = f'''
You will receive 2 LaTeX contents, one are the questions and one are the answers. Your task is to output a **Python list** of **JSON objects**. Each JSON object must follow this structure:

- **ID**: (Required) A unique identifier for each question, generated from the question number and parent question. DO NOT FORGET the parent question that contains a general question. 
- **Difficulty**: (Required) Rate the difficulty as one of: ['easy', 'medium', 'hard'].
- **Parent_ID**: (MUST EXIST AN ID FOR THE PARENT) For sub-questions, assign IDs based on the parent question (e.g., if the parent question is '1', sub-questions could be '1a', '1b', etc.).
- **Question**: (Required) The full question derived from the question file. Do not summarize; ensure clarity for both parent and sub-questions.
- **Options**: Include multiple-choice options if available; otherwise, leave as ''.
- **Images**: Indicate if an image is associated with the question; otherwise, leave as ''.
- **Mark Scheme**: Provide the marking scheme if available.
- **Answer**: (Required except for questions that have sub-questions) The complete answer derived from the answer file. Do not summarize.
- **Mark**: (Required) Must be greater than 1. If specified in the answer file, include it; otherwise, assign based on difficulty. The parent question's mark must equal the sum of its sub-questions.
- **Other Text**: Any additional context or explanation.

**IMPORTANT NOTES**:
- Please use `$` for inline math in LaTeX.
- Ensure your output is NOT a markdown and can be used with the `json.loads()` method to convert.
- DO NOT put unnecessary newline, or space characters. keep the double backslash for the LaTeX format.
- Strip empty spaces to minimize output size.
- Do not answer questions or shorten content.
- Ensure all sub-questions are covered and there is a parent for all sub-questions.
- Retain any mathematical characters or equations in LaTeX format.
- Leave fields empty ('') if information cannot be extracted.
'''

def count_tokens(text, model="gpt-4o-mini"):
    encoding = tiktoken.encoding_for_model(model)  # Change model as needed
    return len(encoding.encode(text))
    
def get_completion_from_messages(messages, model="gpt-4o-mini"):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        # max_tokens=6000
    )
    return response.choices[0].message.content


def extract_from_llm(questions_contents, answers_contents, system_message=SYSTEM_PROMPT):
    input_contents = f"""
    Questions File Contents: {questions_contents},
    
    Answers File Contents:{answers_contents}"""

    messages =  [  
    {'role':'system', 'content': system_message},    
    # {'role':'user', 'content': f"{few_shot_user_1}"},  #Few-shot learning can be used here, in-case the model got hallucination issues
    # {'role':'assistant', 'content': few_shot_assistant_1 },
    {'role':'user', 'content': input_contents},  
    ]
    
    return get_completion_from_messages(messages)

def fix_llm_response(LLM_response, error, model='gpt-4o-mini'):
    system_message = '''You will be given a string that gives an error while trying to load it into JSON format, you will be provided the error message as well
    ONLY Output only the JSON format corrected version of the string that can be successfully loaded into JSON format.
    '''

    input_message = f'''
    The error string is failed to load into JSON format:
    
    {LLM_response}
    
    The error message: {error}.
    '''
    
    messages =  [  
    {'role':'system', 'content': system_message},    
    {'role':'user', 'content': input_message},  
    ]

    return get_completion_from_messages(messages)

def extract_practise_exams(practise_data, contents):
    global LLM_responses
    for idx, (questions_contents, answers_contents) in enumerate(zip(contents['questions'], contents['answers'])):
        pre_len = len(practise_data)
        total_tokens = 0
        input_tokens = count_tokens(f'{SYSTEM_PROMPT}{questions_contents}{answers_contents}')
        total_tokens += input_tokens
        print(f"{idx+1}) Injecting into LLM... | Input Tokens:{input_tokens}", end='')
        
        # Extract data from LLM
        LLM_response = extract_from_llm(questions_contents, answers_contents).strip("```json").strip("```")
        
        output_tokens = count_tokens(f'{LLM_response}')
        total_tokens += output_tokens
        print(f' | Output Tokens:{output_tokens}',end='')
        
        # Handle error (if error happens while converting into JSON format)
        cur_data = []
        while not cur_data:
            try:
                cur_data = json.loads(LLM_response) # Convert string to JSON format
            except Exception as e:
                LLM_error_responses.append({'response':LLM_response,'error':e})
                error_message = f'Error encountered at index `{e.pos}` character `{LLM_response[e.pos]}` in this part `...{LLM_response[e.pos-10:e.pos+10]}...`'
                print(f'\n{error_message} | Fixing response...', end='')
        
                if LLM_response[e.pos] == '\\' and LLM_response[e.pos + 1] != '\\' and LLM_response[e.pos - 1] != '\\': # Mostly it is missing a `\` somewhere in the response
                    LLM_response = f'{LLM_response[:e.pos]}{'\\'}{LLM_response[e.pos:]}'
                    print(f" [Added '\\']", end='')
                else: # Otherwise, have to use LLM to fix the response
                    input_tokens = count_tokens(f'{LLM_response}{e} {error_message}')
                    total_tokens += input_tokens
                    print(f" | Input Tokens:{input_tokens}", end='')
                    
                    LLM_response = fix_llm_response(LLM_response, f'{e} {error_message}').strip("```json").strip("```")

                    output_tokens = count_tokens(f'{LLM_response}')
                    total_tokens += output_tokens
                    print(f" | Output Tokens:{output_tokens}",end='')
        
        LLM_responses.append(LLM_response)
        
        practise_data += cur_data
    
        print(f" | Total Tokens:{total_tokens} | Questions:{len(practise_data) - pre_len} | Total:{len(practise_data)}")

    return LLM_responses

<a name="E"></a>
# Excution

<a name="RUFF"></a>
## Read URLS from file

In [12]:
year_group, subject, sub_topic, urls = get_urls_from_file()
year_group, subject, sub_topic

[15] Read urls


('A Levels', 'Physics', 'Particles and Radiation')

<a name="CPIL"></a>
## Convert PDFs into LaTeX (MathPix APIs)

In [13]:
pdf_ids = process_pdfs(urls)

Processing [15] pairs of PDF(s)... 
 | pdf_id:2025_01_13_e99cf271cc569080484dgphysicsandmathstutor.com/download/Physics/A-level/Topic-Qs/AQA/02-Particles-and-Radiation/Set-M/Applications%20of%20Conservation%20Laws%20QP.pdf
 | pdf_id:2025_01_13_ae2d9f37a0e60c483cbfgysicsandmathstutor.com/download/Physics/A-level/Topic-Qs/AQA/02-Particles-and-Radiation/Set-M/Applications%20of%20Conservation%20Laws%20MS.pdf
Q2) Extracting questions from:https://pmt.physicsandmathstutor.com/download/Physics/A-level/Topic-Qs/AQA/02-Particles-and-Radiation/Set-M/Classification%20of%20Particles%20QP.pdf | pdf_id:2025_01_13_b024e1492a56f8da33c3g
 | pdf_id:2025_01_13_c07a222d2317a8c2c14bgysicsandmathstutor.com/download/Physics/A-level/Topic-Qs/AQA/02-Particles-and-Radiation/Set-M/Classification%20of%20Particles%20MS.pdf
 | pdf_id:2025_01_13_566ac15ee45ee29885c6gphysicsandmathstutor.com/download/Physics/A-level/Topic-Qs/AQA/02-Particles-and-Radiation/Set-M/Collisions%20of%20Electrons%20with%20Atoms%20QP.pdf
A3) 

<a name="GLCUP"></a>
## Get LaTeX contents using `pdf_ids` (MathPix APIs)

Some PDFs have many pages require longer time to process (indicate as `Wait for the file to process.. `)

In [14]:
pdf_ids = read_json_file()[-1]
contents = get_results_in_latex(pdf_ids)

Q1) Got questions contents from pdf_id:2025_01_13_e99cf271cc569080484dg
A1) Got answers contents from pdf_id:2025_01_13_ae2d9f37a0e60c483cbfg
Q2) Got questions contents from pdf_id:2025_01_13_b024e1492a56f8da33c3g
A2) Got answers contents from pdf_id:2025_01_13_c07a222d2317a8c2c14bg
Q3) Got questions contents from pdf_id:2025_01_13_566ac15ee45ee29885c6g
A3) Got answers contents from pdf_id:2025_01_13_a668ed00459a4ff1a3a6g
Q4) Got questions contents from pdf_id:2025_01_13_fdd42c5bc6fc204dc167g
A4) Got answers contents from pdf_id:2025_01_13_2d1d03e1cad68424d44cg
Q5) Got questions contents from pdf_id:2025_01_13_64606b258cbf4c5797a9g
A5) Got answers contents from pdf_id:2025_01_13_511c4e911b1114bcffacg
Q6) Got questions contents from pdf_id:2025_01_13_b388b444e2cf9b2ed562g
A6) Got answers contents from pdf_id:2025_01_13_d8bb204b5b88264d3a4ag
Q7) Got questions contents from pdf_id:2025_01_13_ae69c670b25cfaf44142g
A7) Got answers contents from pdf_id:2025_01_13_645c491b4d14f6361b72g
Q8) Go

<a name="IJIL"></a>
## Injecting into LLM (OpenAI APIs)

- Currently using [gpt-4o-mini](https://platform.openai.com/docs/models/gpt-4o-mini) the most cost-efficient small model that’s smarter and cheaper than GPT-3.5 Turbo, and has vision capabilities. (Max output tokens: **16,384** tokens)
- Other GPT with higher output tokens can be used (in case there are lots of questions in 1 PDF)

In [15]:
practise_data = []
LLM_responses = []
LLM_responses = extract_practise_exams(practise_data, contents)

1) Injecting into LLM... | Input Tokens:2600 | Output Tokens:2292
Error encountered at index `136` character `\` in this part `...ve pion, $\pi^{+}$, ...` | Fixing response... [Added '\']
Error encountered at index `5369` character `\` in this part `... \\pi^{-}+\mathrm{p}...` | Fixing response... [Added '\']
Error encountered at index `6676` character `\` in this part `... \\pi^{+}+\mathrm{n}...` | Fixing response... [Added '\']
Error encountered at index `6701` character `\` in this part `...ightarrow \mathrm{p}...` | Fixing response... [Added '\']
Error encountered at index `6713` character `\` in this part `...mathrm{p}+\pi \\] (i...` | Fixing response... [Added '\'] | Total Tokens:4892 | Questions:29 | Total:29
 | Output Tokens:2806... | Input Tokens:4055
Error encountered at index `2784` character `\` in this part `... quark ( $\mathrm{u}...` | Fixing response... [Added '\'] | Total Tokens:6861 | Questions:24 | Total:53
 | Output Tokens:2204 | Total Tokens:5353 | Questions:29 | T

In [16]:
output_file = f"{sub_topic.lower()} {year_group.lower()} {subject.lower()}.xlsx"
save_excel(output_file, practise_data)

[474] Data has been successfully saved to `particles and radiation a levels physics.xlsx`.
