## gemini ground truth gemini output in json 

In [21]:
import os
from dotenv import load_dotenv
import google.generativeai as genai

# === Load API Key ===
load_dotenv()

model_name = "gemini-2.5-pro"
genai.configure(api_key=os.getenv("GOOGLE_GEMINI_API"))
model = genai.GenerativeModel(model_name)

def send_pdf_to_gemini_and_save_json(pdf_path, prompt, output_base_dir):
    try:
        # Upload the PDF file
        file_resource = genai.upload_file(pdf_path, mime_type="application/pdf")
        # Compose the prompt and file
        response = model.generate_content([prompt, file_resource])
        generated_text = response.text

        pdf_filename = os.path.basename(pdf_path)
        pdf_stem = os.path.splitext(pdf_filename)[0]
        output_dir = os.path.join(output_base_dir, pdf_stem)
        os.makedirs(output_dir, exist_ok=True)
        output_json_path = os.path.join(output_dir, f"{pdf_stem}.json")

        with open(output_json_path, 'w', encoding='utf-8') as output_file:
            output_file.write(generated_text)
        print(f"Output saved to {output_json_path}")
    except Exception as e:
        print(f"Error in generating response: {str(e)}")

# Example usage
if __name__ == "__main__":
    pdf_path = "/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Maths_human/maths_pdf_docx_human_ocr/14_10021393251035351171693723024.pdf"
    output_base_dir = "/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Maths_human/pdf_docx_json"
    prompt = """
*Role**: You are a meticulous digital archivist tasked with transcribing handwritten student answer sheets.

**Core Task**: Your goal is to create a perfect digital copy of the student's work. You must transcribe the text *exactly* as it appears, including any spelling or grammatical errors.
---

### Other Directives
1.  **Collate Sub-Questions**: Group all parts of a question (e.g., 11. (1), 11. (2)) under a single main question number. Preserve the original sub-question numbering in the text.

---

### Output Format
- The output MUST be a single, valid JSON array containing one object per main question.
- Do NOT include any text or explanations outside of the JSON array.
- Only transcribe text that is not enclosed within angle brackets (<TEXT>).

**Example of a valid JSON object:**
[
  {
    "question_number": 11,
    "ocr_text": "11. (1) This is the answer to the first part. 11. (2) This is the final answer, which contains a PV curve <diagram_1>. [NOTE: The student originally wrote 'the initial answer' here and struck it through; it has been correctly omitted from this output per the critical rule.]",
    "diagrams": [
      {
        "id": "diagram_1",
        "coordinates": "0.5,0.5,0.2,0.3",
        "diagram_class": "graph or diagram",
        "page_number": 3
      }
    ],
    "pages": [2, 3]
  }
]
"""
send_pdf_to_gemini_and_save_json(pdf_path, prompt, output_base_dir)

Output saved to /Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Maths_human/pdf_docx_json/14_10021393251035351171693723024/14_10021393251035351171693723024.json


## to send pdf to gemini get ocr output

In [46]:
import os
import fitz  # PyMuPDF for PDF processing
from PIL import Image
import json
import google.generativeai as genai
from dotenv import load_dotenv
import base64
import time
import numpy as np
import cv2

# === Load API Key ===
load_dotenv()
model_name = "gemini-2.5-pro"
genai.configure(api_key=os.getenv("GOOGLE_GEMINI_API"))
model = genai.GenerativeModel(model_name)

# === Prompt for Gemini ===
PROMPT = """
### System Instruction

**Role**: You are a meticulous digital archivist tasked with transcribing handwritten student answer sheets.

**Core Task**: Your goal is to create a perfect digital copy of the student's work.
**Important Rule**: **DO NOT** correct any spelling, punctuation, or grammatical errors. This includes cases where words might seem misspelled, such as "metabolities" instead of "metabolites". **Preserve all text exactly as it appears in the PDF**, even if there are apparent mistakes or inconsistencies.
- Please avoid using Unicode escape sequences (e.g., \u25b3) and provide direct characters like ∆, ², and cm.
- Whenever a table is found in the image, output the information in a proper table format.
- Please ignore symbols like  "->" from the final text output.
-- - If the word "Ans." is followed by a number (e.g., "Ans1."), only include the number and exclude the word "Ans."
---
### Other Directives
1.  **Ignore Page Template**: Exclude all non-content elements like headers, footers, page numbers, or decorative logos.
2.  **Collate Sub-Questions**: Group all parts of a question (e.g., 11. (1), 11. (2)) under a single main question number. Preserve the original sub-question numbering in the text.

---
### IGNORE ALL STRIKETHROUGH TEXT
Any portion of text that has a line through it (strikethrough) MUST BE COMPLETELY REMOVED OR IGNORED from the output. Do not include it.
---

### Output Format
- The output MUST be a single, valid JSON array containing one object per main question.
- Do NOT include any text or explanations outside of the JSON array.



**Example of a valid JSON object:**
[
  {
    "question_number": 11,
    "ocr_text": "11. (1) This is the answer to the first part. 11. (2) This is the final answer. [NOTE: The student originally wrote 'the initial answer' here and struck it through; it has been correctly omitted from this output per the critical rule.]",
    "diagrams": [
      {
        "id": "diagram_1",
        "coordinates": "0.5,0.5,0.2,0.3",
        "diagram_class": "graph or diagram",
        "page_number": 3
      }
    ],
    "pages": [2, 3]
  }
]
**Schema Definitions:**
- `question_number` (integer): The main question number.
- `ocr_text` (string): The full, collated text for the question and all its sub-parts.
- `diagrams` (array): A list of diagram objects. Leave as an empty array `[]` if none.
  - `id` (string): The diagram identifier from the text.
  - `coordinates` (string): "x_mid,y_mid,width,height", with values normalized between 0 and 1 relative to image dimensions.
  - `page_number` (integer): The page where the diagram is located.
- `pages` (array): A list of all page numbers on which any part of the question appears.

"""

##768,1536,364
# === Set the dimension value ===
dim = 1536  # Define the dimension value

# === Resize Image Function ===
def resize_image(image, dim=dim, save_path=None):
    image1 = np.array(image.convert('RGB'))  # Ensure the image is in RGB mode
    original_size = image1.shape  # (height, width, channels)
    image1 = image1.mean(axis=2)  # Convert image to grayscale
    h, w = image1.shape
    if w > h:
        new_w = dim
        new_h = int(h * (dim / w))
    else:
        new_h = dim
        new_w = int(w * (dim / h))
    resized_image = cv2.resize(image1, (new_w, new_h), interpolation=cv2.INTER_AREA)
    resized_image_pil = Image.fromarray(resized_image)
    resized_image_pil = resized_image_pil.convert('RGB')  # Convert to RGB before saving
    if save_path:
        resized_image_pil.save(save_path)
    return original_size, (new_h, new_w), resized_image_pil

# === Load PDF, Convert Pages to Images ===
def pdf_to_images(pdf_path, output_folder):
    doc = fitz.open(pdf_path)
    images = []
    num_pages = doc.page_count
    for i in range(num_pages):
        page = doc.load_page(i)
        pix = page.get_pixmap(dpi=300)
        img_path_default = os.path.join(output_folder, f"page_{i + 1}.jpeg")
        pix.save(img_path_default)
        images.append(img_path_default)
    return images, num_pages

# === Load Base64 Images ===
def load_base64_images(folder_path, num_pages):
    b64_list = []
    for i in range(num_pages):
        img_filename_dim = f"DIM_{dim}_PAGE_{i + 1}.jpeg"  # Use dim variable here
        img_filename_default = f"page_{i + 1}.jpeg"  # Old naming convention

        img_path_dim = os.path.join(folder_path, img_filename_dim)
        img_path_default = os.path.join(folder_path, img_filename_default)

        if os.path.exists(img_path_dim):
            path = img_path_dim
        elif os.path.exists(img_path_default):
            path = img_path_default
        else:
            print(f"❌ Image {img_filename_dim} or {img_filename_default} not found in {folder_path}")
            continue

        with open(path, "rb") as f:
            b64_list.append(base64.b64encode(f.read()).decode())
    return b64_list

# === Batch send to Gemini ===
def send_to_gemini(resized_images_objs):
    try:
        response = model.generate_content([PROMPT] + resized_images_objs)  # Batch processing
        raw = response.text.strip()
        cleaned = raw.strip('```json').strip('```').strip()
        parsed = json.loads(cleaned)
        return parsed
    except Exception as e:
        print(f"❌ Failed to process images: {e}")
        return None

# === Main Process ===
def main(pdf_file_path):
    # Extract folder name from the PDF file name
    pdf_filename = os.path.splitext(os.path.basename(pdf_file_path))[0]
    
    # Define the output folder path using the PDF filename
    output_folder = os.path.join("/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Maths_Gemini/Maths_json", pdf_filename)  # Adjust path as needed
    os.makedirs(output_folder, exist_ok=True)

    # Step 1: Convert PDF to images
    images, num_pages = pdf_to_images(pdf_file_path, output_folder)

    # Step 2: Resize images
    resized_images = []
    for page_num in range(len(images)):
        img_path = images[page_num]
        original_size, new_size, resized_image = resize_image(Image.open(img_path), dim=dim)
        
        # Save resized image
        img_filename_dim = f"DIM_{dim}_PAGE_{page_num + 1}.jpeg"
        resized_image.save(os.path.join(output_folder, img_filename_dim))
        
        resized_images.append(resized_image)

    # Step 3: Load base64 images
    images_b64 = load_base64_images(output_folder, len(images))

    # Step 4: Send to Gemini for OCR
    results = send_to_gemini(resized_images)

    if results:
        output_json_filename = f"{pdf_filename}.json"
        json_path = os.path.join(output_folder, output_json_filename)

        # Save the OCR results
        with open(json_path, "w") as f:
            json.dump(results, f, indent=3)

        print(f"OCR results saved to {json_path}")
    else:
        print("❌ No results from Gemini OCR")

# Running the process with a given PDF file path
pdf_file_path = "/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Maths_Gemini/pdf/math/12_100210408223409261171703682786.pdf"  # Replace with your PDF file path
main(pdf_file_path)


OCR results saved to /Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Maths_Gemini/Maths_json/12_100210408223409261171703682786/12_100210408223409261171703682786.json


## Make a table with human ocr , gemini ocr , cer , type of error ,  error discription , question number, docx
"""
Spelling

Wording

Extra Content

Punctuation

Numerical Difference

Missing Content

Content Mix-up

Omission

"""

In [None]:
import os
import json
from difflib import ndiff

# === CONFIGURATION ===
GEMINI_JSON_PATH = "/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Maths_Gemini/Maths_json/01_10021165141080491171694788750/01_10021165141080491171694788750.json"
HUMAN_JSON_PATH = "/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Maths_human/pdf_docx_json/01_10021165141080491171694788750/01_10021165141080491171694788750.json"
OUTPUT_MD_DIR = "/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Tables"

os.makedirs(OUTPUT_MD_DIR, exist_ok=True)

# === LOAD JSONS ===
with open(GEMINI_JSON_PATH, "r", encoding="utf-8") as f:
    gemini_data = json.load(f)
with open(HUMAN_JSON_PATH, "r", encoding="utf-8") as f:
    human_data = json.load(f)

# === BUILD QUESTION MAPS ===
def build_q_map(data):
    q_map = {}
    for entry in data:
        qnum = entry.get("question_number")
        text = entry.get("ocr_text", "")
        q_map[qnum] = text
    return q_map

gemini_q = build_q_map(gemini_data)
human_q = build_q_map(human_data)

# === CER CALCULATION ===
def char_error_rate(s1, s2):
    if s1 == "na" or s2 == "na":
        return "na"
    diff = list(ndiff(s1, s2))
    insertions = sum(1 for d in diff if d[0] == '+')
    deletions = sum(1 for d in diff if d[0] == '-')
    ref_len = len(s1)
    if ref_len == 0:
        return 0 if len(s2) == 0 else 1
    cer = (insertions + deletions) / ref_len
    return f"{cer:.4f}"

# === HIGHLIGHT DIFFERENCES ===
def highlight_differences(s1, s2):
    if s1 == "na" or s2 == "na":
        return "na"
    diff = list(ndiff(s1, s2))
    result = []
    for d in diff:
        if d[0] == ' ':
            result.append(d[2])
        elif d[0] == '-':
            result.append(f"[-{d[2]}-]")
        elif d[0] == '+':
            result.append(f"[+{d[2]}+]")
    return ''.join(result)

# === SANITIZE FOR MARKDOWN ===
def sanitize(text):
    if text == "na":
        return text
    text = text.replace("\n", " ").replace("|", "\\|").replace(",", "&#44;")
    return text

# === OUTPUT HEADERS ===
header = [
    "docx source",
    "question number",
    "Human_ocr",
    "Gemini_ocr",
    "cer",
    "highlighted differences",
    "type of error",
    "Discrepancy Analysis"
]

# === OUTPUT FILE NAME ===
json_base = os.path.splitext(os.path.basename(GEMINI_JSON_PATH))[0]
output_md_path = os.path.join(OUTPUT_MD_DIR, f"{json_base}.md")

# === BUILD TABLE ===
md_lines = ["# OCR Comparison Table", "\n", "| " + " | ".join(header) + " |", "| " + " | ".join(['---'] * len(header)) + " |"]

for qnum in sorted(set(gemini_q.keys()) | set(human_q.keys())):
    human_text = sanitize(human_q.get(qnum, "na"))
    gemini_text = sanitize(gemini_q.get(qnum, "na"))
    if human_text == "na" or gemini_text == "na":
        cer_value = "na"
        highlighted_diff = "na"
    else:
        cer_value = char_error_rate(human_text, gemini_text)
        highlighted_diff = sanitize(highlight_differences(human_text, gemini_text))
    row = [
        os.path.basename(GEMINI_JSON_PATH),
        str(qnum),
        human_text,
        gemini_text,
        cer_value,
        highlighted_diff,
        "",
        "",
    ]
    md_lines.append("| " + " | ".join(row) + " |")

# === WRITE OUTPUT ===
with open(output_md_path, "w", encoding="utf-8") as f:
    f.write("\n".join(md_lines))

print(f"Markdown table written to {output_md_path}")


Markdown table written to /Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/Tables/01_10021165141080491171694788750.md
CSV table written to /Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/tables_csv/01_10021165141080491171694788750.csv


## send the 2 columns to gemini to get evalutaion 

In [59]:
import os
from dotenv import load_dotenv
import google.generativeai as genai

# === Load API Key ===
load_dotenv()
genai.configure(api_key=os.getenv("GOOGLE_GEMINI_API"))

# Set model
model = genai.GenerativeModel("gemini-2.5-pro")

def send_md_and_prompt(input_md_path, prompt, output_md_dir):
    # Read Markdown file
    with open(input_md_path, 'r', encoding='utf-8') as f:
        md_content = f.read()

    # Compose prompt content
    full_prompt = f"{prompt}\n\n<Markdown Table Input>\n{md_content}"

    try:
        response = model.generate_content(
            full_prompt,
            generation_config={"temperature": 0.2},
        )
        generated_text = response.text

        # Try to extract Markdown from the response (strip markdown code block if present)
        if generated_text.strip().startswith('```markdown'):
            generated_text = generated_text.strip().removeprefix('```markdown').removesuffix('```').strip()
        elif generated_text.strip().startswith('```'):
            generated_text = generated_text.strip().removeprefix('```').removesuffix('```').strip()

        # Output file name
        base_name = os.path.splitext(os.path.basename(input_md_path))[0]
        os.makedirs(output_md_dir, exist_ok=True)
        output_md_path = os.path.join(output_md_dir, f"{base_name}_gemini_output.md")

        # Save the generated output
        with open(output_md_path, 'w', encoding='utf-8') as out_file:
            out_file.write(generated_text)

        print(f"🎉 Output saved to {output_md_path}")

    except Exception as e:
        print(f"❌ Error generating response: {e}")

# === Example Usage ===
if __name__ == "__main__":
    input_md_path = "/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/analysis_md/01_10021165141080491171694788750.md"  # Change as needed
    output_md_dir = "/Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/gemini_md"
    prompt = """
Role: You are a highly accurate Quality Assurance (QA) Engine for OCR systems.
Context: You will be provided with a Markdown file containing two columns:
Human OCR: The text extracted by a human from the original content.
Gemini OCR: The text extracted by the Gemini model.
Objective: Your sole purpose is to determine which of the two columns—Human OCR or Gemini OCR—more accurately reflects the original content.

Specific Instructions:
1.For each row:
Compare the Human OCR and Gemini OCR for accuracy.
Determine the discrepancy (if any) between the two columns. Discrepancies can include:
Spelling mistakes
Punctuation errors
Wording differences
Numerical discrepancies
Missing Content
Extra Content
Content Mix-up
Omission

2.If either OCR text contains "na", mention "na" in the type of error and Discrepancy Analysis columns.

3.For each discrepancy found, you need to report it in a table with the following columns:
Question Number: The ID of the solution where the discrepancy occurred.
Human OCR: The full text of the solution from the Human OCR column.
Gemini OCR: The full text of the solution from the Gemini OCR column.
Type of error: The type of error (e.g., Spelling, Wording, Extra Content, Punctuation, Numerical Difference, Missing Content, Content Mix-up, Omission).
Discrepancy Analysis: A brief analysis of the error.

4.If no discrepancies are found, flag the Discrepancy Analysis as "NO ERRORS" and the Type of Error as "NO ERRORS".

Output: You will output a single table (in Markdown format) with discrepancies, including an analysis of the error type and a brief justification of which version is more accurate.

"""
    send_md_and_prompt(input_md_path, prompt, output_md_dir)


🎉 Output saved to /Users/simrannaik/Desktop/solution_improvement/ds-prototypes/subjective_grading/solution_improvement/Z_OCR_gemini_pcmb/Maths/gemini_md/01_10021165141080491171694788750_gemini_output.md
