# OCR & Metadata Extraction from a Seismic Report (PDF)

In this notebook, I demonstrate how to extract structured data and images from a real seismic report in PDF format using OCR techniques. This is part of a broader geophysics pipeline aimed at enabling large-scale, automated analysis of seismic literature and field reports.

Key skills demonstrated:
- Text and image extraction from complex PDFs
- OCR using Tesseract and PyMuPDF
- Structuring raw OCR output for downstream processing


In [1]:
import fitz              # PyMuPDF
import pytesseract
from PIL import Image
import cv2
import numpy as np
import os

# Set up path to tesseract executable (needed on Windows)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"


In [2]:
#loading pdf with page count preview
pdf_path = "data_report.pdf"

# Load the PDF
pdf_path = "../data_report.pdf"

doc = fitz.open(pdf_path)

print(f" PDF loaded: {pdf_path}")
print(f" Number of pages: {len(doc)}")

 PDF loaded: ../data_report.pdf
 Number of pages: 10


In [3]:
for page_num in range(len(doc)):
    text = doc[page_num].get_text()
    print(f"\n--- Page {page_num+1} ---\n{text[:1000]}")  # Only show first 1000 chars for brevity



--- Page 1 ---
Magnitude 7.0 NEW CALEDONIA
Thursday,  March 31, 2022, 05:44:01 UTC
A magnitude 7.0 earthquake has occurred 279 km (173 miles) ESE of 
issued for the region has been lifted. There are no immediate reports of 
damage or injuries.
Australia
Yédjélé Beach, Maré, New Caledonia


--- Page 2 ---
Extreme 
Violent
Severe
Very Strong
Strong
Moderate
Light
Weak
Not Felt
USGS estimated shaking intensity from M 7.0 Earthquake
The Modified-Mercalli Intensity 
(MMI) scale is a ten-stage 
scale, from I to X, that indicates 
the severity of ground shaking. 
Intensity is based on observed 
effects and is variable over the 
area affected by an earthquake. 
Intensity is dependent on 
earthquake size, depth, 
distance, and local conditions.  
MMI   Perceived Shaking
Magnitude 7.0 NEW CALEDONIA
Thursday,  March 31, 2022, 05:44:01 UTC


--- Page 3 ---
The USGS PAGER map shows the 
population exposed to different Modified 
Mercalli Intensity (MMI) levels. 
The USGS estimates that over 3,000
p

In [4]:
img_dir = "../output/images"
os.makedirs(img_dir, exist_ok=True)

image_paths = []  # list of (page, image_path)

for i in range(len(doc)):
    for img_index, img_info in enumerate(doc.get_page_images(i), start=1):
        xref = img_info[0]
        base_image = doc.extract_image(xref)
        img_bytes = base_image["image"]
        img_ext = base_image["ext"]
        img_path = os.path.join(img_dir, f"page{i+1}_img{img_index}.{img_ext}")
        with open(img_path, "wb") as img_file:
            img_file.write(img_bytes)
        image_paths.append((i+1, img_path))
        print(f" Saved image from page {i+1}: {img_path}")



 Saved image from page 1: ../output/images\page1_img1.png
 Saved image from page 1: ../output/images\page1_img2.png
 Saved image from page 1: ../output/images\page1_img3.png
 Saved image from page 1: ../output/images\page1_img4.png
 Saved image from page 1: ../output/images\page1_img5.png
 Saved image from page 1: ../output/images\page1_img6.png
 Saved image from page 1: ../output/images\page1_img7.jpeg
 Saved image from page 1: ../output/images\page1_img8.jpeg
 Saved image from page 1: ../output/images\page1_img9.png
 Saved image from page 1: ../output/images\page1_img10.png
 Saved image from page 1: ../output/images\page1_img11.png
 Saved image from page 1: ../output/images\page1_img12.png
 Saved image from page 1: ../output/images\page1_img13.png
 Saved image from page 1: ../output/images\page1_img14.png
 Saved image from page 1: ../output/images\page1_img15.png
 Saved image from page 1: ../output/images\page1_img16.png
 Saved image from page 1: ../output/images\page1_img17.png
 Sav

extracted embedded images above and  
Next, I run Tesseract OCR on each extracted image to capture any text labels, map legends, or captions that aren’t in the PDF’s text stream.


In [5]:
ocr_results = {}  # page → [ocr_texts]

for page_num, img_path in image_paths:
    # Read image with OpenCV
    img = cv2.imread(img_path)
    # Convert to RGB for PIL
    img_rgb = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    pil_img = Image.fromarray(img_rgb)
    # OCR
    text = pytesseract.image_to_string(pil_img)
    ocr_results.setdefault(page_num, []).append(text)
    print(f"\n OCR from {os.path.basename(img_path)}:\n{text[:300]}")  # preview



 OCR from page1_img1.png:
S

|

Teachable Moments


 OCR from page1_img2.png:


 OCR from page1_img3.png:


 OCR from page1_img4.png:


 OCR from page1_img5.png:


 OCR from page1_img6.png:


 OCR from page1_img7.jpeg:


 OCR from page1_img8.jpeg:
Coral Seg



 OCR from page1_img9.png:


 OCR from page1_img10.png:


 OCR from page1_img11.png:


 OCR from page1_img12.png:


 OCR from page1_img13.png:


 OCR from page1_img14.png:


 OCR from page1_img15.png:


 OCR from page1_img16.png:


 OCR from page1_img17.png:


 OCR from page2_img1.png:
S

|

Teachable Moments


 OCR from page2_img2.jpeg:


 OCR from page2_img3.png:


 OCR from page2_img4.png:


 OCR from page2_img5.png:


 OCR from page2_img6.png:


 OCR from page2_img7.png:


 OCR from page2_img8.png:


 OCR from page2_img9.jpeg:


 OCR from page2_img10.png:


 OCR from page2_img11.png:


 OCR from page2_img12.png:


 OCR from page2_img13.png:


 OCR from page2_img14.png:


 OCR from page2_img15.png:


 OCR from page2_img16.png:

Compiling Structured Output  
I combine the raw page text and OCR output into a dictionary, then save as JSON and CSV for easy consumption.


In [7]:
import json
import csv
import os

# 1. Ensure output directory exists
output_dir = "../output"
os.makedirs(output_dir, exist_ok=True)

# 2. Build a dictionary of selectable text for each page
page_texts = {}
for i in range(len(doc)):
    text = doc[i].get_text().strip()
    page_texts[i + 1] = text

# 3. Assume ocr_results is already populated as:
#    { page_number: [ocr_text_from_image1, ocr_text_from_image2, …], … }

# 4. Combine page text + OCR into a records list
records = []
for page in range(1, len(doc) + 1):
    records.append({
        "page": page,
        "text": page_texts.get(page, ""),
        "ocr":  "\n\n".join(ocr_results.get(page, []))
    })

# 5. Save to JSON
out_json = os.path.join(output_dir, "metadata_extraction.json")
with open(out_json, "w", encoding="utf-8") as f:
    json.dump(records, f, indent=2, ensure_ascii=False)
print(f" Saved structured output to {out_json}")

# 6. Save to CSV (truncating long fields for safety)
out_csv = os.path.join(output_dir, "metadata_extraction.csv")
with open(out_csv, "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=["page", "text", "ocr"])
    writer.writeheader()
    for rec in records:
        writer.writerow({
            "page": rec["page"],
            "text": rec["text"][:1000],
            "ocr":  rec["ocr"][:1000]
        })
print(f" Saved CSV summary to {out_csv}")


 Saved structured output to ../output\metadata_extraction.json
 Saved CSV summary to ../output\metadata_extraction.csv


**Parse Key Metadata Fields**

Now that I have both the PDF‑text (`page_texts`) and image‑OCR (`ocr_results`) for each page, I’ll:

1. Combine them into a single string per page.  
2. Use regular expressions to extract:
   - **Event date/time**  
   - **Magnitude**  
   - **Location coordinates**  
   - **Depth**  
3. Build a Pandas DataFrame and inspect the results.


In [8]:
import re
import pandas as pd
import json

# 1. Load the structured JSON output
with open("../output/metadata_extraction.json", "r", encoding="utf-8") as f:
    records = json.load(f)

# 2. Define regex patterns
date_pat      = re.compile(r"(\d{4}-\d{2}-\d{2}T?\s*\d{2}:\d{2}:\d{2})")  # ISO or space‑separated
mag_pat       = re.compile(r"Magnitude[:\s]+([0-9]\.\d)")
loc_pat       = re.compile(r"\[?(-?\d+\.\d+),\s*(-?\d+\.\d+)\]?")         # [lat, lon]
depth_pat     = re.compile(r"depth.*?(\d+)\s*km", re.IGNORECASE)

# 3. Extract into a list of dicts
parsed = []
for rec in records:
    page = rec["page"]
    text = rec["text"] + "\n" + rec["ocr"]
    
    date_match  = date_pat.search(text)
    mag_match   = mag_pat.search(text)
    loc_match   = loc_pat.search(text)
    depth_match = depth_pat.search(text)
    
    parsed.append({
        "page":       page,
        "date":       date_match.group(1)    if date_match   else None,
        "magnitude":  float(mag_match.group(1)) if mag_match  else None,
        "latitude":   float(loc_match.group(1)) if loc_match  else None,
        "longitude":  float(loc_match.group(2)) if loc_match  else None,
        "depth_km":   int(depth_match.group(1)) if depth_match else None
    })

# 4. Create DataFrame
df_meta = pd.DataFrame(parsed)
df_meta


Unnamed: 0,page,date,magnitude,latitude,longitude,depth_km
0,1,,7.0,,,10.0
1,2,,7.0,,,
2,3,,7.0,,,
3,4,,7.0,,,
4,5,,7.0,,,
5,6,,7.0,,,
6,7,,7.0,,,
7,8,,7.0,,,
8,9,,7.0,,,
9,10,,,,,


In [9]:
# Display the dataframe (in Jupyter this will render nicely)
df_meta

# Save to CSV
df_meta.to_csv("../output/parsed_seismic_metadata.csv", index=False)
df_check = pd.read_csv("../output/parsed_seismic_metadata.csv")
print(df_check.head())  # Display the first few rows to verify


   page  date  magnitude  latitude  longitude  depth_km
0     1   NaN        7.0       NaN        NaN      10.0
1     2   NaN        7.0       NaN        NaN       NaN
2     3   NaN        7.0       NaN        NaN       NaN
3     4   NaN        7.0       NaN        NaN       NaN
4     5   NaN        7.0       NaN        NaN       NaN
