<a href="https://colab.research.google.com/github/Rohan912Jacob/pdf-geodata-extraction/blob/main/Main_Study_Area_Extraction_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook automatically extracts the main study area(s) from a collection of academic geoscience theses, linking each location to the thesis title, author, and publication year.

It uses large language models to identify the primary research sites where fieldwork or data collection was performed based on the full text of each thesis.

The extracted study areas and associated metadata are then visualized on an interactive map, providing an at-a-glance summary of the geographic focus and research coverage across all theses.

Basically a good way to summarize dataset for a quick relevant glance of a folder full of academic PDFs

In [None]:
!pip install unidecode

Collecting unidecode
  Downloading Unidecode-1.4.0-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.4.0-py3-none-any.whl (235 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/235.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m235.5/235.8 kB[0m [31m8.0 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.8/235.8 kB[0m [31m5.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.4.0


In [None]:
!pip install --upgrade openai

Collecting openai
  Downloading openai-2.3.0-py3-none-any.whl.metadata (29 kB)
Downloading openai-2.3.0-py3-none-any.whl (999 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m999.8/999.8 kB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.109.1
    Uninstalling openai-1.109.1:
      Successfully uninstalled openai-1.109.1
Successfully installed openai-2.3.0


In [None]:
!pip install pdfminer.six

Collecting pdfminer.six
  Downloading pdfminer_six-20250506-py3-none-any.whl.metadata (4.2 kB)
Downloading pdfminer_six-20250506-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.6/5.6 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pdfminer.six
Successfully installed pdfminer.six-20250506


In [None]:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
import pandas as pd
import json
import re
import os
import openai
from openai import OpenAI
import time
import ast
import csv
import glob

## PDF TO TEXT

Extract pages and page numbers

In [None]:
def extract_pages_text(pdf_path):
    """ Input: PDF files
        Output: list of pages as text(strings)
    """
    pages = []
    for page_layout in extract_pages(pdf_path):
        lines = []
        for element in page_layout:
            if isinstance(element, LTTextContainer):
                lines.append(element.get_text())
        page_text = '\n'.join(lines)
        pages.append(page_text)
    return pages

Filter out noise: Dedications,resumes,declaration, acknowledgements, table of contents,empty lines and References that may have names and locations that will hinder the location extraction. Left with main body

In [None]:
def is_toc_page(text):
    if "table of contents" in text or "contents" in text:
        return True
    if re.search(r'\.{5,}', text) and re.search(r'\d{1,3}\s*$', text, re.MULTILINE):
        return True
    if sum(1 for l in text.split('\n') if re.match(r'.*\d{1,3}\s*$', l)) > 5:
        return True
    return False

def is_ack_page(text):
    return "acknowledgement" in text or "acknowledgments" in text

def is_declaration_page(text):
    return "declaration" in text

def is_main_section_start(text):
    return bool(re.search(
        r'\b(?:1\.|chapter\s*1)[:\s-]*introduction\b|\bintroduction\b',
        text, re.IGNORECASE
    ))

def remove_empty_lines(text):
    return "\n".join(line for line in text.splitlines() if line.strip())

In [None]:
def remove_references_sections(page_texts):
    """
    Removes 'References' sections from each page and drops any blank pages.
    Input: list of page_texts (strings)
    Output: list of cleaned page_texts (strings)
    """
    cleaned_pages = []
    skip_mode = False
    for page in page_texts:
        lines = page.splitlines()
        cleaned_lines = []
        for line in lines:
            # Detect references header
            if not skip_mode and re.match(r'^\s*(\d+\.?)?\s*references\b', line, re.I):
                skip_mode = True
                continue
            # Exit skip mode if a new section/chapter starts
            if skip_mode and (
                re.match(r'^\s*(chapter|paper|section|abstract|introduction)\b', line, re.I) or
                re.match(r'^\s*(\d+\.?)?\s*(abstract|introduction|chapter|paper|section)\b', line, re.I)
            ):
                skip_mode = False
            if not skip_mode:
                cleaned_lines.append(line)
        # Remove empty lines
        non_empty = [l for l in cleaned_lines if l.strip()]
        # If after cleaning, page is not blank, keep it
        if non_empty:
            cleaned_pages.append('\n'.join(non_empty))
    return cleaned_pages

PDF to text with page mapping that will help with lookup later. Notably only the main body is converted to text. Coverpage is set to page 1 like what is viewed when pdf is read on a reading application e.g. Adobe Acrobat

In [None]:
def pdf_to_text_with_page_mapping(pdf_path):
    pages = extract_pages_text(pdf_path)
    body_pages = pages
    filtered_pages = []
    kept_pages = []
    skip_mode = None
    original_page_numbers = list(range(1, len(pages)+1))

    for idx, pg in enumerate(body_pages):
        pg_lower = pg.lower()
        if skip_mode == 'toc':
            if is_main_section_start(pg_lower):
                skip_mode = None
            elif is_toc_page(pg_lower):
                continue
            else:
                skip_mode = None
        elif skip_mode == 'ack':
            if is_main_section_start(pg_lower):
                skip_mode = None
            elif is_ack_page(pg_lower):
                continue
            else:
                skip_mode = None
        elif skip_mode == 'dec':
            if is_main_section_start(pg_lower):
                skip_mode = None
            elif is_declaration_page(pg_lower):
                continue
            else:
                skip_mode = None
        if skip_mode is None:
            if is_toc_page(pg_lower):
                skip_mode = 'toc'
                continue
            elif is_ack_page(pg_lower):
                skip_mode = 'ack'
                continue
            elif is_declaration_page(pg_lower):
                skip_mode = 'dec'
                continue
        filtered_pages.append(pg)
        kept_pages.append(original_page_numbers[idx])

    main_body_text = "\n\n".join(remove_empty_lines(pg) for pg in filtered_pages)

    return main_body_text, kept_pages, filtered_pages

Next to be done on multiple PDF academic theses. The output is a folder of text files to be used as input in location extraction. A page count report can be generated to check that the extraction has actually extracted main body and not removed relevant parts that are important especially since these are unstructured pdfs that may have editing that may cause issues with extraction pipeline.

In [None]:
def process_pdf_folder(folder_path, output_txt_folder=None, csv_report_path=None):
    """
    Processes all PDFs in the given folder:
    - Extracts text (and keeps track of page numbers).
    - Removes all content from the References section onwards.
    - Writes cleaned text to .txt files (with page markers).
    - Prints and optionally saves a count per file as CSV.
    """
    if output_txt_folder:
        os.makedirs(output_txt_folder, exist_ok=True)
    report = []
    for filename in os.listdir(folder_path):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(folder_path, filename)
            print(f"Processing: {filename}")
            try:
                main_text, kept_page_nums, kept_page_texts = pdf_to_text_with_page_mapping(pdf_path)
                kept_page_texts_norefs = remove_references_sections(kept_page_texts)
                kept_page_nums_norefs = kept_page_nums[:len(kept_page_texts_norefs)]

                if output_txt_folder:
                    txt_filename = os.path.splitext(filename)[0] + ".txt"
                    txt_path = os.path.join(output_txt_folder, txt_filename)
                    with open(txt_path, "w", encoding="utf-8") as f:
                        for page_num, page_text in zip(kept_page_nums_norefs, kept_page_texts_norefs):
                            f.write(f"\n--- Page {page_num} ---\n")
                            f.write(remove_empty_lines(page_text).strip() + "\n")
                report.append({"filename": filename, "kept_pages": len(kept_page_nums_norefs)})
            except Exception as e:
                print(f"Failed to process {filename}: {e}")
    print("\n=== Page Count Report ===")
    for row in report:
        print(f"{row['filename']}: {row['kept_pages']} pages kept")
    if csv_report_path:
        pd.DataFrame(report).to_csv(csv_report_path, index=False)
    return report


In [None]:
if __name__ == "__main__":
    folder = "/content/drive/MyDrive/data_pdf"
    out_folder = "/content/drive/MyDrive/Main_txt"
    report_csv = "/content/drive/MyDrive/pdf_page_report.csv"
    process_pdf_folder(folder, out_folder, report_csv)

Processing: 2013_Peters.pdf
Processing: 2015_Masurel_phd.pdf
Processing: 2013_FUNYUFUNYU.pdf
Processing: 2014_MSc_YOSSI.pdf
Processing: 2015_LeBrun_Siguiri.pdf
Processing: 2008_MATABANE_FE3.pdf
Processing: 2011_Peters_East Markoye_2011.pdf
Processing: 2010_Matsheka_Irvin Final Thesis.pdf
Processing: 2013_Ramabulana_Sadiola Hill petrology.pdf
Processing: 2007_Tshibubudze_THE MARKOYE FAULT_2007.pdf
Processing: 2009_Bontle Nkuna_0605886P_Honours Report.pdf
Processing: 2010_Mohale_GIS interpretation of NE Burkina Faso.pdf
Processing: 2011_Woolfe_The stratigraphy and metamorphic facies of the KEMB.pdf
Processing: 2012_Simoko_Petrology, geochemistry and structure of the Pissila batholith and the Saaba Zone gneiss.pdf

=== Page Count Report ===
2013_Peters.pdf: 78 pages kept
2015_Masurel_phd.pdf: 228 pages kept
2013_FUNYUFUNYU.pdf: 72 pages kept
2014_MSc_YOSSI.pdf: 39 pages kept
2015_LeBrun_Siguiri.pdf: 192 pages kept
2008_MATABANE_FE3.pdf: 40 pages kept
2011_Peters_East Markoye_2011.pdf: 51 

## EXTRACTION OF KEY INFORMATION

In [None]:
!pip show openai

Name: openai
Version: 2.3.0
Summary: The official Python library for the openai API
Home-page: https://github.com/openai/openai-python
Author: 
Author-email: OpenAI <support@openai.com>
License: Apache-2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: anyio, distro, httpx, jiter, pydantic, sniffio, tqdm, typing-extensions
Required-by: 


In [None]:
# API key
os.environ['OPENAI_API_KEY'] = 'INSERT KEY HERE'

# Testing API connection
client = OpenAI()
try:
    model_list = client.models.list()
    print("API connection successful! Number of models:", len(model_list.data))
except Exception as e:
    print("API connection failed:", e)

API connection successful! Number of models: 93


Prompt to help extract the papers' authors, title, year and main study areas the theses looks into

In [None]:
study_area_doc_prompt = """
You are an expert assistant. You are given the full text of a geoscience thesis (with page numbers).

Instructions:
- First, extract the thesis **title**, **author(s)**, and **year** (look at the title page or first few pages).
- Then, read the ENTIRE text and pay attention to where the main fieldwork, case study, or research focus is described.
- Only extract the MAIN study area or areas (usually 1–2), not all places mentioned. These are where the research was conducted, data collected, or fieldwork performed.
- Do not include places only mentioned in passing, references, or in background/literature review.
- For each main study area, return:
    - "mention":  Return the full sentence from the document that defines or introduces the main study area, not just the place name.
    - "page": Page number where it first appears
    - "canonical": Cleaned name for geocoding
    - "latitude", "longitude": Decimal degrees WGS84 (your best estimate)
- If no main study area is clear, return {"study_areas": []}
- Output as strict JSON, including the thesis metadata and the study areas:

{
  "title": "<thesis title>",
  "author": "<author(s)>",
  "year": <year>,
  "study_areas": [
    {
      "mention": "...",
      "page": ...,
      "canonical": "...",
      "latitude": ...,
      "longitude": ...
    },
    ...
  ]
}

Example:

Text:
--- Page 1 ---
RELATIVE TIMING OF STRUCTURAL EVENTS: THE MARKOYE FAULT AND ITS ASSOCIATION TO GOLD MINERALISATION
Jessica Haritina Wainaina, 2007

This thesis investigates gold mineralization in the Markoye Fault Zone in NE Burkina Faso.
--- Page 3 ---
Background about the geology of Mali and Niger.
--- Page 6 ---
Detailed fieldwork was performed at Essakane and Markoye.
--- Page 15 ---
Conclusion.

JSON:
{
  "title": "RELATIVE TIMING OF STRUCTURAL EVENTS: THE MARKOYE FAULT AND ITS ASSOCIATION TO GOLD MINERALISATION",
  "author": "Jessica Haritina Wainaina",
  "year": 2007,
  "study_areas": [
    {
      "mention": "Markoye Fault Zone in NE Burkina Faso",
      "page": 1,
      "canonical": "Markoye Fault Zone, Burkina Faso",
      "latitude": 14.77,
      "longitude": -0.05
    },
    {
      "mention": "Essakane",
      "page": 6,
      "canonical": "Essakane, Burkina Faso",
      "latitude": 14.56,
      "longitude": -0.15
    }
  ]
}


"""


In [None]:
def extract_study_areas_doclevel(full_text, model="gpt-4o"):
    prompt = (
        study_area_doc_prompt +
        "\nText:\n" +
        full_text +
        "\nJSON:"
    )
    response = openai.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=1000,
        temperature=0,
    )
    text = response.choices[0].message.content.strip()
    text = re.sub(r"^```(?:json)?\s*|\s*```$", "", text.strip())
    try:
        data = json.loads(text)
        return data
    except Exception:
        print("Error parsing LLM output:\n", text)
        return {}


In [None]:
def batch_extract_main_study_areas(files, model="gpt-4o", out_csv="main_study_areas_llm_geocoded.csv"):
    rows = []
    for path in files:
        filename = os.path.basename(path)
        with open(path, "r", encoding="utf-8") as f:
            text = f.read()

        result = extract_study_areas_doclevel(text, model=model)


        title = result.get('title', '')
        author = result.get('author', '')
        year = result.get('year', '')
        study_areas = result.get('study_areas', [])

        for area in study_areas:
            rows.append({
                "filename": filename,
                "title": title,
                "author": author,
                "year": year,
                "pages": area.get('page',''),
                "mention": area.get('mention',''),
                "location": area.get('canonical',''),
                "latitude": area.get('latitude',''),
                "longitude": area.get('longitude','')
            })
    with open(out_csv, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(
            f,
            fieldnames=[
                "filename", "title", "author", "year",
                "pages", "mention", "location", "latitude", "longitude"
            ])
        w.writeheader()
        w.writerows(rows)
    print(f"Done! Results saved to {out_csv}")


In [None]:
folder_path = "/content/drive/MyDrive/Main_txt"
files = glob.glob(f"{folder_path}/*.txt")
batch_extract_main_study_areas(files, model="gpt-4o", out_csv="main_study_areas_llm_geocoded.csv")


Done! Results saved to main_study_areas_llm_geocoded.csv


Summary of the extraction done

In [None]:
df = pd.read_csv("main_study_areas_llm_geocoded.csv")

# Number of files processed
n_files = df['filename'].nunique()
print(f"Files processed: {n_files}")

# Total main study areas extracted
n_main_areas = len(df)
print(f"Total main study areas: {n_main_areas}")

# How many studies have multiple main areas?
area_counts = df.groupby('filename').size()
n_multi = (area_counts > 1).sum()
print(f"Studies with multiple main areas: {n_multi}")


def get_countries(loc):
    parts = loc.split(",")
    return parts[-1].strip() if "," in loc else ""

summary = (
    df.groupby('filename')
      .agg(
        num_areas = ('location', 'count'),
        first_pages = ('pages', lambda x: ";".join(map(str, sorted(set(x))))),
        countries = ('location', lambda x: ";".join(sorted(set(get_countries(l) for l in x))))
      )
      .reset_index()
)

print(summary)
summary.to_csv("coverage_summary.csv", index=False)


Files processed: 14
Total main study areas: 18
Studies with multiple main areas: 4
                                             filename  num_areas first_pages  \
0         2007_Tshibubudze_THE MARKOYE FAULT_2007.txt          2         6;7   
1                               2008_MATABANE_FE3.txt          1           7   
2       2009_Bontle Nkuna_0605886P_Honours Report.txt          1           7   
3                2010_Matsheka_Irvin Final Thesis.txt          1           8   
4   2010_Mohale_GIS interpretation of NE Burkina F...          1          11   
5                   2011_Peters_East Markoye_2011.txt          2         5;6   
6   2011_Woolfe_The stratigraphy and metamorphic f...          1           4   
7   2012_Simoko_Petrology, geochemistry and struct...          1          10   
8                                 2013_FUNYUFUNYU.txt          1          15   
9                                     2013_Peters.txt          1           3   
10         2013_Ramabulana_Sadiola Hi

## VISUALIZATION

Map of the main study areas for the papers processed

In [None]:
import folium
from folium.plugins import MarkerCluster

# Center map on mean of all coordinates
center_lat = df['latitude'].astype(float).mean()
center_lon = df['longitude'].astype(float).mean()


m = folium.Map(
    location=[center_lat, center_lon],
    zoom_start=6,
    tiles='cartodbpositron',
    attr='© OpenStreetMap contributors, © CARTO'
)

marker_cluster = MarkerCluster().add_to(m)

for _, row in df.iterrows():
    popup_parts = [
        f"<b>Filename:</b> {row.get('filename', '')}",
        f"<b>Page:</b> {row.get('pages', '')}",
        f"<b>Location:</b> {row.get('location', '')}",
        f"<b>Mention:</b> {row.get('mention', '')}",
        f"<b>Lat/Lon:</b> {row.get('latitude', '')}, {row.get('longitude', '')}"
    ]

    if 'title' in df.columns:
        popup_parts.insert(1, f"<b>Title:</b> {row.get('title', '')}")
    if 'author' in df.columns:
        popup_parts.insert(2, f"<b>Author:</b> {row.get('author', '')}")
    if 'year' in df.columns:
        popup_parts.insert(3, f"<b>Year:</b> {row.get('year', '')}")

    popup = "<br>".join(popup_parts)

    folium.Marker(
        location=[float(row['latitude']), float(row['longitude'])],
        popup=folium.Popup(popup, max_width=500),
        tooltip=row.get('location', '')
    ).add_to(marker_cluster)

m.save("main_study_areas_map.html")
print("Map saved to main_study_areas_map.html")
m

Map saved to main_study_areas_map.html


Works as the papers were relevant to West Africa and thus main study areas would be there.