# Find People In Texts

This code reads input texts and identifies people in them, outputting their names, the sentence their name appears in, a citation and URL if available as a CSV file. A CSV file can be opened in Excel for correcting and adding other data. 

These methods will save a great deal of time and make large scale research feasible when before it was not. None the less, these methods unavoidably have a high error rate and will require a substantial amount of human checking and manual correction.

The input format is not strict markup, but is designed to be very quick and easy to use for a non IT person trawling full text OCR archives. Just copy and paste the citation and the text and seperate with a line of hashes.

It will also attempt to identify dates from the citation to help piece together people's life course, or the order events they were involved in. It may pick up the 'accessed on' date or other date instead of the publication date. 


## TO DO
- file upload button (if you are using this online in Binder Hub, you can use Binder Hub to upload a file, and set the file name in settings below).
- Optional Get locations.
- Optional Geolocate locations.
- LOD for link people and places.


In [None]:
print("Confirm script is working by printing this sentence.")

## Inputs, options and settings

### Input

The expected input is a plain text file with a citation at the top, seperated from the text by a blank line. 

You can also process many small texts, such as news articles from Trove, by putting them in a single text file. 

Simply seperate each text with a line of hash tags, and put the citation at the start of each section, followed by a blank line and then the full text.

An example text 'Example_Eureka.txt' has been provided. This example includes texts with few and with many names, and some with OCR errors, to illustrate the sorts of things you are likely to encounter. 

Eg:

The Australian, p. 2. Retrieved June 9, 2025, from http://nla.gov.au/nla.news-article1

This is a full news story about something or other on the night of the ...

########################

Sydney Morning Herald, p. 5. Retrieved June 9, 2025, from http://nla.gov.au/nla.news-article2

Another full text article is here, according to eye witness accounts ...

###############

etc

### Options

show_highlighted_text
Set show_highlighted_text to True if you would like to print the text to screen with all the identified people highlighted. This can be useful for checking results and making manual corrections in the output file. This also highlights many other 'entities' such as organisations, dates, etc, which are each colour coded (people are purple). All entities are included to help find people misidentified as places, or otherwise. Set to False, if you just want to process and output the text.

This will highlight not only people but other types of things that it identifies, such as places, dates, organisations, etc. It does get a lot wrong, but it still useful.

results_to_screen
Set to true if you want to see the results listed to this screen. 

write_to_file
Set to true if you want to output the results to a csv file.

In [None]:
text_file = "Example_Eureka.txt"
show_highlighted_text = True
results_to_screen = True
write_to_file = True

## Utility and helper functions for getting IDs and dates

In [None]:
import re
from typing import Dict

# Lines of only hashes split blocks; blank line splits citation vs body
_RE_BLOCK_SPLIT = re.compile(r'^\s*#+\s*$', re.MULTILINE)
_RE_BLANKLINE   = re.compile(r'\n\s*\n', re.MULTILINE)
_RE_URL         = re.compile(r'https?://[^\s)]+', re.IGNORECASE)

# helper function for getting a unique identifier, either URL, DOI or ISBN or creating one Short-hash URN
import re, unicodedata, hashlib

RE_URL = re.compile(r'https?://[^\s)]+', re.I)
RE_DOI = re.compile(r'10\.\d{4,9}/\S+', re.I)
RE_ISBN = re.compile(r'\bISBN(?:-1[03])?:?\s*([0-9Xx][0-9Xx\-\s–]{8,})\b')

def _norm_for_hash(s: str) -> str:
    s = unicodedata.normalize("NFKC", s)
    return re.sub(r'\s+', ' ', s).strip().lower()

def _strip_trailing_punct(s: str) -> str:
    return s.rstrip(').,;]»”')

def _clean_isbn(raw: str) -> str:
    digits = re.sub(r'[\s\-\u2013]', '', raw).upper()  # remove spaces/hyphens/en-dash
    return digits if len(digits) in (10, 13) else ''

def make_identifier(citation: str) -> str:
    """Return a stable identifier: URL, DOI URL, ISBN URN, or short-hash URN."""
    if not citation:
        return "urn:source:000000000000"

    # URL
    m = RE_URL.search(citation)
    if m:
        return _strip_trailing_punct(m.group(0))

    # DOI
    m = RE_DOI.search(citation)
    if m:
        doi = _strip_trailing_punct(m.group(0))
        return f"https://doi.org/{doi}"

    # ISBN (expects 'ISBN' present; avoids false positives)
    m = RE_ISBN.search(citation)
    if m:
        isbn = _clean_isbn(m.group(1))
        if isbn:
            return f"urn:isbn:{isbn}"

    # Fallback: deterministic short hash of normalized citation
    short = hashlib.sha1(_norm_for_hash(citation).encode("utf-8")).hexdigest()[:12]
    return f"urn:source:{short}"

# DETECT DATE
from typing import Tuple

# Map month names -> MM
_MONTH = {
    "jan": "01","january": "01","feb": "02","february": "02","mar": "03","march": "03",
    "apr": "04","april": "04","may": "05","jun": "06","june": "06","jul": "07","july": "07",
    "aug": "08","august": "08","sep": "09","sept": "09","september": "09","oct": "10","october": "10",
    "nov": "11","november": "11","dec": "12","december": "12"
}

def get_date(citation: str):
    # 0) ignore retrieval/access dates
    cut = re.search(r'\b(retrieved|accessed|viewed)\b', citation, re.I)
    head = citation[:cut.start()] if cut else citation

    # 1) (1839, August 29)
    m = re.search(r'\(\s*(\d{4})\s*,\s*([A-Za-z]+)\s+(\d{1,2})\s*\)', head)
    if m and _MONTH.get(m.group(2).lower()):
        y, mon, d = m.group(1), m.group(2), m.group(3)
        return (m.group(0), f"{y}-{_MONTH[mon.lower()]}-{int(d):02d}")

    # 2) 1839, August 29
    m = re.search(r'\b(\d{4})\s*,\s*([A-Za-z]+)\s+(\d{1,2})\b', head)
    if m and _MONTH.get(m.group(2).lower()):
        y, mon, d = m.groups()
        return (m.group(0), f"{y}-{_MONTH[mon.lower()]}-{int(d):02d}")

    # 3) 29 August 1839
    m = re.search(r'\b(\d{1,2})\s+([A-Za-z]+)\s+(\d{4})\b', head)
    if m and _MONTH.get(m.group(2).lower()):
        d, mon, y = m.groups()
        return (m.group(0), f"{y}-{_MONTH[mon.lower()]}-{int(d):02d}")

    # 4) August 29, 1839
    m = re.search(r'\b([A-Za-z]+)\s+(\d{1,2}),?\s+(\d{4})\b', head)
    if m and _MONTH.get(m.group(1).lower()):
        mon, d, y = m.groups()
        return (m.group(0), f"{y}-{_MONTH[mon.lower()]}-{int(d):02d}")

    # 5) Month Year
    m = re.search(r'\b([A-Za-z]+)\s+(\d{4})\b', head)
    if m and _MONTH.get(m.group(1).lower()):
        mon, y = m.groups()
        return (m.group(0), f"{y}-{_MONTH[mon.lower()]}")

    # 6) Year range
    m = re.search(r'\b(c\.|ca\.|circa\s*)?(\d{4})\s*[–-]\s*(\d{4})\b', head, re.I)
    if m:
        return (m.group(0).strip(), "")

    # 7) Single year
    m = re.search(r'\b(?:c\.|ca\.|circa\s*)?(\d{4})(?!\d)\b', head, re.I)
    if m:
        y = m.group(1)
        return (m.group(0).strip(), y)

    return ("", "")



## Split input file into citations, texts and URL

Function for splitting the input text into several texts (assuming they are seperated by a line of hashtags: ########), 
extracting the citation and the URL and returning a data structure with URL as the key, 
and containing the citation and text of the article.



In [None]:
def split_text_blocks(big_text: str) -> Dict[str, Dict[str, str]]:
    """
    Input text contains blocks separated by a line of hashes (#####...).
    Each block: citation at top, then a blank line, then the text.
    Returns { url: {"citation": citation, "text": body} }.
    Blocks without a URL in the citation are skipped.
    """
    out: Dict[str, Dict[str, str]] = {}
    for chunk in _RE_BLOCK_SPLIT.split(big_text):
        block = chunk.strip()
        if not block:
            continue

        parts = _RE_BLANKLINE.split(block, maxsplit=1)
        citation = parts[0].strip()
        body = parts[1].strip() if len(parts) > 1 else ""

        # get date
        rawdate, iso_date = get_date(citation)
        date = "-".join((iso_date.split("-") + ["01","01"])[:3]) if iso_date else ""
        
        source_id = make_identifier(citation)
        out[source_id] = {"citation": citation, "text": body, "date": date}

    return out

## Find people in each text
Function for identifying people in a text, and displaying them using Spacy.

In [None]:
import spacy
from spacy import displacy

# load once
nlp = spacy.load("en_core_web_sm")

def find_people(text: str, show_displacy: bool = True):
    doc = nlp(text)

    people, seen = [], set()
    for ent in doc.ents:
        if ent.label_ == "PERSON":
            name = " ".join(ent.text.split())
            sent = ent.sent.text.strip()
            key = (name, sent)
            if key not in seen:
                seen.add(key)
                people.append(key)

    if show_displacy:
        displacy.render(doc, style="ent")

    return people


## Call the functions to process each text and find people
Read text from file and split into articles, and identify citation and URL.

In [None]:
with open(text_file, "r", encoding="utf-8") as f:
     intext = f.read()

text_data = split_text_blocks(intext)


Find the people in each text.

In [None]:

for url, item in text_data.items():
    # print(url, item["citation"])
    print("Finding people...")
    print(item["citation"])
    people = find_people(item["text"], show_highlighted_text)
    item["people"] = people
    # report to screen
    #for name, sentence in people:
    #    print(f"{name} → {sentence}")


In [None]:
if (results_to_screen):
    for url, item in text_data.items():
        print("Citation: " + item.get("citation", ""))
        print("Date: " + item.get("date", ""))
        print()
        for p in item.get("people", []):
            if isinstance(p, dict):
                print("Name: " + p.get("name", ""))
                print("Sentence: " + p.get("sentence", ""))
                print()
            else:  # assume tuple/list
                name, sentence = (p + ("",))[:2]
                print("Name: " + name)
                print("Sentence: " + sentence)
            print()

## Output file

Output result to CSV. CSV files can be opened in Excel and corrected, and other information added, such as whether the person is colonist, Aboriginal, Torres Strait Islander, their country, etc.

In [None]:
if (write_to_file):

    import csv
    
    def export_people_csv(text_data, csv_path):
        """
        text_data: {
          url: {"citation": str, "fulltext": str, "people": [
                  # either:
                  (name, sentence)  OR  {"name": name, "sentence": sentence}
          ]}
        }
        Writes CSV with columns: person, sentence, citation, URL
        """
        with open(csv_path, "w", newline="", encoding="utf-8") as f:
            w = csv.writer(f)
            w.writerow(["person", "sentence", "date", "citation", "URL"])
            for url, item in text_data.items():
                citation = item.get("citation", "")
                date = item.get("date", "")
                for p in item.get("people", []):
                    if isinstance(p, dict):
                        name = p.get("name", "")
                        sentence = p.get("sentence", "")
                    else:  # assume tuple/list
                        name, sentence = (p + ("",))[:2]
                    w.writerow([name, sentence, date, citation, url])

In [None]:
export_people_csv(text_data, "people.csv")

## Data cleaning post processing

Don't be worried if there are hundreds of rows in the results. Many of them are false positives and can quickly be deleted, leaving you with a manageable number to work through.

You will probably want to take the following manual data cleaning steps:

- Remove false positives from glitchy data, and things which aren't people (eg: places are sometimes misidentified as people).
- Remove people from news not relevant to the conflict (eg: sometimes an article may summarise many disconnected events, such as Jackey raiding a hut as well as Mr. Weber buying a horse. Remove Mr Weber.)
- Correct spelling errors.
- Check each record against the source that it is a properly identify person and should be included.
- Check the highlighted output in the script for anyone not identified and manually add to the spreadsheet.