Install dependencies and download spaCy's [`en_core_web_sm` model](https://spacy.io/models/en#en_core_web_sm), which is trained on a small corpus of general English text on the web. Import libraries we'll use.

In [60]:
!pip install spacy unidecode
!python -m spacy download en_core_web_sm

import csv        # loading/saving data
import spacy      # nlp library
import difflib    # comparing lists of terms
import unidecode  # normalizing terms for comparison

from collections import Counter, defaultdict
from itertools import islice

Collecting unidecode
  Using cached Unidecode-1.1.1-py2.py3-none-any.whl (238 kB)
Installing collected packages: unidecode
Successfully installed unidecode-1.1.1
You should consider upgrading via the '/Users/nbudak/src/geniza/venv/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/nbudak/src/geniza/venv/bin/python -m pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')


Load the spaCy model and define the types of entities we are interested in recognizing. Create `Counter` objects to track the number of times we see entities.

In [9]:
nlp = spacy.load("en_core_web_sm")
ent_types = ["GPE", "PERSON"]       # "GPE" = geopolitical entity (place)
ent_counts = defaultdict(Counter)   # stats container

Load all items from the PGP Metadata spreadsheet, storing their descriptions and PGPIDs.

In [11]:
items = []
with open("pgp_items.csv") as file:
  reader = csv.DictReader(file)
  for row in reader:
    items.append((
        row["Description"],
        { "PGPID": row["\ufeffPGPID"], }
    ))
print(f"loaded {len(items)} items")

loaded 29946 items


Process all item descriptions using our spaCy pipeline, counting occurrences of named entities as we encounter them.

In [69]:
for doc, context in nlp.pipe(items, as_tuples=True, disable=["tagger", "parser"]):  # only run NER part of pipe
  for ent in doc.ents:
    ent_counts[ent.label_][ent.text] += 0.5     # creates one Counter for each entity type
print("processing completed")

processing completed


Load the list of known places, and compare it to the places we identified to see which ones aren't in the list. Of those that weren't listed, check how common they are in item descriptions. Write the results to a file.

In [70]:
known_places = []
with open("pgp_places.csv") as file:
    reader = csv.reader(file)
    for row in islice(reader, 1, None):   # skip header row
        for cell in row:
            if cell:                      # skip blank cells
                known_places.append(cell.lower().strip())

# normalize both lists by removing whitespace and lowercasing
places = [place.lower().strip() for place in ent_counts["GPE"].keys()]

# some markers clearly indicate a person has been misidentified as a place
person_markers = ["b.", "bat", "abū ", "abu ", "abu-"]
places = [place for place in places if not any([marker in place for marker in person_markers])]

# if there are numbers present it's likely a date, not a place
places = [place for place in places if not any([char.isdigit() for char in place])]

# normalize unicode (e.g. diacritics)
places = [unidecode.unidecode(place) for place in places]

# ensure we only have unique values after normalization
places = list(set(places))

# compare the two lists and keep those we didn't find in the known places list
missing = []
for line in difflib.ndiff(sorted(places), sorted(known_places)):
    if line.startswith("-"):
        missing.append(line[2:])
print(f"identified {len(missing)} potential places not in list")

# get the most frequently occurring missing places
norm_counts = Counter({ unidecode.unidecode(place.lower().strip()): count for (place, count) in ent_counts["GPE"].items() })
missing_counts = Counter({ place: count for (place, count) in norm_counts.items() if place in missing })

# write the results to a file
with open("missing_places.csv", mode="w") as file:
    fieldnames = ["place", "count"]
    writer = csv.DictWriter(file, fieldnames=fieldnames)
    writer.writeheader()
    for place, count in missing_counts.most_common():       # order by count
        writer.writerow({ "place": place, "count": count })
print("wrote missing_places.csv")



identified 1036 potential places not in list
wrote missing_places.csv
