# IS547 Project Jupyter Notebook

<details>
<summary>Project Overview</summary>

This project involves managing approximately 2200 digital documents originating from an internal WordPress site migration at my workplace. As previously outlined in my Dataset Profile, the data consists of PDFs, Word documents, Excel spreadsheets, and occasionally PowerPoint presentations already archived in our Box storage. These were curated over a decade or more by our seventy-plus library committees, albeit the majority of the data comes from 10-15 committees. The documents include meeting minutes, agendas, and related institutional records. With FAIR in mind, the curation goals I have are to enhance internal accessibility, maintain institutional memory and data provenance, and support governance through improved data organization and documentation. These documents were publicly available via our open staff site.


</details>

<details>
<summary>Deliverables</summary>

- Consistent naming conventions applied across all documents
- Documentation of data governance and ethical compliance per our institutional policies; if none exist, resources from university-wide policies will be utilized
- Metadata enhancement to improve retrieval, searchability, and discoverability
- Documented provenance and fixity check to support institutional memory

</details>



Note that all code is importing functions from the data_pipeline package where several python files contain functions, sorted by file according to their purpose.

First I get a total file count to check against later.

In [None]:
from data_pipeline.data_explore import count_files

committees_directory = 'data/Committees'
total_files = count_files(committees_directory)
print(f"Total number of files in '{committees_directory}': {total_files}")

Next I review the file types in the data set.

In [None]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Committees')
print(file_types)

A list of committees and their count is helpful to make sure everything looks as it should (81 committees)

In [None]:
from data_pipeline.data_explore import list_committees_and_count
list_committees_and_count('data/Committees')


Then a list of files just to see what I'm working with.

In [None]:
from data_pipeline.data_explore import list_files

list_files('data/Committees')


A function to ensure files are delivered to the right place so no mess is created.

In [None]:
from data_pipeline.data_cleaning import ensure_output_directory

ensure_output_directory()



Now I copy the original files to the processed directory.  This ensures the original data set is untouched.

In [None]:
from data_pipeline.data_cleaning import copy_files

copy_files()

Count files again to verify the copy was successful.

In [None]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

Review file types again to see if anything changed.

In [None]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Processed_Committees')
print(file_types)

This function creates a CSV with committee, type, original filename, extracted date, and proposed filename.  The CSV is "names.csv" and placed in the data directory.  From examining the CSV data I can see:
1. : There are a significant number of files in "Related Documents" folders. These maintain their original, often unique filenames and are skipped during renaming. **Related Documents**
2. **"Unknown" Date Files**: Many files have "unknown" in their proposed filenames (especially from committees like "Diversity Residency Advisory Committee" and "DEIA Task Force"). These would standardize to the same pattern, reducing unique names.
3. **Duplicate Resolution**: Files like and would be normalized to the same standardized name, with collision handling adding suffixes as needed. `capt_agenda_minutes_2013_04_30.docx``capt_agenda_minutes_2013_04_30 (1).docx`

The specific reduction (2193 - 1610 = 583 fewer unique values) indicates that about 26.6% of your original filenames were standardized or excluded from renaming (like Related Documents), which is expected in a file organization project focused on consistent naming.
This is a positive outcome that indicates your standardization process is successfully reducing naming inconsistencies while preserving the original files in Related Documents folders that likely need their distinct names for context.


In [None]:
from data_pipeline.file_naming import generate_names_csv

generate_names_csv()

Again I list files to see if anything has changed.

In [None]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

This is where I update the filenames based on the CSV created in the previous step.  This comes after the hours I spent manually cleaning the data and adding dates the hard way to the date column in the "names.csv" and renamed it "manually_updated_committee_names.csv"  It adds a column for the final concatenated names and saves the updated CSV "final_updated_committee_names.csv" in the data directory.  NOTE HOW CLOSE THE UNIQUE VLAUES ARE BEGINNING TO END.

In [None]:
from data_pipeline.final_file_naming import build_final_filenames
build_final_filenames()

I verify the folder structure and files are as expected before the final renaming.

In [None]:
from data_pipeline.final_file_naming import verify_folder_file_structure
verify_folder_file_structure()

The big event - renaming the files.  It renames less than the full amount as some of the new file names match the old, and Related Docs never got renamed due to unique naming with no dates in many cases.

In [None]:
from data_pipeline.final_file_naming import rename_processed_files
rename_processed_files()

When checked manually the file names with dates appended appear to work exactly as I want.

In [None]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

Validate the same number of files exist as when we started:


In [None]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

Enhance file metadata with json-ld files

In [None]:
# Import the module
from data_pipeline import enhance_metadata

# Call the single combined function instead of both separately
enhance_metadata.enhance_all_metadata(
    csv_path="data/final_updated_committee_names.csv",
    base_dir="data/Processed_Committees",
    skip_existing=False
)


Enhance Project Metadata by creating a json-ld file in the root directory with basic description of the project

In [None]:
from data_pipeline.project_metadata import write_project_metadata
write_project_metadata()


NLP term extraction to create a preview of entities in the data set.  This is a first step in identifying key terms and concepts for further analysis.

In [None]:
from data_pipeline.nlp_term_extraction_preview import run_entity_preview
run_entity_preview()

In [2]:
from data_pipeline.add_nlp_terms_to_metadata import enhance_json_with_nlp

# Update a small sample of JSON-LD files first as a test
# Using a limit of 10 files to see quick results
test_results = enhance_json_with_nlp(base_dir="data/Processed_Committees", limit=10)

# Review the test results
print("\nTest completed. Check the output above to see if it looks correct.")

Enhancing JSON-LD metadata with NLP entities...
Will process up to 10 of 2198 JSON files
Processing file 0/10: The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-09-17.json
  Skipped - already has entities

=== NLP Entity Extraction Summary ===
Files processed: 0
JSON files updated: 0
Files already enhanced (skipped): 10
Files skipped due to errors: 0
JSON files without corresponding documents: 0

Top 10 people mentioned:
  Bill: 9 mentions
  Joanne: 9 mentions
  Library Building: 9 mentions
  Tom: 9 mentions
  Tom Teper: 9 mentions
  Bill Maher: 8 mentions
  Lynne: 8 mentions
  Joanne Kaczmarek: 7 mentions
  John: 5 mentions
  Lynne Thomas
Absent: 4 mentions

Top 10 organizations mentioned:
  Special Collections: 7 mentions
  Archives: 6 mentions
  Division: 4 mentions
  ARC: 4 mentions
  University Archives: 4 mentions
  UGL: 4 mentions
  FAQ: 2 me

In [5]:
from data_pipeline.add_nlp_terms_to_metadata import enhance_json_with_nlp

# First batch - process 500 files
enhance_json_with_nlp(
    base_dir="data/Processed_Committees",
    limit=2500,
    skip_existing=True
)

# After this completes, run the next batch
# enhance_json_with_nlp(
#     base_dir="data/Processed_Committees",
#     limit=1000,
#     skip_existing=True
# )

# Finally, process any remaining files
# enhance_json_with_nlp(
#     base_dir="data/Processed_Committees",
#     skip_existing=True
# )

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Enhancing JSON-LD metadata with NLP entities...
Will process up to 2500 of 2198 JSON files
Processing file 0/2500: The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-09-17.json
  Skipped - already has entities
Processing file 10/2500: Graduate Student Survey Working Group/Minutes/Graduate Student Survey Working Group_Minutes_2014-12-12.json
  Skipped - already has entities
Processing file 20/2500: Diversity Residency Advisory Committee/Minutes/Diversity Residency Advisory Committee_Minutes_2022-03-14.json
  Skipped - already has entities
Processing file 30/2500: Academic Professional Promotion Implementation Team/Minutes/Academic Professional Promotion Implementation Team_Minutes_2017-02-17.json
Processing file 40/2500: Content Access Policy & Technology (CAPT)/Related Documents/Discovery_Study_Group_rev.json
  No corresponding document found


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

  Skipped - already has entities
Processing file 50/2500: Content Access Policy & Technology (CAPT)/Minutes/Content Access Policy & Technology (CAPT)_Minutes_2012-08-22.json


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Processing file 60/2500: Content Access Policy & Technology (CAPT)/Minutes/Content Access Policy & Technology (CAPT)_Minutes_2013-06-25.json


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

  Skipped - already has entities
Processing file 70/2500: Content Access Policy & Technology (CAPT)/Minutes/Content Access Policy & Technology (CAPT)_Minutes_2013-09-24.json
  Skipped - already has entities
Processing file 80/2500: Content Access Policy & Technology (CAPT)/Agendas/Content Access Policy & Technology (CAPT)_Agendas_2013-05-28.json
  Skipped - already has entities
Processing file 90/2500: Content Access Policy & Technology (CAPT)/Agendas/Content Access Policy & Technology (CAPT)_Agendas_2016-11-01.json
  Skipped - already has entities
Processing file 100/2500: Content Access Policy & Technology (CAPT)/Agendas/Content Access Policy & Technology (CAPT)_Agendas_2018-08-01.json
  Skipped - already has entities
Processing file 110/2500: Content Access Policy & Technology (CAPT)/Agendas/Content Access Policy & Technology (CAPT)_Agendas_2017-11-01.json
  Skipped - already has entities
Processing file 120/2500: Content Access Policy & Technology (CAPT)/Agendas/Content Access Poli

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

  Skipped - already has entities
Processing file 180/2500: Library Faculty Meeting/Related Documents/Sept28-CAPT-Agenda.json
  Skipped - already has entities
Processing file 190/2500: Library Faculty Meeting/Related Documents/Payments_to_foreign_nationals.json
  No corresponding document found
  Skipped - already has entities
Processing file 200/2500: Library Faculty Meeting/Related Documents/CAPT_ppt_-_FacultyMeeting2016.json
  No corresponding document found
  Skipped - already has entities
Processing file 210/2500: Library Faculty Meeting/Related Documents/SavvyResearcher_FacultyMeeting12.16.json
  No corresponding document found
Processing file 220/2500: Library Faculty Meeting/Minutes/Library Faculty Meeting_Minutes_2019-09-18.json
  Skipped - already has entities
Processing file 230/2500: Library Faculty Meeting/Minutes/Library Faculty Meeting_Minutes_2014-09-24.json
  Skipped - already has entities
Processing file 240/2500: Library Faculty Meeting/Minutes/Library Faculty Meeting

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

  Skipped - already has entities
Processing file 1920/2500: Collection Development Committee/Minutes/Collection Development Committee_Minutes_2005-10-01.json


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

  Skipped - already has entities
Processing file 1930/2500: Collection Development Committee/Minutes/Collection Development Committee_Minutes_2006-04-01.json
  Skipped - already has entities
Processing file 1940/2500: Collection Development Committee/Minutes/Collection Development Committee_Minutes_2016-08-01.json
  Skipped - already has entities
Processing file 1950/2500: Collection Development Committee/Minutes/Collection Development Committee_Minutes_2004-04-01.json


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

  Skipped - already has entities
Processing file 1960/2500: Collection Development Committee/Minutes/Collection Development Committee_Minutes_2012-11-27.json
  Skipped - already has entities
Processing file 1970/2500: Collection Development Committee/Minutes/Collection Development Committee_Minutes_2011-05-17.json
  Skipped - already has entities
Processing file 1980/2500: Collection Development Committee/Minutes/Collection Development Committee_Minutes_2014-03-25.json
  Skipped - already has entities
Processing file 1990/2500: Collection Development Committee/Minutes/Collection Development Committee_Minutes_2006-10-01.json
  Skipped - already has entities
Processing file 2000/2500: Collection Development Committee/Agendas/Collection Development Committee_Agendas_2009-05-26.json
  Skipped - already has entities
Processing file 2010/2500: Collection Development Committee/Agendas/Collection Development Committee_Agendas_2014-09-30.json
  Skipped - already has entities
Processing file 202

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

  Skipped - already has entities
Processing file 2040/2500: Collection Development Committee/Agendas/Collection Development Committee_Agendas_2017-10-01.json
  Skipped - already has entities
Processing file 2050/2500: Collection Development Committee/Agendas/Collection Development Committee_Agendas_2013-09-24.json
  Skipped - already has entities
Processing file 2060/2500: Accessibility Advisory Group/Minutes/Accessibility Advisory Group_Minutes_2017-12-21.json
  Skipped - already has entities
Processing file 2070/2500: Library Strategic Planning Team/Agendas/Library Strategic Planning Team_Agendas_2019-02-28.json
  Skipped - already has entities
Processing file 2080/2500: Library Staff Support Committee/Minutes/Library Staff Support Committee_Minutes_2016-04-28.json
  No corresponding document found
  Skipped - already has entities
Processing file 2090/2500: Library Staff Support Committee/Minutes/Library Staff Support Committee_Minutes_2016-11-30.json
  Skipped - already has entities

CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

  Skipped - already has entities
Processing file 2190/2500: The Library as Catalyst Project - Managing the Library's Collections WG/Minutes/The Library as Catalyst Project - Managing the Library's Collections WG_Minutes_2019-04-29.json

=== NLP Entity Extraction Summary ===
Files processed: 341
JSON files updated: 341
Files already enhanced (skipped): 1774
Files skipped due to errors: 0
JSON files without corresponding documents: 83

Top 10 people mentioned:
  Library: 906 mentions
  Tom Teper: 479 mentions
  John Wilkin: 364 mentions
  Mary Laskowski: 333 mentions
  Bill Mischo: 306 mentions
  David Ward: 282 mentions
  John: 265 mentions
  Dean Wilkin: 224 mentions
  Nancy O’Brien: 220 mentions
  Dean: 214 mentions

Top 10 organizations mentioned:
  EC: 600 mentions
  AP: 240 mentions
  AC: 205 mentions
  University: 167 mentions
  Administrative Council: 159 mentions
  Attendees: 140 mentions
  AUL: 139 mentions
  UGL: 132 mentions
  CDC: 117 mentions
  Collection Development Commit

{'processed': 341,
 'updated': 341,
 'skipped': 0,
 'already_enhanced': 1774,
 'missing': 83,
 'all_entities': {'PERSON': Counter({'Library': 906,
           'Tom Teper': 479,
           'John Wilkin': 364,
           'Mary Laskowski': 333,
           'Bill Mischo': 306,
           'David Ward': 282,
           'John': 265,
           'Dean Wilkin': 224,
           'Nancy O’Brien': 220,
           'Dean': 214,
           'Tom': 213,
           'Sue Searing': 208,
           'Lynne Rudasill': 192,
           'Chris Prom': 190,
           'LSSC': 188,
           'Mara Thacker': 180,
           'Lynn Wiley': 180,
           'Cindy Ingold': 177,
           'Jennifer Teper': 176,
           'Bill Maher': 160,
           'Paula Kaufman': 152,
           'Kirstin Dougan': 145,
           'Lisa Hinchliffe': 139,
           'JoAnn Jacoby': 133,
           'Paula': 132,
           'Greg Knott': 131,
           'Beth': 131,
           'John Laskowski': 130,
           'Beth Woodard': 129,
       