# IS547 Project Jupyter Notebook

<details>
<summary>Project Overview</summary>

This project involves managing approximately 2200 digital documents originating from an internal WordPress site migration at my workplace. As previously outlined in my Dataset Profile, the data consists of PDFs, Word documents, Excel spreadsheets, and occasionally PowerPoint presentations already archived in our Box storage. These were curated over a decade or more by our seventy-plus library committees, albeit the majority of the data comes from 10-15 committees. The documents include meeting minutes, agendas, and related institutional records. With FAIR in mind, the curation goals I have are to enhance internal accessibility, maintain institutional memory and data provenance, and support governance through improved data organization and documentation. These documents were publicly available via our open staff site.


</details>

<details>
<summary>Deliverables</summary>

- Consistent naming conventions applied across all documents
- Documentation of data governance and ethical compliance per our institutional policies; if none exist, resources from university-wide policies will be utilized
- Metadata enhancement to improve retrieval, searchability, and discoverability
- Documented provenance and fixity check to support institutional memory

</details>



Note that all code is importing functions from the data_pipeline package where several python files contain functions, sorted by file according to their purpose.

First I get a total file count to check against later.

In [1]:
from data_pipeline.data_explore import count_files

committees_directory = 'data/Committees'
total_files = count_files(committees_directory)
print(f"Total number of files in '{committees_directory}': {total_files}")

Total number of files in 'data/Committees': 2203


Next I review the file types in the data set.

In [2]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Committees')
print(file_types)

{'.docx': 1764, '.ppt': 26, '.doc': 53, '.pdf': 333, '.pptx': 21, '.xls': 4, '.xlsx': 2}


A list of committees and their count is helpful to make sure everything looks as it should (81 committees)

In [3]:
from data_pipeline.data_explore import list_committees_and_count
list_committees_and_count('data/Committees')


Research and Publication Committee
Reference Management Team
Promotion and Tenure Advisory Committee
The Library as Catalyst Project - Special Collections Research Center Working Group
Teaching and Learning Task Force
Graduate Student Survey Working Group
University Library Residency Program Working Group
Diversity Residency Advisory Committee
Awards and Recognition Committee
Academic Professional Promotion Implementation Team
Content Access Policy & Technology (CAPT)
Working Group on Library Grants, Outreach and Training (COMPLETED CHARGE)
Open Licensing Task Force
Academic Professional Peer Review Promotion Advisory Committee
Diversity, Equity, Inclusion, and Accessibility (DEIA) Task Force
Student-Focused Spaces Task Force
220 Exploratory Use Team
Marshall Gallery Task Force
Marketing and Communications Strategy Working Group
Reproduction and Use Fees Working Group
Library Faculty Meeting
Faculty Meeting Agenda Committee
CAPT Digital Production
CAPT Repositories, Preservation, and A

Then a list of files just to see what I'm working with.

In [4]:
from data_pipeline.data_explore import list_files

list_files('data/Committees')


File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.06.13.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.17.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.29.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.10.10.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.15.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.04.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.05.30.docx
File: data/Committees/The Library as Catalyst Project - Special Collections Research Cente

A function to ensure files are delivered to the right place so no mess is created.

In [5]:
from data_pipeline.data_cleaning import ensure_output_directory, clean_ds_store_files

ensure_output_directory()



Now I copy the original files to the processed directory.  This ensures the original data set is untouched.

In [6]:
from data_pipeline.data_cleaning import copy_files

copy_files()

Count files again to verify the copy was successful.

In [7]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

2203

Review file types again to see if anything changed.

In [8]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Processed_Committees')
print(file_types)

{'.docx': 1764, '.ppt': 26, '.doc': 53, '.pdf': 333, '.pptx': 21, '.xls': 4, '.xlsx': 2}


This function creates a CSV with committee, type, original filename, extracted date, and proposed filename.  The CSV is "names.csv" and placed in the data directory.  From examining the CSV data I can see:
1. **Related Documents** There are a significant number of files in "Related Documents" folders. These maintain their original, often unique filenames and are skipped during renaming.
2. **"Unknown" Date Files**: Many files have "unknown" in their proposed filenames (especially from committees like "Diversity Residency Advisory Committee" and "DEIA Task Force"). These would standardize to the same pattern, reducing unique names.
3. **Duplicate Resolution**: Files like and would be normalized to the same standardized name, with collision handling adding suffixes as needed. `capt_agenda_minutes_2013_04_30.docx``capt_agenda_minutes_2013_04_30 (1).docx`

The reduction (2193 - 1610 = 583 fewer unique values) indicates that about 26.6% of the original filenames were standardized or excluded from renaming (like Related Documents), which is expected in a file organization project focused on consistent naming.  This indicates the standardization process is successfully reducing naming inconsistencies while preserving the original files in Related Documents folders that likely need their distinct names for context.


In [9]:
from data_pipeline.file_naming import generate_names_csv

generate_names_csv()

Unnamed: 0,Committee,Document Type,Original File Name,Extracted Date,Proposed File Name
0,The Library as Catalyst Project - Special Coll...,Minutes,2019.06.13.docx,2019-06-13,The Library as Catalyst Project - Special Coll...
1,The Library as Catalyst Project - Special Coll...,Minutes,2019.09.17.docx,2019-09-17,The Library as Catalyst Project - Special Coll...
2,The Library as Catalyst Project - Special Coll...,Minutes,2019.07.29.docx,2019-07-29,The Library as Catalyst Project - Special Coll...
3,The Library as Catalyst Project - Special Coll...,Minutes,2019.10.10.docx,2019-10-10,The Library as Catalyst Project - Special Coll...
4,The Library as Catalyst Project - Special Coll...,Minutes,2019.07.15.docx,2019-07-15,The Library as Catalyst Project - Special Coll...
...,...,...,...,...,...
2198,The Library as Catalyst Project - Managing the...,Minutes,2019-February-5.docx,unknown,The Library as Catalyst Project - Managing the...
2199,The Library as Catalyst Project - Managing the...,Minutes,2019-May-15.docx,unknown,The Library as Catalyst Project - Managing the...
2200,The Library as Catalyst Project - Managing the...,Minutes,2019-March-7.docx,unknown,The Library as Catalyst Project - Managing the...
2201,The Library as Catalyst Project - Managing the...,Minutes,2019-January-7.docx,unknown,The Library as Catalyst Project - Managing the...


Again I list files to see if anything has changed.

In [10]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.06.13.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.17.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.29.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.10.10.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.15.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.04.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.05.30.docx
File: 

This is where I update the filenames based on the CSV created in the previous step.  This comes after the hours I spent manually cleaning the data and adding dates the hard way to the date column in the "names.csv" and renamed it "manually_updated_committee_names.csv"  It adds a column for the final concatenated names and saves the updated CSV "final_updated_committee_names.csv" in the data directory.  NOTE HOW CLOSE THE UNIQUE VLAUES ARE BEGINNING TO END.

In [11]:
from data_pipeline.final_file_naming import build_final_filenames
build_final_filenames()

Final filenames saved to data/final_updated_committee_names.csv
Rows processed: 2203


Unnamed: 0,Committee,Document Type,Original File Name,Extracted Date,Proposed File Name,Final File Name
0,The Library as Catalyst Project - Special Coll...,Minutes,2019.06.13.docx,2019-06-13,The Library as Catalyst Project - Special Coll...,The Library as Catalyst Project - Special Coll...
1,The Library as Catalyst Project - Special Coll...,Minutes,2019.09.17.docx,2019-09-17,The Library as Catalyst Project - Special Coll...,The Library as Catalyst Project - Special Coll...
2,The Library as Catalyst Project - Special Coll...,Minutes,2019.07.29.docx,2019-07-29,The Library as Catalyst Project - Special Coll...,The Library as Catalyst Project - Special Coll...
3,The Library as Catalyst Project - Special Coll...,Minutes,2019.10.10.docx,2019-10-10,The Library as Catalyst Project - Special Coll...,The Library as Catalyst Project - Special Coll...
4,The Library as Catalyst Project - Special Coll...,Minutes,2019.07.15.docx,2019-07-15,The Library as Catalyst Project - Special Coll...,The Library as Catalyst Project - Special Coll...
...,...,...,...,...,...,...
2198,The Library as Catalyst Project - Managing the...,Minutes,2019-February-5.docx,2019-02-05,The Library as Catalyst Project - Managing the...,The Library as Catalyst Project - Managing the...
2199,The Library as Catalyst Project - Managing the...,Minutes,2019-May-15.docx,2019-05-15,The Library as Catalyst Project - Managing the...,The Library as Catalyst Project - Managing the...
2200,The Library as Catalyst Project - Managing the...,Minutes,2019-March-7.docx,2019-03-07,The Library as Catalyst Project - Managing the...,The Library as Catalyst Project - Managing the...
2201,The Library as Catalyst Project - Managing the...,Minutes,2019-January-7.docx,2019-01-07,The Library as Catalyst Project - Managing the...,The Library as Catalyst Project - Managing the...


I verify the folder structure and files are as expected before the final renaming.

In [12]:
from data_pipeline.final_file_naming import verify_folder_file_structure
verify_folder_file_structure()

Verification complete: All 2203 entries match the folder structure in data/Processed_Committees


True

The big event - renaming the files.  It renames less than the full amount as some of the new file names match the old, and Related Docs never got renamed due to unique naming with no dates in many cases.

In [13]:
from data_pipeline.final_file_naming import rename_processed_files
rename_processed_files()

Renamed: data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.06.13.docx -> data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-06-13.docx
Renamed: data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.09.17.docx -> data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-09-17.docx
Renamed: data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/2019.07.29.docx -> data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/

1892

When checked manually the file names with dates appended appear to work exactly as I want.

In [14]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-10-10.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-07-15.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-06-13.docx
File: ./data/Processed_Committees/The Library as Catalyst Project - Special Collections Research Center Working Group/Minutes/The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-07-29.docx
File: ./data/Processed_Committees/The Library as Catalyst Projec

Validate the same number of files exist as when we started:


In [15]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

2203

Enhance file metadata with json-ld files

In [16]:
# Import the module
from data_pipeline import enhance_metadata

# Call the single combined function instead of both separately
enhance_metadata.enhance_all_metadata(
    csv_path="data/final_updated_committee_names.csv",
    base_dir="data/Processed_Committees",
    skip_existing=False
)


=== Metadata Enhancement Complete ===
Total files processed: 2234
  - Regular files from CSV: 1892
  - Related documents from CSV: 311
  - Additional files found: 31
Committees with related documents: 7
Created: 2198, Updated: 36, Skipped: 0, Missing: 0

File type counts:
  .DOCX: 1791
  .PDF: 337
  .DOC: 53
  .PPT: 26
  .PPTX: 21
  .XLS: 4
  .XLSX: 2


Enhance Project Metadata by creating a json-ld file in the root directory with basic description of the project

In [17]:
from data_pipeline.project_metadata import write_project_metadata
write_project_metadata()


Project-level metadata written to: project_metadata.jsonld


NLP term extraction to create a preview of entities in the data set.  This is a first step in identifying key terms and concepts for further analysis.

In [18]:
from data_pipeline.nlp_term_extraction_preview import run_entity_preview
run_entity_preview()

The Library as Catalyst Project - Special Collections Research Center Working Group_Minutes_2019-09-17.docx: ['55 years', 'AEON', 'Archives', 'Archon', 'Bill', 'Bill Maher', 'Bill – Oak Street', 'Bill – Voyager', 'Box', 'Cara Bertram', 'Chris', 'Division', 'Division’s', 'EC', 'FAQ', 'Hort', 'Jameatris Rimkus', 'Joanne', 'Joanne Kaczmarek\nNote', 'July', 'Katie Nichols', 'Krannert', 'Krista', 'Library', 'Library Building', 'Linda', 'Linda Stahnke', 'Lynne', 'Lynne Thomas\nAbsent', 'Main Library', 'Maps', 'Mary', 'Numbers', 'October 10th', 'SCRC', 'Sarah Harris', 'September 17, 2019', 'September 25th', 'Solberg', 'Sousa', 'Special Collections', 'Steelcase', 'Student Life Archives', 'Thursday', 'Tim', 'Tom', 'Tom Teper', 'Training', 'Tuesday', 'University', 'Valerie', 'Wednesday', 'Wendy', 'Wendy Wolter\n\nBill', 'all day', 'daily', 'last summer', 'the 29th', 'the Tuesday', 'the early days', 'this weekly', 'today', 'tomorrow', 'two weeks']
The Library as Catalyst Project - Special Collect

Next I test the enhance_json_with_nlp function quickly before running the full process.

In [19]:
from data_pipeline.add_nlp_terms_to_metadata import enhance_json_with_nlp

# Update a small sample of JSON-LD files first as a test
# Using a limit of 10 files to see quick results
test_results = enhance_json_with_nlp(base_dir="data/Processed_Committees", limit=10)

# Review the test results
print("\nTest completed. Check the output above to see if it looks correct.")

Starting NLP enhancement of 2198 JSON metadata files...
Processing limit: 10 files





=== NLP Enhancement Complete ===
Files examined: 10
Files processed: 10
JSON files updated: 10
Already enhanced: 0
Missing documents: 0
Skipped (errors): 0

Entities extracted:
  PERSON: 74 unique
  ORG: 62 unique
  GPE: 4 unique
  DATE: 58 unique
Total unique entities: 198

Test completed. Check the output above to see if it looks correct.


This is a big event, takes several minutes, where we do batch processing of the files for term extraction and add the terms to the json-ld

In [20]:
from data_pipeline.add_nlp_terms_to_metadata import enhance_json_with_nlp

# First batch - process 500 files
enhance_json_with_nlp(
    base_dir="data/Processed_Committees",
    limit=2500,
    skip_existing=False,
)


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox


Starting NLP enhancement of 2198 JSON metadata files...
Processing limit: 2500 files


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def


=== NLP Enhancement Complete ===
Files examined: 2198
Files processed: 1776
JSON files updated: 1776
Already enhanced: 0
Missing documents: 83
Skipped (errors): 0

Entities extracted:
  PERSON: 4712 unique
  ORG: 9706 unique
  GPE: 663 unique
  DATE: 5051 unique
Total unique entities: 20132


{'processed': 1776,
 'updated': 1776,
 'already_enhanced': 0,
 'missing_files': 83,
 'skipped': 0,
 'entities': {'PERSON': Counter({'Library': 905,
           'Tom Teper': 479,
           'John Wilkin': 364,
           'Mary Laskowski': 333,
           'Bill Mischo': 306,
           'David Ward': 282,
           'John': 265,
           'Dean Wilkin': 224,
           'Nancy O’Brien': 220,
           'Dean': 214,
           'Tom': 213,
           'Sue Searing': 208,
           'Lynne Rudasill': 192,
           'Chris Prom': 190,
           'LSSC': 188,
           'Mara Thacker': 180,
           'Lynn Wiley': 180,
           'Cindy Ingold': 177,
           'Jennifer Teper': 176,
           'Bill Maher': 160,
           'Paula Kaufman': 152,
           'Kirstin Dougan': 145,
           'Lisa Hinchliffe': 139,
           'JoAnn Jacoby': 133,
           'Greg Knott': 131,
           'Beth': 131,
           'Paula': 131,
           'John Laskowski': 130,
           'Beth Woodard': 129,
      

Lets make sure no Mac .DS_Store files are contaminating the set.

In [21]:
import importlib
from data_pipeline import data_cleaning
importlib.reload(data_cleaning)

# Then try calling it
data_cleaning.clean_ds_store_files()

✅ No .DS_Store files found - directory is clean!


0

Build knowledge graph with redacted set of data.

In [1]:
# Import the functions
from data_pipeline.build_redacted_knowlege_graph import create_person_document_explorer

# Create an interactive knowledge graph
graph, network = create_person_document_explorer(
    base_dir="data/Processed_Committees/Executive Committee",
    committee=None,  # All committees
    limit=50,       # Process up to 100 files
    min_person_mentions=2,
    output_file="knowledge_graph_explorer.html"
)

Building person-centric graph...
Found 29 people with 2+ mentions
Graph built: 79 nodes, 174 edges
knowledge_graph_explorer.html
Interactive person explorer saved to knowledge_graph_explorer.html
Instructions:
- Click on any person (blue node) to see only their document connections
- Click the same person again or click empty space to show all nodes
- Hover over nodes to see detailed information
- Physics and animations are disabled so the graph stays static


Build the final knowledge graph with added entities.

In [23]:
from data_pipeline.metadata_check import check_person_entities
check_person_entities()

Checked 20 files
Current PERSON entities found:
  Bill: 9 mentions
  Joanne: 9 mentions
  Lynne: 9 mentions
  Tom: 9 mentions
  Tom Teper: 9 mentions
  Bill Maher: 8 mentions
  Victor Jones: 8 mentions
  Joanne Kaczmarek: 7 mentions
  Kelli Trei: 7 mentions
  Hannah Williams
Box: 7 mentions
  John: 6 mentions
  Chris Wiley: 6 mentions
  Mara Thacker: 6 mentions
  Victor: 5 mentions
  Lynne Thomas
Absent: 4 mentions
Checked 10 files
Current PERSON entities found:
  Bill: 9 mentions
  Joanne: 9 mentions
  Tom: 9 mentions
  Tom Teper: 9 mentions
  Bill Maher: 8 mentions
  Lynne: 8 mentions
  Joanne Kaczmarek: 7 mentions
  John: 5 mentions
  Lynne Thomas
Absent: 4 mentions
  Box: 3 mentions
  Tim: 3 mentions
  Wendy: 3 mentions
  Krista: 2 mentions
  Maps: 2 mentions
  Sousa: 2 mentions


Counter({'Bill': 9,
         'Joanne': 9,
         'Tom': 9,
         'Tom Teper': 9,
         'Bill Maher': 8,
         'Lynne': 8,
         'Joanne Kaczmarek': 7,
         'John': 5,
         'Lynne Thomas\nAbsent': 4,
         'Box': 3,
         'Tim': 3,
         'Wendy': 3,
         'Krista': 2,
         'Maps': 2,
         'Sousa': 2,
         'Aeon': 2,
         'John’s': 2,
         'Krista Gray': 2,
         'Lynne Thomas': 2,
         'Dennis': 2,
         'Tim Newman': 2,
         'Wendy Wolter': 2,
         'William Maher': 2,
         'Tom - I': 2,
         'Archon': 1,
         'Bill – Oak Street': 1,
         'Bill – Voyager': 1,
         'Cara Bertram': 1,
         'Chris': 1,
         'Joanne Kaczmarek\nNote': 1,
         'Katie Nichols': 1,
         'Krannert': 1,
         'Linda': 1,
         'Linda Stahnke': 1,
         'Mary': 1,
         'Sarah Harris': 1,
         'Solberg': 1,
         'Training': 1,
         'Valerie': 1,
         'Wendy Wolter\n\nBill': 1,
   

In [5]:
from data_pipeline.add_nlp_terms_to_metadata import reprocess_all_entities

result = reprocess_all_entities(
    report_path="data/nlp_quality_report.json"
  )


import json
with open("data/nlp_quality_report.json") as f:
    report = json.load(f)

print(f"Low quality documents: {len(report['problematic_documents'])}")
print(f"PERSON rejection rate: {report['entity_stats']['PERSON']['rejection_rate']:.1%}")


REPROCESSING ALL ENTITIES WITH IMPROVED VALIDATION

Step 1: Clearing existing entities from JSON metadata files...
  Cleared entities from 2198 JSON files

Step 2: Running enhanced NLP extraction with quality validation...
Starting NLP enhancement of 2198 JSON metadata files...


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Progress: 500 files processed, 500 updated


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Progress: 1000 files processed, 1000 updated


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Progress: 1500 files processed, 1500 updated


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def

Progress: 2000 files processed, 2000 updated


CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, defaulting to MediaBox
CropBox missing from /Page, def


=== NLP Enhancement Complete ===
Files examined: 2198
Files processed: 2115
JSON files updated: 2115
Already enhanced: 0
Missing documents: 83
Skipped (errors): 0

Entities extracted:
  PERSON: 2840 unique
  ORG: 9540 unique
  GPE: 653 unique
  DATE: 5006 unique
Total unique entities: 18039

NLP EXTRACTION QUALITY REPORT

Documents Processed: 2119
  Successful: 2115
  Failed: 4
  Low Quality: 347
  Avg Quality Score: 0.940

Entity Statistics:
  PERSON:
    Extracted: 32430
    Kept: 21355
    Rejected: 11075 (34.2%)
  ORG:
    Extracted: 23768
    Kept: 23756
    Rejected: 12 (0.1%)
  GPE:
    Extracted: 2210
    Kept: 2189
    Rejected: 21 (1.0%)
  DATE:
    Extracted: 12675
    Kept: 12656
    Rejected: 19 (0.1%)

Top Rejected PERSON Entities:
  'Library': 3610
  'LSSC': 636
  'Dean Wilkin': 568
  'Dean': 449
  'Nancy O’Brien': 340
  'Lynn Wiley': 276
  'Primo': 212
  'Librarian': 167
  'Lynn': 110
  'Scott Schwartz': 97

Top Rejection Reasons (PERSON):
  filter_term_exact: 5067
  n

In [6]:
# Neo4j Export
from data_pipeline.neo4j_export import export_to_neo4j

result = export_to_neo4j(
    base_dir="data/Processed_Committees",
    output_format="both",  # generates both .cypher and CSV files
    min_person_mentions=2,
    min_coappear_count=2
)

print(f"Cypher file: {result.get('cypher')}")
print(f"CSV directory: {result.get('csv_dir')}")



Scanning documents in data/Processed_Committees...
  Processed 500 documents...
  Processed 1000 documents...
  Processed 1500 documents...
  Processed 2000 documents...

Scan complete:
  Documents: 2198
  Committees: 26
  Persons: 2815
  Organizations: 9540
  Locations: 653
  Co-appearances: 80891
Cypher script written to: data/neo4j_export/neo4j_import.cypher
CSV files written to: data/neo4j_export/csv/
Files created:
  - doc_mentions_person.csv
  - person_coappears.csv
  - documents.csv
  - organizations.csv
  - committees.csv
  - locations.csv
  - persons.csv
  - doc_mentions_location.csv
  - doc_mentions_org.csv
  - doc_belongs_to_committee.csv
Cypher file: data/neo4j_export/neo4j_import.cypher
CSV directory: data/neo4j_export/csv


In [7]:
# Neo4j Direct Import
from data_pipeline.neo4j_import import import_to_neo4j

stats = import_to_neo4j(
    base_dir="data/Processed_Committees",
    min_person_mentions=2,
    clear_first=True  # Clears existing data before import
  )


Connected to Neo4j at bolt://localhost:7687
Clearing existing data...
Database cleared.
Creating constraints and indexes...
Constraints created.
Scanning documents in data/Processed_Committees...
  Scanned 500 documents...
  Scanned 1000 documents...
  Scanned 1500 documents...
  Scanned 2000 documents...

Scanned 2198 documents
  Committees: 26
  Persons: 2815
  Organizations: 9540
  Locations: 653
  Persons (≥2 mentions): 1242

Importing committees...
Importing persons...
Importing organizations...
Importing locations...
Importing documents...
  Imported 500 documents...
  Imported 1000 documents...
  Imported 1500 documents...
  Imported 2000 documents...
  Imported 2198 documents total
Creating person mentions...
  Created 19694 person mentions
Creating organization mentions...
  Created 23756 organization mentions
Creating location mentions...
  Created 2189 location mentions
Creating co-appearance relationships...
  Created 52453 co-appearance relationships

Import complete!

===

## Graph Dataset Preparation for GraphRAG

The following cells create a filtered, cleaned dataset specifically for knowledge graph and GraphRAG applications:

1. **Filter Minutes Only** - Extract only Minutes documents (excluding Agendas and Related Documents) into a flattened structure
2. **Clean Entities** - Apply stricter NLP entity validation to remove:
   - Single-word names (first names only)
   - Acronyms and abbreviations  
   - Misclassified entities (persons as ORG/GPE)
   - Contraction artifacts and garbage text

This creates `data/committees_processed_for_graph/` with cleaner entity data for semantic search and Q&A.

In [None]:
# Step 1: Filter to Minutes-only dataset
# Creates data/committees_processed_for_graph/ with flattened structure

from data_pipeline.filter_for_graph import filter_for_graph

filter_result = filter_for_graph(
    source_dir="data/Processed_Committees",
    dest_dir="data/committees_processed_for_graph"
)

print(f"\nReady for entity cleanup: {filter_result['documents_copied']} documents")

In [None]:
# Step 2: Clean up entities with stricter validation
# Reprocesses NLP entities with filters for:
# - Single-word names, acronyms, generic terms
# - Misclassified persons in ORG/GPE
# - Contraction artifacts (n't)

from data_pipeline.cleanup_graph_entities import cleanup_graph_entities

cleanup_result = cleanup_graph_entities(
    base_dir="data/committees_processed_for_graph",
    report_path="data/graph_nlp_quality_report.json"
)

In [None]:
# Step 3: Review cleaned entities
# Verify entity quality after cleanup

from data_pipeline.cleanup_graph_entities import show_top_entities

show_top_entities(
    base_dir="data/committees_processed_for_graph",
    top_n=15
)

### Graph Dataset Summary

The filtered and cleaned dataset is now ready at `data/committees_processed_for_graph/`:

- **Documents:** ~1,143 Minutes files from 26 committees
- **Structure:** Flattened `[Committee Name]/[files]` (no subfolders)
- **Entity Quality:**
  - PERSON: Full names only (Tom Teper, John Wilkin, etc.)
  - ORG: Real organizations (User Education Committee, Administrative Council, etc.)
  - GPE: Real locations (Illinois, Chicago, etc.)

**Next Steps:** This dataset is ready for GraphRAG integration with Ollama embeddings. See `docs/features/feature-graphrag-ollama-integration.md` for the implementation plan.