# IS547 Project Jupyter Notebook

<details>
<summary>Project Overview</summary>

This project involves managing approximately 2200 digital documents originating from an internal WordPress site migration at my workplace. As previously outlined in my Dataset Profile, the data consists of PDFs, Word documents, Excel spreadsheets, and occasionally PowerPoint presentations already archived in our Box storage. These were curated over a decade or more by our seventy-plus library committees, albeit the majority of the data comes from 10-15 committees. The documents include meeting minutes, agendas, and related institutional records. With FAIR in mind, the curation goals I have are to enhance internal accessibility, maintain institutional memory and data provenance, and support governance through improved data organization and documentation. These documents were publicly available via our open staff site.


</details>

<details>
<summary>Deliverables</summary>

- Consistent naming conventions applied across all documents
- Documentation of data governance and ethical compliance per our institutional policies; if none exist, resources from university-wide policies will be utilized
- Metadata enhancement to improve retrieval, searchability, and discoverability
- Documented provenance and fixity check to support institutional memory

</details>



Note that all code is importing functions from the data_pipeline package where several python files contain functions, sorted by file according to their purpose.

First I get a total file count to check against later.

In [None]:
from data_pipeline.data_explore import count_files

committees_directory = 'data/Committees'
total_files = count_files(committees_directory)
print(f"Total number of files in '{committees_directory}': {total_files}")

Next I review the file types in the data set.

In [None]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Committees')
print(file_types)

A list of committees and their count is helpful to make sure everything looks as it should (81 committees)

In [None]:
from data_pipeline.data_explore import list_committees_and_count
list_committees_and_count('data/Committees')


Then a list of files just to see what I'm working with.

In [None]:
from data_pipeline.data_explore import list_files

list_files('data/Committees')


A function to ensure files are delivered to the right place so no mess is created.

In [None]:
from data_pipeline.data_cleaning import ensure_output_directory, clean_ds_store_files

ensure_output_directory()



Now I copy the original files to the processed directory.  This ensures the original data set is untouched.

In [None]:
from data_pipeline.data_cleaning import copy_files

copy_files()

Count files again to verify the copy was successful.

In [None]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

Review file types again to see if anything changed.

In [None]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Processed_Committees')
print(file_types)

This function creates a CSV with committee, type, original filename, extracted date, and proposed filename.  The CSV is "names.csv" and placed in the data directory.  From examining the CSV data I can see:
1. : There are a significant number of files in "Related Documents" folders. These maintain their original, often unique filenames and are skipped during renaming. **Related Documents**
2. **"Unknown" Date Files**: Many files have "unknown" in their proposed filenames (especially from committees like "Diversity Residency Advisory Committee" and "DEIA Task Force"). These would standardize to the same pattern, reducing unique names.
3. **Duplicate Resolution**: Files like and would be normalized to the same standardized name, with collision handling adding suffixes as needed. `capt_agenda_minutes_2013_04_30.docx``capt_agenda_minutes_2013_04_30 (1).docx`

The specific reduction (2193 - 1610 = 583 fewer unique values) indicates that about 26.6% of your original filenames were standardized or excluded from renaming (like Related Documents), which is expected in a file organization project focused on consistent naming.
This is a positive outcome that indicates your standardization process is successfully reducing naming inconsistencies while preserving the original files in Related Documents folders that likely need their distinct names for context.


In [None]:
from data_pipeline.file_naming import generate_names_csv

generate_names_csv()

Again I list files to see if anything has changed.

In [None]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

This is where I update the filenames based on the CSV created in the previous step.  This comes after the hours I spent manually cleaning the data and adding dates the hard way to the date column in the "names.csv" and renamed it "manually_updated_committee_names.csv"  It adds a column for the final concatenated names and saves the updated CSV "final_updated_committee_names.csv" in the data directory.  NOTE HOW CLOSE THE UNIQUE VLAUES ARE BEGINNING TO END.

In [None]:
from data_pipeline.final_file_naming import build_final_filenames
build_final_filenames()

I verify the folder structure and files are as expected before the final renaming.

In [None]:
from data_pipeline.final_file_naming import verify_folder_file_structure
verify_folder_file_structure()

The big event - renaming the files.  It renames less than the full amount as some of the new file names match the old, and Related Docs never got renamed due to unique naming with no dates in many cases.

In [None]:
from data_pipeline.final_file_naming import rename_processed_files
rename_processed_files()

When checked manually the file names with dates appended appear to work exactly as I want.

In [None]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

Validate the same number of files exist as when we started:


In [None]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

Enhance file metadata with json-ld files

In [None]:
# Import the module
from data_pipeline import enhance_metadata

# Call the single combined function instead of both separately
enhance_metadata.enhance_all_metadata(
    csv_path="data/final_updated_committee_names.csv",
    base_dir="data/Processed_Committees",
    skip_existing=False
)


Enhance Project Metadata by creating a json-ld file in the root directory with basic description of the project

In [None]:
from data_pipeline.project_metadata import write_project_metadata
write_project_metadata()


NLP term extraction to create a preview of entities in the data set.  This is a first step in identifying key terms and concepts for further analysis.

In [None]:
from data_pipeline.nlp_term_extraction_preview import run_entity_preview
run_entity_preview()

Next I test the enhance_json_with_nlp function quickly before running the full process.

In [None]:
from data_pipeline.add_nlp_terms_to_metadata import enhance_json_with_nlp

# Update a small sample of JSON-LD files first as a test
# Using a limit of 10 files to see quick results
test_results = enhance_json_with_nlp(base_dir="data/Processed_Committees", limit=10)

# Review the test results
print("\nTest completed. Check the output above to see if it looks correct.")

This is a big event, takes several minutes, where we do batch processing of the files for term extraction and add the terms to the json-ld

In [None]:
from data_pipeline.add_nlp_terms_to_metadata import enhance_json_with_nlp

# First batch - process 500 files
enhance_json_with_nlp(
    base_dir="data/Processed_Committees",
    limit=2500,
    skip_existing=False
)


Lets make sure no Mac .DS_Store files are contaminating the set.

In [5]:
import importlib
from data_pipeline import data_cleaning
importlib.reload(data_cleaning)

# Then try calling it
data_cleaning.clean_ds_store_files()

✅ No .DS_Store files found - directory is clean!


0

In [11]:
# Import the functions
from data_pipeline.build_redacted_knowlege_graph import create_person_document_explorer

# Create an interactive knowledge graph
graph, network = create_person_document_explorer(
    base_dir="data/Processed_Committees/Executive Committee",
    committee=None,  # All committees
    limit=50,       # Process up to 100 files
    min_person_mentions=2,
    output_file="knowledge_graph_explorer.html"
)

Building person-centric graph...
Found 67 people with 2+ mentions
Graph built: 117 nodes, 249 edges
knowledge_graph_explorer.html
Interactive person explorer saved to knowledge_graph_explorer.html
Instructions:
- Click on any person (blue node) to see only their document connections
- Click the same person again or click empty space to show all nodes
- Hover over nodes to see detailed information


In [9]:
from data_pipeline.metadata_check import check_person_entities
check_person_entities()

Checked 20 files
Current PERSON entities found:
  Bill: 9 mentions
  Joanne: 9 mentions
  Lynne: 9 mentions
  Tom: 9 mentions
  Tom Teper: 9 mentions
  Bill Maher: 8 mentions
  Victor Jones: 8 mentions
  Joanne Kaczmarek: 7 mentions
  Kelli Trei: 7 mentions
  Hannah Williams
Box: 7 mentions
  John: 6 mentions
  Chris Wiley: 6 mentions
  Mara Thacker: 6 mentions
  Victor: 5 mentions
  Lynne Thomas
Absent: 4 mentions
Checked 10 files
Current PERSON entities found:
  Bill: 9 mentions
  Joanne: 9 mentions
  Tom: 9 mentions
  Tom Teper: 9 mentions
  Bill Maher: 8 mentions
  Lynne: 8 mentions
  Joanne Kaczmarek: 7 mentions
  John: 5 mentions
  Lynne Thomas
Absent: 4 mentions
  Box: 3 mentions
  Tim: 3 mentions
  Wendy: 3 mentions
  Krista: 2 mentions
  Maps: 2 mentions
  Sousa: 2 mentions


Counter({'Bill': 9,
         'Joanne': 9,
         'Tom': 9,
         'Tom Teper': 9,
         'Bill Maher': 8,
         'Lynne': 8,
         'Joanne Kaczmarek': 7,
         'John': 5,
         'Lynne Thomas\nAbsent': 4,
         'Box': 3,
         'Tim': 3,
         'Wendy': 3,
         'Krista': 2,
         'Maps': 2,
         'Sousa': 2,
         'Aeon': 2,
         'John’s': 2,
         'Krista Gray': 2,
         'Lynne Thomas': 2,
         'Dennis': 2,
         'Tim Newman': 2,
         'Wendy Wolter': 2,
         'William Maher': 2,
         'Tom - I': 2,
         'Archon': 1,
         'Bill – Oak Street': 1,
         'Bill – Voyager': 1,
         'Cara Bertram': 1,
         'Chris': 1,
         'Joanne Kaczmarek\nNote': 1,
         'Katie Nichols': 1,
         'Krannert': 1,
         'Linda': 1,
         'Linda Stahnke': 1,
         'Mary': 1,
         'Sarah Harris': 1,
         'Solberg': 1,
         'Training': 1,
         'Valerie': 1,
         'Wendy Wolter\n\nBill': 1,
   