# IS547 Project Jupyter Notebook

<details>
<summary>Project Overview</summary>

This project involves managing approximately 2200 digital documents originating from an internal WordPress site migration at my workplace. As previously outlined in my Dataset Profile, the data consists of PDFs, Word documents, Excel spreadsheets, and occasionally PowerPoint presentations already archived in our Box storage. These were curated over a decade or more by our seventy-plus library committees, albeit the majority of the data comes from 10-15 committees. The documents include meeting minutes, agendas, and related institutional records. With FAIR in mind, the curation goals I have are to enhance internal accessibility, maintain institutional memory and data provenance, and support governance through improved data organization and documentation. These documents were publicly available via our open staff site.


</details>

<details>
<summary>Deliverables</summary>

- Consistent naming conventions applied across all documents
- Documentation of data governance and ethical compliance per our institutional policies; if none exist, resources from university-wide policies will be utilized
- Metadata enhancement to improve retrieval, searchability, and discoverability
- Documented provenance and fixity check to support institutional memory

</details>



Note that all code is importing functions from the data_pipeline package where several python files contain functions, sorted by file according to their purpose.

First I get a total file count to check against later.

In [None]:
from data_pipeline.data_explore import count_files

committees_directory = 'data/Committees'
total_files = count_files(committees_directory)
print(f"Total number of files in '{committees_directory}': {total_files}")

Next I review the file types in the data set.

In [None]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Committees')
print(file_types)

A list of committees and their count is helpful to make sure everything looks as it should (81 committees)

In [None]:
from data_pipeline.data_explore import list_committees_and_count
list_committees_and_count('data/Committees')


Then a list of files just to see what I'm working with.

In [None]:
from data_pipeline.data_explore import list_files

list_files('data/Committees')


A function to ensure files are delivered to the right place so no mess is created.

In [None]:
from data_pipeline.data_cleaning import ensure_output_directory, clean_ds_store_files

ensure_output_directory()



Now I copy the original files to the processed directory.  This ensures the original data set is untouched.

In [None]:
from data_pipeline.data_cleaning import copy_files

copy_files()

Count files again to verify the copy was successful.

In [None]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

Review file types again to see if anything changed.

In [None]:
from data_pipeline.data_explore import find_file_types
file_types = find_file_types('data/Processed_Committees')
print(file_types)

This function creates a CSV with committee, type, original filename, extracted date, and proposed filename.  The CSV is "names.csv" and placed in the data directory.  From examining the CSV data I can see:
1. **Related Documents** There are a significant number of files in "Related Documents" folders. These maintain their original, often unique filenames and are skipped during renaming.
2. **"Unknown" Date Files**: Many files have "unknown" in their proposed filenames (especially from committees like "Diversity Residency Advisory Committee" and "DEIA Task Force"). These would standardize to the same pattern, reducing unique names.
3. **Duplicate Resolution**: Files like and would be normalized to the same standardized name, with collision handling adding suffixes as needed. `capt_agenda_minutes_2013_04_30.docx``capt_agenda_minutes_2013_04_30 (1).docx`

The reduction (2193 - 1610 = 583 fewer unique values) indicates that about 26.6% of the original filenames were standardized or excluded from renaming (like Related Documents), which is expected in a file organization project focused on consistent naming.  This indicates the standardization process is successfully reducing naming inconsistencies while preserving the original files in Related Documents folders that likely need their distinct names for context.


In [None]:
from data_pipeline.file_naming import generate_names_csv

generate_names_csv()

Again I list files to see if anything has changed.

In [None]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

This is where I update the filenames based on the CSV created in the previous step.  This comes after the hours I spent manually cleaning the data and adding dates the hard way to the date column in the "names.csv" and renamed it "manually_updated_committee_names.csv"  It adds a column for the final concatenated names and saves the updated CSV "final_updated_committee_names.csv" in the data directory.  NOTE HOW CLOSE THE UNIQUE VLAUES ARE BEGINNING TO END.

In [None]:
from data_pipeline.final_file_naming import build_final_filenames
build_final_filenames()

I verify the folder structure and files are as expected before the final renaming.

In [None]:
from data_pipeline.final_file_naming import verify_folder_file_structure
verify_folder_file_structure()

The big event - renaming the files.  It renames less than the full amount as some of the new file names match the old, and Related Docs never got renamed due to unique naming with no dates in many cases.

In [None]:
from data_pipeline.final_file_naming import rename_processed_files
rename_processed_files()

When checked manually the file names with dates appended appear to work exactly as I want.

In [None]:
from data_pipeline.data_cleaning import list_files

list_files("./data/Processed_Committees")

Validate the same number of files exist as when we started:


In [None]:
from data_pipeline.data_explore import count_files
count_files('data/Processed_Committees')

Enhance file metadata with json-ld files

In [None]:
# Import the module
from data_pipeline import enhance_metadata

# Call the single combined function instead of both separately
enhance_metadata.enhance_all_metadata(
    csv_path="data/final_updated_committee_names.csv",
    base_dir="data/Processed_Committees",
    skip_existing=False
)


Enhance Project Metadata by creating a json-ld file in the root directory with basic description of the project

In [None]:
from data_pipeline.project_metadata import write_project_metadata
write_project_metadata()


NLP term extraction to create a preview of entities in the data set.  This is a first step in identifying key terms and concepts for further analysis.

In [None]:
from data_pipeline.nlp_term_extraction_preview import run_entity_preview
run_entity_preview()

Next I test the enhance_json_with_nlp function quickly before running the full process.

In [None]:
from data_pipeline.add_nlp_terms_to_metadata import enhance_json_with_nlp

# Update a small sample of JSON-LD files first as a test
# Using a limit of 10 files to see quick results
test_results = enhance_json_with_nlp(base_dir="data/Processed_Committees", limit=10)

# Review the test results
print("\nTest completed. Check the output above to see if it looks correct.")

This is a big event, takes several minutes, where we do batch processing of the files for term extraction and add the terms to the json-ld

In [None]:
from data_pipeline.add_nlp_terms_to_metadata import enhance_json_with_nlp

# First batch - process 500 files
enhance_json_with_nlp(
    base_dir="data/Processed_Committees",
    limit=2500,
    skip_existing=False,
)


Lets make sure no Mac .DS_Store files are contaminating the set.

In [None]:
import importlib
from data_pipeline import data_cleaning
importlib.reload(data_cleaning)

# Then try calling it
data_cleaning.clean_ds_store_files()

Build knowledge graph with redacted set of data.

In [None]:
# Import the functions
from data_pipeline.build_redacted_knowlege_graph import create_person_document_explorer

# Create an interactive knowledge graph
graph, network = create_person_document_explorer(
    base_dir="data/Processed_Committees/Executive Committee",
    committee=None,  # All committees
    limit=50,       # Process up to 100 files
    min_person_mentions=2,
    output_file="knowledge_graph_explorer.html"
)

Build the final knowledge graph with added entities.

In [None]:
from data_pipeline.metadata_check import check_person_entities
check_person_entities()

In [None]:
from data_pipeline.add_nlp_terms_to_metadata import reprocess_all_entities

result = reprocess_all_entities(
    report_path="data/nlp_quality_report.json"
  )


import json
with open("data/nlp_quality_report.json") as f:
    report = json.load(f)

print(f"Low quality documents: {len(report['problematic_documents'])}")
print(f"PERSON rejection rate: {report['entity_stats']['PERSON']['rejection_rate']:.1%}")


In [None]:
# Neo4j Export
from data_pipeline.neo4j_export import export_to_neo4j

result = export_to_neo4j(
    base_dir="data/Processed_Committees",
    output_format="both",  # generates both .cypher and CSV files
    min_person_mentions=2,
    min_coappear_count=2
)

print(f"Cypher file: {result.get('cypher')}")
print(f"CSV directory: {result.get('csv_dir')}")



In [None]:
# Neo4j Direct Import
from data_pipeline.neo4j_import import import_to_neo4j

stats = import_to_neo4j(
    base_dir="data/Processed_Committees",
    min_person_mentions=2,
    clear_first=True  # Clears existing data before import
  )


## Graph Dataset Preparation for GraphRAG

The following cells create a filtered, cleaned dataset specifically for knowledge graph and GraphRAG applications:

1. **Filter Minutes Only** - Extract only Minutes documents (excluding Agendas and Related Documents) into a flattened structure
2. **Clean Entities** - Apply stricter NLP entity validation to remove:
   - Single-word names (first names only)
   - Acronyms and abbreviations  
   - Misclassified entities (persons as ORG/GPE)
   - Contraction artifacts and garbage text

This creates `data/committees_processed_for_graph/` with cleaner entity data for semantic search and Q&A.

In [None]:
# Step 1: Filter to Minutes-only dataset
# Creates data/committees_processed_for_graph/ with flattened structure

from data_pipeline.filter_for_graph import filter_for_graph

filter_result = filter_for_graph(
    source_dir="data/Processed_Committees",
    dest_dir="data/committees_processed_for_graph"
)

print(f"\nReady for entity cleanup: {filter_result['documents_copied']} documents")

In [None]:
# Step 2: Clean up entities with stricter validation
# Reprocesses NLP entities with filters for:
# - Single-word names, acronyms, generic terms
# - Misclassified persons in ORG/GPE
# - Contraction artifacts (n't)

from data_pipeline.cleanup_graph_entities import cleanup_graph_entities

cleanup_result = cleanup_graph_entities(
    base_dir="data/committees_processed_for_graph",
    report_path="data/graph_nlp_quality_report.json"
)

In [None]:
# Step 3: Review cleaned entities
# Verify entity quality after cleanup

from data_pipeline.cleanup_graph_entities import show_top_entities

show_top_entities(
    base_dir="data/committees_processed_for_graph",
    top_n=15
)

### Graph Dataset Summary

The filtered and cleaned dataset is now ready at `data/committees_processed_for_graph/`:

- **Documents:** ~1,143 Minutes files from 26 committees
- **Structure:** Flattened `[Committee Name]/[files]` (no subfolders)
- **Entity Quality:**
  - PERSON: Full names only (Tom Teper, John Wilkin, etc.)
  - ORG: Real organizations (User Education Committee, Administrative Council, etc.)
  - GPE: Real locations (Illinois, Chicago, etc.)

**Next Steps:** This dataset is ready for GraphRAG integration with Ollama embeddings. See `docs/features/feature-graphrag-ollama-integration.md` for the implementation plan.

In [None]:
# Step 4: Re-import cleaned data to Neo4j
# This replaces the original import with the cleaned Minutes-only dataset
# Note: This clears existing data including embeddings - you'll need to regenerate them

from data_pipeline.neo4j_import import import_to_neo4j

stats = import_to_neo4j(
    base_dir="data/committees_processed_for_graph",  # Use cleaned dataset
    min_person_mentions=2,
    clear_first=True  # Clear old dirty data first
)

print("\nNeo4j now contains cleaned entities from Minutes-only dataset.")

## GraphRAG Integration with Ollama

This section adds semantic search and question-answering capabilities using:
- **neo4j-graphrag-python**: Official Neo4j GraphRAG library
- **Ollama**: Local LLM and embeddings (privacy-preserving)
- **Vector Index**: Semantic similarity search on document embeddings
- **Hybrid Retrieval**: Combines vector search + keyword matching + graph traversal

**Prerequisites:**
1. Ollama installed and running (`ollama serve`)
2. Models downloaded: `ollama pull nomic-embed-text` and `ollama pull llama3.1:8b`
3. Neo4j database populated (run previous cells first)

In [None]:
# Step 1: Verify GraphRAG prerequisites
# Checks Ollama connection, Neo4j connection, and available models

from data_pipeline.graphrag_verification import run_full_verification

verification = run_full_verification(verbose=True)

In [None]:
# Step 2: Generate embeddings for documents
# This processes all documents and stores embeddings in Neo4j
# Takes ~60-90 minutes for full dataset, can be interrupted and resumed

from data_pipeline.generate_embeddings import (
    setup_embedder,
    extract_and_embed_documents,
    check_embedding_status
)

# Setup embedder
embedder_config = setup_embedder()

# Generate embeddings (skip existing to allow resume)
embedding_stats = extract_and_embed_documents(
    base_dir="data/committees_processed_for_graph",
    skip_existing=True,  # Set to False to regenerate all
    batch_size=50
)

# Check status
check_embedding_status()

In [None]:
# Step 3: Create vector and fulltext indexes
# Required for semantic search operations

from data_pipeline.setup_vector_index import setup_all_indexes

index_results = setup_all_indexes(
    vector_index_name="document_embeddings",
    fulltext_index_name="document_fulltext",
    wait_for_online=True
)

In [None]:
# Step 4: Test semantic search
# Verify vector retrieval is working

from data_pipeline.graphrag_retriever import GraphRAGRetriever

with GraphRAGRetriever() as retriever:
    # Test query
    query = "What budget discussions took place?"
    results = retriever.search(query, top_k=5, search_type="graph_enhanced")
    
    print(f"Query: {query}\n")
    print(f"Results ({len(results)} documents):\n")
    
    for i, doc in enumerate(results, 1):
        print(f"{i}. {doc['name']}")
        print(f"   Date: {doc.get('date', 'N/A')}")
        print(f"   Committee: {doc.get('committeeName') or doc.get('committee', 'N/A')}")
        print(f"   Score: {doc.get('score', 'N/A'):.3f}" if doc.get('score') else "")
        if doc.get('mentionedPeople'):
            print(f"   People: {', '.join(doc['mentionedPeople'][:3])}")
        print()

In [None]:
# Step 5: Ask questions with GraphRAG Q&A
# Uses retrieval + LLM to answer questions about documents

from data_pipeline.graphrag_qa import GraphRAGQA, example_questions

# Show example questions
print("Example questions you can ask:")
for i, q in enumerate(example_questions(), 1):
    print(f"  {i}. {q}")
print()

# Initialize Q&A system
qa = GraphRAGQA()

# Ask a question
question = "What were the main topics discussed in Executive Committee meetings?"
print(f"Question: {question}\n")
print("Generating answer...\n")

result = qa.ask(question, top_k=5)

print(f"Answer:\n{result['answer']}\n")
print(f"Sources ({len(result['sources'])} documents):")
for i, src in enumerate(result['sources'], 1):
    print(f"  {i}. {src['name']}")
    if src.get('committee'):
        print(f"     Committee: {src['committee']}")

qa.close()

### GraphRAG Capabilities

The system now supports:

- **Semantic Search**: Find documents by meaning, not just keywords
- **Graph-Enhanced Retrieval**: Includes committee context and mentioned people
- **Natural Language Q&A**: Ask questions and get answers with source citations
- **Person Search**: Find all documents mentioning a specific person
- **Committee Search**: Find all documents from a specific committee

**Example Usage:**
```python
from data_pipeline.graphrag_qa import GraphRAGQA

with GraphRAGQA() as qa:
    result = qa.ask("What technology initiatives were discussed?")
    print(result['answer'])
```

**Interactive Mode:**
```python
from data_pipeline.graphrag_qa import interactive_qa
interactive_qa()  # Starts interactive Q&A session
```

In [7]:
# Quick Q&A - Ask a single question
from data_pipeline.graphrag_qa import GraphRAGQA

# Change the question to whatever you want to ask
question = ("what can you find on new building project")

with GraphRAGQA() as qa:
    result = qa.ask(question, top_k=25
                    )
    
    print(f"Question: {question}\n")
    print(f"Answer:\n{result['answer']}\n")
    print(f"Sources ({len(result['sources'])}):")
    for s in result['sources']:
        print(f"  - {s['name']}")

LLM configured: llama3.1:8b, temperature=0.1
Question: what can you find on new building project

Answer:
After analyzing the documents, I found several mentions of a "Library Building Project" or "Main Library Building". Here are some specific details:

* Document 2 (The Library as Catalyst Project_ Programming the Main Library Building WG_Minutes_2019-03-26.docx) mentions the "Library Building Project: Programming" organization.
* Document 3 (The Library as Catalyst Project_ Programming the Main Library Building WG_Minutes_2019-02-04.docx) mentions the "Library Building Project: Programming" organization and also lists "Exhibit-Public Event" as an organization related to this project.
* Document 4 (The Library as Catalyst Project_ Programming the Main Library Building WG_Minutes_2019-07-08.docx) mentions the "Library Building Project: Programming" organization.
* Document 5 (The Library as Catalyst Project_ Programming the Main Library Building WG_Minutes_2019-02-25.docx) mentions th

In [5]:
from data_pipeline.graphrag_retriever import get_neo4j_driver

driver = get_neo4j_driver()
with driver.session() as session:
    result = session.run("""
        MATCH (p:Person)
        RETURN p.name AS name, p.mentionCount AS mentions
        ORDER BY p.mentionCount DESC
        LIMIT 10
    """)
    
    for record in result:
        print(f"{record['name']}: {record['mentions']} mentions")

driver.close()

Tom Teper: 379 mentions
John Wilkin: 325 mentions
Mary Laskowski: 317 mentions
Bill Mischo: 257 mentions
David Ward: 256 mentions
Sue Searing: 190 mentions
Lynne Rudasill: 179 mentions
Mara Thacker: 175 mentions
Chris Prom: 170 mentions
Jennifer Teper: 166 mentions
