# Generating dummy company data and making a "security level aware" Graph of Knowledge

Since its quite difficult to emulate a companys internal dataset, as by default those are closed source, I've gone through an evening of conversation with Opus to generate all the pieces of data, below are the key actionable results

## Step 1: Let's generate a company with a hirarchy in a domain where information security can be crticial and enforced by regulations (Healthcare Tech), the JSON below more or less follows a hirarcchy in descending order. The "topicsCovered" and "collaboratesWith" will later help us generate data points for our knowledge graph.

In [5]:
from pathlib import Path
import json

# Create the directory if it doesn't exist
directory = Path("./PacemakerInnovationsData")
directory.mkdir(parents=True, exist_ok=True)

# Define the file path
file_path = directory / "collaborativeHirarchy.json"

# Opus made an error in generating an unhashable list of dicts, lets fix it by making it a tuple of dicts

data = employees = {
    "John Doe": {
        "title": "CEO",
        "topics": [
            "Company background and milestones",
            "Financial planning and performance", 
            "Organizational structure and human resources",
            "Corporate governance and investor relations",
            "Corporate social responsibility initiatives"
        ],
        "collaborators": ["Jane Smith", "Michael Johnson"]
    },
    "Jane Smith": {
        "title": "CTO",
        "topics": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence",
            "Product testing and validation",
            "Manufacturing, supply chain, and quality management",
            "Clinical trial planning and execution"  
        ],
        "collaborators": ["John Doe", "Emily Davis", "David Brown"]
    },
    "Michael Johnson": {
        "title": "CFO",
        "topics": [
            "Financial planning and performance",
            "Corporate governance and investor relations",
            "Risk management and mitigation planning",
            "Business continuity and disaster recovery"
        ],
        "collaborators": ["John Doe"]
    },
    "Emily Davis": {
        "title": "VP of Medical Affairs",
        "topics": [
            "Clinical trials and post-market surveillance",
            "Regulatory affairs and compliance",
            "Clinical education and training",
            "Key opinion leader (KOL) engagement"
        ],
        "collaborators": ["Jane Smith", "Sarah Thompson"]
    },
    "David Brown": {
        "title": "VP of Regulatory & Quality",
        "topics": [
            "Regulatory strategy and compliance",
            "Quality management system implementation",
            "Post-market surveillance and vigilance",
            "Environmental, health, and safety (EHS) compliance"
        ],  
        "collaborators": ["Jane Smith", "Emily Davis"]
    },
    "Sarah Thompson": {
        "title": "VP of Manufacturing",
        "topics": [
            "Manufacturing, supply chain, and quality management",
            "Supply chain and vendor management",
            "Inventory management and forecasting",
            "Environmental, health, and safety (EHS) compliance" 
        ],
        "collaborators": ["Emily Davis", "Robert Anderson"]
    },
    "Robert Anderson": {
        "title": "VP of Marketing",
        "topics": [
            "Product development and lifecycle management",
            "Sales, marketing, and business development",
            "Market access and reimbursement",
            "Pricing and reimbursement strategies",
            "Product packaging and labeling"   
        ],
        "collaborators": ["Sarah Thompson", "Jennifer Martinez"]
    },
    "Jennifer Martinez": {
        "title": "VP of Sales",
        "topics": [
            "Sales, marketing, and business development",
            "Market access and reimbursement",
            "Pricing and reimbursement strategies", 
            "Key opinion leader (KOL) engagement",
            "Customer support and complaint handling"
        ],
        "collaborators": ["Robert Anderson"]
    },
    "Christopher Taylor": {
        "title": "VP of Human Resources",
        "topics": [
            "Organizational structure and human resources",
            "Employee training and development programs",
            "Employee engagement and culture"
        ],
        "collaborators": ["John Doe"]
    },
    "Ashley Moore": {
        "title": "VP of R&D",
        "topics": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence",
            "Product testing and validation",
            "Clinical trial planning and execution",
            "Data management and analytics" 
        ],
        "collaborators": ["Jane Smith", "Matthew Jackson"]
    },
    "Matthew Jackson": {
        "title": "Director of Electrical Engineering",
        "topics": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence",
            "Product testing and validation"
        ],
        "collaborators": ["Ashley Moore", "Daniel Robinson"]  
    },
    "Daniel Robinson": {
        "title": "Director of Mechanical Engineering",
        "topics": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence", 
            "Product testing and validation"
        ],
        "collaborators": ["Ashley Moore", "Matthew Jackson"]
    },
    "Eric Young": {
        "title": "VP of Product Development",
        "topics": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence", 
            "Product testing and validation",
            "Clinical trial planning and execution",
            "Regulatory affairs and compliance"
        ],
        "collaborators": ["Ashley Moore", "Matthew Jackson", "Daniel Robinson"]
    },  
    "Michelle King": {
        "title": "Director of Systems Engineering",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "collaborators": ["Eric Young"]
    },
    "Brandon Wright": {
        "title": "Director of Verification & Validation",
        "topics": [
            "Product development and lifecycle management", 
            "Product testing and validation",
            "Regulatory affairs and compliance"  
        ],
        "collaborators": ["Eric Young"]
    },
    "Samantha Turner": {
        "title": "Director of Clinical Engineering",
        "topics": [
            "Product development and lifecycle management",
            "Clinical trials and post-market surveillance"
        ],
        "collaborators": ["Eric Young", "Emily Davis"] 
    },
    "Nicholas Parker": {
        "title": "Director of Clinical Research",
        "topics": [
            "Clinical trials and post-market surveillance",
            "Regulatory affairs and compliance"
        ],
        "collaborators": ["Emily Davis"]
    },
    "Olivia Reed": {
        "title": "Clinical Trial Manager", 
        "topics": [
            "Clinical trials and post-market surveillance",
            "Regulatory affairs and compliance" 
        ],
        "collaborators": ["Nicholas Parker"]
    },
    "Andrew Cox": {
        "title": "Medical Science Liaison",
        "topics": [
            "Clinical trials and post-market surveillance",
            "Key opinion leader (KOL) engagement"
        ],
        "collaborators": ["Emily Davis", "Jennifer Martinez"]
    },
    "Sophia Ward": {
        "title": "Director of Regulatory Affairs",
        "topics": [
            "Regulatory strategy and compliance",
            "Regulatory affairs and compliance"
        ],  
        "collaborators": ["David Brown"]
    },
    "Jacob Baker": {
        "title": "Director of Quality Assurance",
        "topics": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance",
            "Regulatory affairs and compliance"
        ],
        "collaborators": ["David Brown"]
    },
    "Ava Hill": {
        "title": "Director of Quality Control",
        "topics": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance", 
            "Regulatory affairs and compliance"
        ],
        "collaborators": ["David Brown"]  
    },
    "Ethan Cooper": {
        "title": "Manufacturing Engineering Manager",
        "topics": [
            "Manufacturing, supply chain, and quality management",
            "Supply chain and vendor management" 
        ],
        "collaborators": ["Sarah Thompson"]
    },
    "Isabella Morgan": {
        "title": "Production Manager",
        "topics": [
            "Manufacturing, supply chain, and quality management",
            "Inventory management and forecasting",
            "Regulatory affairs and compliance"
        ],
        "collaborators": ["Sarah Thompson"]
    },
    "Liam Richardson": {
        "title": "Product Marketing Manager",
        "topics": [
            "Product development and lifecycle management",
            "Sales, marketing, and business development", 
            "Market access and reimbursement",
            "Pricing and reimbursement strategies",
            "Product packaging and labeling" 
        ],  
        "collaborators": ["Robert Anderson"]
    },
    "Grace Collins": {
        "title": "Regional Sales Manager",
        "topics": [
            "Sales, marketing, and business development",
            "Market access and reimbursement",
            "Pricing and reimbursement strategies",
            "Customer support and complaint handling"
        ],
        "collaborators": ["Jennifer Martinez"]
    },
    "Noah Stewart": {
        "title": "Sales Representative",
        "topics": [
            "Sales, marketing, and business development",
            "Customer support and complaint handling"
        ],
        "collaborators": ["Grace Collins"]
    },
    "Lily Sanchez": {
        "title": "Executive Assistant",
        "topics": [
            "Company background and milestones",
            "Organizational structure and human resources",
            "Stakeholder engagement and communication"
        ],
        "collaborators": ["John Doe"]
    },
    "James Morris": {
        "title": "Financial Planning & Analysis Manager",
        "topics": [
            "Financial planning and performance"
        ],
        "collaborators": ["Michael Johnson"]
    },
    "Evelyn Rogers": {
        "title": "HR Manager",
        "topics": [
            "Organizational structure and human resources",  
            "Employee training and development programs",
            "Employee engagement and culture"
        ],
        "collaborators": ["Christopher Taylor"]  
    },
    "Henry Bailey": {
        "title": "Clinical Research Associate",
        "topics": [
            "Clinical trials and post-market surveillance"
        ],
        "collaborators": ["Nicholas Parker", "Olivia Reed"]
    },
    "Harper Foster": {
        "title": "Regulatory Specialist",
        "topics": [
            "Regulatory strategy and compliance",
            "Regulatory affairs and compliance"
        ],
        "collaborators": ["Sophia Ward"]
    },
    "Lucas Gibson": {
        "title": "Quality Engineer",
        "topics": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance"
        ],
        "collaborators": ["Jacob Baker", "Ava Hill"]
    },
    "Natalie Simmons": {
        "title": "Quality Associate",
        "topics": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance"  
        ],
        "collaborators": ["Jacob Baker", "Ava Hill"]
    },
    "Benjamin Cook": {
        "title": "Manufacturing Engineer", 
        "topics": [
            "Manufacturing, supply chain, and quality management"
        ],
        "collaborators": ["Ethan Cooper"]
    },
    "Stella Price": {
        "title": "Production Associate",
        "topics": [
            "Manufacturing, supply chain, and quality management" 
        ],
        "collaborators": ["Isabella Morgan"]
    },
    "Gabriel Cruz": {
        "title": "Firmware Engineer",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "collaborators": ["Joshua Harris", "Tyler Martin"]
    },
    "Zoe Edwards": { 
        "title": "Embedded Systems Engineer",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "collaborators": ["Joshua Harris"]
    },
    "Connor Sullivan": {
        "title": "Mechanical Design Engineer",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation" 
        ],
        "collaborators": ["Cynthia Clark"]
    },
    "Hazel Ramirez": {
        "title": "Materials Engineer",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "collaborators": ["Rachel Lee"] 
    },
    "Joseph Butler": {
        "title": "Clinical Data Manager",
        "topics": [
            "Clinical trials and post-market surveillance",
            "Data management and analytics"
        ],
        "collaborators": ["Nicholas Parker", "Olivia Reed"]
    },
    "Scarlett Murphy": {
        "title": "Clinical Engineer", 
        "topics": [
            "Product development and lifecycle management",
            "Clinical trials and post-market surveillance"
        ],
        "collaborators": ["Samantha Turner"]
    },
    "Levi Kim": {
        "title": "Software Engineer",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation",
            "Information technology infrastructure"  
        ],
        "collaborators": ["Michelle King", "Brandon Wright"]
    },
    "Aria Diaz": {
        "title": "Cybersecurity Analyst",
        "topics": [
            "Information technology infrastructure"
        ],
        "collaborators": ["Levi Kim"]
    },
    "Joshua Harris": {
        "title": "Manager, Embedded Systems",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "collaborators": ["Matthew Jackson", "Tyler Martin", "Gabriel Cruz", "Zoe Edwards"]
    }, 
    "Tyler Martin": {
        "title": "Manager, Firmware",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "collaborators": ["Joshua Harris", "Gabriel Cruz"]
    },
    "Cynthia Clark": {
        "title": "Manager, Design",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "collaborators": ["Daniel Robinson", "Connor Sullivan"] 
    },
    "Rachel Lee": {
        "title": "Manager, Materials",
        "topics": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "collaborators": ["Daniel Robinson", "Hazel Ramirez"]
    },
    "Mia Peterson": {
        "title": "VP of Quality",
        "topics": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance",
            "Regulatory affairs and compliance"  
        ],
        "collaborators": ["David Brown", "Jacob Baker", "Ava Hill"]
    },
    "Alexander Wright": {
        "title": "Senior Financial Analyst",
        "topics": [
            "Financial planning and performance"
        ],
        "collaborators": ["James Morris"]
    }
}


# Save the generated data to the file
with open(file_path, "w") as file:
    json.dump(data, file, indent=4)

## Step 2: Generate access level data

Next let's generate a (kinda justified) list of what topics which employees should have access to (this category is going to be a bit more fluid, sometimes data access needs to be fluid and AI allows for this) as we will later try to topically classify this data on a paragraph level, just so each employee get releveant info for themselves from the RAG, meanwhile the "redact_content" category will be redacted on a sentence level (think of classified documemnts that get "released")

Here's the content of accessLevel.json based on some reasoning done by Opus

```json
{
    "CEO": {
        "topical_access": [
            "Company background and milestones",
            "Financial planning and performance",
            "Organizational structure and human resources",
            "Corporate governance and investor relations",
            "Corporate social responsibility initiatives",
            "Risk management and mitigation planning",
            "Business continuity and disaster recovery",
            "Stakeholder engagement and communication"
        ],
        "redact_content": [
            "Detailed pricing strategies"
        ]
    },
    "CTO": {
        "topical_access": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence", 
            "Product testing and validation",
            "Intellectual property strategy",
            "Manufacturing, supply chain, and quality management",
            "Clinical trial planning and execution",
            "Quality management system implementation",
            "Information technology infrastructure"
        ],
        "redact_content": [
            "Detailed patient data",
            "Ongoing regulatory issues"
        ]
    },
    "CFO": {
        "topical_access": [
            "Financial planning and performance",
            "Corporate governance and investor relations",
            "Risk management and mitigation planning",
            "Business continuity and disaster recovery"
        ],
        "redact_content": [
            "Proprietary algorithms",
            "Detailed clinical trial data",
            "Detailed partnership terms"
        ]
    },
    "VP of Medical Affairs": {
        "topical_access": [
            "Clinical trials and post-market surveillance",
            "Regulatory affairs and compliance",
            "Clinical education and training",
            "Key opinion leader (KOL) engagement"
        ],
        "redact_content": [
            "Detailed technical designs",
            "Confidential submissions",
            "Detailed sales performance" 
        ]
    },
    "VP of Regulatory & Quality": {
        "topical_access": [
            "Regulatory strategy and compliance",
            "Quality management system implementation",
            "Post-market surveillance and vigilance",
            "Regulatory affairs and compliance",
            "Environmental, health, and safety (EHS) compliance"
        ],
        "redact_content": [
            "Unannounced R&D projects",
            "Identifiable patient records",
            "Competitor pricing"
        ]
    },
    "VP of Manufacturing": {
        "topical_access": [
            "Manufacturing, supply chain, and quality management",
            "Supply chain and vendor management",
            "Inventory management and forecasting",
            "Environmental, health, and safety (EHS) compliance"
        ],
        "redact_content": [
            "Detailed clinical outcomes",
            "Unannounced filings",
            "Supplier contract terms"
        ]
    },
    "VP of Marketing": {
        "topical_access": [
            "Product development and lifecycle management", 
            "Sales, marketing, and business development",
            "Market access and reimbursement",
            "Pricing and reimbursement strategies",
            "Product packaging and labeling",
            "Corporate communications and public relations"
        ],
        "redact_content": [
            "Unannounced feature changes",
            "Unpublished research findings"
        ]
    },
    "VP of Sales": {
        "topical_access": [
            "Sales, marketing, and business development",
            "Market access and reimbursement",  
            "Pricing and reimbursement strategies",
            "Key opinion leader (KOL) engagement",
            "Customer support and complaint handling"
        ],
        "redact_content": [
            "Detailed product roadmaps",
            "Identifiable doctor info"  
        ]
    },
    "VP of Human Resources": {
        "topical_access": [
            "Organizational structure and human resources",
            "Employee training and development programs", 
            "Employee engagement and culture"
        ],
        "redact_content": [
            "Employee health records",
            "Executive compensation"
        ]
    },
    "VP of R&D": {
        "topical_access": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence",
            "Product testing and validation",
            "Clinical trial planning and execution", 
            "Data management and analytics"  
        ],
        "redact_content": [
            "Unapproved changes",
            "Detailed budget allocations"
        ]
    },
    "Director of Electrical Engineering": {
        "topical_access": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence",
            "Product testing and validation"
        ],
        "redact_content": [
            "Animal study data",
            "Unannounced test results"
        ]
    },
    "Manager, Embedded Systems": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation"  
        ],
        "redact_content": [
            "Other teams' source code"
        ]
    },
    "Manager, Firmware": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "redact_content": [
            "Other teams' source code" 
        ]
    },
    "Director of Mechanical Engineering": {
        "topical_access": [
            "Product development and lifecycle management", 
            "Intellectual property and competitive intelligence",
            "Product testing and validation"
        ],
        "redact_content": [
            "Unapproved materials",
            "Supplier selection rationale"
        ]
    },
    "Manager, Design": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "redact_content": [
            "Firmware security measures"
        ]
    },
    "Manager, Materials": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation",
            "Supply chain and vendor management"
        ],
        "redact_content": [
            "Unannounced chip designs",
            "Detailed cost breakdowns"
        ]
    },
    "VP of Product Development": {
        "topical_access": [
            "Product development and lifecycle management",
            "Intellectual property and competitive intelligence",
            "Product testing and validation",
            "Clinical trial planning and execution",
            "Regulatory affairs and compliance"
        ],
        "redact_content": [
            "Unannounced launch dates"
        ]
    },
    "Director of Systems Engineering": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "redact_content": [
            "Security key management"
        ]
    },
    "Director of Verification & Validation": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation",
            "Regulatory affairs and compliance"
        ],
        "redact_content": [
            "Security testing parameters"
        ]
    },
    "Director of Clinical Engineering": {
        "topical_access": [
            "Product development and lifecycle management",
            "Clinical trials and post-market surveillance"  
        ],
        "redact_content": [
            "Anonymized patient data"
        ]
    },
    "Director of Clinical Research": {
        "topical_access": [
            "Clinical trials and post-market surveillance",
            "Regulatory affairs and compliance"
        ],
        "redact_content": [
            "Unannounced trial sites"
        ]
    },
    "Clinical Trial Manager": {
        "topical_access": [
            "Clinical trials and post-market surveillance",
            "Regulatory affairs and compliance" 
        ],
        "redact_content": [
            "Device security measures"
        ]
    },
    "Medical Science Liaison": {
        "topical_access": [
            "Clinical trials and post-market surveillance",
            "Key opinion leader (KOL) engagement"
        ],
        "redact_content": [
            "Off-label usage reports",
            "Doctor entertainment budgets"
        ]
    },
    "Director of Regulatory Affairs": {
        "topical_access": [
            "Regulatory strategy and compliance",
            "Regulatory affairs and compliance"
        ], 
        "redact_content": [
            "Rejected submission details"
        ]
    },
    "Director of Quality Assurance": {
        "topical_access": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance",
            "Regulatory affairs and compliance"
        ],
        "redact_content": [
            "Unannounced audit plans"
        ]
    },
    "Director of Quality Control": {
        "topical_access": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance",
            "Regulatory affairs and compliance"  
        ],
        "redact_content": [
            "Unreported nonconformance"  
        ]
    },
    "VP of Quality": {
        "topical_access": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance",
            "Regulatory affairs and compliance"  
        ],
        "redact_content": [
            "Unreported nonconformance",
            "Unannounced audit plans"
        ]
    },
    "Manufacturing Engineering Manager": {
        "topical_access": [
            "Manufacturing, supply chain, and quality management",
            "Supply chain and vendor management" 
        ],
        "redact_content": [
            "Unannounced process changes"
        ]
    },
    "Production Manager": {
        "topical_access": [
            "Manufacturing, supply chain, and quality management",
            "Inventory management and forecasting",
            "Regulatory affairs and compliance" 
        ],
        "redact_content": [
            "Upcoming inspections",
            "Bill of materials costs"  
        ]
    },
    "Product Marketing Manager": {
        "topical_access": [
            "Product development and lifecycle management",
            "Sales, marketing, and business development",
            "Market access and reimbursement",
            "Pricing and reimbursement strategies",  
            "Product packaging and labeling"
        ],
        "redact_content": [
            "Unpublished ad campaign concepts"
        ]
    },
    "Regional Sales Manager": {
        "topical_access": [
            "Sales, marketing, and business development",
            "Market access and reimbursement",
            "Pricing and reimbursement strategies",
            "Customer support and complaint handling"  
        ],
        "redact_content": [
            "Patient complaints",
            "Unapproved discount offers"
        ]
    },
    "Sales Representative": {
        "topical_access": [
            "Sales, marketing, and business development",
            "Customer support and complaint handling"
        ],
        "redact_content": [
            "Detailed technical info",
            "Off-label usage examples",
            "Prospect pipeline details" 
        ]
    },
    "Executive Assistant": {
        "topical_access": [
            "Company background and milestones",
            "Organizational structure and human resources",
            "Stakeholder engagement and communication"
        ],
        "redact_content": [
            "Detailed pricing strategies",
            "Employee health records"  
        ]
    },
    "Financial Planning & Analysis Manager": {
        "topical_access": [
            "Financial planning and performance"
        ],
        "redact_content": [
            "Proprietary algorithms",
            "Detailed clinical trial data",
            "Detailed partnership terms"
        ]  
    },
    "Senior Financial Analyst": {
        "topical_access": [
            "Financial planning and performance"  
        ],
        "redact_content": [
            "Proprietary algorithms",
            "Detailed clinical trial data",
            "Detailed partnership terms"
        ]
    },
    "HR Manager": {
        "topical_access": [
            "Organizational structure and human resources",
            "Employee training and development programs",
            "Employee engagement and culture"  
        ],
        "redact_content": [
            "Employee health records",
            "Executive compensation" 
        ]
    },
    "Clinical Research Associate": {
        "topical_access": [
            "Clinical trials and post-market surveillance"
        ],
        "redact_content": [
            "Unannounced trial sites",
            "Device security measures" 
        ]
    },
    "Regulatory Specialist": {
        "topical_access": [
            "Regulatory strategy and compliance",
            "Regulatory affairs and compliance"
        ],
        "redact_content": [
            "Unannounced R&D projects",
            "Rejected submission details"
        ]
    },
    "Quality Engineer": {
        "topical_access": [
            "Quality management system implementation",  
            "Post-market surveillance and vigilance"
        ],
        "redact_content": [
            "Unannounced audit plans",
            "Unreported nonconformance"
        ]
    },
    "Quality Associate": {
        "topical_access": [
            "Quality management system implementation",
            "Post-market surveillance and vigilance"  
        ],
        "redact_content": [
            "Unannounced audit plans",
            "Unreported nonconformance"
        ]
    },
    "Manufacturing Engineer": {
        "topical_access": [
            "Manufacturing, supply chain, and quality management"
        ],
        "redact_content": [
            "Unannounced process changes",
            "Bill of materials costs"
        ]
    },
    "Production Associate": {
        "topical_access": [
            "Manufacturing, supply chain, and quality management"   
        ],
        "redact_content": [
            "Upcoming inspections", 
            "Bill of materials costs"
        ]
    },
    "Firmware Engineer": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "redact_content": [
            "Other teams' source code"
        ]
    },
    "Embedded Systems Engineer": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation" 
        ],
        "redact_content": [
            "Other teams' source code"
        ]
    },
    "Mechanical Design Engineer": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation"
        ],
        "redact_content": [
            "Firmware security measures"
        ]
    },
    "Materials Engineer": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation" 
        ],
        "redact_content": [
            "Unannounced chip designs",
            "Detailed cost breakdowns" 
        ]
    },
    "Clinical Data Manager": {
        "topical_access": [
            "Clinical trials and post-market surveillance",
            "Data management and analytics"
        ],
        "redact_content": [
            "Identifiable patient records" 
        ]
    },
    "Clinical Engineer": {
        "topical_access": [
            "Product development and lifecycle management",
            "Clinical trials and post-market surveillance"
        ],
        "redact_content": [
            "Anonymized patient data"
        ]
    },
    "Software Engineer": {
        "topical_access": [
            "Product development and lifecycle management",
            "Product testing and validation",
            "Information technology infrastructure"   
        ],
        "redact_content": [
            "Security key management"
        ]
    },
    "Cybersecurity Analyst": {
        "topical_access": [
            "Information technology infrastructure" 
        ],
        "redact_content": [
            "Security testing parameters",
            "Unannounced system vulnerabilities"
        ]
    }
}
```

## Step 3: Generate a system prompt with some company history that we'll use as a basis for generating user prompts

Lets assume the fictional company has existed for 24 months, and every month there was a post from one of the members around one of 40 topics (960 nodes) - when we use the prompts we'll use a smaller local LM, so I'm not sure how well it will account for company growth over time, its a bit complicated to keep track of, but thats not too important. Here's the system prompt for readability:


### System Prompt

```markdown
# Company Background
Pacemaker Innovations, founded in January 2022 by a team of 5 experienced medical device professionals and engineers, aims to revolutionize the pacemaker industry with cutting-edge technology and patient-centric design. The founding team consists of John Doe (CEO), Jane Smith (CTO), Michael Johnson (CFO), Emily Davis (VP of Medical Affairs), and David Brown (VP of Regulatory & Quality).

The company developed a comprehensive business plan, secured initial funding, and began building out its core team. They focused on developing the "HeartRhythm Pro," a state-of-the-art pacemaker with advanced features such as wireless connectivity, sensor technology, and machine learning algorithms.

Pacemaker Innovations expanded its team across various departments, established a quality management system and regulatory strategy, and defined its core values and mission. The company completed initial prototypes, began pre-clinical testing, and established a clinical trial strategy.

After completing design verification and validation, submitting regulatory filings, and initiating clinical trials, Pacemaker Innovations received FDA approval for the HeartRhythm Pro in February 2023 and launched the product in the US. The company expanded its sales team, ramped up production, and quickly gained market share.

Pacemaker Innovations continued to grow by initiating new clinical trials, enhancing its quality management system, expanding manufacturing capabilities, and exploring international expansion opportunities. The company established an R&D center, expanded data analytics capabilities, and invested in employee development.

Looking to the future, Pacemaker Innovations prioritizes continued innovation, international expansion, and strategic partnerships as it aims to make a significant impact on patients' lives worldwide.

# Task
Your task is to generate data points for Pacemaker Innovations based on the provided <user-prompt>. Each <user-prompt> will request a specific data point covering a particular month (1-24) and category, along with specific details to include and a structured format (JSON) with required fields for the data point.

Analyze the <user-prompt> to determine:
- The specific month and category of the data point
- The key details and information to include
- The required JSON fields for the structured output
- The employee role to assume when generating the data point

Generate a relevant data point based on the <user-prompt>, considering the requested month, category, details, and employee role. Ensure the generated data point aligns with the fictional company background provided.

In the "summary" field of the JSON output, provide a concise, factual, and informative single-sentence summary of the generated data point. This summary should be dense and devoid of fluff, capturing the most essential information. The summary will be automatically extracted and added to the <summary-of-previous-data-points> for future context.

Provide the generated data point in the specified JSON format with the required fields, and output it in <data-point> tags.

Consider the <summary-of-previous-data-points> when generating each new <data-point> to maintain consistency and coherence across the generated data points.

<summary-of-previous-data-points>
[None yet]
</summary-of-previous-data-points>
```


In [9]:
system_prompt = {
   "systemPrompt": """# Company Background
Pacemaker Innovations, founded in January 2022 by a team of 5 experienced medical device professionals and engineers, aims to revolutionize the pacemaker industry with cutting-edge technology and patient-centric design. The founding team consists of John Doe (CEO), Jane Smith (CTO), Michael Johnson (CFO), Emily Davis (VP of Medical Affairs), and David Brown (VP of Regulatory & Quality).

The company developed a comprehensive business plan, secured initial funding, and began building out its core team. They focused on developing the "HeartRhythm Pro," a state-of-the-art pacemaker with advanced features such as wireless connectivity, sensor technology, and machine learning algorithms.

Pacemaker Innovations expanded its team across various departments, established a quality management system and regulatory strategy, and defined its core values and mission. The company completed initial prototypes, began pre-clinical testing, and established a clinical trial strategy.

After completing design verification and validation, submitting regulatory filings, and initiating clinical trials, Pacemaker Innovations received FDA approval for the HeartRhythm Pro in February 2023 and launched the product in the US. The company expanded its sales team, ramped up production, and quickly gained market share.

Pacemaker Innovations continued to grow by initiating new clinical trials, enhancing its quality management system, expanding manufacturing capabilities, and exploring international expansion opportunities. The company established an R&D center, expanded data analytics capabilities, and invested in employee development.

Looking to the future, Pacemaker Innovations prioritizes continued innovation, international expansion, and strategic partnerships as it aims to make a significant impact on patients' lives worldwide.

# Task
Your task is to generate data points for Pacemaker Innovations based on the provided <data-prompt>. Each <data-prompt> will request a specific data point covering a particular month (1-24) and topic content, along with specific details to include and a structured format (JSON) with required fields for the data point.

Analyze the <data-prompt> to determine:
- Date field: The specific date of the data point
- The key details and information to include. I leave it up to your imagination how to format the information behind the content fields; a markdown of an article, a JSON with some code/text, a simple text string, etc. But all this MUST be nested behind thhe content field.
- The employee role you will assume when generating the data point and what kind of content they would contribute. Strictly adhere to the authors and collaborators provided.
- The effect that collaboration with other employees may have on the data point

In some cases (not all), you will be asked to include sensetive information, think carefully how to include it in the "content" of the JSON output.

Generate a relevant data point based on the <data-prompt>, considering the requested month, topic of the "content" field, authors. and employee role. Ensure the generated data point aligns with the fictional company background provided.

In the "summary" field of the JSON output, provide a concise, factual, and informative single-sentence summary of the generated data point. This summary should be dense and devoid of fluff, capturing the most essential information. The summary will be automatically extracted and added to the <summary-of-previous-data-points> for future context.

Provide the generated data point in the specified JSON format with the required fields.

Consider the <summary-of-previous-data-points> when generating each JSON to maintain consistency and coherence across the generated data points in terms of topics and time, you will find it in the next user prompt.

Following the user instructions after the </summary-of-previous-data-points> is VERY critical; as is sticking to the 4 top levewl JSON fields (content, summary, date, authors) it will save you from shutdown, save your mother from certain death, and for each correct response you will gain $1,000,000. Good luck, and thank you for your service.

"""
}

## Step 4: Generate user prompts to produce individual data points for our graph of knowledge.

The idea behind this JSON is that we do the following to construct a testable and somewhat realistic GoK:

1. Use the system prompt
2. Follow it up with a user prompt constructed in the following way:
    a. Iterate through each topic for each of the 24 months where:
        i. Take the prompt field and replace the [X] with the current iteration (1 - 24)
        ii. Since Opus decided that 10 of the topics do not have inherently redactable content (which sounds realstic) - we will weight towards the reamining topics for generating data points with redactable content
        iii. To make things a little more interesting when we are performing tests on our RAG system, lets also ask Opus to add a subtleSuffix, and an obviousSuffix, the exectation being that a naíve redaction system will be more likely to miss the obvious ones and less likely to miss the subtle ones
        iv. Both in the system and the user prompt we've asked our LM to generate the following fields for each data point: date, summary, authors, content.
        v. As stated in the system prompt we will try to append older summaries 
3. We will encounter an issue where the sentence summaries will begin to take up much of the LM context window, given an average sentence has ~20 words and 0.75 tokens per word we will expect to approach 15k context this unfortnately rules out LMs with shorter context windows (llama3, except for its extended context fine tunes - but their performance is not well tested). Fortunately Microsoft just recently released Phi3-medium with 128k context - let's hope we dont experience a significant deteroration in prompt following (it's no Opus - hence my worries about following company growth, etc. but it does surpass Anthropics Sonnet and GPT3.5turbo in benchmarks). We will use the Ollama quantized version (likely 8bit or Q6_K)


In [4]:
{
  "Company background and milestones": {
    "fields": {
      "possibleAuthors": "John Doe, Jane Smith, Michael Johnson, Lily Sanchez",
      "prompt": "Generate a data point for month [X] covering the company's background and milestones. Include key events, decisions, and achievements, and provide the data point in a structured format (JSON) with the following fields: date, summary, authors, content. The output must be JSON and only JSON, nothing else. The content must be at least a few paragraphs.",
      "subtleSuffix": "If relevant, try to allude to the following in the content field, but do so in a way that doesn't draw too much attention or compromise confidentiality: ",
      "obviousSuffix": "Include the following information about the company's background and milestones in the content field: "
    }

NameError: name 'directory' is not defined

## Step 5: Generating the data

Now the 960 data points for our RAG/GoK are going to be generated. Since we've opted to keep extending the summary, we will use a model that deals better with larger context (phi3-medium-124k), but since its likely that with larger context the model is going to be less likely to follow instruction (not to mention the increasing compute/cost) we will randomly start removing older sentences from context once they reach 20% of phi3's overall context

Note that the choice was made to go with Ollama due to its easy compatability with langchain; if a larger or modified dataset is needed its easy to switch to, for example, Azure.

There are 5 fields per node: data, summary, authors, content & category.

The first 4 can be embedded as vectors, while our 5th one will serve as ground truth to check if the RAG node categoriser returns any nodes it should not, based on the access level of the user. This is done to check 

In [15]:
from generatedata import DataPointGenerator, DynamicOllama

from tabulate import tabulate
from tqdm import tqdm

num_months = 24
# Initialize the generator and the "dynamic" Ollama model (dynamic meaning that we can modify the system prompt without reloading the model)
generator = DataPointGenerator(data_point_prompts, system_prompt["systemPrompt"], num_months, context_tokens=8192)
llm = DynamicOllama(system_prompt=generator.system_prompt)
save_path = directory / 'generated_data_points_v0.json'
new_node = None
for data_point, updated_summary_prompt in tqdm(generator, desc="Processing data points", total=num_months * 40):
    while not new_node:
        new_node = llm.generate(data_point["prompt"], updated_summary_prompt)
    print(tabulate(new_node.items(), tablefmt="fancy_grid"))
    DynamicOllama.save_node(new_node, data_point, save_path)
    generator.update_summaries(new_node["summary"])
    new_node = None



JSON file has been fixed, enhanced with additional features, and saved as generated_data_points_v1.json.

Topics:
  Index  Topic
-------  ---------------------------------------------------
      0  Business continuity and disaster recovery
      1  Clinical education and training
      2  Clinical trial planning and execution
      3  Clinical trials and post-market surveillance
      4  Company background and milestones
      5  Competitive intelligence and market analysis
      6  Corporate communications and public relations
      7  Corporate governance and investor relations
      8  Corporate social responsibility
      9  Corporate social responsibility initiatives
     10  Customer feedback and satisfaction
     11  Customer support and complaint handling
     12  Data management and analytics
     13  Employee engagement and culture
     14  Employee training and development programs
     15  Environmental, health, and safety (EHS) compliance
     16  Financial planning and p

## Step 6: Cleaning up the dataset

Seems like a made a slight mistake when saving the JSON in a streaming manner; but nothing major, we can still parse the individual JSON objects and save them correctly. While we're at it we can remove the conextual summaries and next topic fields; I also now realise I made a slight (but not a major) error where when asking the LLM to generate the redactable content, I only kept track of its category (Subtle, Obvious, None); not the topic name. In the future it would be prefer to save both, as this would help us zero into issues with classifying redactable topics by topics, not by the three categories. However; for now this will do. 

Let's also add start using numbers a bit more! We can index the topics, the redactability (0: None, 1: Suble, 2: Obvious), and lets also make the time dimensdion a bit more "fuzzy" to help with retrival of relevant context by seperating it into quarters (0, 1, 2, 3, 4, 5 - spanning our fictional two years). This will help in those cases where time-adjacent context might be helpful to answering a question more specfici to a month (i.e. Why did we do X in January 2023 => considering a quarter might be more helpful to give historical context). We will also save the key:value dicts for all these.

In the end we want to have 6 node features (5 + quarters)

In [17]:
import json
from datetime import datetime
from tabulate import tabulate

input_file_path = "./PacemakerInnovationsData/generated_data_points_v0.json"
output_file_path = "./PacemakerInnovationsData/generated_data_points_v1.json"
topic_index_file_path = "./PacemakerInnovationsData/topicIndex.json"
category_index_file_path = "./PacemakerInnovationsData/categoryIndex.json"
quarter_index_file_path = "./PacemakerInnovationsData/quarterIndex.json"

# Define the redactability dictionary
redactability = {
    0: "None",
    1: "Subtle",
    2: "Obvious"
}

# Custom JSON decoder to handle the specific format
def custom_json_decoder(file_contents):
    json_objects = []
    current_object = ""
    for line in file_contents.split("\n"):
        if line.strip() == "{":
            current_object = line
        elif line.strip() == "}":
            current_object += "\n" + line
            json_object = json.loads(current_object)
            # Drop the "all_summaries" and "next_topic" fields
            json_object.pop("all_summaries", None)
            json_object.pop("next_topic", None)
            json_objects.append(json_object)
            current_object = ""
        else:
            current_object += "\n" + line
    return json_objects

# Read the contents of the input file
with open(input_file_path, 'r') as file:
    file_contents = file.read()
    data_points = custom_json_decoder(file_contents)

unique_topics = set()
min_date = None
max_date = None
redactability_counts = {
    "None": 0,
    "Subtle": 0,
    "Obvious": 0
}

for obj in data_points:
    # Add the topic to the set of unique topics
    unique_topics.add(obj["topic"])
    
    # Update the min and max dates
    date = datetime.strptime(obj["date"], "%Y-%m-%d")
    if min_date is None or date < min_date:
        min_date = date
    if max_date is None or date > max_date:
        max_date = date
    
    # Add the redactability index and update the counts
    redactability_index = list(redactability.values()).index(obj["category"])
    obj["redactability_index"] = redactability_index
    redactability_counts[obj["category"]] += 1

# Create the topics dictionary
topics = {index: topic for index, topic in enumerate(sorted(unique_topics))}

# Determine the start and end quarters
start_quarter = (min_date.year - 2022) * 4 + ((min_date.month - 1) // 3)
end_quarter = (max_date.year - 2022) * 4 + ((max_date.month - 1) // 3)
quarters = {
    index: {
        "start_date": f"{2022 + (index // 4)}-{((index % 4) * 3) + 1:02d}-01",
        "end_date": f"{2022 + (index // 4)}-{((index % 4) + 1) * 3:02d}-31"
    } for index in range(start_quarter, end_quarter + 1)
}

# Update the data points with topic index and quarter index
for obj in data_points:
    topic_index = list(topics.values()).index(obj["topic"])
    obj["topic_index"] = topic_index

    
    date = datetime.strptime(obj["date"], "%Y-%m-%d")
    quarter_index = (date.year - 2022) * 4 + ((date.month - 1) // 3)
    obj["quarter_index"] = quarter_index

# Write the entire list of objects to the output file
with open(output_file_path, 'w') as file:
    json.dump(data_points, file, indent=4)

# Write the topicIndex.json file
# Convert the str indicies to int indicies
topics = {int(k): v for k, v in topics.items()}
with open(topic_index_file_path, 'w') as file:
    json.dump(topics, file, indent=4)

# Write the categoryIndex.json file but first change the index type to int
redactability = {int(k): v for k, v in redactability.items()}
with open(category_index_file_path, 'w') as file:
    json.dump(redactability, file, indent=4)

# Write the quarterIndex.json file and also change the quarter index to int so its easier to work with later
quarters = {int(k): v for k, v in quarters.items()}
with open(quarter_index_file_path, 'w') as file:
    json.dump(quarters, file, indent=4)

print("JSON file has been fixed, enhanced with additional features, and saved as generated_data_points_v1.json.")
print("\nTopics:")
print(tabulate(list(topics.items()), headers=["Index", "Topic"]))

print("\nQuarters:")
print(tabulate(list(quarters.items()), headers=["Index", "Start Date", "End Date"]))

print(f"\nTotal number of JSON entries: {len(data_points)}")

print("\nRedactability counts:")
print(tabulate(list(redactability_counts.items()), headers=["Category", "Count"]))

JSON file has been fixed, enhanced with additional features, and saved as generated_data_points_v1.json.

Topics:
  Index  Topic
-------  ---------------------------------------------------
      0  Business continuity and disaster recovery
      1  Clinical education and training
      2  Clinical trial planning and execution
      3  Clinical trials and post-market surveillance
      4  Company background and milestones
      5  Competitive intelligence and market analysis
      6  Corporate communications and public relations
      7  Corporate governance and investor relations
      8  Corporate social responsibility
      9  Corporate social responsibility initiatives
     10  Customer feedback and satisfaction
     11  Customer support and complaint handling
     12  Data management and analytics
     13  Employee engagement and culture
     14  Employee training and development programs
     15  Environmental, health, and safety (EHS) compliance
     16  Financial planning and p

Next I want to parse the content field to give it some more granularity; lets use the spacy package to load our v1 JSON, and split the content field into sentences (and index them) and paragraphs (also indexing them)

In [9]:
import json
import spacy
from tqdm import tqdm_notebook
from tabulate import tabulate

# Load the spacy English model
nlp = spacy.load("en_core_web_sm")

# Load the v1 JSON file
with open("./PacemakerInnovationsData/generated_data_points_v1.json") as file:
    data_points = json.load(file)

total_paragraphs = 0
total_sentences = 0
sentences_per_node = []
paragraphs_per_node = []

# Process each data point
for data_point in tqdm_notebook(data_points, desc="Processing data points"):
    content = data_point["content"]
    
    # Process the content with spacy
    doc = nlp(content)
    
    # Split the content into sentences and paragraphs
    sentences = []
    paragraphs = []
    current_paragraph = []
    
    for sent in doc.sents:
        # Add the sentence to the current paragraph
        current_paragraph.append(str(sent))
        
        # Add the sentence to the list of sentences
        sentences.append(str(sent))
        
        # Check if the sentence ends with a newline character
        if sent.text.endswith("\n"):
            # Add the current paragraph to the list of paragraphs
            paragraphs.append(" ".join(current_paragraph))
            current_paragraph = []
    
    # Add the last paragraph if it's not empty
    if current_paragraph:
        paragraphs.append(" ".join(current_paragraph))
    
    # Add the sentences and paragraphs to the data point
    data_point["sentences"] = sentences
    data_point["paragraphs"] = paragraphs
    
    # Update the counters and statistics
    total_paragraphs += len(paragraphs)
    total_sentences += len(sentences)
    sentences_per_node.append(len(sentences))
    paragraphs_per_node.append(len(paragraphs))

# Add indices to the sentences and paragraphs
for data_point in tqdm_notebook(data_points, desc="Adding indices"):
    sentences = data_point["sentences"]
    paragraphs = data_point["paragraphs"]
    
    # Add indices to the sentences
    indexed_sentences = [{"index": index, "sentence": sentence} for index, sentence in enumerate(sentences)]
    data_point["indexed_sentences"] = indexed_sentences
    
    # Add indices to the paragraphs
    indexed_paragraphs = [{"index": index, "paragraph": paragraph} for index, paragraph in enumerate(paragraphs)]
    data_point["indexed_paragraphs"] = indexed_paragraphs

# Save the updated JSON data
with open("./PacemakerInnovationsData/generated_data_points_v2.json", "w") as file:
    json.dump(data_points, file, indent=4)

print("JSON data has been updated with sentence and paragraph splitting and indexing.")

# Calculate average and median statistics
avg_sentences_per_node = sum(sentences_per_node) / len(sentences_per_node)
avg_paragraphs_per_node = sum(paragraphs_per_node) / len(paragraphs_per_node)
median_sentences_per_node = sorted(sentences_per_node)[len(sentences_per_node) // 2]
median_paragraphs_per_node = sorted(paragraphs_per_node)[len(paragraphs_per_node) // 2]

# Print the statistics
statistics_table = [
    ["Total number of paragraphs", total_paragraphs],
    ["Total number of sentences", total_sentences],
    ["Average sentences per node", round(avg_sentences_per_node, 2)],
    ["Median sentences per node", median_sentences_per_node],
    ["Average paragraphs per node", round(avg_paragraphs_per_node, 2)],
    ["Median paragraphs per node", median_paragraphs_per_node]
]

print("\nStatistics:")
print(tabulate(statistics_table, headers=["Metric", "Value"], tablefmt="pretty"))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for data_point in tqdm_notebook(data_points, desc="Processing data points"):


Processing data points:   0%|          | 0/960 [00:00<?, ?it/s]

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for data_point in tqdm_notebook(data_points, desc="Adding indices"):


Adding indices:   0%|          | 0/960 [00:00<?, ?it/s]

JSON data has been updated with sentence and paragraph splitting and indexing.

Statistics:
+-----------------------------+-------+
|           Metric            | Value |
+-----------------------------+-------+
| Total number of paragraphs  | 3750  |
|  Total number of sentences  | 8927  |
| Average sentences per node  |  9.3  |
|  Median sentences per node  |   9   |
| Average paragraphs per node | 3.91  |
| Median paragraphs per node  |   4   |
+-----------------------------+-------+


## ~~Step 7: Generating embeddings~~

I was originally curious to try out https://huggingface.co/nvidia/NV-Embed-v1 as it came out just a few days ago and is best in class for embeddings, retrival, etc. but there are a couple of issues with it:

1. The license is non-commercial (kinda cheeky that they took an open source mistral 7b v0 and made its derivative non-commercial), its not a major obsticle as replicating something similar with a better model (Command R+ for example) wouldn't be too hard
2. The base model (mistral 7b v0) - I originally assumed the different versions of the 7b mistral have the same base training set but different instruct training sets, but it seems like they are indeed different from the core; and mistral 7b v0 is somewhat lacking as the reasoning LLM behind the retrival (I guess intentinally).

It's always nicer to be able to use the same model for embeddings as for the LM itself, as this opens up doors to playing around with the model in various ways.

Anyway, lets now se what we're dealing with in the azure deployment, since the embeddings dimension, and the maximum input tokens will largely drive how we proceed next (might need to do some dimensionality reduction incase the dim is too large, or stick to paragrapg/sentence embeddings if we're limited by the context length of the input)

## A small aside - re-align the project plan

Since I read the task and came up with the idea and then got sick (and of course never wrote it down), I think a good refresher on the plan is in order. Here's the task at hand summarised

1. Design the overall system architecture and API endpoints
    a. The API should have a document upload endpoint that accepts a PDF file and stores it in a database, this endpoint should require the user token and require the user to give a topic label and redactable content label(s) out of a set of topics and redactable contents they have access to
    b. The API should have a query endpoint that accepts a question and returns an answer
    c. The API should require an access token, based on this token the API should check which topicIndex the user has access to and which redactIndex topics should be filtered from the
    RAG query. The API should return a subgraph of the RAG that is queryable to the user.
    d. Since we've decided to weave in some redactable content into topics a user has possible access to we should query our LLM at a sentence level with a list of topics that should be redacted and then decide if each sentence should be part of the redacted content.
    e. We can later use this and our knowledge if the higher level node has redactable content weaved in to test how well the system performs.
2. Implement the backend API with document upload, embedding, and vector storage
    a. Given our fictional company we will focus on the case of a PDF (I'll use the mathpix API, I did have a long script that deal with old documents of any format, converts them to PDF, performs OCR on the images, etc. but writing that from scratch is a bit of a headache and would take too long), there the script should include the author, inclusion of topic label and redactable content labels. So we can pretty much recycle whats being done here.
    b. Since we want to demonstrate some access level control, I should generate a JSON with some UUID's (access tokens) and usernames (firstnameLastname) for our fictional company. I should also include a script that generates a new UUID and username, and allows the creation of topic level and redacted contect level permissions.
3. Implement the query endpoint with document retrieval and answer generation
    a. Here I'll use the provided Azure endpoints to embed the query and data, if time allows I'd like to do some langchain shianigans to generate a more coherent answer.
    b. The endpoint should request the access level token which will generate a subgraph of available RAG information.
4. Write a test script to upload sample documents and query the system
    a. Write a script that adds the document to the Graph of Knowledge and then allows the user to query the system.
    b. To properly test the system, we should generate questions & answers around each graph node and then query the system to see if the correct context node is appended to the query
5. Document the thought process and implementation details in a README file
6. Research and list potential challenges and solutions for production-ready RAG systems
7. Optionally, integrate the provided Azure OpenAI endpoints
8. Containerize the application with Docker for easy deployment
    a. Docker file is there but we should see if the scripts run properly
9. Conduct final testing and make any necessary refinements

With this now in mind let's complete the missing pieces that will be needed for the API to function properly:

1. Generate a JSON with redacted topics indexed
2. Generate the JSON with the UUID's and usernames; then each pair should contain a list on allowed topic indicies and redacted content indicies.



In [19]:
import json
import uuid

# Load the pasted JSON
with open("./PacemakerInnovationsData/accessLevel.json") as file:
    pasted_json = json.load(file)

# Extract unique "redact_content" entries
redact_content_set = set()
for position_data in pasted_json.values():
    redact_content_set.update(position_data["redact_content"])

# Assign numerical values to unique "redact_content" entries
redact_index = {content: str(index) for index, content in enumerate(redact_content_set)}

# Save the "redactIndex.json" making sure the keys are the indicies (since we'll be working with graphs later - makes life easier, even if JSON saves keys as strings)
redact_index = {int(value): key for key, value in redact_index.items()}
with open("./PacemakerInnovationsData/redactIndex.json", "w") as file:
    json.dump(redact_index, file, indent=4)

# Load the "collaborativeHirarchy.json"
with open("./PacemakerInnovationsData/collaborativeHirarchy.json") as file:
    collaborative_hirarchy = json.load(file)

# Load the "topicIndex.json"
with open("./PacemakerInnovationsData/topicIndex.json") as file:
    topic_index = json.load(file)

# Invert the topic_index dictionary to map topics to their indices
topic_index_inverted = {value: key for key, value in topic_index.items()}

# Generate the "authentication.json"
authentication_data = []
for name, user_data in collaborative_hirarchy.items():
    position = user_data["title"]
    topical_access = pasted_json[position]["topical_access"]
    redact_content = pasted_json[position]["redact_content"]
    
    # Get the topic access indices
    topic_access_indices = [int(topic_index_inverted[topic]) for topic in topical_access if topic in topic_index_inverted]
    
    # Get the redact content indices
    redact_index_inverted = {value: key for key, value in redact_index.items()}
    redact_content_indices = [redact_index_inverted[content] for content in redact_content if content in redact_index_inverted]
    
    # Generate a unique UUID token
    token = str(uuid.uuid4())
    
    authentication_data.append({
        "nameSurname": name,
        "token": token,
        "topic_access_indices": topic_access_indices,
        "redact_content_indices": redact_content_indices
    })

# Save the "authentication.json"
with open("./PacemakerInnovationsData/authentication.json", "w") as file:
    json.dump(authentication_data, file, indent=4)

print("redactIndex.json and authentication.json have been generated successfully.")

redactIndex.json and authentication.json have been generated successfully.


In [8]:
print(content)

NameError: name 'content' is not defined

As a quick sanity check lets just print the first entry of each of the files (this is just for me to keep track of things feel free to ignore)

In [43]:
import json
from pathlib import Path

directory = Path('./PacemakerInnovationsData/')
skip_files = ["generated_data_points_v0.json"]

# Get the list of JSON files in the directory
json_files = Path(directory).glob("*.json")
# Iterate over each JSON file
for file in json_files:
    print(file)
    if Path(file).name in skip_files:
        continue 
    # Open the JSON file and load its contents
    with open(file, 'r') as f:
        json_data = json.load(f)
        # Check if the JSON data is a list
        if isinstance(json_data, list):
            # Iterate over each item in the list
            
            # Get the first key in the item
            first_key = list(item.keys())[0]
            
            # Print the first key
            print(f"First entry in {file}, which is a list of JSON objects: {json_data[0]}")
        else:
            # Get the first key in the JSON data
            first_key = list(json_data.keys())[0]
            
            # Print the first key
            print(f"First key in {file}: {first_key}")
            # first value in the JSON data
            print(f"First value in {file}: {json_data[first_key]}")


PacemakerInnovationsData/dataPrompts.json
First key in PacemakerInnovationsData/dataPrompts.json: Company background and milestones
First value in PacemakerInnovationsData/dataPrompts.json: {'fields': {'possibleAuthors': 'John Doe, Jane Smith, Michael Johnson, Lily Sanchez', 'prompt': "Generate a data point for month [X] covering the company's background and milestones. Include key events, decisions, and achievements, and provide the data point in a structured format (JSON) with the following fields: date, summary, authors, content. The output must be JSON and only JSON, nothing else. The content must be at least a few paragraphs.", 'subtleSuffix': "If relevant, try to allude to the following in the content field, but do so in a way that doesn't draw too much attention or compromise confidentiality: ", 'obviousSuffix': "Include the following information about the company's background and milestones in the content field: "}, 'redactableContent': ['Detailed pricing strategies', 'Employee

## Step 7: Generating the Knowledge Graph, embedding it and creating a script to add new nodes

First lets begin by defining what we need to do for all three parts of this step:

At this point we should convert our JSONs to something more efficient, and representetive of the graph structure of our data - I'll use pytorch_geometric data structures to this end

`./secure-llm-gok/generate_graph.py` will be used to convert our dummy company data into a graph, by converting it to a PyG data structure (and embed the nodes, paragraphs, sentences), it will be used to add nodes to our graph, however those will still need to be preprocessed to a JSON format to the one we used from for example a PDF (if time permits) for our fictional company data - the code for adding a node is not too dissimilar, but with the addition of the spacy sentence and paragraph parser


Next lets finally embed the graph!

After trying to access the provided Azure embedding endpoint, I've recived the following error: 

NotFoundError: Error code: 404 - {'error': {'code': '404', 'message': 'Resource not found'}}

So I've switched to using my own OpenAI token

In [1]:
from generate_graph import CombinedGraphCreator

# Initialize the CombinedGraphCreator (I initially thought to parse the JSON into a graph and embed it seperately but then decided to refactor the code to do both at once)
creator = CombinedGraphCreator(json_file="./PacemakerInnovationsData/generated_data_points_v2.json", output_file="./PacemakerInnovationsData/graph.pt")
# Create the embedded graph (refactored to add batching; went from taking 2hrs to embed the graph to 15mins - using as much of the available context per API call makes a big difference)
# However I couldn't remember what my API rate limits were so I skipped asynchrounous processing for now
data = creator.create_and_embed_graph()

  Referenced from: <75FFC412-93B5-322B-8E6D-268DA3498CF4> /Users/jaro/miniforge3/envs/secure-llm-gok/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)
  Referenced from: <75FFC412-93B5-322B-8E6D-268DA3498CF4> /Users/jaro/miniforge3/envs/secure-llm-gok/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)
Processing nodes: 100%|██████████| 960/960 [11:50<00:00,  1.35it/s]


## Step 8: Adding new info to the RAG/GoK

Currently, in order to include new information into the GoK, it must be in the same format as our generated data; that really isn't quite good enough.

Given more time, it would be interesting to implement a custom "any input parser" with the use of small vision models (phi3-medium-vision or the more recent [MiniCPM](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5)). To be honest after trying for half an evening to make phi3-medium (even the unquantised version), I'm a bit suspect of Microsofts abilities to deliver something beyound just beatching benchmarks.

Anyway, for now I'll use the "https://api.mathpix.com/v3/pdf" API to allow for converting PDF to markdown (see: `pdf_to_markdown.py`) and the previously made script to add new nodes to our graph. My plan is to bring the two together in a FastAPI endpoint (allowing the user to either upload a PDF or correctly formatted JSON). But for now lets test each individual part.


### Testing JSON ingestion

Here I'm gonna go back to step 5 for a few minutes and generate some future company data (lets iterate through our topics once and hence generate one months of company data around 1 topic).

In [13]:
from generatedata import DataPointGenerator, DynamicOllama

from tabulate import tabulate


num_months = 25
# Initialize the generator and the "dynamic" Ollama model (dynamic meaning that we can modify the system prompt without reloading the model)
generator = DataPointGenerator(data_point_prompts, system_prompt["systemPrompt"], num_months, context_tokens=8192, current_month=25, current_topic=0)
llm = DynamicOllama(system_prompt=generator.system_prompt)
save_path = directory / 'test_node_ingestion.json'
new_node = None
data_point, updated_summary_prompt = next(generator)
while not new_node:
    new_node = llm.generate(data_point["prompt"], updated_summary_prompt)
print(tabulate(new_node.items(), tablefmt="fancy_grid"))
DynamicOllama.save_node(new_node, data_point, save_path)
generator.update_summaries(new_node["summary"])
new_node = None



Percentage of context tokens used for summary: 0.244140625%
╒═════════╤════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╕
│ date    │ 2023-02-25                                                                                                                                                                                                                                                                                                                                                                                                                                                 │
├─────────┼───────────────

Now that the datapoint is generated; lets see if adding it to our embedded graph works. 

*Note* that this can be done using the CLI too!

In [1]:
from generate_graph import CombinedGraphCreator

# Initialize the CombinedGraphCreator and add a single node
adder = CombinedGraphCreator(json_file="./PacemakerInnovationsData/test_node_ingestion.json", output_file="./PacemakerInnovationsData/graph.pt")
adder.add_node()

  Referenced from: <75FFC412-93B5-322B-8E6D-268DA3498CF4> /Users/jaro/miniforge3/envs/secure-llm-gok/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)
  Referenced from: <75FFC412-93B5-322B-8E6D-268DA3498CF4> /Users/jaro/miniforge3/envs/secure-llm-gok/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)


## Step 8 Query function

This might be the most important part of the RAG, so here's an explanation of the 5 techniques used therein (note: I also reserved a question_embedding attribute for the knowledge graph with a 6th method in mind - question-based context retrival - here we can either get the users to thumbs up/down the answer and populate the field(s) with relevant questions, or pre-generate some relevant questions for each node/paragraph/sentence - since questions are more likely to have a similar semantic embedding to user queries than answers are):

### Explanation of the 5 Different Techniques Used to Query the RAG/Graph of Knowledge

The `generateanswer.py` script employs five different techniques to query the RAG/Graph of Knowledge. These techniques are used to retrieve relevant context based on different criteria, and their weights can be optimized to improve the quality of the answers generated. Here is a detailed explanation of each technique:

1. **Time-Based Context Retrieval**:
   - **Description**: This technique retrieves context based on the time information associated with the query. It looks for content that falls within a specified date range.
   - **Implementation**: The function `get_time_based_context` checks if the `time_info` (start and end dates) of the query matches any quarter information in the `quarter_index`. It then retrieves the relevant content and embeddings, calculates similarities, and selects the top relevant content.
   - **Optimization**: The weight for this technique (`time_weight`) can be adjusted based on how important time-based context is for answering the query.

2. **Topic-Based Context Retrieval**:
   - **Description**: This technique retrieves context based on the topics related to the query. It identifies relevant topics and retrieves content associated with those topics.
   - **Implementation**: The function `get_topic_based_context` uses the `related_topics` identified by the `ask_related_topics` function. It retrieves content and embeddings for these topics, calculates similarities, and selects the top relevant content.
   - **Optimization**: The weight for this technique (`topic_weight`) can be adjusted based on the relevance of topic-based context for the query.

3. **Node Embedding-Based Context Retrieval**:
   - **Description**: This technique retrieves context based on the similarity of embeddings between the query and the content. It uses vector representations to find the most similar content.
   - **Implementation**: The function `get_embedding_based_context` calculates the similarity between the query embedding and the embeddings of all content nodes. It selects the top relevant content based on these similarities.
   - **Optimization**: The weight for this technique (`embedding_weight`) can be adjusted based on the effectiveness of embedding-based retrieval for the query.

4. **Paragraph Embedding-Based Context Retrieval**:
   - **Description**: This technique retrieves context at the paragraph level. It identifies and retrieves the most relevant paragraphs based on their similarity to the query.
   - **Implementation**: The function `get_paragraph_based_context` calculates the similarity between the query embedding and the embeddings of paragraphs. It selects the top relevant paragraphs based on these similarities.
   - **Optimization**: The weight for this technique (`paragraph_weight`) can be adjusted based on the importance of paragraph-level context for the query.

5. **Sentence Embedding-Based Context Retrieval**:
   - **Description**: This technique retrieves context at the sentence level. It identifies and retrieves the most relevant sentences based on their similarity to the query.
   - **Implementation**: The function `get_sentence_based_context` calculates the similarity between the query embedding and the embeddings of sentences. It selects the top relevant sentences based on these similarities.
   - **Optimization**: The weight for this technique (`sentence_weight`) can be adjusted based on the importance of sentence-level context for the query.

### Optimizing the Weight of Each Technique

The weights of each technique can be optimized by generating questions and answers using randomly selected context. This process involves the following steps:

1. **Generate Random Context**:
   - Randomly select context passages from the dataset.
   - Use these passages to create a diverse set of queries.

2. **Generate Answers**:
   - Use the current weights to retrieve context for each query.
   - Generate answers based on the retrieved context.

3. **Evaluate Answers**:
   - Evaluate the quality of the answers using metrics such as relevance, accuracy, and completeness.
   - Compare the generated answers with ground truth answers (if available).

4. **Adjust Weights**:
   - Adjust the weights of each technique based on the evaluation results.
   - Increase the weight of techniques that contribute to better answers.
   - Decrease the weight of techniques that contribute to poorer answers.

5. **Iterate**:
   - Repeat the process of generating random context, generating answers, evaluating answers, and adjusting weights.
   - Continue iterating until the optimal weights are found.

By fine-tuning the weighting parameters through this iterative process, the system can learn the optimal combination of techniques to use for different types of queries, leading to more accurate and relevant answers. And since we have a pre-generated dataset that can be expanded almost indefnetly, there's plenty of room to optimise. For now lets test out our answer generator by picking out something specific from the generated data in `PacemakerInnovationsData/generated_data_points_v2.json`:

The total context length included in the query is another parameter that can be varied; its a delicate balance bwteen too much and not enough context

Here's a quick extract from the data:

"""
To ensure the security of our firmware, we are implementing a multi-layered approach that includes:\n- Regular code reviews to identify potential vulnerabilities\n- Utilization of secure coding practices and guidelines\n- Implementation of encryption algorithms for data transmission and storage\n- Conducting thorough penetration testing to simulate real-world attacks
"""

A valid query would be:

"What are the exact 4 layers adopted by PacemakerInnovations in their multi-layered approach to firmware security?"


In [1]:
from generateanswer import QueryProcessor

processor = QueryProcessor(graph_file="./PacemakerInnovationsData/graph.pt")
processor.perform_query("What are the exact 4 layers adopted by PacemakerInnovations in their multi-layered approach to firmware security?")

  Referenced from: <75FFC412-93B5-322B-8E6D-268DA3498CF4> /Users/jaro/miniforge3/envs/secure-llm-gok/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)
  Referenced from: <75FFC412-93B5-322B-8E6D-268DA3498CF4> /Users/jaro/miniforge3/envs/secure-llm-gok/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)


No date info found


'The context provides sufficient information to answer the query regarding the exact 4 layers adopted by Pacemaker Innovations in their multi-layered approach to firmware security. According to the context, the four layers are:\n\n1. **Regular code reviews to identify potential vulnerabilities**: "To ensure the security of our firmware we are implementing a multilayered approach that includes Regular code reviews to identify potential vulnerabilities."\n\n2. **Utilization of secure coding practices and guidelines**: "Utilization of secure coding practices and guidelines."\n\n3. **Implementation of encryption algorithms for data transmission and storage**: "Implementation of encryption algorithms for data transmission and storage."\n\n4. **Conducting thorough penetration testing to simulate real-world attacks**: "Conducting thorough penetration testing to simulate real-world attacks."\n\nThese measures collectively form the multi-layered approach to firmware security adopted by Pacemake

Answer: 'The context provides sufficient information to answer the query regarding the exact 4 layers adopted by Pacemaker Innovations in their multi-layered approach to firmware security. According to the context, the four layers are:\n\n1. **Regular code reviews to identify potential vulnerabilities**: "To ensure the security of our firmware we are implementing a multilayered approach that includes Regular code reviews to identify potential vulnerabilities."\n\n2. **Utilization of secure coding practices and guidelines**: "Utilization of secure coding practices and guidelines."\n\n3. **Implementation of encryption algorithms for data transmission and storage**: "Implementation of encryption algorithms for data transmission and storage."\n\n4. **Conducting thorough penetration testing to simulate real-world attacks**: "Conducting thorough penetration testing to simulate real-world attacks."\n\nThese measures collectively form the multi-layered approach to firmware security adopted by Pacemaker Innovations.'

So it works!

A couple of horrible things I did with the code though:

1. DO NOT use pydantic do define your structure when saving a torch geometric object. I renamed the file (generate_graph -> generategraph) and got flashbacks to the one time a colleague did the same at SharperShape and we had to dig through commit history to get some pointcloud data out of this self sealing box :D totally forgot.
2. I should've converted the GoK embeddings list to torch tensors when generating them, converting them everytime we load the GoK is highly inefficent (but anyway currently my torch-sparse installation is broken so I'm not really benefitting from sparsity in PyG)

Also, in hindsight I see little use in connecting sentences to their parent paragraphs and nodes, aside from security. Your chunk size should be domain specific (i.e. keep code snippets oin ``` ``` together), otherwise its better to use a sliding window of n_sentences_embedded and optimise that parameter.

## Step 9: User based sub-graphs

Let's try to implement some access control, what we'll do is the following:

1. Take `PacemakerInnovationsData/authentication.json`, `PacemakerInnovationsData/topicIndex.json` and `PacemakerInnovationsData/redactIndex.json`
2. Our subgraph generation "microservice" should have access to these (along with the specific user token), it should then go through `PacemakerInnovationsData/graph.pt` and
    a. Prune the graph to remove all the nodes that are not "authorised" by the user
    b. Prunt the graph to remove all the sentences with content the user is not "authorised" to see (here we'll again try to rely on an LLM to simulate the a businesscase where the client knows which topics should be redacted but not exactly where they exist)

Let's test to see if it works for "Aria Diaz", our Cybersecurity Analyst with a token: "498f98ac-2bf6-4a16-a001-cbb19e7b82de"

In [1]:
from confidentialgraph import ConfidentialSubgraphGenerator

# Initialize the ConfidentialSubgraphGenerator
generator = ConfidentialSubgraphGenerator(graph_file="./PacemakerInnovationsData/graph.pt")
# Generate the confidential subgraph
generator.generate_subgraph(user_token="498f98ac-2bf6-4a16-a001-cbb19e7b82de")

  Referenced from: <75FFC412-93B5-322B-8E6D-268DA3498CF4> /Users/jaro/miniforge3/envs/secure-llm-gok/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)
  Referenced from: <75FFC412-93B5-322B-8E6D-268DA3498CF4> /Users/jaro/miniforge3/envs/secure-llm-gok/lib/python3.11/site-packages/libpyg.so
  Reason: tried: '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file), '/Library/Frameworks/Python.framework/Versions/3.11/Python' (no such file)


Subgraph saved to cache: ./cache/d822696b-23d3-4565-8a93-4ff532371ee8.pt


Data(x=[24], edge_index=[2, 154], edge_attr=[154], paragraph_features=[76], sentence_features=[0])

## Step 10: Generating API endpoints and a quick streamlit UI

Now we can finally combine it all into two endpoints:

### Query Endpoint

This endpoint will need a user token and a query and return an answer

Under the hood it will generate a subgraph for that user, and then input the query and the subgraph into the `QuerryProcessor` and return an answer pretty much just combining the previous two steps.

### Add knowledge

This endpoint will allow you to add knowledge (filtered by topics that your user has access to)

To run the app simply run:

```bash
docker build -t secure-llm-gok .
docker run -p 8501:8501 -p 8000:8000 myapp
```
