<a href="https://colab.research.google.com/github/Jacob-Rose-BU/Alternative-Investments---Assette-Capstone-Project/blob/main/Generate_Synthetic_Data_CSV_Excel_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Team Summary & Disclosures**

This notebook generates synthetic data needed for ESG equity fund fact sheets, specifically for the team summary and compliance disclosures sections.

- For HR, synthetic data simulates what would be manually provided by the HR department (as per business advisor guidance). This includes names, titles, tenure, and team affiliations. The data is output as a CSV and prepared for GPT-based summary generation to describe the team managing each fund.

- For compliance, a reusable list of regulatory disclosure footnotes has been synthetically created. These footnotes are tagged with unique IDs and labeled as either mandatory or optional. For each fund, the compliance team will provide a list of relevant disclosure IDs, which are then pulled and added to the fund's fact sheet.


###**Execution Instructions**

**To run this notebook:**
1. Run the notebook sequentially from top to bottom to generate and export HR and Compliance data
2. Output files will be saved locally and visible in the file panel

### **File Roadmap**
HR Data: Faker-generated employee names, titles, tenure, and team <br>
Compliance Footnotes: Reusable disclosures labeled with DisclosureID and category <br>
**Outputs:** synthetic_employee_hr.csv, full_compliance_footnotes.csv

### **Next Steps**
- Generate GPT-based Team Summary using the HR CSV
- Fund-Specific Disclosure DataFrame Generation

### **Future Improvement:**
#### **Synthetic Data Enhancements**
- Add variability in job title seniority
- Group HR data by team/product alignment


In [None]:
pip install faker

Collecting faker
  Downloading faker-37.4.2-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.4.2-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m28.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faker
Successfully installed faker-37.4.2


# **Team Summary**

HR data for this project is synthetically created to simulate how team information would be incorporated into ESG fund fact sheets. As discussed with the business advisor, real HR data will be provided manually since the HR team operates within separate systems. To emulate this, we generated synthetic HR profiles, including names, roles, tenure, and team affiliations, and saved them to a CSV file. This dataset serves as input for the GPT API, which will be used to generate narrative summaries about the fund teams responsible for managing each product. These summaries are intended to provide investors with a concise, professional overview of the team behind their fund. While the data generation and CSV preparation are complete, the next step is to generate the GPT-based summaries. The final deliverable will be a table containing the fund name and corresponding team summary, ready to be manually inserted into the final fact sheets.

#### **HR Data Pipeline**
1. Synthetic HR Data Creation - For development and demonstration purposes, we synthetically generated HR data including names, titles, tenure, and teams.
2. CSV Output- The synthetic HR data is saved as a CSV file to mirror how the real HR data might be delivered manually.
3. GPT-Based Summary Generation (Pending) - A GPT function will read the HR CSV and generate a narrative team summary for each fund, describing who manages the fund and their qualifications.
4. Manual Transfer Step - Once the GPT summaries are generated, they will be shared with the business team to be inserted into the final fact sheets manually.

#### **Final Output Format** : CSV/Excel : fund_name, team_summary

## **Generate Synthetic HR data**

The script generates synthetic HR data tailored for an asset management firm by creating fake employee records with realistic fields such as name, start date, job function, title, team, education, and key personnel status using the Faker library. It calculates industry experience based on the start date, saves the dataset to a CSV file, and includes a separate function to load and clean this data for analysis, standardizing column names and recalculating experience if needed.

In [None]:
import pandas as pd
import random
from faker import Faker
from datetime import datetime

fake = Faker()

def generate_synthetic_hr_data(num_records=20, output_path='synthetic_employee_hr.csv'):
    """
    Generates synthetic HR data for asset management roles and saves as CSV.
    """
    job_functions = [
        'Equity Research Analyst', 'Fixed Income Analyst', 'Quantitative Analyst',
        'Portfolio Manager', 'Research Associate', 'Investment Strategist',
        'ESG Analyst', 'Trader', 'Risk Analyst', 'Compliance Officer',
        'Operations Manager', 'Chief Investment Officer (CIO)', 'Product Specialist'
    ]
    titles = ['Analyst', 'Associate', 'Vice President', 'Director', 'Managing Director']
    teams = ['Equities', 'Fixed Income', 'Multi-Asset', 'Quant', 'Compliance']
    education_levels = ['MBA', 'CFA', 'PhD', 'BBA', 'MS Finance', 'MA Econ']
    employment_types = ['Full-time', 'Contract']
    locations = ['New York', 'Boston', 'London', 'San Francisco', 'Chicago']

    data = []
    for i in range(num_records):
        start_date = fake.date_between(start_date='-15y', end_date='-1y')
        years_exp = round((datetime.today().date() - start_date).days / 365.25, 1)
        job_function = random.choice(job_functions)
        name = fake.name()
        employee_id = f"EID{i+1000}"
        title = random.choice(titles)
        team = random.choice(teams)
        education = random.choice(education_levels)
        employment_type = random.choice(employment_types)
        location = random.choice(locations)
        is_key_personnel = random.choices([True, False], weights=[0.3, 0.7])[0]
        status = random.choices(['Active', 'Inactive'], weights=[0.85, 0.15])[0]

        data.append({
            'EmployeeID': employee_id,
            'Name': name,
            'StartDate': start_date,
            'IndustryExperienceYears': years_exp,
            'PrimaryJobFunction': job_function,
            'Title': title,
            'Team': team,
            'Education': education,
            'EmploymentType': employment_type,
            'Location': location,
            'IsKeyPersonnel': is_key_personnel,
            'Status': status
        })

    df_hr = pd.DataFrame(data)
    df_hr.to_csv(output_path, index=False)
    print(f"Synthetic HR data saved to: {output_path}")
    return df_hr

def load_hr_data(file_path):
    """
    Loads and cleans HR data CSV. Calculates years of experience if not provided.
    """
    df_hr = pd.read_csv(file_path, parse_dates=['StartDate'])

    # Normalize column names
    df_hr.columns = df_hr.columns.str.strip().str.lower()

    # Rename for clarity
    df_hr.rename(columns={
        'startdate': 'start_date',
        'industryexperienceyears': 'industry_experience_years',
        'primaryjobfunction': 'job_function'
    }, inplace=True)

    # Calculate experience if missing
    if 'industry_experience_years' not in df_hr.columns or df_hr['industry_experience_years'].isnull().all():
        df_hr['industry_experience_years'] = (datetime.today() - df_hr['start_date']).dt.days / 365.25

    return df_hr

# Example run
if __name__ == "__main__":
    generate_synthetic_hr_data(50, 'synthetic_employee_hr.csv')
    df_hr = load_hr_data('synthetic_employee_hr.csv')
    print(df_hr.head())


Synthetic HR data saved to: synthetic_employee_hr.csv
  employeeid              name start_date  industry_experience_years  \
0    EID1000       Nathan Hull 2021-05-04                        4.2   
1    EID1001         John Mann 2014-02-04                       11.4   
2    EID1002      Emily Patton 2019-05-22                        6.2   
3    EID1003  Brandon Martinez 2014-09-30                       10.8   
4    EID1004        Amanda Key 2020-02-13                        5.4   

           job_function           title         team   education  \
0  Fixed Income Analyst  Vice President  Multi-Asset         PhD   
1    Research Associate         Analyst  Multi-Asset  MS Finance   
2  Quantitative Analyst       Associate  Multi-Asset         CFA   
3    Operations Manager         Analyst     Equities  MS Finance   
4  Fixed Income Analyst         Analyst     Equities  MS Finance   

  employmenttype  location  iskeypersonnel    status  
0      Full-time   Chicago            True    Act

## **Load HR Data**

This updated load_hr_data() function reads the synthetic HR CSV file, standardizes column names, parses the StartDate, and ensures consistency across key fields. It recalculates industry experience from the start date if the value is missing or incorrect and rounds it to one decimal place. The function also converts specific columns like IsKeyPersonnel, Status, EmploymentType, and Team into appropriate data types such as booleans and categories, making the dataset cleaner and more efficient for analysis or visualization in dashboards or reporting tools.

In [None]:
import pandas as pd
from datetime import datetime

def load_hr_data(file_path):
    """
    Loads and cleans HR data from CSV.
    Standardizes column names, parses dates, and verifies key fields.
    """
    df_hr = pd.read_csv(file_path, parse_dates=['StartDate'])

    # Normalize column names
    df_hr.columns = df_hr.columns.str.strip().str.lower()

    # Rename key columns for consistency
    df_hr.rename(columns={
        'startdate': 'start_date',
        'industryexperienceyears': 'industry_experience_years',
        'primaryjobfunction': 'job_function'
    }, inplace=True)

    # Recalculate experience if missing or corrupted
    if 'industry_experience_years' not in df_hr.columns or df_hr['industry_experience_years'].isnull().any():
        df_hr['industry_experience_years'] = (datetime.today() - df_hr['start_date']).dt.days / 365.25
        df_hr['industry_experience_years'] = df_hr['industry_experience_years'].round(1)

    # Optional: Convert booleans & categories
    df_hr['iskeypersonnel'] = df_hr['iskeypersonnel'].astype(bool)
    df_hr['status'] = df_hr['status'].astype('category')
    df_hr['employmenttype'] = df_hr['employmenttype'].astype('category')
    df_hr['team'] = df_hr['team'].astype('category')
    df_hr['job_function'] = df_hr['job_function'].astype('category')

    return df_hr

# Example usage
if __name__ == "__main__":
    hr_df = load_hr_data('synthetic_employee_hr.csv')
    print(hr_df.head())


  employeeid              name start_date  industry_experience_years  \
0    EID1000       Nathan Hull 2021-05-04                        4.2   
1    EID1001         John Mann 2014-02-04                       11.4   
2    EID1002      Emily Patton 2019-05-22                        6.2   
3    EID1003  Brandon Martinez 2014-09-30                       10.8   
4    EID1004        Amanda Key 2020-02-13                        5.4   

           job_function           title         team   education  \
0  Fixed Income Analyst  Vice President  Multi-Asset         PhD   
1    Research Associate         Analyst  Multi-Asset  MS Finance   
2  Quantitative Analyst       Associate  Multi-Asset         CFA   
3    Operations Manager         Analyst     Equities  MS Finance   
4  Fixed Income Analyst         Analyst     Equities  MS Finance   

  employmenttype  location  iskeypersonnel    status  
0      Full-time   Chicago            True    Active  
1       Contract    London           False    Ac

In [None]:
df_hr

Unnamed: 0,employeeid,name,start_date,industry_experience_years,job_function,title,team,education,employmenttype,location,iskeypersonnel,status
0,EID1000,Nathan Hull,2021-05-04,4.2,Fixed Income Analyst,Vice President,Multi-Asset,PhD,Full-time,Chicago,True,Active
1,EID1001,John Mann,2014-02-04,11.4,Research Associate,Analyst,Multi-Asset,MS Finance,Contract,London,False,Active
2,EID1002,Emily Patton,2019-05-22,6.2,Quantitative Analyst,Associate,Multi-Asset,CFA,Contract,New York,False,Active
3,EID1003,Brandon Martinez,2014-09-30,10.8,Operations Manager,Analyst,Equities,MS Finance,Contract,Chicago,False,Inactive
4,EID1004,Amanda Key,2020-02-13,5.4,Fixed Income Analyst,Analyst,Equities,MS Finance,Contract,London,False,Active
5,EID1005,Jennifer Newman,2021-12-17,3.6,Fixed Income Analyst,Managing Director,Compliance,PhD,Contract,New York,False,Active
6,EID1006,Margaret Proctor,2020-03-21,5.3,Chief Investment Officer (CIO),Vice President,Fixed Income,BBA,Full-time,New York,False,Active
7,EID1007,Henry Jones,2018-07-22,7.0,Product Specialist,Managing Director,Fixed Income,CFA,Full-time,London,False,Active
8,EID1008,Tara Wiley,2011-06-30,14.1,Chief Investment Officer (CIO),Director,Fixed Income,MA Econ,Contract,Chicago,False,Active
9,EID1009,Christopher Hensley,2021-09-06,3.9,Quantitative Analyst,Analyst,Fixed Income,MBA,Full-time,Boston,False,Active


## **GPT API for Team Summary**

In [None]:
#next steps - reach out to jacob regarding this portion. Maybe we can add this to Jacob's GPT sheet or do it in this sheet.

# **Compliance & Disclosure Information**

Compliance and disclosure information for fund fact sheets is managed through a centralized, reusable repository. As advised by the business team, most disclosures are consistent across funds and can be reused as needed. To support this, we generated a synthetic master list of disclosures, each labeled with a unique DisclosureID, its description, and a tag indicating whether it is mandatory or optional. For each fund, the compliance team will provide the specific IDs applicable to that product, and those entries will be pulled from the master list and added to the fact sheet. This approach allows for consistency, ease of updates, and simplified integration with downstream reporting. The final output will be a table of selected disclosures that align with each fund’s regulatory and marketing requirements.

#### **Compliance & Disclosure Pipeline**
1. Synthetic Disclosure List Creation - A synthetic set of compliance footnotes has been generated, each with a unique DisclosureID and corresponding description text. Each record is tagged as mandatory or optional.
2. Central Repository Approach - This disclosure list will act as a central, master repository. It can be maintained and updated by the compliance team outside of the pipeline (in Excel or Snowflake).
3. Manual Input from Compliance - For each fact sheet, the compliance team will specify which DisclosureIDs apply to that fund. These IDs will be matched to the master list and pulled into the fact sheet.
4. Fund-Specific Disclosure DataFrame Generation (Pending)- Load the provided DisclosureIDs for a specific fund from compliance, pull the matching footnotes from the master list, and generate a structured table of disclosures.
5. Fund-Specific Disclosure Table Generation (Future Improvement) - Previous step's table will  be loaded into Snowflake for use in final fact sheet assembly.


#### **Final Output Format** : CSV/Excel: fund_name, footnote

## **Synthetic Compliance & Disclosure Information**

This function creates a synthetic dataset of compliance footnotes and disclosures commonly found in ESG equity fact sheets. Each record includes a disclosure ID, topic, descriptive footnote, effective date, disclosure type (e.g., legal, performance), applicable entity (e.g., fund or strategy), review details, and a DisclosureRequirement field that flags whether the disclosure is mandatory or voluntary. The requirement is determined based on the disclosure type, with legal and regulatory types defaulting to mandatory. The final dataset is saved as a CSV file for analysis or reporting.

In [None]:
def generate_full_compliance_data(num_records=25, output_path='full_compliance_footnotes.csv'):
    """
    Generates synthetic compliance footnotes and disclosures for ESG equity fact sheets,
    incorporating regulatory disclaimers, usage restrictions, and disclosure classifications.
    """
    topics = [
        "Risk Disclosure", "ESG Methodology", "Benchmark Comparison",
        "Performance Past vs Future", "Index Usage", "Data Source",
        "Carbon Footprint", "Proxy Voting", "Engagement Policy",
        "Sustainable Investing Risk", "Regulatory Classification",
        "Holdings Disclosure", "Sector Allocation Methodology",
        "Data Accuracy Disclaimer", "Third-Party Content Disclaimer",
        "Marketing Classification", "Jurisdictional Disclosure",
        "Investor Rights", "Professional Use Only", "Fund Documentation Reference",
        "Source Attribution", "Reproduction Restriction"
    ]

    footnotes = {
        "Risk Disclosure": "Past performance is not indicative of future results.",
        "ESG Methodology": "This strategy integrates ESG criteria but may still be exposed to non-sustainable risks.",
        "Benchmark Comparison": "Benchmark returns are presented for comparison purposes only and do not reflect fees.",
        "Performance Past vs Future": "Historical returns are not a guarantee of future performance.",
        "Index Usage": "The strategy may invest in securities not included in the benchmark.",
        "Data Source": "Data used in ESG scoring is obtained from third-party sources deemed reliable.",
        "Carbon Footprint": "This fund's carbon footprint is calculated based on Scope 1 and 2 emissions.",
        "Proxy Voting": "All voting activity aligns with the firm's proxy voting policy and stewardship principles.",
        "Engagement Policy": "Company engagement is a key part of our ESG integration framework.",
        "Sustainable Investing Risk": "ESG ratings may not reflect all material sustainability risks.",
        "Regulatory Classification": "Distributed in accordance with local regulations governing UCITS.",
        "Holdings Disclosure": "Full details of underlying fund holdings can be found at www.ssga.com.",
        "Sector Allocation Methodology": "Sector classifications follow GICS methodology unless otherwise noted.",
        "Data Accuracy Disclaimer": "Information is not warranted to be accurate, complete, or timely.",
        "Third-Party Content Disclaimer": "Morningstar and its content providers disclaim liability for damages from use of the data.",
        "Marketing Classification": "This document is a marketing communication and not investment research.",
        "Jurisdictional Disclosure": "Distributed in accordance with Swiss and Luxembourg regulations.",
        "Investor Rights": "A summary of investor rights is available at www.ssga.com.",
        "Professional Use Only": "This communication is directed at professional clients only.",
        "Fund Documentation Reference": "Review the KID and Prospectus before making an investment decision.",
        "Source Attribution": "Source: Morningstar, Inc. and SSGA.",
        "Reproduction Restriction": "No part of this document may be reproduced without written consent."
    }

    disclosure_types = ['Legal', 'Regulatory', 'ESG Policy', 'Performance', 'Marketing']
    applies_to = ['Fund', 'Firm', 'Strategy', 'Share Class', 'All', 'Professional Clients']
    disclosure_sources = ['SSGA', 'Morningstar', 'Internal Compliance']

    data = []
    for i in range(num_records):
        topic = random.choice(topics)
        footnote = footnotes[topic]
        disclosure_id = f"DISC{i+100}"
        effective_date = fake.date_between(start_date='-3y', end_date='today')
        disclosure_type = random.choice(disclosure_types)
        applicable_entity = random.choice(applies_to)
        last_reviewed = fake.date_between(start_date=effective_date, end_date='today')
        reviewer = fake.name()
        disclosure_source = random.choice(disclosure_sources)

        # Assign disclosure requirement classification
        if disclosure_type in ['Legal', 'Regulatory']:
            requirement = 'Mandatory'
        else:
            requirement = random.choices(['Mandatory', 'Voluntary'], weights=[0.2, 0.8])[0]

        data.append({
            'DisclosureID': disclosure_id,
            'Topic': topic,
            'Footnote': footnote,
            'EffectiveDate': effective_date,
            'DisclosureType': disclosure_type,
            'AppliesTo': applicable_entity,
            'LastReviewedDate': last_reviewed,
            'ReviewedBy': reviewer,
            'DisclosureRequirement': requirement,
            'DisclosureSource': disclosure_source
        })

    df_disclosures = pd.DataFrame(data)
    df_disclosures.to_csv(output_path, index=False)
    print(f"Compliance data saved to: {output_path}")
    return df_disclosures

# Example usage
if __name__ == "__main__":
    df_disclosures = generate_full_compliance_data(50, 'full_compliance_footnotes.csv')


Compliance data saved to: full_compliance_footnotes.csv


## **Load Compliance & Disclosure Footnotes**

This function loads the compliance footnotes CSV file into a pandas DataFrame, parses date fields (EffectiveDate and LastReviewedDate), and standardizes column names to lowercase for consistency. It also converts relevant columns—such as disclosuretype, appliesto, topic, and disclosurerequirement—into categorical types, which improves memory efficiency and prepares the data for structured analysis or dashboard integration.

In [None]:
import pandas as pd

def load_compliance_data(file_path):
    """
    Loads and cleans the enhanced compliance footnotes and disclosures dataset.
    Parses dates and ensures proper data types for new fields.
    """
    # Load CSV and parse dates
    df_disclosures = pd.read_csv(file_path, parse_dates=['EffectiveDate', 'LastReviewedDate'])

    # Normalize column names
    df_disclosures.columns = df_disclosures.columns.str.strip().str.lower()

    # Convert appropriate columns to categorical
    categorical_fields = [
        'disclosuretype',
        'appliesto',
        'topic',
        'disclosurerequirement',
        'disclosuresource'
    ]
    for col in categorical_fields:
        if col in df_disclosures.columns:
            df_disclosures[col] = df_disclosures[col].astype('category')

    return df_disclosures

# Example usage
if __name__ == "__main__":
    df_disclosures = load_compliance_data('full_compliance_footnotes.csv')
    print(df_disclosures.head())


  disclosureid                           topic  \
0      DISC100   Sector Allocation Methodology   
1      DISC101  Third-Party Content Disclaimer   
2      DISC102        Data Accuracy Disclaimer   
3      DISC103           Professional Use Only   
4      DISC104                     Data Source   

                                            footnote effectivedate  \
0  Sector classifications follow GICS methodology...    2024-05-21   
1  Morningstar and its content providers disclaim...    2022-11-04   
2  Information is not warranted to be accurate, c...    2023-11-14   
3  This communication is directed at professional...    2024-09-09   
4  Data used in ESG scoring is obtained from thir...    2024-06-26   

  disclosuretype             appliesto lastrevieweddate            reviewedby  \
0     ESG Policy                   All       2024-09-14          Douglas Wood   
1     ESG Policy  Professional Clients       2025-04-17          Alan Edwards   
2     Regulatory                  F

In [None]:
df_disclosures
#there are systems dedicated to this (disclosure id and text and mandatory or not)

Unnamed: 0,disclosureid,topic,footnote,effectivedate,disclosuretype,appliesto,lastrevieweddate,reviewedby,disclosurerequirement,disclosuresource
0,DISC100,Sector Allocation Methodology,Sector classifications follow GICS methodology...,2024-05-21,ESG Policy,All,2024-09-14,Douglas Wood,Voluntary,Internal Compliance
1,DISC101,Third-Party Content Disclaimer,Morningstar and its content providers disclaim...,2022-11-04,ESG Policy,Professional Clients,2025-04-17,Alan Edwards,Voluntary,SSGA
2,DISC102,Data Accuracy Disclaimer,"Information is not warranted to be accurate, c...",2023-11-14,Regulatory,Firm,2025-01-28,Annette Montoya,Mandatory,Internal Compliance
3,DISC103,Professional Use Only,This communication is directed at professional...,2024-09-09,ESG Policy,Firm,2025-06-19,Mr. Michael Gonzalez,Mandatory,Morningstar
4,DISC104,Data Source,Data used in ESG scoring is obtained from thir...,2024-06-26,Performance,Firm,2025-04-24,Cameron Johnson,Voluntary,Internal Compliance
5,DISC105,Marketing Classification,This document is a marketing communication and...,2023-02-12,Performance,Share Class,2025-04-14,David Leon,Voluntary,Morningstar
6,DISC106,Professional Use Only,This communication is directed at professional...,2023-05-04,Legal,Firm,2023-07-16,Patrick Andersen,Mandatory,Morningstar
7,DISC107,Investor Rights,A summary of investor rights is available at w...,2023-02-19,Legal,Fund,2024-06-13,Patricia Summers,Mandatory,Morningstar
8,DISC108,Data Source,Data used in ESG scoring is obtained from thir...,2024-08-10,Legal,Professional Clients,2024-11-04,Darren Jones,Mandatory,Internal Compliance
9,DISC109,Holdings Disclosure,Full details of underlying fund holdings can b...,2024-02-17,Performance,Fund,2025-06-03,Lee Hill,Voluntary,SSGA
