<a href="https://colab.research.google.com/github/MODA-NYC/Agency-Name-Project/blob/main/Agency_Name_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview
This project aims to create a standardized list of Agency Names* and publish this as a dataset on the NYC Open Data portal. The primary goal is to enhance data legibility and interoperability by providing official, consistently formatted agency names. This will provide a clear canonical source for how to format Agency Names, improving data quality and saving time when joining datasets on the Agency Name field.

This project is being developed by the Data Governance team in the Office of Data and Analytics.

*The word “Agency” is colloquially used to mean a government organization that includes a New York City Agency, a Mayoral Office, or a Commission.

Project Plan document: https://docs.google.com/document/d/1u9-sZXUWdand1yIRmmKGbq7D5RAgD2puWoYvbP06a4g/edit?usp=sharing

GitHub repository (final location of the code and documentation of this project): https://github.com/MODA-NYC/Agency-Name-Project

Import Pandas.
Import and mount Google Drive to the Colab environment for file access.

In [177]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Web Scraping for https://www.nyc.gov/nyc-resources/agencies.page

Import required libraries (requests for HTTP requests, BeautifulSoup from bs4 for HTML parsing, and pandas for data manipulation). Define two functions to process and scrape agency information from a given URL (https://www.nyc.gov/nyc-resources/agencies.page). The process_agency_info function extracts and processes agency names, URLs, and descriptions from HTML list tags. The scrape_agency_list function performs a web scrape to collect agency data, handling possible request exceptions and storing the data in a pandas DataFrame.

In [178]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def process_agency_info(li_tag):
    a_tag = li_tag.find('a', class_='name')
    name = a_tag.text.strip() if a_tag else ''
    url = a_tag.get('href') if a_tag else ''
    description = li_tag.get('data-desc', '')

    # Preprocess name for unique identification
    name_processed = name.lower().strip()

    return {
        'Name': name_processed,
        'Name - NYC.gov Agency List': name,
        'URL': url,
        'Description': description
    }

def scrape_agency_list(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        agencies_info = []
        for li_tag in soup.select('.alpha-list li'):
            agencies_info.append(process_agency_info(li_tag))

        return agencies_info
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return []

# URL for the List of NYC agencies
url = 'https://www.nyc.gov/nyc-resources/agencies.page'
agencies_info = scrape_agency_list(url)

# Load the scraped data into a DataFrame
df_nyc_gov_agency_list = pd.DataFrame(agencies_info)

# Display the DataFrame
#print(df_nyc_gov_agency_list.head())

In [179]:
df_nyc_gov_agency_list.shape

(159, 4)

In [260]:
df_nyc_gov_agency_list.head()

Unnamed: 0,Name,Name - NYC.gov Agency List,URL,Description
0,actuary nyc office of the (nycoa),"Actuary, NYC Office of the (NYCOA)",http://www.nyc.gov/actuary,"The New York City Office of the Actuary (""NYCO..."
1,administrative justice coordinator nyc office ...,"Administrative Justice Coordinator, NYC Office...",http://www.nyc.gov/ajc,The Office of the Administrative Justice Coord...
2,administrative tax appeals office of,"Administrative Tax Appeals, Office of",http://www.nyc.gov/oata,The Office of Administrative Tax Appeals was e...
3,administrative trials and hearings office of (...,"Administrative Trials and Hearings, Office of ...",http://www.nyc.gov/oath,The NYC Office of Administrative Trials and H...
4,aging department for the (nyc aging),"Aging, Department for the (NYC Aging)",http://www.nyc.gov/aging,NYC Aging funds community-based organizations ...


In [180]:
# Show duplicate entries based on the 'Name' column in df_nyc_gov_agency_list
duplicates_nyc_gov_agency_list = df_nyc_gov_agency_list[df_nyc_gov_agency_list.duplicated('Name', keep=False)]
duplicates_nyc_gov_agency_list

Unnamed: 0,Name,Name - NYC.gov Agency List,URL,Description


# Web Scrapping for https://www.nyc.gov/office-of-the-mayor/admin-officials.page

Define functions to scrape and process information from the https://www.nyc.gov/office-of-the-mayor/admin-officials.page. The process_mayor_office_info function extracts agency names, URLs, contact names, and titles from HTML elements and standardizes agency names for unique identification. The scrape_mayor_office_list function uses the requests library to fetch the webpage, parses it with BeautifulSoup, and aggregates the data into a list, handling exceptions gracefully. The results are loaded into a pandas DataFrame.


In [181]:
def process_mayor_office_info(li_tag, source_name):
    agency_tag = li_tag.find('div', class_='al-agency').find('a')
    agency_name = agency_tag.text.strip() if agency_tag else ''
    agency_url = agency_tag.get('href') if agency_tag else ''

    contact_name_tag = li_tag.find('div', class_='al-contact').find('a')
    contact_name = contact_name_tag.text.strip() if contact_name_tag else ''
    contact_title_tag = li_tag.find('li', class_='al-contact-info')
    contact_title = contact_title_tag.text.strip() if contact_title_tag else ''

    # Modify the 'Name - NYC.gov Mayor's Office' field based on the specified condition
    if agency_name == "Mayor, Office of the":
        agency_name = "Office of the " + contact_title

    # Preprocess name for unique identification
    name_processed = agency_name.lower().strip()

    return {
        'Name': name_processed,
        'Name - NYC.gov Mayor\'s Office': agency_name,
        'URL': agency_url,
        'Contact Name': contact_name,
        'Contact Title': contact_title
    }

def scrape_mayor_office_list(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    try:
        response = requests.get(url, headers=headers)
        response.raise_for_status()  # Ensure we raise an error for bad status
        soup = BeautifulSoup(response.text, 'html.parser')
        officials_info = []
        for li_tag in soup.select('li[data-topic]'):
            officials_info.append(process_mayor_office_info(li_tag, 'NYC.gov Mayor\'s Office'))
        return officials_info
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return []

# URL for the Office of the Mayor officials
url_mayor_office = 'https://www.nyc.gov/office-of-the-mayor/admin-officials.page'
officials_info = scrape_mayor_office_list(url_mayor_office)

# Load the scraped data into a DataFrame
df_nyc_mayor_office = pd.DataFrame(officials_info)

# Display the DataFrame
df_nyc_mayor_office.head()

Unnamed: 0,Name,Name - NYC.gov Mayor's Office,URL,Contact Name,Contact Title
0,"actuary, nyc office of the (nycoa)","Actuary, NYC Office of the (NYCOA)",http://www.nyc.gov/actuary,Marek Tyszkiewicz,Chief Actuary
1,"administrative justice coordinator, nyc office...","Administrative Justice Coordinator, NYC Office...",http://www.nyc.gov/ajc,David Goldin,Administrative Justice Coordinator
2,"administrative tax appeals, office of","Administrative Tax Appeals, Office of",http://www.nyc.gov/oata,Frances Henn,Director
3,"administrative trials and hearings, office of ...","Administrative Trials and Hearings, Office of ...",http://www.nyc.gov/oath,Asim Rehman,Commissioner
4,"aging, department for the (nyc aging)","Aging, Department for the (NYC Aging)",http://www.nyc.gov/aging,Lorraine A. Cortés-Vázquez,Commissioner


In [182]:
# Show duplicate entries based on the 'Name' column in df_nyc_mayor_office
duplicates_nyc_mayor_office = df_nyc_mayor_office[df_nyc_mayor_office.duplicated('Name', keep=False)]
duplicates_nyc_mayor_office

Unnamed: 0,Name,Name - NYC.gov Mayor's Office,URL,Contact Name,Contact Title


# Web Scrapper for https://opendata.cityofnewyork.us/data/

Define a function to scrape agency information from the NYC Open Data Portal. The scrape_open_data_list function fetches the page using requests, parses it with BeautifulSoup, and iterates through specified HTML elements to collect agency names and URLs. Agency names are processed for uniformity. The collected data is stored in a pandas DataFrame for further manipulation and analysis.

In [183]:
def scrape_open_data_list(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    data_info = []

    for ul_tag in soup.select('div.content-block ul.space-section'):
        for li_tag in ul_tag.select('li'):
            a_tag = li_tag.find('a')
            if a_tag:
                agency_name = a_tag.text.strip()
                agency_url = a_tag.get('href')

                # Preprocess name for unique identification
                name_processed = agency_name.lower().strip()

                data_info.append({
                    'Name': name_processed,
                    'Name - NYC Open Data Portal': agency_name,
                    'URL': agency_url
                })

    return data_info

# URL for the NYC Open Data Portal
url_open_data = 'https://opendata.cityofnewyork.us/data/'
open_data_info = scrape_open_data_list(url_open_data)

# Load the scraped data into a DataFrame
df_nyc_open_data_portal = pd.DataFrame(open_data_info)

# Display the DataFrame
#print(df_nyc_open_data_portal.head())

In [261]:
df_nyc_open_data_portal.head()

Unnamed: 0,Name,Name - NYC Open Data Portal,URL
0,administration for childrens services (acs),Administration for Children’s Services (ACS),https://data.cityofnewyork.us/browse?Dataset-I...
1,board of elections (boeny),Board of Elections (BOENY),https://data.cityofnewyork.us/browse?Dataset-I...
2,board of standards and appeals (bsa),Board of Standards and Appeals (BSA),https://data.cityofnewyork.us/browse?Dataset-I...
3,bronx borough president (bpbx),Bronx Borough President (BPBX),https://data.cityofnewyork.us/browse?Dataset-I...
4,brooklyn borough president (bpbk),Brooklyn Borough President (BPBK),https://data.cityofnewyork.us/browse?Dataset-I...


In [184]:
# Show duplicate entries based on the 'Name' column in df_nyc_open_data_portal
duplicates_nyc_open_data_portal = df_nyc_open_data_portal[df_nyc_open_data_portal.duplicated('Name', keep=False)]
duplicates_nyc_open_data_portal


Unnamed: 0,Name,Name - NYC Open Data Portal,URL


# Web Scrapper for https://www.checkbooknyc.com/agency_codes/newwindow

Checkbook data

In [185]:
def load_checkbook_data():
    checkbook_url = 'https://www.checkbooknyc.com/agency_codes/newwindow'
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    response = requests.get(checkbook_url, headers=headers)

    if response.status_code != 200:
        print("Failed to retrieve the webpage")
        return pd.DataFrame()

    soup = BeautifulSoup(response.content, 'html.parser')

    # Locate the table containing the data
    table = soup.find('table')
    if table is None:
        print("No table found on the webpage")
        return pd.DataFrame()

    # Extract table rows
    rows = table.find_all('tr')

    data = []
    for row in rows:
        cols = row.find_all('td')
        if len(cols) > 2:
            code = cols[0].text.strip()
            name = cols[1].text.strip()  # Keep the original name formatting
            short_name = cols[2].text.strip()  # Extract the "Agency Short Name"
            data.append({
                'Code': code,
                'Name': name.lower(),  # Preprocess the name to lowercase for merging
                'Name - Checkbook': name,
                'Agency Short Name': short_name
            })

    df_checkbook = pd.DataFrame(data)

    return df_checkbook[['Name', 'Name - Checkbook', 'Agency Short Name']]

# Load and process Checkbook NYC data
df_checkbook = load_checkbook_data()

In [262]:
df_checkbook.head()

Unnamed: 0,Name,Name - Checkbook,Agency Short Name
0,administration for childrens services,Administration for Children's Services,ADM CHILD SV
1,board of correction,Board of Correction,BD CORRECTN
2,board of elections,Board of Elections,BD ELECTIONS
3,borough president bronx,Borough President - Bronx,BP BRONX
4,borough president brooklyn,Borough President - Brooklyn,BP BROOKLYN


In [186]:
# Show duplicate entries based on the 'Name' column in df_checkbook
duplicates_checkbook = df_checkbook[df_checkbook.duplicated('Name', keep=False)]
duplicates_checkbook

Unnamed: 0,Name,Name - Checkbook,Agency Short Name


# Load Greenbook data from the Open Data Portal and Preprocess the data

In [187]:
import requests

# URL of the data endpoint
data_url = 'https://data.cityofnewyork.us/resource/mdcw-n682.json'

# Initialize a list to store dataframe chunks
dataframes = []
offset = 0
limit = 1000  # Adjust if necessary based on the API limits

while True:
    # Append the offset and limit parameters to the query
    response = requests.get(f"{data_url}?$limit={limit}&$offset={offset}")
    data_chunk = pd.read_json(response.text)

    # If no data is returned, we've read all rows, so break the loop
    if data_chunk.empty:
        break

    # Append the chunk to the list of dataframes
    dataframes.append(data_chunk)

    # Increase the offset to get the next chunk of data
    offset += limit

# Concatenate all chunks into a single DataFrame
df_greenbook = pd.concat(dataframes, ignore_index=True)

# Copy the original name field to a new column with the dataset specific name
df_greenbook['Name - Greenbook'] = df_greenbook['agency_name'].copy()

# Preprocess the "Name" field to trim whitespace and convert to lowercase for unique identification
df_greenbook['Name'] = df_greenbook['agency_name'].str.lower().str.strip()

# Select only the 'Name' and 'Name - Greenbook' columns
df_greenbook = df_greenbook[['Name', 'Name - Greenbook']]

# Drop duplicate rows based on the 'Name' column
df_greenbook = df_greenbook.drop_duplicates(subset=['Name'])

In [263]:
df_greenbook.head()

Unnamed: 0,Name,Name - Greenbook
0,actuary office of,"Actuary, Office of"
7,administrative trials and hearings office of,"Administrative Trials And Hearings, Office of"
26,health mental hygiene department of,"Health & Mental Hygiene, Department of"
27,aging department for the,"Aging, Department for the"
48,borough historians,Borough Historians


In [188]:
# Show duplicate entries based on the 'Name' column in df_greenbook
duplicates_greenbook = df_greenbook[df_greenbook.duplicated('Name', keep=False)]
duplicates_greenbook

Unnamed: 0,Name,Name - Greenbook



# Load and Preprocess ODA Agency Data from CSV:

Loads a CSV file located on Google Drive into a DataFrame. The original 'Name' field is duplicated into a new column labeled 'Name - ODA'. The 'Name' field is then processed to remove whitespace and convert to lowercase for uniformity and ease of comparison or linking with other datasets.

In [189]:
# Path to the CSV file
file_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/ODA Data.csv'

# Load the CSV file into a DataFrame
df_oda_data = pd.read_csv(file_path)

# Copy the original name field to a new column with the dataset specific name
df_oda_data['Name - ODA'] = df_oda_data['Name'].copy()

# Preprocess the "Name" field to trim whitespace and convert to lowercase for unique identification
df_oda_data['Name'] = df_oda_data['Name'].str.lower().str.strip()

# Display the DataFrame
#print(df_oda_data.head())

In [264]:
df_oda_data.head()

Unnamed: 0,Name,Agency Code,Parent Organization,Child Organization(s),Acronym,Agency Type,Website,Name - ODA
0,administration for childrens services (acs),68.0,,,ACS,City Department,https://www.nyc.gov/site/acs/index.page,Administration for Children's Services (ACS)
1,association for a better new york (abny),,,,ABNY,Other,abny.org,Association for a Better New York (ABNY)
2,board of correction (boc),73.0,,,BOC,Other,https://www.nyc.gov/site/boc/index.page,Board of Correction (BOC)
3,board of education retirement system (bers),,,,BERS,Other,https://www.bers.nyc.gov/,Board of Education Retirement System (BERS)
4,board of elections (boe),3.0,,,BOE,Other,vote.nyc,Board of Elections (BOE)


In [190]:
# Show duplicate entries based on the 'Name' column in df_oda_data
duplicates_oda_data = df_oda_data[df_oda_data.duplicated('Name', keep=False)]
duplicates_oda_data


Unnamed: 0,Name,Agency Code,Parent Organization,Child Organization(s),Acronym,Agency Type,Website,Name - ODA



# Load and Process Chief Privacy Officer (CPO) Data:

Loads a specific CSV file containing agency data from the Chief Privacy Officer (CPO) into a DataFrame. Renames and preprocesses the 'Agency or Office' column for consistent identification across datasets by trimming whitespace and converting to lowercase. The original column name is retained under 'Name - CPO' for reference. Finally, the original 'Agency or Office' column is dropped to streamline the DataFrame.

In [191]:
import pandas as pd

# Path to the CSV file
cpo_file_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/CPO Data.csv'

# Load the CSV file into a DataFrame
df_cpo_data = pd.read_csv(cpo_file_path)

# Assume 'Agency or Office' is the column we want to rename and preprocess
# Copy the original 'Agency or Office' to 'Name - CPO' before preprocessing
df_cpo_data['Name - CPO'] = df_cpo_data['Agency or Office'].copy()

# Preprocess 'Agency or Office' for unique identification (trim and lower case)
df_cpo_data['Name'] = df_cpo_data['Agency or Office'].str.lower().str.strip()

# Now we can drop the original 'Agency or Office' column if it's no longer needed
df_cpo_data.drop(columns=['Agency or Office'], inplace=True)

# Display the DataFrame
#print(df_cpo_data.head())

In [265]:
df_cpo_data.head()

Unnamed: 0,Acronym,Name - CPO,Name
0,ACS,Administration for Children's Services,administration for childrens services
1,BOC,Board of Correction,board of correction
2,BERS,Board of Education Retirement System,board of education retirement system
3,BSA,Board of Standards and Appeals,board of standards and appeals
4,Bronx BP,Bronx Borough President's Office,bronx borough presidents office


In [192]:
# Show duplicate entries based on the 'Name' column in df_cpo_data
duplicates_cpo_data = df_cpo_data[df_cpo_data.duplicated('Name', keep=False)]
duplicates_cpo_data


Unnamed: 0,Acronym,Name - CPO,Name


# Load, Process, and Filter WeGov Data

Loads a CSV file containing data from the civic group WeGov into a DataFrame. Copies and renames the 'name' column to 'Name - WeGov', then preprocesses the 'name' for uniformity by trimming and converting to lowercase. The original 'name' column is removed post-processing. Additionally, filters the dataset to include only rows where the 'type' column values are 'City Agency' or 'Elected Office', focusing on relevant entities for further analysis.

In [193]:
# Path to the CSV file
wegov_file_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/WeGov Data.csv'

# Load the CSV file into a DataFrame
df_wegov_data = pd.read_csv(wegov_file_path)

# Copy the original 'name' to 'Name - WeGov' before renaming
df_wegov_data['Name - WeGov'] = df_wegov_data['name'].copy()

# Preprocess 'name' for unique identification (trim and lower case)
df_wegov_data['Name'] = df_wegov_data['name'].str.lower().str.strip()

# Drop the original 'name' column as its data has been preserved and preprocessed
df_wegov_data.drop(columns=['name'], inplace=True)

# Filter the DataFrame for rows where the "type" column is either "City Agency" or "Elected Office"
#df_wegov_data['type'] = df_wegov_data['type'].str.strip()
df_wegov_data = df_wegov_data[df_wegov_data['type'].isin(['City Agency', 'Elected Office'])]

# Display the DataFrame
#print(filtered_df_wegov_data.head())

In [266]:
df_wegov_data.head()

Unnamed: 0,type,Name - WeGov,Name
4,City Agency,NYC Municipal Water Finance Authority,nyc municipal water finance authority
5,City Agency,NYC Technology Development Corporation,nyc technology development corporation
6,City Agency,Office of Administrative Tax Appeals,office of administrative tax appeals
7,City Agency,Transitional Finance Authority,transitional finance authority
24,Elected Office,Mayor's Office,mayors office


In [194]:
# Show duplicate entries based on the 'Name' column in df_wegov_data
duplicates_wegov_data = df_wegov_data[df_wegov_data.duplicated('Name', keep=False)]
duplicates_wegov_data

Unnamed: 0,type,Name - WeGov,Name


# Standardize and Normalize Agency Names Across DataFrames

Imports re for regular expressions and unicodedata for character normalization. Defines a function standardize_name to normalize, clean, and reformat agency names for consistency. This includes decomposing characters, replacing special characters, removing acronyms, and adjusting name order. Applies this standardized function to the 'Name' column of each DataFrame containing agency information, ensuring uniform naming across multiple data sources. This process aids in data integration and comparison.

In [195]:
import re
import unicodedata

def standardize_name(name):
    # Normalize the string to decompose combined characters and replace special characters
    name = unicodedata.normalize('NFKD', name)
    name = name.replace('’', "'").replace('‘', "'")

    # Remove extra spaces and invisible characters
    name = "".join(char for char in name if unicodedata.category(char).strip())

    # Expand common abbreviations
    name = re.sub(r'\bdept\b', 'department', name, flags=re.IGNORECASE)
    name = re.sub(r'\b&\b', 'and', name)

    # Convert to ASCII
    name = name.encode('ascii', 'ignore').decode('ascii')

    # Remove punctuation except parentheses
    name = re.sub(r'[^\w\s\(\)]', '', name)

    # Remove multiple spaces
    name = re.sub(r'\s+', ' ', name)

    return name.lower().strip()

# Re-apply the standardized function to each dataframe's 'Name' field with acronyms included
dataframes = [df_nyc_gov_agency_list, df_nyc_mayor_office, df_nyc_open_data_portal, df_oda_data, df_cpo_data, df_wegov_data, df_greenbook, df_checkbook]

for df in dataframes:
    df['Name'] = df['Name'].apply(standardize_name)

# Combine dataframes into one for further processing
combined_df = pd.concat(dataframes, ignore_index=True)

# Display the first few rows to verify
#print(combined_df.head())

In [196]:
combined_df.head()

Unnamed: 0,Name,Name - NYC.gov Agency List,URL,Description,Name - NYC.gov Mayor's Office,Contact Name,Contact Title,Name - NYC Open Data Portal,Agency Code,Parent Organization,...,Acronym,Agency Type,Website,Name - ODA,Name - CPO,type,Name - WeGov,Name - Greenbook,Name - Checkbook,Agency Short Name
0,actuary nyc office of the (nycoa),"Actuary, NYC Office of the (NYCOA)",http://www.nyc.gov/actuary,"The New York City Office of the Actuary (""NYCO...",,,,,,,...,,,,,,,,,,
1,administrative justice coordinator nyc office ...,"Administrative Justice Coordinator, NYC Office...",http://www.nyc.gov/ajc,The Office of the Administrative Justice Coord...,,,,,,,...,,,,,,,,,,
2,administrative tax appeals office of,"Administrative Tax Appeals, Office of",http://www.nyc.gov/oata,The Office of Administrative Tax Appeals was e...,,,,,,,...,,,,,,,,,,
3,administrative trials and hearings office of (...,"Administrative Trials and Hearings, Office of ...",http://www.nyc.gov/oath,The NYC Office of Administrative Trials and H...,,,,,,,...,,,,,,,,,,
4,aging department for the (nyc aging),"Aging, Department for the (NYC Aging)",http://www.nyc.gov/aging,NYC Aging funds community-based organizations ...,,,,,,,...,,,,,,,,,,


# Display DataFrame Names and Sizes

In [197]:
dataframes = {
    'NYC Gov Agency List': df_nyc_gov_agency_list,
    'NYC Mayor Office': df_nyc_mayor_office,
    'NYC Open Data Portal': df_nyc_open_data_portal,
    'ODA Data': df_oda_data,
    'CPO Data': df_cpo_data,
    'WeGov Data': df_wegov_data,
    'Greenbook Data': df_greenbook,
    'Checkbook Data': df_checkbook
}

# Print the name and shape of each dataframe
for name, df in dataframes.items():
    print(f"{name}: {df.shape}")

NYC Gov Agency List: (159, 4)
NYC Mayor Office: (176, 5)
NYC Open Data Portal: (88, 3)
ODA Data: (183, 8)
CPO Data: (186, 3)
WeGov Data: (180, 3)
Greenbook Data: (123, 2)
Checkbook Data: (145, 3)


# Combine DataFrames with Agency Names from Multiple Sources

Initializes a combined DataFrame using the 'Name' and 'Name - NYC.gov Agency List' columns from the NYC government agency list. Constructs a list of tuples, each containing a DataFrame and its respective unique agency name column. Iterates through this list, merging each DataFrame with the combined DataFrame based on the standardized 'Name' field, using an outer join to ensure all data is included. The result is a comprehensive DataFrame that aligns agency names across different sources, useful for data comparison and integration.

In [198]:
# Initialize the combined dataframe with the first dataframe's relevant columns
combined_df = df_nyc_gov_agency_list[['Name', 'Name - NYC.gov Agency List']]

# List of tuples containing dataframes and their respective "Name - Dataset" columns
dataframes_to_merge = [
    (df_nyc_mayor_office, 'Name - NYC.gov Mayor\'s Office'),
    (df_nyc_open_data_portal, 'Name - NYC Open Data Portal'),
    (df_oda_data, 'Name - ODA'),
    (df_cpo_data, 'Name - CPO'),
    (df_wegov_data, 'Name - WeGov'),
    (df_greenbook, 'Name - Greenbook'),
    (df_checkbook, 'Name - Checkbook')
]

# Merge each dataframe in the list with the combined dataframe
for df, name_column in dataframes_to_merge:
    combined_df = combined_df.merge(df[['Name', name_column]], on='Name', how='outer')

# Display the head of the combined dataframe to verify
#print(combined_df.head())

In [199]:
combined_df.shape

(760, 9)

In [200]:
combined_df.head()

Unnamed: 0,Name,Name - NYC.gov Agency List,Name - NYC.gov Mayor's Office,Name - NYC Open Data Portal,Name - ODA,Name - CPO,Name - WeGov,Name - Greenbook,Name - Checkbook
0,actuary nyc office of the (nycoa),"Actuary, NYC Office of the (NYCOA)","Actuary, NYC Office of the (NYCOA)",,,,,,
1,administrative justice coordinator nyc office ...,"Administrative Justice Coordinator, NYC Office...","Administrative Justice Coordinator, NYC Office...",,,,,,
2,administrative tax appeals office of,"Administrative Tax Appeals, Office of","Administrative Tax Appeals, Office of",,,,,,
3,administrative trials and hearings office of (...,"Administrative Trials and Hearings, Office of ...","Administrative Trials and Hearings, Office of ...",,,,,,
4,aging department for the (nyc aging),"Aging, Department for the (NYC Aging)","Aging, Department for the (NYC Aging)",,,,,,


# Extract Acronym and Remove it from the Name Field

In [201]:
import re

def extract_acronym(name):
    # Extract acronym from parentheses
    match = re.search(r'\((.*?)\)', name)
    return match.group(1).upper() if match else ''

# Apply the function to create the Acronym field
combined_df['Acronym'] = combined_df['Name'].apply(extract_acronym)

# Display the updated DataFrame
print(combined_df[['Name', 'Acronym']].head())

                                                Name    Acronym
0                  actuary nyc office of the (nycoa)      NYCOA
1  administrative justice coordinator nyc office ...        AJC
2               administrative tax appeals office of           
3  administrative trials and hearings office of (...       OATH
4               aging department for the (nyc aging)  NYC AGING


In [202]:
def remove_acronym(name):
    # Remove the acronym if present
    return re.sub(r'\s*\([^)]+\)\s*', '', name).strip()

# Apply the function to create a new Name field without acronyms
combined_df['Name'] = combined_df['Name'].apply(remove_acronym)

# Display the updated DataFrame
print(combined_df[['Name', 'Acronym']].head())

                                               Name    Acronym
0                         actuary nyc office of the      NYCOA
1  administrative justice coordinator nyc office of        AJC
2              administrative tax appeals office of           
3      administrative trials and hearings office of       OATH
4                          aging department for the  NYC AGING


In [203]:
combined_df.shape

(760, 10)

# Merge just based on Acronym

In [204]:
# Function to merge records based on matching acronyms
def merge_on_acronym(df):
    # Create a dictionary to hold merged records
    merged_records = {}
    no_acronym_records = []

    # Iterate through each row in the DataFrame
    for index, row in df.iterrows():
        acronym = row['Acronym']
        if acronym:
            if acronym in merged_records:
                # Merge the current record with the existing record
                for col in df.columns:
                    if pd.notnull(row[col]):
                        if col not in merged_records[acronym]:
                            merged_records[acronym][col] = row[col]
                        elif pd.isnull(merged_records[acronym][col]):
                            merged_records[acronym][col] = row[col]
                        elif col.startswith('Name -') and row[col] not in merged_records[acronym][col]:
                            merged_records[acronym][col] += '; ' + row[col]
            else:
                # Add the record to the dictionary
                merged_records[acronym] = row.to_dict()
        else:
            # Add records without acronyms to a separate list
            no_acronym_records.append(row.to_dict())

    # Convert the merged records dictionary back to a DataFrame
    merged_df = pd.DataFrame.from_dict(merged_records, orient='index')

    # Convert the no_acronym_records list back to a DataFrame
    no_acronym_df = pd.DataFrame(no_acronym_records)

    # Concatenate the merged_df and no_acronym_df DataFrames
    final_df = pd.concat([merged_df, no_acronym_df], ignore_index=True)

    return final_df

# Perform the merge based on acronym
merged_df = merge_on_acronym(combined_df)

# Display the merged DataFrame
print(merged_df.head())

# Save the merged DataFrame to a CSV file if needed
# merged_df.to_csv('merged_agency_names.csv', index=False)

# Continue with the dedupe process using merged_df
combined_df = merged_df

                                               Name  \
0                         actuary nyc office of the   
1  administrative justice coordinator nyc office of   
2      administrative trials and hearings office of   
3                          aging department for the   
4                     appointments mayors office of   

                          Name - NYC.gov Agency List  \
0                 Actuary, NYC Office of the (NYCOA)   
1  Administrative Justice Coordinator, NYC Office...   
2  Administrative Trials and Hearings, Office of ...   
3              Aging, Department for the (NYC Aging)   
4              Appointments, Mayor's Office of (MOA)   

                       Name - NYC.gov Mayor's Office  \
0                 Actuary, NYC Office of the (NYCOA)   
1  Administrative Justice Coordinator, NYC Office...   
2  Administrative Trials and Hearings, Office of ...   
3              Aging, Department for the (NYC Aging)   
4              Appointments, Mayor's Office of (MOA)

In [205]:
combined_df.shape

(690, 10)

# Merge just based on Name field

In [206]:
# Function to merge records based on matching names
def merge_on_name(df):
    # Create a dictionary to hold merged records
    merged_records = {}
    no_name_records = []

    # Iterate through each row in the DataFrame
    for index, row in df.iterrows():
        name = row['Name']
        if name:
            if name in merged_records:
                # Merge the current record with the existing record
                for col in df.columns:
                    if pd.notnull(row[col]):
                        if col not in merged_records[name]:
                            merged_records[name][col] = row[col]
                        elif pd.isnull(merged_records[name][col]):
                            merged_records[name][col] = row[col]
                        elif col.startswith('Acronym -') and row[col] not in merged_records[name][col]:
                            merged_records[name][col] += '; ' + row[col]
            else:
                # Add the record to the dictionary
                merged_records[name] = row.to_dict()
        else:
            # Add records without names to a separate list
            no_name_records.append(row.to_dict())

    # Convert the merged records dictionary back to a DataFrame
    merged_df = pd.DataFrame.from_dict(merged_records, orient='index')

    # Convert the no_name_records list back to a DataFrame
    no_name_df = pd.DataFrame(no_name_records)

    # Concatenate the merged_df and no_name_df DataFrames
    final_df = pd.concat([merged_df, no_name_df], ignore_index=True)

    return final_df

# Perform the merge based on acronym
merged_df = merge_on_acronym(combined_df)

# Perform the merge based on name
merged_df = merge_on_name(merged_df)

# Display the merged DataFrame
print(merged_df.head())

# Save the merged DataFrame to a CSV file if needed
# merged_df.to_csv('merged_agency_names.csv', index=False)

# Continue with the dedupe process using merged_df
combined_df = merged_df

                                               Name  \
0                         actuary nyc office of the   
1  administrative justice coordinator nyc office of   
2      administrative trials and hearings office of   
3                          aging department for the   
4                     appointments mayors office of   

                          Name - NYC.gov Agency List  \
0                 Actuary, NYC Office of the (NYCOA)   
1  Administrative Justice Coordinator, NYC Office...   
2  Administrative Trials and Hearings, Office of ...   
3              Aging, Department for the (NYC Aging)   
4              Appointments, Mayor's Office of (MOA)   

                       Name - NYC.gov Mayor's Office  \
0                 Actuary, NYC Office of the (NYCOA)   
1  Administrative Justice Coordinator, NYC Office...   
2  Administrative Trials and Hearings, Office of ...   
3              Aging, Department for the (NYC Aging)   
4              Appointments, Mayor's Office of (MOA)

In [207]:
combined_df.shape

(582, 10)

# Merging to dedupe - use generated list of confirmed matches to deduplicate the dataframe

I worked directly in several ChatGPT instances to develop and execute code that generated potential match lists. Then I manually reviewed the output and labeled all of the potential pairs. I then deduped the combined_df and performed a second round of generating matched pairs and labeled those. I then combined the matched pairs as consolidated_matches

In [208]:
import pandas as pd

# Load the data
consolidated_matches_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/Output/consolidated_matches.csv'
consolidated_matches = pd.read_csv(consolidated_matches_path)

# Assuming combined_data exists in your notebook as a dataframe called combined_df
# combined_df = ...

# Normalize Names
consolidated_matches['Name_1'] = consolidated_matches['Name_1'].str.lower().str.strip()
consolidated_matches['Name_2'] = consolidated_matches['Name_2'].str.lower().str.strip()
combined_df['Name'] = combined_df['Name'].str.lower().str.strip()

# Filter the matches to only include "Match" records
filtered_matches = consolidated_matches[consolidated_matches['Label'] == 'Match']

# Create a new mapping dictionary from filtered matches
name_mapping_filtered = pd.Series(filtered_matches['Name_2'].values,
                                  index=filtered_matches['Name_1']).to_dict()

# Function to remove cycles from the mapping
def remove_cycles(mapping):
    resolved_mapping = {}
    visited = set()

    def resolve(name):
        path = []
        while name in mapping and name not in visited:
            path.append(name)
            visited.add(name)
            name = mapping[name]
        for p in path:
            resolved_mapping[p] = name
        return name

    for key in mapping:
        resolve(key)

    return resolved_mapping

# Remove cycles from the mapping
resolved_mapping = remove_cycles(name_mapping_filtered)

# Function to map variations to standardized names
def map_name(name, mapping):
    return mapping.get(name, name)

# Apply the Mapping directly to the 'Name' field
combined_df['Name'] = combined_df['Name'].apply(lambda name: map_name(name, resolved_mapping))

# Reapply mapping to ensure consistency
combined_df['Name'] = combined_df['Name'].apply(lambda name: map_name(name, resolved_mapping))

# Create a new column "Merged Names" to store the list of merged names
merged_names_dict = combined_df.groupby('Name')['Name'].apply(lambda x: '; '.join(x.unique())).to_dict()
combined_df['Merged Names'] = combined_df['Name'].map(merged_names_dict)

# Aggregate Data, keeping the first non-null value for each column
deduplicated_df = combined_df.groupby('Name').first().reset_index()

# Ensure the "Merged Names" column is properly aggregated
deduplicated_df['Merged Names'] = deduplicated_df['Name'].map(merged_names_dict)

# Display the deduplicated dataframe
deduplicated_df.head()

Unnamed: 0,Name,Name - NYC.gov Agency List,Name - NYC.gov Mayor's Office,Name - NYC Open Data Portal,Name - ODA,Name - CPO,Name - WeGov,Name - Greenbook,Name - Checkbook,Acronym,Merged Names
0,actuary office of,"Actuary, NYC Office of the (NYCOA)","Actuary, NYC Office of the (NYCOA)",,Office of the Actuary,Office of the Actuary,Office of the Actuary,"Actuary, Office of",Office of the Actuary,NYCOA,actuary office of
1,administration for childrens services,"Children's Services, Administration for (ACS)","Children's Services, Administration for (ACS)",Administration for Children’s Services (ACS),Administration for Children's Services (ACS),Administration for Children's Services,Administration for Children's Services,"Children's Services, Administration for",Administration for Children's Services,ACS,administration for childrens services
2,association for a better new york,,,,Association for a Better New York (ABNY),,,,,ABNY,association for a better new york
3,board of correction,"Correction, Board of (BOC)","Correction, Board of (BOC)",,Board of Correction (BOC),Board of Correction,Board of Correction,"Correction, Board of",Board of Correction,BOC,board of correction
4,board of elections,"Elections, Board of (BOE)","Elections, Board of (BOE)",Board of Elections (BOENY),Board of Elections (BOE),,Board of Elections,"Elections, Board of",Board of Elections,BOE,board of elections


In [209]:
deduplicated_df.shape

(334, 11)

In [210]:
# Order the dataframe by 'Name' in ascending order
deduplicated_df = deduplicated_df.sort_values(by='Name').reset_index(drop=True)

# Add an 'ID' column with sequential numbers for each row
deduplicated_df['ID'] = range(1, len(deduplicated_df) + 1)

# Display the updated dataframe
deduplicated_df.head()

Unnamed: 0,Name,Name - NYC.gov Agency List,Name - NYC.gov Mayor's Office,Name - NYC Open Data Portal,Name - ODA,Name - CPO,Name - WeGov,Name - Greenbook,Name - Checkbook,Acronym,Merged Names,ID
0,actuary office of,"Actuary, NYC Office of the (NYCOA)","Actuary, NYC Office of the (NYCOA)",,Office of the Actuary,Office of the Actuary,Office of the Actuary,"Actuary, Office of",Office of the Actuary,NYCOA,actuary office of,1
1,administration for childrens services,"Children's Services, Administration for (ACS)","Children's Services, Administration for (ACS)",Administration for Children’s Services (ACS),Administration for Children's Services (ACS),Administration for Children's Services,Administration for Children's Services,"Children's Services, Administration for",Administration for Children's Services,ACS,administration for childrens services,2
2,association for a better new york,,,,Association for a Better New York (ABNY),,,,,ABNY,association for a better new york,3
3,board of correction,"Correction, Board of (BOC)","Correction, Board of (BOC)",,Board of Correction (BOC),Board of Correction,Board of Correction,"Correction, Board of",Board of Correction,BOC,board of correction,4
4,board of elections,"Elections, Board of (BOE)","Elections, Board of (BOE)",Board of Elections (BOENY),Board of Elections (BOE),,Board of Elections,"Elections, Board of",Board of Elections,BOE,board of elections,5


In [211]:
deduplicated_df.shape

(334, 12)

# Filtering Out of Scope Entities

- Flagging entities that exist as multiple administrative unites based on geography (such as community boards and distrcit attorneys, keeping just one record for "Community Boards")

- Flagging non-New York City entities that are out of scope for this excersise (like new york state courts)

In [212]:
# Add a new column "Instance Of" and initialize with None
deduplicated_df['Instance Of'] = None

# Define lists of names for different categories
community_boards = ["bronx community board", "brooklyn community board", "manhattan community board", "queens community board", "staten island community board"]
borough_presidents = ["bronx borough president", "brooklyn borough president", "manhattan borough president", "queens borough president", "staten island borough president"]
district_attorney = ["district attorney bronx county", "district attorney kings county", "district attorney new york county", "district attorney queens county", "district attorney richmond county"]
public_administrator = ["public administrator bronx county", "public administrator kings county", "public administrator new york county", "public administrator queens county", "public administrator richmond county"]

# Assign values to the "Instance Of" column based on the Name column
deduplicated_df.loc[deduplicated_df['Name'].str.lower().str.startswith(tuple(community_boards)), 'Instance Of'] = 'community boards'
deduplicated_df.loc[deduplicated_df['Name'].str.lower().isin(borough_presidents), 'Instance Of'] = 'borough presidents'
deduplicated_df.loc[deduplicated_df['Name'].str.lower().isin(district_attorney), 'Instance Of'] = 'district attorney'
deduplicated_df.loc[deduplicated_df['Name'].str.lower().isin(public_administrator), 'Instance Of'] = 'public administrator'

# Create a new row for borough presidents
new_row = pd.DataFrame([{'Name': 'borough presidents', 'Instance Of': 'borough presidents'}])
deduplicated_df = pd.concat([deduplicated_df, new_row], ignore_index=True)

In [213]:
deduplicated_df.shape

(335, 13)

In [214]:
# Create a new column "Out of Scope" and initialize with None
deduplicated_df['Out of Scope'] = None

# Set "Out of Scope" to "Out of Scope" for rows with a non-null value in "Instance Of"
deduplicated_df.loc[deduplicated_df['Instance Of'].notnull(), 'Out of Scope'] = 'Out of Scope'

# Set "Out of Scope" to "Out of Scope" for rows where the Name starts with "new york state" or "nys"
deduplicated_df.loc[deduplicated_df['Name'].str.lower().str.startswith('new york state'), 'Out of Scope'] = 'Out of Scope'
deduplicated_df.loc[deduplicated_df['Name'].str.lower().str.startswith('nys'), 'Out of Scope'] = 'Out of Scope'

# Filter rows to create a new DataFrame "in_scope_agencies" where "Out of Scope" is null
in_scope_agencies = deduplicated_df[deduplicated_df['Out of Scope'].isnull()]

In [215]:
in_scope_agencies.shape

(243, 14)

In [216]:
in_scope_agencies.head()

Unnamed: 0,Name,Name - NYC.gov Agency List,Name - NYC.gov Mayor's Office,Name - NYC Open Data Portal,Name - ODA,Name - CPO,Name - WeGov,Name - Greenbook,Name - Checkbook,Acronym,Merged Names,ID,Instance Of,Out of Scope
0,actuary office of,"Actuary, NYC Office of the (NYCOA)","Actuary, NYC Office of the (NYCOA)",,Office of the Actuary,Office of the Actuary,Office of the Actuary,"Actuary, Office of",Office of the Actuary,NYCOA,actuary office of,1.0,,
1,administration for childrens services,"Children's Services, Administration for (ACS)","Children's Services, Administration for (ACS)",Administration for Children’s Services (ACS),Administration for Children's Services (ACS),Administration for Children's Services,Administration for Children's Services,"Children's Services, Administration for",Administration for Children's Services,ACS,administration for childrens services,2.0,,
2,association for a better new york,,,,Association for a Better New York (ABNY),,,,,ABNY,association for a better new york,3.0,,
3,board of correction,"Correction, Board of (BOC)","Correction, Board of (BOC)",,Board of Correction (BOC),Board of Correction,Board of Correction,"Correction, Board of",Board of Correction,BOC,board of correction,4.0,,
4,board of elections,"Elections, Board of (BOE)","Elections, Board of (BOE)",Board of Elections (BOENY),Board of Elections (BOE),,Board of Elections,"Elections, Board of",Board of Elections,BOE,board of elections,5.0,,


#Choosing a preferred name for each agency

The format of the CPO data source seems to be closest to what appears in the City Charter, so I will use that as the base, and then do my best to guess the best preferred name for each other row. I may go back later and create a rubric for how to decide here.

I did a manual evaluation and then uploaded the results of the manual evaluation here and matched to that.

In [217]:
# # Code to select the Preferred Name for each row

# # Add a new column "Name - Preferred" using .loc to avoid SettingWithCopyWarning
# in_scope_agencies.loc[:, 'Name - Preferred'] = ''

# # Use .loc with apply method to set "Name - Preferred" to the value in "Name - CPO" if it exists
# in_scope_agencies.loc[:, 'Name - Preferred'] = in_scope_agencies.apply(
#     lambda row: row['Name - CPO'] if pd.notnull(row['Name - CPO']) else '',
#     axis=1
# )

# # Function to prompt user to choose a value for "Name - Preferred"
# def choose_name(row):
#     options = [
#         "Name - NYC.gov Agency List",
#         "Name - NYC.gov Mayor's Office",
#         "Name - NYC Open Data Portal",
#         "Name - ODA",
#         "Name - CPO",
#         "Name - WeGov",
#         "Name - Greenbook",
#         "Name - Checkbook"
#     ]

#     print("\nChoose a preferred name for the following row:")
#     for idx, option in enumerate(options, 1):
#         value = row.get(option, '')
#         print(f"{str(idx).ljust(2)}. {option.ljust(30)} : {str(value).rjust(50)}")

#     choice = input("\nEnter the number of the preferred name (or leave blank to skip): ")
#     if choice.isdigit():
#         choice = int(choice)
#         if 1 <= choice <= len(options):
#             return row[options[choice - 1]]
#     return ''

# # Iterate over rows where "Name - Preferred" is empty
# for index, row in in_scope_agencies[in_scope_agencies['Name - Preferred'] == ''].iterrows():
#     preferred_name = choose_name(row)
#     in_scope_agencies.at[index, 'Name - Preferred'] = preferred_name

In [218]:
# Path to the CSV file
# preferred_name = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/Output/preferred_name.csv'

# Load the CSV file into a DataFrame
# preferred_name = pd.read_csv(preferred_name)

In [219]:
# Merge in_scope_agencies with preferred_name_df to add the "Preferred Name" column based on the "Name" field
# in_scope_agencies = in_scope_agencies.merge(preferred_name[['Name', 'Name - Preferred']], on='Name', how='left')

In [220]:
# # Function to remove acronyms and parentheses
# def strip_acronyms(name):
#     if isinstance(name, str):
#         return re.sub(r'\s*\([^)]*\)', '', name).strip()
#     return name

# # Strip acronyms from the "Name - Preferred" column
# if 'Name - Preferred' in in_scope_agencies.columns:
#     in_scope_agencies['Name - Preferred'] = in_scope_agencies['Name - Preferred'].apply(strip_acronyms)

# Merging in Manually Compiled Preferred Name, Agency Type, and Legal Citation information

In [225]:
# Path to the CSV file
agencies_with_charter_citation = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/Output/agencies_with_charter_citation.csv'

# Load the CSV file into a DataFrame
agencies_with_charter_citation = pd.read_csv(agencies_with_charter_citation)

In [226]:
agencies_with_charter_citation.shape

(253, 11)

In [227]:
in_scope_agencies.shape

(243, 14)

In [228]:
agencies_enhanced = pd.merge(in_scope_agencies, agencies_with_charter_citation, on="Name", how="left")

Filter for just Active Agencies

In [244]:
agencies_enhanced = agencies_enhanced[agencies_enhanced["Agency Operational Status"] == "Active"]

In [245]:
agencies_enhanced.shape

(208, 24)

In [246]:
agencies_enhanced_null_status = agencies_enhanced[agencies_enhanced["Agency Operational Status"].isna()]

In [247]:
agencies_enhanced_null_status.head(20)

Unnamed: 0,Name,Name - NYC.gov Agency List,Name - NYC.gov Mayor's Office,Name - NYC Open Data Portal,Name - ODA,Name - CPO,Name - WeGov,Name - Greenbook,Name - Checkbook,Acronym,...,Name - Preferred,NYC Administrative Organization Type,Agency Operational Status,Parent Organization,Authorizing Authority,Legal Citation,Legal Citation URL,Legal Citation Text,Additional Citation,Additional Citation URL


In [267]:
agencies_enhanced.columns

Index(['Name', 'Name - NYC.gov Agency List', 'Name - NYC.gov Mayor's Office',
       'Name - NYC Open Data Portal', 'Name - ODA', 'Name - CPO',
       'Name - WeGov', 'Name - Greenbook', 'Name - Checkbook', 'Acronym',
       'Merged Names', 'ID', 'Instance Of', 'Out of Scope', 'Name - Preferred',
       'NYC Administrative Organization Type', 'Agency Operational Status',
       'Parent Organization', 'Authorizing Authority', 'Legal Citation',
       'Legal Citation URL', 'Legal Citation Text', 'Additional Citation',
       'Additional Citation URL'],
      dtype='object')

In [269]:
df_nyc_gov_agency_list.columns

Index(['Name', 'Name - NYC.gov Agency List', 'URL', 'Description'], dtype='object')

In [270]:
# Merge the DataFrames on 'Name - NYC.gov Agency List'
agency_name_final = agencies_enhanced.merge(df_nyc_gov_agency_list[['Name - NYC.gov Agency List', 'URL']],
                                           on='Name - NYC.gov Agency List',
                                           how='left')

# Rename the 'URL' column to 'Organization Website'
agency_name_final.rename(columns={'URL': 'Organization Website'}, inplace=True)

In [272]:
df_nyc_mayor_office.columns

Index(['Name', 'Name - NYC.gov Mayor's Office', 'URL', 'Contact Name',
       'Contact Title'],
      dtype='object')

In [281]:
df_oda_data.columns

Index(['Name', 'Agency Code', 'Parent Organization', 'Child Organization(s)',
       'Acronym', 'Agency Type', 'Website', 'Name - ODA'],
      dtype='object')

In [295]:
agencies_enhanced.columns

Index(['Name', 'Name - NYC.gov Agency List', 'Name - NYC.gov Mayor's Office',
       'Name - NYC Open Data Portal', 'Name - ODA', 'Name - CPO',
       'Name - WeGov', 'Name - Greenbook', 'Name - Checkbook', 'Acronym',
       'Merged Names', 'ID', 'Instance Of', 'Out of Scope', 'Name - Preferred',
       'NYC Administrative Organization Type', 'Agency Operational Status',
       'Parent Organization', 'Authorizing Authority', 'Legal Citation',
       'Legal Citation URL', 'Legal Citation Text', 'Additional Citation',
       'Additional Citation URL'],
      dtype='object')

In [343]:
# Merge the DataFrames on 'Name - NYC.gov Agency List' and keep all original columns
agency_name_final = agencies_enhanced.merge(df_nyc_gov_agency_list[['Name - NYC.gov Agency List', 'URL', 'Description']],
                                            on='Name - NYC.gov Agency List',
                                            how='left')

# Rename the 'URL' column to 'Organization Website'
agency_name_final.rename(columns={'URL': 'Organization Website'}, inplace=True)

# Merge with df_nyc_mayor_office to fill in missing 'Organization Website' and add 'Contact Name'
agency_name_final = agency_name_final.merge(df_nyc_mayor_office[['Name - NYC.gov Mayor\'s Office', 'URL', 'Contact Name']],
                                            left_on='Name - NYC.gov Mayor\'s Office',
                                            right_on='Name - NYC.gov Mayor\'s Office',
                                            how='left')

# Fill in missing 'Organization Website' values from the Mayor's Office data
agency_name_final['Organization Website'].fillna(agency_name_final['URL'], inplace=True)

# Merge with df_oda_data to fill in remaining missing 'Organization Website'
agency_name_final = agency_name_final.merge(df_oda_data[['Name - ODA', 'Website']],
                                            left_on='Name - NYC.gov Agency List',
                                            right_on='Name - ODA',
                                            how='left')

# Fill in missing 'Organization Website' values from the ODA data
agency_name_final['Organization Website'].fillna(agency_name_final['Website'], inplace=True)

# Drop temporary columns
agency_name_final.drop(columns=['URL', 'Website'], errors='ignore', inplace=True)
agency_name_final.rename(columns={'Name - ODA_x': 'Name - ODA'}, inplace=True)

# Drop columns ending with '_x' and '_y'
agency_name_final = agency_name_final.loc[:, ~agency_name_final.columns.str.endswith(('_x', '_y'))]

In [344]:
agency_name_final.columns

Index(['Name', 'Name - NYC.gov Agency List', 'Name - NYC.gov Mayor's Office',
       'Name - NYC Open Data Portal', 'Name - ODA', 'Name - CPO',
       'Name - WeGov', 'Name - Greenbook', 'Name - Checkbook', 'Acronym',
       'Merged Names', 'ID', 'Instance Of', 'Out of Scope', 'Name - Preferred',
       'NYC Administrative Organization Type', 'Agency Operational Status',
       'Parent Organization', 'Authorizing Authority', 'Legal Citation',
       'Legal Citation URL', 'Legal Citation Text', 'Additional Citation',
       'Additional Citation URL', 'Organization Website', 'Description',
       'Contact Name'],
      dtype='object')

In [345]:
df_nyc_gov_agency_list.columns

Index(['Name', 'Name - NYC.gov Agency List', 'URL', 'Description'], dtype='object')

In [346]:
df_nyc_mayor_office.columns

Index(['Name', 'Name - NYC.gov Mayor's Office', 'URL', 'Contact Name',
       'Contact Title'],
      dtype='object')

In [347]:
# Merge to add the 'Description' column
agency_name_final = agency_name_final.merge(df_nyc_gov_agency_list[['Name - NYC.gov Agency List', 'Description']],
                                            on='Name - NYC.gov Agency List',
                                            how='left')

In [348]:
# Merge to add the 'Contact Name' column
agency_name_final = agency_name_final.merge(df_nyc_mayor_office[['Name - NYC.gov Mayor\'s Office', 'Contact Name', 'Contact Title']],
                                            on='Name - NYC.gov Mayor\'s Office',
                                            how='left')

In [349]:
# Merge to add the 'Open Datasets URL' column
agency_name_final = agency_name_final.merge(df_nyc_open_data_portal[['Name - NYC Open Data Portal', 'URL']],
                                            left_on='Name - NYC Open Data Portal',
                                            right_on='Name - NYC Open Data Portal',
                                            how='left')

# Rename the 'URL' column to 'Open Datasets URL'
agency_name_final.rename(columns={'URL': 'Open Datasets URL'}, inplace=True)

In [350]:
agency_name_final.columns

Index(['Name', 'Name - NYC.gov Agency List', 'Name - NYC.gov Mayor's Office',
       'Name - NYC Open Data Portal', 'Name - ODA', 'Name - CPO',
       'Name - WeGov', 'Name - Greenbook', 'Name - Checkbook', 'Acronym',
       'Merged Names', 'ID', 'Instance Of', 'Out of Scope', 'Name - Preferred',
       'NYC Administrative Organization Type', 'Agency Operational Status',
       'Parent Organization', 'Authorizing Authority', 'Legal Citation',
       'Legal Citation URL', 'Legal Citation Text', 'Additional Citation',
       'Additional Citation URL', 'Organization Website', 'Description_x',
       'Contact Name_x', 'Description_y', 'Contact Name_y', 'Contact Title',
       'Open Datasets URL'],
      dtype='object')

In [351]:
# Merge to add the 'Agency Code' column
agency_name_final = agency_name_final.merge(df_oda_data[['Name - ODA', 'Agency Code']],
                                            left_on='Name - ODA',
                                            right_on='Name - ODA',
                                            how='left')

In [352]:
agency_name_final.head(2)

Unnamed: 0,Name,Name - NYC.gov Agency List,Name - NYC.gov Mayor's Office,Name - NYC Open Data Portal,Name - ODA,Name - CPO,Name - WeGov,Name - Greenbook,Name - Checkbook,Acronym,...,Additional Citation,Additional Citation URL,Organization Website,Description_x,Contact Name_x,Description_y,Contact Name_y,Contact Title,Open Datasets URL,Agency Code
0,actuary office of,"Actuary, NYC Office of the (NYCOA)","Actuary, NYC Office of the (NYCOA)",,Office of the Actuary,Office of the Actuary,Office of the Actuary,"Actuary, Office of",Office of the Actuary,NYCOA,...,,,http://www.nyc.gov/actuary,"The New York City Office of the Actuary (""NYCO...",Marek Tyszkiewicz,"The New York City Office of the Actuary (""NYCO...",Marek Tyszkiewicz,Chief Actuary,,8
1,administration for childrens services,"Children's Services, Administration for (ACS)","Children's Services, Administration for (ACS)",Administration for Children’s Services (ACS),Administration for Children's Services (ACS),Administration for Children's Services,Administration for Children's Services,"Children's Services, Administration for",Administration for Children's Services,ACS,...,,,http://www.nyc.gov/acs,The Administration for Children's Services (AC...,Jess Dannhauser,The Administration for Children's Services (AC...,Jess Dannhauser,Commissioner,https://data.cityofnewyork.us/browse?Dataset-I...,68


In [353]:
agency_name_final.columns

Index(['Name', 'Name - NYC.gov Agency List', 'Name - NYC.gov Mayor's Office',
       'Name - NYC Open Data Portal', 'Name - ODA', 'Name - CPO',
       'Name - WeGov', 'Name - Greenbook', 'Name - Checkbook', 'Acronym',
       'Merged Names', 'ID', 'Instance Of', 'Out of Scope', 'Name - Preferred',
       'NYC Administrative Organization Type', 'Agency Operational Status',
       'Parent Organization', 'Authorizing Authority', 'Legal Citation',
       'Legal Citation URL', 'Legal Citation Text', 'Additional Citation',
       'Additional Citation URL', 'Organization Website', 'Description_x',
       'Contact Name_x', 'Description_y', 'Contact Name_y', 'Contact Title',
       'Open Datasets URL', 'Agency Code'],
      dtype='object')

In [354]:
# Create a new DataFrame called agency_name_export based on agency_name_final
agency_name_export = agency_name_final.copy()

# Drop the specified fields
agency_name_export.drop(columns=['Name', 'Description_y', 'Contact Name_y'], errors='ignore', inplace=True)

# Reorganize the order and rename the columns
agency_name_export = agency_name_export[[
    'Name - Preferred',
    'Acronym',
    'Agency Code',
    'NYC Administrative Organization Type',
    'Agency Operational Status',
    'Organization Website',
    'Description_x',
    'Contact Name_x',
    'Contact Title',
    'Parent Organization',
    'Authorizing Authority',
    'Legal Citation',
    'Legal Citation URL',
    'Legal Citation Text',
    'Open Datasets URL'
]].rename(columns={
    'Description_x': 'Description',
    'Contact Name_x': 'Contact Name'
})

In [359]:
agency_name_crosswalk = agency_name_final[[
    'Name - Preferred',
    'Acronym',
    'Agency Code',
    'NYC Administrative Organization Type',
    'Agency Operational Status',
    'Name - NYC.gov Agency List',
    'Name - NYC.gov Mayor\'s Office',
    'Name - NYC Open Data Portal',
    'Name - ODA',
    'Name - CPO',
    'Name - WeGov',
    'Name - Greenbook',
    'Name - Checkbook'
]].copy()

In [361]:
agency_name_crosswalk.shape

(208, 13)

In [356]:
agency_name_export.shape

(208, 15)

# Export combined list

In [363]:
# Define the output file path
output_file_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/Output/agency_name_export.csv'

# Export the combined dataframe to a CSV file
agency_name_export.to_csv(output_file_path, index=False)

print(f'Data exported successfully to {output_file_path}')

Data exported successfully to /content/drive/MyDrive/Projects/ODA/Agency Name Project/Output/agency_name_export.csv
