<a href="https://colab.research.google.com/github/MODA-NYC/Agency-Name-Project/blob/main/Agency_Name_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Overview
This project aims to create a standardized list of Agency Names* and publish this as a dataset on the NYC Open Data portal. The primary goal is to enhance data legibility and interoperability by providing official, consistently formatted agency names. This will provide a clear canonical source for how to format Agency Names, improving data quality and saving time when joining datasets on the Agency Name field.

This project is being developed by the Data Governance team in the Office of Data and Analytics.

*The word “Agency” is colloquially used to mean a government organization that includes a New York City Agency, a Mayoral Office, or a Commission.

Project Plan document: https://docs.google.com/document/d/1u9-sZXUWdand1yIRmmKGbq7D5RAgD2puWoYvbP06a4g/edit?usp=sharing

GitHub repository (final location of the code and documentation of this project): https://github.com/MODA-NYC/Agency-Name-Project

Import Pandas.
Import and mount Google Drive to the Colab environment for file access.

In [None]:
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Web Scraping for https://www.nyc.gov/nyc-resources/agencies.page

Import required libraries (requests for HTTP requests, BeautifulSoup from bs4 for HTML parsing, and pandas for data manipulation). Define two functions to process and scrape agency information from a given URL (https://www.nyc.gov/nyc-resources/agencies.page). The process_agency_info function extracts and processes agency names, URLs, and descriptions from HTML list tags. The scrape_agency_list function performs a web scrape to collect agency data, handling possible request exceptions and storing the data in a pandas DataFrame.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

def process_agency_info(li_tag):
    a_tag = li_tag.find('a', class_='name')
    name = a_tag.text.strip() if a_tag else ''
    url = a_tag.get('href') if a_tag else ''
    description = li_tag.get('data-desc', '')

    # Preprocess name for unique identification
    name_processed = name.lower().strip()

    return {
        'Name': name_processed,
        'Name - NYC.gov Agency List': name,
        'URL': url,
        'Description': description
    }

def scrape_agency_list(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        agencies_info = []
        for li_tag in soup.select('.alpha-list li'):
            agencies_info.append(process_agency_info(li_tag))

        return agencies_info
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return []

# URL for the List of NYC agencies
url = 'https://www.nyc.gov/nyc-resources/agencies.page'
agencies_info = scrape_agency_list(url)

# Load the scraped data into a DataFrame
df_nyc_gov_agency_list = pd.DataFrame(agencies_info)

# Display the DataFrame
#print(df_nyc_gov_agency_list.head())

In [None]:
df_nyc_gov_agency_list.head()

Unnamed: 0,Name,Name - NYC.gov Agency List,URL,Description
0,"actuary, nyc office of the (nycoa)","Actuary, NYC Office of the (NYCOA)",http://www.nyc.gov/actuary,"The New York City Office of the Actuary (""NYCO..."
1,"administrative justice coordinator, nyc office...","Administrative Justice Coordinator, NYC Office...",http://www.nyc.gov/ajc,The Office of the Administrative Justice Coord...
2,"administrative tax appeals, office of","Administrative Tax Appeals, Office of",http://www.nyc.gov/oata,The Office of Administrative Tax Appeals was e...
3,"administrative trials and hearings, office of ...","Administrative Trials and Hearings, Office of ...",http://www.nyc.gov/oath,The NYC Office of Administrative Trials and H...
4,"aging, department for the (nyc aging)","Aging, Department for the (NYC Aging)",http://www.nyc.gov/aging,NYC Aging funds community-based organizations ...


# Web Scrapping for https://www.nyc.gov/office-of-the-mayor/admin-officials.page

Define functions to scrape and process information from the https://www.nyc.gov/office-of-the-mayor/admin-officials.page. The process_mayor_office_info function extracts agency names, URLs, contact names, and titles from HTML elements and standardizes agency names for unique identification. The scrape_mayor_office_list function uses the requests library to fetch the webpage, parses it with BeautifulSoup, and aggregates the data into a list, handling exceptions gracefully. The results are loaded into a pandas DataFrame.


In [None]:
def process_mayor_office_info(li_tag, source_name):
    agency_tag = li_tag.find('div', class_='al-agency').find('a')
    agency_name = agency_tag.text.strip() if agency_tag else ''
    agency_url = agency_tag.get('href') if agency_tag else ''

    contact_tag = li_tag.find('div', class_='al-contact').find('a')
    contact_name = contact_tag.text.strip() if contact_tag else ''
    contact_title = li_tag.find('li', class_='al-contact-info').text.strip() if li_tag.find('li', class_='al-contact-info') else ''

    # Preprocess name for unique identification
    name_processed = agency_name.lower().strip()

    return {
        'Name': name_processed,
        'Name - NYC.gov Mayor\'s Office': agency_name,
        'URL': agency_url,
        'Contact Name': contact_name,
        'Contact Title': contact_title
    }

def scrape_mayor_office_list(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    try:
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        officials_info = []
        for li_tag in soup.select('li[data-topic]'):
            officials_info.append(process_mayor_office_info(li_tag, 'NYC.gov Mayor\'s Office'))
        return officials_info
    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return []

# URL for the Office of the Mayor officials
url_mayor_office = 'https://www.nyc.gov/office-of-the-mayor/admin-officials.page'
officials_info = scrape_mayor_office_list(url_mayor_office)

# Load the scraped data into a DataFrame
df_nyc_mayor_office = pd.DataFrame(officials_info)

# Display the DataFrame
#print(df_nyc_mayor_office.head())


In [None]:
df_nyc_mayor_office.head()

Unnamed: 0,Name,Name - NYC.gov Mayor's Office,URL,Contact Name,Contact Title
0,"actuary, nyc office of the (nycoa)","Actuary, NYC Office of the (NYCOA)",http://www.nyc.gov/actuary,Marek Tyszkiewicz,Chief Actuary
1,"administrative justice coordinator, nyc office...","Administrative Justice Coordinator, NYC Office...",http://www.nyc.gov/ajc,David Goldin,Administrative Justice Coordinator
2,"administrative tax appeals, office of","Administrative Tax Appeals, Office of",http://www.nyc.gov/oata,Frances Henn,Director
3,"administrative trials and hearings, office of ...","Administrative Trials and Hearings, Office of ...",http://www.nyc.gov/oath,Asim Rehman,Commissioner
4,"aging, department for the (nyc aging)","Aging, Department for the (NYC Aging)",http://www.nyc.gov/aging,Lorraine A. Cortés-Vázquez,Commissioner


# Web Scrapper for https://opendata.cityofnewyork.us/data/

Define a function to scrape agency information from the NYC Open Data Portal. The scrape_open_data_list function fetches the page using requests, parses it with BeautifulSoup, and iterates through specified HTML elements to collect agency names and URLs. Agency names are processed for uniformity. The collected data is stored in a pandas DataFrame for further manipulation and analysis.

In [None]:
def scrape_open_data_list(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    data_info = []

    for ul_tag in soup.select('div.content-block ul.space-section'):
        for li_tag in ul_tag.select('li'):
            a_tag = li_tag.find('a')
            if a_tag:
                agency_name = a_tag.text.strip()
                agency_url = a_tag.get('href')

                # Preprocess name for unique identification
                name_processed = agency_name.lower().strip()

                data_info.append({
                    'Name': name_processed,
                    'Name - NYC Open Data Portal': agency_name,
                    'URL': agency_url
                })

    return data_info

# URL for the NYC Open Data Portal
url_open_data = 'https://opendata.cityofnewyork.us/data/'
open_data_info = scrape_open_data_list(url_open_data)

# Load the scraped data into a DataFrame
df_nyc_open_data_portal = pd.DataFrame(open_data_info)

# Display the DataFrame
#print(df_nyc_open_data_portal.head())

In [None]:
df_nyc_open_data_portal.head()

Unnamed: 0,Name,Name - NYC Open Data Portal,URL
0,administration for children’s services (acs),Administration for Children’s Services (ACS),https://data.cityofnewyork.us/browse?Dataset-I...
1,board of elections (boeny),Board of Elections (BOENY),https://data.cityofnewyork.us/browse?Dataset-I...
2,board of standards and appeals (bsa),Board of Standards and Appeals (BSA),https://data.cityofnewyork.us/browse?Dataset-I...
3,bronx borough president (bpbx),Bronx Borough President (BPBX),https://data.cityofnewyork.us/browse?Dataset-I...
4,brooklyn borough president (bpbk),Brooklyn Borough President (BPBK),https://data.cityofnewyork.us/browse?Dataset-I...



# Load and Preprocess ODA Agency Data from CSV:

Loads a CSV file located on Google Drive into a DataFrame. The original 'Name' field is duplicated into a new column labeled 'Name - ODA'. The 'Name' field is then processed to remove whitespace and convert to lowercase for uniformity and ease of comparison or linking with other datasets.

In [None]:
# Path to the CSV file
file_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/ODA Data.csv'

# Load the CSV file into a DataFrame
df_oda_data = pd.read_csv(file_path)

# Copy the original name field to a new column with the dataset specific name
df_oda_data['Name - ODA'] = df_oda_data['Name'].copy()

# Preprocess the "Name" field to trim whitespace and convert to lowercase for unique identification
df_oda_data['Name'] = df_oda_data['Name'].str.lower().str.strip()

# Display the DataFrame
#print(df_oda_data.head())

In [None]:
df_oda_data.head()

Unnamed: 0,Name,Agency Code,Parent Organization,Child Organization(s),Acronym,Agency Type,Website,Name - ODA
0,administration for children's services (acs),68.0,,,ACS,City Department,https://www.nyc.gov/site/acs/index.page,Administration for Children's Services (ACS)
1,association for a better new york (abny),,,,ABNY,Other,abny.org,Association for a Better New York (ABNY)
2,board of correction (boc),73.0,,,BOC,Other,https://www.nyc.gov/site/boc/index.page,Board of Correction (BOC)
3,board of education retirement system (bers),,,,BERS,Other,https://www.bers.nyc.gov/,Board of Education Retirement System (BERS)
4,board of elections (boe),3.0,,,BOE,Other,vote.nyc,Board of Elections (BOE)



# Load and Process Chief Privacy Officer (CPO) Data:

Loads a specific CSV file containing agency data from the Chief Privacy Officer (CPO) into a DataFrame. Renames and preprocesses the 'Agency or Office' column for consistent identification across datasets by trimming whitespace and converting to lowercase. The original column name is retained under 'Name - CPO' for reference. Finally, the original 'Agency or Office' column is dropped to streamline the DataFrame.

In [None]:
import pandas as pd

# Path to the CSV file
cpo_file_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/CPO Data.csv'

# Load the CSV file into a DataFrame
df_cpo_data = pd.read_csv(cpo_file_path)

# Assume 'Agency or Office' is the column we want to rename and preprocess
# Copy the original 'Agency or Office' to 'Name - CPO' before preprocessing
df_cpo_data['Name - CPO'] = df_cpo_data['Agency or Office'].copy()

# Preprocess 'Agency or Office' for unique identification (trim and lower case)
df_cpo_data['Name'] = df_cpo_data['Agency or Office'].str.lower().str.strip()

# Now we can drop the original 'Agency or Office' column if it's no longer needed
df_cpo_data.drop(columns=['Agency or Office'], inplace=True)

# Display the DataFrame
#print(df_cpo_data.head())

In [None]:
df_cpo_data.head()

Unnamed: 0,Acronym,Name - CPO,Name
0,ACS,Administration for Children's Services,administration for children's services
1,BOC,Board of Correction,board of correction
2,BERS,Board of Education Retirement System,board of education retirement system
3,BSA,Board of Standards and Appeals,board of standards and appeals
4,Bronx BP,Bronx Borough President's Office,bronx borough president's office


# Load, Process, and Filter WeGov Data

Loads a CSV file containing data from the civic group WeGov into a DataFrame. Copies and renames the 'name' column to 'Name - WeGov', then preprocesses the 'name' for uniformity by trimming and converting to lowercase. The original 'name' column is removed post-processing. Additionally, filters the dataset to include only rows where the 'type' column values are 'City Agency' or 'Elected Office', focusing on relevant entities for further analysis.

In [None]:
# Path to the CSV file
wegov_file_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/WeGov Data.csv'

# Load the CSV file into a DataFrame
df_wegov_data = pd.read_csv(wegov_file_path)

# Copy the original 'name' to 'Name - WeGov' before renaming
df_wegov_data['Name - WeGov'] = df_wegov_data['name'].copy()

# Preprocess 'name' for unique identification (trim and lower case)
df_wegov_data['Name'] = df_wegov_data['name'].str.lower().str.strip()

# Drop the original 'name' column as its data has been preserved and preprocessed
df_wegov_data.drop(columns=['name'], inplace=True)

# Filter the DataFrame for rows where the "type" column is either "City Agency" or "Elected Office"
#df_wegov_data['type'] = df_wegov_data['type'].str.strip()
df_wegov_data = df_wegov_data[df_wegov_data['type'].isin(['City Agency', 'Elected Office'])]

# Display the DataFrame
#print(filtered_df_wegov_data.head())

In [None]:
df_wegov_data.head()

Unnamed: 0,id,alternate_name,type,tags,child_of,description,email,url,main_address,main_phone,...,legislation,legal_status,tax_status,tax_id,year_incorporated,news,capital_code,ical,Name - WeGov,Name
4,170019017,,City Agency,Business,170011000.0,,,https://www1.nyc.gov/site/nyw/index.page,"255 Greenwich St 6th floor, New York, NY 10007...",(212) 788-5889,...,https://laws.council.nyc.gov/search/?sort_by=d...,,,,,,,,NYC Municipal Water Finance Authority,nyc municipal water finance authority
5,170019007,TDC,City Agency,Business,170011000.0,,info@nyctdc.org,www.nyctdc.org,"USA, NY 11201, Brooklyn, 15 MetroTech Center",(718) 724-6560,...,https://laws.council.nyc.gov/search/?sort_by=d...,,,,,,,,NYC Technology Development Corporation,nyc technology development corporation
6,170010021,OATA,City Agency,Business,170011000.0,,,https://www1.nyc.gov/site/oata/index.page,"One Centre Street, Room 2400 New York, N.Y. 1...",(212) 669-2070,...,https://laws.council.nyc.gov/search/?sort_by=d...,,,,,,,,Office of Administrative Tax Appeals,office of administrative tax appeals
7,170019016,TFA,City Agency,Business,170011000.0,,,https://www1.nyc.gov/site/transitionalfinance/...,"255 Greenwich Street 9th Floor New York, NY 10...",1-212-442-5775,...,https://laws.council.nyc.gov/search/?sort_by=d...,,,,,,,,Transitional Finance Authority,transitional finance authority
24,170010002,OOM,Elected Office,,170000000.0,"Mayor Bill de Blasio took office on January 1,...",,http://www1.nyc.gov/office-of-the-mayor/index....,,(212) 788-3000,...,https://laws.council.nyc.gov/search/?sort_by=d...,,,,,,,,Mayor's Office,mayor's office


# Standardize and Normalize Agency Names Across DataFrames

Imports re for regular expressions and unicodedata for character normalization. Defines a function standardize_name to normalize, clean, and reformat agency names for consistency. This includes decomposing characters, replacing special characters, removing acronyms, and adjusting name order. Applies this standardized function to the 'Name' column of each DataFrame containing agency information, ensuring uniform naming across multiple data sources. This process aids in data integration and comparison.

In [None]:
import re
import unicodedata

def standardize_name(name):
    # Normalize the string to decompose combined characters and replace special characters
    name = unicodedata.normalize('NFKD', name)
    name = name.replace('’', "'").replace('‘', "'")

    # Remove extra spaces and invisible characters
    name = "".join(char for char in name if unicodedata.category(char).strip())

    # Extract the acronym if present, then remove it from the base name
    acronym_search = re.search(r'\(([^)]+)\)', name)
    acronym = acronym_search.group(1) if acronym_search else ''
    base_name = re.sub(r'\s*\([^)]+\)\s*', '', name).strip()

    # Convert alphabetized names to non-alphabetized format if necessary
    if ',' in base_name:
        parts = base_name.split(', ')
        base_name = ' '.join(parts[::-1])

    return base_name.lower().strip()

# Re-apply the standardized function to each dataframe's 'Name' field
dataframes = [df_nyc_gov_agency_list, df_nyc_mayor_office, df_nyc_open_data_portal, df_oda_data, df_cpo_data, df_wegov_data]

for df in dataframes:
    df['Name'] = df['Name'].apply(standardize_name)

# Now you can recombine the dataframes as before and check the results

# Display DataFrame Names and Sizes

In [None]:
dataframes = {
    'NYC Gov Agency List': df_nyc_gov_agency_list,
    'NYC Mayor Office': df_nyc_mayor_office,
    'NYC Open Data Portal': df_nyc_open_data_portal,
    'ODA Data': df_oda_data,
    'CPO Data': df_cpo_data,
    'WeGov Data': df_wegov_data
}

# Print the name and shape of each dataframe
for name, df in dataframes.items():
    print(f"{name}: {df.shape}")

NYC Gov Agency List: (158, 4)
NYC Mayor Office: (175, 5)
NYC Open Data Portal: (88, 3)
ODA Data: (183, 8)
CPO Data: (186, 3)
WeGov Data: (180, 32)


# Combine DataFrames with Agency Names from Multiple Sources

Initializes a combined DataFrame using the 'Name' and 'Name - NYC.gov Agency List' columns from the NYC government agency list. Constructs a list of tuples, each containing a DataFrame and its respective unique agency name column. Iterates through this list, merging each DataFrame with the combined DataFrame based on the standardized 'Name' field, using an outer join to ensure all data is included. The result is a comprehensive DataFrame that aligns agency names across different sources, useful for data comparison and integration.

In [None]:
# Initialize the combined dataframe with the first dataframe's relevant columns
combined_df = df_nyc_gov_agency_list[['Name', 'Name - NYC.gov Agency List']]

# List of tuples containing dataframes and their respective "Name - Dataset" columns
dataframes_to_merge = [
    (df_nyc_mayor_office, 'Name - NYC.gov Mayor\'s Office'),
    (df_nyc_open_data_portal, 'Name - NYC Open Data Portal'),
    (df_oda_data, 'Name - ODA'),
    (df_cpo_data, 'Name - CPO'),
    (df_wegov_data, 'Name - WeGov')  # Assuming this is the filtered WeGov dataframe
]

# Merge each dataframe in the list with the combined dataframe
for df, name_column in dataframes_to_merge:
    combined_df = combined_df.merge(df[['Name', name_column]], on='Name', how='outer')

# Display the head of the combined dataframe to verify
#print(combined_df.head())

In [None]:
combined_df.shape

(447, 7)

In [None]:
combined_df.head()

Unnamed: 0,Name,Name - NYC.gov Agency List,Name - NYC.gov Mayor's Office,Name - NYC Open Data Portal,Name - ODA,Name - CPO,Name - WeGov
0,nyc office of the actuary,"Actuary, NYC Office of the (NYCOA)","Actuary, NYC Office of the (NYCOA)",,,,
1,nyc office of administrative justice coordinator,"Administrative Justice Coordinator, NYC Office...","Administrative Justice Coordinator, NYC Office...",,,,
2,office of administrative tax appeals,"Administrative Tax Appeals, Office of","Administrative Tax Appeals, Office of",,Office of Administrative Tax Appeals (OATA),Office of Administrative Tax Appeals,Office of Administrative Tax Appeals
3,office of administrative trials and hearings,"Administrative Trials and Hearings, Office of ...","Administrative Trials and Hearings, Office of ...",Office of Administrative Trials and Hearings (...,Office of Administrative Trials and Hearings (...,,Office of Administrative Trials and Hearings
4,department for the aging,"Aging, Department for the (NYC Aging)","Aging, Department for the (NYC Aging)",Department for the Aging (NYC Aging),Department for the Aging (DFTA),Department for the Aging,Department for the Aging


# Export combined list

In [None]:
# Define the output file path
output_file_path = '/content/drive/MyDrive/Projects/ODA/Agency Name Project/Output/combined_data.csv'

# Export the combined dataframe to a CSV file
combined_df.to_csv(output_file_path, index=False)

print(f'Data exported successfully to {output_file_path}')

Data exported successfully to /content/drive/MyDrive/Projects/ODA/Agency Name Project/Output/combined_data.csv
