# Data Dump Processing

The purpose of this notebook is to process building permit data extracted from a monthly report generated in Accela. At the end of each month, the CPDI team conducts a manual audit of the data and compiles a consolidated report detailing building permit activity.

While long-term development efforts aim to establish a direct data pipeline from Accela to an interactive dashboard—eliminating the need for manual intervention—this current process enables audited data dumps to be systematically combined into a unified dataset.

Once compiled, the consolidated dataset can be manually uploaded to the GIS platform, ensuring that the dashboard is refreshed with the latest data. The final destination for this processed data is the Community Development Snapshot dashboard on the GIS platform, providing stakeholders with an up-to-date view of building permit developments.

### Scope

This report has been compiled for many years. However, over time, the consolidated report has changed. In this iteration, we only include data from July 2021 onward. Since some changes to the data dump sheets within the workbooks were needed, they may not be reuploaded in a different state. Some of these changes included moving revisions to the appropriate place or deleting obviously duplicate rows that didn't conform to generic permit processing rules, i.e. a 4-plex having 4 permit lines when only 1 is needed. When the CPDI team is ready to include the next month of data, the CPDI team will overwrite the current fiscal year workbook such that it contains the new month's data dump. On fiscal year changeover, the CPDI team will create a new fiscal year workbook and it should be uploaded to the same data folder all the other workbooks are located.

In [1]:
import os
import pandas as pd
import numpy as np
import re
import warnings
from collections import Counter, defaultdict
from itertools import chain
import itertools

In [2]:
def get_files_in_folder(folder):
    '''
    Generates a list of all of the Development Reports
    
    Args
    folder: the folder path (string) for the folder that contains the workbooks
    
    Outputs
    files: a list of all of the development reports
    '''
    files = []
    for file in os.listdir(folder):
        # Skip .DS_Store files
        if file == '.DS_Store':
            continue
        file_path = folder+'/'+file
        if os.path.isfile(file_path):
            files.append(file)
    return files

# Example
folder = os.getcwd()+'/data'
get_files_in_folder(folder)

['FY23 Development Report Workbook.xlsx',
 'FY24 Development Report Workbook.xlsx',
 'FY22 Development Report Workbook.xlsx',
 'FY25 Development Report Workbook.xlsx']

In [3]:
def get_file_fiscal_year(file):
    '''
    From the file path, finds the fiscal year of the development report workbook.
    
    Args
    file: the file path (string) for the development report workbook

    Outputs
    fiscal_year: the fiscal year (string) for the development report workbook
    '''
    # Get the fiscal year
    match = re.search(r'FY(\d{2})', file)
    if match:
        fiscal_year = '20' + match.group(1)
    else:
        # Issue a warning if the fiscal year is not found
        warnings.warn(f"The file name '{file}' does not specify a fiscal year. Please specify 'FY' in the file name.", UserWarning)
        fiscal_year = None  # Set to None to indicate missing fiscal year
    return fiscal_year

# Example
file = os.getcwd()+'/data/FY22 Development Report Workbook.xlsx'
get_file_fiscal_year(file)

'2022'

In [4]:
def build_sheet_details_df(file):
    ''' 
    Generate a sheet details data frame to be merged as data within a workbook is processed. 
    At this point, it's important to ensure that data dumps and five reports sheets in the workbook can be recognized and referenced appropriately.
    This is important because this data frame is later joined on the data dumps and/or five reports data. 
    
    Args 
    file: the file path (string) for the development report workbook

    Outputs
    sheet_details_df: the details (DataFrame) of the workbook contents recognized by the program
    '''
    # Load the sheet names from the Excel file
    sheets = pd.ExcelFile(file).sheet_names
    
    # Isolate the data dump sheets while accounting for variations
    data_dump_sheets = sorted([
        sheet.replace("data dump", "Data Dump") 
        for sheet in sheets 
        if "data dump" in sheet.lower()
    ])

    # Isolate five reports sheets without modifying names
    five_reports_sheets = sorted([sheet for sheet in sheets if '5 Reports' in sheet or 'Five Reports' in sheet])

    # Check if the number of Data Dump sheets matches Five Reports sheets
    if len(data_dump_sheets) != len(five_reports_sheets):
        raise ValueError("The number of 'Data Dump' sheets does not match the number of 'Five Reports' sheets. Please check the sheet names inside the file.")
    
    # Generate a list of the fiscal years for how many data dumps we have
    fiscal_year = get_file_fiscal_year(file)
    fiscal_years = [fiscal_year]*len(data_dump_sheets)
    
    # Extract the month numbers from the sheet names
    month_numbers = [re.match(r'(\d{2})', sheet).group(1) for sheet in data_dump_sheets if re.match(r'^\d{2} ', sheet)]
    
    # Create the DataFrame with the initial columns
    df_1 = pd.DataFrame({
        'Data Dump' : data_dump_sheets,
        'Five Reports' : five_reports_sheets,
        'Month Number' : month_numbers,  
        'Fiscal Year' : fiscal_years
    })

    # Populate the calendar year by subtracting 1 from the fiscal year if it is not one of the first 6 months of the year
    df_1['Calendar Year'] = df_1['Month Number'].apply(lambda x: str(int(fiscal_year)-1) if x not in ['01', '02', '03', '04', '05', '06'] else fiscal_year)

    # Generate a date that can be used in dashboards for time frame filters
    df_1['Permit Date'] = pd.to_datetime(df_1['Calendar Year'].astype(str) + '-' + df_1['Month Number'] + '-01')

    # Create the DataFrame of month names that we will join to month_df_1 so we have full month names
    df_2 = pd.DataFrame({
        'Month Number' : ['01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12'],
        'Permit Month' : ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
    })

    # Create the final DataFrame that will eventually be used to join onto row data read from a specific sheet.
    df_merged = pd.merge(df_1, df_2, how='left', left_on='Month Number', right_on='Month Number')

    # Define a mapping of abbreviations to full month names
    month_map = {
        "Jan": "January", "Feb": "February", "Mar": "March", "Apr": "April",
        "May": "May", "June": "June", "July": "July", "Aug": "August",
        "Sept": "September", "Oct": "October", "Nov": "November", "Dec": "December"
    }

    # Filter sheet names that match the pattern "<Month Abbreviation> Reports"
    selected_sheets = [sheet for sheet in sheets if any(sheet.startswith(m) and "Report" in sheet for m in month_map)]

    # Create a new dataset with "Report Name" and "Month"
    df_3 = pd.DataFrame([{"Report": sheet, 'Permit Month': month_map.get(sheet.split()[0], "Unknown")} for sheet in selected_sheets])

    # Create the final DataFrame that will eventually be used to join onto row data read from a specific sheet.
    df_merged = pd.merge(df_merged, df_3, how='left', left_on='Permit Month', right_on='Permit Month')

    sheet_details_df = df_merged[['Data Dump', 'Five Reports', 'Report', 'Permit Date', 'Month Number', 'Permit Month', "Fiscal Year", 'Calendar Year']]

    return sheet_details_df

# Example
file = os.getcwd()+'/data/FY25 Development Report Workbook.xlsx'
build_sheet_details_df(file)

Unnamed: 0,Data Dump,Five Reports,Report,Permit Date,Month Number,Permit Month,Fiscal Year,Calendar Year
0,01 Data Dump,01 Five Reports,Jan Report,2025-01-01,1,January,2025,2025
1,07 Data Dump,07 Five Reports,July Report,2024-07-01,7,July,2025,2024
2,08 Data Dump,08 Five Reports,Aug Report,2024-08-01,8,August,2025,2024
3,09 Data Dump,09 Five Reports,Sept Report,2024-09-01,9,September,2025,2024
4,10 Data Dump,10 Five Reports,Oct Report,2024-10-01,10,October,2025,2024
5,11 Data Dump,11 Five Reports,Nov Report,2024-11-01,11,November,2025,2024
6,12 Data Dump,12 Five Reports,Dec Report,2024-12-01,12,December,2025,2024


# Creation of `BuildingPermitDataAll.csv` Data Set
## Extract a Data Dump from a Sheet

In [5]:
def find_header_row_in_main(file, sheet):
    """
    Finds the row number that likely contains the main header based on >50% non-null values.
    
    Args:
    file: the file path (string) for the development report workbook
    sheet: the specific data dump sheet (string) 

    Outputs:
    header_row: the best row number (index) for the main header row
    """
    # Get the first few rows to see where the header exists
    preview = pd.read_excel(file, sheet, header=None, nrows=5).iloc[:, 0:15]
    total_cols = preview.shape[1]
    
    header_row = None

    # Starting with top row, iterate through until we get columns with names
    for i, row in preview.iterrows():
        non_null_count = row.notna().sum()
        unnamed_count = sum(str(col).startswith("Unnamed") for col in row.astype(str))

        # At least 50% of the row contains valid data and not mostly 'Unnamed'
        if non_null_count > (0.5 * total_cols) and unnamed_count < (0.5 * total_cols):
            header_row = i
            break

    if header_row is None:
        print(f"""Warning: Header row for the '{sheet}' main was not
        found. Please ensure the data is formatted like previous
        data dump months.""")

    return header_row

# Example
file = os.getcwd()+'/data/FY23 Development Report Workbook.xlsx'
sheet = '03 Data Dump'
find_header_row_in_main(file, sheet)

0

In [6]:
def extract_data_dump_main(file, sheet):
    """
    Extracts the main data dump from a specific sheet within a workbook.
    
    Args:
    file: the file path (string) for the development report workbook
    sheet: the specific data dump sheet (string) 

    Outputs:
    data: a pandas DataFrame of the main data dump
    """
    header = find_header_row_in_main(file, sheet)

    # Read Excel with the identified header row
    data = pd.read_excel(file, sheet, header=header).iloc[:,0:15]

    # Sometimes the data dump and its revisions have column headers that aren't formatted the same.
    # This mapping allows for the program to recognize column headers from data dumps.
    # If data for a column is not showing up, the column header needs to be changed or added to fit the data requirements as seen in output data sets.

    column_mapping = {
        "CATEGORY" : "Category",
        "RECORD NUMBER": 'Record Number',
        "REC_NO" : "Record Number",
        "REC_NO1" : "Record Number",
        "COMM_SUB" : "Commercial Subtype",
        "COMMERCIAL SUBTYPE" : "Commercial Subtype",
        "RES_SUB" : "Residential Subtype",
        "RESIDENTIAL SUBTYPE" : "Residential Subtype",
        "WORK DESCRIPTION": "Work Description",
        "WORK_DES1" : "Work Description",
        "WORK_DES" : "Work Description",
        "BUSINESS NAME" : "Business Name",
        "BUS_NAME1" : "Business Name",
        "BUS_NAME" : "Business Name",
        "PROJECT COST" : "Project Cost",
        "MUL_PROJ_COST" : "Project Cost",
        "JOB COST VALUATION" : "Job Cost Valuation",
        "MUL_JOB_COST_VAL" : "Job Cost Valuation",
        "CENSUS - CURRENT VALUATION RES" : "Current Valuation - Residential",
        "Current Valuation Res" : "Current Valuation - Residential",
        "CENSUS CURRENT VALUATION RESIDENTIAL" : "Current Valuation - Residential",
        "Current Valuation - Res" : "Current Valuation - Residential",
        "MUL_CURR_VAL_RES" : "Current Valuation - Residential",
        "CENSUS Current Valuation - Res" : "Current Valuation - Residential",
        "CURRENT VALUATION COMM" : "Current Valuation - Commercial",
        "Current Valuation - Comm" : "Current Valuation - Commercial",
        "Current Valuation Comm" : "Current Valuation - Commercial",
        "MUL_CURR_VAL_COMM" : "Current Valuation - Commercial",
        "CURRENT VALUATION COMMERCIAL": "Current Valuation - Commercial",
        "ADDRESS" : "Address",
        "FULL_ADDRESS" : "Address",
        "ADU/TED ON PERMIT?" : "ADU/TED on Permit?",
        "ADU/TED ON ADDRESS\n" : "ADU or TED on Address?",
        "ADU/TED ON ADDRESS" : "ADU or TED on Address?",
        "ADU/TED ON ADDRESS?" : "ADU or TED on Address?",
        "ADU OR TED ON ADDRESS?" : "ADU or TED on Address?",
        "TOTAL CONSTRUCTION VALUATION" : "Total Construction Valuation",
        "Total Construction Valuation TCV" : "Total Construction Valuation",
        "Total Costruction Valuation" : "Total Construction Valuation",
        "CURRENT MARKET VALUATION" : "Current Market Valuation",
        "Current Market Valuation CMV" : "Current Market Valuation"
        }


    # Rename columns using mapping
    data.rename(columns=column_mapping, inplace=True)
    
    # Filter out rows where 'Record Number' contains numeric values
    data = data[~data['Record Number'].apply(lambda x: isinstance(x, (int, float)))]
    data = data[data['Record Number']!='count']

    # Filter out NaN values in 'Record Number'
    data = data[~data['Record Number'].isna()]

    # Remove only the trailing newline character
    data['ADU or TED on Address?'] = data['ADU or TED on Address?'].str.rstrip('\n').replace('', np.nan)

    # Add 'Property Type' column
    data['Property Type'] = np.nan

    # Keep rows where there has been a valuation change
    data = data.loc[data.groupby(['Record Number','Address'])[data.columns].apply(lambda x: x.notnull().sum(axis=1).idxmax())]

    return data

data_dump_main = extract_data_dump_main(file, sheet)

In [7]:
def find_header_row_in_revisions(file, sheet):
    """
    Finds the row number that likely contains the revisions header based on >50% non-null values.
    
    Args:
    file: the file path (string) for the development report workbook
    sheet: the specific data dump sheet (string) 

    Outputs:
    header_row: the best row number (index) for the revisions header row
    """
    # Get the first few rows to see where the header exists
    preview = pd.read_excel(file, sheet, header=None, nrows=5).iloc[:,16:]
    total_cols = preview.shape[1]
    
    header_row = None

    # Starting with top row, iterate through until we get columns with names
    for i, row in preview.iterrows():
        non_null_count = row.notna().sum()
        unnamed_count = sum(str(col).startswith("Unnamed") for col in row.astype(str))

        # At least 50% of the row contains valid data and not mostly 'Unnamed'
        if non_null_count > (0.5 * total_cols) and unnamed_count < (0.5 * total_cols):
            header_row = i
            break

    if header_row is None:
        print(f"Warning: Header row for the '{sheet}' revisions were not found. Please ensure data exists for revisions or is formatted like previous data dump months.")

    return header_row

# Example
file = os.getcwd()+'/data/FY23 Development Report Workbook.xlsx'
sheet = '03 Data Dump'
find_header_row_in_revisions(file, sheet)

2

In [8]:
def extract_data_dump_revisions(file, sheet):
    """
    Extracts the revisions from a data dump from a specific sheet within a workbook.
    
    Args:
    file: the file path (string) for the development report workbook
    sheet: the specific data dump sheet (string) 

    Outputs:
    data: a pandas DataFrame of the revisions from the data dump
    """
    header = find_header_row_in_revisions(file, sheet)

    # Read Excel with the identified header row
    data = pd.read_excel(file, sheet, header=header).iloc[:,16:]

    # Sometimes the data dump and its revisions have column headers that aren't formatted the same.
    # This mapping allows for the program to recognize column headers from data dumps.
    # If data for a column is not showing up, the column header needs to be changed or added to fit the data requirements as seen in output data sets.

    column_mapping = {
        "CATEGORY" : "Category",
        "RECORD NUMBER": 'Record Number',
        "REC_NO" : "Record Number",
        "REC_NO1" : "Record Number",
        "COMM_SUB" : "Commercial Subtype",
        "COMMERCIAL SUBTYPE" : "Commercial Subtype",
        "RES_SUB" : "Residential Subtype",
        "RESIDENTIAL SUBTYPE" : "Residential Subtype",
        "WORK DESCRIPTION": "Work Description",
        "WORK_DES1" : "Work Description",
        "WORK_DES" : "Work Description",
        "BUSINESS NAME" : "Business Name",
        "BUS_NAME1" : "Business Name",
        "BUS_NAME" : "Business Name",
        "PROJECT COST" : "Project Cost",
        "MUL_PROJ_COST" : "Project Cost",
        "JOB COST VALUATION" : "Job Cost Valuation",
        "MUL_JOB_COST_VAL" : "Job Cost Valuation",
        "CENSUS - CURRENT VALUATION RES" : "Current Valuation - Residential",
        "Current Valuation Res" : "Current Valuation - Residential",
        "CENSUS CURRENT VALUATION RESIDENTIAL" : "Current Valuation - Residential",
        "Current Valuation - Res" : "Current Valuation - Residential",
        "MUL_CURR_VAL_RES" : "Current Valuation - Residential",
        "CENSUS Current Valuation - Res" : "Current Valuation - Residential",
        "CURRENT VALUATION COMM" : "Current Valuation - Commercial",
        "Current Valuation - Comm" : "Current Valuation - Commercial",
        "Current Valuation Comm" : "Current Valuation - Commercial",
        "MUL_CURR_VAL_COMM" : "Current Valuation - Commercial",
        "CURRENT VALUATION COMMERCIAL": "Current Valuation - Commercial",
        "ADDRESS" : "Address",
        "FULL_ADDRESS" : "Address",
        "ADU/TED ON PERMIT?" : "ADU/TED on Permit?",
        "ADU/TED ON ADDRESS\n" : "ADU or TED on Address?",
        "ADU/TED ON ADDRESS" : "ADU or TED on Address?",
        "ADU/TED ON ADDRESS?" : "ADU or TED on Address?",
        "ADU OR TED ON ADDRESS?" : "ADU or TED on Address?",
        "TOTAL CONSTRUCTION VALUATION" : "Total Construction Valuation",
        "Total Construction Valuation TCV" : "Total Construction Valuation",
        "Total Costruction Valuation" : "Total Construction Valuation",
        "CURRENT MARKET VALUATION" : "Current Market Valuation",
        "Current Market Valuation CMV" : "Current Market Valuation"
        }
    
    # Rename columns using mapping
    data.rename(columns=column_mapping, inplace=True)
    
    # Check if any column matches the column_mapping keys
    if not any(col in data.columns for col in chain(*column_mapping.items())):
        return pd.DataFrame()  # Return an empty DataFrame if no matches

    # Convert relevant columns to numeric
    numeric_cols = ['Project Cost', 'Job Cost Valuation', 'Current Valuation - Commercial', 'Current Valuation - Residential']
    for col in numeric_cols:
        data[col] = pd.to_numeric(data[col], errors='coerce')

    # Filter out rows where 'Record Number' is numeric
    if 'Record Number' in data.columns:
        data = data[~data['Record Number'].apply(lambda x: isinstance(x, (int, float)))]

    # Keep rows where there has been a valuation change
    data = data[(abs(data['Project Cost'].fillna(0)) > 0) | 
                (abs(data['Job Cost Valuation'].fillna(0)) > 0) | 
                (abs(data['Current Valuation - Commercial'].fillna(0)) > 0) | 
                (abs(data['Current Valuation - Residential'].fillna(0)) > 0)]

    # Drop columns with "Unnamed" in their name
    data = data.loc[:, ~data.columns.str.contains("Unnamed", na=False)]

    # Add 'Property Type' column
    data['Property Type'] = 'Modification to Work in Progress'

    # Calculate 'Total Construction Valuation'
    data['Total Construction Valuation'] = data['Project Cost'].fillna(0) + data['Job Cost Valuation'].fillna(0)

    # Calculate 'Current Market Valuation'
    data['Current Market Valuation'] = (data['Current Valuation - Commercial'].fillna(0) + 
                                        data['Current Valuation - Residential'].fillna(0) - 
                                        data['Total Construction Valuation']).clip(lower=0)

    # Remove duplicate rows
    data.drop_duplicates(inplace=True)

    return data

# Example usage
data_dump_revisions = extract_data_dump_revisions(file, sheet)

In [9]:
def combine_main_and_revisions(main, revisions):
    """
    Combines the main data dump and the revisions from a specific sheet within a workbook.
    
    Args:
    main: a pandas DataFrame of the main data dump
    revisions: a pandas DataFrame of the revisions from the data dump

    Outputs:
    data: a pandas DataFrame of the combined main and revisions data dump
    """
    data = pd.concat([main, revisions], ignore_index = True)
    return data

data_dump = combine_main_and_revisions(data_dump_main, data_dump_revisions)

## Data Definitions for Cleaning

In [10]:
def assign_property_type(row):
    """
    Depending on the row data, this function will assign a property type to the row.
    
    Args:
    row: a pandas Series of the row data

    Outputs:
    a string of the property type
    """
    # New Construction
    
    # Single Dwelling Attached
    if (row['Residential Subtype'] == 'BNSFT - New Single Family Townhouse' or row['Residential Subtype'] == 'BNSFR - New Single Family Residence') and row['ADU or TED on Address?'] == 'SFR-ATT':
        return 'Single Dwelling Attached'
    # Single Dwelling Detached
    elif (row['Residential Subtype'] == 'BNSFT - New Single Family Townhouse' or row['Residential Subtype'] == 'BNSFR - New Single Family Residence') and (row['ADU or TED on Address?'] == 'SFR-DET' or row['ADU or TED on Address?'] == 'TWNHS'):
        return 'Single Dwelling Detached'
    # ADU
    elif (row['Residential Subtype'] == 'BNSFT - New Single Family Townhouse' or row['Residential Subtype'] == 'BNSFR - New Single Family Residence' or row['Residential Subtype'] == 'BAARR -  Add/Alter/Remodel Residential' or (row['Category'] == '06 - New Miscellaneous' and row['Residential Subtype'] == 'BNRDG - New Detached Garage/Carport')) and (row['ADU or TED on Address?'] == 'ADU' or row['ADU/TED on Permit?'] == 'ADU'):
        return 'ADU'
    # Duplex
    elif (row['Category'] == '04 - New Duplex' or((row['Residential Subtype'] == 'BNRDX - New Duplex' or row['Commercial Subtype'] == 'BNMRA - New Multifamily 3-4 Units') and (row['ADU or TED on Address?'] == 'Duplex' or row['ADU or TED on Address?'] == 'DUPLEX'))):
        return 'Duplex'
    # Multi-Dwelling Apartment
    elif (row['Commercial Subtype'] == 'BNMRA - New Multifamily 3-4 Units' or row['Commercial Subtype'] == 'BNMRB - New Multifamily 5+ Units') and row['ADU or TED on Address?'] == 'MFR-APT':
        return 'Multi-Dwelling Apartment'
    # Multi-Dwelling Condo
    elif (row['Commercial Subtype'] == 'BNMRA - New Multifamily 3-4 Units' or row['Commercial Subtype'] == 'BNMRB - New Multifamily 5+ Units') and row['ADU or TED on Address?'] == 'MFR-CONDO':
        return 'Multi-Dwelling Condo'
    # TED Single Dwelling
    elif (row['Residential Subtype'] == 'BNSFT - New Single Family Townhouse' or row['Residential Subtype'] == 'BNSFR - New Single Family Residence') and (row['ADU or TED on Address?'] == 'TED SF' or row['ADU or TED on Address?'] == 'TED-SFR' or row['ADU or TED on Address?'] == 'TED') :
        return 'TED Single Dwelling'
    # TED Two Unit
    elif (row['Residential Subtype'] == 'BNSFT - New Single Family Townhouse' or row['Residential Subtype'] == 'BNSFR - New Single Family Residence') and (row['ADU or TED on Address?'] == 'TED 2U' or row['ADU or TED on Address?'] == 'TED-2U'):
        return 'TED Two Unit'    
    # TED 3+ Unit
    elif (row['Residential Subtype'] == 'BNSFT - New Single Family Townhouse' or row['Residential Subtype'] == 'BNSFR - New Single Family Residence' or row['Residential Subtype'] == 'BNMRA - New Multifamily 3 -4 Units') and row['ADU or TED on Address?'] == 'TED 3+':
        return 'TED 3+'
    # Misc. (Garage, Shed, etc.)
    elif (
          ('CARPORT' in str(row['Work Description']).upper() 
           and 'REMODEL' not in str(row['Category']).upper()) 
          or row['Residential Subtype'] == 'BNRDA - New Detached Accessory Building' 
          or row['Residential Subtype'] == 'BNRDG - New Detached Garage/Carport' 
          or row['Residential Subtype'] == 'BRFND - New Residential Foundation' 
          or row['Residential Subtype'] == 'BO\\S\\R - Other Residential' 
          or (row['Category'] == '06 - New Miscellaneous' 
              and row['Commercial Subtype'] == 'BNCON -  New Other')
          ):
        return 'Misc. (Garage, Shed, etc.)'
    # Assembly
    elif row['Category'] == '07 - New Assembly':
        return 'Assembly'
    # Business
    elif row['Category'] == '08 - New Business':
        return 'Business'
    # Education (Undefined)
    elif row['Category'] == '09 - New Education':
        return 'Education'
    # Hazardous (Undefined)
    elif row['Category'] == '10 - New Hazardous':
        return 'Hazardous'
    # Institutional (Undefined)
    elif 1 == 2:
        return 'Institutional'
    
    # Addition/Remodel

    # Commercial
    if (row['Category'] == '01 - Remodel Commercial' or (row['Category'] == '08 - New Business' and row['Commercial Subtype'] == 'BNCNB -  New Other Than Building')):
        return 'Commercial'
    # Residential
    if row['Category'] == '02 - Remodel Residential':
        return 'Residential'

    # Revision

    # Modification to Work in Progress
    if row['Property Type'] == 'Modification to Work in Progress':
        return 'Modification to Work in Progress'

    # Flag Unspecified
    else:
        return 'Unspecified'

def assign_project_type(row):
    """
    Depending on the row data, this function will assign a project type to the row.
    
    Args:
    row: a pandas Series of the row data

    Outputs:
    a string of the project type
    """
    if (row['Property Type'] == 'Commercial') or (row['Property Type'] == 'Residential') or (row['Property Type'] == 'Modification to Work in Progress'):
        return 'Addition/Remodel'
    else:
        return 'New Construction'
    
def assign_property_type_order(row):
    """
    Depending on the row data, this function will assign a property type order to the row.
    
    Args:
    row: a pandas Series of the row data

    Outputs:
    an integer of the property type order
    """
    permit_type_order = {
        "Single Dwelling Attached": 1,
        "Single Dwelling Detached": 2,
        "Duplex": 3,
        "Multi-Dwelling Apt": 4,
        "Multi-Dwelling Condo": 5,
        "TED Single Dwelling": 6,
        "TED Two Unit": 7,
        "TED 3+": 8,
        "Misc. (Garage, Shed, etc.)": 9,
        "Assembly": 10,
        "Business": 11,
        "Education": 12,
        "Hazardous": 13,
        "Institutional": 14,
        "Residential": 15,
        "Commercial": 16,
        "Modification to work in progress": 17
    }
    if row['Property Type'] in permit_type_order:
        return permit_type_order[row['Property Type']]
    else:
        return 99


In [11]:
# Apply the functions to the data dump
data_dump['Property Type'] = data_dump.apply(assign_property_type, axis=1)
data_dump['Project Type'] = data_dump.apply(assign_project_type, axis=1)
data_dump['Property Type Order'] = data_dump.apply(assign_property_type_order, axis=1)
data_dump

Unnamed: 0,Category,Record Number,Commercial Subtype,Residential Subtype,ADU/TED on Permit?,Work Description,Business Name,Project Cost,Job Cost Valuation,Current Valuation - Commercial,Current Valuation - Residential,Address,ADU or TED on Address?,Total Construction Valuation,Current Market Valuation,Property Type,Project Type,Property Type Order
0,03 - New Single Family Residential,2021-MSS-RES-00429,,BNSFT - New Single Family Townhouse,,UNIT A/NEW TOWNHOME/ATT GARAGE//VB/R-3,TOLLEFSON CONSTRUCTION,,64971.80,,207273.78,2205-A S 13TH ST W,TED 2U,64971.80,142301.98,TED Two Unit,New Construction,7
1,03 - New Single Family Residential,2021-MSS-RES-00435,,BNSFT - New Single Family Townhouse,,NEW TOWNHOME UNIT B/ATT GARAGE/VB/R-3,TOLLEFSON CONSTRUCTION,,64971.80,,207273.78,2205-B S 13TH ST W,TED 2U,64971.80,142301.98,TED Two Unit,New Construction,7
2,08 - New Business,2022-MSS-COM-00215,BNCON - New Other,,,HILLVIEW APTS GARAGE BLDG 1/NEW GARAGE BLDG/VB/U,QUALITY CONSTRUCTION COMPANY,,44023.10,,171651.36,1990 RIMEL RD,,44023.10,127628.26,Business,New Construction,11
3,08 - New Business,2022-MSS-COM-00221,BNCON - New Other,,,HILLVIEW APTS GARAGE BLDG 2/NEW GARAGE BLDG/VB/U,QUALITY CONSTRUCTION COMPANY,,48086.80,,151981.45,1980 RIMEL RD,,48086.80,103894.65,Business,New Construction,11
4,08 - New Business,2022-MSS-COM-00222,BNCON - New Other,,,HILLVIEW APTS GARAGE BLDG 3/NEW GARAGE BLDG/VB/U,QUALITY CONSTRUCTION COMPANY,,55371.52,,175005.28,1950 RIMEL RD,,55371.52,119633.76,Business,New Construction,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,02 - Remodel Residential,2023-MSS-RES-00165,,BRRRS - Reroof or Reside Residential,,Detached Shed / 1 3-TAB / 1 story re-roof / R&...,JARED LANGLEY ENTERPRISES INC,25000.0,0.00,,,2804 HARMONY CT,,25000.00,0.00,Residential,Addition/Remodel,15
98,02 - Remodel Residential,2023-MSS-RES-00169,,BRRRS - Reroof or Reside Residential,,SFR / Remove & Replace 30# Felt / replace with...,RHINO ROOFING,13000.0,0.00,,,2508 BLACKTHORN DR,,13000.00,0.00,Residential,Addition/Remodel,15
99,02 - Remodel Residential,2023-MSS-RES-00170,,BRRRS - Reroof or Reside Residential,,SFR / Remove & Replace 30# Felt / replace with...,RHINO ROOFING,22000.0,0.00,,,5055 JORDAN CT,,22000.00,0.00,Residential,Addition/Remodel,15
100,02 - Remodel Residential,2023-MSS-RES-00171,,BRRRS - Reroof or Reside Residential,,SFR / Tear off and remove existing shingles an...,MIGHTY DOG ROOFING 157,9500.0,0.00,,,127 HASTINGS AVE,,9500.00,0.00,Residential,Addition/Remodel,15


## Iterating over Multiple Data Dumps

In [12]:
def process_data_dumps(file):
    """
    This function will process the data dumps from a specific workbook.
    
    Args:
    file: the file path (string) for the development report workbook

    Outputs:
    final_data_dump: a pandas DataFrame of the processed data dumps in the workbook
    """
    sheet_details_df = build_sheet_details_df(file)[['Data Dump', 'Permit Date', 'Permit Month', 'Month Number', 'Calendar Year', 'Fiscal Year']]

    # Get sheets that contain both 'data' and 'dump'
    sheets = [sheet for sheet in pd.ExcelFile(file).sheet_names if all(word in sheet.lower() for word in ['data', 'dump'])]

    # List to store processed data frames
    all_data = []

    for sheet in sheets:
        
        # Extract main and revision data
        data_dump_main = extract_data_dump_main(file, sheet)
        data_dump_revisions = extract_data_dump_revisions(file, sheet)

        # Combine main and revision data
        data_dump = combine_main_and_revisions(data_dump_main, data_dump_revisions)

        # Fix some weird carriage returns
        data_dump = data_dump.map(lambda x: x.replace('_x000D_', '') if isinstance(x, str) else x)

        data_dump['Property Type'] = data_dump.apply(assign_property_type, axis=1)
        data_dump['Project Type'] = data_dump.apply(assign_project_type, axis=1)
        data_dump['Property Type Order'] = data_dump.apply(assign_property_type_order, axis=1)

        # Drop columns that are unneeded
        data_dump = data_dump.drop([col for col in ['DATE ISSUED', 'TOTAL MOD'] if col in data_dump.columns], axis=1)
        
        # Identifying duplicates based on "Property Type" and "Work Description"
        condo_dupes = data_dump[data_dump['Property Type'] == 'Multi-Dwelling Condo']

        if not condo_dupes.empty:

            # Group by relevant columns (for example, 'Record Number', 'Property Type', etc.)
            consolidated_condos = condo_dupes.groupby(
                ["Record Number"], as_index=False
            ).agg(
                {   'Category': 'first',
                    'Project Cost': 'sum',
                    'Job Cost Valuation': 'sum',
                    'Current Valuation - Commercial': 'sum',
                    'Current Valuation - Residential': 'sum',
                    'Total Construction Valuation': 'sum',
                    'Current Market Valuation': 'sum',
                    'Address': lambda x: ', '.join(x),  # Concatenate addresses
                    'Commercial Subtype': 'first',
                    'Residential Subtype': 'first',
                    'ADU/TED on Permit?': 'first',
                    'Work Description': 'first',
                    'Business Name': 'first',
                    'ADU or TED on Address?': 'first',
                    'Property Type': 'first',
                    'Project Type': 'first'
                }
            )
            # Remove original duplicate rows
            data_dump = data_dump.drop(condo_dupes.index)
            # Append the consolidated rows back into the DataFrame
            data_dump = pd.concat([data_dump, consolidated_condos], ignore_index=True)

        # Now, we will add the dwellable Units column
        data_dump['Units'] = 0  # Default value
        
        # Property Types increasing units by 1
        data_dump.loc[data_dump['Property Type'].isin(['Single Dwelling Attached', 'Single Dwelling Detached', 'TED Single Dwelling', 'TED Two Unit', 'TED 3+', 'ADU']), 'Units'] = 1

        # Property Types increasing units by 2
        data_dump.loc[data_dump['Property Type'] == 'Duplex', 'Units'] = 2
    
        # Property Types increasing units by 3+
        word_to_num = {
            "Tri": 3,
            "Quad": 4,
            "Five": 5,
            "Six": 6
        }

        # Create a mask for multi-dwelling properties to filter only relevant rows
        multi_dwelling_mask = data_dump['Property Type'].isin([
            'Multi-Dwelling Apartment',
            'Multi-Dwelling Condo'
        ])
        multi_dwelling_data = data_dump[multi_dwelling_mask]

        # Process only the filtered rows
        for index, row in multi_dwelling_data.iterrows():
            work_description = row['Work Description']
            
            # Look for numeric patterns first (more common case)
            match = re.search(
                r'(\d+)[- ]*(Plex|Unit|APT)',
                work_description,
                re.IGNORECASE
            )
            if match:
                data_dump.at[index, 'Units'] = int(match.group(1))
                continue
                
            # Look for word-based numbers if numeric pattern wasn't found
            for word, num in word_to_num.items():
                match = re.search(
                    rf'\b{word}\b',
                    work_description,
                    re.IGNORECASE
                )
                if match:
                    data_dump.at[index, 'Units'] = num
                    break
        
        # Create a new column for the sheet name
        data_dump["Data Dump"] = sheet

        # Left join with sheet_details_df on "Sheet Name"
        data_dump = data_dump.merge(sheet_details_df, on="Data Dump", how="left")

        # Append processed sheet data to the list
        all_data.append(data_dump)

    # Concatenate all processed sheets into a single DataFrame
    data_dumps = pd.concat(all_data, ignore_index=True) if all_data else pd.DataFrame()

    final_data_dump = pd.concat([data_dumps.assign(**{"Year Type": 'Calendar', 'Permit Year': data_dumps['Calendar Year']}),
                                data_dumps.assign(**{"Year Type": 'Fiscal', 'Permit Year': data_dumps['Fiscal Year']})
                                ]).drop(['Calendar Year', 'Fiscal Year'], axis=1)

    return final_data_dump

file = os.getcwd()+'/data'+'/FY24 Development Report Workbook.xlsx'

data = process_data_dumps(file)



## Process & Store Data Dumps

In [13]:
# 2025 ONLY
file = os.getcwd()+'/data/FY25 Development Report Workbook.xlsx'

data = process_data_dumps(file)

data.to_csv('output/BuildingPermitDataFY2025.csv', index=False)

In [14]:
def process_all_data_dumps(folder):
    """
    This function will process all data dumps in a folder.
    
    Args:
    folder: the folder path (string) for the development report workbooks

    Outputs:
    data: a pandas DataFrame of the processed data dumps in the folder
    """
    folder = os.path.join(os.getcwd(), "data")
    all_data = []

    for file in os.listdir(folder):
        if file.endswith(".xlsx") and not file.startswith("~$"):  # Exclude temp Excel files
            file = os.path.join(folder, file)
            df = process_data_dumps(file)  # Use your existing function
            all_data.append(df)


    # Assuming all_data is already loaded
    if all_data:
        data = pd.concat(all_data, ignore_index=True)

        # Get valid combinations of 'Project Type' and 'Property Type'
        valid_combinations = data[['Project Type', 'Property Type']].dropna().drop_duplicates().values.tolist()

        # Get unique values for 'Month Number', 'Year Type', and 'Permit Year'
        month_numbers = data['Month Number'].dropna().unique().tolist()
        year_types = data['Year Type'].dropna().unique().tolist()
        permit_years = data['Permit Year'].dropna().unique().tolist()

        # Generate Cartesian product
        dummy_combinations = list(itertools.product(month_numbers, year_types, permit_years, valid_combinations))

        # Unpack 'Project Type' and 'Property Type' from the tuples
        dummy_rows = pd.DataFrame(dummy_combinations, columns=['Month Number', 'Year Type', 'Permit Year', 'Project_Property'])
        dummy_rows[['Project Type', 'Property Type']] = pd.DataFrame(dummy_rows['Project_Property'].tolist(), index=dummy_rows.index)
        dummy_rows.drop(columns=['Project_Property'], inplace=True)

        # Add blank columns for other schema values
        for col in data.columns:
            if col not in ['Month Number', 'Year Type', 'Permit Year', 'Project Type', 'Property Type']:
                if col in ['Total Construction Valuation', 'Current Market Valuation', 'Units']:
                    dummy_rows[col] = 0  # Fill numerical fields with 0
                else:
                    dummy_rows[col] = None  # Keep others as None

        # Append and save
        final_df = pd.concat([data, dummy_rows], ignore_index=True)
        final_df.to_csv('output/BuildingPermitDataAll.csv', index=False)

In [15]:
# Final Processing and Creation of `BuildingPermitDataAll.csv` Data Set
folder = os.getcwd()+'/data'
process_all_data_dumps(folder)



  final_df = pd.concat([data, dummy_rows], ignore_index=True)


# Creation of `PermitDataAll.csv` Data Set

In [16]:
# We need to get the number of building permits per month so we can report on total number of permits
file = os.getcwd() + '/output/BuildingPermitDataAll.csv'

df = pd.read_csv(file)

# Filter the DataFrame to only include calendar year rows
df_calendar = df[df['Year Type'] == 'Calendar']

# Initialize a dictionary to store permit counts
permit_counts = defaultdict(int)

# Iterate through each row and count occurrences
for _, row in df_calendar.iterrows():
    year, month = str(row['Permit Year']), str(row['Permit Month']).zfill(2)  # Ensure proper format
    key = f"{year}-{month}"  # Format as 'YYYY-MM'
    permit_counts[key] += 1  # Increment count

permit_counts

defaultdict(int,
            {'2022-July': 109,
             '2022-August': 138,
             '2022-September': 92,
             '2022-October': 129,
             '2022-November': 48,
             '2022-December': 46,
             '2023-January': 43,
             '2023-February': 55,
             '2023-March': 102,
             '2023-April': 85,
             '2023-May': 133,
             '2023-June': 143,
             '2023-July': 123,
             '2023-August': 106,
             '2023-September': 134,
             '2023-October': 85,
             '2023-November': 47,
             '2023-December': 57,
             '2024-January': 55,
             '2024-February': 54,
             '2024-March': 86,
             '2024-April': 125,
             '2024-May': 146,
             '2024-June': 140,
             '2021-September': 118,
             '2021-October': 107,
             '2021-November': 95,
             '2021-December': 38,
             '2022-January': 54,
             '2022-March': 6

In [17]:
def get_value_to_right(df, search_value, num_columns=1):
    """
    Finds the first occurrence of `search_value` in the DataFrame and returns the value `num_columns` to the right.
    
    Args:
    df: a pandas DataFrame of the development report workbook
    search_value: the value (string) to search for
    num_columns: the number of columns (integer) to move to the right

    Outputs:
    value_to_right: the value to the right of the search value
    """
    for row in range(df.shape[0]):
        for col in range(df.shape[1] - num_columns):  # Avoid index error
            if str(df.iloc[row, col]).strip() == search_value:
                value_to_right = df.iloc[row, col + num_columns]
                if pd.notna(value_to_right) and str(value_to_right).strip():  # Check non-null and non-empty
                    return value_to_right  # Return first valid value found
    return None  # Return None if no valid value is found

In [18]:
def extract_building_permits(file, sheet, year, month):
    """
    This function will extract the building permits from the development report workbook.
    
    Args:
    file: the file path (string) for the development report workbook
    sheet: the sheet name (string) of the development report workbook
    year: the year (string) of the development report workbook
    month: the month (string) of the development report workbook

    Outputs:
    df: a pandas DataFrame of the building permits
    """
        
    # Read Excel with the identified header row
    data = pd.read_excel(file, sheet, header=None)

    # Initialize a dictionary to hold permit data
    permit_data = {
        'Permit Type': [],
        'Quantity': [],
        'Revenue': []
    }

    # Electrical
    search = "Electrical"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Electrical")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Plumbing
    search = "Plumbing"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Plumbing")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Mechanical
    search = "Mechanical"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Mechanical")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Demolition 
    search = "Demolition"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Demolition")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Building Permits
    search = "Building"
    permits_number = permit_counts[f"{year}-{month}"]
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Building")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Create a DataFrame from the dictionary
    df = pd.DataFrame(permit_data)
    df['Permit Category'] = ['Building Permits']*len(df)
    df['Calendar Year'] = [year]*len(df)
    df['Permit Month'] = [month]*len(df)
    
    return df

# Example
file = os.getcwd()+'/data/FY25 Development Report Workbook.xlsx'
sheet = 'Aug Report'
building_permits = extract_building_permits(file, sheet, '2024', 'August')
building_permits

Unnamed: 0,Permit Type,Quantity,Revenue,Permit Category,Calendar Year,Permit Month
0,Electrical,159,24109.3,Building Permits,2024,August
1,Plumbing,71,10724.0,Building Permits,2024,August
2,Mechanical,117,11052.0,Building Permits,2024,August
3,Demolition,12,231.0,Building Permits,2024,August
4,Building,179,134084.23,Building Permits,2024,August


In [19]:
def extract_other_permits(file, sheet, year, month):
    """
    This function will extract the other permits from the development report workbook.
    
    Args:
    file: the file path (string) for the development report workbook
    sheet: the sheet name (string) of the development report workbook
    year: the year (string) of the development report workbook
    month: the month (string) of the development report workbook

    Outputs:
    df: a pandas DataFrame of the other permits
    """
    # Read Excel with the identified header row
    data = pd.read_excel(file, sheet, header=None)

    # Initialize a dictionary to hold permit data
    permit_data = {
        'Permit Type': [],
        'Quantity': [],
        'Revenue': []
    }

    # Water Service
    search = "Water Service***"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Water Service")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Utility Excavation
    search = "Utility Excavation"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Utility Excavation")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Sanitary Sewer Service
    search = "Sanitary Sewer Service"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Sanitary Sewer Service")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Storm Sewer Service
    search = "Storm Sewer Service"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Storm Sewer Service")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Right-of-way Construction
    search = "Right-of-way Construction"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Right-of-way Construction")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Right-of-way Use
    search = "Right-of-way Use"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Right-of-way Use")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # ADA
    search = "ADA"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("ADA")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Paving
    search = "Paving"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Paving")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Grading
    search = "Grading"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Grading")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)
    
    # SWPPP
    search = "SWPPP"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("SWPPP")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Fence
    search = "Fence"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Fence")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Sign
    search = "Sign"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Sign")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Zoning Compliance
    search = "Zoning Compliance"
    permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Zoning Compliance")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Planning Floodplain
    search = "Planning Floodplain (added FY24)"
    permits_number = get_value_to_right(data, search, num_columns=2)
    if not permits_number:
        search = "Planning Floodplain"
        permits_number = get_value_to_right(data, search, num_columns=2)
    permits_revenue = get_value_to_right(data, search, num_columns=3)
    search = "Planning Floodplain (added FY24)"
    if not permits_revenue:
        search = "Planning Floodplain"
        permits_number = get_value_to_right(data, search, num_columns=3)
    permit_data['Permit Type'].append("Planning Floodplain")
    permit_data['Quantity'].append(permits_number if permits_number else 0)
    permit_data['Revenue'].append(permits_revenue if permits_revenue else 0)

    # Create a DataFrame from the dictionary
    df = pd.DataFrame(permit_data)
    df['Permit Category'] = ['Other Permits']*len(df)
    df['Calendar Year'] = [year]*len(df)
    df['Permit Month'] = [month]*len(df)

    return df

# Example
file = os.getcwd()+'/data/FY25 Development Report Workbook.xlsx'
sheet = 'Sept Report'
year = '2024'
month = 'August'
other_permits = extract_other_permits(file, sheet, year, month)
other_permits

Unnamed: 0,Permit Type,Quantity,Revenue,Permit Category,Calendar Year,Permit Month
0,Water Service,35,41313.0,Other Permits,2024,August
1,Utility Excavation,21,12078.0,Other Permits,2024,August
2,Sanitary Sewer Service,43,24198.0,Other Permits,2024,August
3,Storm Sewer Service,0,0.0,Other Permits,2024,August
4,Right-of-way Construction,23,23587.38,Other Permits,2024,August
5,Right-of-way Use,7,434.0,Other Permits,2024,August
6,ADA,4,2215.6,Other Permits,2024,August
7,Paving,21,4347.0,Other Permits,2024,August
8,Grading,0,0.0,Other Permits,2024,August
9,SWPPP,16,6332.0,Other Permits,2024,August


In [20]:
def process_reports(file):
    """
    This function will process the development report workbooks and return a pandas DataFrame of the permit data.
    
    Args:
    file: the file path (string) for the development report workbook

    Outputs:
    final_df: a pandas DataFrame of the permit data
    """

    # Get sheet details
    sheet_details_df = build_sheet_details_df(file)[['Report', 'Permit Date', 'Permit Month', 'Month Number', 'Calendar Year', 'Fiscal Year']]

    # Get sheets that match any report name in the 'Reports' column
    reports_sheets = sheet_details_df['Report'].dropna()

    df_1 = pd.DataFrame()
    df_2 = pd.DataFrame()
    for sheet in reports_sheets:
        
        year = sheet_details_df.loc[sheet_details_df['Report'] == sheet, 'Calendar Year'].iloc[0]
        month = sheet_details_df.loc[sheet_details_df['Report'] == sheet, 'Permit Month'].iloc[0]

        df = extract_building_permits(file, sheet, year, month)
        df_1 = pd.concat([df_1, df], ignore_index=True)

        df = extract_other_permits(file, sheet, year, month)
        df_2 = pd.concat([df_2, df], ignore_index=True)

    # Combine both dataframes
    combined_df = pd.concat([df_1, df_2], ignore_index=True)
    
    # Merge combined_df with sheet_details_df on 'Calendar Year' and 'Permit Month' using a left join
    final_df = combined_df.merge(
        sheet_details_df, 
        on=['Calendar Year', 'Permit Month'], 
        how='left'
    )
    # Duplicate rows, while creating new column which enables dashboard filtering by year type
    final_df = pd.concat([final_df.assign(**{"Year Type": 'Calendar', 'Permit Year': final_df['Calendar Year']}),
                                final_df.assign(**{"Year Type": 'Fiscal', 'Permit Year': final_df['Fiscal Year']})
                                ]).drop(['Calendar Year', 'Fiscal Year', 'Report'], axis=1)
    
    return final_df

In [21]:
# 2025 ONLY
file = os.path.join(os.getcwd(), 'data', 'FY25 Development Report Workbook.xlsx')

data = process_reports(file)

data.to_csv('output/PermitDataFY2025.csv', index=False)

In [22]:
def process_all_reports(folder):
    """
    This function will process all the development report workbooks in the data folder and return a pandas DataFrame of the permit data.
    
    Args:
    folder: the folder path (string) for the development report workbooks

    Outputs:
    final_df: a pandas DataFrame of the permit data
    """
    folder = os.path.join(os.getcwd(), "data")
    all_data = []

    for file in os.listdir(folder):
        if file.endswith(".xlsx") and not file.startswith("~$"):  # Exclude temp Excel files
            file = os.path.join(folder, file)
            df = process_reports(file)  # Use your existing function
            all_data.append(df)

    if all_data:
        final_df = pd.concat(all_data, ignore_index=True)
        final_df.to_csv('output/PermitDataAll.csv', index=False)

In [23]:
# Final Processing and Creation of `PermitDataAll.csv` Data Set
folder = os.getcwd()+'/data'
process_all_reports(folder)