**"fy24-adopted-operating-budget" pre-processing**

In [2]:
import pandas as pd

# Load the provided CSV file
file_path = 'data/fy24-adopted-operating-budget.csv'
data = pd.read_csv(file_path)

# Displaying the first few rows of the dataset to understand its structure
data.head()


Unnamed: 0,Cabinet,Dept,Program,Expense Category,FY21 Actual Expense,FY22 Actual Expense,FY23 Appropriation,FY24 Adopted
0,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Personnel Services,1820538.46,1624903.69,1584054.209,1921403.81
1,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Contractual Services,127557.82,284597.9,99314.0,219633.42
2,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Supplies & Materials,27318.17,28541.55,44938.0,55573.65
3,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Current Charges & Obligations,11365.77,19410.3,29630.0,16734.29
4,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Equipment,39040.6,16164.36,24900.0,36115.0


In [3]:
# 1. Handling missing values
# Checking for missing values in the dataset
missing_values = data.isnull().sum()

# 2. Converting data types
# Converting budget and expense columns to numeric types
columns_to_convert = ['FY21 Actual Expense', 'FY22 Actual Expense', 'FY23 Appropriation', 'FY24 Adopted']
for col in columns_to_convert:
    data[col] = pd.to_numeric(data[col], errors='coerce')

# 3. Data cleaning
# Handling any specific data cleaning based on domain knowledge (if required)

# 4. Standardizing column names
data.columns = [col.strip().replace(' ', '_').lower() for col in data.columns]

# 5. Adding or modifying features
# Example: Adding a new column for the change in budget from FY23 to FY24
data['fy24_change_from_fy23'] = data['fy24_adopted'] - data['fy23_appropriation']

# Displaying the updated dataframe and missing values
updated_missing_values = data.isnull().sum()
data.head(), missing_values, updated_missing_values


(           cabinet            dept                 program  \
 0  Mayor's Cabinet  Mayor's Office  Mayor's Administration   
 1  Mayor's Cabinet  Mayor's Office  Mayor's Administration   
 2  Mayor's Cabinet  Mayor's Office  Mayor's Administration   
 3  Mayor's Cabinet  Mayor's Office  Mayor's Administration   
 4  Mayor's Cabinet  Mayor's Office  Mayor's Administration   
 
                 expense_category  fy21_actual_expense  fy22_actual_expense  \
 0             Personnel Services           1820538.46           1624903.69   
 1           Contractual Services            127557.82            284597.90   
 2           Supplies & Materials             27318.17             28541.55   
 3  Current Charges & Obligations             11365.77             19410.30   
 4                      Equipment             39040.60             16164.36   
 
    fy23_appropriation  fy24_adopted  fy24_change_from_fy23  
 0         1584054.209    1921403.81             337349.601  
 1           99314.0

In [5]:
# Saving the preprocessed data to a new CSV file
processed_file_path = 'data/processed_fy24-adopted-operating-budget.csv'
data.to_csv(processed_file_path, index=False)

processed_file_path


'data/processed_fy24-adopted-operating-budget.csv'

**Explaination:**

**Missing Values:** Some budget and expenditure columns in the dataset have missing values. For example, the 'FY21 Actual Expense' column has 181 missing values, the 'FY22 Actual Expense' column has 146 missing values, etc.

**Data Type Conversion:** The budget and expenditure columns have been successfully converted to numeric types.

**Column Name Standardization:** All column names have been standardized for easier reading and processing.

**New Feature Added:** A new column named 'fy24_change_from_fy23' has been added, indicating the change in the budget from FY23 to FY24.

**"fy24-adopted-revenue-budget" pre-processing**

In [6]:
import pandas as pd

# Load the CSV file
file_path = 'data/fy24-adopted-revenue-budget.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe to understand its structure
df.head()


Unnamed: 0,Revenue Category,Account,Cabinet,Department,FY21 Actual,FY22 Actual,FY23 Appropriation,FY24 Adopted,Unnamed: 8,Unnamed: 9
0,Property Tax Levy,Real Estate Taxes,Finance,Assessing Department,2490082613,2630469593,2784133324,2913736657,,
1,Property Tax Levy,Personal Property Tax,Finance,Assessing Department,189939113,196499737,209010762,214236031,,
2,Property Tax Levy,Property Tax Overlay,Finance,Assessing Department,-3735387,-33174590,-29845007,-30000000,,
3,Excises,MV Excise - Current Year,Finance,Assessing Department,36012943,36227381,34000000,34000000,,
4,Excises,MV Excise - Prior Year,Finance,Assessing Department,15202795,21692077,17500000,18000000,,


In [8]:
# Remove unnamed columns
df_cleaned = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# Convert financial figures to numerical values
financial_columns = ['FY21 Actual', 'FY22 Actual', 'FY23 Appropriation', 'FY24 Adopted']
for column in financial_columns:
    df_cleaned[column] = df_cleaned[column].replace('[\$,]', '', regex=True).astype(float)

# Save the preprocessed file
output_file_path = 'data/processed_fy24-adopted-revenue-budget.csv'
df_cleaned.to_csv(output_file_path, index=False)

output_file_path


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[column] = df_cleaned[column].replace('[\$,]', '', regex=True).astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[column] = df_cleaned[column].replace('[\$,]', '', regex=True).astype(float)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned[column] = df_cleaned

'data/processed_fy24-adopted-revenue-budget.csv'

In [9]:
# Load the CSV file
file_path = 'data/processed_fy24-adopted-revenue-budget.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe to understand its structure
df.head()

Unnamed: 0,Revenue Category,Account,Cabinet,Department,FY21 Actual,FY22 Actual,FY23 Appropriation,FY24 Adopted
0,Property Tax Levy,Real Estate Taxes,Finance,Assessing Department,2490083000.0,2630470000.0,2784133000.0,2913737000.0
1,Property Tax Levy,Personal Property Tax,Finance,Assessing Department,189939100.0,196499700.0,209010800.0,214236000.0
2,Property Tax Levy,Property Tax Overlay,Finance,Assessing Department,-3735387.0,-33174590.0,-29845010.0,-30000000.0
3,Excises,MV Excise - Current Year,Finance,Assessing Department,36012940.0,36227380.0,34000000.0,34000000.0
4,Excises,MV Excise - Prior Year,Finance,Assessing Department,15202800.0,21692080.0,17500000.0,18000000.0


**Explaination**

Remove the unnamed columns, as they don't contain relevant information.


Ensure the financial figures are in a numerical format for easier processing.

**"fy24-capital-budget-plan-recommended" pre-processing**

In [11]:
import pandas as pd

# Load the CSV file
file_path = 'data/fy24-capital-budget-plan-recommended.csv'
df = pd.read_csv(file_path)

# Display the first few rows of the dataframe to understand its structure
df.head()


Unnamed: 0,Department,Project_Name,Scope_Of_Work,PM_Department,Project_Status,Neighborhood,Authorization_Existing,Authorization_FY,Authorization_Future,Grant_Existing,...,GO_Expended,Capital_Year_0,CapitalYear_1,Capital_Year_25,Grant_Expended,Grant_Year_0,Grant_Year_1,GrantYear_25,External_Funds,Total_Project_Budget
0,Boston Centers for Youth and Families,BCYF Security and Technology Upgrades,Improvements to technology infrastructure and ...,Boston Centers for Youth and Families,To Be Scheduled,Citywide,750000,1250000,0,0,...,0,0,100000,1900000,0,0,0,0,0,2000000
1,Boston Centers for Youth and Families,BCYF Tobin Community Center Retaining Wall,Repair or replace the retaining wall adjacent ...,Public Facilities Department,In Design,Mission Hill,1800000,0,0,0,...,0,100000,1500000,200000,0,0,0,0,0,1800000
2,Boston Centers for Youth and Families,BCYF North End Community Center,Develop a design for a new North End Community...,Public Facilities Department,In Design,North End,5000000,63000000,0,0,...,0,400000,3000000,64600000,0,0,0,0,20000000,88000000
3,Boston Centers for Youth and Families,Pool Repairs,Renovate and upgrade locker rooms and pools in...,Boston Centers for Youth and Families,Annual Program,Citywide,1300000,1000000,0,0,...,383450,450000,700000,766550,0,0,0,0,0,2300000
4,Boston Centers for Youth and Families,Youth Budget Round 4,Engage youth across the City to create a capit...,Youth Engagement and Employment,Implementation Underway,Citywide,1000000,0,0,0,...,17140,25000,250000,707860,0,0,0,0,0,1000000


In [12]:
# Standardize and format the code for column names
# Converts column names to lower case and replaces Spaces with underscores
data_standardized_columns = data.columns.str.lower().str.replace(' ', '_')

data_standardized_columns.tolist()

['cabinet',
 'dept',
 'program',
 'expense_category',
 'fy21_actual_expense',
 'fy22_actual_expense',
 'fy23_appropriation',
 'fy24_adopted',
 'fy24_change_from_fy23']

In [15]:
#removed duplicate rows from the data set

data_no_duplicates = data.drop_duplicates()


unique_departments = data_no_duplicates['dept'].unique()
unique_departments[:10]

data_no_duplicates.head()

Unnamed: 0,cabinet,dept,program,expense_category,fy21_actual_expense,fy22_actual_expense,fy23_appropriation,fy24_adopted,fy24_change_from_fy23
0,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Personnel Services,1820538.46,1624903.69,1584054.209,1921403.81,337349.601
1,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Contractual Services,127557.82,284597.9,99314.0,219633.42,120319.42
2,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Supplies & Materials,27318.17,28541.55,44938.0,55573.65,10635.65
3,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Current Charges & Obligations,11365.77,19410.3,29630.0,16734.29,-12895.71
4,Mayor's Cabinet,Mayor's Office,Mayor's Administration,Equipment,39040.6,16164.36,24900.0,36115.0,11215.0


In [18]:
# Outlier detection code
# Use descriptive statistics to see the distribution of numeric columns
numerical_descriptions = data_no_duplicates.describe()
numerical_descriptions

Unnamed: 0,fy21_actual_expense,fy22_actual_expense,fy23_appropriation,fy24_adopted,fy24_change_from_fy23
count,719.0,754.0,780.0,807.0,752.0
mean,4046999.0,4105683.0,4055082.0,4157598.0,239859.9
std,21274360.0,21431620.0,21081820.0,21499390.0,2457819.0
min,-825.0,4.32,0.0,-215.0,-32664690.0
25%,7597.445,9225.305,10576.5,14426.0,-205.4983
50%,112818.0,130917.6,137416.0,172904.0,1155.05
75%,831380.4,843050.0,934282.7,1008667.0,68250.75
max,272543200.0,271720900.0,221964800.0,229918700.0,40026740.0


In [19]:
processed_file_path = 'data/processed_fy24-capital-budget-plan.csv'
data_no_duplicates.to_csv(processed_file_path, index=False)

processed_file_path

'data/processed_fy24-capital-budget-plan.csv'