# Cleaning Greenway Data

General demographic surveys in the tracts surrounding Boston's greenways contain unnecessary information that is not relevant to displacement in the area. These scripts filter out that unnecessary data in all the tracts surrounding one greenway in a given year.

This script loads the data and filters out columns that do not contain the total. For example, sub-headers in each tract might show age by the amount of the population of the tract that is a U.S. Citizen. We are uninterested in some of these statistics, and prefer to operate on the population of each tract as a whole. Thus, we preserve columns with a total estimate for each tract, then trim them for clarity.

In [37]:
import pandas as pd

# Retrieving demographic data for greenway
def retrieve_data(greenway, year):
    file_path = f'../../data/greenways-data/{greenway}-data/{year}-demographic-data.csv'
    data = pd.read_csv(file_path)
    return data

def filter_columns(data, year):
    # Filter out unnecessary columns
    filtered_columns = [col for col in data.columns if "Massachusetts!!Total!!Estimate" in col or "Label" in col]
    data = data[filtered_columns]
    # Trimming lower-level headers
    data.columns = [col.split('County')[0] + f'County, {year}'if 'County' in col else col + f' {year}' for col in data.columns]
    return data

test_data = retrieve_data("riverway", "2018")
filtered = filter_columns(test_data, "2018")
filtered.head()

Unnamed: 0,Label (Grouping) 2018,"Census Tract 4001, Norfolk County, 2018","Census Tract 102.04, Suffolk County, 2018","Census Tract 103, Suffolk County, 2018","Census Tract 104.08, Suffolk County, 2018"
0,Total population,5305,3592,5500,1392
1,AGE,,,,
2,Under 5 years,3.5%,0.3%,0.0%,1.9%
3,5 to 17 years,5.1%,0.2%,2.6%,0.4%
4,18 to 24 years,20.2%,65.1%,86.4%,19.3%


The data likewise contains rows of data we are uninterested in. We are primarily interested in race, income, and poverty status from the demographic sheets. Thus, we filter out everything else, yet maintain sub-headers. You can see in the head below that the "AGE" header in the previous dump has been filtered out.

In [38]:
def filter_rows(data):
    saved_categories = [
        "RACE AND HISPANIC OR LATINO ORIGIN", 
        "INDIVIDUALS' INCOME IN THE PAST 12 MONTHS", 
        "POVERTY STATUS"
    ]
    
    keep_rows = []
    relevant_section = False
    for index, row in data.iterrows():
        if pd.isna(row.iloc[1]):
            relevant_section = False
        if any(category in str(row.iloc[0]) for category in saved_categories):
            relevant_section = True
        if relevant_section:
            keep_rows.append(index)
    data = data.iloc[keep_rows]
    return data

filtered_again = filter_rows(filtered)
filtered_again.head(20)



Unnamed: 0,Label (Grouping) 2018,"Census Tract 4001, Norfolk County, 2018","Census Tract 102.04, Suffolk County, 2018","Census Tract 103, Suffolk County, 2018","Census Tract 104.08, Suffolk County, 2018"
14,RACE AND HISPANIC OR LATINO ORIGIN,,,,
15,One race,93.3%,95.1%,96.9%,97.8%
16,White,73.4%,63.6%,75.1%,66.2%
17,Black or African American,1.8%,2.7%,5.6%,8.0%
18,American Indian and Alaska Native,0.0%,0.5%,0.0%,0.0%
19,Asian,11.1%,25.6%,12.4%,20.9%
20,Native Hawaiian and Other Pacific ...,0.1%,0.2%,0.1%,0.0%
21,Some other race,6.8%,2.5%,3.7%,2.6%
22,Two or more races,6.7%,4.9%,3.1%,2.2%
23,Hispanic or Latino origin (of any race),18.1%,9.0%,9.1%,8.7%


We now have the relevant data for Riverway in 2018. We can perform this process on the data from 2019-2022 and combine them into one dataset to perform analysis.

In [39]:
datasets = [filtered_again]
for year in ["2019", "2020", "2021", "2022"]:
    annual_data = retrieve_data("riverway", year)
    filtered_annual_data = filter_columns(annual_data, year)
    filtered_rows_annual_data = filter_rows(filtered_annual_data)
    datasets.append(filtered_rows_annual_data)

# Checking that 2019 formatted correctly
datasets[1].head()

Unnamed: 0,Label (Grouping) 2019,"Census Tract 4001, Norfolk County, 2019","Census Tract 102.04, Suffolk County, 2019","Census Tract 103, Suffolk County, 2019","Census Tract 104.08, Suffolk County, 2019"
14,RACE AND HISPANIC OR LATINO ORIGIN,,,,
15,One race,92.3%,96.2%,96.3%,97.7%
16,White,60.7%,69.1%,74.7%,65.2%
17,Black or African American,4.5%,2.9%,5.6%,12.8%
18,American Indian and Alaska Native,0.0%,0.7%,0.0%,0.0%


In [40]:
# Isolate label column
label_column = datasets[0].iloc[:, :1]
# Drop labels from other datasets
drop_labels = [df.drop(df.columns[0], axis=1) for df in datasets]
# Concatenate and reorder columns
combined_df = pd.concat(drop_labels, axis=1)
combined_df = combined_df.reindex(sorted(combined_df.columns), axis=1)
combined_df.insert(0, "Label",label_column)
combined_df.head()

Unnamed: 0,Label,"Census Tract 102.04, Suffolk County, 2018","Census Tract 102.04, Suffolk County, 2019","Census Tract 102.04, Suffolk County, 2020","Census Tract 102.04, Suffolk County, 2021","Census Tract 102.04; Suffolk County, 2022","Census Tract 102.05, Suffolk County, 2020","Census Tract 102.05, Suffolk County, 2021","Census Tract 102.05; Suffolk County, 2022","Census Tract 102.06, Suffolk County, 2020",...,"Census Tract 104.08, Suffolk County, 2018","Census Tract 104.08, Suffolk County, 2019","Census Tract 104.08, Suffolk County, 2020","Census Tract 104.08, Suffolk County, 2021","Census Tract 104.08; Suffolk County, 2022","Census Tract 4001, Norfolk County, 2018","Census Tract 4001, Norfolk County, 2019","Census Tract 4001, Norfolk County, 2020","Census Tract 4001, Norfolk County, 2021","Census Tract 4001; Norfolk County, 2022"
14,RACE AND HISPANIC OR LATINO ORIGIN,,,,,,,,,,...,,,,,,,,,,
15,One race,95.1%,96.2%,96.5%,95.2%,91.5%,89.7%,88.9%,88.8%,99.2%,...,97.8%,97.7%,97.8%,98.2%,97.5%,93.3%,92.3%,89.9%,91.8%,91.2%
16,White,63.6%,69.1%,70.7%,70.5%,62.4%,54.3%,52.0%,53.5%,73.0%,...,66.2%,65.2%,66.9%,67.0%,66.0%,73.4%,60.7%,65.1%,63.8%,62.2%
17,Black or African American,2.7%,2.9%,2.9%,3.4%,6.6%,3.7%,4.0%,3.0%,12.8%,...,8.0%,12.8%,13.0%,12.3%,14.5%,1.8%,4.5%,3.8%,4.1%,4.8%
18,American Indian and Alaska Native,0.5%,0.7%,0.9%,0.0%,0.0%,0.0%,0.2%,0.1%,1.1%,...,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.1%


The dataset above shows demographic data by tracts and year for all tracts around the Riverway. We can save this to a CSV in our data folder, and then generalize the process into a function for other greenways.

In [41]:
combined_df.to_csv('../../data/greenways-data/riverway-data/combined_demographic_data.csv', index=False)

In [42]:
def combine_demographic_data(greenway_name):
    datasets = []
    for year in ["2018", "2019", "2020", "2021", "2022"]:
        annual_data = retrieve_data(greenway_name, year)
        filtered_annual_data = filter_columns(annual_data, year)
        filtered_rows_annual_data = filter_rows(filtered_annual_data)
        datasets.append(filtered_rows_annual_data)
    label_column = datasets[0].iloc[:, :1]
    drop_labels = [df.drop(df.columns[0], axis=1) for df in datasets]
    combined_df = pd.concat(drop_labels, axis=1)
    combined_df = combined_df.reindex(sorted(combined_df.columns), axis=1)
    combined_df.insert(0, "Label",label_column)
    combined_df.to_csv(f'../../data/greenways-data/{greenway_name}-data/combined_demographic_data.csv')
    print(f'Combined data for {greenway_name}')
    return combined_df

df_test = combine_demographic_data("riverway")
df_test.head(10)

Combined data for riverway


Unnamed: 0,Label,"Census Tract 102.04, Suffolk County, 2018","Census Tract 102.04, Suffolk County, 2019","Census Tract 102.04, Suffolk County, 2020","Census Tract 102.04, Suffolk County, 2021","Census Tract 102.04; Suffolk County, 2022","Census Tract 102.05, Suffolk County, 2020","Census Tract 102.05, Suffolk County, 2021","Census Tract 102.05; Suffolk County, 2022","Census Tract 102.06, Suffolk County, 2020",...,"Census Tract 104.08, Suffolk County, 2018","Census Tract 104.08, Suffolk County, 2019","Census Tract 104.08, Suffolk County, 2020","Census Tract 104.08, Suffolk County, 2021","Census Tract 104.08; Suffolk County, 2022","Census Tract 4001, Norfolk County, 2018","Census Tract 4001, Norfolk County, 2019","Census Tract 4001, Norfolk County, 2020","Census Tract 4001, Norfolk County, 2021","Census Tract 4001; Norfolk County, 2022"
14,RACE AND HISPANIC OR LATINO ORIGIN,,,,,,,,,,...,,,,,,,,,,
15,One race,95.1%,96.2%,96.5%,95.2%,91.5%,89.7%,88.9%,88.8%,99.2%,...,97.8%,97.7%,97.8%,98.2%,97.5%,93.3%,92.3%,89.9%,91.8%,91.2%
16,White,63.6%,69.1%,70.7%,70.5%,62.4%,54.3%,52.0%,53.5%,73.0%,...,66.2%,65.2%,66.9%,67.0%,66.0%,73.4%,60.7%,65.1%,63.8%,62.2%
17,Black or African American,2.7%,2.9%,2.9%,3.4%,6.6%,3.7%,4.0%,3.0%,12.8%,...,8.0%,12.8%,13.0%,12.3%,14.5%,1.8%,4.5%,3.8%,4.1%,4.8%
18,American Indian and Alaska Native,0.5%,0.7%,0.9%,0.0%,0.0%,0.0%,0.2%,0.1%,1.1%,...,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.1%
19,Asian,25.6%,21.4%,20.1%,17.8%,19.4%,29.7%,30.5%,27.3%,6.4%,...,20.9%,17.3%,15.3%,16.7%,15.5%,11.1%,20.0%,18.3%,22.2%,20.2%
20,Native Hawaiian and Other Pacific ...,0.2%,0.2%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,...,0.0%,0.0%,0.0%,0.0%,0.0%,0.1%,0.1%,0.1%,0.0%,0.0%
21,Some other race,2.5%,1.9%,2.0%,3.4%,3.1%,2.0%,2.2%,4.8%,5.9%,...,2.6%,2.5%,2.6%,2.3%,1.5%,6.8%,7.0%,2.5%,1.8%,3.9%
22,Two or more races,4.9%,3.8%,3.5%,4.8%,8.5%,10.3%,11.1%,11.2%,0.8%,...,2.2%,2.3%,2.2%,1.8%,2.5%,6.7%,7.7%,10.1%,8.2%,8.8%
23,Hispanic or Latino origin (of any race),9.0%,8.2%,9.2%,10.8%,9.3%,5.6%,8.3%,10.8%,28.1%,...,8.7%,12.6%,9.6%,8.5%,6.6%,18.1%,13.5%,12.6%,8.9%,11.5%
