# Cleaning Greenway Data

General demographic surveys in the tracts surrounding Boston's greenways contain unnecessary information that is not relevant to displacement in the area. These scripts filter out that unnecessary data in all the tracts surrounding one greenway in a given year.

This script loads the data and filters out columns that do not contain the total. For example, sub-headers in each tract might show age by the amount of the population of the tract that is a U.S. Citizen. We are uninterested in some of these statistics, and prefer to operate on the population of each tract as a whole. Thus, we preserve columns with a total estimate for each tract, then trim them for clarity.

In [13]:
import pandas as pd

# Retrieving demographic data for greenway
def retrieve_data(greenway, year):
    file_path = f'../../data/greenways-early-insights/{greenway}-data/{year}-demographic-data.csv'
    data = pd.read_csv(file_path)
    return data

def filter_columns(data, year):
    # Filter out unnecessary columns
    filtered_columns = [col for col in data.columns if "Massachusetts!!Total!!Estimate" in col or "Label" in col]
    data = data[filtered_columns]
    # Trimming lower-level headers
    data.columns = [col.split('County')[0] + f'County, {year}'if 'County' in col else col + f' {year}' for col in data.columns]
    return data

test_data = retrieve_data("riverway", "2018")
filtered = filter_columns(test_data, "2018")
filtered.head()

Unnamed: 0,Label (Grouping) 2018,"Census Tract 4001, Norfolk County, 2018","Census Tract 102.04, Suffolk County, 2018","Census Tract 103, Suffolk County, 2018","Census Tract 104.08, Suffolk County, 2018"
0,Total population,5305,3592,5500,1392
1,AGE,,,,
2,Under 5 years,3.5%,0.3%,0.0%,1.9%
3,5 to 17 years,5.1%,0.2%,2.6%,0.4%
4,18 to 24 years,20.2%,65.1%,86.4%,19.3%


The data likewise contains rows of data we are uninterested in. We are primarily interested in race, income, and poverty status from the demographic sheets. Thus, we filter out everything else, yet maintain sub-headers. You can see in the head below that the "AGE" header in the previous dump has been filtered out.

In [14]:
def filter_rows(data):
    saved_categories = [
        "RACE AND HISPANIC OR LATINO ORIGIN", 
        "INDIVIDUALS' INCOME IN THE PAST 12 MONTHS", 
        "POVERTY STATUS"
    ]
    
    keep_rows = []
    relevant_section = False
    for index, row in data.iterrows():
        if pd.isna(row.iloc[1]):
            relevant_section = False
        if any(category in str(row.iloc[0]) for category in saved_categories):
            relevant_section = True
        if relevant_section:
            keep_rows.append(index)
    data = data.iloc[keep_rows]
    return data

filtered_again = filter_rows(filtered)
filtered_again.head(20)



Unnamed: 0,Label (Grouping) 2018,"Census Tract 4001, Norfolk County, 2018","Census Tract 102.04, Suffolk County, 2018","Census Tract 103, Suffolk County, 2018","Census Tract 104.08, Suffolk County, 2018"
14,RACE AND HISPANIC OR LATINO ORIGIN,,,,
15,One race,93.3%,95.1%,96.9%,97.8%
16,White,73.4%,63.6%,75.1%,66.2%
17,Black or African American,1.8%,2.7%,5.6%,8.0%
18,American Indian and Alaska Native,0.0%,0.5%,0.0%,0.0%
19,Asian,11.1%,25.6%,12.4%,20.9%
20,Native Hawaiian and Other Pacific ...,0.1%,0.2%,0.1%,0.0%
21,Some other race,6.8%,2.5%,3.7%,2.6%
22,Two or more races,6.7%,4.9%,3.1%,2.2%
23,Hispanic or Latino origin (of any race),18.1%,9.0%,9.1%,8.7%
