## Green Line Extension Analysis

Since 2005, the MBTA had been planning the Green Line Extension project to extend the Green Line into Somerville and Medford. The project extends the Lechmere line, diverging to Union Square in Somerville, and Tufts University in Medford. The MBTA has estimated that the line will support 45,000 one-way trips by 2030. The extension also included an additional vehicle maintenance and storage yard located in Somerville. The line has been intermittently under construction since 2018, finally being completed in December 2022. This makes the stops and lines along the extension the most recent additions to be added to Boston's train infrastructure. This mid-semester report aims to explore demographic characteristics relating to displacement and their change over time during the construction of the extension.

Observing these trends requires data collected at a granular level, which is only available via the U.S. Census. The Census Bureau manages the American Community Survey, which contains community data and subsequent estimates down to a block level. The Census Bureau maintains the multi-year data to provide more statistically accurate insights on smaller communities and issues that might otherwise not have relevant or accurate data. The data is broken down on a tract level, so we used Boston's census tract breakdowns to identify the tracts that are largely covered by the extension project.

The extension project is largely focused in Middlesex county, and the maps below show a side-by-side comparison of the project and tracts.

### Green Line Extension Map and Tracts

In [75]:
from IPython.display import display, HTML

tracts = 'green-line-census-tracts.png'
map = 'green-line-extension.png'

html = f"""
<table><tr>
<td><img src='{map}' width='600'></td>
<td><img src='{tracts}' width='520'></td>
</tr></table>
"""
display(HTML(html))

Link to Middlesex County Tracts: https://www2.census.gov/geo/maps/DC2020/PL20/st25_ma/censustract_maps/c25017_middlesex/DC20CT_C25017.pdf

The train line passes through several tracts in three different census-designated areas:
- Cambridge (green): 3521.01, 3521.02, 3522, 3527
- Somerville (purple): 3503, 3504, 3505, 3506, 3502.01, 3502.02, 3510.01, 3511.01, 3511.02, 3501.08, 3512.04, 3512.03, 3513, 3515, 3514.04, 3514.03, 3501.09, 3501.07
- Medford (orange): 3394, 3395, 3396, 3397

Below are relevant surveys for the tracts:
- [SO802](https://data.census.gov/table/ACSST5Y2022.S0802?g=1400000US25017339400,25017339500,25017339600,25017339700,25017350107,25017350108,25017350109,25017350201,25017350202,25017350300,25017350400,25017350500,25017350600,25017351001,25017351101,25017351102,25017351203,25017351204,25017351300,25017351403,25017351404,25017351500,25017352101,25017352102,25017352200,25017352700): Means of Transportation to Work
- [SO601](https://data.census.gov/table/ACSST5Y2022.S0601?g=1400000US25017339400,25017339500,25017339600,25017339700,25017350107,25017350108,25017350109,25017350201,25017350202,25017350300,25017350400,25017350500,25017350600,25017351001,25017351101,25017351102,25017351203,25017351204,25017351300,25017351403,25017351404,25017351500,25017352101,25017352102,25017352200,25017352700): General Characteristics of the Population
- [DP04](https://data.census.gov/table/ACSDP5Y2022.DP04?t=Homeownership%20Rate&g=1400000US25017339400,25017339500,25017339600,25017339700,25017350107,25017350108,25017350109,25017350201,25017350202,25017350300,25017350400,25017350500,25017350600,25017351001,25017351101,25017351102,25017351203,25017351204,25017351300,25017351403,25017351404,25017351500,25017352101,25017352102,25017352200,25017352700): Selected Housing Characteristics



### Cleaning Survey Data

Each survey is conducted on an annual basis. The code below combines the annual surveys into one CSV per survey, and trims unhelpful columns.

In [76]:
import pandas as pd

# Retrieves all surveys for a chosen survey
def retrieve_survey_data(survey):
    years = ['2018', '2019', '2020', '2021', '2022']
    surveys = []
    for year in years:
        file_path = f'../../data/green-line-extension-data/{survey}/{survey}-{year}.csv'
        data = pd.read_csv(file_path)
        surveys.append((data, year))
    return surveys

# Filter for S06901
def dp04_filter(data, year):
    # Filtering columns
    filtered_columns = [col for col in data.columns if not ("Margin of Error" in col) ]
    data = data[filtered_columns]
    data.columns = [col.split('County')[0] + f'County, {year}'if 'County' in col else col + f' {year}' for col in data.columns]

    # Filtering rows
    saved_categories = [
        "HOUSING OCCUPANCY", 
        "UNITS IN STRUCTURE", 
        "YEAR STRUCTURE BUILT",
        "HOUSING TENURE",
        "YEAR HOUSEHOLDER MOVED INTO UNIT",
        "VALUE",
        "MORTGAGE STATUS",
        "SMOC",
        "SMOCAPI",
        "GROSS RENT",
        "GRAPI"
    ]
    keep_rows = []
    relevant_section = False
    for index, row in data.iterrows():
        if pd.isna(row.iloc[1]):
            relevant_section = False
        if any(category in str(row.iloc[0]) for category in saved_categories):
            relevant_section = True
        if relevant_section:
            keep_rows.append(index)
    data = data.iloc[keep_rows]
    return data

# Filter for S06901
def s0601_filter(data, year):
    # Filtering columns
    filtered_columns = [col for col in data.columns if "Massachusetts!!Total!!Estimate" in col or "Label" in col]
    data = data[filtered_columns]
    data.columns = [col.split('County')[0] + f'County, {year}'if 'County' in col else col + f' {year}' for col in data.columns]

    # Filtering rows
    saved_categories = [
        "RACE AND HISPANIC OR LATINO ORIGIN", 
        "INDIVIDUALS' INCOME IN THE PAST 12 MONTHS", 
        "POVERTY STATUS"
    ]
    keep_rows = []
    relevant_section = False
    for index, row in data.iterrows():
        if pd.isna(row.iloc[1]):
            relevant_section = False
        if any(category in str(row.iloc[0]) for category in saved_categories):
            relevant_section = True
        if relevant_section:
            keep_rows.append(index)
    data = data.iloc[keep_rows]
    return data

# Filter for S06901
def s0802_filter(data, year):
    # Filtering columns
    filtered_columns = [col for col in data.columns if "Massachusetts!!Total!!Estimate" in col or "Label" in col]
    data = data[filtered_columns]
    data.columns = [col.split('County')[0] + f'County, {year}'if 'County' in col else col + f' {year}' for col in data.columns]

    # Filtering rows
    saved_categories = [
        "AGE", 
        "SEX", 
        "POVERTY STATUS",
        "PLACE OF WORK",
        "TIME LEAVING HOME TO GO TO WORK",
        "TRAVEL TIME TO WORK",
        "HOUSING TENURE",
        "VEHICLES AVAILABLE",
        "PERCENT ALLOCATED"
        
    ]
    keep_rows = []
    relevant_section = False
    for index, row in data.iterrows():
        if pd.isna(row.iloc[1]):
            relevant_section = False
        if any(category in str(row.iloc[0]) for category in saved_categories):
            relevant_section = True
        if relevant_section:
            keep_rows.append(index)
    data = data.iloc[keep_rows]
    return data

def combine_years(dfs):
    label_column = dfs[0].iloc[:, 0]
    trimmed_dfs = [df.iloc[:, 1:] for df in dfs]    
    combined_df = pd.concat(trimmed_dfs, axis=1)
    combined_df = combined_df.loc[:, ~combined_df.columns.duplicated()]
    sorted_df = combined_df.sort_index(axis=1)    
    sorted_df.insert(0, 'Label', label_column)
    return sorted_df

In [77]:
dp04 = retrieve_survey_data('dp04')
dp04 = [dp04_filter(data, year) for data, year in dp04]
dp04 = combine_years(dp04)
dp04.to_csv('../../data/green-line-extension-data/dp04/dp04.csv', index=False)
dp04.head()

Unnamed: 0,Label,"Census Tract 3394, Middlesex County, 2018","Census Tract 3394, Middlesex County, 2019","Census Tract 3394, Middlesex County, 2020","Census Tract 3394, Middlesex County, 2021","Census Tract 3394; Middlesex County, 2022","Census Tract 3395, Middlesex County, 2018","Census Tract 3395, Middlesex County, 2019","Census Tract 3395, Middlesex County, 2020","Census Tract 3395, Middlesex County, 2021",...,"Census Tract 3522, Middlesex County, 2018","Census Tract 3522, Middlesex County, 2019","Census Tract 3522, Middlesex County, 2020","Census Tract 3522, Middlesex County, 2021","Census Tract 3522; Middlesex County, 2022","Census Tract 3527, Middlesex County, 2018","Census Tract 3527, Middlesex County, 2019","Census Tract 3527, Middlesex County, 2020","Census Tract 3527, Middlesex County, 2021","Census Tract 3527; Middlesex County, 2022"
0,HOUSING OCCUPANCY,,,,,,,,,,...,,,,,,,,,,
1,Total housing units,1524.0,1517.0,1449.0,1525.0,1547.0,1651.0,1636.0,1758.0,1863.0,...,1162.0,1151.0,1050.0,1030.0,1068.0,1047.0,1009.0,1012.0,1038.0,1052.0
2,Occupied housing units,1496.0,1517.0,1449.0,1504.0,1525.0,1539.0,1541.0,1595.0,1694.0,...,1069.0,1044.0,982.0,962.0,963.0,972.0,947.0,969.0,958.0,942.0
3,Vacant housing units,28.0,0.0,0.0,21.0,22.0,112.0,95.0,163.0,169.0,...,93.0,107.0,68.0,68.0,105.0,75.0,62.0,43.0,80.0,110.0
4,Homeowner vacancy rate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.6,4.3,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [78]:
s0601 = retrieve_survey_data('s0601')
s0601 = [s0601_filter(data, year) for data, year in s0601]
s0601 = combine_years(s0601)
s0601.to_csv('../../data/green-line-extension-data/s0601/s0601.csv', index=False)
s0601.head()

Unnamed: 0,Label,"Census Tract 3394, Middlesex County, 2018","Census Tract 3394, Middlesex County, 2019","Census Tract 3394, Middlesex County, 2020","Census Tract 3394, Middlesex County, 2021","Census Tract 3394; Middlesex County, 2022","Census Tract 3395, Middlesex County, 2018","Census Tract 3395, Middlesex County, 2019","Census Tract 3395, Middlesex County, 2020","Census Tract 3395, Middlesex County, 2021",...,"Census Tract 3522, Middlesex County, 2018","Census Tract 3522, Middlesex County, 2019","Census Tract 3522, Middlesex County, 2020","Census Tract 3522, Middlesex County, 2021","Census Tract 3522; Middlesex County, 2022","Census Tract 3527, Middlesex County, 2018","Census Tract 3527, Middlesex County, 2019","Census Tract 3527, Middlesex County, 2020","Census Tract 3527, Middlesex County, 2021","Census Tract 3527; Middlesex County, 2022"
14,RACE AND HISPANIC OR LATINO ORIGIN,,,,,,,,,,...,,,,,,,,,,
15,One race,99.3%,97.6%,96.7%,95.6%,93.0%,96.9%,97.0%,95.8%,94.4%,...,97.6%,95.6%,93.2%,93.2%,91.9%,93.0%,95.5%,96.3%,93.4%,91.5%
16,White,85.2%,86.7%,90.0%,86.9%,84.0%,78.9%,80.1%,80.3%,79.8%,...,71.6%,73.0%,62.9%,57.3%,55.8%,71.4%,57.5%,64.3%,62.4%,64.0%
17,Black or African American,1.8%,1.4%,0.8%,1.4%,2.4%,8.2%,6.3%,2.9%,2.8%,...,15.7%,14.6%,8.2%,10.0%,10.6%,4.8%,21.2%,15.2%,15.7%,14.0%
18,American Indian and Alaska Native,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,...,0.4%,0.0%,0.0%,0.0%,0.0%,0.6%,0.5%,0.6%,0.0%,0.0%


In [79]:
s0802 = retrieve_survey_data('s0802')
s0802 = [s0802_filter(data, year) for data, year in s0802]
s0802 = combine_years(s0802)
s0802.to_csv('../../data/green-line-extension-data/s0802/s0802.csv', index=False)
s0802.head()

Unnamed: 0,Label,"Census Tract 3394, Middlesex County, 2018","Census Tract 3394, Middlesex County, 2019","Census Tract 3394, Middlesex County, 2020","Census Tract 3394, Middlesex County, 2021","Census Tract 3394; Middlesex County, 2022","Census Tract 3395, Middlesex County, 2018","Census Tract 3395, Middlesex County, 2019","Census Tract 3395, Middlesex County, 2020","Census Tract 3395, Middlesex County, 2021",...,"Census Tract 3522, Middlesex County, 2018","Census Tract 3522, Middlesex County, 2019","Census Tract 3522, Middlesex County, 2020","Census Tract 3522, Middlesex County, 2021","Census Tract 3522; Middlesex County, 2022","Census Tract 3527, Middlesex County, 2018","Census Tract 3527, Middlesex County, 2019","Census Tract 3527, Middlesex County, 2020","Census Tract 3527, Middlesex County, 2021","Census Tract 3527; Middlesex County, 2022"
1,AGE,,,,,,,,,,...,,,,,,,,,,
2,16 to 19 years,1.2%,1.6%,1.5%,1.7%,0.8%,10.2%,12.0%,12.7%,10.7%,...,0.0%,0.0%,0.0%,0.0%,0.0%,0.5%,1.4%,3.2%,3.9%,5.3%
3,20 to 24 years,14.6%,20.4%,19.7%,19.1%,14.1%,16.9%,21.2%,18.7%,20.1%,...,15.2%,21.5%,14.1%,14.1%,13.2%,5.1%,12.2%,13.7%,12.4%,14.8%
4,25 to 44 years,54.5%,53.3%,46.8%,47.4%,54.8%,44.8%,41.4%,38.4%,37.7%,...,62.1%,55.8%,66.6%,62.4%,67.2%,77.2%,65.3%,61.7%,65.4%,62.6%
5,45 to 54 years,9.6%,7.2%,6.5%,5.9%,5.1%,10.3%,7.8%,9.3%,11.4%,...,5.3%,3.5%,3.1%,5.0%,3.7%,9.4%,13.4%,11.0%,12.9%,11.1%
