# Data Processing: OCO-2 and SVI at the cenus-tract and census-county levels

### Issues and workarounds
- Census tract ids may have changed over time:
    - within the census data we may be missing svi values for a given census tract within a certain year(s). This indicates to me that the census tract boundaries have changed within our selected time range. To navigate this, only svi census tracts that have values for each year within the selected time range have been selected. 
    - If we set our range to 2014-2020, we get far less samples. Seems as if quite a few census boundaries changed in 2020. So we will set our range to 2014-2018 
    - Hypothetical example: Smallville Tennessee has census tract 108, this tract has svi values for 2014, 2016, but no values for 2018; in this case we would not use this census tract in our analysis since our selected time range is (2014-2018)
- Sparsity of the OCO-2 XCO2 data
    - Grouping by census tract increases our granularity, but the the sparsity of OCO-2 xco2 dataset and potentially changing census tract boundaries may make it necessary to group at the census county level instead. I am also creating datasets at this level. We can evaluate performance when running the clustering model.

In [14]:
import pandas as pd
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

# SVI data
Obtained from here: https://www.atsdr.cdc.gov/placeandhealth/svi/data_documentation_download.html

In [15]:
df_svi_2014= pd.read_csv(r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\Team-Project-Practicum-6748\nasa_data\data\SVI_2014_US.csv")
df_svi_2016= pd.read_csv(r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\Team-Project-Practicum-6748\nasa_data\data\SVI_2016_US.csv")
df_svi_2018= pd.read_csv(r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\Team-Project-Practicum-6748\nasa_data\data\SVI_2018_US.csv")
df_svi_2020= pd.read_csv(r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\Team-Project-Practicum-6748\nasa_data\data\SVI_2020_US.csv")
pd.set_option('display.max_columns', None)
display(df_svi_2020.head(2))

Unnamed: 0,ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,AREA_SQMI,E_TOTPOP,M_TOTPOP,E_HU,M_HU,E_HH,M_HH,E_POV150,M_POV150,E_UNEMP,M_UNEMP,E_HBURD,M_HBURD,E_NOHSDP,M_NOHSDP,E_UNINSUR,M_UNINSUR,E_AGE65,M_AGE65,E_AGE17,M_AGE17,E_DISABL,M_DISABL,E_SNGPNT,M_SNGPNT,E_LIMENG,M_LIMENG,E_MINRTY,M_MINRTY,E_MUNIT,M_MUNIT,E_MOBILE,M_MOBILE,E_CROWD,M_CROWD,E_NOVEH,M_NOVEH,E_GROUPQ,M_GROUPQ,EP_POV150,MP_POV150,EP_UNEMP,MP_UNEMP,EP_HBURD,MP_HBURD,EP_NOHSDP,MP_NOHSDP,EP_UNINSUR,MP_UNINSUR,EP_AGE65,MP_AGE65,EP_AGE17,MP_AGE17,EP_DISABL,MP_DISABL,EP_SNGPNT,MP_SNGPNT,EP_LIMENG,MP_LIMENG,EP_MINRTY,MP_MINRTY,EP_MUNIT,MP_MUNIT,EP_MOBILE,MP_MOBILE,EP_CROWD,MP_CROWD,EP_NOVEH,MP_NOVEH,EP_GROUPQ,MP_GROUPQ,EPL_POV150,EPL_UNEMP,EPL_HBURD,EPL_NOHSDP,EPL_UNINSUR,SPL_THEME1,RPL_THEME1,EPL_AGE65,EPL_AGE17,EPL_DISABL,EPL_SNGPNT,EPL_LIMENG,SPL_THEME2,RPL_THEME2,EPL_MINRTY,SPL_THEME3,RPL_THEME3,EPL_MUNIT,EPL_MOBILE,EPL_CROWD,EPL_NOVEH,EPL_GROUPQ,SPL_THEME4,RPL_THEME4,SPL_THEMES,RPL_THEMES,F_POV150,F_UNEMP,F_HBURD,F_NOHSDP,F_UNINSUR,F_THEME1,F_AGE65,F_AGE17,F_DISABL,F_SNGPNT,F_LIMENG,F_THEME2,F_MINRTY,F_THEME3,F_MUNIT,F_MOBILE,F_CROWD,F_NOVEH,F_GROUPQ,F_THEME4,F_TOTAL,E_DAYPOP,E_NOINT,M_NOINT,E_AFAM,M_AFAM,E_HISP,M_HISP,E_ASIAN,M_ASIAN,E_AIAN,M_AIAN,E_NHPI,M_NHPI,E_TWOMORE,M_TWOMORE,E_OTHERRACE,M_OTHERRACE,EP_NOINT,MP_NOINT,EP_AFAM,MP_AFAM,EP_HISP,MP_HISP,EP_ASIAN,MP_ASIAN,EP_AIAN,MP_AIAN,EP_NHPI,MP_NHPI,EP_TWOMORE,MP_TWOMORE,EP_OTHERRACE,MP_OTHERRACE
0,1,Alabama,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",3.79357,1941,390,710,120,693,121,352,138,18,18,144,59,187,93,187,91,295,101,415,208,413,147,51,31,0,48,437,192,0,16,88,43,0,16,10,12,0,12,18.1,6.1,2.1,2.1,20.8,7.7,14.3,6.3,9.6,5.1,15.2,5.1,21.4,9.8,21.3,7.4,7.4,4.3,0.0,2.6,22.5,8.8,0.0,2.3,12.4,5.8,0.0,2.3,1.4,1.8,0.0,0.6,0.4727,0.1731,0.3448,0.6963,0.6529,2.3398,0.4578,0.4693,0.4653,0.8926,0.6627,0.0,2.4899,0.5079,0.3921,0.3921,0.3921,0.0,0.818,0.0,0.1872,0.0,1.0052,0.0945,6.227,0.2823,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1033,217,544,235,151,33,34,41,54,0,12,0,12,128,99,0,12,11.2,2.3,12.1,7.1,1.7,1.8,2.1,2.7,0.0,1.8,0.0,1.8,6.6,5.1,0.0,1.8
1,1,Alabama,AL,1001,Autauga,1001020200,"Census Tract 202, Autauga County, Alabama",1.282174,1757,310,720,99,573,99,384,182,29,26,149,60,139,59,91,55,284,97,325,110,168,73,21,25,0,48,1116,306,3,13,5,8,9,16,57,37,212,85,25.4,11.0,4.0,3.5,26.0,9.5,10.6,4.8,5.9,3.7,16.2,4.6,18.5,5.3,11.0,3.8,3.7,4.3,0.0,2.9,63.5,13.3,0.4,1.8,0.7,1.1,1.6,2.8,9.9,6.3,12.1,4.3,0.6491,0.421,0.5214,0.5673,0.432,2.5908,0.5348,0.5252,0.2898,0.3984,0.3493,0.0,1.5627,0.081,0.761,0.761,0.761,0.2798,0.5115,0.4821,0.7387,0.957,2.9691,0.7915,7.8836,0.5406,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,4080,301,372,1026,298,30,38,0,12,0,12,0,12,46,54,14,19,19.5,3.8,58.4,8.0,1.7,2.1,0.0,2.0,0.0,2.0,0.0,2.0,2.6,3.0,0.8,1.1


### There are some mismatched column IDs from year to year, lets just get the ones we want, then concatinate all years together

There are also white spaces within some values within the dataframe? Annoying

In [16]:
def whitespace_remover(dataframe):
   
    # iterating over the columns
    for i in dataframe.columns:
         
        # checking datatype of each columns
        if dataframe[i].dtype == 'object':
             
            # applying strip function on column
            dataframe[i] = dataframe[i].map(str.strip)
        else:
             
            # if condn. is False then it will do nothing.
            pass
 

In [17]:
#Getting the columns that match from each dataframe
match_20142016=(set(df_svi_2014.columns) & set(df_svi_2016.columns))
match_201420162018=(match_20142016 & set(df_svi_2018.columns))
match_2014201620182020=(match_201420162018 & set(df_svi_2020.columns))
print('shared columns between all years:',match_2014201620182020)

subset=[]
for i in ['ST', 'STATE', 'ST_ABBR', 'STCNTY', 'COUNTY', 'FIPS', 'LOCATION', 'E_TOTPOP','E_HU', 'E_PCI', 'E_HH', 'E_UNEMP', 'E_POV', 'RPL_THEME1','RPL_THEME2','RPL_THEME3', 'RPL_THEME4','RPL_THEMES']:
    if i in match_2014201620182020:
        #print("Yes,",i, "is in this set")
        subset.append(i)

df_svi_2014=df_svi_2014[subset]
df_svi_2014['YEAR']=2014

df_svi_2016=df_svi_2016[subset]
df_svi_2016['YEAR']=2016

df_svi_2018=df_svi_2018[subset]
df_svi_2018['YEAR']=2018

df_svi_2020=df_svi_2020[subset]
df_svi_2020['YEAR']=2020

df_svi = pd.concat([df_svi_2014,df_svi_2016,df_svi_2018,df_svi_2020],ignore_index=True)

#formatting
df_svi=df_svi.sort_values(by=['FIPS','YEAR']).reset_index(drop=True)
df_svi['STATE']=df_svi['STATE'].str.upper()

# applying whitespace_remover function on dataframe
whitespace_remover(df_svi)

display(df_svi.head(2))

shared columns between all years: {'F_MOBILE', 'SPL_THEME2', 'M_MUNIT', 'EP_UNEMP', 'EPL_LIMENG', 'RPL_THEME2', 'M_HH', 'EPL_NOHSDP', 'EP_DISABL', 'E_UNINSUR', 'MP_MINRTY', 'F_GROUPQ', 'F_THEME4', 'E_DAYPOP', 'EPL_DISABL', 'EP_LIMENG', 'MP_AGE65', 'E_HH', 'E_SNGPNT', 'EPL_UNEMP', 'F_MINRTY', 'SPL_THEME3', 'RPL_THEME4', 'F_UNEMP', 'MP_DISABL', 'M_UNEMP', 'COUNTY', 'F_AGE17', 'E_MOBILE', 'EP_GROUPQ', 'M_NOVEH', 'F_NOVEH', 'F_THEME3', 'M_AGE17', 'F_SNGPNT', 'EPL_MOBILE', 'E_HU', 'M_AGE65', 'F_NOHSDP', 'E_MINRTY', 'RPL_THEME1', 'EP_NOHSDP', 'RPL_THEMES', 'F_CROWD', 'MP_NOHSDP', 'EPL_SNGPNT', 'EP_NOVEH', 'MP_NOVEH', 'E_NOVEH', 'EPL_CROWD', 'EP_AGE65', 'SPL_THEMES', 'SPL_THEME1', 'M_MINRTY', 'M_LIMENG', 'M_HU', 'M_NOHSDP', 'F_DISABL', 'E_AGE17', 'EP_SNGPNT', 'E_AGE65', 'STCNTY', 'ST_ABBR', 'E_MUNIT', 'F_THEME1', 'E_UNEMP', 'MP_MOBILE', 'EPL_MUNIT', 'EPL_GROUPQ', 'MP_CROWD', 'ST', 'EP_CROWD', 'EP_AGE17', 'RPL_THEME3', 'AREA_SQMI', 'E_LIMENG', 'EP_UNINSUR', 'MP_UNINSUR', 'LOCATION', 'E_DISABL'

Unnamed: 0,ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,E_TOTPOP,E_HU,E_HH,E_UNEMP,RPL_THEME1,RPL_THEME2,RPL_THEME3,RPL_THEME4,RPL_THEMES,YEAR
0,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1900.0,714.0,688.0,48.0,0.4399,0.3403,0.3134,0.3634,0.3466,2014
1,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",2010.0,751.0,740.0,43.0,0.3885,0.2355,0.3804,0.1057,0.1918,2016


# AT THE CENSUS-TRACT LEVEL

### Imputing missing years for census tracts where we have 2014, 2016, 2018. The output will be a df with values for these years (2014,2016,2018) and imputed values for 2015 and 2017
#### Below cell is computationally expensive

In [18]:
#Get the unique tracts from the df
tracts=np.unique(df_svi['FIPS'])

svi_df_tract_w_imputed_2014_2018=pd.DataFrame(columns=['ST', 'STATE', 'ST_ABBR', 'STCNTY', 'COUNTY', 'FIPS', 'LOCATION', 'E_TOTPOP', 'E_HU', 'E_HH', 'E_UNEMP', 'RPL_THEME1', 'RPL_THEME2', 'RPL_THEME3', 'RPL_THEME4', 'RPL_THEMES', 'YEAR'])
svi_df_tract_w_imputed_2014_2020=pd.DataFrame(columns=['ST', 'STATE', 'ST_ABBR', 'STCNTY', 'COUNTY', 'FIPS', 'LOCATION', 'E_TOTPOP', 'E_HU', 'E_HH', 'E_UNEMP', 'RPL_THEME1', 'RPL_THEME2', 'RPL_THEME3', 'RPL_THEME4', 'RPL_THEMES', 'YEAR'])

#for each unique tract
for tract in tracts:
    #subset by the orig df by that tract
    subset=df_svi[df_svi['FIPS']==tract]

    #if we have years 2014-2018 for that tract: impute 2015 and 2017
    if sorted(list(subset['YEAR']))==[2014,2016,2018]:
        
        #get numeric fields for averaging
        numeric=subset[['E_TOTPOP', 'E_HU', 'E_HH', 'E_UNEMP', 'RPL_THEME1', 'RPL_THEME2', 'RPL_THEME3', 'RPL_THEME4', 'RPL_THEMES']]
        non_numeric=subset[['ST', 'STATE', 'ST_ABBR', 'STCNTY', 'COUNTY', 'FIPS', 'LOCATION']]

        #calc mean with rolling window that averages 2 years at a time
        #returns imputed values for 2015 and 2017
        impute=numeric.rolling(2).mean()
        impute.dropna(inplace=True)
        impute['YEAR']=[2015,2017]

        #recombining the string label fields with the newly imputed data
        result=pd.concat([non_numeric, impute], axis=1)
        result.dropna(inplace=True)
        result['YEAR']=result['YEAR'].astype(int)

        #combining our computed values and the original values for that tract for each iteration into a new dataframe
        svi_df_tract_w_imputed_2014_2018 = pd.concat([svi_df_tract_w_imputed_2014_2018, result, subset], ignore_index=True)

    #if we have years 2014-2020 for that tract: imput 2015, 2017 and 2019
    elif sorted(list(subset['YEAR']))==[2014,2016,2018,2020]:
        #get numeric fields for averaging
        numeric=subset[['E_TOTPOP', 'E_HU', 'E_HH', 'E_UNEMP', 'RPL_THEME1', 'RPL_THEME2', 'RPL_THEME3', 'RPL_THEME4', 'RPL_THEMES']]
        non_numeric=subset[['ST', 'STATE', 'ST_ABBR', 'STCNTY', 'COUNTY', 'FIPS', 'LOCATION']]

        #calc mean with rolling window that averages 2 years at a time
        #returns imputed values for 2015, 2017 and 2019
        impute=numeric.rolling(2).mean()
        impute.dropna(inplace=True)
        impute['YEAR']=[2015,2017,2019]

        #recombining the string label fields with the newly imputed data
        result=pd.concat([non_numeric, impute], axis=1)
        result.dropna(inplace=True)
        result['YEAR']=result['YEAR'].astype(int)


        #combining our computed values and the original values for that tract for each iteration into a new dataframe
        svi_df_tract_w_imputed_2014_2020 = pd.concat([svi_df_tract_w_imputed_2014_2020, result, subset], ignore_index=True)

    #if we DON'T have years 2014-2018 or 2014-2020 for that tract: we're not using it the new dataframe
    else:
        continue

#writing to csv
svi_df_tract_w_imputed_2014_2018=svi_df_tract_w_imputed_2014_2018.sort_values(by=['FIPS','YEAR'])
svi_df_tract_w_imputed_2014_2020=svi_df_tract_w_imputed_2014_2020.sort_values(by=['FIPS','YEAR'])
svi_df_tract_w_imputed_2014_2018['YEAR']=svi_df_tract_w_imputed_2014_2018['YEAR'].astype(str)
svi_df_tract_w_imputed_2014_2020['YEAR']=svi_df_tract_w_imputed_2014_2020['YEAR'].astype(str)

svi_df_tract_w_imputed_2014_2018.to_csv('svi_df_tract_w_imputed_2014_2018.csv',index=False)
svi_df_tract_w_imputed_2014_2020.to_csv('svi_df_tract_w_imputed_2014_2020.csv',index=False)

In [19]:
display(svi_df_tract_w_imputed_2014_2018.head(10))
print(len(svi_df_tract_w_imputed_2014_2018))

display(svi_df_tract_w_imputed_2014_2020.head(10))
print(len(svi_df_tract_w_imputed_2014_2020))

Unnamed: 0,ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,E_TOTPOP,E_HU,E_HH,E_UNEMP,RPL_THEME1,RPL_THEME2,RPL_THEME3,RPL_THEME4,RPL_THEMES,YEAR
2,1,ALABAMA,AL,1001,Autauga,1001020500,"Census Tract 205, Autauga County, Alabama",10881.0,4440.0,4165.0,182.0,0.1828,0.4615,0.3291,0.597,0.336,2014
0,1,ALABAMA,AL,1001,Autauga,1001020500,"Census Tract 205, Autauga County, Alabama",10705.0,4476.5,4227.0,129.0,0.1995,0.4392,0.3729,0.4993,0.3175,2015
3,1,ALABAMA,AL,1001,Autauga,1001020500,"Census Tract 205, Autauga County, Alabama",10529.0,4513.0,4289.0,76.0,0.2162,0.4169,0.4167,0.4016,0.299,2016
1,1,ALABAMA,AL,1001,Autauga,1001020500,"Census Tract 205, Autauga County, Alabama",10206.0,4500.5,4231.5,88.0,0.24815,0.5322,0.41715,0.5303,0.38515,2017
4,1,ALABAMA,AL,1001,Autauga,1001020500,"Census Tract 205, Autauga County, Alabama",9883.0,4488.0,4174.0,100.0,0.2801,0.6475,0.4176,0.659,0.4713,2018
7,1,ALABAMA,AL,1001,Autauga,1001020802,"Census Tract 208.02, Autauga County, Alabama",10471.0,4142.0,3717.0,385.0,0.5683,0.745,0.2223,0.5335,0.5434,2014
5,1,ALABAMA,AL,1001,Autauga,1001020802,"Census Tract 208.02, Autauga County, Alabama",10607.0,4199.0,3857.0,298.5,0.5267,0.76815,0.2898,0.4592,0.52135,2015
8,1,ALABAMA,AL,1001,Autauga,1001020802,"Census Tract 208.02, Autauga County, Alabama",10743.0,4256.0,3997.0,212.0,0.4851,0.7913,0.3573,0.3849,0.4993,2016
6,1,ALABAMA,AL,1001,Autauga,1001020802,"Census Tract 208.02, Autauga County, Alabama",11173.0,4272.0,4048.0,161.5,0.4345,0.81585,0.3703,0.46105,0.51265,2017
9,1,ALABAMA,AL,1001,Autauga,1001020802,"Census Tract 208.02, Autauga County, Alabama",11603.0,4288.0,4099.0,111.0,0.3839,0.8404,0.3833,0.5372,0.526,2018


57425


Unnamed: 0,ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,E_TOTPOP,E_HU,E_HH,E_UNEMP,RPL_THEME1,RPL_THEME2,RPL_THEME3,RPL_THEME4,RPL_THEMES,YEAR
3,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1900.0,714.0,688.0,48.0,0.4399,0.3403,0.3134,0.3634,0.3466,2014
0,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1955.0,732.5,714.0,45.5,0.4142,0.2879,0.3469,0.23455,0.2692,2015
4,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",2010.0,751.0,740.0,43.0,0.3885,0.2355,0.3804,0.1057,0.1918,2016
1,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1966.5,765.0,752.5,39.0,0.3899,0.3976,0.37695,0.1025,0.22255,2017
5,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1923.0,779.0,765.0,35.0,0.3913,0.5597,0.3735,0.0993,0.2533,2018
2,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1932.0,744.5,729.0,26.5,0.42455,0.5338,0.3828,0.0969,0.2678,2019
6,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1941.0,710.0,693.0,18.0,0.4578,0.5079,0.3921,0.0945,0.2823,2020
10,1,ALABAMA,AL,1001,Autauga,1001020200,"Census Tract 202, Autauga County, Alabama",2342.0,855.0,797.0,166.0,0.8152,0.3436,0.684,0.8143,0.7777,2014
7,1,ALABAMA,AL,1001,Autauga,1001020200,"Census Tract 202, Autauga County, Alabama",2269.0,872.5,817.5,119.0,0.79325,0.48505,0.64065,0.83655,0.794,2015
11,1,ALABAMA,AL,1001,Autauga,1001020200,"Census Tract 202, Autauga County, Alabama",2196.0,890.0,838.0,72.0,0.7713,0.6265,0.5973,0.8588,0.8103,2016


429422


# OCO-2 Data

### Reading in previously created df and dropping columns we don't need


In [20]:
df_xco2= pd.read_csv(r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\data\OCO2_BASE_2014-2023_V1.csv")
df_xco2.drop(['Unnamed: 0', 'geoid'], axis=1, inplace=True)
display(df_xco2.head(3))

Unnamed: 0,county_name,state_name,DateTime,Year,Month,Day,Latitude,Longitude,xco2,xco2_quality_flag
0,Moore,North Carolina,2014-09-06 18:30:51.370,2014,9,6,35.10113,-79.46456,388.31067,1
1,Moore,North Carolina,2014-09-06 18:30:51.730,2014,9,6,35.141167,-79.51347,385.0724,1
2,Moore,North Carolina,2014-09-06 18:30:52.080,2014,9,6,35.135918,-79.46529,388.94995,1


### Assigning census label to each oco2 reading
Tract shapefile obtained from here: https://www.census.gov/geographies/mapping-files/time-series/geo/cartographic-boundary.html

In [21]:
cens_tracts = gpd.GeoDataFrame.from_file(r"C:\Users\ddrye\OneDrive\Documents\OMSA_Program\OMSA 2023\Summer2023\Practicum\off_git\cb_2020_us_tract_500k\cb_2020_us_tract_500k.shp")
df_xco2_tract=df_xco2
df_xco2_tract['coords'] = list(zip(df_xco2_tract['Longitude'],df_xco2_tract['Latitude']))
df_xco2_tract['coords'] = df_xco2_tract['coords'].apply(Point)
points = gpd.GeoDataFrame(df_xco2_tract, geometry='coords', crs=cens_tracts.crs)
df_xco2_tract = gpd.tools.sjoin(points, cens_tracts, predicate="within", how='left')
display(df_xco2_tract.head(3))

Unnamed: 0,county_name,state_name,DateTime,Year,Month,Day,Latitude,Longitude,xco2,xco2_quality_flag,coords,index_right,STATEFP,COUNTYFP,TRACTCE,AFFGEOID,GEOID,NAME,NAMELSAD,STUSPS,NAMELSADCO,STATE_NAME,LSAD,ALAND,AWATER
0,Moore,North Carolina,2014-09-06 18:30:51.370,2014,9,6,35.10113,-79.46456,388.31067,1,POINT (-79.46456 35.10113),71393.0,37,125,951200,1400000US37125951200,37125951200,9512.0,Census Tract 9512,NC,Moore County,North Carolina,CT,119041601.0,339457.0
1,Moore,North Carolina,2014-09-06 18:30:51.730,2014,9,6,35.141167,-79.51347,385.0724,1,POINT (-79.51347 35.14117),71393.0,37,125,951200,1400000US37125951200,37125951200,9512.0,Census Tract 9512,NC,Moore County,North Carolina,CT,119041601.0,339457.0
2,Moore,North Carolina,2014-09-06 18:30:52.080,2014,9,6,35.135918,-79.46529,388.94995,1,POINT (-79.46529 35.13592),35729.0,37,125,951101,1400000US37125951101,37125951101,9511.01,Census Tract 9511.01,NC,Moore County,North Carolina,CT,28421525.0,345471.0


### Transforming oco2 data for all available years

In [22]:
# Many of the readings are over the ocean or outside of the US - we're going to drop these
df_xco2_tract_filter=df_xco2_tract.dropna()

#dropping values with bad quality flag
df_xco2_tract_filter=df_xco2_tract_filter[df_xco2_tract_filter['xco2_quality_flag']==0]

#only date range we have svi data for
df_xco2_tract_filter = df_xco2_tract_filter.loc[(df_xco2_tract_filter['Year'] <2021) & (df_xco2_tract_filter['Year'] > 2014)]

#calc new fields
counts=df_xco2_tract_filter.groupby(['GEOID','Year'], as_index=False)["xco2"].size().rename(columns={'size':'readings_count'})
mean=df_xco2_tract_filter.groupby(['GEOID','Year'], as_index=False)["xco2"].mean().rename(columns={'xco2':'avg_xco2'})
std_deviation=df_xco2_tract_filter.groupby(['GEOID','Year'], as_index=False)["xco2"].std().rename(columns={'xco2':'stddev_xco2'})

intermediate_df=pd.merge(counts, mean, on=['GEOID','Year'])
intermediate_df=pd.merge(intermediate_df, std_deviation, on=['GEOID','Year'])

pct_change = (intermediate_df.groupby(['GEOID'])['avg_xco2'].apply(pd.Series.pct_change) + 1).rename('pct_change').reset_index()

intermediate_df=pd.merge(intermediate_df, pct_change, left_index=True, right_index=True)

intermediate_df["delta"] = intermediate_df.groupby(['GEOID_x'])['avg_xco2'].diff()
intermediate_df["cum_delta"] = intermediate_df.groupby(['GEOID_x'])['delta'].cumsum()

#cleaning up
intermediate_df.drop(['GEOID_y', 'level_1'], axis=1, inplace=True)
intermediate_df.rename(columns={'GEOID_x':'GEOID'}, inplace=True)

labels=df_xco2_tract_filter.groupby(['state_name','county_name','GEOID','Year'], as_index=False).size()
xco2_df_tract_w_vars=pd.merge(intermediate_df, labels, on=['GEOID','Year'], how="left")
xco2_df_tract_w_vars.drop(['size'], axis=1, inplace=True)
xco2_df_tract_w_vars.rename(columns={'Year':'YEAR'}, inplace=True)
xco2_df_tract_w_vars['YEAR']=xco2_df_tract_w_vars['YEAR'].astype(str)

display(xco2_df_tract_w_vars.sort_values(by=['GEOID','YEAR']).head(10))

Unnamed: 0,GEOID,YEAR,readings_count,avg_xco2,stddev_xco2,pct_change,delta,cum_delta,state_name,county_name
0,1001020100,2015,4,395.127552,0.424238,,,,Alabama,Autauga
1,1001020100,2018,5,407.772954,1.080408,1.032003,12.645402,12.645402,Alabama,Autauga
2,1001020100,2019,5,410.868384,0.580349,1.007591,3.09543,15.740832,Alabama,Autauga
3,1001020200,2015,1,393.92688,,,,,Alabama,Autauga
4,1001020200,2018,3,407.703827,0.54323,1.034973,13.776947,13.776947,Alabama,Autauga
5,1001020300,2018,15,409.356357,1.569522,,,,Alabama,Autauga
6,1001020300,2019,5,409.869212,0.270231,1.001253,0.512855,0.512855,Alabama,Autauga
7,1001020400,2018,13,409.474572,0.788158,,,,Alabama,Autauga
8,1001020400,2019,8,409.846386,0.899963,1.000908,0.371814,0.371814,Alabama,Autauga
9,1001020501,2016,5,404.97372,1.156374,,,,Alabama,Autauga


### Getting idea of yearly readings

In [23]:
for i in range(2015,2021):
    i=str(i)
    print(i,'num rows:',len(xco2_df_tract_w_vars[xco2_df_tract_w_vars['YEAR']==i]))

2015 num rows: 21767
2016 num rows: 20588
2017 num rows: 16982
2018 num rows: 20741
2019 num rows: 19355
2020 num rows: 17709


# Joining SVI Data and OCO-2 Data on FIPS code (GEOID) and Year

In [24]:
print(xco2_df_tract_w_vars.dtypes)
print(svi_df_tract_w_imputed_2014_2018.dtypes)
print(svi_df_tract_w_imputed_2014_2020.dtypes)
display(xco2_df_tract_w_vars.head())
display(svi_df_tract_w_imputed_2014_2020.head())

GEOID              object
YEAR               object
readings_count      int64
avg_xco2          float64
stddev_xco2       float64
pct_change        float64
delta             float64
cum_delta         float64
state_name         object
county_name        object
dtype: object
ST             object
STATE          object
ST_ABBR        object
STCNTY         object
COUNTY         object
FIPS           object
LOCATION       object
E_TOTPOP      float64
E_HU          float64
E_HH          float64
E_UNEMP       float64
RPL_THEME1    float64
RPL_THEME2    float64
RPL_THEME3    float64
RPL_THEME4    float64
RPL_THEMES    float64
YEAR           object
dtype: object
ST             object
STATE          object
ST_ABBR        object
STCNTY         object
COUNTY         object
FIPS           object
LOCATION       object
E_TOTPOP      float64
E_HU          float64
E_HH          float64
E_UNEMP       float64
RPL_THEME1    float64
RPL_THEME2    float64
RPL_THEME3    float64
RPL_THEME4    float64
RPL_THEM

Unnamed: 0,GEOID,YEAR,readings_count,avg_xco2,stddev_xco2,pct_change,delta,cum_delta,state_name,county_name
0,1001020100,2015,4,395.127552,0.424238,,,,Alabama,Autauga
1,1001020100,2018,5,407.772954,1.080408,1.032003,12.645402,12.645402,Alabama,Autauga
2,1001020100,2019,5,410.868384,0.580349,1.007591,3.09543,15.740832,Alabama,Autauga
3,1001020200,2015,1,393.92688,,,,,Alabama,Autauga
4,1001020200,2018,3,407.703827,0.54323,1.034973,13.776947,13.776947,Alabama,Autauga


Unnamed: 0,ST,STATE,ST_ABBR,STCNTY,COUNTY,FIPS,LOCATION,E_TOTPOP,E_HU,E_HH,E_UNEMP,RPL_THEME1,RPL_THEME2,RPL_THEME3,RPL_THEME4,RPL_THEMES,YEAR
3,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1900.0,714.0,688.0,48.0,0.4399,0.3403,0.3134,0.3634,0.3466,2014
0,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1955.0,732.5,714.0,45.5,0.4142,0.2879,0.3469,0.23455,0.2692,2015
4,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",2010.0,751.0,740.0,43.0,0.3885,0.2355,0.3804,0.1057,0.1918,2016
1,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1966.5,765.0,752.5,39.0,0.3899,0.3976,0.37695,0.1025,0.22255,2017
5,1,ALABAMA,AL,1001,Autauga,1001020100,"Census Tract 201, Autauga County, Alabama",1923.0,779.0,765.0,35.0,0.3913,0.5597,0.3735,0.0993,0.2533,2018


In [36]:
combined_2014_2018=pd.merge(svi_df_tract_w_imputed_2014_2018,xco2_df_tract_w_vars,left_on=['FIPS','YEAR'],right_on=['GEOID','YEAR'], how='left')
combined_2014_2020=pd.merge(svi_df_tract_w_imputed_2014_2020,xco2_df_tract_w_vars,left_on=['FIPS','YEAR'],right_on=['GEOID','YEAR'], how='left')

KeyError: ['GEOID', 'YEAR']

In [35]:
test=combined_2014_2018
test.dropna(inplace=True)
print(test)

Empty DataFrame
Columns: [ST, STATE, ST_ABBR, STCNTY, COUNTY, FIPS, LOCATION, E_TOTPOP, E_HU, E_HH, E_UNEMP, RPL_THEME1, RPL_THEME2, RPL_THEME3, RPL_THEME4, RPL_THEMES, YEAR, GEOID, readings_count, avg_xco2, stddev_xco2, pct_change, delta, cum_delta, state_name, county_name]
Index: []


In [29]:
combined_2014_2018.to_csv('svi_oco2_combined_2014-2018.csv',index=False)
combined_2014_2020.to_csv('svi_oco2_combined_2014-2020.csv',index=False)

  values = values.astype(str)
