# Basic Data Structure Observations
---
## Purpose:
In this notebook we're going to do our first exploration of the EPA, USDA, and Redfin datasets. Eventually, we will combine these three datasets to identify correlations between pollution, indicators of regional wealth, and urbanization levels. To start though, it's important to evaluate each dataset and the potential issues that could arise from using them. These issues could include:
* **ethical concerns** <br> Such as considering how the data was collected, if/how consent was obtained, how the data will be used, and how it could cause harm or shift the balance of power.  <br><br>
* **technical concerns**  <br> such as considering the magnitude of the data, it's structure, data reliability, and how missing data should be treated.     

As we explore the data, a key ethical concern could involve how the findings would be use to enact legislative change, or adjust public opinion about particular groups. As such, <u>we strictly prohibit the aggregation of senstive population data</u> (race, religion, age, etc) with this study. Likewise this study should not be used to endorse proposed pollution legislation, without substantial supporting evidence from peer-reviewed sources. You should refer to the [*Terms of Use* clause](https://github.com/MDJonesBYU/Wealth_and_Pollution_Study/blob/main/Terms_of_Use)  to understand how this work should and should not be used. Before citing or using this research, you must agree to adhere to our ToU and [copyright limitations](https://github.com/MDJonesBYU/Wealth_and_Pollution_Study/blob/main/License).  

### Package Installation and Versioning Requirments:
For questions regarding python version, package installations, and other functional requirements, see the *Read Me* file contained [here](https://github.com/MDJonesBYU/Wealth_and_Pollution_Study/blob/main/Read_me/Read_me.txt).

Now, let's review the data structure we have.

In [1]:
# Import necessary packages: 
import pandas as pd


# Create a function to load the data
def load_base_data(): 
    """This function will load the raw EPA, USDA, and Redfin datasets and return them as a list. 
    """
    df_emissions = pd.read_json("data/nei.json",dtype={'COUNTY FIPS': str, "STATE FIPS": str})
    df_USDA = pd.read_excel("data/Unemployment.xlsx",header=4)
    df_Redfin = pd.read_csv("data/county_market_tracker.tsv000", sep = '\t')
    return(df_emissions, df_USDA, df_Redfin)


# Add a function to count missing data in the datasets (if any)
def check_nan(df): 
    """Checking for null values"""
    return(df.isnull().sum())

# Get the data
df_emissions, df_USDA, df_Redfin = load_base_data()

# Highlight missing data if any
print(check_nan(df_emissions))

# View sample -- starting with emission data
df_emissions.info()

STATE              0
STATE FIPS         0
COUNTY             0
SECTOR             0
COUNTY FIPS        0
POLLUTANT          0
POLLUTANT TYPE     0
EMISSIONS          0
UNIT OF MEASURE    0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 101010 entries, 0 to 101009
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   STATE            101010 non-null  object
 1   STATE FIPS       101010 non-null  object
 2   COUNTY           101010 non-null  object
 3   SECTOR           101010 non-null  object
 4   COUNTY FIPS      101010 non-null  object
 5   POLLUTANT        101010 non-null  object
 6   POLLUTANT TYPE   101010 non-null  object
 7   EMISSIONS        101010 non-null  int64 
 8   UNIT OF MEASURE  101010 non-null  object
dtypes: int64(1), object(8)
memory usage: 6.9+ MB


In [2]:
# Let's also take a sample to see the data contents
df_emissions.sample(4)


Unnamed: 0,STATE,STATE FIPS,COUNTY,SECTOR,COUNTY FIPS,POLLUTANT,POLLUTANT TYPE,EMISSIONS,UNIT OF MEASURE
76984,SD,46,Kingsbury,Fuel Comb - Comm/Institutional - Natural Gas,77,Carbon Monoxide,CAP,1,TON
68424,IA,19,Chickasaw,"Fuel Comb - Industrial Boilers, ICEs - Oil",37,Carbon Monoxide,CAP,2,TON
42786,VA,51,King George,Mobile - On-Road Diesel Heavy Duty Vehicles,99,Carbon Monoxide,CAP,24,TON
34563,NC,37,Columbus,Mobile - Non-Road Equipment - Other,47,Carbon Monoxide,CAP,49,TON


In [3]:
# Okay, so we see 9 columns, including a unique ID (FIPS) for the state and counties in the emissions dataset. 
# Emissions are separated by sector, which could be useful to probe differences across pollution sources. We 
# also have the county name and state abbreviation for each emission source. Since we extracted the data
# directly from EPA's National Emission Inventory (NEI), we know cabron monoxide is the only pollutant 
# considered, and it's always in units of U.S. tons. So we can ignore these columns in the future. 

# Now, the dataframe has no missing values, but that doesn't mean the dataframe contains all 
# counties in the US. We'll check on that for all datasets shortly. Also, while it's not explicitely stated, 
# this is 2020 emission data only. This will be a problem for our USDA data. To explain, let's go ahead and 
# explore that data. 

In [4]:
# Checking the USDA data: 
print(check_nan(df_USDA))

# View sample, starting with emission data
df_USDA.info()

FIPS_Code                                     0
State                                         0
Area_Name                                     0
Rural_Urban_Continuum_Code_2013              58
Urban_Influence_Code_2013                    58
                                             ..
Employed_2022                                 4
Unemployed_2022                               4
Unemployment_rate_2022                        4
Median_Household_Income_2021                 83
Med_HH_Income_Percent_of_State_Total_2021    84
Length: 100, dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3277 entries, 0 to 3276
Data columns (total 100 columns):
 #   Column                                     Non-Null Count  Dtype  
---  ------                                     --------------  -----  
 0   FIPS_Code                                  3277 non-null   int64  
 1   State                                      3277 non-null   object 
 2   Area_Name                                  32

In [5]:
df_USDA.sample(3)

Unnamed: 0,FIPS_Code,State,Area_Name,Rural_Urban_Continuum_Code_2013,Urban_Influence_Code_2013,Metro_2013,Civilian_labor_force_2000,Employed_2000,Unemployed_2000,Unemployment_rate_2000,...,Civilian_labor_force_2021,Employed_2021,Unemployed_2021,Unemployment_rate_2021,Civilian_labor_force_2022,Employed_2022,Unemployed_2022,Unemployment_rate_2022,Median_Household_Income_2021,Med_HH_Income_Percent_of_State_Total_2021
783,18133,IN,"Putnam County, IN",1.0,1.0,1.0,17100.0,16609.0,491.0,2.9,...,16514.0,15960.0,554.0,3.4,16999.0,16475.0,524.0,3.1,61223.0,97.6
3197,56045,WY,"Weston County, WY",7.0,9.0,0.0,3284.0,3147.0,137.0,4.2,...,3718.0,3595.0,123.0,3.3,3792.0,3691.0,101.0,2.7,62509.0,94.0
406,13005,GA,"Bacon County, GA",7.0,9.0,0.0,4641.0,4431.0,210.0,4.5,...,4830.0,4673.0,157.0,3.3,4879.0,4736.0,143.0,2.9,43154.0,64.9


In [6]:
# So there's a lot to unpack here. First, we have lots of columns that repeat variables across different years. 
# Now there are some really useful features in this dataset including income, unemployment rates, labor force
# size, location, and rural/urban designation code (Continuum Code). 

# After a few minutes some issues should be apparent though. For one, we don't have income data for 2020, and
# the last rural continuum code record (a tag for population density) was taken in 2013. These variables could
# really add depth to our analysis. 

# Since we could not find readily available data to compensate here, we plan to use both of these variables, but
# there's some pretty big assumptions here: 
# 1. We assume income changes between 2020 and 2021 are neglgible down to the county level. For some counties this 
#    could be vastly different from reality given the COVID pandemic's impact on local economies. 
# 2. We assume the rural urban continuum is reasonably close to what it was in 2013. Again this could be an issue 
#    as some regions substantial growth during this period, like Austin TX while other areas had swaths of 
#    emigration (like Edenville, MI where a dam breakage forced residents to leave).


In [7]:
# Before we dive further into this, let's review the Redfin dataset

# Checking the USDA data: 
print(check_nan(df_Redfin))

# View sample -- starting with emission data
df_Redfin.info()

period_begin                           0
period_end                             0
period_duration                        0
region_type                            0
region_type_id                         0
table_id                               0
is_seasonally_adjusted                 0
region                                 0
city                              563122
state                                  0
state_code                             0
property_type                          0
property_type_id                       0
median_sale_price                    685
median_sale_price_mom              52383
median_sale_price_yoy              69332
median_list_price                  43920
median_list_price_mom              81940
median_list_price_yoy              98455
median_ppsf                         7395
median_ppsf_mom                    58352
median_ppsf_yoy                    75335
median_list_ppsf                   44644
median_list_ppsf_mom               82833
median_list_ppsf

In [8]:
df_Redfin.sample(3)

Unnamed: 0,period_begin,period_end,period_duration,region_type,region_type_id,table_id,is_seasonally_adjusted,region,city,state,...,sold_above_list_yoy,price_drops,price_drops_mom,price_drops_yoy,off_market_in_two_weeks,off_market_in_two_weeks_mom,off_market_in_two_weeks_yoy,parent_metro_region,parent_metro_region_metro_code,last_updated
499792,1/1/2017,1/31/2017,30,county,5,1300,f,"Knox County, ME",,Maine,...,0.023256,0.11,0.038571,0.030078,0.133333,0.008333,-0.097436,Maine nonmetropolitan area,,1/9/2022 14:29
552999,7/1/2018,7/31/2018,30,county,5,2744,f,"Hays County, TX",,Texas,...,0.257143,,,,0.5,0.0,-0.214286,"Austin, TX",12420.0,1/9/2022 14:29
57910,8/1/2015,8/31/2015,30,county,5,3170,f,"Calumet County, WI",,Wisconsin,...,,,,,,,,"Appleton, WI",11540.0,1/9/2022 14:29


Okay, so at a high-level we see that the Redfin dataset provides information on sale price, property type, 
location, number of listings, and the period when the sale occured. Notably, the state name is not 
abbreviated and we aren't given any FIPs to connect to the other datasets, so that's going to be 
an issue down the road. We also only want 2021 data (since our income data is from the same time period, 
and we assume emissions are the same in 2020 and 2021). 

Now that we've scratched the surface, we'd like to see if we can make more sense of the data graphically, so we can understand how factors like income, pollution, unemployment, and sale prices are distributed. 

To do this graphically, we need to some light data cleaning and grouping, which is covered in the next notebook. 

### End of Notebook

Next notebook: Data_manipulation


*Note: to limit the number of functions duplicated, all codebook functions will be saved in py files that can be imported to execute.*

---
