# Setting up Spark

In [1]:
from pyspark.sql import SparkSession

# spark = SparkSession.builder.getOrCreate()

# only use 4 cores
spark = SparkSession.builder.master("local[4]").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/30 20:42:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
spark

In [3]:
spark.sparkContext.defaultParallelism

4

In [None]:
# spark.stop()

# Preparing NO2 Data

For NO2 data, I will be using OMI NO2 from NASA. This dataset is daily-level ar roughly 0.1deg * 0.1deg resolution. OMI instrument present in a satellite orbits the Earth 14 to 15 times a day and calculate NO2 vertical column density. Vertical Column Density or VCD refers to the total amount of NO2 molecules present from earth surface to the top of the atmosphere. NO2 pollution would be in tons that can be calculated from surface concentrations (NO2 molecules present on earth surface). This can be done using ground based sensors from EPA or PANDORA. 

Even though the data I am using is vertical columns, it still is a very good indicator of NO2 pollution. It is satellite-based data, which means we can do global-level data analysis and evaluation. Besides, the presence of sensors is very scarce and it is not possible to place ground-based sensors throughout the world. That's where satellite-based evaluation come in handy.

<b> Higher OMI NO2 VCD indicates more NO2 emissions from burning of fuel (cars), industries, etc. Furthermore, NO2 irritates the airways in our lungs and nose, causing inflammation and swelling, and is one of the top pollutants to cause respiratory disease like Asthma</b>

I wil be using `OMI_MINDS_NO2` dataset from NASA. You can find more about it from the following link: https://disc.gsfc.nasa.gov/datasets/OMI_MINDS_NO2_1.1/summary?keywords=OMI_MINDS_NO2_1.1


This data is also available on S3. I will be reading the files directly from S3. 

In [17]:
import os
import earthaccess
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

def get_earthdata_credentials():
    """
    Authenticate with NASA Earthdata and retrieve S3 credentials for accessing LAADS DAAC data.
    This function uses environment variables `EARTHDATA_USERNAME` and `EARTHDATA_PASSWORD` 
    for authentication. It logs in to Earthdata using the `earthaccess` library, retrieves 
    temporary AWS S3 credentials, and initializes an S3FileSystem object for accessing S3 resources.
    Returns:
        dict: A dictionary containing the S3 credentials (access key, secret key, and session token).
        
    Raises:
        ValueError: If the Earthdata username or password is not set in the environment variables.
        requests.RequestException: If there is a failure during the request to retrieve credentials.
    """

    username = os.getenv("EARTHDATA_USERNAME")
    password = os.getenv("EARTHDATA_PASSWORD")

    if not username or not password:
        raise ValueError("Missing Earthdata credentials")

    try:
        earthaccess.login()
        creds = earthaccess.get_s3_credentials(daac="LAADS")
        return creds

    except Exception as e:
        raise

In [18]:
s3_creds = get_earthdata_credentials()

# Getting Raw Datasets

We can get County Level dataset from CDC's PLACES

1. 2022: https://data.cdc.gov/500-Cities-Places/PLACES-County-Data-GIS-Friendly-Format-2024-releas/i46a-9kgh/about_data
2. 2021: https://data.cdc.gov/500-Cities-Places/PLACES-County-Data-GIS-Friendly-Format-2023-releas/7cmc-7y5g/about_data
3. 2020: https://data.cdc.gov/500-Cities-Places/PLACES-County-Data-GIS-Friendly-Format-2022-releas/xyst-f73f/about_data
4. 2019: https://data.cdc.gov/500-Cities-Places/PLACES-County-Data-GIS-Friendly-Format-2021-releas/kmvs-jkvx/about_data
5. 2018: https://data.cdc.gov/500-Cities-Places/PLACES-County-Data-GIS-Friendly-Format-2020-releas/mssc-ksj7/about_data

In [19]:
df = spark.read.csv(r"C:\Users\neupa\Downloads\PLACES__Local_Data_for_Better_Health__County_Data_2023_release (1).csv", header=True, inferSchema=True)

In [21]:
df.show(5)

+----+---------+---------+------------+----------+---------------+--------------------+---------------+----------------+----------+--------------------------+-------------------+--------------------+---------------------+---------------+----------+----------+----------+---------------+-------------------+--------------------+
|Year|StateAbbr|StateDesc|LocationName|DataSource|       Category|             Measure|Data_Value_Unit| Data_Value_Type|Data_Value|Data_Value_Footnote_Symbol|Data_Value_Footnote|Low_Confidence_Limit|High_Confidence_Limit|TotalPopulation|LocationID|CategoryID| MeasureId|DataValueTypeID|Short_Question_Text|         Geolocation|
+----+---------+---------+------------+----------+---------------+--------------------+---------------+----------------+----------+--------------------------+-------------------+--------------------+---------------------+---------------+----------+----------+----------+---------------+-------------------+--------------------+
|2021|       TX|

In [22]:
df.groupBy("Measure").count().show()

+--------------------+-----+
|             Measure|count|
+--------------------+-----+
|Fecal occult bloo...| 6288|
|Obesity among adu...| 6154|
|Physical health n...| 6154|
|Self-care disabil...| 6154|
|Binge drinking am...| 6154|
|Any disability am...| 6154|
|No leisure-time p...| 6154|
|Visits to dentist...| 6288|
|Vision disability...| 6154|
|Sleeping less tha...| 6288|
|High blood pressu...| 6154|
|Arthritis among a...| 6154|
|Visits to doctor ...| 6154|
|Mammography use a...| 6288|
|Older adult women...| 6288|
|Cholesterol scree...| 6154|
|Fair or poor self...| 6154|
|Stroke among adul...| 6154|
|Depression among ...| 6154|
|Diagnosed diabete...| 6154|
+--------------------+-----+
only showing top 20 rows



In [13]:
# https://data.cdc.gov/500-Cities-Places/PLACES-County-Data-GIS-Friendly-Format-2024-releas/i46a-9kgh/about_data

In [22]:
import pandas as pd
df = pd.read_csv(r"C:\Users\neupa\Downloads\PLACES__Local_Data_for_Better_Health__County_Data_2024_release_20250321.csv")

  df = pd.read_csv(r"C:\Users\neupa\Downloads\PLACES__Local_Data_for_Better_Health__County_Data_2024_release_20250321.csv")


In [24]:
df["Year"].value_counts()

Year
2022    216262
2021     24624
Name: count, dtype: int64

In [25]:
df = df[df["Year"] == 2022]

In [26]:
df

Unnamed: 0,Year,StateAbbr,StateDesc,LocationName,DataSource,Category,Measure,Data_Value_Unit,Data_Value_Type,Data_Value,...,Low_Confidence_Limit,High_Confidence_Limit,TotalPopulation,TotalPop18plus,LocationID,CategoryID,MeasureId,DataValueTypeID,Short_Question_Text,Geolocation
0,2022,US,United States,,BRFSS,Health Outcomes,Diagnosed diabetes among adults,%,Crude prevalence,12.0,...,11.8,12.2,333287557,260836730,59,HLTHOUT,DIABETES,CrdPrv,Diabetes,
1,2022,CO,Colorado,Lake,BRFSS,Health Outcomes,Stroke among adults,%,Crude prevalence,2.4,...,2.2,2.6,7327,5862,8065,HLTHOUT,STROKE,CrdPrv,Stroke,POINT (-106.344971513974 39.2024367117474)
2,2022,CO,Colorado,Mesa,BRFSS,Disability,Hearing disability among adults,%,Crude prevalence,7.1,...,6.3,8.0,158636,126505,8077,DISABLT,HEARING,CrdPrv,Hearing Disability,POINT (-108.466537411781 39.0183551841305)
3,2022,CT,Connecticut,Capitol,BRFSS,Health Outcomes,Arthritis among adults,%,Crude prevalence,26.2,...,24.4,28.4,981447,783914,9110,HLTHOUT,ARTHRITIS,CrdPrv,Arthritis,POINT (-72.5720699045246 41.8184543884154)
4,2022,FL,Florida,Alachua,BRFSS,Health Outcomes,Arthritis among adults,%,Age-adjusted prevalence,24.4,...,21.7,27.2,284030,234132,12001,HLTHOUT,ARTHRITIS,AgeAdjPrv,Arthritis,POINT (-82.3582005204153 29.6751856950068)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
240879,2022,WI,Wisconsin,Fond du Lac,BRFSS,Health Outcomes,Stroke among adults,%,Crude prevalence,3.7,...,3.3,4.0,103836,82265,55039,HLTHOUT,STROKE,CrdPrv,Stroke,POINT (-88.4883433780916 43.7536089759286)
240882,2022,WI,Wisconsin,Trempealeau,BRFSS,Health Outcomes,Depression among adults,%,Age-adjusted prevalence,24.5,...,20.9,28.2,30899,23116,55121,HLTHOUT,DEPRESSION,AgeAdjPrv,Depression,POINT (-91.3584214806691 44.3039450660913)
240883,2022,WI,Wisconsin,Door,BRFSS,Prevention,Visited dentist or dental clinic in the past y...,%,Age-adjusted prevalence,64.3,...,60.1,67.8,30526,25807,55029,PREVENT,DENTAL,AgeAdjPrv,Dental Visit,POINT (-87.3114193001272 44.9500144269812)
240884,2022,WI,Wisconsin,Marathon,BRFSS,Disability,Self-care disability among adults,%,Crude prevalence,3.2,...,2.9,3.5,137958,107333,55073,DISABLT,SELFCARE,CrdPrv,Self-care Disability,POINT (-89.7588560093353 44.8983004431375)


In [18]:
df["Year"] = 2022

In [19]:
selected_column = ["Year", 'StateAbbr', 'StateDesc', 'CountyName', 'CountyFIPS', 'TotalPopulation', "TotalPop18plus", 'CASTHMA_CrudePrev', 'CASTHMA_Crude95CI', 'CASTHMA_AdjPrev', 'CASTHMA_Adj95CI', "Geolocation"]

In [20]:
df = df[selected_column]

In [21]:
df[df.CountyName == "Otero"]

Unnamed: 0,Year,StateAbbr,StateDesc,CountyName,CountyFIPS,TotalPopulation,TotalPop18plus,CASTHMA_CrudePrev,CASTHMA_Crude95CI,CASTHMA_AdjPrev,CASTHMA_Adj95CI,Geolocation
2676,2022,CO,Colorado,Otero,8089,18303,14058,11.8,"(10.4, 13.3)",12.0,"(10.6, 13.5)",POINT (-103.717022312193 37.9021617548008)
2866,2022,NM,New Mexico,Otero,35035,68823,53831,11.4,"(10.0, 12.9)",11.6,"(10.1, 13.0)",POINT (-105.741478524611 32.6132896393527)
