### Converting Google Earth Assets to ~50 CSV Files 

* National average embeddings data per county for all states (2017 to 2024).
* Each asset represents one state (according to the FIPS code).
* State FIPS Codes available here: https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt

#### Saving as CSV to `.\notebooks\national_embeddings\all_embeddings_csvs`

Using the `convert_to_df()` function from `utils.py`: 

In [1]:
from pathlib import Path
import sys
import os

# utils import error: add wnv_embeddings as root
PROJECT_ROOT = Path.cwd().parents[1]  # <-- wnv_embeddings
sys.path.insert(0, str(PROJECT_ROOT))

from utils.utils import convert_to_df
import pandas as pd
import ee
import requests

In [7]:
# will prompt you to authorize access to GEE
# this is needed to obtain assets from the cloud saved under your account
ee.Authenticate()

# enter your own registered project name here
ee.Initialize(project="wnv-embeddings")

In [None]:
state_fips_codes = [
    "01", "02", "04", "05", "06", "08", "09", "10", "11", "12",
    "13", "15", "16", "17", "18", "19", "20", "21", "22", "23",
    "24", "25", "26", "27", "28", "29", "30", "31", "32", "33",
    "34", "35", "36", "37", "38", "39", "40", "41", "42", "44",
    "45", "46", "47", "48", "49", "50", "51", "53", "54", "55", "56"
  ]

In [None]:
# =============CONVERT GEE ASSETS TO CSVS============= #
# ONLY RUN ONCE TO CONVERT ALL 56 ASSETS AS CSV #

# now obtaining the csvs
# csv_destination = Path("all_embeddings_csvs")
# csv_destination.mkdir(parents=True, exist_ok=True)

# for fips in state_fips_codes:
# 	gee_path = f"users/angel314/{fips}_2017_2024_embeddings"
	
# 	save_to = csv_destination / f"{fips}-avg-embeddings-2017-2024.csv"

# 	convert_to_df(gee_path, True, save_to)

#### Appending Yearly WNV Case Data + County Population Data

##### Getting WNV Case Data:
* Source: https://www.cdc.gov/west-nile-virus/data-maps/historic-data.html  
* Section: "Explore county level data for 1999-2024" - "Yearly data"
	* Returns: one CSV with case data at a county level for 1999-2024
* `Location` column represents the FIPS county code for that row.
* WNV Case data is cleaned to only include relevant years and rows with at least one human disease case. 

This is a preview of WNV County Cases from 1999 to 2024.

In [75]:
cases = pd.read_csv("./national_wnv_case_data/wnv_county_cases_1999_2024.csv")
cases.sample(5)

Unnamed: 0,FullGeoName,Year,Location,Activity,Total human disease cases,Neuroinvasive disease cases,**Presumptive viremic blood donors,Notes
16252,"CA, Santa Clara",2007,6085,Human infections and non-human activity,4.0,1.0,0.0,
4456,"CO, Pitkin",2019,8097,Human infections,1.0,0.0,0.0,
26347,"VA, Scott",2002,51169,Non-human activity,0.0,0.0,0.0,
13116,"PA, Indiana",2012,42063,Non-human activity,0.0,0.0,0.0,
7243,"OH, Meigs",2017,39105,Human infections and non-human activity,1.0,1.0,0.0,


In [76]:
###### filtering ######

# remove entries that come before 2017
cases = cases[cases["Year"]>=2017]
# remove any rows with 0 total human disease cases
cases = cases[cases["Total human disease cases"]>0]
# only keep relevant columns
cases = cases.drop(columns=["FullGeoName", "Activity", "Neuroinvasive disease cases", "**Presumptive viremic blood donors", "Notes"]).reset_index(drop=True)
cases

Unnamed: 0,Year,Location,Total human disease cases
0,2024,1001,2.0
1,2024,1003,2.0
2,2024,1021,1.0
3,2024,1043,2.0
4,2024,1047,1.0
...,...,...,...
4006,2017,55141,2.0
4007,2017,56003,1.0
4008,2017,56013,3.0
4009,2017,56015,2.0


In [77]:
cases = cases.groupby(["Year","Location"]).agg("sum").reset_index()
cases

Unnamed: 0,Year,Location,Total human disease cases
0,2017,1001,6.0
1,2017,1003,3.0
2,2017,1007,1.0
3,2017,1011,1.0
4,2017,1015,2.0
...,...,...,...
4006,2024,55133,1.0
4007,2024,55139,1.0
4008,2024,55141,1.0
4009,2024,56015,1.0


In [78]:
# convert from long format to wide format
# each row represents one location
# each location has sum of cases for 2017 - 2024.

# columns="Year" -> each unique year is a column
# values="cases" -> numbers to fill pivot table
# take sum of all entries for the the same location and year

# reset_index to move "Location" column to the right.

cases_wide = (cases.pivot_table(index="Location", columns="Year", values="Total human disease cases", aggfunc="sum", fill_value=0).add_prefix("Cases_").reset_index())
cases_wide

Year,Location,Cases_2017,Cases_2018,Cases_2019,Cases_2020,Cases_2021,Cases_2022,Cases_2023,Cases_2024
0,1001,6.0,0.0,0.0,1.0,1.0,0.0,1.0,2.0
1,1003,3.0,2.0,1.0,0.0,2.0,1.0,0.0,2.0
2,1007,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1011,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1015,2.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
...,...,...,...,...,...,...,...,...,...
1607,56025,0.0,0.0,1.0,0.0,0.0,1.0,3.0,1.0
1608,56029,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
1609,56031,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0
1610,56033,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


Saving this cleaned dataframe to a csv for future use.

In [85]:
cases_wide.to_csv("./national_wnv_case_data/agg_wnv_county_cases_2017_2024.csv")

Now all that is left is to merge with the embeddings data for a fips code. 

This is a test with fips code 17 (Illinois).

In [None]:
##### fips 17 test #####

# get csv for current fips code
path = f"./all_embeddings_csvs/17-avg-embeddings-2017-2024.csv"
# load in the csv 
df = pd.read_csv(path)
df

Unnamed: 0,A00_2017,A00_2018,A00_2019,A00_2020,A00_2021,A00_2022,A00_2023,A00_2024,A01_2017,A01_2018,...,A62_2024,A63_2017,A63_2018,A63_2019,A63_2020,A63_2021,A63_2022,A63_2023,A63_2024,GEOID
0,-0.126185,-0.112618,-0.083152,-0.137135,-0.105750,-0.125978,-0.123757,-0.102107,-0.038990,-0.037035,...,-0.136299,0.030838,0.038678,0.028741,0.034277,0.042769,0.022800,0.033839,0.033290,17121
1,-0.119659,-0.100279,-0.086142,-0.111147,-0.091950,-0.108974,-0.110964,-0.088635,-0.035458,-0.033798,...,-0.143509,0.048798,0.048666,0.044434,0.057424,0.066643,0.039234,0.050398,0.049886,17005
2,-0.121869,-0.106163,-0.089410,-0.093877,-0.100690,-0.095218,-0.117983,-0.091646,-0.032314,-0.017284,...,-0.147811,0.030798,0.035180,0.038207,0.042865,0.044161,0.031365,0.031765,0.035637,17083
3,-0.120115,-0.113954,-0.098588,-0.113361,-0.104700,-0.109215,-0.123083,-0.098814,-0.096917,-0.080333,...,-0.142231,0.030921,0.034907,0.035352,0.036025,0.049167,0.029630,0.036940,0.040171,17163
4,-0.104617,-0.090185,-0.070940,-0.106841,-0.082370,-0.101803,-0.099361,-0.082830,-0.084239,-0.075842,...,-0.129191,0.080986,0.080554,0.077953,0.085153,0.096826,0.073851,0.082292,0.084914,17027
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,-0.081918,-0.083240,-0.049877,-0.071286,-0.032761,-0.055054,-0.061407,-0.069657,-0.047272,-0.037998,...,-0.165051,0.045601,0.055496,0.062862,0.061943,0.088035,0.038284,0.045226,0.050090,17131
98,-0.068598,-0.065659,-0.033529,-0.056787,-0.025147,-0.045408,-0.046732,-0.061580,-0.081419,-0.070210,...,-0.154932,0.036447,0.041207,0.067978,0.053747,0.078759,0.036642,0.040094,0.035452,17161
99,-0.070369,-0.077984,-0.038843,-0.059882,-0.027230,-0.047454,-0.051155,-0.067624,-0.060634,-0.034753,...,-0.154816,0.057985,0.058326,0.064610,0.065451,0.092426,0.048008,0.046217,0.055491,17073
100,-0.063715,-0.053766,-0.017952,-0.039881,-0.024010,-0.040352,-0.044199,-0.061335,-0.044527,-0.032002,...,-0.126775,0.055892,0.071999,0.090880,0.094341,0.123831,0.065502,0.072743,0.072798,17103


In [83]:
df_merged = pd.merge(df, cases_wide, left_on="GEOID", right_on="Location", how="left").fillna(0).drop(columns=["Location"])
df_merged

Unnamed: 0,A00_2017,A00_2018,A00_2019,A00_2020,A00_2021,A00_2022,A00_2023,A00_2024,A01_2017,A01_2018,...,A63_2024,GEOID,Cases_2017,Cases_2018,Cases_2019,Cases_2020,Cases_2021,Cases_2022,Cases_2023,Cases_2024
0,-0.126185,-0.112618,-0.083152,-0.137135,-0.105750,-0.125978,-0.123757,-0.102107,-0.038990,-0.037035,...,0.033290,17121,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0
1,-0.119659,-0.100279,-0.086142,-0.111147,-0.091950,-0.108974,-0.110964,-0.088635,-0.035458,-0.033798,...,0.049886,17005,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.121869,-0.106163,-0.089410,-0.093877,-0.100690,-0.095218,-0.117983,-0.091646,-0.032314,-0.017284,...,0.035637,17083,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,-0.120115,-0.113954,-0.098588,-0.113361,-0.104700,-0.109215,-0.123083,-0.098814,-0.096917,-0.080333,...,0.040171,17163,1.0,2.0,2.0,0.0,0.0,1.0,4.0,3.0
4,-0.104617,-0.090185,-0.070940,-0.106841,-0.082370,-0.101803,-0.099361,-0.082830,-0.084239,-0.075842,...,0.084914,17027,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97,-0.081918,-0.083240,-0.049877,-0.071286,-0.032761,-0.055054,-0.061407,-0.069657,-0.047272,-0.037998,...,0.050090,17131,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
98,-0.068598,-0.065659,-0.033529,-0.056787,-0.025147,-0.045408,-0.046732,-0.061580,-0.081419,-0.070210,...,0.035452,17161,0.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0
99,-0.070369,-0.077984,-0.038843,-0.059882,-0.027230,-0.047454,-0.051155,-0.067624,-0.060634,-0.034753,...,0.055491,17073,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
100,-0.063715,-0.053766,-0.017952,-0.039881,-0.024010,-0.040352,-0.044199,-0.061335,-0.044527,-0.032002,...,0.072798,17103,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


##### Iterating Over `all_embeddings_csvs` to add WNV Human cases for each year.

In [86]:
for code in state_fips_codes:
  # get csv for current fips code
	path = f"./all_embeddings_csvs/{code}-avg-embeddings-2017-2024.csv"
  # load in the csv 
	df = pd.read_csv(path)
	df_merged = pd.merge(df, cases_wide, left_on="GEOID", right_on="Location", how="left").fillna(0).drop(columns=["Location"])

	df_merged.to_csv(f"./all_embeddings_with_cases/cleaned-{code}-avg-embeddings-2017-2024.csv")

##### Appending Population Data:

* Using Data Commons API:

	* https://docs.datacommons.org/what_is.html 

	* Basically allows us to query specific statistical questions and get one unified result.

	* There is an option to query for counties as well using FIPS codes: https://datacommons.org/browser/County 

County population data is needed for each year to normalize based on this formula:

$\textnormal{Cases per 100k} = \frac{\textnormal{Number of disease cases}}{\textnormal{County population}} \times 100,000$

Normalized cases (cases per 100k) will be the target variable when measuring machine learning models' performance.

note: api.census.gov does not have consistent and updated data for 2017 - 2024 county populations.

In [None]:
# TODO: get api key

### Model Evaluation

### Visualizations