# **Variable Exploration**

### PVI Variable from the National Institute of Environmental Health Sciences.

#### NIH has created a Live Pandemic Vulnerability Index (PVI):
- Their dashboard creates risk profiles, called PVI scorecards, for every county in the United States. The PVI is built from weighing several different variables
- It is continuously updated with the latest data dating back to 02/28/2020

#### Screenshot of PVI Dashboard:
![Screen%20Shot%202021-05-22%20at%204.43.30%20PM.png](attachment:Screen%20Shot%202021-05-22%20at%204.43.30%20PM.png)

#### NIH Explanation of this Chart
"Population-level data is a powerful resource for understanding how the virus is spreading and which communities are at risk. However, interpreting that information is challenging. The data visualization in this dashboard offers an effective means of communicating data to scientists, policy makers, and the public."

#### Opportunity:
To use this indexed measurement of COVID risk. Which COVID indicator is closely tied to our county level economic data?"  
- Seach for county-level economic data and see the relationship between those economic factors and the data from this covid measurement
- I am not 100% positive how to weigh COVID in an analysis (deaths, cases, recoveries, infection rates... what do we focus on?? I think this would do a lot of problem solving for us

The metric is evaluated under four major domains: 
- Infection Rate, 
- Population Concentration, 
- Intervention Measures,  
- Health & Environment.

Link for further details on for how each Domain/Coefficient is weighed when creating the PVI https://www.niehs.nih.gov/research/programs/coronavirus/covid19pvi/details/index.cfm 


#### PVI Coefficients 



![Screen%20Shot%202021-05-22%20at%204.17.16%20PM.png](attachment:Screen%20Shot%202021-05-22%20at%204.17.16%20PM.png)

### Potential use case for PVI 
- **Narrow down project scope** to assess the economic relationship between this COVID Pandemic Vulnerability Index (county level) and Economic Variables 
- How we account for COVID in economic risk prediction model
- I personally do not know how to leverage variables deaths, hospilizations, into one complete index so I think bouncing off this would go well





In [31]:
#Link to snippet used to pulled this data
#https://medium.com/towards-entrepreneurship/importing-a-csv-file-from-github-in-a-jupyter-notebook-e2c28e7e74a5
#Libraries needed for the tutorial
import pandas as pd 
import requests 
import io

#Downloading the csv file from your GitHub account
#Link to Repo: https://github.com/COVID19PVI/data
url = "https://raw.githubusercontent.com/COVID19PVI/data/master/Model11.2.1/Model_11.2.1_20200228_results.csv"

#Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

#Reading the downloaded content and turning it into a pandas dataframe
df_PVI = pd.read_csv(io.StringIO(download.decode('utf-8')))

#Printing out the first 5 rows of the dataframe
# link to how each variable is weighted https://www.niehs.nih.gov/research/programs/coronavirus/covid19pvi/details/""
df_PVI.head()

Unnamed: 0,ToxPi Score,HClust Group,KMeans Group,Name,Source,Infection Rate: Transmissible Cases!25!0xcc3333ff,Infection Rate: Disease Spread!5!0xe64d4dff,Pop Concentration: Pop Mobility!10!0x57b757ff,Pop Concentration: Residential Density!10!0x5ced5cff,Intervention: Social Distancing!10!0x4258c9ff,Intervention: Testing!10!0x6079f7ff,Health & Environment: Pop Demographics!10!0x6b0b9eff,Health & Environment: Air Pollution!10!0x8e26c4ff,Health & Environment: Age Distribution!10!0x9a42c8ff,Health & Environment: Co-morbidities!10!0xb460e0ff,Health & Environment: Health Disparities!10!0xc885ecff,Health & Environment: Hospital Beds!5!0xdeb9f1ff
0,0.651913,1,8,"California, Humboldt","-123.876,40.6992",1.0,1.0,0.593877,0.9042,0.75,0.504366,0.545336,0.347305,0.551736,0.283945,0.430832,0.474621
1,0.535333,3,6,"Georgia, Colquitt","-83.7678,31.1881",0.0,0.0,0.575826,0.9491,1.0,1.0,0.670197,0.479042,0.504195,0.527423,0.751368,0.469028
2,0.530385,3,6,"Georgia, Crisp","-83.7681,31.9229",0.0,0.0,0.579201,0.9475,1.0,1.0,0.604013,0.479042,0.571298,0.542399,0.696341,0.420039
3,0.530276,3,6,"Georgia, Sumter","-84.1982,32.0365",0.0,0.0,0.565632,0.979,1.0,1.0,0.669669,0.502994,0.533441,0.506358,0.646898,0.448907
4,0.530236,3,6,"Georgia, Toombs","-82.3296,32.1206",0.0,0.0,0.563958,0.9223,1.0,1.0,0.643108,0.461078,0.540407,0.565237,0.695868,0.471986


## Cleaned Aggregated COVID Dataset
- Link to Repo: https://github.com/covid19-dashboard-us/cdcar
- Link to Paper: https://ui.adsabs.harvard.edu/abs/2020arXiv200601333W/abstract

## Variables to Leverage from this data set / Repo

- **County_Population** = Count of People for every Count
- **FIPS_C** = County Level Identifying Variable
- **FIPS_S** = State Level Identifying Varable
- **AA_Rate** = the percent of the population who identify as African American;
- **HL_PCT** -- the percent of the population who identify as Hispanic or Latino;
- **Sex_ratio** -- the ratio of male over female;
- **HEducation_PCT** -- the percent of the population aged 25 years or older with a
bachelor’s degree or higher;
- **HHD_PAI_PCT** -- the percent of the households with public assistance income;
- **HHD_F_PCT** -- the percent of households with female householders and no husband present;
- **Unemployment_PCT** -- civilian labor force unemployment rate; 


### Why I find this datasource valuable:
- The paper talks about how there can be wide variations found between different open source datasets and the repo provides access to the cleaned data. Please see attached screenshot of the paper showing the areas affected below.

![Screen%20Shot%202021-05-22%20at%205.30.27%20PM.png](attachment:Screen%20Shot%202021-05-22%20at%205.30.27%20PM.png)

In [4]:

# Link to how I pulled this data
# https://medium.com/towards-entrepreneurship/importing-a-csv-file-from-github-in-a-jupyter-notebook-e2c28e7e74a5
# Libraries needed for the tutorial

import pandas as pd
import requests
import io

# Downloading the csv file from your GitHub account
# Link to Repo: https://github.com/covid19-dashboard-us/cdcar
url = "https://raw.githubusercontent.com/covid19-dashboard-us/cdcar/master/data/Cov_v14.csv" 
# Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

df_cdcar = pd.read_csv(io.StringIO(download.decode('utf-8')))

# Printing out the first 5 rows of the dataframe
df_cdcar.columns

Index(['ID', 'County', 'State', 'FIPS_C', 'FIPS_S', 'avemort', 'BlackRate',
       'HLRate', 'Gini', 'Affluence', 'HighIncome', 'EduAttain', 'OccupAdv',
       'MedHouVal', 'Disadvantage', 'PublicAssistance', 'FemaleLeadRate',
       'EmployStatus', 'ViolentCrime', 'PropertyCrime', 'ResidStability',
       'UrbanRate', 'HealCovRate', 'ExpHealth', 'Latitude', 'Longtitude', 'MF',
       'dPop_ml2', 'LOG_pop', 'prop_old', 'BED_SUM'],
      dtype='object')

In [29]:
url = "https://raw.githubusercontent.com/covid19-dashboard-us/cdcar/master/data/County_pop.tsv" 
# Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe
#'latin-1' needed for .tsv
df_cdcarpop = pd.read_csv(io.StringIO(download.decode('latin-1')),sep='\t')

# Printing out the first 5 rows of the dataframe
df_cdcarpop.head(8)

Unnamed: 0,ID,County,State,population
0,1001,Autauga County,Alabama,55601
1,1003,Baldwin County,Alabama,218022
2,1005,Barbour County,Alabama,24881
3,1007,Bibb County,Alabama,22400
4,1009,Blount County,Alabama,57840
5,1011,Bullock County,Alabama,10138
6,1013,Butler County,Alabama,19680
7,1015,Calhoun County,Alabama,114277


**VARIABLES From CDCAR git repo**
- **ID** -- County-level Federal Information Processing System (FIPS) codes, which uniquely identify geographic areas. State FIPS with one digit - county FIPS with four digits. The number has five digits of which the first two are the FIPS code of the state to which the county belongs; 
- **County** -- Name of county matched with ID. There are 3,104 counties and county-equivalents (e.g. independent cities, parishes, boroughs) in the United States; 
State -- Name of state matched with  ID. There are the 48 mainland U.S. states and the District of Columbia;  
- **FIPS_C** -- County-level Federal Information Processing System (FIPS) codes;
- **FIPS_S** -- State-level Federal Information Processing System (FIPS) codes;
- **Mortality** -- the 5-year (1998–2002) average mortality rate, measured by the total counts of deaths per 100,000 population in a county;
- **AA_PCT** -- the percentof the population who identify as African American;
- **HL_PCT** -- the percent of the population who identify as Hispanic or Latino;
- **Gini** -- the Gini coefficient, a measure for income inequality and wealth distribution in economics; 
- **Affluence** -- social affluence generated by factor analysis from HighIncome, HighEducation, WCEmployment and MedHU;  
- **HIncome_PCT** -- the percent of families with annual incomeshigher than $75,000;
- **HEducation_PCT** -- the percent of the population aged 25 years or older with a
bachelor’s degree or higher;
- **WCEmployment_PCT** -- the percent of the people working in management, professional, and related occupations
- **MedHU** -- the median value of owner-occupied housing units;
- **Disadvantage** -- concentrated disadvantage obtained by factor analysis from HHD_PAI_PCT, HHD_F_PCT and Unemployment_PCT;
- **HHD_PAI_PCT** -- the percent of the households with public assistance income;
- **HHD_F_PCT** -- the percent of households with female householders and no husband present;
- **Unemployment_PCT** -- civilian labor force unemployment rate; 
- **ViolentCrime** -- the total number of violent crimes per 1,000 population;
- **PropertyCrime** -- the total number of property crimes per 1,000 population;
- **ResidStability** -- the percent of the population residence in the same house for one year and over;
- **UrbanRate** -- urban rate from the 2010 Census (U.S. Census Bureau, 2010);
- **NHIC_PCT** -- the percent of persons under 65 years without health insurance;
- **EHPC** -- the local government expenditures for health per capita;
- **Latitude** -- 
- **Longitude** -- 
- **Sex_ratio** -- the ratio of male over female; 
- **PD_log** -- log population density per square mile of land area;
- **Pop_log** -- log local population; 
- **Old_PCT** -- the percent of aged people (age greater than or equal to 65 years); 
- **TBed** -- total bed counts per 1,000 population;

In [3]:

    
url = "https://raw.githubusercontent.com/CDCgov/covid_case_privacy_review/master/data/raw/county_pop_demo_for_verify.csv" 
# Make sure the url is the raw version of the file on GitHub
download = requests.get(url).content

# Reading the downloaded content and turning it into a pandas dataframe

df_cdc = pd.read_csv(io.StringIO(download.decode('utf-8')))

# Printing out the first 5 rows of the dataframe
df_cdc.head()    

Unnamed: 0,state_county_combined_fips,STNAME,CTYNAME,SUM_of_TOT_POP,SUM_of_TOT_MALE,SUM_of_TOT_FEMALE,SUM_of_WA_MALE,SUM_of_WA_FEMALE,SUM_of_BA_MALE,SUM_of_BA_FEMALE,...,SUM_of_NHBA_MALE,SUM_of_NHBA_FEMALE,SUM_of_NHIA_MALE,SUM_of_NHIA_FEMALE,SUM_of_NHAA_MALE,SUM_of_NHAA_FEMALE,SUM_of_NHNA_MALE,SUM_of_NHNA_FEMALE,SUM_of_NHTOM_MALE,SUM_of_NHTOM_FEMALE
0,1001,Alabama,Autauga County,55869,27092,28777,20878,21729,5237,6000,...,5171,5927,105,138,282,364,20,20,492,464
1,1003,Alabama,Baldwin County,223234,108247,114987,94810,100388,9486,10107,...,9308,9907,753,754,911,1435,53,70,1832,1930
2,1005,Alabama,Barbour County,24686,13064,11622,6389,5745,6311,5595,...,6260,5547,52,43,55,61,21,10,153,132
3,1007,Alabama,Bibb County,22394,11929,10465,8766,8425,2941,1822,...,2912,1807,50,41,21,25,5,1,116,130
4,1009,Alabama,Blount County,57826,28472,29354,27258,28154,516,462,...,453,419,143,139,73,90,14,7,345,385
