# Explainer Notebook

## Motivation.
### The datasets
Our data consists of 4 diffent datasets that describes the counties across the united states. Our final datasets contain 14 variables after cleaning and preprocessing and to name a few it contains adult obesity, mean income, poltical stance etc.

### Why health, Fastfood chains & income data?
One of the problems of some modern welfare states is a tendency of obesity. 

We have choosen health in the US because we would like to study how other social factors may have an impact on ones health. The Health data allows us to investigate many potential factors in determining obesity in the United States of America, these factors are: Income, exposure to fastfood restaurants, physical health, mental health, smoking habits, drinking habits, employment status and political orientation. 

The Fastfood chain data can also have an effect on the health. The trend seems that the Americans every year spend more money on take-away excluding 2020, however that year was also extraordinary in regards to lockdown caused by COVID-19. And the income data is just as relevant, as sources tells us that almost the same percentage of the American income is spend on take-away, where the percentage spend on homemade food is decreasing.

#### The idea and goal of the project 


## Basic stats. Let's understand the dataset better
* Write about your choices in data cleaning and preprocessing
* Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

### Choices in data cleaning and preprocessing

#### County Health Rankings Dataset
The Health Dataset consists of 3193 rows and 250 columns. A row corresponds to a county in the US and the first columns consists of a FIPS code, the name of the state the county is within and the name of the county. The rest of the columns describe different health factors of each county such as obesity, smokers, alcoholism, education etc.

226 of the rows in the data has an x in a column named "Unreliable". The column is not further explained in the data description given in the [PDF of data description](https://www.countyhealthrankings.org/sites/default/files/media/document/DataDictionary_2020_2.pdf) but for the sake of the column name, these rows will be removed. 

Due to the way pandas can read a csv, the first zero of the FIPS code can be automatically omitted. This will not allow `plotly` to plot those states, which is why we need to apply a zero infront of the row if the number is less than 5 digits short using:

`df['FIPS']=df'FIPS'].apply(lambda x: '{0:0>5}'.format(x))`

Too easier combine the different datasets, a dictionary of the states and their abbreviation (`us_state_to_abbrev`) is needed to translate the states. This allows us to use `pandas.groupby` to combine the states and counties. It is important to groupby both State and County since some county names may repeat across different States.

A few rows also had floats represented as a string, which had to be translated into floats to analyse.

#### FastFood Chains across America
The FastFood Dataset consists of 10000 rows and 14 columns. A row corresponds to a resturant in the US and the first columns consists of a address, the name of the Fastfood chain, the state etc. This dataset does not have a column for county, so we have to extract that information ourselves and create a new column to group this data together with the other datasets. Since the postalcode is a column in the dataset, we can use `pgeocode` to extract the county information for each resturant. 

```
nomi = pgeocode.Nominatim('us')
county_names = []
for i in range(len(FastFood)):
    county_names.append(nomi.query_postal_code(FastFood["postalCode"][i]).county_name)
    
FastFood["County"] = county_names
```

The focus variable we are interested in from this data set is the a count of how many chains there is in each county as well as what fastfood chains we can see across the states.

#### US Household Income Statistics & Political data

The US Household Dataset consists of 32526 rows and 19 columns. Each row corresponds to some area code within a county. This and the Political data has the word *County* added to each string value in the `County` column, meaning we need to remove the last word of each element in this column. The focus variable we are interested in from the income data set is the `Mean` column which represents the mean income for households in that county and per_gop from the Political data.

#### Data Cleaning Code

In [4]:
import numpy as np
import pandas as pd

#loading datasets
health = pd.read_csv("Datasets/rankmd.csv", delimiter=";")
FastFood = pd.read_csv("Datasets/FastFoodRestaurants.csv")
income = pd.read_csv("Datasets/kaggle_income.csv", encoding="ISO 8859-1")
poldata = pd.read_csv("Datasets/2020_US_County_Level_Presidential_Results.csv", delimiter=",")

#dictionary of the states to abbreviation
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}
    
# Inverting the dictionary
abbrev_to_us_state = dict(map(reversed, us_state_to_abbrev.items()))

# Creating a state dataset
FastFood['State'] = FastFood['province'].map(abbrev_to_us_state)
States = health.copy()
States = States[States['FIPS'].astype(str).str.endswith('000')]

# Converting FIPS to string
health['FIPS']=health['FIPS'].apply(lambda x: '{0:0>5}'.format(x))

#Setting the food_enviornment index as float instead of string
health["food_environment_index_Food Environment Index"] = health["food_environment_index_Food Environment Index"].str.replace(",",".").astype(float)

# Removing ' County' from the county names in income
income["County"] = income.County.str.replace(' County', '')

# Merging income and health data
temp_df = income.groupby(["State_Name","County"]).mean().reset_index()
new_df = pd.merge(health.copy(), temp_df.copy(),  how='left', left_on=['State','County'], right_on = ['State_Name','County'])

#ONLY NEEDS TO BE RAN ONCE AS THE COUNTIES ARE STORED IN THE CSV.

#!{sys.executable} -m pip install pgeocode
#import pgeocode

#nomi = pgeocode.Nominatim('us')
#county_names = []
#for i in range(len(FastFood)):
#    county_names.append(nomi.query_postal_code(FastFood["postalCode"][i]).county_name)
    
#FastFood["County"] = county_names
#FastFood.to_csv("../Datasets/FastFoodRestaurants.csv")

# Merging fastfood data with income and health data
temptemp = FastFood.groupby(["State", "County"]).count().reset_index()[['State','County','address']]
tempo = temptemp.rename(columns={'address':'nr of FFchains'})
data_df = pd.merge(new_df, tempo,  how='left', left_on=['State','County'], right_on =['State','County'])
data_df['nr of FFchains'] = data_df['nr of FFchains'].fillna(0)

# Removing ' County' from the county names
poldata["county_name"] = poldata.county_name.str.replace(' County', '')

# Merging the political data with the other data
merged=pd.merge(data_df.copy(), poldata.copy(),  how='left', left_on=['State','County'], right_on = ['state_name','county_name'])

# Snipping the columns to a more clean dataset
data = merged[["premature_deathYears_of_Potential_Life_Lost_Rate",'adult_obesity_% Adults with Obesity',
                "adult_smoking_% Smokers", "excessive_drinking_% Excessive Drinking", "food_environment_index_Food Environment Index",
                "uninsured_% Uninsured", "unemployed_% Unemployed", 'nr of FFchains', 'Mean',
                "poor_physical_health_days_Average Number of Physically Unhealthy Days",
                "poor_mental_health_days_Average Number of Mentally Unhealthy Days","per_gop"]]

# Dropping NaNs
data = data.dropna()

# Creating a response value to predict a ML model
data['is_obese'] = data['adult_obesity_% Adults with Obesity']>=33
data = data.drop(['adult_obesity_% Adults with Obesity', 'Mean'],axis=1)

# Cleansing columns, making unemployed percentage an integer and physical and mental unhealthy days floats
data["unemployed_% Unemployed"] = data["unemployed_% Unemployed"].str.replace(",",".").astype(float).astype(int)
data["poor_physical_health_days_Average Number of Physically Unhealthy Days"] = data["poor_physical_health_days_Average Number of Physically Unhealthy Days"].str.replace(",",".").astype(float)
data["poor_mental_health_days_Average Number of Mentally Unhealthy Days"] = data["poor_mental_health_days_Average Number of Mentally Unhealthy Days"].str.replace(",",".").astype(float)

# Storing the merged, cleansed dataset as a csv file
data.to_csv("Datasets/Mixed_data.csv")
merged.to_csv("Datasets/All_data.csv")

merged

Unnamed: 0,FIPS,State,County,Unreliable,premature_deathDeaths,premature_deathYears_of_Potential_Life_Lost_Rate,premature_death_95% CILow,premature_death_95% CI - High,premature_death_Quartile,premature_death_YPLL Rate (AIAN),...,state_name,county_fips,county_name,votes_gop,votes_dem,total_votes,diff,per_gop,per_dem,per_point_diff
0,01000,Alabama,,,82249.0,9820.0,9718.0,9922.0,,5145.0,...,,,,,,,,,,
1,01001,Alabama,Autauga,,787.0,7830.0,6998.0,8662.0,1.0,,...,Alabama,1001.0,Autauga,19838.0,7503.0,27770.0,12335.0,0.714368,0.270184,0.444184
2,01003,Alabama,Baldwin,,3147.0,7680.0,7237.0,8124.0,1.0,,...,Alabama,1003.0,Baldwin,83544.0,24578.0,109679.0,58966.0,0.761714,0.224090,0.537623
3,01005,Alabama,Barbour,,515.0,11477.0,9908.0,13045.0,3.0,,...,Alabama,1005.0,Barbour,5622.0,4816.0,10518.0,806.0,0.534512,0.457882,0.076631
4,01007,Alabama,Bibb,,476.0,12173.0,10506.0,13839.0,4.0,,...,Alabama,1007.0,Bibb,7525.0,1986.0,9595.0,5539.0,0.784263,0.206983,0.577280
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3188,56037,Wyoming,Sweetwater,,527.0,7775.0,6849.0,8701.0,3.0,,...,Wyoming,56037.0,Sweetwater,12229.0,3823.0,16603.0,8406.0,0.736554,0.230260,0.506294
3189,56039,Wyoming,Teton,,109.0,2980.0,2094.0,3866.0,1.0,,...,Wyoming,56039.0,Teton,4341.0,9848.0,14677.0,-5507.0,0.295769,0.670982,-0.375213
3190,56041,Wyoming,Uinta,,271.0,8081.0,6637.0,9525.0,4.0,,...,Wyoming,56041.0,Uinta,7496.0,1591.0,9402.0,5905.0,0.797277,0.169219,0.628058
3191,56043,Wyoming,Washakie,,104.0,6541.0,4417.0,8665.0,2.0,,...,Wyoming,56043.0,Washakie,3245.0,651.0,4012.0,2594.0,0.808824,0.162263,0.646560


## Data Analysis.
* Describe your data analysis and explain what you've learned about the dataset. *If relevant, talk about your machine-learning.

## Genre.
* Which genre of data story did you use?
* Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
* Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

## Visualizations.
* Explain the visualizations you've chosen.
* Why are they right for the story you want to tell?

### Folium map

### Choroploth Map

### 

## Think critically about your creation
* What went well?
* What is still missing? What could be improved? Why?

## Contributions