# Explainer Notebook

## Motivation.
* What is your dataset?
* Why did you choose this/these particular dataset(s)?
* What was your goal for the end user's experience?

## Basic stats. Let's understand the dataset better
* Write about your choices in data cleaning and preprocessing
* Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

### Choices in data cleaning and preprocessing

#### County Health Rankings Dataset
The Health Dataset consists of 3193 rows and 250 columns. A row corresponds to a county in the US and the first columns consists of a FIPS code, the name of the state the county is within and the name of the county. The rest of the columns describe different health factors of each county such as obesity, smokers, alcoholism, education etc.

226 of the rows in the data has an x in a column named "Unreliable". The column is not further explained in the data description given in the [PDF of data description](https://www.countyhealthrankings.org/sites/default/files/media/document/DataDictionary_2020_2.pdf) but for the sake of the column name, these rows will be removed. 

Due to the way pandas can read a csv, the first zero of the FIPS code can be automatically omitted. This will not allow `plotly` to plot those states, which is why we need to apply a zero infront of the row if the number is less than 5 digits short using:

`df['FIPS']=df'FIPS'].apply(lambda x: '{0:0>5}'.format(x))`

Too easier combine the different datasets, a dictionary of the states and their abbreviation (`us_state_to_abbrev`) is needed to translate the states. This allows us to use `pandas.groupby` to combine the states and counties. It is important to groupby both State and County since some county names may repeat across different States.

A few rows also had floats represented as a string, which had to be translated into floats to analyse.

#### FastFood Chains across America
The FastFood Dataset consists of 10000 rows and 14 columns. A row corresponds to a resturant in the US and the first columns consists of a address, the name of the Fastfood chain, the state etc. This dataset does not have a column for county, so we have to extract that information ourselves and create a new column to group this data together with the other datasets. Since the postalcode is a column in the dataset, we can use `pgeocode` to extract the county information for each resturant. 

```
nomi = pgeocode.Nominatim('us')
county_names = []
for i in range(len(FastFood)):
    county_names.append(nomi.query_postal_code(FastFood["postalCode"][i]).county_name)
    
FastFood["County"] = county_names
```

The focus variable we are interested in from this data set is the a count of how many chains there is in each county as well as what fastfood chains we can see across the states.

#### US Household Income Statistics [link](https://www.kaggle.com/datasets/goldenoakresearch/us-household-income-stats-geo-locations)

The US Household Dataset consists of 32526 rows and 19 columns. Each row corresponds to some area code within a county. This data has the word *County* added to each string value in the `County` column, meaning we need to remove the last word of each element in this column. The focus variable we are interested in from this data set is the `Mean` column which represents the mean income for households in that county.

## Data Analysis.
* Describe your data analysis and explain what you've learned about the dataset. *If relevant, talk about your machine-learning.

## Genre.
* Which genre of data story did you use?
* Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
* Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

## Visualizations.
* Explain the visualizations you've chosen.
* Why are they right for the story you want to tell?

### Folium map

### Choroploth Map

### 

## Think critically about your creation
* What went well?
* What is still missing? What could be improved? Why?

## Contributions