# Explainer Notebook

## Motivation.
### The datasets
Our data consists of 4 diffent datasets that describes the counties across the united states. Our final datasets contain 14 variables after cleaning and preprocessing and to name a few it contains adult obesity, mean income, poltical stance etc.

### Why health, Fastfood chains & income data?
One of the problems of some modern welfare states is a tendency of obesity. 

We have choosen health in the US because we would like to study how other social factors may have an impact on ones health. The Health data allows us to investigate many potential factors in determining obesity in the United States of America, these factors are: Income, exposure to fastfood restaurants, physical health, mental health, smoking habits, drinking habits, employment status and political orientation. 

The Fastfood chain data can also have an effect on the health. The trend seems that the Americans every year spend more money on take-away excluding 2020, however that year was also extraordinary in regards to lockdown caused by COVID-19. And the income data is just as relevant, as sources tells us that almost the same percentage of the American income is spend on take-away, where the percentage spend on homemade food is decreasing.

#### The idea and goal of the project 

The idea arose from the green challange and the third SDG: Good health and well being. With data from such as grand nation that is known to have such a large range in quality of health culture, the idea of exploring their health arose, and would hopefully give us an insight into if different changes to society could have an effect on the individuals health. 

The goal of the project is to investigate if the society around you and factors of your person and common area has anything to do with your own health. To investigate whether or not living in societies with lower income, different political orientation or other factors determines how healthy you on average are yourself. In the end, we wanted to tell a story that showcases american health behaviour and wether or not these different aspects of american life can be put into a category, to tell us if we can predict these unhealthy counties based on many other variables about the common american located differently across the US. 

## Basic stats. Let's understand the dataset better
* Write about your choices in data cleaning and preprocessing

### Choices in data cleaning and preprocessing

#### County Health Rankings Dataset
The Health Dataset consists of 3193 rows and 250 columns. A row corresponds to a county in the US and the first columns consists of a FIPS code, the name of the state the county is within and the name of the county. The rest of the columns describe different health factors of each county such as obesity, smokers, alcoholism, education etc.

226 of the rows in the data has an x in a column named "Unreliable". The column is not further explained in the data description given in the [PDF of data description](https://www.countyhealthrankings.org/sites/default/files/media/document/DataDictionary_2020_2.pdf) but for the sake of the column name, these rows will be removed. 

Due to the way pandas can read a csv, the first zero of the FIPS code can be automatically omitted. This will not allow `plotly` to plot those states, which is why we need to apply a zero infront of the row if the number is less than 5 digits short using:

`df['FIPS']=df'FIPS'].apply(lambda x: '{0:0>5}'.format(x))`

Too easier combine the different datasets, a dictionary of the states and their abbreviation (`us_state_to_abbrev`) is needed to translate the states. This allows us to use `pandas.groupby` to combine the states and counties. It is important to groupby both State and County since some county names may repeat across different States.

A few rows also had floats represented as a string, which had to be translated into floats to analyse.

#### FastFood Chains across America
The FastFood Dataset consists of 10000 rows and 14 columns. A row corresponds to a resturant in the US and the first columns consists of a address, the name of the Fastfood chain, the state etc. This dataset does not have a column for county, so we have to extract that information ourselves and create a new column to group this data together with the other datasets. Since the postalcode is a column in the dataset, we can use `pgeocode` to extract the county information for each resturant. 

```
nomi = pgeocode.Nominatim('us')
county_names = []
for i in range(len(FastFood)):
    county_names.append(nomi.query_postal_code(FastFood["postalCode"][i]).county_name)
    
FastFood["County"] = county_names
```

The focus variable we are interested in from this data set is the count of how many restaurants there is in each county as well as what fastfood chains we can see across the states.

#### US Household Income Statistics & Political data

The US Household Dataset consists of 32526 rows and 19 columns. Each row corresponds to some area code within a county. This and the Political data has the word *County* added to each string value in the `County` column, meaning we need to remove the last word of each element in this column. The focus variable we are interested in from the income data set is the `Mean` column which represents the mean income for households in that county and per_gop from the Political data.



#### Data Cleaning Code
##### Imports


In [None]:
# Importing packages
import sys
import numpy as np
import pandas as pd
# Packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import sys
#!{sys.executable} -m pip install folium
import folium
from urllib.request import urlopen
import json
import os
#!{sys.executable} -m pip install geopandas
import plotly.express as px
from urllib.request import urlopen
import json
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

import seaborn as sns
import matplotlib.pyplot as plt
!{sys.executable} -m pip install colorcet
import colorcet as cc

#Ignore warnings
import warnings
warnings.filterwarnings('ignore')

##### Loading the datasets

In [None]:
#loading datasets
health = pd.read_csv("Datasets/rankmd.csv", delimiter=";")
FastFood = pd.read_csv("Datasets/FastFoodRestaurants.csv")
income = pd.read_csv("Datasets/kaggle_income.csv", encoding="ISO 8859-1")
poldata = pd.read_csv("Datasets/2020_US_County_Level_Presidential_Results.csv", delimiter=",")

##### Cleaning up the data formatting
We added abbreviations for each state to the data set, likewise we cleaned it up by ensuring that the formatting was correct. I.e strings should be strings, floats; floats and so on. 
As our data consists of 4 dataset we also merged them into some more *handy* datasets that was easier to work with.

In [None]:
#dictionary of the states to abbreviation
us_state_to_abbrev = {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}
    
# Inverting the dictionary
abbrev_to_us_state = dict(map(reversed, us_state_to_abbrev.items()))

# Creating a state dataset
FastFood['State'] = FastFood['province'].map(abbrev_to_us_state)
States = health.copy()
States = States[States['FIPS'].astype(str).str.endswith('000')]

# Converting FIPS to string
health['FIPS']=health['FIPS'].apply(lambda x: '{0:0>5}'.format(x))

#Setting the food_enviornment index as float instead of string
health["food_environment_index_Food Environment Index"] = health["food_environment_index_Food Environment Index"].str.replace(",",".").astype(float)

# Removing ' County' from the county names in income
income["County"] = income.County.str.replace(' County', '')

# Merging income and health data
temp_df = income.groupby(["State_Name","County"]).mean().reset_index()
new_df = pd.merge(health.copy(), temp_df.copy(),  how='left', left_on=['State','County'], right_on = ['State_Name','County'])

#ONLY NEEDS TO BE RAN ONCE AS THE COUNTIES ARE STORED IN THE CSV.

#!{sys.executable} -m pip install pgeocode
#import pgeocode

#nomi = pgeocode.Nominatim('us')
#county_names = []
#for i in range(len(FastFood)):
#    county_names.append(nomi.query_postal_code(FastFood["postalCode"][i]).county_name)
    
#FastFood["County"] = county_names
#FastFood.to_csv("../Datasets/FastFoodRestaurants.csv")

# Merging fastfood data with income and health data
temptemp = FastFood.groupby(["State", "County"]).count().reset_index()[['State','County','address']]
tempo = temptemp.rename(columns={'address':'nr of FFchains'})
data_df = pd.merge(new_df, tempo,  how='left', left_on=['State','County'], right_on =['State','County'])
data_df['nr of FFchains'] = data_df['nr of FFchains'].fillna(0)

# Removing ' County' from the county names in political data
poldata["county_name"] = poldata.county_name.str.replace(' County', '')

# Merging the political data with the other data
merged=pd.merge(data_df.copy(), poldata.copy(),  how='left', left_on=['State','County'], right_on = ['state_name','county_name'])

# Snipping the columns to a more clean dataset
data = merged[["State","premature_deathYears_of_Potential_Life_Lost_Rate",'adult_obesity_% Adults with Obesity',
                "adult_smoking_% Smokers", "excessive_drinking_% Excessive Drinking", "food_environment_index_Food Environment Index",
                "uninsured_% Uninsured", "unemployed_% Unemployed", 'nr of FFchains',
                "poor_physical_health_days_Average Number of Physically Unhealthy Days",
                "poor_mental_health_days_Average Number of Mentally Unhealthy Days","per_gop"]]

# Dropping NaNs
data = data.dropna()

# Creating a response value to predict a ML model
data['is_obese'] = data['adult_obesity_% Adults with Obesity']>=33
data = data.drop(['adult_obesity_% Adults with Obesity'],axis=1)

# Cleansing columns, making unemployed percentage an integer and physical and mental unhealthy days floats
data["unemployed_% Unemployed"] = data["unemployed_% Unemployed"].str.replace(",",".").astype(float).astype(int)
data["poor_physical_health_days_Average Number of Physically Unhealthy Days"] = data["poor_physical_health_days_Average Number of Physically Unhealthy Days"].str.replace(",",".").astype(float)
data["poor_mental_health_days_Average Number of Mentally Unhealthy Days"] = data["poor_mental_health_days_Average Number of Mentally Unhealthy Days"].str.replace(",",".").astype(float)

# Storing the merged, cleansed dataset as a csv file
data.to_csv("Datasets/Mixed_data.csv")
merged.to_csv("Datasets/All_data.csv")



* Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

## Data Analysis.
* Describe your data analysis and explain what you've learned about the dataset. *If relevant, talk about your machine-learning.

### Statewise plot of health daata
As we have alot of states we will start by plotting the contributers to loss of death prematurely, how many years of life that is lost based on premature death. 
The number of Fastfood restaurants per state and the percentage of obesity per state. 

We do this as our goal is to explore public health and well being, based on fastfood restaurants, and obesity is one major factor contributing to the loss of life prematurely. 
Likewise fastfood can lead to obesity. 

A simlpe bar plot can give us indications of how different states are ranked; likewise each state has a specific color so it is easier to navigate throughout the four plots as they can seem quite overwhelming




In [None]:
# Create column with States abbrevs' 
FastFood['State'] = FastFood['province'].map(abbrev_to_us_state)
States = health.copy()

# Zero padding the FIPS code, as pandas don't accept leading 0's. 
States = States[States['FIPS'].astype(str).str.endswith('000')]

#Define subplots to be a 2x2 grid
fig, axes = plt.subplots(2, 2, sharex=False, figsize=(25,20))
sns.set(rc={'figure.figsize':(12,9)})
sns.despine(left=True, bottom=True)
sns.set_color_codes("muted")

s = States['State'].unique()

palette = sns.color_palette(cc.glasbey, 51)

hue_dic = dict(zip(s, palette))
# Bar plot of prepamature deaths per state
sns.barplot(ax=axes[0,0], data=States.sort_values("premature_deathDeaths",ascending=False), y='State', x='premature_deathDeaths', palette=hue_dic).set(title='Total amount of Premature Deaths per state',  xlabel='Total number of premature deaths')
# Years of potential life lost per state
sns.barplot(ax=axes[0,1], data=States.sort_values("premature_deathYears_of_Potential_Life_Lost_Rate",ascending=False), y='State', x='premature_deathYears_of_Potential_Life_Lost_Rate', palette=hue_dic).set(title='Amount of Potential years of life lost per state',  xlabel='Potential Life lost rate')
# Number of fastfood restaurants per state
sns.barplot(ax=axes[1,0], data=FastFood.groupby(['State']).count().reset_index().sort_values("address",ascending=False), y='State', x='address', palette=hue_dic).set(title='Amount of Fastfood Resturants per state', xlabel='Total number of Fasfood Resturants')
# Percentage of obesity per state 
sns.barplot(ax=axes[1,1], data=States.sort_values('adult_obesity_% Adults with Obesity',ascending=False), y='State', x='adult_obesity_% Adults with Obesity', palette=hue_dic).set(title='Percentage of Obese people per state', xlabel='Percentage of adult people with obesity')

# Save subplot for website hosting as png file
#fig.savefig("../Visulisations/4x4plot.png") 

### Fastfood restaurants

We plot the *"popularity"* of the fastfood restaurants, to find our top 10. This will be our focus point. 
We did already see the plot with many colors, and to ease it up a bit for future use we decided to color each of the top 10 restaurants in the own destinctive color.
As the top 10 accounts for the "major" contribution of restaurants across the US.  

In [None]:
# Creating index to the corresponding type of fastfood restaurant
colorindex=(FastFood["Chain"]=="mcdonalds")+(FastFood["Chain"]=="burger king")*2+(FastFood["Chain"]=="taco bell")*3+(FastFood["Chain"]=="wendys")*4+(FastFood["Chain"]=="arbys")*5+(FastFood["Chain"]=="kfc")*6+(FastFood["Chain"]=="subway")*7+(FastFood["Chain"]=="sonic drive in")*8+(FastFood["Chain"]=="dominos pizza")*9

# Colors for the fastfood restaurants 
# Yellow Mcdonalds, Orange for Burger King, Purple for Taco bell, Red for Wendys, Green for Arbys, KFC Beige, Dark green Subway, Blue for Sonic, Dark blue for Domino's, Dark grey for Other
restaurants=['Other',"Mcdonald's", 'Burger King', 'Taco Bell', "Wendy's", "Arby's", 'KFC', 'Subway', 'Sonic drive-in', "Domino's Pizza"]
colors=["#4A4A4A","#FFFF00", "#FF6103", "#8A2BE2","#FF3030", "#00C957", "#FFE4C4", "#228B22", "#1E90FF", "#104E8B"]

In [None]:
# Appending the dark color for each of the other chains that are in the 'Other' group
for i in range(len(FastFood["Chain"].unique())-9):
    colors.append("#4A4A4A")

# Bar plot
FastFood["Chain"] = FastFood["name"].str.lower().str.replace(r"[\"\',]", '').str.replace(r"\-"," ")
sns.barplot(data=FastFood.groupby(['Chain']).count().reset_index().sort_values('address',ascending=False)[0:50], y='Chain', x='address', palette=colors).set(title='Number of Resturants per the 50 most popular chains', xlabel='Number of Resturants')

# Saving the figure to plot in the website
#plt.savefig('../docs/exploring_the_data/stat2.png')

With the knowledge of the major contributers for restaurants across the US, it was time to create a location of map, to see how they might accumulate throughout the country. We expect that a vast majority of the restaurants will be around the major cities. 
For this a Folium map, with the colors decided previously for the 10 major restaurants and `other` fits nicely.

In [None]:
# Coordinates for USA
lat, lon = 37.77919, -100.41914 

# Lists of coordinates for fastfood restaurants
lat_list = list(FastFood.latitude)
lon_list = list(FastFood.longitude)

# Creating map and point layers
map_FF = folium.Map([lat, lon], tiles = "Stamen Toner", zoom_start=4)
point_layer = [folium.FeatureGroup(name=restaurants[0]), folium.FeatureGroup(name=restaurants[1]), folium.FeatureGroup(name=restaurants[2]), folium.FeatureGroup(name=restaurants[3]), folium.FeatureGroup(name=restaurants[4]), folium.FeatureGroup(name=restaurants[5]), folium.FeatureGroup(name=restaurants[6]), folium.FeatureGroup(name=restaurants[7]), folium.FeatureGroup(name=restaurants[8]), folium.FeatureGroup(name=restaurants[9])]

# Creating the points to the map in the respective layer
for i in range(len(FastFood)):
    point_layer[colorindex[i]].add_child(folium.Circle(location=[lat_list[i], lon_list[i]], radius=50,
    color=colors[colorindex[i]],
    fill=True,
    # Adding HTML styling for more detail to the pop up.
    popup=folium.Popup(f"""<b>Name: </b>  {restaurants[colorindex[i]]} <br>
                               <b>Address: </b>  {FastFood["address"][i]} <br>
                               <b>State: </b>  {FastFood["State"][i]} <br>
                               <b>County: </b> {FastFood["County"][i]}
                            """, max_width=len(f"name= {restaurants[colorindex[i]]}")*20),
    fill_color="#3186cc", tooltip = restaurants[colorindex[i]])).add_to(map_FF)
    
# Adding the point layers to the map
map_FF.add_child(point_layer[0])
map_FF.add_child(point_layer[1])
map_FF.add_child(point_layer[2])
map_FF.add_child(point_layer[3])
map_FF.add_child(point_layer[4])
map_FF.add_child(point_layer[5])
map_FF.add_child(folium.LayerControl())     

# Showing the map
map_FF.save("Fastfood_locations.html") # Save map as HTML for hosting on other GitHub Repo
map_FF


To our Surprise a vast majority of the restaurants were located on the east side of the country. We expected a more even distribution. 
However after realising that 36% of the United States' population is housed on the east side **[[14]](https://worldpopulationreview.com/state-rankings/east-coast-states)** it made alot more sense that the accumulated restaurants would be placed there. 


Looking at the West Coast distribution we see a pattern of accumulated restaurants around the big cities like Washington, Los Angeles and in Nevada Las Vegas. 

### Exploring Food Enviroment Index' 
With knowledge of how the states performed health wise and county wise, and knowing of where the FastFood restaurants are located, we explored the `Food Envirometn Index Scoring`. 
This tells something about; the salery in that area vs possibilty to buy healthy food, and grocery store distribution. 

The goal was to find some sort of correlation between a bad food enviroment index and perhaps the ease of access to fastfood as a quick and cheap alternative; compared driving further to a grocery store. 

In [None]:
# Create a barplot in decreasing order of the state and the corresponding food enviroment index score
sns.barplot(data=States.sort_values("food_environment_index_Food Environment Index",ascending=False), y='State', x='food_environment_index_Food Environment Index', palette=hue_dic).set(title='Food Enviornment Index per state',  xlabel='Food Enviornment Index')
# Save the plot for GitHub Repository hosting
# plt.savefig('../Visulisations/FoodEnv.png', dpi=300)

New Jersy came out on top, and California which previously in account of percentage of obese people were 2nd lowest and had the most fastfood chains was also in the top range of the scoring. 

To explore county wise; we created a *Choropleth - map*, this would make it easier for us to explore the counties and areas by having a heatmap overview, but also for the reader to themselves explore

#### Exploring County wise Food Enviroment Index

As the locations for each fastfood restaurants were available, we tried to get an idea of the distribution of the food enviroment index county wise. 
This could perhabs reveal a pattern between the restaurant locations and the scores.

In [None]:
# Create Plotly Chropleth map
fig = px.choropleth_mapbox(health, geojson=counties, locations='FIPS', color="food_environment_index_Food Environment Index",
                           color_continuous_scale="Viridis",
                           range_color=(0, 10),
                           mapbox_style="carto-positron",
                           zoom=3, center = {"lat": 37.77919, "lon": -100.41914},
                           opacity=0.5,
                           labels={'food_environment_index_Food Environment Index':'Food Enviornement Index'},
                           hover_data=["State", "County"] # Add additional info to mouse-overs (Hovers) 
                          )

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

# Save the plot as HTML, for hosting on the GitHub Repository website. 
#fig.write_html("foodindx.html")

One thing to extract from the plot is the fact that the FEI (Food Enviroment Index) rankings top states are located on the east side of the country. Likewise they are most likely northern states. 

As we have the option within our data and our goal is to describe obesity and food quality in the US, we decided to create a similar plot; just describing the adult obesity percentage in each county. 

#### County Obesity percentage 

In [None]:
# Create Plotly Choropleth map with the Obesity percentage countyswise. 
fig = px.choropleth_mapbox(health.fillna(0), geojson=counties, locations='FIPS', color='adult_obesity_% Adults with Obesity',
                           color_continuous_scale="Viridis",
                           range_color=(0, max(health['adult_obesity_% Adults with Obesity'])),
                           mapbox_style="carto-positron",
                           zoom=3.3, center = {"lat": 37.77919, "lon": -100.41914},
                           opacity=0.5,
                           labels={'adult_obesity_% Adults with Obesity':'Adult obesity percentage'},
                           hover_data=["State", "County"] # Add additional info to mouse-overs (Hovers) 
                          )

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

# Save as HTML for GitHub  Repository hosting
#fig.write_html("obesitypercentage.html")

From the above plots, with the `Adult obesity percentage` and `Food Enviroment Index` we see a correlation especially on the west coast of counties that have a low percentage of obesed citizens, usually also have a fairly high FEI. 

To further explore the data, we also plot the polical orientation in the year of 2020 to see if there's any correlation between political belief and obesity (health)

In [None]:
# Load the prepared dataframe from the data handling in the beginning.
tmp_data = pd.read_csv("Datasets/All_data.csv", delimiter=",", index_col=0)
# Convert fips to have loading 0's
tmp_data['FIPS']=tmp_data['FIPS'].apply(lambda x: '{0:0>5}'.format(x))
data2 = pd.read_csv("Datasets/Mixed_data.csv", delimiter=",", index_col=0)

# import State and County to All_data -- Again #Hotfixing
tmp_data["State"] = health["State"]
tmp_data["County"] = health["County"]
tmp_data["is_obese"] = data2["is_obese"]

# Rename columns as a hotfix
tmp_data = tmp_data.rename(columns={"per_dem": "Democrate percentage", "per_gop": "Republican percentage"})


In [None]:
# Create Plotly Choropleth map with the Obesity percentage countyswise. 
fig = px.choropleth_mapbox(tmp_data, geojson=counties, locations='FIPS', color='Republican percentage',
                           color_continuous_scale="Bluered",
                           range_color=(0, 1),
                           mapbox_style="carto-positron",
                           zoom=3.3, center = {"lat": 37.77919, "lon": -100.41914},
                           opacity=0.5,
                           labels={'Republican percentage':'Republican percentage'},
                           hover_data= ["State", "County", "Democrate percentage", "is_obese"]# Add additional info to mouse-overs (Hovers) 
                          )

fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

fig.write_html("republicancentage.html")

### Machine learning modelling


Based on our information on the data, we sought out to **predict obesity in states**. 
Rather than focusing only on the attributes we described above, we wanted to create a model based on the entire feature set of all data we had. The thought of this was lastly that we could do a Variance explained by feature exploration to check which attributes contributed the greatest as to whether or not a county in a state was obese or not. To give a reasonable goal, we had an expectation of predicting whether or not 1/3 of the adults in a state were obese or not. 

#### Preperations

Start by preparing the mixed data, that consists of data from multiple datasets. - In the data preperation we saved a file of mixed data sets. This is loaded and used here. 

In [None]:
# From the mixed data sets
# Loading the data
data = pd.read_csv("Datasets/Mixed_data.csv", delimiter=",", index_col=0)


Create a Correlation plot to identify possible correlated features

Create temp dataframe with nicer labels for correaltion heatmap

In [None]:
prettylabels=['Years of potential life lost rate', 'Adult smokers %', 'Excessive drinkers %', 'Food Environment Index', 'Uninsured %', 'Unemployed %', 'No. fastfood restaurants','No physically unhealthy days', 'No mentally unhealthy days', 'Percentage Republican Voters']
data.head()
tmp_data = data.rename(columns = {"State": "State", "premature_deathYears_of_Potential_Life_Lost_Rate": prettylabels[0], "adult_smoking_% Smokers": prettylabels[1], "excessive_drinking_% Excessive Drinking": prettylabels[2], "food_environment_index_Food Environment Index": prettylabels[3], "uninsured_% Uninsured": prettylabels[4], "unemployed_% Unemployed": prettylabels[5], "nr of FFchains": prettylabels[6], "poor_physical_health_days_Average Number of Physically Unhealthy Days": prettylabels[7],"poor_mental_health_days_Average Number of Mentally Unhealthy Days": prettylabels[8], "per_gop": prettylabels[9], "is_obese": "Is_obese"})

In [None]:

sns.heatmap(tmp_data.corr(), cmap="YlGnBu", annot=True )
# Save for later
plt.savefig('corrHeat.png', dpi=300, bbox_inches='tight')

From the heatmap we learn whether or not some features are correlated or not, 
Interrestingly political beliefs seem to have some correlation with the general health of that specific County. 


As we need test and training data, and the test data cannot be included in the training data we have to sort out some states; so when predicting, the data is unknown for the model! 

For this we chose the tree states; Oregon (West-coast), Oklahoma (Midderteranian), Pennsylvania(East-coast). We chose three states in different locations as we saw previously that the distribution of restaurants did matter for how the FEI was placed; but not the obesity percentage. By having 3 states in the 3 extremes, we could explore the relation of health between east, middle and west side of the country. 

For simplicity we start by setting up a `DecisionTreeClassifier`, from here we can then improve the model and see if having *multiple* `DecisionTreeClassifiers`i.e a `RandomForrestClassifier` will have an effect on the model performance

In [None]:
# Further imports relevant to the Machine Learning part
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix

# Set random seed for replication of results. 
random_state = 42

# Define model classifier
clf = DecisionTreeClassifier(random_state=random_state)

# Split data into - Predict values (y) and model paramters (X)
y = data['is_obese']
X = data.drop(['is_obese'],axis=1)


In [None]:
# Define target states and split into train and test data! 
target_states = ['Oregon', 'Oklahoma', 'Pennsylvania']
X_train = data[~data['State'].isin(target_states)]
X_test = data[data['State'].isin(target_states)]
y_train = X_train["is_obese"]
y_test = X_test["is_obese"]
X_train = X_train.drop(['is_obese', 'State'],axis=1)
X_test = X_test.drop(['is_obese', 'State'],axis=1)

#### Training a basic model

In [None]:
# Training the model
clf = clf.fit(X_train, y_train)

# Predicng obesity based on features and data.     
y_pred = clf.predict(X_test)

# Extract accuracy measures 
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
#return tn, fp, fn, tp
# plot a confusio matrix
cf_matrix = confusion_matrix(y_test, y_pred)


# Customising the confusion matrix with a heatmap and scores. 
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                    cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                        cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
heatmap = sns.heatmap(cf_matrix, annot=labels, fmt='')
fig = heatmap.get_figure()

fig.show()
fig.savefig("confusionMatrix.png")

In [None]:
accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
F1 = 2*precision*recall/(precision + recall)

print("Accuracy of the model:", accuracy)
print("Precision of the model:", precision)
print("recall of the model:", recall)
print("F1-score of the model:", F1)

As the scores above reveal, there's definetely room for improvement. This can by done by trying with a `RandomForrestClassifier`- This will create a *forrest* of `DecisionTreeClassifiers`and have a majority vote based classification.

#### Setting up a `RandomForrestClassifier`

In [None]:
# Defining the RandomForrestClassifier model

clf = RandomForestClassifier(random_state=random_state)

# Training the model using the splits made above

model = clf.fit(X_train, y_train)

# Predicting with the trained model 

y_pred = model.predict(X_test)


# Getting the scores as previously, with the confusion matrix.
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
#return tn, fp, fn, tp
cf_matrix = confusion_matrix(y_test, y_pred)

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                    cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                        cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
heatmap = sns.heatmap(cf_matrix, annot=labels, fmt='')
fig = heatmap.get_figure()
fig.savefig("confusionMatrix_RF.png")

In [None]:
accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
F1 = 2*precision*recall/(precision + recall)

print("Accuracy of the model:", accuracy)
print("Precision of the model:", precision)
print("recall of the model:", recall)
print("F1-score of the model:", F1)

Definetely an improvement over the `DecisionTreeClassifier`, however the `RandomForrestClassifier`has alot of *hyperparameters* which can be tuned to further improve the models prediction rate. 

We start by doing a `RandomSearchCV`as this is randomly searching through parameters while performing 5-fold crossvalidation. 

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from pprint import pprint

# number of trees in random forrest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num=10)]
#Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 20, 30, 40, 50]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4, 6, 8, 10, 15, 20]
# Method of selecting samples for training each tree
bootstrap = [True, False]



# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}


**THE FOLLOWING SECTION HAS COMMENTED CODE TO AVOID RETRAINING A MODEL - UNCOMMENT TO RE-FIND THE HYPERPARAMETERS - OR CONTENIUE AND USE THE PARAMETERS ALREADY FOUND** 

Uncomment the following section to run the `RandomSearchCV`- uncomment due to timeconstraints


In [None]:
## Create base model as previously
#rf = RandomForestClassifier(random_state=random_state)
## Use the classifier in the randomizedd search cv
#rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid, n_iter=100, cv = 5, verbose=1, random_state=random_state, n_jobs=-1)
#
#rf_random.fit(X_train, y_train)

To get more info of the status, increase the verbose level.
With the randomsearhchperformed, we can print the parameters and further narrow down the search for the most optimal model by using `GridSearchCV`

In [None]:
# Showing best params
#rf_random.best_params_

Based on the above, we do a specific search through a grid that is more "identical to the parameters mentioned above" - Creating a narrow grid of oppertunities in order to hopefully find the same or an even better set of *hyperparameters* for the model

Uncomment the following section to perform the `GridSearchCV`- Or continue to retrieve and use the model already found.

In [None]:
#from sklearn.model_selection import GridSearchCV
## Based on the parameters we got (uncomment to retrieve the same)
#param_grid = {
#    'bootstrap': [True],
#    'max_depth': [35,37,40,43,46],
#    'max_features': ["auto"],
#    'min_samples_leaf': [17,20,22,25],
#    'min_samples_split': [25, 27, 30 ,33, 35],
#    'n_estimators': [100, 150, 175, 200, 225, 250, 300]
#}
## Define Random grid variable for the classifier
#rf_random_grid = RandomForestClassifier(random_state=random_state)
#
#grid_search = GridSearchCV(estimator=rf_random_grid, param_grid=param_grid, cv = 5, n_jobs=-1, verbose=1)
#grid_search.fit(X_train, y_train)

In [None]:
#grid_search.best_params_

We can now define the model, whit the above printed *Hyperparameters*

In [None]:
final_model = RandomForestClassifier(bootstrap=True, max_depth=35, max_features="auto", min_samples_leaf=20, min_samples_split= 25, n_estimators=150, random_state=random_state)

# Fit the grid search to the data
final_model.fit(X_train, y_train)



In [None]:
# Predicting with the model
y_pred = final_model.predict(X_test)


# Defining a confusion matrix as previously
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
#return tn, fp, fn, tp
cf_matrix = confusion_matrix(y_test, y_pred)

group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                    cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                        cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
heatmap = sns.heatmap(cf_matrix, annot=labels, fmt='')
fig = heatmap.get_figure()
fig.show()
fig.savefig("confusionMatrix_RF_tuned.png")

In [None]:
accuracy = (tp+tn)/(tp+tn+fp+fn)
precision = tp/(tp+fp)
recall = tp/(tp+fn)
F1 = 2*precision*recall/(precision + recall)

print("Accuracy of the model:", accuracy)
print("Precision of the model:", precision)
print("recall of the model:", recall)
print("F1-score of the model:", F1)

With the hyperparameter tuning the model performance didn't increase that much. However the general recall which is based on the number of `True positive` increased. 
This is quite usefull in a model as this, as we would rather assign one county to much as being obesed compared to not doing so; as it won't affect *human life* or might only lead to additional ressourcing being put into combating obesity. 

To get an idea of how the model predicts, we plot the 10 most significant features along with their standard deviations in the following plot. As to the project goal being predicting obesity and learning the factors behind, with a focus on Fastfood restaurants it is interessting to see what the model believes is the most important features from the data we had available.  

In [None]:
prettylabels=['Years of potential life lost rate', 'Adult smokers %', 'Excessive drinkers %', 'Food Environment Index', 'Uninsured %', 'Unemployed %', 'No. fastfood restaurants','No physically unhealthy days', 'No mentally unhealthy days', 'Percentage Republican Voters']
importances = pd.DataFrame({'Features':X_test.columns,
                            'Importance':final_model.feature_importances_, 'Labels':prettylabels})
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15,5))
stds=np.std([tree.feature_importances_ for tree in final_model.estimators_], axis=0)
sns.despine(left=True, bottom=True)
sns.barplot(data=importances.sort_values("Importance",ascending=False), y='Features', x='Importance')
# list(importances.sort_values("Importance",ascending=False)["Features"])
ax.set_yticks(list(reversed(range(len(importances)))),list(importances.sort_values("Importance",ascending=True)["Labels"]))
#stds=np.std([tree.feature_importances_ for tree in bestModel.estimators_], axis=0)

ax.errorbar(data=importances.sort_values("Importance",ascending=False), y='Features', x='Importance', yerr=None, xerr=stds,fmt='o',ecolor='black')

plt.savefig("Barplot.png")

From the obove, Somking, political belief, mental unhealthy days, Excessive drinking are among 5 most important features. Whilst the number of Fastfood restaurants are still significant, but not as much as the other features! However 
keep inmind the affect that smoking has for many things, the Food enviroment index and No. fastfood restaurants hast quite a significant impact on whether a state is obesed or not. 

### Predictions

To evalutate the model, we plotted the results of our predictions. 

In [None]:
import ssl
comp={'is_obese': y_test, 'predictions': y_pred, 'correct': y_test == y_pred, 'FIPS':merged['FIPS'][X_test.index], 'State':merged['State'][X_test.index]} 
comparison=pd.DataFrame(comp)
#comparison.reset_index()
#data['FIPS']=data['FIPS'].apply(lambda x: '{0:0>5}'.format(x))
ssl._create_default_https_context = ssl._create_unverified_context

# fig = px.choropleth_mapbox(data, geojson=counties, locations='FIPS', color="is_obese",
#                            color_continuous_scale="Viridis",
#                            range_color=(0, 10),
#                            mapbox_style="carto-positron",
#                            zoom=3, center = {"lat": 37.77919, "lon": -100.41914},
#                            opacity=0.5,
#                            labels={'food_environment_index_Food Environment Index':'Food Enviornement Index'}
#                           )

# Plotting the predicted states
fig = px.choropleth_mapbox(comparison, geojson=counties, locations='FIPS', color="correct",
                           color_continuous_scale="Viridis",
                           range_color=(0, 10),
                           mapbox_style="carto-positron",
                           zoom=3, center = {"lat": 37.77919, "lon": -100.41914},
                           opacity=0.5,
                           labels={'predictions'}
                          )


fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

fig.write_html("predctive_overview.html")

In [None]:
import plotly.figure_factory as ff

def figures_to_html(figs, filename="dashboard.html"):
    with open(filename, 'w') as dashboard:
        dashboard.write("<html><head></head><body>" + "\n")
        for fig in figs:
            inner_html = fig.to_html().split('<body>')[1].split('</body>')[0]
            dashboard.write(inner_html)
        dashboard.write("</body></html>" + "\n")

#creating a small comparison dataframe to easily plot the data
comp={'is_obese': y_test, 'predictions': y_pred, 'correct': y_test == y_pred, 'FIPS':merged['FIPS'][X_test.index], 'State':merged['State'][X_test.index]} 
comparison=pd.DataFrame(comp)

#Defining States and Fips for each state and a colorscale for True/False
Oregon_values = comparison[comparison['State'] == 'Oregon']['correct'].tolist()
Oregon_fips = comparison[comparison['State'] == 'Oregon']['FIPS'].tolist()
Oklahoma_values = comparison[comparison['State'] == 'Oklahoma']['correct'].tolist()
Oklahoma_fips = comparison[comparison['State'] == 'Oklahoma']['FIPS'].tolist()
Penn_values = comparison[comparison['State'] == 'Pennsylvania']['correct'].tolist()
Penn_fips = comparison[comparison['State'] == 'Pennsylvania']['FIPS'].tolist()
colorscale = ["#E60000","#0000CD"]

#Creating figure for Oklahoma
fig1 = ff.create_choropleth(
    fips=Oklahoma_fips, values=Oklahoma_values, scope=['Oklahoma'],
    colorscale=colorscale, round_legend_values=True,
    simplify_county=0, simplify_state=0,
    county_outline={'color': 'rgb(15, 15, 55)', 'width': 0.5},
    state_outline={'width': 1},
    legend_title='Correct Prediction',
    title='Oklahoma'
)

#Creating figure for Oregon
fig2 = ff.create_choropleth(
    fips=Oregon_fips, values=Oregon_values, scope=['Oregon'],
    colorscale=colorscale, round_legend_values=True,
    simplify_county=0, simplify_state=0,
    county_outline={'color': 'rgb(15, 15, 55)', 'width': 0.5},
    state_outline={'width': 1},
    legend_title='Correct Prediction',
    title='Oregon'
)

#Creating figure for Pennsylvania
fig3 = ff.create_choropleth(
    fips=Penn_fips, values=Penn_values, scope=['Pennsylvania'],
    colorscale=colorscale, round_legend_values=True,
    simplify_county=0, simplify_state=0,
    county_outline={'color': 'rgb(15, 15, 55)', 'width': 0.5},
    state_outline={'width': 1},
    legend_title='Correct Prediction',
    title='Pennsylvania'
)

#Due to some flaw in the package, the hoverdata gets erased. A hotfix for it was found and implemented below.
#The hotfix more or less duplicates the intended hovertext and adds it back on
#Hotfix for Oregon map
hover_ix, hover = [(ix, t) for ix, t in enumerate(fig2['data']) if t.text][0]
df_sample_r = comparison[comparison['State'] == 'Oregon']
# mismatching lengths indicates bug
if len(hover['text']) != len(df_sample_r):

    ht = pd.Series(hover['text'])

    no_dupe_ix = ht.index[~ht.duplicated()]

    hover_x_deduped = np.array(hover['x'])[no_dupe_ix]
    hover_y_deduped = np.array(hover['y'])[no_dupe_ix]

    new_hover_x = [x if type(x) == float else x[0] for x in hover_x_deduped]
    new_hover_y = [y if type(y) == float else y[0] for y in hover_y_deduped]

    fig2['data'][hover_ix]['text'] = ht.drop_duplicates()
    fig2['data'][hover_ix]['x'] = new_hover_x
    fig2['data'][hover_ix]['y'] = new_hover_y

#Same hotfix for Pennsylvania
hover_ix, hover = [(ix, t) for ix, t in enumerate(fig3['data']) if t.text][0]
df_sample_r = comparison[comparison['State'] == 'Pennsylvania']
# mismatching lengths indicates bug
if len(hover['text']) != len(df_sample_r):

    ht = pd.Series(hover['text'])

    no_dupe_ix = ht.index[~ht.duplicated()]

    hover_x_deduped = np.array(hover['x'])[no_dupe_ix]
    hover_y_deduped = np.array(hover['y'])[no_dupe_ix]

    new_hover_x = [x if type(x) == float else x[0] for x in hover_x_deduped]
    new_hover_y = [y if type(y) == float else y[0] for y in hover_y_deduped]

    fig3['data'][hover_ix]['text'] = ht.drop_duplicates()
    fig3['data'][hover_ix]['x'] = new_hover_x
    fig3['data'][hover_ix]['y'] = new_hover_y

figures_to_html([fig1,fig2,fig3], filename = 'Predict_states.html')

#figures_to_html([fig1], filename = 'Oregon.html')
#figures_to_html([fig2], filename = 'Oklahoma.html')
#figures_to_html([fig3], filename = 'Pennsylvania.html')
#with open('p_graph.html', 'a') as f:
 #   f.write(fig1.to_html(full_html=True, include_plotlyjs='cdn'))
  #  f.write(fig2.to_html(full_html=True, include_plotlyjs='cdn'))
   # f.write(fig3.to_html(full_html=True, include_plotlyjs='cdn'))

In [None]:
hover_ix, hover = [(ix, t) for ix, t in enumerate(fig2['data']) if t.text][0]
df_sample_r = comparison[comparison['State'] == 'Oregon']
# mismatching lengths indicates bug
if len(hover['text']) != len(df_sample_r):

    ht = pd.Series(hover['text'])

    no_dupe_ix = ht.index[~ht.duplicated()]

    hover_x_deduped = np.array(hover['x'])[no_dupe_ix]
    hover_y_deduped = np.array(hover['y'])[no_dupe_ix]

    new_hover_x = [x if type(x) == float else x[0] for x in hover_x_deduped]
    new_hover_y = [y if type(y) == float else y[0] for y in hover_y_deduped]

    fig2['data'][hover_ix]['text'] = ht.drop_duplicates()
    fig2['data'][hover_ix]['x'] = new_hover_x
    fig2['data'][hover_ix]['y'] = new_hover_y

fig2.show()

# Thoughts on the model

Overall the model predicts quite well. One wouldn't expect a model with a 100% accuracty so to get most of them right across 3 different regions in the US is a pretty decent model. Especially considering that the culture is pretty standardized by yet so different from west US to the east of US. 

To further improve the model, one could try other classification methods which are more advanced; this could be neural networks or performing for an example PCA, for reducing the number of features to the most relevant ones and gaining an overall simpler model. 

# Genre.
The genre we chose to use is **Magazine Style** with some **Slide Show** features. Our focus from the beginning was the story and how the visualization should fit together with the story, the style to tell the story came more naturally as we began to construct visualizations, analyze those and started to build a webpage. 

We chose this genre since our visualizations are overall quite detailed and the reader has the opportunity to spend time go in depth with exploring the visualizaztions. The corresponding text is also rather in depth, however using bold and emojis trying to make it more readable, such as skimming the respective webpages also should be able to give an impression on what is going on, with the opportunity to go in depth by reading and exploring the interactive plots. 

There is somewhat linearity with the story, however not completely bound, the sidebar makes it easy for the reader to jump around to the headlines that spark interest, and the pages has some introduction in the beginning such that it is not strictly nescessary to have read the other pages, however that would be the intuitive way to click around. Therefore it can be argued that it is mostly **author-driven** in regards to somewhat **linear ordering of scenes** and as there is (if the reader chooses to go in depth with the text) a lot of **messaging**. In regards to the webpage structure and (some) of the visualizations it is **reader-driven** as the reader has the opportunity to click around and explore the **interactive website** and **interactive map plots**. 

## Visual Narrative

| **Visual structuring** 	| *** 	|
|---:	|---:	|
| Establishing Shot/Spalsh Screen 	| + 	|
| Consistent Visual Platform  	| + 	|
| Progress Bar/Timebar  	| - 	|
| "Checklist" Progress Tracker 	| - 	|
| **Highlight** 	| *** 	|
| Close-Ups 	| + 	|
| Feature Distinction  	| - 	|
| Character Direction  	| - 	|
| Motion 	| + 	|
| Audio 	| - 	|
| Zooming 	| + 	|
| **Transition Guidance** 	| *** 	|
| Familiar Objects  	| + 	|
| Viewing Angle  	| - 	|
| Viewer (Camera) Motion  	| - 	|
| Continuity Editing 	|- 	|
| Object Continuity 	| - 	|
| Animated Transitions 	| - 	|

**Visual Structuring** 


We chose to have an establishing shot for our web page to have a more fun and inspiring introduction to reading about what we did in the project. We chose a consistent visual platform as the plots themselves differed from barplots, maps and correlation-/confusionmatrices, and therefore the background was more consistent as to create more consitency throughout the plot. It would have been nice with a progress bar, but we valued spending more time on the visualizations than on implementing such. 

**Highlight** 


The interactive maps gave the opportunity to zoom and investigate close on fastfood chains and the heat maps, furthermore the three chosen test-states were zoom in on in the model evaluation, this was done to get a better view of the county data. Motion was necessary to be able to use within the webpage, since there was large plots and text in each webpage, however the motion was limited to scrolling. 

**Transition Guidance** 


We added emojis to the text as familiar objects, also to make an easier visual connection to the text and make it more fun to read. Furthermore there was clicking buttons to guide the reader back and forth together with the siderbar. 



## Narrative Structure

| **Ordering** 		| *** |
|---:	|---:	|
| Random Access 	| + 	|
| User Directed Path  	| - 	|
| Linear  	| + 	|
| **Interactivity** 	| *** | 
| Hover Highlighting/Details 	| + 	|
| Filterint/Selection/Search  	| - 	|
| Navigation Buttons  	| + 	|
| Very Limited Interactivity 	| - 	|
| Explicit Intstruction 	| - 	|
| Tacit Tutorial 	| - 	|
| Stimulating default views 	| + 	|
| **Messaging** 	| ***	|
| Captions/Headlines 	| + 	|
| Annotations  	| - 	|
| Accompanying Article  	| + 	|
| Multi-messaging 	| + 	|
| Comment Repition 	| - 	|
| Introductory Text 	| + 	|
| Summary 	| + 	|

**Ordering**


The storyline is overall linear guided with buttons back and forth in the bottom of the pages, with the possibility to access independent topics through the sidebar. 

**Interactivity** 


All the maps had interactivity with both hovertool for counties as well as for the fastfood restaurants. There was somewhat navigating buttons in the bottum of the pages and the ordering of the sidebar. Stimulating default views was used for the interactive maps as the startet in the view of USA, it could have been considered whether each start page of the different topics could be more stimulating, but it also shouldn't be overwhelming. 

**Messaging**  


All the topics had headlines and subtopics which was foldable. The visualizations had accompanying article, which was necessary since the plots were not able to stand for themselves, they were not self explanatory. Multimessaging was used in a way, showing barplots and maps. The project had both introductory text as well as summary in the discussion section. 




## Visualizations.
* We've chosen a combination of normal bar plots for a general understanding of rankings. 
As the vast majority of users will be familiar with these and have a knowledge of how to interpret them. 
Similarily they perform very well in presenting basic elements where we want to present rankings. We are able to show by which margain for an example the Restaurant chain McDonalds is dominant compared to others. 

* Likewise to obtain an understanding of model performance we've used Confusion Matrix' to present the number of True positives, false negatives etc. with a heatmap representation alike. 
* For last plot of plotting the feature importances in the model; we use a mix of a *bar* & *lollipop* plot to visualise the importance scores aswell as the standarddeviations. 
* For normal location and pinning we've used Folium maps, as they are easy to work with, likewise we've used 
plotly - Choropleth Maps for heat maps with geolocations. 
These types of plots where chosen as they allow for a "default" view, where we can tell our approach to the story. 
Meanwhile they add interactivity, which allows the reader to explore themselves how the data might present it selv and find new elements.


# Think critically about your creation

* What went well 

The data we found was very useful, both health, fastfood and political data had almost all counties included. The data was useful in terms of visualizing it geographically. 

* What is still missing

The income data was not very useful, and some time was spent on trying to interpolate with the missing counties, that however only worked in sort of pixel terms, which is why we ended up leaving it out of the machine learning model. Furthermore the county dataset was hard to visualize in other ways than in a geo map, since there were thousands of counties, the data could be aggregated as states, which was used in some barplots, but it was not the easiest way to visualize it, however it was easier to compare states in that way where the maps gave more of an overview. Furthermore we had some thoughts in regards to sex and race seggregated data, those thoughts were also discussed in the webpage in 'Discussion'. 

# Contributions
* Fastfood locations - Michella
* Health data statewise - Gustav
* Food Quality - Peetz
* Political Orientation - Michella
* Income - Gustav
* Constructing and training model - Peetz
* Feature evaluating - Michella
* Prediction evaluation - Gustav
* Webpage - Peetz