# US State Food and Demographic Data Analysis

John Platt

Matt McClements

December 6, 2022

In this notebook, given a database, we aim to analyze data on obesity scores, socioeconomic status and fast food popularity across each state to answer our main question:

**How is fast food restaurant choice across the US states affected by the number of given fast food restaurants in each state, obesity rates, unemployment rates, and median household income?**





# Our Database (US_State_Food.db)

Our database has six different tables described in the table below.

Here is a **needed** link to download our database-https://drive.google.com/file/d/1mrtOndnzf2zMonpI70WKeDyNZOZlWGTG/view?usp=sharing

Database Table Name|Description|Primary key
-|-|-
obesity_by_state|Shows the overall obesity score and overall rank for each US State (50 rows)|StateID
category_by_state|Shows the rank for different health categories across each state (150 rows)|CategoryID
state_numfastfood|Shows the number of certain fast food restaurants per 100,000 people in each state (500 rows)|FastFoodID
state_fast_food|Shows the data for all fast food restaurants, full service restaurants, and most popular fast food restaurants in each state (50 rows)|StateID
state_medincome|Shows the data for current median income across each state (50 rows)|StateID
state_unemp|Shows the data for the unemployment rate in 2021 and 2022 across each state (100 rows)|unempID

# Functions we are using to perform analysis

Since we want to create all sorts of plots that describe and analyze our data, we will create functions (all involve extracting data from our database) that help to break this complex problem into smaller pieces.

# execute_sql(db, sql_q)

Takes in the data base and an sql query to output a pandas dataframe


In [None]:
def execute_sql(db, sql_q):
  import sqlite3 as sql
  conn = sql.connect(db)
  cur = conn.cursor()
  result = cur.execute(sql_q)

  return result.fetchall()

# execute_sqldf(queryStatement,dbFile,columnList) function

This function will allow us to obtain any subset of data from our database sorted and filtered however we like. This function returns the subset of data obtained from our database as a pandas dataframe.

In [None]:
def execute_sqldf(queryStatement,dbFile,columnList):
  '''Given a query, database, and list of columns we are selecting, return a dataframe that illustrates the query execution'''
  import sqlite3 as sql
  import pandas as pd
  connection=sql.connect(dbFile)
  cursor=connection.cursor()
  exe=cursor.execute(queryStatement)
  return pd.DataFrame(exe.fetchall(),columns=columnList)

# rankbycategory(state,dbFile) function

This function allows us to determine where a given state in the US ranks for our three given categories (Food & Fitness, Health Consequences, Obesity & Overweight Prevalence). It prints the rankings for all three categories of a given State. There is no return statement.

In [None]:
def rankbycategory(state,dbFile):
  '''Given a state and database, determine the rankings for a given state across all three categories. There is no return. This function prints
  the rankings for the different categories for a state.'''
  import sqlite3 as sql
  import pandas as pd
  connection=sql.connect(dbFile)
  cursor=connection.cursor()
  exe=cursor.execute('SELECT * FROM category_by_state')
  cat_df=pd.DataFrame(exe.fetchall(),columns=['CategoryID','State','Category','Ranking'])
  print(cat_df.loc[cat_df['State']==state,['State','Category','Ranking']].set_index('State'))

# state_comp_boxplots(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var) function

This function allows us to combine subsets of data extracted from two different tables in our database that have 50 rows and create comparative boxplots that will show distributions of a numerical variable across different versions of a categorical variable (categorical variable from our combined data on x axis and numerical variable from our combined data on y axis). It returns nothing. The output of the function is the comparative boxplots being displayed.

In [None]:
def state_comp_boxplots(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var,xlab,ylab):
  '''
  Given a database, two tables from that database, subsets of data from the two tables, a categorical variable to be on the x axis
  of the comparative boxplots, and a numerical variable to be on the y axis, create comparative boxplots that display distributions of a numerical
  variable across different versions of a categorical variable. This function assumes we are grabbing data from the tables of our database with
  stateID as the primary key and we want/need to combine data from two different tables. It also assumes that one of our tables has data on the most
  popular fast food restaurant in each state.
  Parameters:
  db - string - Our database file
  table1 - string - The name of the first table we want to extract data from
  table2 - string - The name of the second table we want to extract data from
  tab1_cols - string - The set of columns we want to extract from the first table
  tab2_cols - string - The set of columns we want to extract from the second table
  cat_var - string - The categorical variable for our comparative boxplots
  num_var - string - The numerical variable for our comparative boxplots
  xlab - string - The x label for our plots
  ylab - string - The y label for our plots
  Return:
  None. The output of the function is the comparative boxplots being displayed
  '''
  import pandas as pd
  import plotly.express as px
  #Obtain pandas dataframe by selecting certain columns from the first table
  df1=execute_sqldf(f'SELECT {tab1_cols} FROM {table1}',db,tab1_cols.split(',')).set_index('StateID')
  #Obtain pandas dataframe by selecting certain columns from the second table
  df2=execute_sqldf(f'SELECT {tab2_cols} FROM {table2}',db,tab2_cols.split(',')).set_index('StateID')
  #Combine the two pandas dataframes
  comb_df=pd.concat([df1,df2],axis=1,sort=True)
  #Create comparative boxplots
  fig=px.box(comb_df,x=cat_var,y=num_var, color='mostPopular',hover_data=['State'])
  fig.update_xaxes(title_text=xlab)
  fig.update_yaxes(title_text=ylab)
  fig.show()

# state_comp_boxplots2(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var) function

This function allows us to combine subsets of data of different lengths extracted from two different tables in our database that and create comparative boxplots that will show distributions of a numerical variable across different versions of a categorical variable (categorical variable from our combined data on x axis and numerical variable from our combined data on y axis). It returns nothing. The output of the function is the comparative boxplots being displayed.

In [None]:
def state_comp_boxplots2(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var,xlab,ylab):
  '''
  Given a database, two tables from that database, subsets of data from the two tables, a categorical variable to be on the x axis
  of the comparative boxplots, and a numerical variable to be on the y axis, create comparative boxplots that display distributions of a numerical
  variable across different versions of a categorical variable. This function assumes we want/need to combine data from two different tables and one of those tables
  includes unemployment data across different years.
  Parameters:
  db - string - Our database file
  table1 - string - The name of the first table we want to extract data from
  table2 - string - The name of the second table we want to extract data from
  tab1_cols - string - The set of columns we want to extract from the first table
  tab2_cols - string - The set of columns we want to extract from the second table
  cat_var - string - The categorical variable for our comparative boxplots
  num_var - string - The numerical variable for our comparative boxplots
  xlab - string - The x label for our plots
  ylab - string - The y label for our plots
  Return:
  None. The output of the function is the comparative boxplots being displayed
  '''
  import pandas as pd
  import plotly.express as px
  #Obtain pandas dataframe by selecting certain columns from the first table
  df1=execute_sqldf(f'SELECT {tab1_cols} FROM {table1}',db,tab1_cols.split(',')).set_index('State')
  #Obtain pandas dataframe by selecting certain columns from the second table
  df2=execute_sqldf(f'SELECT {tab2_cols} FROM {table2}',db,tab2_cols.split(',')).set_index('State')
  #Combine the two pandas dataframes
  comb_df=df1.join(df2,how='inner')
  #Add state as column
  comb_df['State']=comb_df.index
  #Create comparative boxplots
  fig=px.box(comb_df,x=cat_var,y=num_var,color='Unemp_Year', hover_data=['State'])
  fig.update_xaxes(title_text=xlab)
  fig.update_yaxes(title_text=ylab)
  fig.show()

# state_barplots(db,table,cols,cat_var,num_var,color_var) function

This function extracts a single subset of data in our database and creates a bar plot (categorical variable on x axis and numerical variable on y axis) colored by one of the variables in the subset of data. It has no return. It displays the bar plots.

In [None]:
def state_barplots(db,table,cols,cat_var,num_var,color_var,xlab,ylab):
  '''
  Given a database, two tables from that database, subsets of data from the two tables, a categorical variable to be on the x axis
  of the barplot, and a numerical variable to be on the y axis, create a barplot that display a numerical
  variable across different versions of a categorical variable. This function assumes we don't want/need to combine data from two different tables.
  Parameters:
  db - string - Our database file
  table - string - The name of the first table we want to extract data from
  cols - string - The set of columns we want to extract from the first table
  cat_var - string - The categorical variable for our barplot
  num_var - string - The numerical variable for our barplot
  xlab - string - The x axis label for our plots
  ylab - string- The y axis label for our plots
  Return:
  None. The output of the function is the barplots being displayed
  '''
  import pandas as pd
  import plotly.express as px
  #Obtain pandas dataframe by selecting certain columns from the first table
  df=execute_sqldf(f'SELECT {cols} FROM {table}',db,cols.split(','))
  #Create barplot
  fig=px.bar(df, x=cat_var, y=num_var, color=color_var)
  fig.update_xaxes(title_text=xlab)
  fig.update_yaxes(title_text=ylab)
  fig.show()

# state_barplots2(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var) function
This function extracts and combines two subsets of data extracted from our database and creates a bar plot (categorical variable on x axis and numerical variable on y axis) colored by one of the variables in the combined data. It has no return. It displays the bar plots.

In [None]:
def state_barplots2(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var,xlab,ylab):
  '''
  Given a database, two tables from that database, subsets of data from the two tables, a categorical variable to be on the x axis
  of the barplots, a numerical variable to be on the y axis, and a color var, create barplots that display distributions of a numerical
  variable across different versions of a categorical variable. This function assumes we want/need to combine data from two different tables and one of those tables
  includes data on most popular fast food restaurant.
  Parameters:
  db - string - Our database file
  table1 - string - The name of the first table we want to extract data from
  table2 - string - The name of the second table we want to extract data from
  tab1_cols - string - The set of columns we want to extract from the first table
  tab2_cols - string - The set of columns we want to extract from the second table
  cat_var - string - The categorical variable for our barplots
  num_var - string - The numerical variable for our barplots
  xlab - string - The x axis label for our plots
  ylab - string - The y axis label for our plots
  Return:
  None. The output of the function is the barplots being displayed
  '''
  import pandas as pd
  import plotly.express as px
  #Obtain pandas dataframe by selecting certain columns from the first table
  df1=execute_sqldf(f'SELECT {tab1_cols} FROM {table1}',db,tab1_cols.split(',')).set_index('State')
  #Obtain pandas dataframe by selecting certain columns from the second table
  df2=execute_sqldf(f'SELECT {tab2_cols} FROM {table2}',db,tab2_cols.split(',')).set_index('State')
  #Combine the two pandas dataframes
  comb_df=df1.join(df2,how='inner')
  #Add state as column
  comb_df['State']=comb_df.index
  #Create barplots
  fig=px.bar(comb_df,x=cat_var,y=num_var,color='mostPopular')
  fig.update_xaxes(title_text=xlab)
  fig.update_yaxes(title_text=ylab)
  fig.show()

# facet_state_barplots(db,table,cols,cat_var,num_var,color_var,facet_var) function
This function extracts and combines two subsets of data extracted from our database and creates a bar plot (categorical variable on x axis and numerical variable on y axis) colored by one of the variables in the combined data and facet wrapped by another variable. It has no return. It displays the bar plots.

In [None]:
def facet_state_barplots2(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var,facet_var,xlab,ylab):
  '''
  Given a database, two tables from that database, subsets of data from the two tables, a categorical variable to be on the x axis
  of the barplots, a numerical variable to be on the y axis, and a color var, create barplots that display distributions of a numerical
  variable across different versions of a categorical variable. This function assumes we want/need to combine data from two different tables and one of those tables
  includes data on most popular fast food restaurant.
  Parameters:
  db - string - Our database file
  table1 - string - The name of the first table we want to extract data from
  table2 - string - The name of the second table we want to extract data from
  tab1_cols - string - The set of columns we want to extract from the first table
  tab2_cols - string - The set of columns we want to extract from the second table
  cat_var - string - The categorical variable for our barplots
  num_var - string - The numerical variable for our barplots
  xlab - string - The x axis label for our plots
  ylab - string - The y axis label for our plots
  Return:
  None. The output of the function is the barplots being displayed
  '''
  import pandas as pd
  import plotly.express as px
  #Obtain pandas dataframe by selecting certain columns from the first table
  df1=execute_sqldf(f'SELECT {tab1_cols} FROM {table1}',db,tab1_cols.split(',')).set_index('State')
  #Obtain pandas dataframe by selecting certain columns from the second table
  df2=execute_sqldf(f'SELECT {tab2_cols} FROM {table2}',db,tab2_cols.split(',')).set_index('State')
  #Combine the two pandas dataframes
  comb_df=df1.join(df2,how='inner')
  #Add state as column
  comb_df['State']=comb_df.index
  #Create barplots
  fig=px.bar(comb_df,x=cat_var,y=num_var,color='mostPopular',facet_col=facet_var)
  fig.update_xaxes(title_text=xlab)
  fig.update_yaxes(title_text=ylab)
  fig.show()

# state_sidebysidebarplots(db,table,cols,cat_var,num_var,color_var,column) function
This function extracts a single subset of data in our database and creates side by side bar plots (categorical variable on x axis and numerical variable on y axis) colored by one of the variables in the subset of data. It has no return. It displays the side by side bar plots.

In [None]:
def state_sidebysidebarplots(db,table,cols,cat_var,num_var,color_var,column,xlab,ylab):
  '''
  Given a database, a table from that database, subsets of data from the table, a categorical variable to be on the x axis
  of the barplot, and a numerical variable to be on the y axis, create a side by side barplot that display a numerical
  variable across different versions of a categorical variable. This function assumes we don't want/need to combine data from two different tables.
  Parameters:
  db - string - Our database file
  table - string - The name of the first table we want to extract data from
  cols - string - The set of columns we want to extract from the first table
  cat_var - string - The categorical variable for our barplot
  num_var - string - The numerical variable for our barplot
  color_var - string - The variable we are coloring each of our bars by
  column - string - column for the bar plots to be grouped by
  xlab - string - The x axis label for our plots
  ylab - string- The y axis label for our plots
  Return:
  None. The output of the function is the barplots being displayed
  '''
  import pandas as pd
  import plotly.express as px
  #Obtain pandas dataframe by selecting certain columns from the first table
  df=execute_sqldf(f'SELECT {cols} FROM {table}',db,cols.split(','))
  df[column]=df[column].astype('string')
  #Create side by side barplot
  fig=px.bar(df, x=cat_var, y=num_var, color=color_var, barmode='group')
  fig.show()

# state_stackedbarplots(db,table,cols,cat_var,num_var,color_var,column) function

This function extracts a single subset of data in our database and creates stacked bar plots (categorical variable on x axis and numerical variable on y axis) colored by one of the variables in the subset of data. It has no return. It displays the stacked bar plots.

In [None]:
def state_stackedbarplots(db,table,cols,cat_var,num_var,color_var,column,xlab,ylab):
  '''
  Given a database, a table from that database, subsets of data from the table, a categorical variable to be on the x axis
  of the barplot, and a numerical variable to be on the y axis, create a stacked barplot that display a numerical
  variable across different versions of a categorical variable. This function assumes we don't want/need to combine data from two different tables.
  Parameters:
  db - string - Our database file
  table - string - The name of the first table we want to extract data from
  ols - string - The set of columns we want to extract from the first table
  cat_var - string - The categorical variable for our barplot
  num_var - string - The numerical variable for our barplot
  color_var - string - The variable we are coloring each of our bars by
  column - string - column for the bar plots to be grouped by
  xlab - string - The x axis label for all of our plots
  ylab - string - The y axis label for all of our plots
  Return:
  None. The output of the function is the barplots being displayed
  '''
  import pandas as pd
  import plotly.express as px
  #Obtain pandas dataframe by selecting certain columns from the first table
  df=execute_sqldf(f'SELECT {cols} FROM {table}',db,cols.split(','))
  df[column]=df[column].astype('string')
  #Create side by side barplot
  fig=px.bar(df, x=cat_var, y=num_var, color=color_var)
  fig.show()

# facet_state_stackedbarplots(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var,facet_var) function

This function extracts and combines two subsets of data extracted from our database and creates stacked bar plots (categorical variable on x axis and numerical variable on y axis) colored by one of the variables in the combined data and facet wrapped by another variable. It has no return. It displays the bar plots.

In [None]:
def facet_state_stackedbarplots(db,table1,table2,tab1_cols,tab2_cols,cat_var,num_var,facet_var,color_var):
  '''
  Given a database, two tables from that database, subsets of data from the two tables, a categorical variable to be on the x axis
  of the barplots, a numerical variable to be on the y axis, and a color var, create barplots that display distributions of a numerical
  variable across different versions of a categorical variable. This function assumes we want/need to combine data from two different tables and one of those tables
  includes data on most popular fast food restaurant.
  Parameters:
  db - string - Our database file
  table1 - string - The name of the first table we want to extract data from
  table2 - string - The name of the second table we want to extract data from
  tab1_cols - string - The set of columns we want to extract from the first table
  tab2_cols - string - The set of columns we want to extract from the second table
  cat_var - string - The categorical variable for our barplots
  num_var - string - The numerical variable for our barplots
  xlab - string - The x axis label for our plots
  ylab - string - The y axis label for our plots
  Return:
  None. The output of the function is the barplots being displayed
  '''
  import pandas as pd
  import plotly.express as px
  #Obtain pandas dataframe by selecting certain columns from the first table
  df1=execute_sqldf(f'SELECT {tab1_cols} FROM {table1}',db,tab1_cols.split(',')).set_index('State')
  #Obtain pandas dataframe by selecting certain columns from the second table
  df2=execute_sqldf(f'SELECT {tab2_cols} FROM {table2}',db,tab2_cols.split(',')).set_index('State')
  #Combine the two pandas dataframes
  comb_df=df1.join(df2,how='inner')
  #Add state as column
  comb_df['State']=comb_df.index
  #Create barplots
  fig=px.bar(comb_df,x=cat_var,y=num_var,color=color_var,facet_col=facet_var,facet_col_wrap=4)
  fig.show()

# mostpop(resturant) Function

This function takes in the parameter "restauant" and outputs the States, along with their stateID and total obesity score based on which resturant is used.

In [None]:
def mostpop(restaurant):
  '''
  Parameters:
  restaurant - string - a fast food restaurant

  '''
  db = 'US_State_Food.db'
  table = 'obesity_by_state'
  cols = 'StateID, State, Total_Score'


  qry = f'SELECT {cols} FROM {table} '
  data1 = execute_sql(db, qry)

  frame1 = pd.DataFrame(data1, columns = cols.split(','))
# creates a pandas dataframe from the obesity_by_state data with the columns StateID, State, and Total_Score
  db = 'US_State_Food.db'
  table = 'state_fast_food'
  cols = 'mostPopular'


  qry = f'SELECT {cols} FROM {table} '
  data3 = execute_sql(db, qry)

  frame2 = pd.DataFrame(data3, columns = cols.split(','))
# creates a pandas dataframe from the state_fast_food data with the column mostPopular
  master = pd.concat([frame1, frame2], axis = 1)
# concatenates the two above dataframes into one
  final = master.loc[master["mostPopular"] == restaurant ,:]
# looks into the concatenated data and finds the states that corresponds to to most popular restaurant inputed into the "restaurant" parameter

  return final


# Data Analysis
In this section, given the functions above, we now perform analysis on our state food data to answer the main question:

**How is fast food restaurant choice across the US states affected by the number of given fast food restaurants in each state, obesity rates, unemployment rates, and median household income?**

# Cloropleth map for favorite fast food restaurant by state

Here we have a map that shows the distribution of most popular fast food resturant for each state. We obtain the locations for each state in the map by adding a state_abb column to our state_fast_food dataframe. We use the px.cloropleth function below to create the cloropleth map with a selected color sequence of length 12, setting our locations to the data of the state_abb column, the color that the states will be colored by is chosen by our mostPopular variable, etc. As we can see, Chick-fil-a is the most popular resturant by a long shot with McDonalds and In and Out being some of the runner ups. We can also see in this plot that there are some regional trends to take into account. For example the entire southwest and all of the southeast except for West Virginia perfered the same fast food resturant. Also, the southern part of the western region all perfered in and out.

In [None]:
import plotly.express as px
cols='State,mostPopular'
db='US_State_Food.db'
df=execute_sqldf(f"SELECT {cols} FROM state_fast_food",db,cols.split(','))
#Add state abbreviation column to df corresponding to each state in state column
df['State_abb']=['AL','AK','AZ','AR','CA','CO','CT','DE','FL','GA',
           'HI','ID','IL','IN','IA','KS','KY','LA','ME','MD','MA',
           'MI','MN','MS','MO','MT','NE','NV','NH','NJ','NM',
           'NY','NC','ND','OH','OK','OR','PA','RI','SC','SD','TN','TX',
           'UT','VT','VA','WA','WV','WI','WY']
fig = px.choropleth(df,
                    locations='State_abb',
                    locationmode="USA-states",
                   scope="usa",
                   color='mostPopular',
                    color_discrete_sequence=['red','orange','blue','purple','yellow','green','brown','pink','lightblue','lightpink','gold','lightgreen']
                    )
fig.show()

# Use value_counts() function to see specific quantities

We also want to see the precise numbers for how many states have Chick-Fil-A as their favorite fast food restaurant, McDonald's as their favorite, Panda Express, etc. So, here we extract the mostPopular column from the state_fast_food data in the database, convert that column to a series, and then do a value counts on the mostPopular series to see the counts of different favorite fast food restaurants.

Chick-Fil-A, McDonalds, and Panda Express support the cloropleth map above as the three most common favorite restaurants in our data with 21 states, 12 states, and 5 states respectively.

In [None]:
import pandas as pd
db='US_State_Food.db'
cols='State,mostPopular'
df=execute_sqldf(f"SELECT {cols} FROM state_fast_food",db,cols.split(','))
s=pd.Series(df['mostPopular'])
s.value_counts()

Chick-Fil-A        21
McDonald's         12
Panda Express       5
In-N-Out            4
White Castle        1
Wendy's             1
Jack In the Box     1
Sonic               1
Church's            1
Taco Bell           1
Charley's           1
Carl's Jr           1
Name: mostPopular, dtype: int64

# Stacked bar plots to show the different category rankings for each state

In this plot we used a stacked bar plot to show the distribution of total category ranking for each state. We do this by calling our category_by_state table and other necessary parameters in our state_stackedbarplots function. Some key takeaway are that West Virginia and Mississippi have the lowest ranking total for all 3 categories with their combined rankings in the stacked bar plot below being the lowest out of all states. This means that these two states are considered some of the most obese and least healthiest states in America.

In [None]:
db='US_State_Food.db'
table='category_by_state'
cols='State,Category,Ranking'
cat_var='State'
num_var='Ranking'
color_var='Category'
state_stackedbarplots(db,table,cols,cat_var,num_var,color_var,'Category','State','Ranking')

# Query showing total obesity scores from highest to lowest with their corresponding state

Here, to answer which states have the highest total obesity score in our data, we query the necessary subset of data from the obesity_by_state table (the state and total score columns). Then, we order this subset by total obesity score in descending order and display the result as a pandas dataframe (at the top of the dataframe are the states with the highest obesity score).

We can see that the states of West Virginia, Mississipi, Kentucky, Arkansas, and Alabama have the five highest obesity scores in our data.

In [None]:
import pandas as pd

db = 'US_State_Food.db'
table = 'obesity_by_state'
cols = 'StateID, State, Total_Score'
ord = 'Total_Score'

qry = f'SELECT {cols} FROM {table} ORDER BY {ord} DESC'
data5 = execute_sql(db, qry)

pd.DataFrame(data5, columns = cols.split(','))

Unnamed: 0,StateID,State,Total_Score
0,148,West Virginia,74.6
1,124,Mississippi,72.33
2,117,Kentucky,68.99
3,104,Arkansas,68.95
4,101,Alabama,68.63
5,142,Tennessee,67.46
6,118,Louisiana,65.66
7,108,Delaware,63.99
8,136,Oklahoma,63.71
9,140,South Carolina,63.43


# Bar plot that shows the total obesity scores for each state and favorite fast food restaurant

This plot shows each states total obesity score based on which fast food resturant they claimed is their favorite. We created this plot from combining our state_fast_food data with our obesity_by_state data in the state_barplots2 function. We can see that a large majority of the states who have their favorite resturant as Chick-fil-a, have generally higher obesity scores than many of the other states. There is obviously an outlier in the McDonalds and West Virginia obesity score, but overall Chick-fil-a has the most obese states.

In [None]:
db='US_State_Food.db'
cols1='State,mostPopular'
cols2='State,Total_Score'
table1='state_fast_food'
table2='obesity_by_state'
cat_var='State'
num_var='Total_Score'
state_barplots2(db,table1,table2,cols1,cols2,cat_var,num_var,'State','Total Obesity Score')

# Comparative boxplots that show the distribution of total obesity score across each favorite fastfood restaurant

Here we call our state_comp_boxplots function to create comparative boxplots that support and complement the barplots above. Necessary subsets of data from our state_fast_food table (State and mostPopular fields) and our obesity_by_state (State and Total_Score fields) table are combined. Then, we create comparative boxplots that show the distribution of total obesity score across the different favorite fast food restaurants in our data.

Based on these boxplots, it appears generally that the most obese states in our data go to Chick-Fil-A and McDonalds despite the Chick-Fil-A distribution of total score demonstrating disproportionately high levels of variability (IQR of 10 and range of about 42) compared to the rest of our distributions. We can also see that the state with the highest obesity score in our data of West Virginia (outlier) rated McDonalds as its favorite fast food restaurant.



In [None]:
db='US_State_Food.db'
cols1='StateID,State,mostPopular'
cols2='StateID,Total_Score'
table1='state_fast_food'
table2='obesity_by_state'
cat_var='mostPopular'
num_var='Total_Score'
state_comp_boxplots(db,table1,table2,cols1,cols2,cat_var,num_var,'Most Popular Fast Food Restaurant','Total Obesity Score')

# The effect of household income on the total obesity score of each state.

This scatterplot demonstrates the relationship between total obesity score in all states and the median household income of all states. This was created from combining our obesity_by_state data with our state_medincome data and then utilizing the data from our combined table in the px.scatter function. As we can see, there seems to be a direct correlation between a states total obesity score and median household income. The more obese a state is, the lower it's median household income tends to be.

In [None]:
import pandas as pd

db = 'US_State_Food.db'
table = 'obesity_by_state'
cols = 'StateID, State, Total_Score'


qry = f'SELECT {cols} FROM {table} '
data1 = execute_sql(db, qry)

frame1 = pd.DataFrame(data1, columns = cols.split(','))


db = 'US_State_Food.db'
table = 'state_medincome'
cols = ' HouseholdIncome'

qry = f'SELECT {cols} FROM {table} '
data2 = execute_sql(db, qry)

frame3 = pd.DataFrame(data2, columns = cols.split(','))

plot = pd.concat ([frame1, frame3], axis = 1)
# x and y given as DataFrame columns
import plotly.express as px
fig = px.scatter(plot, x=" HouseholdIncome", y=" Total_Score", trendline="ols", hover_data= [' State'])
fig.update_xaxes(title_text='Median Household Income')
fig.update_yaxes(title_text='Total Obesity Score')
fig.show()

# Query that shows the median income of any state from highest to lowest

Here, to answer which states have the highest median income in our data, we query the necessary subset of data from the state_medincome table (the state and HouseholdIncome columns). Then, we order this subset by total median income in descending order and display the result as a pandas dataframe (at the top of the dataframe are the states with the highest median income).

We can see that the states of Maryland, New Jersey, Massachusetts, Hawaii, and Connecticut have the five highest median incomes in our data.

In [None]:
db = 'US_State_Food.db'
table = 'state_medincome'
cols = 'State, HouseholdIncome'
ord = 'HouseholdIncome'

qry = f'SELECT {cols} FROM {table} ORDER BY {ord} DESC'
data2 = execute_sql(db, qry)

frame3 = pd.DataFrame(data2, columns = cols.split(','))
frame3

Unnamed: 0,State,HouseholdIncome
0,Maryland,87063
1,New Jersey,85245
2,Massachusetts,84385
3,Hawaii,83173
4,Connecticut,79855
5,California,78672
6,New Hampshire,77923
7,Alaska,77790
8,Washington,77006
9,Virginia,76398


# Bar plot that shows the median household income for each state and favorite fast food restaurant
Here, to support our comparative boxplots below and to further look into the correlation between median income and fast food restaurant choice, we call the state_barplots2 function. What the does is combine needed subsets of data from our state_fast_food table and our state_medincome table in. Then, we create barplots that show median income numbers across each state and which more/less wealthy states have a certain favorite fast food restaurant.

We can see in the barplot below that states that rated their favorite fast food restaurant as In-N-Out, Panda Express, or McDonalds generally demonstrated higher levels of median household income. We see lower levels of median income for Chick-Fil-A states and states that have a unique favorite fast food restaurant (Carl's Jr, Taco Bell, Wendy's...). We can see that our boxplots below support these findings.

In [None]:
db='US_State_Food.db'
cols1='State,mostPopular'
cols2='State,HouseholdIncome'
table1='state_fast_food'
table2='state_medincome'
cat_var='State'
num_var='HouseholdIncome'
state_barplots2(db,table1,table2,cols1,cols2,cat_var,num_var,'State','Median Household Income')

# Comparative boxplots that show the distribution of median income across each favorite fastfood restaurant (shows what fastfood restaurant more wealthy states/less wealthy states tend to like more)

Here we have boxplots that show how a states most popular fast food resturant choice related to household income created from combining our state_fast_food data with our state_medincome data where we call that data along with our database inside the state_comp_boxplots function. To the very left of the plot it makes sense that Chick-fil-a is at the low end of the median household income scale because as we have already seen, the states who chose Chick-fil-a as their most popular actually tended to be generally more obese than other states. And since we know that high obesity rates in this case correlate to low median household income, it makes sense that Chick-fil-a is at the low end. And if we look at which favorite fast food restaurant states with higher median household income had, it appears that the favorites of In-N-Out and Panda Express corresponded to higher income states with McDonalds in the mix too. For the states that only had one favorite fast food restaurant, there didn't seem to be much of a correlation between median household income and fast food restaurant choice.

In [None]:
db='US_State_Food.db'
cols1='StateID,State,mostPopular'
cols2='StateID,HouseholdIncome'
table1='state_fast_food'
table2='state_medincome'
cat_var='mostPopular'
num_var='HouseholdIncome'
state_comp_boxplots(db,table1,table2,cols1,cols2,cat_var,num_var,'Most Popular Fast Food Restaurant','Median Household Income')

# Query that shows unemployment rate from highest to lowest in any given year for any given state

Here we have a basic query that shows the unemployment rates for any year from highest to lowest given our state_unemp table. This gives us a good idea for which year has the highest unemployment rates. As we can see the top 5 highest unemployment rates were in 2021 and the bottom 5 were all in 2022.  

In [None]:
db = 'US_State_Food.db'
table = 'state_unemp'
cols = 'State, Unemp_Year, Unemployment_rate'
ord = 'Unemployment_rate'

qry = f'SELECT {cols} FROM {table} ORDER By {ord} Desc'
data = execute_sql(db, qry)

pd.DataFrame(data, columns = cols.split(','))

Unnamed: 0,State,Unemp_Year,Unemployment_rate
0,California,2021,7.4
1,New Mexico,2021,7.0
2,New York,2021,6.9
3,Nevada,2021,6.6
4,New Jersey,2021,6.6
...,...,...,...
95,Vermont,2022,2.1
96,Nebraska,2022,2.0
97,New Hampshire,2022,2.0
98,Utah,2022,2.0


# How unemployment rate effects the total obesity scores of each state.
Here, we combine our obesity_by_score data with our state_unemp data that are different lengths (50 rows vs 100 rows) with the join() function in order to obtain a dataframe with both unemployment across various years and the total obesity scores across each State. Then, from that combined dataframe, we create scatterplots that show the correlation between total score and unemployment across different years (2021 and 2022).

Looking at the scatterplots below, it appears that across 2021 and 2022, there isn't much of a correlation between total score and unemployment rate.

In [None]:
import pandas as pd
db = 'US_State_Food.db'
cols1 = 'State,Total_Score'
table1 = 'obesity_by_state'
cols2 = 'Unemployment_rate,Unemp_Year,State'
table2 = 'state_unemp'
qry1 = f'SELECT {cols1} FROM {table1}'
qry2 = f'SELECT {cols2} FROM {table2}'
df1 = pd.DataFrame(execute_sql(db, qry1), columns = cols1.split(',')).set_index('State')
df2 = pd.DataFrame(execute_sql(db, qry2), columns = cols2.split(',')).set_index('State')
comb_dfs = df1.join(df2, how = 'inner')
fig = px.scatter(comb_dfs, x="Total_Score", y="Unemployment_rate", facet_col = 'Unemp_Year',trendline='ols')
fig.update_xaxes(title_text='Total Obesity Score')
fig.update_yaxes(title_text='Unemployment Rate')
fig.show()

# Bar plots that show unemployment rate (2021 and 2022) and favorite fast food restaurant for each US State

Here we have 2 bar plots that show unemployment rates for each state in 2021 and 2022 based on most popular fast food resturant. We combine our state_fast_food data with the state_unemp data in order to accomplish this task which happens by calling our needed data in the facet_state_barplots2 function. Similarly to the scatterplot above, we can see that there are much higher unemployment rates in 2021 than 2022. Also in both graphs, the distribution of data seems to be pretty even across each graph showing no real correlation.

In [None]:
db='US_State_Food.db'
cols1='State,mostPopular'
cols2='State,Unemployment_rate,Unemp_Year'
table1='state_fast_food'
table2='state_unemp'
cat_var='State'
num_var='Unemployment_rate'
facet_var='Unemp_Year'
facet_state_barplots2(db,table1,table2,cols1,cols2,cat_var,num_var,facet_var,'State','Unemployment Rate')

# Side by side comparative boxplots that show the distribution of unemployment rate across each favorite fastfood restaurant for two different years (2021 and 2022)
Here, in an effort to support our barplots in how unemployment rate correlates with fast food restaurant choice, we call the state_comp_boxplots2 function. What this does is combine necessary subsets of data from our state_unemp table with the state_fast_food table. Then, side by side comparative boxplots are created to determine how unemployment rate differs across different favorite fast food restaurants in our data.

Based on the boxplots below, there doesn't appear to be much of a relationship between unemployment rate and fast food restaurant choice. There seems to be similar levels of unemployment for states that have Chick-Fil-A as their favorite restaurant, states that have McDonald's as their favorite fastfood restaurant, etc.

In [None]:
db='US_State_Food.db'
cols1='State,mostPopular'
cols2='State,Unemp_Year,Unemployment_rate'
table1='state_fast_food'
table2='state_unemp'
cat_var='mostPopular'
num_var='Unemployment_rate'
state_comp_boxplots2(db,table1,table2,cols1,cols2,cat_var,num_var,'Most Popular Fast Food Restaurant','Unemployment Rate')

# Stacked bar plots that show the number of certain fast food restaurants per 100,000 people for each state and favorite fast food restaurant

This visual shows what kinds of resturants are in all of the states per 100,000 people based on favorite fast food resturant. We created these stacked barplots divided out based on favorite fast food restaurant by calling our needed data from the state_numfastfood table and the state_fast_food table in the facet_state_stackedbarplots function. There doesn't seem to be much of a correlation between the number of a given fast food restaurant per 100,000 people in each state and fast food restaurant choice. However, it does seem like there were more McDonalds per 100,000 people for states in our data that had In N Out as their favorite fast food restuarant and more Dunkin Donuts per 100,000 people for states that had McDonalds as their favorite fast food restaurant.

In [None]:
db='US_State_Food.db'
table1='state_numfastfood'
table2='state_fast_food'
cols1='State,Fast_Food_Resturant,Number_Per_100_People'
cols2='State,mostPopular'
cat_var='State'
num_var='Number_Per_100_People'
color_var='Fast_Food_Resturant'
column='Fast_Food_Resturant'
facet_var='mostPopular'
facet_state_stackedbarplots(db,table1,table2,cols1,cols2,cat_var,num_var,facet_var,color_var)

# Conclusion:
To conclude all of our results...
* On average, states who had Chick-fil-a or McDonalds as their most popular fast food restaurant saw higher obesity scores. This finding is based on our barplots showing state obesity scores categorized by the most popular fast food restaurant for each specific state and our comparative boxplots that showed the distribution of obesity score across each favorite fast food restaurant.
* There is a correlation between median income and total obesity scores. The lower a state's median income, the higher a state's obesity score was.
* Generally, states who had In-N-Out or Panda Express as their most popular fast food restaurant saw higher median household income. Chick-Fil-A saw lower median household income. This makes sense considering the negative correlation we saw between median income and obesity score, and that states that had Chick-Fil-A as their favorite were more obese. This finding is based on our barplots showing median income categorized by the most popular fast food restaurant for each specific state and our comparative boxplots that showed the distribution of median household income across each favorite food restaurant.
* There seemed to be more Dunkin Donuts per 100,000 people in the states that had McDonalds as their favorite fast food restaurant. There seemed to be more McDonalds per 100,000 people in the states that had In-N-Out as their favorite fast food restaurant. Other than that, the number of different fast food restaurants per 100,000 people in each state doesn't really seem to correlate with fast food restaurant choice.
* We didn’t find much of a correlation between unemployment rate and fast food restaurant choice. This finding is based on our bar plots showing the unemployment rate (for 2021 and 2022) categorized by the most popular fast food restaurant for each specific state and our comparative boxplots that showed the distribution of unemployment rate (for 2021 and 2022) across each favorite fast food restaurant.
* 2021 saw much higher average unemployment rates than 2022 did. We can see this based on all of the graphs in which we used unemployment rates because in every year compared with different variables, 2021 always had higher unemployment rates than 2022.
* All of the states in the southeast and southwest region with the exception of West Virginia all had Chick-fil-a as their most popular fast food restaurant.

* The least healthy states based on our data are:
  1. West Virginia
  2. Mississippi
  3. Kentucky
  4. Alabama
  5. Arkansas

  We determined this based on our stacked bar plot that contained the    rankings of each state for all three categories.
* The highest median household income states are:
  1. Maryland
  2. New Jersey
  3. Massachusetts
  4. Hawaii
  5. Connecticut

  We determined this by using a query that organized median household income from highest to lowest.



# Answering our central question:
Now we return to our main question-**How is fast food restaurant choice across the US states affected by the number of given fast food restaurants in each state, obesity rates, unemployment rates, and median household income?**

We can determine that the most obese states in our data had Chick-Fil-A and McDonalds as their most popular fast food restaurant. Also, we saw that since household median incomes had such a high correlation with total obesity scores, median household income also has an effect on fast food restaurant choice with In-N-Out and Panda Express being fast food restaurants more wealthy states generally have as their favorite. Unfortunately, based on all of the comparative plots we did based on unemployment rate, we found that unemployment rate really didn't have much of an effect on fast food restaurant choice. We also didn't really see much of a correlation between the number of given fast food restaurants in each state and fast food restaurant choice. Yet, we did see that states that had McDonalds as their favorite fast food restaurant had an above average amount of Dunkin Donuts per 100,000 people. States that had In-N-Out as their favorite fast food restaurant had an above average amount of McDonalds per 100,000 people.