# Thanksgiving Dinner Analysis

In this project I will be exploring data on America's most famous meal, Thanksgiving dinner.
The dataset contains 1058 responses to an online survey about what Americans eat for dinner on thanksgiving along with some demographic questions like gender, income and location.

The dataset will allow us to look at some interesting facets about Thanksgiving dinner and also give me a chance to practice cleaning up data for analysis and use the pandas and numpy libraries. 

###### Getting the dataset

In [14]:

import numpy as np
import pandas as pd
data = pd.read_csv('thanksgiving.csv', encoding='Latin-1')



###### Inspecting the first 5 rows of the data


In [15]:
data.head()

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


###### A list of all the columns in our dataset

In [16]:
print(data.columns)

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

#### Cleaning The Data

Because we want to understand what people ate for Thanksgiving, we'll remove any responses from people who don't celebrate it. The column Do you celebrate Thanksgiving? contains this information. We only want to keep data for people who answered Yes to this questions.

In [17]:
# Counting number of yesses and nos in the 
#"Do you celebrate Thanksgiving" column

yes_no_count = data['Do you celebrate Thanksgiving?'].value_counts()
print(yes_no_count)

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64


In [18]:
#filtering the dataframe to remain only with yes responses

yes_data = data[data['Do you celebrate Thanksgiving?'] == 'Yes']
print(yes_data.shape)


(980, 65)


As can be seen above our data contained 980 responders who celebrate Thanksgiving, and since this is an overwhelming majority of the dataset we can confidently filter our dataset to only contain responders that celebrate Thanksgiving. The new `yes_data` dataframe represents the dataset we need.

###### We will now eplore main dishes eaten during Thanksgiving dinner:

In [19]:
main_dishes = yes_data['What is typically the main dish at your Thanksgiving dinner?'].value_counts()
print(main_dishes)

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


As expected Turkey is the main dish at most Thanksgiving dinners. However just out of interest I'd like to look at gravy use among the Turkey eating group.

###### Gravy use among Turkey Eaters

In [20]:
#filtering for turkey
turkey_group = yes_data[yes_data['What is typically the main dish at your Thanksgiving dinner?']== 'Turkey']

#Displaying the "Do you typically have gravy?" column for "Turkey group
#print(turkey_group['Do you typically have gravy?'])
print(turkey_group['Do you typically have gravy?'].value_counts())

Yes    814
No      45
Name: Do you typically have gravy?, dtype: int64


814 out of a possible 859 people have gravy as part of their Thanksgiving which is not surprising at all. Personally I am surprised that it's not 859 out of a possible 859 but I admit I am very biased.

#### Desert: Pie, Pie and More Pie

Now that we've looked into the main dishes, let's explore the dessert dishes. Specifically, we'll look at how many people eat Apple, Pecan, or Pumpkin pie during Thanksgiving dinner.

In [21]:
# Generating Boolean Series indicating which pie columns are null:

apple_isnull = yes_data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'].isnull()
pumpkin_isnull = yes_data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'].isnull()
pecan_isnull = yes_data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'].isnull()

#Combining all three rows, False will represent a row where someone had at least one of the types of pie listed
ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull
print(ate_pies.value_counts())


False    876
True     104
dtype: int64


876 responders ate at least one of apple, pumpkin or pecan pie during their Thanksgiving dinner. The rest may have hadsome other kind which was not surveyed for, or they are clearly joy hating communists. 

_Disclaimer: I am also biased about pie_

#### Age and Thanksgiving

We will now analyze the Age column in more depth, however some data cleaning needs to take place before we do so. The Age column comes in the format:
- `18-29`
- `30-44`
- `45-59`
- `60+`
- `null`
 
But in order to figure out avergae age of survey respondents we need to use numeric values. Because we are missing the exact age value we will instead extract the first age value in the strings given.

In [22]:
#creating a function to convert age to an integer value:

def age_to_int(age_string):
    if pd.isnull(age_string):
        return None
    elif age_string == '60+':
        return 60
    else:
        split_string = age_string.split(' ')
        result_int = int(split_string[0])
        return result_int
    
data['int_age'] = data['Age'].apply(age_to_int)

print(data['int_age'].describe())    

count    1025.000000
mean       39.383415
std        15.398493
min        18.000000
25%        30.000000
50%        45.000000
75%        60.000000
max        60.000000
Name: int_age, dtype: float64


From the above data we can see that the mean age of respondents was 39.38 years with a standard deviation of 15.4. This data is however a bit misleading since we converted age ranges to single age points on the lower end of the ranges for example everyone in the 18-29 age range was assumed to be 18 which is not necessarily the case. In all likelihood the above summary statistics suggest a younger group than the reality of the respondents.

#### Income and Thanksgiving

We will run a similar method to investigate household income of respondents bearing in mind that our results will most likely skew towards the lower end of reality.

In [23]:
#creating a function to convert Household Income to an integer value:

def income_to_int(income_string):
    if pd.isnull(income_string):
        return None
    split_string = income_string.split(' ')[0]
    if split_string == 'Prefer':
        return None
    no_dollar_number = split_string.replace('$','')
    no_comma_number = no_dollar_number.replace(',','')
    income_int = int(no_comma_number)
    return income_int
    
data['int_income'] = data['How much total combined money did all members of your HOUSEHOLD earn last year?'].apply(income_to_int)
print(data['int_income'].describe())

count       889.000000
mean      74077.615298
std       59360.742902
min           0.000000
25%       25000.000000
50%       50000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64


Based on our method we can conclude that the mean household income of responders is about $74,000. However due to the method used this number is once again more than likely lower than the actual average household income. Moreover the high standard deviation numbers suggest a less than optimal model. 

_Also see concluding remarks about how survey data can be unreliable especially with regards to income_

#### Travel and Thanksgiving


We can now see how the distance someone travels for Thanksgiving dinner relates to their income level. We hypothesize that people earning less money could be younger, and would travel to their parent's houses for Thanksgiving. People earning more are more likely to have Thanksgiving at their house as a result.

We can test this by filtering the data based on int\_income, and seeing what the values in the _"How far will you travel for Thanksgiving?"_ column are.

In [24]:
#Exploring how far people will travel given income is less than 50,000

low_income = data[data['int_income'] < 50000]
print(low_income['How far will you travel for Thanksgiving?'].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         106
Thanksgiving is local--it will take place in the town I live in                      92
Thanksgiving is out of town but not too far--it's a drive of a few hours or less     64
Thanksgiving is out of town and far away--I have to drive several hours or fly       16
Name: How far will you travel for Thanksgiving?, dtype: int64


In [25]:
#Exploring how far people will travel given income is more than 150,000

high_income = data[data['int_income'] >= 50000]
print(high_income['How far will you travel for Thanksgiving?'].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         241
Thanksgiving is local--it will take place in the town I live in                     145
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    111
Thanksgiving is out of town and far away--I have to drive several hours or fly       54
Name: How far will you travel for Thanksgiving?, dtype: int64


We split the groups into low_income and high_income, with those earning less than 50,000 grouped under low income (The median household income is roughly $56,000 according to the US Census Bureau). 

5.7 percent of low income respondents travelled several hours compared to the 9.8 percent for high income respondents. High income respondents were also more likely to host Thanksgiving dinner with 43.7percent hosting compared to the 38.2 percent for low income respondents. This is also intuitive as high income people are more likely to have the means to host others than low income people.



#### Friendsgiving

There are two columns which directly pertain to friendship, Have you ever tried to meet up with hometown friends on Thanksgiving night?, and Have you ever attended a "Friendsgiving?. In the US, a "Friendsgiving" is when instead of traveling home for the holiday, you celebrate it with friends who live in your area. Both questions seem skewed towards younger people. We will see if this hypothesis holds up.

In [26]:
#using pivot tables to investigate the friends-giving phenomenon filtering by age

print(data.pivot_table(
    index = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?',
    columns = 'Have you ever attended a "Friendsgiving?"',
    values = 'int_age'
))

Have you ever attended a "Friendsgiving?"                  No        Yes
Have you ever tried to meet up with hometown fr...                      
No                                                  42.283702  37.010526
Yes                                                 41.475410  33.976744


In [27]:
#using pivot tables to investigate the friends-giving phenomenon filtering by income

print(data.pivot_table(
    index = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?',
    columns = 'Have you ever attended a "Friendsgiving?"',
    values = 'int_income'
))

Have you ever attended a "Friendsgiving?"                     No           Yes
Have you ever tried to meet up with hometown fr...                            
No                                                  78914.549654  72894.736842
Yes                                                 78750.000000  66019.736842


From the above data we can confirm that both questions skewed toward younger people and when we filtered by age, and toward lower incomes when we filtered by income. This is not surprising as generally speaking younger people are still in the early part of their careers and tend to make less money than their older peers.



## Conclusion

All in all we did some interesting analysis on Thanksgiving habits in the US but we did not really uncover anything surprising about the holiday. Sometimes things are exactly as they appear, but sometimes they are not but having a way to confirm or debunk any assumptions we may have is always fun.

Personally I think survey data is never quite representative of a nation's population. The ability to fill out a survey, particularly online necessarilly excludes a portion of the population that may not have access to computers or reliable internet connections. According to [pewresearch data](http://www.pewresearch.org/fact-tank/2016/09/07/some-americans-dont-use-the-internet-who-are-they/), 13% of Americans don't use the internet at all, with 23% of people with incomes less than $30,000 a year having no access. The efficacy of surveys is a topic that I think warrants investigation and hopefully in the future I will have an opportunity to tackle it.