This JupyterNotebook analyzes Thanksgiving Celebrations in the US.  The data was gotten from https://github.com/fivethirtyeight/data. That repo has very interesting datasets to explore.


In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [2]:
data=pd.read_csv('/home/raphael/Documents/DataSets/thanksgiving.csv',
encoding="Latin-1")

Let's have a look at what the first 3 rows of the  data look like

In [4]:
data.head(3)


Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain


We can look at the last 3 rows


In [5]:
data.tail(3)

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
1055,4335943060,Yes,Other (please specify),Duck,Baked,,Rice-based,,,,...,Yes,Yes,Yes,No,,Urban,60+,Male,"$100,000 to $124,999",Pacific
1056,4335934708,Yes,Turkey,,Baked,,,,Homemade,,...,Yes,No,Yes,Yes,Yes,,,,,
1057,4335894916,Yes,Turkey,,Baked,,Bread-based,,Canned,,...,Yes,Yes,Yes,No,,,,,,


We see there are 1057rows and 66olumns or attributes.
The columns in the dataset are

In [6]:
data.columns

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

To begin this analysis, its important to know how many people celebrate Thanksgiving and how many don't'

In [7]:
data["Do you celebrate Thanksgiving?"].value_counts()

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

From the above code, you see that only  less than 10% of respondents don't celebrate Thanksgiving
Now let's  filter out those who don't celebrate thanksgiving. This is an important step since the analysis will be
on those who celebrate thanksgiving

In [8]:
data=data[data["Do you celebrate Thanksgiving?"]=="Yes"]

In [8]:
data["Do you celebrate Thanksgiving?"].value_counts() 

Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64

We now explore various columns in the dataset. For example, we could easily find how many males and females 
celebrate thanksgiving. Notice that I set dropna=True. This drops rows where there are missing values or NaNs. This is a particularly useful thing to do when counting.

In [9]:

data["What is your gender?"].value_counts(dropna=True)

Female    515
Male      432
Name: What is your gender?, dtype: int64

Explore main dishes 

In [10]:
data['What is typically the main dish at your Thanksgiving dinner?'].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

We see that more people eat Turkey as the main dish during Thanksgiving. With Pandas, I can create a slice of my
data. For example, I could create a slice that only contains data for all those who have Tofurkey

In [10]:
main_dish_Tofurkey=data[data["What is typically the main dish at your Thanksgiving dinner?"]=="Tofurkey"]
main_dish_Tofurkey.head(3)

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific
33,4337771439,Yes,Tofurkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Suburban,30 - 44,Male,"$50,000 to $74,999",Middle Atlantic
69,4337553422,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,No,Yes,No,No,,Urban,18 - 29,Male,"$10,000 to $24,999",West South Central


Its a simple matter to find out the most common eaten dessert.

In [182]:
dessert_Cobbler='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Apple cobbler'
dessert_Blondies='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Blondies'
dessert_Brownies='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Brownies'
dessert_CarrotCake='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Carrot cake'
dessert_CheeseCake='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Cheesecake'
dessert_Cookies='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Cookies'
dessert_IceCream='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Ice cream'
dessert_PeachCobbler='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Peach cobbler'
dessert_others='Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Other (please specify).1'

In [177]:
dessert_cobbler=(data['Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Peach cobbler']).dropna(axis=0)

In [20]:
#Cobbler counts
Cobbler_counts=((data[dessert_Cobbler]).dropna(axis=0)).value_counts()
Cobbler_counts

Apple cobbler    110
Name: Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Apple cobbler, dtype: int64

In [21]:
Blondies_counts=((data[dessert_Blondies]).dropna()).value_counts()
Blondies_counts

Blondies    16
Name: Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Blondies, dtype: int64

In [22]:
dessert_icecream=((data[dessert_IceCream]).dropna()).value_counts()
dessert_icecream

Ice cream    266
Name: Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Ice cream, dtype: int64

In [23]:
#print(data.filter(like='meal').columns)
dessert_PeachCobbler=((data[dessert_PeachCobbler]).dropna()).value_counts()
dessert_PeachCobbler

Peach cobbler    103
Name: Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Peach cobbler, dtype: int64

The above numbers show that Icecream is the most common dessert people eat

In [192]:
#Lets  by compare the mean income between people wo eat homemade cranberry sauce to those who eat canned cranberry
#sauce  


data["What type of cranberry saucedo you typically have?"].value_counts()



Canned                    502
Homemade                  301
None                      146
Other (please specify)     25
Name: What type of cranberry saucedo you typically have?, dtype: int64

In [193]:
#To be able to get the mean of the incomes of those who eat homemande and cranberry sauces, we have to first clean
# the income table-get rid of  the dollar sign and convert it from a string to an integer or float

(canned["How much total combined money did all members of your HOUSEHOLD earn last year?"]).dtype


dtype('O')

In [32]:
def cleaned_income(income):
    import math
    if income=="$200,000 and up":
        return 200000
    elif income=="Prefer not to answer":
        return np.nan
    elif isinstance(income,float) and math.isnan(income):
        return np.nan
    income=income.replace(",", "").replace("$","")
    income_high,income_low=income.split("to")
    return (int(income_high)+int(income_low))/2
    
        

In [194]:
#lets create a new column called income. This will contain the values of cleaned income. To get this done, we will 
#use the .apply function of pandas. 


data["income"]=data["How much total combined money did all members of your HOUSEHOLD earn last year?"].apply(cleaned_income)
data["income"].head()

0     87499.5
1     62499.5
2      4999.5
3    200000.0
4    112499.5
Name: income, dtype: float64

In [195]:
#Lets extract from the data  only those respondents that have homemade and canned sauces


homemade = data[data["What type of cranberry saucedo you typically have?"] == "Homemade"]
canned = data[data["What type of cranberry saucedo you typically have?"] == "Canned"]

In [196]:
#Now we can find the mean income of those who eat crannedberry and homemade sauces.

print(canned["income"].mean())
print(homemade["income"].mean())

83823.4034091
94878.1072874


In [198]:
#Grouping data with pandas
#To do this, we use the groupby functionality.
#Readmore here https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

grouped_cranberry= data.groupby("What type of cranberry saucedo you typically have?")
grouped_cranberry

#Its very nice that grouped_cranberry is a dataframe. That means many of pandas functions can be applied to it

<pandas.core.groupby.DataFrameGroupBy object at 0x7f6fc3b84908>

In [197]:
#This shows the various types of cranberry in the "What type of cranberry saucedo you typically have?" column

grouped_cranberry.groups

{'Canned': Int64Index([   4,    6,    8,   11,   12,   15,   18,   19,   26,   27,
             ...
             1040, 1041, 1042, 1044, 1045, 1046, 1047, 1051, 1054, 1057],
            dtype='int64', length=502),
 'Homemade': Int64Index([   2,    3,    5,    7,   13,   14,   16,   20,   21,   23,
             ...
             1016, 1017, 1025, 1027, 1030, 1034, 1048, 1049, 1053, 1056],
            dtype='int64', length=301),
 'None': Int64Index([   0,   17,   24,   29,   34,   36,   40,   47,   49,   51,
             ...
              980,  981,  997, 1015, 1018, 1031, 1037, 1043, 1050, 1055],
            dtype='int64', length=146),
 'Other (please specify)': Int64Index([   1,    9,  154,  216,  221,  233,  249,  265,  301,  336,  380,
              435,  444,  447,  513,  550,  749,  750,  784,  807,  860,  872,
              905, 1000, 1007],
            dtype='int64')}

In [199]:
#we can also see the number of rows that correspond to Canned, Homemade, None and Other

grouped_cranberry.size()

#Seems most people have canned sauce

What type of cranberry saucedo you typically have?
Canned                    502
Homemade                  301
None                      146
Other (please specify)     25
dtype: int64

In [200]:
#One can also use a for loop to cycle through the group

for name, group in grouped_cranberry:
    print(name)
    print(group.shape)
    print(type(group))
    
#As you can see they are all dataframes!    

Canned
(502, 66)
<class 'pandas.core.frame.DataFrame'>
Homemade
(301, 66)
<class 'pandas.core.frame.DataFrame'>
None
(146, 66)
<class 'pandas.core.frame.DataFrame'>
Other (please specify)
(25, 66)
<class 'pandas.core.frame.DataFrame'>


In [201]:
#We can extract a column. The column will be a series object

grouped_cranberry["RespondentID"]


<pandas.core.groupby.SeriesGroupBy object at 0x7f6fc3b07c18>

In [202]:
grouped_cranberry["RespondentID"].size()

What type of cranberry saucedo you typically have?
Canned                    502
Homemade                  301
None                      146
Other (please specify)     25
dtype: int64

In [68]:
#Aggregating values in groups
grouped_cranberry["income"].aggregate(np.mean)

What type of cranberry saucedo you typically have?
Canned                    83823.403409
Homemade                  94878.107287
None                      78886.084034
Other (please specify)    86629.978261
Name: income, dtype: float64

In [203]:
#Or presented in a clearer table as


grouped_cranberry.aggregate(np.mean)

Unnamed: 0_level_0,RespondentID,income
What type of cranberry saucedo you typically have?,Unnamed: 1_level_1,Unnamed: 2_level_1
Canned,4336699416,83823.403409
Homemade,4336792040,94878.107287
,4336764989,78886.084034
Other (please specify),4336763253,86629.978261


In [205]:
#Lets group more columns together. For example, the table below shows that familes who have homemade cranberry 
#saucedo and Turducken have the highest annual income of 200000.

grouped=data.groupby(["What type of cranberry saucedo you typically have?",
                      "What is typically the main dish at your Thanksgiving dinner?"])

grouped.agg(np.mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,RespondentID,income
What type of cranberry saucedo you typically have?,What is typically the main dish at your Thanksgiving dinner?,Unnamed: 2_level_1,Unnamed: 3_level_1
Canned,Chicken,4336354418,80999.6
Canned,Ham/Pork,4336757434,77499.535714
Canned,I don't know,4335987430,4999.5
Canned,Other (please specify),4336682072,53213.785714
Canned,Roast beef,4336254414,25499.5
Canned,Tofurkey,4337156546,100713.857143
Canned,Turkey,4336705225,85242.682045
Homemade,Chicken,4336539693,19999.5
Homemade,Ham/Pork,4337252861,96874.625
Homemade,I don't know,4336083561,


In [206]:
# we can evaluate more statistical variables

grouped["income"].agg([np.mean,np.sum,np.std]).head(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,sum,std
What type of cranberry saucedo you typically have?,What is typically the main dish at your Thanksgiving dinner?,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Canned,Chicken,80999.6,404998.0,75779.481062
Canned,Ham/Pork,77499.535714,1084993.5,56645.063944
Canned,I don't know,4999.5,4999.5,
Canned,Other (please specify),53213.785714,372496.5,29780.94629
Canned,Roast beef,25499.5,127497.5,24584.039538


In [207]:
#using apply on groups.  For more info on groupby read
#http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.core.groupby.GroupBy.apply.html

#With groupby, we can easily group data to discover hidden patterns. For example, we can see that most people
#in the Suburban region eat the most Turkey while pork is least consumed in the Urban region.

grouped = data.groupby("How would you describe where you live?")["What is typically the main dish at your Thanksgiving dinner?"]

grouped.apply(lambda x:x.value_counts())

How would you describe where you live?                        
Rural                                   Turkey                    189
                                        Other (please specify)      9
                                        Ham/Pork                    7
                                        Tofurkey                    3
                                        I don't know                3
                                        Chicken                     2
                                        Turducken                   2
                                        Roast beef                  1
Suburban                                Turkey                    449
                                        Ham/Pork                   17
                                        Other (please specify)     13
                                        Tofurkey                    9
                                        Chicken                     3
                           

In [91]:
#Lets make it into a dataframe and get a nice table

pd.DataFrame(grouped.apply(lambda x:x.value_counts()))


Unnamed: 0_level_0,Unnamed: 1_level_0,What is typically the main dish at your Thanksgiving dinner?
How would you describe where you live?,Unnamed: 1_level_1,Unnamed: 2_level_1
Rural,Turkey,189
Rural,Other (please specify),9
Rural,Ham/Pork,7
Rural,Tofurkey,3
Rural,I don't know,3
Rural,Chicken,2
Rural,Turducken,2
Rural,Roast beef,1
Suburban,Turkey,449
Suburban,Ham/Pork,17


In [213]:
#Let's find regional patterns in menus. To do this we will use the pivot table functionality of pandas
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot.html. I introduced pivot tables in 
#an earlier tutorial

data.pivot_table([["What is typically the main dish at your Thanksgiving dinner?", "US Region"]],
                index=["US Region"], columns=["What is typically the main dish at your Thanksgiving dinner?"],
                aggfunc=len, fill_value=0)

What is typically the main dish at your Thanksgiving dinner?,Chicken,Ham/Pork,I don't know,Other (please specify),Roast beef,Tofurkey,Turducken,Turkey
US Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
East North Central,0,4,0,5,0,1,0,135
East South Central,0,1,0,4,1,0,0,50
Middle Atlantic,1,2,0,4,2,5,1,130
Mountain,1,1,0,0,0,2,0,37
New England,2,0,0,1,0,1,0,51
Pacific,0,6,1,9,1,4,2,107
South Atlantic,3,7,0,6,3,3,0,181
West North Central,1,4,1,3,0,2,0,60
West South Central,1,2,0,3,0,2,0,77


In [107]:
#we could get the same result using crosstabs. Readmore here 
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.crosstab.html
pd.crosstab(data["US Region"],data["What is typically the main dish at your Thanksgiving dinner?"])

What is typically the main dish at your Thanksgiving dinner?,Chicken,Ham/Pork,I don't know,Other (please specify),Roast beef,Tofurkey,Turducken,Turkey
US Region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
East North Central,0,4,0,5,0,1,0,135
East South Central,0,1,0,4,1,0,0,50
Middle Atlantic,1,2,0,4,2,5,1,130
Mountain,1,1,0,0,0,2,0,37
New England,2,0,0,1,0,1,0,51
Pacific,0,6,1,9,1,4,2,107
South Atlantic,3,7,0,6,3,3,0,181
West North Central,1,4,1,3,0,2,0,60
West South Central,1,2,0,3,0,2,0,77


In [208]:
#We can use groupby to get the same result as the pivot table. The presenatation is slightly different but they
#both convey the same information. Personally, I prefer the presentation of crosstab.

pd.DataFrame(data.groupby("US Region")
             ["What is typically the main dish at your Thanksgiving dinner?"].apply(lambda x:x.value_counts()))

Unnamed: 0_level_0,Unnamed: 1_level_0,What is typically the main dish at your Thanksgiving dinner?
US Region,Unnamed: 1_level_1,Unnamed: 2_level_1
East North Central,Turkey,135
East North Central,Other (please specify),5
East North Central,Ham/Pork,4
East North Central,Tofurkey,1
East South Central,Turkey,50
East South Central,Other (please specify),4
East South Central,Roast beef,1
East South Central,Ham/Pork,1
Middle Atlantic,Turkey,130
Middle Atlantic,Tofurkey,5


In [209]:
#Now let's find age, gender, and income based patterns in dinner menus.

#It is interesting to see that  the 60+ age group eat the least Ham/Pork. This could be as a result of health
#concerns


pd.crosstab(data["Age"],data["What is typically the main dish at your Thanksgiving dinner?"])

What is typically the main dish at your Thanksgiving dinner?,Chicken,Ham/Pork,I don't know,Other (please specify),Roast beef,Tofurkey,Turducken,Turkey
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
18 - 29,3,6,3,2,3,6,0,162
30 - 44,3,12,1,10,1,9,2,197
45 - 59,2,8,0,14,3,2,0,240
60+,4,2,0,9,3,3,1,236


In [210]:
#Lets add what is your gender column.

#Seems in the 60+ age group, the males are more concerned about their health. They eat more chicken and less 
#Ham/Pork which is known to contain a lot of fat.

pd.crosstab((data["Age"],data["What is your gender?"]),data["What is typically the main dish at your Thanksgiving dinner?"])

Unnamed: 0_level_0,What is typically the main dish at your Thanksgiving dinner?,Chicken,Ham/Pork,I don't know,Other (please specify),Roast beef,Tofurkey,Turducken,Turkey
Age,What is your gender?,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
18 - 29,Female,1,3,2,2,0,4,0,90
18 - 29,Male,2,3,1,0,3,2,0,72
30 - 44,Female,2,6,0,8,1,4,0,106
30 - 44,Male,1,6,1,2,0,5,2,91
45 - 59,Female,2,4,0,7,2,2,0,125
45 - 59,Male,0,4,0,7,1,0,0,115
60+,Female,1,2,0,5,1,3,1,131
60+,Male,3,0,0,4,2,0,0,105


There are potential nextsteps. For example figuring out the number  of people who work on thanksgiving. Also,  we can find out if there there is any correlation between those who don't celebrate thanksgiving to those who work on Thanksgiving