In this project we will analyze data on Thanksgiving dinner in the US for 2015. For this project we will be using the pandas library. We will convert the csv file into a dataframe and explore the data using some functions found in pandas.

In [1]:
import pandas as pd

data = pd.read_csv('thanksgiving.csv', encoding = 'Latin-1')
# Display first 3 rows of dataframe
data.head(3)

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain


In [2]:
# Print the column names to get a sense of what data looks like
print(data.columns)

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

In [3]:
# Print the table structure (row x column)
print(data.shape)

(1058, 65)


From the dataset we can see that each column name is a survey question and each row is a response to these questions. We can sort through this dataset by removing all the rows that answered 'No' to the first question, 'Do you celebrate Thanksgiving?'. 

This is accomplished by converting the dataframe into a Boolean series. We will then use this Boolean series to remove all the data that answered 'No' to the first question.

In [4]:
print(data['Do you celebrate Thanksgiving?'].value_counts())

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64


In [5]:
data = data[data['Do you celebrate Thanksgiving?'] == 'Yes']
data.shape

(980, 65)

We can see that we have filtered the data as the number of rows has been reduced to 980. We can now analyse our data. Seeing as we are working with a survey, we can use the .value_counts() method as this is useful for dataframes that contain many repeating strings.

In [6]:
print(data['What is typically the main dish at your Thanksgiving dinner?'].value_counts())

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


From the code above we can see that Turkey is the predominant choice when it comes to Thanksgiving dinner. We can explore this data further. Let's say we want to investigate the number of people who have gravy when they eat Tofurkey as the main dish.

In [7]:
tofurkey_gravy = data[data['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey']['Do you typically have gravy?']
tofurkey_gravy.value_counts()

Yes    12
No      8
Name: Do you typically have gravy?, dtype: int64

Out of the 20 people that eat Tofurkey as a main dish, 60% of them would eat that dish with gravy.

Next we are interested to see how many people in this survey have apple, pumpkin, or pecan pies during Thanksgiving. We can use the .isnull() method to convert each column into a boolean, then use the & operator to return a boolean series. Then we can use the .value_counts() method to tally up the total number of False statements in the boolean series.

In [8]:
apple = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple']
apple_isnull = pd.isnull(apple)

pumpkin = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin']
pumpkin_isnull = pd.isnull(pumpkin)

pecan = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan']
pecan_isnull = pd.isnull(pecan)

no_pie = (apple_isnull) & (pumpkin_isnull) & (pecan_isnull)
no_pie.value_counts()

False    876
True     104
dtype: int64

From the people who celebrate Thanksgiving we can see that most people had apple, pumpkin or pecan pies for Thanksgiving.

We want to make sure this survey isn't biased towards the older generation and covers all ages. Currently the age column is difficult to analyze. We can write a function that uses the .apply() function to convert this column into integers.

In [9]:
data['Age'].value_counts()

45 - 59    269
60+        258
30 - 44    235
18 - 29    185
Name: Age, dtype: int64

In [10]:
# Remove any non-numeric characters
# Convert the column to a numeric dtype
def convert_int(column):
    if pd.isnull(column) == True:
        return None
    else:
        string = column.split(' ')[0]
        string = string.replace('+', '')
        return int(string)

In [11]:
int_age = data['Age'].apply(convert_int)

# Add the column int_age to the dataframe 'data'
data['int_age'] = int_age

# Generate summary statistics
data['int_age'].describe()

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

While we took the lower limit of each age range to better analyze the data, the survey participants appear to cover all age groups.

Next we are interested in the income groups of each family. We do this to make sure the average income of the survey participants is representative of the population.

In [12]:
data['How much total combined money did all members of your HOUSEHOLD earn last year?'].value_counts()

$25,000 to $49,999      166
$75,000 to $99,999      127
$50,000 to $74,999      127
Prefer not to answer    118
$100,000 to $124,999    109
$200,000 and up          76
$10,000 to $24,999       60
$0 to $9,999             52
$125,000 to $149,999     48
$150,000 to $174,999     38
$175,000 to $199,999     26
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

In [13]:
def convert_int_inc(column):
    if pd.isnull(column) == True:
        return None
    string = column.split(' ')[0]
    if string == 'Prefer':
        return None
    else:
        string = string.replace('$','')
        string = string.replace(',','')
        return int(string)

In [14]:
int_income = data['How much total combined money did all members of your HOUSEHOLD earn last year?'].apply(convert_int_inc)
data['int_income'] = int_income
data['int_income'].describe()

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64

We once again took the lower limit of the income range so the average skews downwards. We can still see that the average income is quite high (~75K). The median is relatiely close to the average which suggests the typical family is earning this amount.

Next, let's see if there is any correlation between income and travel distance of those who took part in the survey.

In [15]:
data['How far will you travel for Thanksgiving?'].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         396
Thanksgiving is local--it will take place in the town I live in                     276
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    197
Thanksgiving is out of town and far away--I have to drive several hours or fly       82
Name: How far will you travel for Thanksgiving?, dtype: int64

In [16]:
data[data['int_income'] < 150000]['How far will you travel for Thanksgiving?'].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64

In [17]:
data[data['int_income'] > 150000]['How far will you travel for Thanksgiving?'].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64

In [18]:
data[data['int_income'] < 75000]['How far will you travel for Thanksgiving?'].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         147
Thanksgiving is local--it will take place in the town I live in                     138
Thanksgiving is out of town but not too far--it's a drive of a few hours or less     94
Thanksgiving is out of town and far away--I have to drive several hours or fly       26
Name: How far will you travel for Thanksgiving?, dtype: int64

It looks like high income survey participants (> 150k) actually stay home at a higher rate(48%) than lower income participants(< 150k)(41%). A reason for this may be that the high income survey participants are older and have their own families while the lower income participants may be made up of students who live on campus and travel home for thanksgiving. When we lower the income again (< 75K) we again see a reduction in the participants who stay at home (36%). This gives further weight to the suggestion that students make up a significant portion of the lower income participants. 

We can use .pivot_table() method to see if there is a correlation between age/income and people who spend their Thanksgiving with Friends.

In [19]:
data.pivot_table(
    index = "Have you ever tried to meet up with hometown friends on Thanksgiving night?",
    columns = 'Have you ever attended a "Friendsgiving?"',
    values = 'int_age'
)

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


In [20]:
data.pivot_table(
    index = "Have you ever tried to meet up with hometown friends on Thanksgiving night?",
    columns = 'Have you ever attended a "Friendsgiving?"',
    values = 'int_income'
)

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


It looks like people who spend their thanksgiving with their friends have lower average income and an average age of 34.

Now let's perform some further investigations on this data set:
- **Most Common Dessert People Eat**

In [21]:
col_names = data.columns.tolist()

dessert_column = []

for c in col_names:
    if c.startswith('Which of these desserts'):
        dessert_column.append(c)
        
dessert_df = data[dessert_column]
dessert_df.head()

Unnamed: 0,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Apple cobbler,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Blondies,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Brownies,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Carrot cake,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Cheesecake,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Cookies,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Fudge,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Ice cream,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Peach cobbler,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - None,Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Other (please specify),Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply. - Other (please specify).1
0,,,,,Cheesecake,Cookies,,Ice cream,,,,
1,,,,,Cheesecake,Cookies,,,,,Other (please specify),"Jelly roll, sweet cheeseball, chocolate dipped..."
2,,,Brownies,Carrot cake,,Cookies,Fudge,Ice cream,,,,
3,,,,,,,,,,,,
4,,,,,,,,,,,,


In [22]:
# .count() counts non-NA cells for each column or row.
dessert_df.count()

Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Apple cobbler               110
Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Blondies                     16
Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Brownies                    128
Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Carrot cake                  72
Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Cheesecake                  191
Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Cookies                     204
Which of these desserts do you typically have at Thanksgiving dinner? Please select all that apply.   - Fudge                        43
Which of these desserts do you typically have at

From looking at the data we can see that the most preffered dessert during Thankgiving is None which is approximately 19%, this is followed by Ice cream which makes up approximately 17% of the desserts.

- **Identify how many people work on Black Friday.**

In [23]:
retail = data['Do you work in retail?'].value_counts()
retail

No     881
Yes     70
Name: Do you work in retail?, dtype: int64

In [24]:
working = data['Will you employer make you work on Black Friday?'].value_counts()
working

Yes              43
No               20
Doesn't apply     7
Name: Will you employer make you work on Black Friday?, dtype: int64

Black Friday is an informal name for the day following Thanksgiving. A large number of retail stores tend to be open on Black Friday and offer promotional sales. From the dataframe we can see that of those who celebrate Thanksgiving, 70 of the participants work in retail. Of those that work in retail ~61% of those will work on Black Friday. 