# Thanksgiving data analysis

In [2]:
import pandas as pd
import re
data = pd.read_csv('thanksgiving.csv', encoding ="Latin-1" )
data.head()

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


In [3]:
data.columns

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

## Filtering out rows in Dataframe

In [4]:
# display the number of people celebrate the day and those who not
print(data['Do you celebrate Thanksgiving?'].value_counts())

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64


In [8]:
# retain people who celebrate Thanksgiving
data = data[data['Do you celebrate Thanksgiving?'] == 'Yes']

## use value_counts to explore main dishes

In [6]:
# display the main dishes people eat during Thanksgiving dinner
print(data['What is typically the main dish at your Thanksgiving dinner?'].value_counts())

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


In [7]:
# display data for people who have Tofurkey
print(data['Do you typically have gravy?'][data['What is typically the main dish at your Thanksgiving dinner?']=='Tofurkey'])

4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object


## Filtering out people eating apple/pimpkin/pecan pie

In [13]:
col_apple = 'Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'
col_Pumpkin = 'Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'
col_Pecan = 'Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'
apple_isnull = pd.isnull(data[col_apple])
pumpkin_isnull = pd.isnull(data[col_Pumpkin])
pecan_isnull  = pd.isnull(data[col_Pecan])
ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull
ate_pies.value_counts()

False    876
True     104
dtype: int64

In [14]:
print(data[ate_pies == False])

      RespondentID Do you celebrate Thanksgiving?  \
0       4337954960                            Yes   
1       4337951949                            Yes   
2       4337935621                            Yes   
3       4337933040                            Yes   
4       4337931983                            Yes   
6       4337924420                            Yes   
8       4337914977                            Yes   
9       4337899817                            Yes   
11      4337893416                            Yes   
12      4337888291                            Yes   
13      4337878450                            Yes   
14      4337878351                            Yes   
16      4337856362                            Yes   
17      4337854106                            Yes   
18      4337844879                            Yes   
19      4337823612                            Yes   
20      4337820281                            Yes   
23      4337793158                            

## Converting Age to Numeric value

In [15]:
# start to inspect the Age column
print(data['Age'].value_counts())

45 - 59    269
60+        258
30 - 44    235
18 - 29    185
Name: Age, dtype: int64


In [16]:
# I found the value is intervals, we need to convert it somehow
def age_conv(age_str):
    if pd.isnull(age_str):
        return None
    age_str = age_str.replace('+','')
    split_str = age_str.split(' ')
    age = int(split_str[0])
    return age
int_age = data['Age'].apply(age_conv)
data['int_age'] = int_age
data['int_age'].describe()

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

## Converting Money to numeric value

In [19]:
# inspect the data
col_money = 'How much total combined money did all members of your HOUSEHOLD earn last year?'
data[col_money].value_counts()

$25,000 to $49,999      166
$75,000 to $99,999      127
$50,000 to $74,999      127
Prefer not to answer    118
$100,000 to $124,999    109
$200,000 and up          76
$10,000 to $24,999       60
$0 to $9,999             52
$125,000 to $149,999     48
$150,000 to $174,999     38
$175,000 to $199,999     26
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

In [20]:
def money_conv(money_str):
    if pd.isnull(money_str) or re.search('Prefer',money_str):
        return None
    split_money = money_str.split(' ')
    split_money[0] = split_money[0].replace('$','')
    split_money[0] = split_money[0].replace(',','')
    return int(split_money[0])
int_income  = data[col_money].apply(money_conv)
data['int_income'] = int_income
int_income.describe()

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: float64

Finding: the diviation among the group is significant. The cost goes from 0 to 20000 dollars. However, the proportion of the people willing to disclose the money is unknown. It it not fair to make any conclusion here. 
<br>
<br>In addition, we take only the first value as the income, which is also not fair

In [23]:
# To deal with biased income slection, I select the average of interval as a better estimator
# 20,000 and up slot will use 20000 only because of the lack of data
def money_conv2(money_str):
    if pd.isnull(money_str) or re.search('Prefer',money_str):
        return None
    split_money = money_str.split(' ')
    for i in [0,2]:
        if split_money[i] == 'up':
            split_money[i] = split_money[0]
        split_money[i] = split_money[i].replace('$','')
        split_money[i] = split_money[i].replace(',','')
    avg_money = (int(split_money[0]) + int(split_money[2]))/2
    return avg_money
int_income  = data[col_money].apply(money_conv2)
data['int_income'] = int_income
int_income.describe()

count       829.000000
mean      86486.276840
std       57789.467567
min        4999.500000
25%       37499.500000
50%       87499.500000
75%      112499.500000
max      200000.000000
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: float64

Findings: this income value description is more precise than the last one.

In [8]:
missing_value_No = len(data[data[col_money].isnull()==False])
print('missing data: ',missing_value_No)

missing data:  947


Now we can see there are less than half of people disclosure the result. Therefore ,the conclusion and representativeness of data calculated above is skeptical.

## Relation between Travel Distance and Income anlysis
A hypothesize that people earning less money could be younger, and would travel to their parent's houses for Thanksgiving. People earning more are more likely to have Thanksgiving at their house as a result

In [25]:
data['How far will you travel for Thanksgiving?'][data['int_income'] <150000].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64

In [30]:
data['How far will you travel for Thanksgiving?'][data['int_income'] >=150000].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         66
Thanksgiving is local--it will take place in the town I live in                     34
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    25
Thanksgiving is out of town and far away--I have to drive several hours or fly      15
Name: How far will you travel for Thanksgiving?, dtype: int64

In [31]:
proportion_low_income = 281/(281+203+150+55)
proportion_high_income = 66/(66+34+25+15)
print('proportion_low_income:',proportion_low_income)
print('proportion_high_income',proportion_high_income)

proportion_low_income: 0.40783744557329465
proportion_high_income 0.4714285714285714


Comparing the proportion of staying in home, I find out that people with lower income have tendency to stay at home. This probably is because that people without less money will have less desposible money for travel on Thanksgiving. 

## Link between Friendship and Age
From the table, two columns shows the question which younger people will more likely to response yes. Let's test this view.

In [36]:
import numpy as np
Q1 = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?'
Q2 = 'Have you ever attended a "Friendsgiving?"'
result_Q1Q2 = pd.pivot_table(data,values = 'int_age',index=Q1,columns = Q2,aggfunc=np.mean)
result_Q1Q2

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


In [37]:
import numpy as np
Q1 = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?'
Q2 = 'Have you ever attended a "Friendsgiving?"'
result_Q1Q2 = pd.pivot_table(data,values = 'int_income',index=Q1,columns = Q2,aggfunc=np.mean)
result_Q1Q2

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,89456.82448,83124.552632
Yes,89553.119048,76315.319079


From the above two pivot tables, I find out that younger people are more active in Friendsgiving party than elder ones. The second pivot tables also shows the evidence that younger people has less income than elder ones