In [1]:
import pandas as pd
data = pd.read_csv('thanksgiving.csv', encoding="Latin-1")

In [2]:
print(data.head())

   RespondentID Do you celebrate Thanksgiving?  \
0    4337954960                            Yes   
1    4337951949                            Yes   
2    4337935621                            Yes   
3    4337933040                            Yes   
4    4337931983                            Yes   

  What is typically the main dish at your Thanksgiving dinner?  \
0                                             Turkey             
1                                             Turkey             
2                                             Turkey             
3                                             Turkey             
4                                           Tofurkey             

  What is typically the main dish at your Thanksgiving dinner? - Other (please specify)  \
0                                                NaN                                      
1                                                NaN                                      
2                            

In [3]:
print(data.columns)

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

In [4]:
# Count how many times each category occurs in the column
print(data['Do you celebrate Thanksgiving?'].value_counts())

# Filter DataFrame to include only Yes responses to Do you celebrate Thanksgiving?
data = data.loc[data['Do you celebrate Thanksgiving?'] == 'Yes']

# Confirm that number of rows matches number of Yeses from earlier count
print(data.shape)

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64
(980, 65)


In [5]:
# Counts of responses to what main dish is
print(data['What is typically the main dish at your Thanksgiving dinner?'].value_counts())

# Display do you have gravy column for any data where Tofurkey was the main dish
Tofurkeyrows = data.loc[data['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey']
print(Tofurkeyrows['Do you typically have gravy?'])


Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64
4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object


In [6]:
# Create Boolean Series for apple, pumpkin, and pecan pie with true false values expressing null and non-null values
apple_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'])
pumpkin_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'])
pecan_isnull = pd.isnull(data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'])

In [7]:
# Combine all three series using the boolean operator &
ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull

In [8]:
# Print the value counts of final boolean series
print(ate_pies.value_counts())

False    876
True     104
dtype: int64


In [9]:
# Function to convert parse a single string from age column into int
def age_to_int(string):
    if pd.isnull(string) == True:
        return None
    new = string.split()
    new1 = new[0]
    new2 = new1.replace('+', '')
    age = int(new2)
    return age
# Apply function to series of age values
data['int_age'] = data['Age'].apply(lambda x: age_to_int(x))

print(data['int_age'].describe())

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64


# Notes on Findings
### Changing age to int value

1. The new int_age column only represents a rough approximation of respondents' ages, it will be lower than their true age as we're using the low end of each age range that was provided in the survey. (You can see this clearly in the fact that the 3rd quartile and the maximum are both 60


2. A better depiction of age, still not true, but closer, would be to use the midpoint as each range as the integer value for age. The best solution would simply be to do a better survey next time and ask them to fill in their exact age. 

In [12]:
def income_to_int(string):
    if pd.isnull(string) == True:
        return None
    i = string.split()
    i2 = i[0]
    if i2 == 'Prefer':
        return None
    i2 = i2.replace('$', '')
    i2 = i2.replace(',', '')
    income = int(i2)
    return income
data['int_income'] = data['How much total combined money did all members of your HOUSEHOLD earn last year?'].apply(lambda x: income_to_int(x))

print(data['int_income'].describe())
    

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64


# Notes on findings
### Changing income to int

Once again this only represents a rough approximation of the true income of respondents, because of the size of the categories and the fact we are using an endpoint as a default value means that the level of possible variance between the assigned income and a respondents true income is high.

In [13]:
# Select rows where int_income is less than 150000
income_filtering1 = data[data['int_income'] < 150000]
print(income_filtering1['How far will you travel for Thanksgiving?'].value_counts())

# Select rows where int_income is greater than 150000
income_filtering2 = data[data['int_income'] > 150000]
print(income_filtering2['How far will you travel for Thanksgiving?'].value_counts())

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64
Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64


# Notes on findings
### Distance travelled filtered by income

1. People making greater than 150k are about 8% more likely to remain at home for Thanksgiving, about 5% less likely to attend a local Thanksgiving not hosted by them, about 5% less likely to go to one that is not far away, but roughly 4% MORE likely to travel far away for Thanksgiving. 

2. Above and below 150k is probably not the best dividing line as the sample on the low end of that filtering is significantly larger than the sample on the high end. This is not necessarily beneficial when comparing two distributions. Also, I think as far as behavior is concerned someone who makes about 100k a year is significantly more like someone who makes 150k than they are someone who makes 50K so it makes little sense to group those people together. 

In [15]:
# Create a pivot table showing the average age of respondents for categories related to meeting with 
# friends on Thanksgiving
data.pivot_table(index = 'Have you ever tried to meet up with hometown friends on Thanksgiving night?', columns = 'Have you ever attended a "Friendsgiving?"', values = 'int_age')

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


# Notes on findings
### Age distribution for people who've met up with friends and/or attended a friendsgiving

It appears that there is a significant age difference between people who've attended a friendsgiving, but not so large a difference between people who've met up with hometown friends on Thanksgiving night and people who haven't. Most of the age variation seems driven by the friendsgiving category