In [2]:
import pandas as pd

In [3]:
data = pd.read_csv('thanksgiving.csv', encoding='Latin-1')
data.head(3)

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain


In [12]:
data.columns

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

# Removing those who don't celebrate thanksgiving

In [13]:
data["Do you celebrate Thanksgiving?"].value_counts()

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

In [14]:
data = data[data["Do you celebrate Thanksgiving?"] == 'Yes']
data["Do you celebrate Thanksgiving?"].value_counts()

Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64

# Exploring main dishes

In [15]:
data["What is typically the main dish at your Thanksgiving dinner?"].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

In [51]:
tofurkey_gravy = data[data['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey']['Do you typically have gravy?']
pd.value_counts(tofurkey_gravy.values)

Yes    12
No      8
dtype: int64

## Findings
12 people of the 20 (60%) of those who had tofurkey had gravy with it.

# Exploring desserts

In [77]:
apple_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'].isnull()
pumpkin_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'].isnull()
pecan_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'].isnull()

apple_data = pd.value_counts(apple_isnull.values)
pumpkin_data = pd.value_counts(pumpkin_isnull.values)
pecan_data = pd.value_counts(pecan_isnull.values)

print('Apple pie\n', apple_data, '\n', apple_data/len(data),end='\n\n')
print('Pumpkin pie\n', pumpkin_data, '\n', pumpkin_data/len(data),end='\n\n')
print('Pecan pie\n', pecan_data, '\n', pecan_data/len(data))

Apple pie
 False    514
True     466
dtype: int64 
 False    0.52449
True     0.47551
dtype: float64

Pumpkin pie
 False    729
True     251
dtype: int64 
 False    0.743878
True     0.256122
dtype: float64

Pecan pie
 True     638
False    342
dtype: int64 
 True     0.65102
False    0.34898
dtype: float64


## Interpreting the results of 'Exploring Desserts'
People could fill up the name of the pie in each of fields or leave it blank (null). So, we used isnull() to verify which positions are null and those that are not null tells us that that person had a especific kind of pie. According to the numbers above we know that 514 (52,45%) people had apple pie, 729 (74,39%) pumpkin pie and 638 (65,10%) pecan pie.
We notice that the same person have had one or more kinds of those pies served.

In [46]:
ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull
ate_pies_data = pd.value_counts(ate_pies.values)

print(ate_pies_data, '\n', ate_pies_data/len(data))

False    876
True     104
dtype: int64 
 False    0.893878
True     0.106122
dtype: float64


Now we know that 876 (89,39%) people ate pies.

# Exploring age and familiar income

In [66]:
# this function casts the first number of each range into int
def age_to_int(string_age):
    if pd.isnull(string_age) == False:
        string_age = string_age.split(' ')[0]
        string_age = string_age.replace('+','')
        return int(string_age)
    return None

data['int_age'] = data['Age'].apply(age_to_int)

data['int_age'].describe()

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

## Interpreting the results of 'Exploring age'
Altough it can give us an ideia of the bigger picture, it is not a good analysis. Since the dataset only let the user choose between age ranges we can't be sure of individual ages, therefore, our approximation is questionable.

In [72]:
def mean_income(income_range):
    if pd.isnull(income_range):
        return None
    
    min_value = income_range.split(' ')[0]
    
    if min_value == 'Prefer':
        return None
    elif min_value == '$200,000':
        return None
    
    max_value = income_range.split(' ')[2]
    
    min_value = min_value.replace(',','')
    min_value = min_value.replace('$','')
    max_value = max_value.replace(',','')
    max_value = max_value.replace('$','')
    
    return (int(min_value) + int(max_value)) / 2

data['income_mean'] = data['How much total combined money did all members of your HOUSEHOLD earn last year?'].apply(mean_income)

data['income_mean'].describe()

count       753.000000
mean      75029.380478
std       47365.158239
min        4999.500000
25%       37499.500000
50%       62499.500000
75%      112499.500000
max      187499.500000
Name: income_mean, dtype: float64

## Interpreting the results of 'Exploring income'
This time I've used the mean between the range's values. It may seem closer to reality than the age analysis, but still has some serious issues such as ignoring the range '$200,000 or up'.

# Correlating the travel distance and income

In [75]:
data[data["income_mean"] < 150000]["How far will you travel for Thanksgiving?"].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64

In [76]:
data[data["income_mean"] > 150000]["How far will you travel for Thanksgiving?"].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         28
Thanksgiving is local--it will take place in the town I live in                     16
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    13
Thanksgiving is out of town and far away--I have to drive several hours or fly       7
Name: How far will you travel for Thanksgiving?, dtype: int64

## Interpreting the results of the correlation
It appears that more people with high income have Thanksgiving at home than people with low income. This may be because younger students, who don't have a high income, tend to go home, whereas parents, who have higher incomes, don't.

# Correlating ages and friendship 

In [78]:
data.pivot_table(
    index="Have you ever tried to meet up with hometown friends on Thanksgiving night?", 
    columns='Have you ever attended a "Friendsgiving?"',
    values="int_age"
)

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744
