In this project, you'll be working with Jupyter notebook, and **analyzing data on Thanksgiving dinner in the US**. By the end, you'll have a notebook that you can add to your portfolio or build on top of on your own. If you need help at any point, you can consult our solution notebook here. The dataset came from FiveThirtyEight, and can be found here.

### Dataset
The dataset has **65 columns, and 1058 rows**. Most of the column names are questions, and most of the column values are string responses to the questions. Most of the columns are categorical, as a survey respondent had to select one of a few options. For example, one of the first column names is What is typically the main dish at your Thanksgiving dinner?. The potential responses are:

* Turkey
* Other (please specify)
* Ham/Pork
* Tofurkey
* Chicken
* Roast beef
* I don't know
* Turducken

Most of the columns follow the same question/response format as the above. There are also **quite a few NaN values** in the columns, which occurred when a survey respondent didn't fill out a question because they didn't want to, or it didn't apply to them.

#### Descriptions of some of the most important columns:
* RespondentID -- a unique ID of the respondent to the survey.
* Do you celebrate Thanksgiving? -- a Yes/No reponse to the question.
* How would you describe where you live? -- responses are Suburban, Urban, and Rural.
* Age -- resposes are one of several categories, such as 18-29, and 30-44.
* How much total combined money did all members of your HOUSEHOLD earn last year? -- one of several categories, such as $75,000 to $99,999.

In [14]:
import pandas as pd

data = pd.read_csv('thanksgiving.csv', encoding="Latin-1")
print(data.head(15))

    RespondentID Do you celebrate Thanksgiving?  \
0     4337954960                            Yes   
1     4337951949                            Yes   
2     4337935621                            Yes   
3     4337933040                            Yes   
4     4337931983                            Yes   
5     4337929779                            Yes   
6     4337924420                            Yes   
7     4337916002                            Yes   
8     4337914977                            Yes   
9     4337899817                            Yes   
10    4337899680                             No   
11    4337893416                            Yes   
12    4337888291                            Yes   
13    4337878450                            Yes   
14    4337878351                            Yes   

   What is typically the main dish at your Thanksgiving dinner?  \
0                                              Turkey             
1                                              Tu

In [15]:
areas = data['How would you describe where you live?'].unique()
print(areas)

['Suburban' 'Rural' 'Urban' nan]


In [17]:
columns = data.columns
print(columns)

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

In [31]:
count = data['Do you celebrate Thanksgiving?'].value_counts()
print(count)

# Filter to only people who celebrate TG
celebrating = data[data['Do you celebrate Thanksgiving?'] == 'Yes']

#Count existing values (should only be 'Yes')
print(celebrating['Do you celebrate Thanksgiving?'].value_counts())

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64
Yes    980
Name: Do you celebrate Thanksgiving?, dtype: int64


In [32]:
print(celebrating.head(5))

   RespondentID Do you celebrate Thanksgiving?  \
0    4337954960                            Yes   
1    4337951949                            Yes   
2    4337935621                            Yes   
3    4337933040                            Yes   
4    4337931983                            Yes   

  What is typically the main dish at your Thanksgiving dinner?  \
0                                             Turkey             
1                                             Turkey             
2                                             Turkey             
3                                             Turkey             
4                                           Tofurkey             

  What is typically the main dish at your Thanksgiving dinner? - Other (please specify)  \
0                                                NaN                                      
1                                                NaN                                      
2                            

In [34]:
dishes = celebrating['What is typically the main dish at your Thanksgiving dinner?']
print(dishes.value_counts())

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64


In [45]:
print(celebrating['Do you typically have gravy?'][celebrating['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey'])

#Alternative:
#print(celebrating[celebrating['What is typically the main dish at your Thanksgiving dinner?'] == 'Tofurkey']['Do you typically have gravy?'])


4      Yes
33     Yes
69      No
72      No
77     Yes
145    Yes
175    Yes
218     No
243    Yes
275     No
393    Yes
399    Yes
571    Yes
594    Yes
628     No
774     No
820     No
837    Yes
860     No
953    Yes
Name: Do you typically have gravy?, dtype: object


## Figuring Out What Pies People Eat¶


In [67]:
apple_isnull = celebrating['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'].isnull()
pumpkin_isnull = celebrating['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'].isnull()
pecan_isnull = celebrating['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'].isnull()

#print(apple_isnull.head(10))

ate_pies = celebrating[apple_isnull & pumpkin_isnull & pecan_isnull].head(5)
print(ate_pies)


    RespondentID Do you celebrate Thanksgiving?  \
5     4337929779                            Yes   
7     4337916002                            Yes   
15    4337857295                            Yes   
21    4337813502                            Yes   
59    4337586061                            Yes   

   What is typically the main dish at your Thanksgiving dinner?  \
5                                              Turkey             
7                                              Turkey             
15                                             Turkey             
21                                             Turkey             
59                                             Turkey             

   What is typically the main dish at your Thanksgiving dinner? - Other (please specify)  \
5                                                 NaN                                      
7                                                 NaN                                      
15            

## Analyse age

In [96]:
#Transform age to numeric value

def toInt(x):
    if pd.isnull(x):
        return None
    splittedSpace = x.split(' ')
    splittedPlus = splittedSpace[0].split('+')
    return int(splittedPlus[0])

celebrating['int_age'] = celebrating['Age'].apply(toInt) 
celebrating['int_age'].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

#### Remarks
* ages below 18 and above 60 are ignored
* all ranges are "floored" to the beginning of the range (18-29 to 18, ...)
* quantiles jump suddenly to next buckets


In [97]:
celebrating['int_age'].value_counts()

45.0    269
60.0    258
30.0    235
18.0    185
Name: int_age, dtype: int64

## Analyse income

In [99]:
income = celebrating['How much total combined money did all members of your HOUSEHOLD earn last year?']
income.value_counts()


$25,000 to $49,999      166
$75,000 to $99,999      127
$50,000 to $74,999      127
Prefer not to answer    118
$100,000 to $124,999    109
$200,000 and up          76
$10,000 to $24,999       60
$0 to $9,999             52
$125,000 to $149,999     48
$150,000 to $174,999     38
$175,000 to $199,999     26
Name: How much total combined money did all members of your HOUSEHOLD earn last year?, dtype: int64

In [102]:
def convertIncome(income_str):
    if pd.isnull(income_str):
        return None
    income_str = income_str.split(' ')
    if income_str[0] == "Prefer":
        return None
    income_str = income_str[0].replace('$', '')
    income_str = income_str.replace(',','')
    income_int = int(income_str)
    
    return income_int

celebrating['int_income'] = income.apply(convertIncome)
celebrating['int_income'].describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64

## Findings
* ranges are "floored" to the beginnin of the range
* max income is capped at 200k
* hence medium income is at least 75,965 USD


## Analying distance

*Hypothesis*: younger people have less income and travel further to their parents homes. People with higher income are more likely to travel less as they celebrate at their own home.

We can test this by filtering data based on int_income, and seeing what the values in the How far will you travel for Thanksgiving? column are.

In [120]:
lessThan150k = celebrating[celebrating['int_income'] < 150000]
values = lessThan150k['How far will you travel for Thanksgiving?'].value_counts()

print(values)
distributionLessThan150k = {}

for value in values:
    distributionLessThan150k[value] = value/sum(values)
    
print(distributionLessThan150k)

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64
{281: 0.40783744557329465, 203: 0.2946298984034833, 150: 0.21770682148040638, 55: 0.07982583454281568}


In [124]:
moreThan150k = celebrating[celebrating['int_income'] >= 150000]
values = moreThan150k['How far will you travel for Thanksgiving?'].value_counts()

print(values)
distributionMoreThan150k = {}

for value in values:
    distributionMoreThan150k[value] = value/sum(values)
    
print(distributionMoreThan150k)

Thanksgiving is happening at my home--I won't travel at all                         66
Thanksgiving is local--it will take place in the town I live in                     34
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    25
Thanksgiving is out of town and far away--I have to drive several hours or fly      15
Name: How far will you travel for Thanksgiving?, dtype: int64
{25: 0.17857142857142858, 66: 0.47142857142857142, 34: 0.24285714285714285, 15: 0.10714285714285714}


## Findings

### People with less income than 150k
* I won't travel at all: 40.1 %
* it will take place in the town I live in: 29.5%
* it's a drive of a few hours or less: 21.8%
* I have to drive several hours or fly: 8.0%

### People with more income than 150k
* I won't travel at all: 47.1 %
* it will take place in the town I live in: 24.3%
* it's a drive of a few hours or less: 17.9%
* I have to drive several hours or fly: 10.7%



## Friendsgiving
*Hypothesis:* younger people tend to celebrate Thanksgiving with friends either at home or in the town they live in.

In [126]:
celebrating.pivot_table(index='Have you ever tried to meet up with hometown friends on Thanksgiving night?', columns='Have you ever attended a "Friendsgiving?"', values='int_age')

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


In [127]:
celebrating.pivot_table(index='Have you ever tried to meet up with hometown friends on Thanksgiving night?', columns='Have you ever attended a "Friendsgiving?"', values='int_income')

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


### Findings
Both younger and lower income people tend be more likely to attend friendsgivings or spend it with hometown friends. 

# Compare main dish with average income

*Hypothesis*: Cheaper meat / easier dishes are eaten by persons with lower income.  

In [138]:
import numpy as np
celebrating.pivot_table(index='What is typically the main dish at your Thanksgiving dinner?', values=['int_income', 'int_age'], aggfunc=np.mean)

Unnamed: 0_level_0,int_age,int_income
What is typically the main dish at your Thanksgiving dinner?,Unnamed: 1_level_1,Unnamed: 2_level_1
Chicken,39.5,40500.0
Ham/Pork,33.857143,65370.37037
I don't know,21.0,16666.666667
Other (please specify),43.028571,79193.548387
Roast beef,39.9,35625.0
Tofurkey,32.4,73235.294118
Turducken,40.0,200000.0
Turkey,40.462275,77113.543092


### Findings
* Turducken (three bird roast) only eaten by persons with income >= 200k
* younger people with less income (students?) tend to not know what the typical dish is
* All higher quality dishes (turducken, turkey, "other") which are eaten by persons with high income correlate with age above 40 years.