<h1>
<center>
Dataquest Guided Project 4:
Analyzing Thanksgiving Dinner
</center>
</h1>

## Introduction

This is part of the Dataquest program.

- part of paths **Data Analyst in Python & Data Scientist in Python**
    - Step 2 : **Intermediate Python and Pandas**
        - Course 1 : **Data Analysis with Pandas: Intermediate**
            - NumPy
            - Pandas
            - Working with missing data

As it is part of the Python and pandas programming: intermediate, we will use basic Pandas functionalities. This is a guided project: we are following and deepening the steps suggested by Dataquest. 

## Use case : Analyzing Thanksgiving Dinner 

The dataset came from [FiveThirtyEight](http://fivethirtyeight.com). It contains 1058 responses to an online survey about what Americans eat for Thanksgiving dinner. Each survey respondent was asked questions about what they typically eat for Thanksgiving, along with some demographic questions, like their gender, income, and location. 
This dataset has 65 columns and 1058 rows. Most of the column names are questions, and most of the column values are string responses to the questions. Most of the columns are categorical, as a survey respondent had to select one of a few options.

In this project, we'll explore the data, and try to find interesting patterns

### Load file and prepare data

In [2]:
import pandas as pd
data = pd.read_csv("thanksgiving.csv", encoding="Latin-1")

Display the first rows of the dataset

In [3]:
data.head()

Unnamed: 0,RespondentID,Do you celebrate Thanksgiving?,What is typically the main dish at your Thanksgiving dinner?,What is typically the main dish at your Thanksgiving dinner? - Other (please specify),How is the main dish typically cooked?,How is the main dish typically cooked? - Other (please specify),What kind of stuffing/dressing do you typically have?,What kind of stuffing/dressing do you typically have? - Other (please specify),What type of cranberry saucedo you typically have?,What type of cranberry saucedo you typically have? - Other (please specify),...,Have you ever tried to meet up with hometown friends on Thanksgiving night?,"Have you ever attended a ""Friendsgiving?""",Will you shop any Black Friday sales on Thanksgiving Day?,Do you work in retail?,Will you employer make you work on Black Friday?,How would you describe where you live?,Age,What is your gender?,How much total combined money did all members of your HOUSEHOLD earn last year?,US Region
0,4337954960,Yes,Turkey,,Baked,,Bread-based,,,,...,Yes,No,No,No,,Suburban,18 - 29,Male,"$75,000 to $99,999",Middle Atlantic
1,4337951949,Yes,Turkey,,Baked,,Bread-based,,Other (please specify),Homemade cranberry gelatin ring,...,No,No,Yes,No,,Rural,18 - 29,Female,"$50,000 to $74,999",East South Central
2,4337935621,Yes,Turkey,,Roasted,,Rice-based,,Homemade,,...,Yes,Yes,Yes,No,,Suburban,18 - 29,Male,"$0 to $9,999",Mountain
3,4337933040,Yes,Turkey,,Baked,,Bread-based,,Homemade,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$200,000 and up",Pacific
4,4337931983,Yes,Tofurkey,,Baked,,Bread-based,,Canned,,...,Yes,No,No,No,,Urban,30 - 44,Male,"$100,000 to $124,999",Pacific


Display the column names

In [4]:
data.columns

Index(['RespondentID', 'Do you celebrate Thanksgiving?',
       'What is typically the main dish at your Thanksgiving dinner?',
       'What is typically the main dish at your Thanksgiving dinner? - Other (please specify)',
       'How is the main dish typically cooked?',
       'How is the main dish typically cooked? - Other (please specify)',
       'What kind of stuffing/dressing do you typically have?',
       'What kind of stuffing/dressing do you typically have? - Other (please specify)',
       'What type of cranberry saucedo you typically have?',
       'What type of cranberry saucedo you typically have? - Other (please specify)',
       'Do you typically have gravy?',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Brussel sprouts',
       'Which of these side dishes aretypically served at your Thanksgiving dinner? Please select all that apply. - Carrots',
       'Which of these side dishes aretypically served

### Filtering out rows from the dataframe

Let's filter the columns to keep only the response of the people celebrating Thanksgiving.
First, how many people celebrate Thanksgiving?

In [5]:
data["Do you celebrate Thanksgiving?"].value_counts()

Yes    980
No      78
Name: Do you celebrate Thanksgiving?, dtype: int64

In [6]:
data = data[data["Do you celebrate Thanksgiving?"] == 'Yes']

### Analyze the data

#### What main dishes people tend to eat during Thanksgiving dinner?

In [7]:
data["What is typically the main dish at your Thanksgiving dinner?"].value_counts()

Turkey                    859
Other (please specify)     35
Ham/Pork                   29
Tofurkey                   20
Chicken                    12
Roast beef                 11
I don't know                5
Turducken                   3
Name: What is typically the main dish at your Thanksgiving dinner?, dtype: int64

People who answered the survey mostly eat Turkey or derived.

We are now very curious: Do people who eat Tofurkey add gravy to their dish?

In [8]:
data[data["What is typically the main dish at your Thanksgiving dinner?"]=="Tofurkey"]['Do you typically have gravy?'].value_counts()

Yes    12
No      8
Name: Do you typically have gravy?, dtype: int64

12 out of 20 people eat gravy with tofurkey. There are only 20 people eating tofurkey out of 980 people celebrating Thanksgiving. 20 people is not representative enough to make conclusions about serving or not gravy to a toforkey eater at Thanksgiving.

#### What desert dishes people tend to eat during Thanksgiving dinner?

There are three possible answers to the question "Which type of pie is typically served at your Thanksgiving dinner?"
- Apple
- Pumpkin
- Pecan

But instead of creating a categorical variable, the survey is organized with three columns with that form : "Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple"

Thus, we will create Booleans series to understand what pies people mainly eat.

In [9]:
apple_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Apple'].isnull()

In [10]:
pumpkin_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pumpkin'].isnull()

In [11]:
pecan_isnull = data['Which type of pie is typically served at your Thanksgiving dinner? Please select all that apply. - Pecan'].isnull()

In [12]:
ate_pies = apple_isnull & pumpkin_isnull & pecan_isnull

In [13]:
ate_pies.value_counts()

False    876
True     104
dtype: int64

876 responders eat pies, that is to say 95% of people who answered the survey!

#### Let's analyse the Age of the participants

We want to analyze the age of the responders to take into account the generational bias to our results. In a real study, this would be the first thing to do. 

In [14]:
data["Age"].unique()

array(['18 - 29', '30 - 44', '60+', '45 - 59', nan], dtype=object)

The age column contains values that fall into categories. To analyze it, we will first need to convert it to numerical values. As we will not be able to extract an exact integer value, we will instead keep the first age value in the string given.

In [15]:
def age(row):
    if pd.isnull(row) == True:
        return None
    else:
        age = row.split(' ')[0]
        age = age.replace("+", ' ')
        age = int(age)
        return age

In [16]:
data["int_age"] = data["Age"].apply(age)

In [17]:
data["int_age"].describe()

count    947.000000
mean      40.089757
std       15.352014
min       18.000000
25%       30.000000
50%       45.000000
75%       60.000000
max       60.000000
Name: int_age, dtype: float64

As we decided to keep the first value of each Age range, the results are not very representative. Moreover, it would be interesting to compare this repartition to the overall American population repartition. 
We can, however, highlight that the values are relatively evenly distributed 

#### Let's analyse the income of the participants

As for the Age category, we want to study the income of the participants. 

In [18]:
data["How much total combined money did all members of your HOUSEHOLD earn last year?"].unique()

array(['$75,000 to $99,999', '$50,000 to $74,999', '$0 to $9,999',
       '$200,000 and up', '$100,000 to $124,999', '$25,000 to $49,999',
       'Prefer not to answer', '$10,000 to $24,999',
       '$175,000 to $199,999', '$150,000 to $174,999',
       '$125,000 to $149,999', nan], dtype=object)

Like previously, we need to convert the data to numerical values.

In [23]:
def income(row):
    if pd.isnull(row) == True:
        return None
    income = row.split(' ')[0]
    if income == "Prefer":
        return None
    else:
        income = income.replace("$", "")
        income = income.replace(",", "")
        income = int(income)
    return income

In [24]:
data['int_income'] = data['How much total combined money did all members of your HOUSEHOLD earn last year?'].apply(income)

In [25]:
data['int_income'].describe()

count       829.000000
mean      75965.018094
std       59068.636748
min           0.000000
25%       25000.000000
50%       75000.000000
75%      100000.000000
max      200000.000000
Name: int_income, dtype: float64

First of all, as we only considered the lower limit of the income range, the result is shifted downward. 
The average income is 75 965 \$. The mean household income in the United States, according to the US Census Bureau 2014 Annual Social and Economic Supplement, was 72 641 \$. Thus, our mean is slightly above it. Moreover, the standard deviation is pretty high.  

#### Correlating Travel Distance And Income

We can now see how the distance someone travels for Thanksgiving dinner relates to their income level. It's safe to hypothesize that people earning less money could be younger, and would travel to their parent's houses for Thanksgiving. People earning more are more likely to have Thanksgiving at their house as a result.

In [26]:
data[data['int_income']<150000]["How far will you travel for Thanksgiving?"].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         281
Thanksgiving is local--it will take place in the town I live in                     203
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    150
Thanksgiving is out of town and far away--I have to drive several hours or fly       55
Name: How far will you travel for Thanksgiving?, dtype: int64

In [27]:
data[data['int_income']>150000]["How far will you travel for Thanksgiving?"].value_counts()

Thanksgiving is happening at my home--I won't travel at all                         49
Thanksgiving is local--it will take place in the town I live in                     25
Thanksgiving is out of town but not too far--it's a drive of a few hours or less    16
Thanksgiving is out of town and far away--I have to drive several hours or fly      12
Name: How far will you travel for Thanksgiving?, dtype: int64

The assumption we made is correct: more people with high income have Thanksgiving at home than people with low income. It correlates with the idea that less income cannot organize a whole dinner. Student's for instance, often come back to their parent's house at this time. 

To deepen this idea, it would be interested to correlate age with distance.

#### Linking Friendship And Age

In the US, a "Friendsgiving" is when instead of traveling home for the holiday, you celebrate it with friends who live in your area. Both questions seem skewed towards younger people. Let's see if this hypothesis holds up

In [34]:
data.pivot_table(index="Have you ever tried to meet up with hometown friends on Thanksgiving night?",
                columns='Have you ever attended a "Friendsgiving?"',
                values="int_age")

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,42.283702,37.010526
Yes,41.47541,33.976744


Younger people tend to attend a "Friendsgiving". 

In [32]:
data.pivot_table(index="Have you ever tried to meet up with hometown friends on Thanksgiving night?",
                columns='Have you ever attended a "Friendsgiving?"',
                values="int_income")

"Have you ever attended a ""Friendsgiving?""",No,Yes
Have you ever tried to meet up with hometown friends on Thanksgiving night?,Unnamed: 1_level_1,Unnamed: 2_level_1
No,78914.549654,72894.736842
Yes,78750.0,66019.736842


It appears that people who are younger are more likely to attend a Friendsgiving, and try to meet up with friends on Thanksgiving.
We also have the confirmation that people who organize Friendsgivings have less income.  