# DS-SF-34 | 02 | The `pandas` Library | Assignment | Answer Key

## The Bistro Meets `pandas`

You've just told one of your friend that you are taking a Data Science class.  (Yeah!)  Your friend is running a bistro, a small restaurant, serving moderately priced simple meals in a modest setting ([Wikipedia](https://en.wikipedia.org/wiki/Bistro)).  She collected over some period of time the following information of her patrons' visits.

| Variable's name | Its meaning |
|:---:|:---|
| `name` | Patron's first name |
| `gender` | Patron's gender |
| `is_smoker` | Whether the patron is smoking or not |
| `party` | Party's size |
| `check` | Check amount (\$) (after taxes but before tips) |
| `tip` | Tip (\$) that the patron added to the check |
| `day` | Week day of the visit |
| `time` | Rough time estimate of the visit |

In this assignment, we will be exploring this dataset using `pandas`.<sup>(*)</sup>

<sup>(*)</sup> this dataset was adapted from the `tips` dataset of the `seaborn` package (https://github.com/mwaskom/seaborn-data)

> ### Question 1.  Import `numpy` (as `np`) and `pandas` (as `pd`).

In [1]:
import os

import numpy as np
import pandas as pd

pd.set_option('display.max_rows', 10)
pd.set_option('display.notebook_repr_html', True)
pd.set_option('display.max_columns', 10)

> ### Question 2.  Read the `dataset-02-tips.csv` dataset.

In [2]:
df = pd.read_csv(os.path.join('..', 'datasets', 'dataset-02-bistro.csv'))

> ### Question 3.  What is the class of the `pandas` object storing the dataset?

In [3]:
type(df)

pandas.core.frame.DataFrame

Answer: `pandas` stored the dataset as a `DataFrame`.

> ### Question 4.  How many samples (i.e., rows) are in this dataset?

In [4]:
df.shape[0]

244

(or)

In [5]:
len(df)

244

Answer: 244

> ### Question 5.  How many variables (i.e., columns) are in this dataset?

In [6]:
df.shape[1]

8

(or)

In [7]:
len(df.columns)

8

Answer: 8

> ### Question 6.  Print the name of each column in the dataset, one name per line.

In [8]:
for column in df.columns:
    print column

day
time
name
gender
is_smoker
party
check
tip


> ### Question 7.  Print the first two rows of the dataset to the console.  What does the output look like?

In [9]:
df[:2]

Unnamed: 0,day,time,name,gender,is_smoker,party,check,tip
0,Sunday,Dinner,Kimberly,Female,False,2,16.99,1.01
1,Sunday,Dinner,Nicholas,Male,False,3,10.34,1.66


(or)

In [10]:
df.iloc[:2]

Unnamed: 0,day,time,name,gender,is_smoker,party,check,tip
0,Sunday,Dinner,Kimberly,Female,False,2,16.99,1.01
1,Sunday,Dinner,Nicholas,Male,False,3,10.34,1.66


(or)

In [11]:
df.head(2)

Unnamed: 0,day,time,name,gender,is_smoker,party,check,tip
0,Sunday,Dinner,Kimberly,Female,False,2,16.99,1.01
1,Sunday,Dinner,Nicholas,Male,False,3,10.34,1.66


In [12]:
type(df[:2])

pandas.core.frame.DataFrame

Answer: To print these two rows to the console, we basically subsetted the first two rows of the `DataFrame`.  These two rows are also represented as a `DataFrame`.

> ### Question 8.  Extract the last 2 rows of the data frame and print them to the console.  What does the output look like?

In [13]:
df[-2:]

Unnamed: 0,day,time,name,gender,is_smoker,party,check,tip
242,Saturday,Dinner,Jon,Male,False,2,17.82,1.75
243,Thursday,Dinner,Brandi,Female,False,2,18.78,3.0


(or)

In [14]:
df.iloc[-2:]

Unnamed: 0,day,time,name,gender,is_smoker,party,check,tip
242,Saturday,Dinner,Jon,Male,False,2,17.82,1.75
243,Thursday,Dinner,Brandi,Female,False,2,18.78,3.0


(or)

In [15]:
df.tail(2)

Unnamed: 0,day,time,name,gender,is_smoker,party,check,tip
242,Saturday,Dinner,Jon,Male,False,2,17.82,1.75
243,Thursday,Dinner,Brandi,Female,False,2,18.78,3.0


Answer: As in the prevous question, we subsetted the last two rows of the `DataFrame` as another `DataFrame`.

> ### Question 9.  Does the dataset contain any missing values?

In [16]:
df.isnull().sum().sum()

0

Answer: There are no missing values in the dataset.

> ### Question 10.  What can you say about the `is_smoker` variable?  I.e., will it bring any insights when analyzing the dataset?  What do you want to do with it?  (and do it...)

In [17]:
df.is_smoker.unique()

array([False], dtype=object)

Answer: No patrons are smokers.  Since there is no signal in this variable, we should get rid of it.

In [18]:
df.drop('is_smoker', axis = 1, inplace = True)

> ### Question 11.  For which week days does the dataset has data for?

In [19]:
df.day.unique()

array(['Sunday', 'Saturday', 'Thursday', 'Friday'], dtype=object)

Answer: Thursdays to Sundays.  (No data for Mondays to Wednesdays)

> ### Question 12.  How often was the bistro patronized for each week day?

(check `.value_counts()`; it could come in handy)

(http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)

In [20]:
df.day.value_counts()

Saturday    87
Sunday      76
Thursday    62
Friday      19
Name: day, dtype: int64

(or)

In [21]:
df.groupby('day').tip.count()

day
Friday      19
Saturday    87
Sunday      76
Thursday    62
Name: tip, dtype: int64

Answer:

| Day | Total Visits |
|:---:|:---:|
| Thursdays | 62 |
| Fridays | 19 |
| Saturdays | 87 |
| Sundays | 76 |

> ### Question 13.  How much tip did waiters collect for each week day?

In [22]:
df.groupby('day').tip.sum()

day
Friday       51.96
Saturday    260.40
Sunday      247.39
Thursday    171.83
Name: tip, dtype: float64

Answer:

| Day | Total Tips |
|:---:|:---:|
| Thursdays | \$171.83 |
| Fridays | \$51.96 |
| Saturdays | \$260.40 |
| Sundays | \$247.39 |

> ### Question 14.  What is the average tip per check (in absolute \$) for each week day?

In [23]:
df.groupby('day').tip.sum() / df.groupby('day').tip.count()

day
Friday      2.734737
Saturday    2.993103
Sunday      3.255132
Thursday    2.771452
Name: tip, dtype: float64

Answer:

| Day | Average tip per check |
|:---:|:---:|
| Thursdays | \$2.77 |
| Fridays | \$2.73 |
| Saturdays | \$2.99 |
| Sundays | \$3.25 |

> ### Question 15.  What is the average tip per check (as a percentage of the check) for each week day?

In [24]:
df.groupby('day').tip.sum() / df.groupby('day').check.sum()

day
Friday      0.159445
Saturday    0.146424
Sunday      0.152038
Thursday    0.156732
dtype: float64

Answer:

| Day | Average tip per check |
|:---:|:---:|
| Thursdays | 15.7% |
| Fridays | 15.9% |
| Saturdays | 14.6% |
| Sundays | 15.2% |

> ### Question 16.  Are there any name in common between male and female patrons?  (E.g., `Chris` can refer to either a man or a woman)

(check `numpy.intersect1d()`; it could come in handy)

(https://docs.scipy.org/doc/numpy/reference/generated/numpy.intersect1d.html)

In [25]:
np.intersect1d(df[df.gender == 'Male'].name, df[df.gender == 'Female'].name)

array(['Casey'], dtype=object)

Answer: `Casey` is a name used by both a male and a female patron.

> ### Question 17.  If no patrons share the same name, how many unique patrons are in the dataset?

- We need to count the names of men and women separately:

In [26]:
len(df[df.gender == 'Male'].name.unique()) +\
    len(df[df.gender == 'Female'].name.unique())

182

- We can also group names by gender:

In [27]:
len(df.groupby(['name', 'gender']))

182

Answer: 182

> ### Question 18.  How many times did `Kevin` patronized the bistro?  How about `Alice`?

In [28]:
(df.name == 'Kevin').sum()

4

In [29]:
(df.name == 'Alice').sum()

2

Answer:
- `Kevin`: 4
- `Alice`: 2

> ### Question 19.  Who are the top 3 female and male patrons?

In [30]:
df[df.gender == 'Female'].name.value_counts().head(3)

Mary     4
Casey    3
Laura    3
Name: name, dtype: int64

In [31]:
df[df.gender == 'Male'].name.value_counts().head(3)

David    8
James    5
Casey    5
Name: name, dtype: int64

Answer:

- Top 3 women: `Mary` (4); `Laura` and `Casey` (3 each)
- Top 3 men: `David` (8); `James` and `Casey` (5 each)

> ### Question 20.  Who's the best tipper (as a fraction of all tips over all check totals)?  Who's the worst?  How many times did they patronize the bistro?

(check `numpy.intersect1d()`; it could come in handy)

- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html)
- (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.sort_values.html)

In [32]:
(df.groupby(['name', 'gender']).tip.sum() / df.groupby(['name', 'gender']).check.sum()).\
    sort_values(ascending = False)

name      gender
Maryann   Female    0.416667
Bailey    Female    0.325733
Daniel    Male      0.313607
Dennis    Male      0.280535
Zackary   Male      0.266312
                      ...   
Willie    Male      0.066534
Kimberly  Female    0.059447
Destiny   Female    0.056797
Mildred   Female    0.056433
Jeremy    Male      0.035638
dtype: float64

In [33]:
df[df.name == 'Maryann']

Unnamed: 0,day,time,name,gender,party,check,tip
178,Sunday,Dinner,Maryann,Female,2,9.6,4.0


In [34]:
df[df.name == 'Jeremy']

Unnamed: 0,day,time,name,gender,party,check,tip
237,Saturday,Dinner,Jeremy,Male,2,32.83,1.17


Answer:
- `Maryann` is the best tipper.  She patronized the restaurant once and gave \$4 on a \$9.6 check.
- `Jeremy` is on the over end the worst tipper.  He patronized the restaurant also only once and gave a mere \$1.17 on a \$32.83 check.