# Introduction to data mining with the tips dataset

- Romain Billot 
- Yannis Haralambous 
- Philippe Lenca 
- Sorin Moga

---
Lab 1: Important issues illustrated from a case study
- Data and Objective understanding
- Descriptive statistics
- Visualisation tools
- Regression
---

## Data and objective understanding

–The Tips dataset– Food server’s tips in restaurants may be influenced by many factors (e.g. the
nature and location of the restaurant, the size of the party, the table location and the day of the week. . . ).
Restaurant managers need to know which factors matter when they assign tables to food servers. Indeed,
for the sake of staff morale, they usually want to avoid either the substance or the appearance of unfair
treatment of the servers, for whom tips (at least in restaurants in the United States) are a major component
of pay.
In one restaurant, a food server recorded some data on all customers they served during an interval
of two and a half months in early 1990. The restaurant, located in a suburban shopping mall, was part
of a national chain and served a varied menu. In observance of local law the restaurant offered seating
in a non-smoking section to patrons who requested it. Each record includes a day and time, and thus
taken together, they show the server’s work schedule. The food server provided a comma-separated-value
file tips.csv containing 244 records, described by 7 variables ( total bill, tip, sex, smoker, day, time
and size; see Table 1).

In [1]:
# import useful libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import scipy as sp

### Question 1

What do you know from the text above and what information is missing?

---
We can say that our problem is about supervised learning, because we have a target function which is the tip. The other attributes are the dependent variables(features).

---

### Question 2

Do you have some idea about the objectives of the study and the knowledge you could extract from the data? Could you suggest a list of questions of interest?

Here we want to know the influence of total bill, sex, smoker, day, time and size on tip by finding a model that can generalise the data.  

Some questions : 

    - What factors can most influence the tip value ?
    - Which factors matter when they assign tables to food servers ?
  

### Question 3

 Load the dataset and have a look at it using the describe() function. Describe the data (the format of the data, the quantity of data –number of example/ records and variable/fields–). What are the expected values and role of each variable?

In [2]:
# load data
data_tips=pd.read_csv('tips.csv')
# the type of the object data_tips 
type(data_tips)
# the shape of data 
data_tips.shape
# look qt the firsy 5 rows
data_tips.head()
#variables in data
data_tips.columns


Index(['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size'], dtype='object')

In [3]:
# describe the data
data_tips.describe() # here we give some statistics about only numerical variables

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


In [4]:
# give a description about categorical variables
data_tips.describe(include=['object']) 

Unnamed: 0,sex,smoker,day,time
count,244,244,244,244
unique,2,2,4,2
top,Male,No,Sat,Dinner
freq,157,151,87,176


In [5]:
# a description for each variable :
# 'total_bill' : total to pay 
#'tip' : the tip value
#'sex' : the gender of the customer
#'smoker' :  customer smokes or not
#'day' : 4 days ( from thursday to sunday)
#'time' : either dinner or lunch
#'size' : the size of the party

### Question 4

Tip is usually referred to by percentage points, or as a rate. This enables a normalization over the total bill and a comparison of values across other variables. The question is now to create a "tip rate" variable and to add it to the original dataset.

In [6]:
data_tips['tip_rate']=data_tips['tip']/data_tips['total_bill']
data_tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_rate
0,16.99,1.01,Female,No,Sun,Dinner,2,0.059447
1,10.34,1.66,Male,No,Sun,Dinner,3,0.160542
2,21.01,3.5,Male,No,Sun,Dinner,3,0.166587
3,23.68,3.31,Male,No,Sun,Dinner,2,0.13978
4,24.59,3.61,Female,No,Sun,Dinner,4,0.146808


##### Home work

Explore the notion of scale of measurement. Provide a short note with meaningful definitions
and examples. Explain why it is important to consider the right scale for each variable.
What is the scale for each of the eight variables?

http://stattrek.com/statistics/measurement-scales.aspx?Tutorial=AP

## Descriptive statistics and visualisation

### Question 5

Explore univariate summaries with the R summary function.

In [7]:
data_tips.describe()

Unnamed: 0,total_bill,tip,size,tip_rate
count,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,0.160803
std,8.902412,1.383638,0.9511,0.061072
min,3.07,1.0,1.0,0.035638
25%,13.3475,2.0,2.0,0.129127
50%,17.795,2.9,2.0,0.15477
75%,24.1275,3.5625,3.0,0.191475
max,50.81,10.0,6.0,0.710345


In [8]:
data_tips.describe(include=['object'])

Unnamed: 0,sex,smoker,day,time
count,244,244,244,244
unique,2,2,4,2
top,Male,No,Sat,Dinner
freq,157,151,87,176


### Question 6

Plot a representation of the days distribution in the dataset and comment.

In [9]:
sns.countplot(x='day',data=data_tips)


<matplotlib.axes._subplots.AxesSubplot at 0x1fca3d1b9e8>

In [10]:
sns.barplot(x='day',y= 'tip',data= data_tips)

<matplotlib.axes._subplots.AxesSubplot at 0x1fca3d1b9e8>

### Question 7

Prepare a plot of the amount of tips against the total bill. What can you see ? Test the correlation
between the two variables.

In [11]:
sns.jointplot(y="tip", x="total_bill", data=data_tips)
plt.title('Tip vs total_bill')

<matplotlib.text.Text at 0x1fca3f3dfd0>

### Question 8

Draw and interpret three boxplots : 

    1. the distribution of the total bill,
    2. the distribution of tips;
    3. the distributions of tips vs. days.

In [12]:
sns.boxplot(x= 'total_bill',data=data_tips)
sns.boxplot(x= 'tip',data=data_tips)
sns.boxplot(x= 'tip',y='day',data=data_tips)
plt.title('Tip VS Days')

<matplotlib.text.Text at 0x1fca3f3dfd0>

### Question 9

Draw an histogram of tips. What can you say about the shape of the data ? Is this restaurant
expensive ? Split the plotting window into 6 subplots (function mfrow) and plot 6 histograms
with increasing numbers of breaks.

In [13]:
plt.hist(x='tip',data=data_tips)

(array([ 41.,  79.,  66.,  27.,  19.,   5.,   4.,   1.,   1.,   1.]),
 array([  1. ,   1.9,   2.8,   3.7,   4.6,   5.5,   6.4,   7.3,   8.2,
          9.1,  10. ]),
 <a list of 10 Patch objects>)

### Question 10


for time of the day (dinner or lunch) and day of the week.

### Question 11

Display the counts (proportions) for Gender of the Bill Payer and Smoking Parties. Do the same
for time of the day (dinner or lunch) and day of the week

### Question 12

Who pay mostly the bills ? men or women ? and when ? Try to visualise the conditional distributions
of Sex given the day of the week, with a mosaic plot

## Regression

### Question 13

Before starting with the regression, we will learn how to build dummy variables, which is sometimes
useful. Create four new variables, named thu, fri, sat, sun, that take 1 if the dining party
was held on that day, 0 otherwise. Use the function with of R and force the variable to the R factor
type with the factor function

### Question 14

Fit a general linear model with tip rate as a response variable against all the other variables of
interest : sex, smoker, time, size, thu, fri, sat, sun

### Question 15

Fit a model with only the size as an explanatory variable

### Question 16

Use a stepwise algorithm with the AIC statistic as a variable selection process to select a good
model. Start from the full model of question 13. What do you remark?

#### Home work

Explore the notion of interaction between the Gender and the smoking habit by including
explicitely this interaction into a model with size, sex, smoke

### Question 17

Check the linear relationship between the tip and the total bill, seen at question 7, with a linear
model and interpret the quality of this model