# Tips Dataset
_______________________________________________________________________________________________________________________________

## Introduction 


This document has been created to understand the well know Tips Dataset, using Pythons packages, Seaborn and Jupyter Notebook.

The project is separated in three main areas: 

- Description: Descriptive statistics and plots to understand tips dataset. 
- Regression: Analyse relationship between the total bill and tip amount.
- Analyse:  Analyse the relationship between the variables within the datase.

### About Tips Dataset 

One waiter recorded information about each tip he received over a period of two and a half month working in one restaurant in early 1990. The restaurant, located in a suburban shopping mall, was one of a national chain and served a varied menu. In observance of local law the restaurant offered seating in a non-smoking sections to patrons who requested it. The data
was assigned to those days and during those times when the food server was routinely assigned to work.

Tips Dataset contains a dataframe with 244 obervations on 8 variables, that allow hospitality managers understand the factors that influence their business, incuding size of the party, smoking preferences, table size, among others.



 

# Description: 
In this section we will understand better the Tips Dataset using statistics and plots
_______________________________________________________________________________________________________________________________


### Import Libraries: 
As first step we need to import the libraries we will use, this will help to analyse the data in a more efficient way

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Load Dataset: 

The data can be loaded from a CSV file. You can also see the document in the repository of this project. 

In [None]:
df= pd.read_csv("tips dataset.csv")

### Check Dimension Dataset: 
the dataset contains 244 attributes under 8 variables, as we can see when we print the shape.

In [None]:
print (df.shape)

### Evaluate the Data: 
First of all we want to take a look of the attributes presented in the data, so we print the first rows of the dataset and we can see for Tips dataset we have 8 variables which are 4 numeric (ID, total_bill,tip and size) and 4 strings (sex,smoker, day and time).

It is also recommended to review the tail, to make sure there is consistency on the data and it is easier to notice if there is any change or information missed. In the example below, we don't see significant difference from the head and as expected there is a total of 244 observations which indicates we can work with the data.

In [None]:
print (df.head())

In [None]:
print (df.tail())

### Summary of each attribute and other observations:
We can see the summary of the attributes which include minimum, maximum, standards, percentiles among other information with the describe formula. 

From the information below we observe: 

- The average bill was 20 dollars with an average tip of 3 dollars, which means in standard the restaurant clients paid around 15% in tips. This is a reasonable standard amount for tips in America. 
- The minimun Bill was 3 dollars and the maximum 51 dollars, which means the restaurant is price friendly. Considering the maximum is for more than one person. 
- The size of tables are from 1 to 6 people, we can consider the place do not have espace for big amount of people in one service. 

In [None]:
print (df.describe())

### Tables View
Gives us a much clearer idea of the distribution of the attributes using tables and comparing values

**View 1:** In the table below we compared the number of days and the party size and observe:

- Monday, Tuesday and Wednesday were not taken in consideration for this dataset purpose, or there is the possibility that the waitress did not have shift at these days. 
- The average visit of parties on Fridays is considerable low compared with the rest of the days recorded. There should be evaluated the causes this is not a normal average for Friday, althought the possibility the waitress worked a limit ammount of hours should be consider as well. 
- The days with more visit are Saturdays followed by Sundays as expected for weekends

In [None]:
print(df.groupby('day').size())

**View 2:** In the table below we compared the day of the week and time of visit, taking gender per count and observe:

- There is no recorder information on lunch for Saturday or Sunday which mean most of the visitors go for dinner on weekends
- There is just 1 visit for dinner on Thursday which is not a normal rate compared with the rest of 61 that visited during lunch. Need to be reviewed to apply changes as necessary
- In general, there is more visit of people for dinner

In [None]:
df[['day', 'time', 'sex']].groupby(['day', 'time']).count()

**View 3:** In the table below we compared the average bill and tips per gender. 

- In average men pay slightly more in bills and give slightly less tips than females, if we compare by percentaje the media between the ammount they pay for total of the bill and the tip.  
- Men and woman has similar size of tables media, there is not considerable difference

In [None]:
print(df.groupby('sex').mean())


**View 4:** In the table below we compared the gender and size of tables per day of week and observe:

- Woman prefer the visit on Thursday and Saturday 
- Men give preferences to Saturday and Sundays

In [None]:
df[['day', 'sex', 'size']].groupby(['day', 'sex']).count()

**View 5:** In the table below we compared smoker per gender, counting with the size and observe:

- More parties were seated in the non smoking section, no significant different between mean and woman on this.
- In total there are more males that go to the restaurant than females. 

In [None]:
df[['sex', 'smoker', 'size']].groupby(['sex', 'smoker']).count()

### Data view 
Gives us a much clearer idea of the distribution of the attributes using plots.

As summary of what we can see on the plots below: 

- Most of the reservation are for 2 person
- Most of the tips are around 1,98 and 3 dollars and the total bill between 12 and 18 dollars approximately 
- Saturday and Sundays the number of men are the double or more than woman.
- Sundays is the day with most visit of table for 4.
- Friday is the only day of the week were most parties prefer smoker area

**Table 1:** Histogram with all the numeric variables

In [None]:
%matplotlib inline

In [None]:
fig = plt.figure(figsize = (8,8))
ax = fig.gca()
df.hist(ax = ax)

**Table 2:** Histogram with parties per day and sex

In [None]:
sns.countplot(x='day',hue='sex' ,data=df) 

**Table 3:** Histogram with parties per day and size

In [None]:
sns.countplot(x='day',hue='size' ,data=df) 

**Table 4:** Histogram with smoking preferences

In [None]:
sns.countplot(x='day',hue='smoker' ,data=df) 

# Regresion
Analyses relationship between the total bill and tip amount
_______________________________________________________________________________________________________________________________

Analysis: There are two ways of calculate the linear regresion to analyse the relationship between the total bill and the tip amount. I have calculated in both ways for a better understanding.

- Find values for m and c that gives the lowest using Numpy funtions

In [None]:
%matplotlib inline

In [None]:
tb= df['total_bill']
ta= df['tip']

In [None]:
np.polyfit(tb,ta,1)

- Find values for m and c that gives the lowest using formulas

In [None]:
tb_avg=np.mean(tb)
ta_avg=np.mean(ta)

In [None]:
tb_zero=tb - tb_avg
ta_zero=ta - ta_avg

In [None]:
m=np.sum(tb_zero*ta_zero)/np.sum(tb_zero*tb_zero)

In [None]:
c=ta_avg- m * tb_avg

In [None]:
print ("The coefficient is:", m, "The intercept is:", c)

Results: 
The coefficient is [0.10502451738435341]
The intercept is [0.9202696135546722]

### The plot of the best fit line 
In the plot below, we can see the best fit line between the total bill and the tip. The straight line in the graph shows our algorithm is correct and per the graphic we can see there is linear relationship between the total bill and the tips paid.

The tips tend to increase if the bill is higher.

The bills of less than 10 and around 35 and 45 dollars seems to be the ones that are less generous tippers. In general the distributions seems even between generous tippers and not that generous. 




In [None]:
plt.plot (tb,ta,"k.",label='Original data')
plt.plot (tb, m*tb+c,'b-', label='The best tip')
plt.xlabel('Total Bill')
plt.ylabel('Tips')
plt.title("Best Fit Line Plot")


References:

- https://en.wikipedia.org/wiki/Data_set
- https://rdrr.io/cran/regclass/man/TIPS.html
- https://dicook.public.iastate.edu/stat503/05/cs-tips2.pdf
- https://towardsdatascience.com/analyze-the-data-through-data-visualization-using-seaborn-255e1cd3948e
- https://amitkushwaha.co.in/data-visualization-part-1.html
- https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed X
- https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166 X
- https://swcarpentry.github.io/python-novice-gapminder/09-plotting/
- https://matplotlib.org/3.1.1/tutorials/introductory/sample_plots.html
https://medium.com/@madanflies/linear-regression-on-carprice-dataset-or-encoding-a-categorical-dataset-in-linear-regression-7378f207e5c1
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155
https://towardsdatascience.com/simple-and-multiple-linear-regression-with-python-c9ab422ec29c
https://hackernoon.com/predict-a-tip-using-machine-learning-aee94f467ef2
https://towardsdatascience.com/linear-regression-using-python-b136c91bf0a2