<a href="https://www.kaggle.com/code/saadatkhalid/eda-and-tips-prediction-waiter-tips?scriptVersionId=163566357" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

<div style="background-color: #FF7F50; color: #FFFFFF; padding: 10px; font-size: 30px; font-weight: bold; text-align: center; border-radius: 5px;">Project Title: 📊 EDA and Tips Prediction - 🕴Waiter Tips 💲 </div>


<div style="background-color: #FF7F50; color: #FFFFFF; padding: 10px; font-size: 24px; font-weight: bold; text-align: left; border-radius: 5px;">About Dataset 🕴</div>

__Context__

Tipping waiters for serving food depends on many factors like the type of restaurant, how many people you are with, how much amount you pay as your bill, etc. 

__Content__

One waiter recorded information about each tip he received over a period of a few months working in one restaurant. In all he recorded 244 tips.

__Acknowledgements__

The data was reported in a collection of case studies for business statistics.

Bryant, P. G. and Smith, M (1995) Practical Data Analysis: Case Studies in Business Statistics. Homewood, IL: Richard D. Irwin Publishing

_______

The food server of a restaurant recorded data about the tips given to the waiters for serving the food. The data recorded by the food server is as follows:

_total bill:_ Total bill in dollars including taxes\
_tip_ : Tip given to waiters in dollars\
_sex:_ gender of the person paying the bill\
_smoker:_ whether the person smoked or not\
_day:_ day of the week\
_time:_ lunch or dinner\
_size:_ number of people in a table 

So this is the data recorded by the restaurant. Based on this data, our task is to find the factors affecting waiter tips and train a machine learning model to predict the waiter’s tipping.


_Dataset is taken from [Kaggle](https://www.kaggle.com/datasets/jsphyg/tipping/data)_

<div style="background-color: #36c457; color: #FFFFFF; padding: 10px; font-size: 24px; font-weight: bold; text-align: center; border-radius: 5px;">About Author: Saadat Khalid Awan</div>

**Email:** *me.saadi96@gmail.com*\
**Website:** *https://thesaadat.blogspot.com/*

### 🌐 Let's Connect:
[![Facebook](https://img.shields.io/badge/Facebook-%231877F2.svg?logo=Facebook&logoColor=white)](https://facebook.com/Saadat.Khalid.Awan)

[![Instagram](https://img.shields.io/badge/Instagram-%23E4405F.svg?logo=Instagram&logoColor=white)](https://instagram.com/saadii_awan66)

[![LinkedIn](https://img.shields.io/badge/LinkedIn-%230077B5.svg?logo=linkedin&logoColor=white)](https://linkedin.com/in/saadatawan)

[![Medium](https://img.shields.io/badge/Medium-12100E?logo=medium&logoColor=white)](https://medium.com/@@me.saadat)

[![Pinterest](https://img.shields.io/badge/Pinterest-%23E60023.svg?logo=Pinterest&logoColor=white)](https://pinterest.com/its_saadatkhalid)

[![Quora](https://img.shields.io/badge/Quora-%23B92B27.svg?logo=Quora&logoColor=white)](https://quora.com/profile/Saadat-Khalid-Awan)

[![TikTok](https://img.shields.io/badge/TikTok-%23000000.svg?logo=TikTok&logoColor=white)](https://tiktok.com/@saadat.awan)

[![Twitter](https://img.shields.io/badge/Twitter-%231DA1F2.svg?logo=Twitter&logoColor=white)](https://twitter.com/saadat_96)

[![YouTube](https://img.shields.io/badge/YouTube-%23FF0000.svg?logo=YouTube&logoColor=white)](https://youtube.com/@saadatkhalidawan)

[![Github](https://img.shields.io/badge/Github-%23FF0000.svg?logo=Github&logoColor=Black)](https://github.com/Saadat-Khalid/)


# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# EDA

Exploratory Data Analysis (EDA) is an approach to analyzing datasets with the objective of summarizing their main characteristics, often employing statistical graphics and other data visualization methods. The primary goal of EDA is to gain insights, detect patterns, and understand the structure of the data in order to inform subsequent steps in the data analysis process.

# Load the dataset

In [2]:
df = pd.read_csv("/kaggle/input/waiterswaitresses-tips/tips.csv")

In [3]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [4]:
df.shape

(244, 7)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB


## Missing Values

In [6]:
df.isnull().sum()

total_bill    0
tip           0
sex           0
smoker        0
day           0
time          0
size          0
dtype: int64

# Descriptive Statistics

In [7]:
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


# Total Bill and Tip

In [8]:
# Distribution of tip amount
fig = px.histogram(df, x='tip', title='Distribution of Tip Amount')
fig.show()

In [9]:
# Relationship between tip amount and total bill
fig = px.scatter(df, x='total_bill', y='tip', title='Tip Amount vs Total Bill')
fig.show()


In [10]:
correlation = df['total_bill'].corr(df['tip'])
print("Correlation coefficient between total bill and tip:", correlation)


Correlation coefficient between total bill and tip: 0.6757341092113641


The Pearson correlation coefficient between the 'total_bill' and 'tip' variables is approximately 0.676.

This positive correlation indicates a moderately strong linear relationship between the total bill amount and the tip amount. In other words, as the total bill amount increases, the tip amount tends to increase as well. 

The `Highest Total Bill` is `50.810000` and the `Lowest` is `3.070000`

The `Hightest Tip` is `10.00` and the `Lowest Tip` is `1.0`. Whereas the `Average Tip` is `2.998279`

# Smoker VS Non-Smoker

In [11]:
df['smoker'].value_counts()

smoker
No     151
Yes     93
Name: count, dtype: int64

In [12]:
# Distribution of categorical variables
fig = px.histogram(df, x='smoker', title='Distribution of Smoker', labels={'smoker': 'Smoker'})
fig.show()

`151 individuals` are `Not-Smoker` and `93 individuals` are `Smokers`

## Time

In [13]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [14]:
# Distribution of categorical variables
fig = px.histogram(df, x='time', title='Distribution of Time', labels={'time': 'Time'})
fig.show()

There are `176` instances recorded as `'Dinner'` and `68` instances recorded as `'Lunch'` in the dataset.

## Day

In [15]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [16]:
fig = px.histogram(df, x='day', title='Distribution of Days', labels={'day': 'Day'})
fig.show()

` 87 `instances recorded on `Saturday`, `76` instances recorded on `Sunday`, `62` instances recorded on` Thursday`, and `19` instances recorded on `Friday`.

# Tip Amount by Day

In [17]:
# Box plot of tip amount by day
fig = px.box(df, x='day', y='tip', title='Tip Amount by Day', labels={'day': 'Day', 'tip': 'Tip Amount'})
fig.show()

In [18]:
total_tips_by_day = df.groupby('day')['tip'].sum()
print(total_tips_by_day)


day
Fri      51.96
Sat     260.40
Sun     247.39
Thur    171.83
Name: tip, dtype: float64


In [19]:
figure = px.pie(df, values='tip', names='day', hole = 0.2)
figure.show()

In [20]:
figure = px.pie(df, values='tip', names='time', hole = 0.5)
figure.show()

In [21]:
# Distribution of categorical variables
fig = px.histogram(df, x='sex', title='Distribution of Gender', labels={'sex': 'Gender'})
fig.show()

In [22]:
# Create a bar plot
fig = px.histogram(df, x='day', color='sex', facet_col='time',
                   title='Gender Distribution based on Time and Day',
                   labels={'day': 'Day', 'sex': 'Gender', 'time': 'Time'},
                   barmode='group')

# Update layout
fig.update_layout(xaxis_title='Day', yaxis_title='Count')

# Show plot
fig.show()


In [23]:
# Aggregate data based on day and time
agg_data = df.groupby(['day', 'time'])['tip'].sum().reset_index()

# Create a sunburst chart
fig = px.sunburst(agg_data, path=['day', 'time'], values='tip', title='Sunburst Chart for Tip Dataset')

# Show plot
fig.show()

# Preprocess the Data

Convert categorical variables into numerical ones using Label Encoding.

In [24]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


# Encoding the Data

In [25]:
label_encoder = LabelEncoder()
df['sex'] = label_encoder.fit_transform(df['sex'])
df['smoker'] = label_encoder.fit_transform(df['smoker'])
df['day'] = label_encoder.fit_transform(df['day'])
df['time'] = label_encoder.fit_transform(df['time'])


In [26]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,0,0,2,0,2
1,10.34,1.66,1,0,2,0,3
2,21.01,3.5,1,0,2,0,3
3,23.68,3.31,1,0,2,0,2
4,24.59,3.61,0,0,2,0,4


Split the data into training and testing sets

In [27]:
X = df.drop('tip', axis=1)  # Features
y = df['tip']                # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#  Machine Learning

A Linear Regression Model in machine learning is like drawing a straight line through data points to predict a continuous outcome based on input features. It's used to understand how changes in the input features relate to changes in the target variable.

In [28]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)

In [29]:
# features = [[total_bill, "sex", "smoker", "day", "time", "size"]]
features = np.array([[24.50, 1, 0, 0, 1, 4]])
model.predict(features)


X does not have valid feature names, but LinearRegression was fitted with feature names



array([3.97416925])

In [30]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate evaluation metrics
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

# Print the evaluation metrics
print("Mean Absolute Error:", mae)
print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("R-squared:", r2)


Mean Absolute Error: 0.6703807496461157
Mean Squared Error: 0.694812968628771
Root Mean Squared Error: 0.8335544185167343
R-squared: 0.4441368826121932


    Mean Absolute Error (MAE): The average absolute difference between the predicted values and the actual values. In this case, the average difference between the predicted tip amounts and the actual tip amounts is approximately 0.6704.

    Mean Squared Error (MSE): The average of the squared differences between the predicted values and the actual values. In this case, the average squared difference between the predicted tip amounts and the actual tip amounts is approximately 0.6948.

    Root Mean Squared Error (RMSE): The square root of the average of the squared differences between the predicted values and the actual values. In this case, the square root of the average squared difference between the predicted tip amounts and the actual tip amounts is approximately 0.8336.

    R-squared (R2): Also known as the coefficient of determination, R-squared measures the proportion of variance in the target variable that is explained by the model. In this case, approximately 44.41% of the variance in the tip amounts is explained by the model.

## Kindly Upvote

Your feedback means a lot to me and motivates me to continue working hard.