# Introduction to Tabular Data: The Tips Dataset

In this notebook, we will explore a sample data set called **Tips** from Seaborn. This data set contains information from a restaurant, including details about bills, tips, and customer data.

## About the Libraries: pandas and seaborn

In Python, we use libraries to add extra functionality beyond the built‐in features. Two important libraries for data analysis are:

- **pandas (imported as `pd`)**: A powerful library for data manipulation and analysis. It provides a DataFrame data structure (similar to a spreadsheet) that makes it easy to work with tabular data.

- **seaborn (imported as `sns`)**: A library for creating statistical graphics and visualisations. It also includes built‐in data sets (like the Tips data set) which are ideal for learning and practice.

We use the aliases `pd` and `sns` to write code more concisely. For example, instead of writing `pandas.DataFrame()`, we use `pd.DataFrame()`.

## About the Data: The Tips Dataset
The Tips dataset is a sample dataset included with seaborn that contains restaurant dining information, including:

- **Total bill**: The cost of the meal, in US dollars
- **Tip amount**: How much the customer tipped, in US dollars
- **Time**: Whether the meal was for lunch or dinner
- **Day**: Which day of the week the meal occurred
- **Size**: The number of people in the dining party
- **Sex**: The sex of the bill payer
- **Smoker**: Whether the party sat in the smoking or non-smoking section

This dataset comes pre-loaded with seaborn.

## Loading Data

We will start by importing the necessary libraries and loading the Tips data set.

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
# Load the Tips data set
df = sns.load_dataset('tips')

## Inspecting the Data

Before analysing the data, it is important to inspect it. We can use methods like `head()` and `tail()` to view the first and last few rows, and use attributes like `shape` and `info()` and `describe()` to understand the structure and contents of the DataFrame.

In [None]:
# Print the first five rows
print(df.head())

# Print the last five rows
print(df.tail())

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
     total_bill   tip     sex smoker   day    time  size
239       29.03  5.92    Male     No   Sat  Dinner     3
240       27.18  2.00  Female    Yes   Sat  Dinner     2
241       22.67  2.00    Male    Yes   Sat  Dinner     2
242       17.82  1.75    Male     No   Sat  Dinner     2
243       18.78  3.00  Female     No  Thur  Dinner     2


In [None]:
# Print the shape and summary info of the DataFrame
print("Rows and columns:", df.shape)
print("Info")
print(df.info())

Rows and columns: (244, 7)
Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB
None


In [None]:
# Get overview statistics of numerical columns with describe()
df.describe()

Unnamed: 0,total_bill,tip,size
count,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672
std,8.902412,1.383638,0.9511
min,3.07,1.0,1.0
25%,13.3475,2.0,2.0
50%,17.795,2.9,2.0
75%,24.1275,3.5625,3.0
max,50.81,10.0,6.0


## Understanding our data

Based on the output of `shape`, `info()` and `describe()`, the Tips dataset contains 244 rows (dining experiences) with 7 columns of different types:

**Numerical Columns:**
- `total_bill` and `tip` are stored as `float64`, which is appropriate as these monetary values can have decimal places
- `size` is stored as `int64`, which makes sense for whole numbers representing the count of diners
- We can see the value ranges and averages, along with other information about the spread of our numerical data

**Categorical Columns:**
- Four columns are stored as `category` type: `sex`, `smoker`, `day`, and `time`
- Using the category type is memory-efficient for columns with a limited set of possible values

All columns have 244 non-null values, indicating there are no missing values in the dataset, which is helpful for our analysis as we won't need to handle missing data.

## Accessing Data

You can access specific parts of the DataFrame using square bracket notation or the `iloc` method. For example:

In [None]:
# Access a single column (e.g. total_bill)
print(df['total_bill'])

# Access multiple columns
print(df[['total_bill', 'tip']])

# Access rows where the day is 'Fri'
print(df[df['day'] == 'Fri'])

# Access the first row
print(df.iloc[0])

# Access the fifth row and third column
print(df.iloc[4, 2])

0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64
     total_bill   tip
0         16.99  1.01
1         10.34  1.66
2         21.01  3.50
3         23.68  3.31
4         24.59  3.61
..          ...   ...
239       29.03  5.92
240       27.18  2.00
241       22.67  2.00
242       17.82  1.75
243       18.78  3.00

[244 rows x 2 columns]
     total_bill   tip     sex smoker  day    time  size
90        28.97  3.00    Male    Yes  Fri  Dinner     2
91        22.49  3.50    Male     No  Fri  Dinner     2
92         5.75  1.00  Female    Yes  Fri  Dinner     2
93        16.32  4.30  Female    Yes  Fri  Dinner     2
94        22.75  3.25  Female     No  Fri  Dinner     2
95        40.17  4.73    Male    Yes  Fri  Dinner     4
96        27.28  4.00    Male    Yes  Fri  Dinner     2
97        12.03  1.50    Male    Yes  Fri  Dinner     2
98        21.01  3.

## Basic Operations

Pandas makes it easy to perform basic operations on your data. For example, you can compute the mean of a column or find the maximum value in another column. Other useful functions include `min`, `sum`, and `std` (standard deviation).

In [None]:
# Calculate the mean of 'total_bill'
print(df['total_bill'].mean())

# Find the maximum tip
print(df['tip'].max())

# Find the unique values
print(df['day'].unique())

19.78594262295082
10.0
['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']


### Exercise 1: Average spend

Compute and print the following information:

* Total revenue generated (sum of `total_bill`).
* Total number of customers (sum of `size`).
* Average revenue per customer.

In [None]:
# Calculate total revenue
total_revenue = df['total_bill'].sum()

# Calculate total number of customers
total_customers = df['size'].sum()

# Calculate average spend per customer
avg_spend = total_revenue / total_customers

# Print results with clear formatting
print(f"Total revenue: £{total_revenue:.2f}")
print(f"Total customers: {total_customers}")
print(f"Average spend per customer: £{avg_spend:.2f}")

Total revenue: £4827.77
Total customers: 627
Average spend per customer: £7.70


### Exercise 2: Time Tips

Compare the average `tip` at Lunch and Dinner. Which meal time has the higher average tip? Use the cell below to show your work.

In [None]:
# Get all lunch tips and dinner tips separately
lunch = df[df['time'] == 'Lunch']
dinner = df[df['time'] == 'Dinner']

# Calculate averages using mean()
lunch_average = lunch['tip'].mean()
dinner_average = dinner['tip'].mean()

# Print results with clear formatting
print(f"Average lunch tip: £{lunch_average:.2f}")
print(f"Average dinner tip: £{dinner_average:.2f}")

Average lunch tip: £2.73
Average dinner tip: £3.10


## Adding and Modifying Columns

Sometimes you might want to derive new insights by creating additional features. For example, you can create a new column that represents the tip as a percentage of the total bill. This new column, `tip_pct`, allows you to compare tips relative to the size of the bill for each observation in the data.

In [None]:
# Create a new column 'tip_pct' representing the tip percentage
df['tip_pct'] = (df['tip'] / df['total_bill']) * 100

# Display the first few rows to verify the new column
print(df.head())

   total_bill   tip     sex smoker  day    time  size    tip_pct
0       16.99  1.01  Female     No  Sun  Dinner     2   5.944673
1       10.34  1.66    Male     No  Sun  Dinner     3  16.054159
2       21.01  3.50    Male     No  Sun  Dinner     3  16.658734
3       23.68  3.31    Male     No  Sun  Dinner     2  13.978041
4       24.59  3.61  Female     No  Sun  Dinner     4  14.680765


## Grouping Data

Grouping is a powerful feature in Pandas that allows you to split your data into subsets based on a key, and then apply an aggregation function to each group. In this notebook, we group the data by the `day` column to see how values differ across the days of the week.

This is most often used for category data, and to indicate that we are only interested in categories that actuall appear in our data, we pass `observed=True` as an additional argument.

In [None]:
# Group the data by 'day' and calculate the average total_bill, tip, and tip_pct
grouped_day = df.groupby('time', observed=True)[['total_bill', 'tip', 'tip_pct']].mean()
print(grouped_day)

        total_bill       tip    tip_pct
time                                   
Lunch    17.168676  2.728088  16.412793
Dinner   20.797159  3.102670  15.951779


## Exercise 3: Smoking spend per head

Calculate the average spend per head on each row of the dataset. Then compare the average spend per head for smokers and non-smokers.

Use the cell below to show your work.

In [None]:
# Create new column for spend per head
df['spend_per_head'] = df['total_bill'] / df['size']

# Group by smoker and calculate average spend per head
avg_spend_by_smoker = df.groupby('smoker', observed=True)['spend_per_head'].mean()

# Print results with clear formatting
print("Average spend per head by smoking status:")
print(f"Non-smokers: £{avg_spend_by_smoker['No']:.2f}")
print(f"Smokers: £{avg_spend_by_smoker['Yes']:.2f}")

Average spend per head by smoking status:
Non-smokers: £7.36
Smokers: £8.74
