# Module 2, Activity 4: Visualising Distributions for 1-Dimensional Continuous Data

Barplots are useful if we only have one data point per category (e.g., counts). But what do we do when we have multiple observations in each category? We need to create visualisations to explore the distribution of the data. That's what we'll be doing here using Seaborn.

---

## Setup

Before creating barplots, we need to:
- Import necessary libraries
- Load the taxis dataset 🚖

In [None]:
# import packages/libraries
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
import datetime as dt

# load (and optionally format) dataset
df = pd.read_csv("data/taxis.csv")
df["pickup"] = pd.to_datetime(df["pickup"], dayfirst=True)
df["dropoff"] = pd.to_datetime(df["dropoff"], dayfirst=True)

---
# Histograms and Density Plots with Seaborn

In this section, we'll create **histograms and density plots** to visualize distributions in our dataset.

* Histogram (Hist)** → Shows the frequency of data values  
* Kernel Density Estimation (KDE)** → Smooths the histogram into a probability curve  

Seaborn provides an easy way to create both using `displot()`.

---

## Step 1: Creating a Histogram

We can use `displot()` with `kind="hist"` to create a histogram of taxi fares.

In [None]:
sns.displot(data=df,x="total", kind = "hist") # histogram of taxi ride prices

plt.show()

---
## Step 2: Customizing the Histogram
We can change the number of bins in two ways:

1. Modify bin widthWe can change the number of bins in two ways - by modifying the width of the bins, or specifying the total number of bins:

In [None]:
sns.displot(data=df,x="total", kind = "hist", binwidth=5) # change the bin width

plt.show()

2. Specify the total number of bins

In [None]:
sns.displot(data=df,x="total", kind = "hist", bins=10) # change the number of bins

plt.show()

---
## Step 3: Creating a Density Plot (KDE)

Now for density plots. We can create density plots also using the **displot** call, but we need to specify that the kind of distribution we want to visualise is a 'kernel density estimation' (kde). 

**Note:** If you've never heard of a kernel density estimation before - it's an estimate of the probability density function from your data.

In [None]:
sns.displot(data=df, x="total", kind='kde')

plt.show()

---
## Step 4: Overlaying Multiple Density Plots

What if we want to overlay multiple density plots? We can distinguish them by colour `hue=`. For example, below is a figure with total fare by payment type.

In [None]:
sns.displot(data=df, x="total", hue = "payment", kind='kde')

plt.show()

---
## Step 5: Combining Histogram and KDE

We can overlay a density plot on a histogram by setting `kde=True`. This can be a helpful, since the shape of the density curve does not depend on the number of histogram bins. 

Play around with the bins parameter in the **displot** call below to see this.

In [None]:
sns.displot(data=df, x="total", kind='hist', kde=True, bins = 10) # overlay density plot on histogram
plt.show()

---
## Step 6: Visualising a Subset of Data

Sometimes we want to focus on only a subset of the data, which means we need to do some Pandas subsetting before we can visualise. 

Let's filter the dataset to include only trips where the pickup and drop-off happened within the same borough.

In [None]:
# Subset data where pickup and drop-off boroughs match
df1 = df[df["pickup_borough"] == df["dropoff_borough"]]

# Density plot for the subset
sns.displot(data=df1, x="total", kind="kde")

plt.show()


---
### Exercise: 

Overlaying Histograms and Density Plots with Seaborn

In this exercise, we’ll explore **density plots for different trip conditions** and overlay multiple histograms with density curves.

**Task:**  
- Plot the **density of total fares** where the **pickup borough is different** from the **dropoff borough**.  
- Compare it with the density plot for trips where the **pickup and drop-off borough are the same**.

---
## Step 7: Overlaying Multiple Histograms with KDE

Finally, what if we wanted to overlay multiple histograms, with their density plots as well? 

There's a few ways to do this. The easiest is with the single-line Seaborn command `histplot()`. To show this, let's plot the distribution of total fares, split by payment type, with their density estimates included as well:

In [None]:
sns.histplot(data=df, x="total", hue = "payment", kde = True)

plt.show()

What’s happening here?

* Hue = "payment" → Separates the histogram by payment type.
* kde=True → Adds a density curve to each histogram.

---
## Step 8: Customizing the Plot with Matplotlib (Pyplot)

Since Seaborn is built with Matplotlib, we can use Matplotlib directly with Seaborn to modify Seaborn figures. We can do this using Matplotlib, or Pyplot. 

Pyplot comes with Matplotlib - we imported it earlier from Matplotlib as `plt`. Pyplot is a plotting library which is closer to the object-orientied approach of Matplotlib than Seaborn, but it's a bit simpler to use than base Matplotlib.

In [None]:
sns.histplot(data=df, x="total", hue="payment", kde=True)

# Add an x-axis label using Pyplot
plt.xlabel("Total Fare")

# Show the plot
plt.show()


---
### Exercise:
1) Write a comment for each line of the Pyplot code in the command above, clearly explaining what each line is doing.
2) Read the section on **Coding styles** in following document: [The object-oriented and the pyplot interfaces](https://matplotlib.org/stable/tutorials/introductory/usage.html). **Take note** of the differences between Pyplot and base Matplotlib. 

We'll mainly be using the object-oriented Matplotlib style, when we use Matplotlib, but sometimes we will use Pyplot too.

---
# Boxplots and Violin plots

Boxplots and violin plots are designed to summarise continuous data grouped by different categories. Let's plot the distribution of fare totals, by pickup borough. They help us understand the **distribution, spread, and outliers** in a dataset.


We can use Seaborn's **catplot** function to quickly generate boxplots and violin plots (documentation [here](https://seaborn.pydata.org/generated/seaborn.catplot.html)).

---

## Step 1: Creating a Boxplot

Let's plot the distribution of total fares by **pickup borough**.

In [None]:
# Boxplot of total fares by pickup borough
sns.catplot(data=df, kind="box", x="pickup_borough", y="total")

plt.show()

---
## Part 2: Creating a Violin Plot
Violin plots have a very similar syntax:

In [None]:
# Violin plot of total fares by pickup borough
sns.catplot(data=df, kind="violin", x="pickup_borough", y="total")

plt.show()

---
## 3: Grouping by Payment Type
We can show categories within different groups. For example, let's look at the total fares by payment type, for each pickup borough, with a boxplot.

In [None]:
sns.catplot(data=df,x="pickup_borough",y="total", hue="payment", kind="box")

plt.show()

What’s happening here?

* Hue = "payment" → Groups the fares by payment type.
* Multiple boxplots appear for each pickup borough.

---
## 4: Changing the Orientation

It's quite hard to compare these individual boxplots because they are all squished up, let's try some variations. For example, let's change the axis around, to change the orientation of the figure.

In [None]:
sns.catplot(data=df,kind = "box",y="pickup_borough",x="total", hue="payment")
plt.show()

This flips the x and y axes!

* Now pickup_borough is on the y-axis
* Total fares appear horizontally


---
## 5: Removing Outliers

Next, let's remove the outliers:

In [None]:
sns.catplot(data=df, kind = "box", y="pickup_borough",x="total", hue="payment", orient="h", showfliers=False)
plt.show()

Why remove outliers? This helps focus on the main data distribution and prevents skewed visualisation caused by extreme values.

---
### Exercise:

How is the number of passengers related to the distance of a fare? To answer this question, we've generated the figure below:

In [None]:
sns.catplot(x="passengers", y="distance", kind="bar", data=df)
plt.show()

In a few sentences, explain why this is not the best visualisation to address this question. Create an alternative figure that is better-suited from what you have learned so far.