# Module 2, Activity 5: Visualising 2-Dimensional Continuous Data

For visualising potential relationsips between 2-dimensional continuous variables, the scatter plot is the first place to start. In Python, scatter plots are available through pretty much all visualisation libraries, from base Matplotlib to Seaborn to Pandas.plot. Here, we're going to use Matplotlib.

---

## Step 1: Import Necessary Libraries

In [None]:
# import packages/libraries
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
import datetime as dt
from datetime import datetime

# load (and optionally format) dataset
df = pd.read_csv("data/taxis.csv")# Reads the CSV file and stores it in 'df'

# Convert the 'pickup' and 'dropoff' columns to proper date/time format
df["pickup"] = pd.to_datetime(df["pickup"], dayfirst=True)
df["dropoff"] = pd.to_datetime(df["dropoff"], dayfirst=True)

Let's having a look at what is happening here. 

* `pd.read_csv("data/taxis.csv")` – Reads the dataset file and saves it in a variable called df (short for DataFrame, which is a table of data).
* `pd.to_datetime(df["pickup"])` – Converts the pickup time into a format that Python understands.
* `pd.to_datetime(df["dropoff"])` – Converts the dropoff time into the correct format.

---

## Step 2: Calculate Trip Duration
Now, we will calculate how long each taxi trip took.

In [None]:
# Calculate trip duration by subtracting pickup time from dropoff time
duration = df["dropoff"] - df["pickup"]

# Convert duration to minutes and add it to the dataset
df["duration_min"] = duration.dt.total_seconds() / 60  

---
## Step 3: View the Updated Dataset
If we look at our dataset, we'll see we've added a new variable at the end called "duration_min".

In [None]:
df  # Display the dataset

---
## Step 4: Creating a Simple Scatter Plot

Now, let's begin plotting with something easy. Let's explore the relationship between the distance travelled and the fare total.

In [None]:
# Create a figure and axis for the plot
fig, ax = plt.subplots()

# Create a scatter plot
ax.scatter(df.distance, df.total)  # Each point represents a trip

# Label the axes
ax.set_xlabel("Distance (km)")  # X-axis label with units
ax.set_ylabel("Total Fare ($)")  # Y-axis label with units

# Add a title to the plot
ax.set_title("Total Fare vs Distance")

# Show the plot
plt.show()


Expected Outcome: A scatter plot showing a trend where longer trips generally result in higher fares.

### Exercise:
Edit the axis labels in the figure above to include the units for Distance (km) and Total Fare ($).

---

## Step 5: Distinguishing Payment Methods

Sometimes, we want to see if there are patterns based on payment method. Let’s color-code the scatter plot to differentiate trips paid with cash vs credit card.

In [None]:
# Split the data into two groups based on payment type
df_card = df[df.payment == "credit card"]  # Filter trips paid by credit card
df_cash = df[df.payment == "cash"]  # Filter trips paid by cash

# Create a new scatter plot with different colors for each payment type
fig, ax = plt.subplots()

# Plot credit card payments in red
ax.scatter(df_card.distance, df_card.total, c="r", label="Credit Card")  

# Plot cash payments in blue
ax.scatter(df_cash.distance, df_cash.total, c="b", label="Cash")  

# Add axis labels and a title
ax.set_xlabel("Distance (km)")  
ax.set_ylabel("Total Fare ($)")  
ax.set_title("Total Fare vs Distance by Payment Method")

# Add a legend to differentiate between payment types
ax.legend()

# Show the plot
plt.show()


---
### Exercises:

Exercise 1: Analyzing Trip Distance and Duration by Payment Type
- **Objective:** Create a scatter plot to explore the relationship between trip distance and duration. Use different colors or markers to represent various payment types (e.g., cash, credit card).

Exercise 2: Scatter Plot with Distinctive Markers
- **Objective:** Referencing the [Matplotlib marker style guide](https://matplotlib.org/stable/api/markers_api.html#module-matplotlib.markers), create a scatter plot to display the relationship between distance and fare. Use distinct markers to differentiate between two groups.

Exercise 3: Adjusting Legend Placement
- **Objective:** Adjust the position of the legend in your scatter plot. Use the [Matplotlib legend documentation](https://matplotlib.org/stable/tutorials/intermediate/legend_guide.html) to:
  - Move the legend to the top left corner of the plot.
  - Position the legend on the right-hand side outside the plot axes.
  - Add a descriptive title to the legend.

Exercise 4: Truncating Plot Axes
- **Objective:** Modify your scatter plot to limit the x-axis to a range from 0 to 5 and the y-axis from 30 to 175. Reference Matplotlib's [axis documentation](https://matplotlib.org/stable/api/axes_api.html) to achieve this.

Exercise 5: Explaining Anomalies in Data
- **Objective:** Discuss potential reasons for observing very high fares despite little or no distance traveled. Consider aspects such as minimum fare charges, waiting time, or data entry errors in your explanation.


**HINT:** Were these travellers prepared when their taxi arrived to collect them? We'll look more at unexpected values in a later Module.

---
We can see from our figure above that total fare increases with distance travelled in a pretty linear way. This isn't very surprising. But, there's more to this data than what this single figure can show. For example, what if we wanted to colour the points by the time of day? We could subset our data into fares during peak traffic hour (7.30am - 9.30am and 4.30pm to 6.30pm) and off-peak hours. Maybe we're interested in looking at this, as well as how the distance versus fare relationship differs by payment type. For this, we could plot two axes on the same figure. Let's do this now.\

First, we need to subset our data into peak and off-peak travel times. We'll use the pickup time variable for this.

In [None]:
## Create datetime start and finish times for the two peak periods, 
# for more information about datetime.strptime, 
# see here: https://www.geeksforgeeks.org/python-datetime-strptime-function/.
peakstart1 = datetime.strptime("07" + "30" + "00", "%H%M%S").time() 
peakstart2 = datetime.strptime("16" + "30" + "00", "%H%M%S").time()

peakend1 = datetime.strptime("09" + "30" + "00", "%H%M%S").time() 
peakend2 = datetime.strptime("18" + "30" + "00", "%H%M%S").time() 

# Extract time of day from datetime pickup object (we don't want the date)
df["pickup_time"] = df["pickup"].dt.time 

# Create a boolean variable called peak, which is True when pickup time is within peak hours, and False when it isn't.
df["peak"] = (((df.pickup_time >= peakstart1) & (df.pickup_time <= peakend1)) | # Peak time one, note the | symbol, this means 'or'. So we're searching for times in peak time one OR peak time two
              ((df.pickup_time >= peakstart2) & (df.pickup_time <= peakend2))) # Peak time two

df_peak = df[df.peak == True] # Subset of our dataset, with peak time pickups only
df_offpeak = df[df.peak == False] # Subset of our dataset, with off-peak pickups only

# Create scatter plot
fig, ax = plt.subplots()
ax.scatter(df_peak.distance, df_peak.total, c = "r")
ax.scatter(df_offpeak.distance, df_offpeak.total, c = "b")

ax.set_xlabel("Distance")
ax.set_ylabel("Total Fare")
ax.set_title("No Difference Between Peak and Off-Peak")
ax.legend(["Peak", "Off-Peak"])

plt.show()


So, we don't really see a difference between distance travelled and total fare between peak and off-peak times after all. But, the off-peak points are obscuring most of the peak points, how do we deal with this? One way is to create seperate axes for the peak and offpeak scatterplots. This is easy with Pyplot, we just need to specify each plot's position in our figure.

In [None]:
fig, axs = plt.subplots(1,2) # A figure with one row and two columns of axes

# Our first plot, note Python indexes from 0
axs[0].scatter(df_peak.distance, df_peak.total, c = "r")
axs[0].set_xlabel("Distance")
axs[0].set_ylabel("Total Fare")
axs[0].set_title("Peak")

# Our second plot
axs[1].scatter(df_offpeak.distance, df_offpeak.total, c = "b")
axs[1].set_xlabel("Distance")
axs[1].set_ylabel("Total Fare")
axs[1].set_title("Off-Peak")

# Figure aesthetics
plt.suptitle('Peak and Off-Peak are Similar', fontsize=16) # Main title
plt.tight_layout() # Shortcut to arrange figures without overlap (try without this)
plt.show() # Show plot command

---
### Exercise:
1. Changing the Layout of Subplots
Instead of having one row with two columns, modify your subplot layout to have one column with two rows.

    Modify your existing code so that:

* The plots are arranged vertically (one above the other) instead of side by side.
* The `plt.subplots()` function uses `nrows=2, ncols=1` instead of `nrows=1, ncols=2`.

2. Exploring Fares During Day vs Night
Now, let's analyze how fares vary during daytime (6:00 AM - 6:00 PM) vs nighttime (6:00 PM - 6:00 AM).

    Modify your dataset to:

* Extract the pickup time for each trip.
* Create two subsets:
    * Daytime trips (6:00 AM - 6:00 PM)
    * Nighttime trips (6:00 PM - 6:00 AM

3. Creating a 2x2 Grid of Plots
Now, let's visualize four plots in a 2-row, 2-column layout:

* The first row will compare peak vs off-peak fares.
* The second row will compare daytime vs nighttime fares.



