# Module 2, Activity 6: Time Series Plots

Time series are sequences of discrete, ordered time data. The simplest way to visualise them is with a line chart, where the x-axis is time, and the y-axis is the continuous variable being measured through time. We'll use the taxis dataset for these exercises.

---

## Step 1: Import Necessary Libraries

In [None]:
# import packages/libraries
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
import pandas as pd
import numpy as np
import datetime as dt
from datetime import datetime

# load (and optionally format) dataset
df = pd.read_csv("data/taxis.csv")
df["pickup"] = pd.to_datetime(df["pickup"], dayfirst=True)
df["dropoff"] = pd.to_datetime(df["dropoff"], dayfirst=True)

# Add a new column, giving the duration of each trip in minutes
duration = df["dropoff"]-df["pickup"] #calculating trip duration as timedelta datatype
df["duration_min"] = duration.dt.total_seconds()/60 #converting to minutes and adding to df

---

## Step 2: Aggregating Data Using `groupby()`
Since our dataset contains many individual trips, let's group the data by pickup day and borough to make it more manageable.

Reference **groupby** [command](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html) for more informarion. 

In [None]:
# Group data by pickup day and borough
df_gb = df.groupby(by=[pd.Grouper(key="pickup", freq="D"), "pickup_borough"])

# Sum only numerical columns
df_gb_sum = df_gb.sum(numeric_only=True).reset_index()

# Display the grouped dataset
df_gb_sum


---
## Exercise:
1. Explain each line of code in the grouping step above.
    * Use comments (#) to describe what each line does.
2. Create a new grouped dataset that calculates the average (mean) values instead of the sum.
    * Name this dataset `df_gb_ave.`
    * Use `.mean(`) instead of `.sum()` in your code.

Now let's start to plot. We'll use the Matplotlib command **plot**, but we could use other libraries like Seaborn. 

---
### Visualizing the Data
Now that we have summarised the data, let’s start plotting graphs using Matplotlib.

Before we move on:

* We'll use Matplotlib’s plot() function for basic graphs.
* Later, we’ll try using Seaborn, which allows for easier, more stylish plots.
---
## Exercise:
* Try plotting a simple line graph using Matplotlib.
* Challenge: Convert the same graph to a Seaborn plot. 

Use [this page](http://seaborn.pydata.org/generated/seaborn.lineplot.html) as a starting reference.

In [None]:
fig, ax = plt.subplots()  # Create a figure and axis
ax.plot(df_gb_sum.pickup, df_gb_sum.fare)  # Create a line plot

# Add labels to the axes
ax.set_xlabel("Date")  
ax.set_ylabel("Total Fares")  

plt.show()  # Show the plot


What's happening?

* Since we grouped data by both date and pickup borough, Matplotlib is drawing one single line across all data points.
* This combines fares from multiple boroughs, leading to overlapping lines or unclear trends.

We need to subset our data first, then plot each line seperately. 

On a related note, let’s see what happens if we visualise something similar using Seaborn."



In [None]:
sns.lineplot(data=df_gb_sum,y='fare',x='pickup')
# You'll notice it's all squished - we can fix that using seaborns set_theme function, but we'll come back to that later

Seaborn plots the mean fare across boroughs as a solid line, with the shading giving the standard error across boroughs for each date. Clever! However, the trade-off here is we have less direct control over our figure. 

---

### Back to Matplotlib

Study the follow code. Is there anything new we haven't looked at yet?

In [None]:
# List of borough names
label_names = ["Queens", "Bronx", "Brooklyn", "Manhattan"]

# Selecting data for each borough
df_gb_sum_0 = df_gb_sum[df_gb_sum.pickup_borough == label_names[0]]  # Queens
df_gb_sum_1 = df_gb_sum[df_gb_sum.pickup_borough == label_names[1]]  # Bronx
df_gb_sum_2 = df_gb_sum[df_gb_sum.pickup_borough == label_names[2]]  # Brooklyn
df_gb_sum_3 = df_gb_sum[df_gb_sum.pickup_borough == label_names[3]]  # Manhattan

# Create a figure and axis
fig, ax = plt.subplots()

# Plot data for each borough with different colors
plt1, = ax.plot(df_gb_sum_0.pickup, df_gb_sum_0.fare, color="r", label=label_names[0])  # Queens
plt2, = ax.plot(df_gb_sum_1.pickup, df_gb_sum_1.fare, color="b", label=label_names[1])  # Bronx
plt3, = ax.plot(df_gb_sum_2.pickup, df_gb_sum_2.fare, color="g", label=label_names[2])  # Brooklyn
plt4, = ax.plot(df_gb_sum_3.pickup, df_gb_sum_3.fare, color="m", label=label_names[3])  # Manhattan

# Add title and labels
ax.set_title("Total Fares by Borough")
ax.set_xlabel("Date")
ax.set_ylabel("Total Fares")

# Add a legend
plt.legend([plt1, plt2, plt3, plt4], label_names, title="Pickup Borough")

# Show the chart
plt.show()

Let's pause and unpack some of the new syntax we've just introduced. 

We have one large dataset `(df_gb_sum)`, but we want to separate it by borough. Each line filters the dataset for one borough.
* `df_gb_sum_0` contains only Queens trips.
* `df_gb_sum_1` contains only Bronx trips.

And so on.

---

### Understanding Why We Use `plt1, = ax.plot(...)`
You'll notice that we're giving each `ax.plot()` an object name, followed by a comma.

For example:

`plt1, = ax.plot(df_gb_sum_0.pickup, df_gb_sum_0.fare, color="r", label=label_names[0])`

We do this so we can create a legend with all of the lines.

But why the comma `(plt1, = ax.plot(...))`?

**Matplotlib Returns a List**

When we use `ax.plot()`, Matplotlib can plot multiple lines at once.

For example, if we plot two datasets together:

`ax.plot(df_gb_sum_0.pickup, df_gb_sum_0.fare, df_gb_sum_1.pickup, df_gb_sum_1.fare)`

Matplotlib automatically returns multiple lines (one for each dataset).

**Unpacking the First Line**
Even when plotting just one line, Matplotlib still returns a list.

So when we do:

`plt1 = ax.plot(df_gb_sum_0.pickup, df_gb_sum_0.fare)`

`plt1` is actually a list with one line inside it instead of just a single line.

To avoid this, we use a comma to unpack only the first line:

`plt1, = ax.plot(df_gb_sum_0.pickup, df_gb_sum_0.fare)`

This way, `plt1` directly refers to the line object, making it easier to add it to the legend.

---

### Figure size and legend position
This looks okay, but the dates are squished together and the legend is obscuring some of the plot. We can fix this by changing the figure size (see [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.figure.html)) and the position of the legend (see [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.legend.html)). We'll experiment with the legend location in the next Module.

In [None]:
#label_names = ["Queens", "Bronx", "Brooklyn", "Manhattan"]

fig, ax = plt.subplots(figsize=(15, 5))
plt1, = ax.plot(df_gb_sum_0.pickup, df_gb_sum_0.fare, c = "r", label = label_names[0])
plt2, = ax.plot(df_gb_sum_1.pickup, df_gb_sum_1.fare, c = "b", label = label_names[1])
plt3, = ax.plot(df_gb_sum_2.pickup, df_gb_sum_2.fare, c = "g", label = label_names[2])
plt4, = ax.plot(df_gb_sum_3.pickup, df_gb_sum_3.fare, c = "m", label = label_names[3])
ax.set_title("Total Fares by Borough")
ax.set_xlabel("Date")
ax.set_ylabel("Total Fares")
plt.legend([plt1, plt2, plt3, plt4], # line objects
           label_names, # line labels
           title = "Pickup Borough", loc = "center left", bbox_to_anchor = (1,0.5))
plt.show()

---
## Exercise:

Remember, time series are sequences of discrete observations ordered by time. But, a line alone can give the impression that the data is continuous in time. Most of the time this is fine, but if we wanted to accentuate that each day is a discrete datapoint, we can add point markers for each day. We can do that by adding **marker = "."** to each **ax.plot** command. 

**Task:** Recreate the plot above, with markers for each day on each borough's line. More information about marker styles can be found [here](https://matplotlib.org/stable/api/markers_api.html#module-matplotlib.markers).