# Module 2, Activity 3: Barplots for 1 and 2-Dimensional Categorical Data

Barplots are extremely useful **visualization tools** for representing **categorical data**.  
They help summarize and compare values across different categories.

In this activity, we'll explore:
- **1D Barplots** → Simple bar charts for a single categorical variable.
- **2D Barplots** → Grouped bar charts to compare multiple categorical variables.

We'll use **Pandas** to generate these plots with **built-in plotting functions**, which rely on Matplotlib.

🔗 More on Pandas bar plots: [built-in Pandas plot calls](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html)

---

## Setup

Before creating barplots, we need to:
- Import necessary libraries
- Load the taxis dataset 🚖
- Convert date columns to the correct format


In [None]:
# Import necessary libraries
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import pandas as pd
import numpy as np
import datetime as dt

# Load dataset
df = pd.read_csv("data/taxis.csv")

# Convert pickup & dropoff times to datetime format
df["pickup"] = pd.to_datetime(df["pickup"], dayfirst=True)
df["dropoff"] = pd.to_datetime(df["dropoff"], dayfirst=True)

# Display first few rows to inspect the data
df.head()

Before plotting, it's important to check that all columns are in the correct format.

In [None]:
df.dtypes

---
# Boxplots for 1D Categorical Data

Let's start with a simple boxplot to explore the difference in the number of people using cash or credit card to pay for their taxi fare.

Remember that we use boxplots to visualise the number of observations in our dataset from different categories. So first, we need to use some Pandas magic to build a data frame which gives us the total number of people who use cash or credit card. 

---

##  Step 1: Grouping Data by Payment Method

First, we use the **groupby** call, to group our the df dataset by payment. But, we also need to tell Python what we want to extract from this call. Since we want the number of people in each payment group, we do this by calling **size()**

In [None]:
# Group data by payment method and count occurrences
df_table = df.groupby(['payment']).size()

# Display the table
df_table

---
## Step 2: Formatting the Table for a DataFrame

We now have a table structure called `df_table`, that gives the count of people by payment. So far so good. 
To make this data easier to work with, we will:
1. Convert it to a Pandas DataFrame
2. Reset the index so payment type becomes a column
3. Rename the count column to "counts"

In [None]:
# Convert table into a DataFrame and reset index
df_frame = df_table.reset_index()

# Rename count column from 0 to 'counts'
df_frame = df_frame.rename({0: 'counts'}, axis=1)

# Display the formatted DataFrame
df_frame


Now, `df_frame` contains two columns:

"payment" → The type of payment method
"counts" → The number of people who used each method

---
## Step 3: Creating a Simple Barplot
We can now create a barplot using Pandas' built-in plotting function:

In [None]:
df_frame.plot(x="payment", y = "counts", kind = "bar", rot = 0)
plt.show()

* `x="payment"` → Sets the x-axis to the payment types
* `y="counts"` → Uses the number of people as the y-axis
* `kind="bar"` → Specifies that we want a barplot
* `rot=0` → Ensures x-axis labels remain horizontal

---

## Step 4: Customizing the Barplot with Pyplot
Since Pandas' `plot()` is built on Matplotlib, we can use Pyplot to customize it further.

**Note:** We imported Pyplot from Matplotlib as **plt** in step 1.

In [None]:
df_frame.plot(x="payment", y="counts", kind="bar", rot=0)

# Add axis labels and title
plt.xlabel("Payment Type")
plt.title("Credit Card is the Most \n Popular Payment Method")

# Show the plot
plt.show()

---
### Exercises:
1. **Add a y-axis label** to the barplot using `plt.ylabel()`.  
   Refer to the [Matplotlib ylabel documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.ylabel.html) for help.

2. **Tidy up the x-axis labels**:  
   - Use `plt.xticks()` to modify the labels.  
   - Change **'cash'** to **'Cash'** and **'credit card'** to **'Credit Card'**.  
   Refer to the [Matplotlib xticks documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xticks.html) for help.


What if we want to order the bars from largest to smallest?

Modify `df_frame` so that the payment method with the most users appears first in the barplot.

In [None]:
# Sort the DataFrame by 'counts' in descending order
df_asce = df_frame.sort_values(by="counts", ascending=False)

Then we can plot:

In [None]:
# Plot the sorted barplot
df_asce.plot(x="payment", y="counts", kind="bar", rot=0)

plt.xlabel("Payment Type")
plt.title("Credit Card is the Most \n Popular Payment Method")

plt.show()

Now, let's check if the payment method trend is the same across all boroughs.

Group the dataset by pickup_borough and payment to count the number of people in each combination.

In [None]:
df_table = df.groupby(['pickup_borough', 'payment']).size().reset_index()

df_table

We have one more step before we can use this data. We want to pivot our dataset from long-form, to wide-form. We use the `.pivot()`call to do this, where we specify the variable names for columns, rows (index) and the values in the cells. 

In [None]:
df_frame2 = df_table.pivot(columns='payment', index='pickup_borough', values=0)
df_frame2

Plot a grouped barplot, where:

* X-axis → Pickup boroughs
* Bars → Payment method categories
* Different colors represent payment methods

In [None]:
# Create a grouped barplot
df_frame2.plot(kind="bar", rot=0)

# Show plot
plt.show()


---
### Exercise:

Modify the above command to transpose `df_frame2`, so the rows are payment type and columns are pickup boroughs. Plot the transposed dataframe as a grouped barplot. Using Matplotlib's Pyplot command, neaten the x-axis and tick labels as required, and give the plot a title.

We can easily stack these bars as well, if we're interested in comparing the size of all observations by pickup borough, and we still want to see the different payment types.

In [None]:
df_frame2.plot(kind = "bar", rot = 0, stacked = True)
plt.show()

---

### Exercise: 

Generate a table of counts for two variables other than pickup borough and payment, and generate grouped and stacked barplots of the variables.

It's hard to see the relative difference in payment type between each group in the figure above - the Bronx might have the same payment composition as Manhattan, but we wouldn't know. To see this, we need a stacked percentage plot. Again, we need to transform our data with a little bit with some Pandas calls, so we have the percentage of all payments that were either cash or credit card in each pickup borough. We'll leave it to you to generate this final figure, using what you've learned so far, in the Exercises below.

In [None]:
df_rel = df_frame2.apply(lambda x: x*100/sum(x), axis=1)
df_rel

---

### Exercise:
1) In your own words, explain what the apply function is doing in the above command. What would you need to change to the above command if you wanted the percentage of all cash payments in Brooklyn?\
2) Using df_rel and what we have learned so far, generate a stacked barplot of the percentage of all payments that were either cash or credit card in each pickup borough. Which borough has the highest percentage of cash payments?

---
# Barplots with Seaborn

Seaborn is a **wrapper around both Matplotlib (for visualization) and Pandas (for data manipulation)**.  
It provides an easier and more powerful way to create **barplots** compared to using Matplotlib directly.

### **Why Use Seaborn?**
* **Less manual coding** – Seaborn automatically counts and groups data  
* **Better styling** – Uses aesthetically pleasing default themes  
* **Simplifies categorical plots** – `countplot()` works directly on DataFrames  

### Exercise: 
Work through the example code for Seaborn's countplot function, available [here](https://seaborn.pydata.org/generated/seaborn.countplot.html), using the taxis dataset you've loaded and prepared for this Module.