# Module 2, Activity 2: Preparing Data for Visualisation with Pandas
---

## Getting Started with Jupyter Notebook

Jupyter Notebook is an interactive environment where you can write and execute Python code in small sections called "cells". 

### How to Use This Notebook:
- **Running a cell**: Click on a cell and press `Shift + Enter` to execute it. Alternatively, Hover over cell e.g. [33] and select the 'Run' button (▶).

- **Adding new cells**: Click on `+` in the toolbar to add a new cell.

- **Editing a cell**: Click inside a cell to start typing.

---

# Loading Data from a CSV File

We are working with a tabular dataset, which means the data is stored in rows and columns—just like an Excel spreadsheet. We can load this data into Pandas, a Python library designed for handling data.

**Note:** See the [quick reference guide](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html ) to understand and utilise the pandas DataFrame.

---
## Step 1: Importing the dataset

**Importing a Dataset:** Ensure that your dataset is stored in the correct location before loading it. The taxis.csv file must be inside a folder named "data". Otherwise, the code will not work. Alternatively, modify the filepath in your code to match the current location of your dataset.

Run (▶) the following code.

In [None]:
import pandas as pd  # Import Pandas

# Load data from a CSV file
df2 = pd.read_csv("data/taxis.csv")

# Display the first 10 rows to understand the dataset
df2.head(10)

What did this code do?

`pd.read_csv("data/taxis.csv")` loads the data from a CSV file.

`.head(10)` shows the first 10 rows so we can see what the dataset looks like.

---

## Step 2: Checking Data Types
Before we start analyzing the data, we need to check what types of data we are dealing with.

Run (▶) the following code.

In [None]:
df2.dtypes

Pandas assigns a data type to each column (numbers, text, dates, etc.).
* Object = Text (e.g., names, addresses, dates stored as text).
* int64 = Whole numbers (e.g., 1, 2, 3).
* float64 = Decimal numbers (e.g., 2.5, 1000.89).

---
## Step 3: Converting Dates to Datetime Format
The dataset contains **pickup** and **dropoff** times. These are dates, but they are currently stored as text (object type).
We need to convert them to datetime format so we can analyze time-based trends.

In [None]:
df2["pickup"] = pd.to_datetime(df2["pickup"], dayfirst=True)
df2["dropoff"] = pd.to_datetime(df2["dropoff"], dayfirst=True)

# Check the updated data types
df2.dtypes



This allows us to sort, filter, and calculate time differences easily.
If we leave them as text, Python won’t recognise them as real dates.

⚠ **Warning:** `pd.to_datetime()` tries to guess the format, but sometimes it gets it wrong! For more information go to: https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html#pandas.to_datetime 

Note how we called variables - by using their column names. This is one of many extremely useful things we can do with Pandas dataframes.

If we wanted to look again at the data, we will see no change to the table. But, now we can perform arithmetic with the pickup and dropoff times.

In [None]:
df2.head()

---
# Indexing and Slicing
Sometimes, we don’t need to use the entire dataset—we only want to focus on specific rows or columns.
We can do this using slicing.

---
## Step 1: Understanding Slicing Syntax
We can extract a portion of the dataset using this format:

df2[start:stop:step]

* start → The row index where the slice begins (included).
* stop → The row index where the slice ends (not included).
* step → The number of rows to skip (optional).

---
## Step 2: Slicing by Row
Let’s look at some examples:

In [None]:
df2[5:6]  # Returns row at index 5 (remember, Python starts counting from 0)
df2[:5]   # Returns the first 5 rows (index 0 to 4)
df2[6000:-100]  # Returns rows from index 6000 to the 100th last row
df2[-60:-50]  # Returns rows from 60th last row to 50th last row


---
## Step 3: Using Steps to Skip Rows
We can skip rows using a step value.

In [None]:
df2[::5]  # Selects every 5th row
df2[10:50:5]  # Selects every 5th row between index 10 and 50
df2[10::-1]  # Selects rows **in reverse order**, starting from index 10


Negative Step (-1) → Reverses the order of rows.

---
## Step 4: Understanding the Exercise
Look at this command:

In [None]:
df2[50:30:-2]

What does this do?

* Starts at row index 50.
* Ends at row index 30 (not included).
* Moves backward (-2 step), selecting every second row.

---
# Indexing Data by Columns
So far, we've learned how to index rows. Now, let's explore how to select specific columns from a DataFrame.

---
## Step 1: Selecting a Single Column
We can extract a single column from a DataFrame using square brackets:

In [None]:
df2[["passengers"]]


What happens when we run the code? It returns a **DataFrame** with only the `"passengers"` column. It preserves the DataFrame structure.

What if we forget the extra brackets?

In [None]:
df2["passengers"]


This returns a **Series**, not a DataFrame.
A Series is like a **single-column list**, while a **DataFrame** is a **table with multiple columns.**

---
## Step 2: Selecting Multiple Columns
To select multiple columns, include them inside double square brackets:

In [None]:
df2[["pickup", "dropoff"]]


This returns **a new DataFrame** with only the `"pickup"` and `"dropoff"` columns.

---
## Step 3: Try It Yourself!
**Experiment:** Change the code below to select three different columns from your dataset.

In [None]:
df2[["passengers", "pickup", "dropoff"]]


---
# Indexing by Rows and Columns
So far, we’ve seen how to select only rows or only columns.
But what if we need to select both at the same time?
For this, we use `iloc` (index-based) or `loc` (label-based).

---
## Step 1: Understanding `iloc` (Index-Based Selection)
Think of `iloc` like selecting items by row/column number (starting from 0).

**Example:** Selecting specific rows and columns using `iloc`

In [None]:
df2.iloc[0:10, 3:5]


* Selects rows 0 to 9 (does NOT include row 10)
* Selects columns 3 and 4 (does NOT include column 5)

**Remember:** `iloc` uses Python-style indexing, so the end value is not included.

---
## Step 2: Understanding `loc` (Label-Based Selection)
Think of `loc` like selecting items by their actual names (labels).

Example: Selecting specific rows and columns using `loc`

In [None]:
df2.loc[0:10, ["distance", "fare"]]


* Selects rows 0 to 10 (includes row 10!)
* Selects columns "distance" and "fare"

**Remember:** loc includes the last value in the range.

---
## Step 3: Key Differences Between `iloc` and `loc`

| Feature             | `iloc` (Index-based)                  | `loc` (Label-based)                    |
|---------------------|--------------------------------------|---------------------------------------|
| **How it works**    | Uses row/column numbers (starting at 0) | Uses actual row/column labels        |
| **End value included?** | **No** (excludes stop index)          | **Yes** (includes stop index)           |
| **Example**         | `df2.iloc[0:10, 3:5]`                | `df2.loc[0:10, ["distance", "fare"]]` |

---

## Step 4: Special Case - Selecting All Rows but Specific Columns
We can select all rows but only certain columns using iloc:

In [None]:
df2.iloc[:, 3:5]  # Selects all rows but only columns 3 and 4
df2.iloc[:, [3, 4]]  # Same as above, but selects specific column positions


---
## Step 5: Try It Yourself!
Run  (▶) each of these commands and write down your observations.

In [None]:
df2.iloc[0:10, 3:5]
df2.loc[0:10, ["distance", "fare"]]
df2[0:10][["distance", "fare"]]
df2.iloc[0:5]
df2.loc[0:5]

1. What is the difference between using iloc and loc?
2. How does `df2[0:10][["distance", "fare"]]` behave differently?
3. Does `df2.iloc[0:5]` return the same rows as `df2.loc[0:5]`? Why or why not?

Run  (▶) these two commands and analyze the output.

In [None]:
df2.iloc[0:4, 1:4]
#df2.loc[0:4, 1:4]
df2.loc[0:4, ["pickup", "dropoff"]]

1. Does `iloc` include the last row and column in the range?
2. Does `loc` behave differently when selecting a column by number?

---
# Filtering Data with Queries

Instead of selecting data by **row/column numbers**, we can **filter** our dataset based on **conditions**.

For example, we might want to:
- Find all taxi rides with a fare over $20
- Find rides between 1 and 5 miles long
- Find rides that happened in a specific date range

We can do this using **queries**.

--- 

## Step 1: Basic Query Syntax

We use **comparison operators** to filter data:

| Operator | Meaning |
|----------|---------|
| `==` | Equals |
| `!=` | Not equals |
| `>` | Greater than |
| `<` | Less than |
| `>=` | Greater than or equal |
| `<=` | Less than or equal |

---
## Step 2: Querying a DataFrame

Let's find all taxi rides where the fare was more than $20.



In [None]:
fare20 = df2.query("fare > 20")
fare20

This creates a new DataFrame called fare20, which contains only rides where `fare > 20.`

---
## Step 3: More Complex Queries
Let's filter rides that were between 1 and 5 miles long AND had a fare greater than $20.

In [None]:
df2.query("fare > 20 & 1 <= distance <= 5")


* & means "AND" → Both conditions must be met.
* distance is between 1 and 5 (inclusive).

**Note:**
If you forget to use **&** properly, you'll get a syntax error

---
## Step 4: Using "OR" (|) in Queries
Now, let's find rides where fare was either above $20 OR below $10.

In [None]:
df2.query("fare > 20 | fare < 10")


| means "OR" → At least one condition must be met.

---
## Step 5: Filtering Data Without `query()`
Instead of using query(), we can filter data directly using brackets ([]).

In [None]:
df2[df2["fare"] > 20]  # Selects rows where fare > 20


This does the same thing as `df2.query("fare > 20")`, but uses boolean indexing instead.

---
## Step 6: Filtering by Date Range
Let's filter rides that happened between November 1, 2019, and January 31, 2020.

In [None]:
df2[(df2["fare"] > 20) & 
    (df2["pickup"] >= pd.to_datetime("2019-11-01")) & 
    (df2["pickup"] <= pd.to_datetime("2020-01-31"))]


Why use `pd.to_datetime()?`
Because pickup is stored as a datetime object, and we need to compare it correctly.

---
# Saving Subsets of Data

When working with large datasets, you often need to **save a filtered subset** for future use.  
You can do this by creating either a **copy** or a **view**.

---

## Step 1: Copy vs. View – What's the Difference?

### **Making a Copy (Recommended)**
A **copy** creates a **new, independent dataset** that is **not linked** to the original DataFrame.


In [None]:
df_copy = df2[df2['fare'] > 20].copy()  # This is a copy

Changes to df_copy will NOT affect the original dataset (df2). This is the safest option when modifying data.

---
### **Using a View (Risky)**
A view is just a reference to the original DataFrame.



In [None]:
df_copy = df2[df2['fare'] > 20]  # This is a view


Changes to df_copy may also modify df2!
Pandas might show a warning (SettingWithCopyWarning).
Use .copy() if you are unsure!

---

## Step 2: Creating a Filtered Copy
Let’s create a subset of taxi rides:

* Fare > $20
* Pickup between November 1, 2019, and January 31, 2020
* Reset index after filtering

In [None]:
import pandas as pd

df3 = df2[(df2["fare"] > 20) & 
          (df2["pickup"] >= pd.to_datetime("2019-11-01")) & 
          (df2["pickup"] <= pd.to_datetime("2020-01-31"))].copy()  # Making a copy

df3 = df3.reset_index()  # Reset index after filtering
df3


---
# Transforming Data

Once you've loaded your data, you can **modify** it further by **adding new columns** or **transforming existing data**.

For example, we might want to **calculate new values** based on existing ones.  
Let's say we want to **calculate the average fare per passenger** and store it in a new column.

---

##  Step 1: Creating a New Column

We can create a new column called **`avg_fare`** by dividing the **total fare** by the **number of passengers**.


In [None]:
df3['avg_fair']=df3['total']/df3['passengers']
df3

This adds a new column to df3, where each row contains the average fare per passenger.


---
## Exercises

### **1. Subsetting Based on a Pickup Zone**
Select all rows in `df2` where the **pickup zone** is `"Lenox Hill East"`.  
Then, **find the number of rows** in this subset using the `.size` method.

**Hint:** use the DataFrame.size command to find the size of the `df2`.

### **2.  Experiment with Queries**
Write a query to find **all rows where the number of passengers is greater than or equal to 2**.


### **3.  Creating a Scatter Plot Based on Diet**
In the **"Python for Data Visualization"** module, we created a scatter plot **colored by exercise type**.  
Now, modify the visualization so that:
- **X-axis** → Time spent exercising  
- **Y-axis** → Heart rate  
- **Points are colored by diet type**  

Use **Matplotlib** and **Pandas** to create this scatter plot.

## Further Reading and Reference Material

We've only just started to scratch the surface of Matplotlib and Pandas, but we're going to rapidly expand our skillset for the purposes of visualisation in the coming Modules. In the meantime both Libraries have extensive online guides ([Pandas](https://pandas.pydata.org/docs/user_guide/index.html) and [Matplotlib](https://matplotlib.org/stable/index.html)).

Helpful 'cheat sheets' have also been created for both Libraries, which you can access for Matplotlib [here](https://matplotlib.org/cheatsheets/) and Pandas [here](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).
