# Pandas Student Notebook — Foundations Practice (3)  
## Dataset: Kaggle “NYC Taxi Trip Duration” (train.csv)

### Goal of this notebook
This notebook strengthens **core Pandas analysis skills** with a slightly more realistic, time-based dataset.
You will practice datetime handling, validation, grouping, normalization, and careful reasoning about grain.

### Important reminders
Always ask yourself:
What does one row represent?
Am I aggregating before comparing?
Am I accidentally leaking future information?
Is this data quality, business logic, or analysis logic?

Write your code in the empty code cells.


## 0. Setup + grain

Load `train.csv` into a DataFrame called `df`.

In [12]:
import pandas as pd
import os
import numpy as np

df = pd.read_csv(os.path.join('data', 'taxi','train.csv'))

## 1. Datetime parsing + invariants

1) Convert `pickup_datetime` and `dropoff_datetime` to datetime.  
2) Create:
- `pickup_date`
- `pickup_hour`
- `pickup_weekday`
- `pickup_is_weekend`



In [13]:
dt_cols = ["pickup_datetime", "dropoff_datetime"]
for c in dt_cols:
    df[c] = pd.to_datetime(df[c], errors="coerce")

df["pickup_date"] = df["pickup_datetime"].dt.date
df["pickup_hour"] = df["pickup_datetime"].dt.hour
df["pickup_weekday"] = df["pickup_datetime"].dt.weekday  # Mon=0 ... Sun=6
df["pickup_is_weekend"] = df["pickup_weekday"].ge(5).astype(int)

df[["pickup_datetime", "dropoff_datetime", "pickup_date", "pickup_hour", "pickup_weekday", "pickup_is_weekend"]].head()


Unnamed: 0,pickup_datetime,dropoff_datetime,pickup_date,pickup_hour,pickup_weekday,pickup_is_weekend
0,2016-03-14 17:24:55,2016-03-14 17:32:30,2016-03-14,17,0,0
1,2016-06-12 00:43:35,2016-06-12 00:54:38,2016-06-12,0,6,1
2,2016-01-19 11:35:24,2016-01-19 12:10:48,2016-01-19,11,1,0
3,2016-04-06 19:32:31,2016-04-06 19:39:40,2016-04-06,19,2,0
4,2016-03-26 13:30:55,2016-03-26 13:38:10,2016-03-26,13,5,1


## 2. Duration validation

Trip duration is provided in seconds.

1) Recompute duration as `(dropoff - pickup).total_seconds()`.  
2) Create `duration_delta = recomputed - trip_duration`.  
3) Inspect:
- how many rows have `duration_delta == 0`
- distribution of `duration_delta`
- top 10 absolute deviations

Write as a comment:
- Why might small non-zero deltas be acceptable?


In [14]:
recomputed = (df["dropoff_datetime"] - df["pickup_datetime"]).dt.total_seconds()
df["duration_delta"] = recomputed - df["trip_duration"]

# 1) how many rows have duration_delta == 0
(df["duration_delta"] == 0).sum()


np.int64(1458644)

In [15]:
# 2) distribution of duration_delta
df["duration_delta"].describe()


count    1458644.0
mean           0.0
std            0.0
min            0.0
25%            0.0
50%            0.0
75%            0.0
max            0.0
Name: duration_delta, dtype: float64

In [16]:
# 3) top 10 absolute deviations
df.loc[df["duration_delta"].abs().nlargest(10).index,
       ["pickup_datetime", "dropoff_datetime", "trip_duration", "duration_delta"]]


# Small non-zero deltas can be acceptable due to rounding, timestamp precision, or ETL differences
# (e.g., trip_duration stored as int seconds while datetimes have higher precision / truncation).


Unnamed: 0,pickup_datetime,dropoff_datetime,trip_duration,duration_delta
0,2016-03-14 17:24:55,2016-03-14 17:32:30,455,0.0
1,2016-06-12 00:43:35,2016-06-12 00:54:38,663,0.0
2,2016-01-19 11:35:24,2016-01-19 12:10:48,2124,0.0
3,2016-04-06 19:32:31,2016-04-06 19:39:40,429,0.0
4,2016-03-26 13:30:55,2016-03-26 13:38:10,435,0.0
5,2016-01-30 22:01:40,2016-01-30 22:09:03,443,0.0
6,2016-06-17 22:34:59,2016-06-17 22:40:40,341,0.0
7,2016-05-21 07:54:58,2016-05-21 08:20:49,1551,0.0
8,2016-05-27 23:12:23,2016-05-27 23:16:38,255,0.0
9,2016-03-10 21:45:01,2016-03-10 22:05:26,1225,0.0


## 3. Outlier detection (robust, no hard thresholds)

Create:
- `duration_z_robust` using median and MAD
- `duration_outlier` where `abs(duration_z_robust) > 5`

Constraints:
- No loops
- No `apply(axis=1)`



In [17]:
x = df["trip_duration"].astype(float)

med = x.median()
mad = (x - med).abs().median()

# robust z-score (0.6745 makes it comparable to std-based z when data is normal-ish)
df["duration_z_robust"] = np.where(mad == 0, 0.0, 0.6745 * (x - med) / mad)

df["duration_outlier"] = (df["duration_z_robust"].abs() > 5)


## 4. Groupby + transform (window thinking)

For each pickup weekday:
- Compute average trip duration
- Attach this average back to each row using `transform`
- Create `duration_vs_weekday_avg = trip_duration - weekday_avg`

Show:
- `head()` of relevant columns

Write as a comment:
- Why is `transform` the right tool here?


In [23]:
weekday_avg = df.groupby("pickup_weekday")["trip_duration"].transform("mean")
df["duration_vs_weekday_avg"] = df["trip_duration"] - weekday_avg

df[["trip_duration", "pickup_weekday", "duration_vs_weekday_avg"]].head()

# transform returns a per-row value aligned to the original rows (same length as df),
# so you can compare each trip to its weekday baseline without collapsing the dataset.



Unnamed: 0,trip_duration,pickup_weekday,duration_vs_weekday_avg
0,455,0,-442.947839
1,663,6,-238.639395
2,2124,1,1140.536876
3,429,2,-546.450494
4,435,5,-513.051175


## 9. Data quality rule across columns

Create boolean `trip_data_suspicious` if:
- trip marked as having passengers
- AND `trip_duration == 0`
- OR pickup and dropoff coordinates are identical

Show:
- number of suspicious rows
- small sample




In [24]:
coords_identical = (
    (df["pickup_longitude"] == df["dropoff_longitude"]) &
    (df["pickup_latitude"] == df["dropoff_latitude"])
)

df["trip_data_suspicious"] = (
    ((df["passenger_count"] > 0) & (df["trip_duration"] == 0))
    | coords_identical
)

print("suspicious rows:", int(df["trip_data_suspicious"].sum()))

df.loc[df["trip_data_suspicious"],
       ["passenger_count", "trip_duration", "pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"]
      ].head(10)


suspicious rows: 5897


Unnamed: 0,passenger_count,trip_duration,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude
246,1,227,-73.981819,40.768963,-73.981819,40.768963
291,2,1109,-73.959068,40.775661,-73.959068,40.775661
407,6,947,-73.808487,40.687336,-73.808487,40.687336
702,1,580,-73.78672,40.647041,-73.78672,40.647041
1620,1,27,-74.209854,40.816853,-74.209854,40.816853
1728,1,19,-73.776314,40.645454,-73.776314,40.645454
1769,5,254,-73.954666,40.821003,-73.954666,40.821003
2087,1,248,-73.954628,40.77718,-73.954628,40.77718
2441,1,8,-73.78183,40.644699,-73.78183,40.644699
2609,5,1212,-73.875313,40.773682,-73.875313,40.773682


## 10. Capstone: clean analysis table

Create `analysis_df` containing:
- `trip_duration`
- `pickup_hour`
- `pickup_weekday`
- `duration_vs_weekday_avg`
- `duration_outlier` (as int 0/1)

Requirements:
- No missing values in engineered columns
- Show `analysis_df.head()` and `analysis_df.isna().sum()`

Write as a comment:
- Which columns would you trust least for modeling, and why?


In [27]:
analysis_df = df[[
    "trip_duration",
    "pickup_hour",
    "pickup_weekday",
    "duration_vs_weekday_avg",
    "duration_outlier",
]].copy()

analysis_df["duration_outlier"] = analysis_df["duration_outlier"].astype(int)

# Requirements: no missing values in engineered columns
analysis_df["duration_vs_weekday_avg"] = analysis_df["duration_vs_weekday_avg"].fillna(0.0)
analysis_df["pickup_hour"] = analysis_df["pickup_hour"].fillna(0).astype(int)
analysis_df["pickup_weekday"] = analysis_df["pickup_weekday"].fillna(0).astype(int)

analysis_df.head()


Unnamed: 0,trip_duration,pickup_hour,pickup_weekday,duration_vs_weekday_avg,duration_outlier
0,455,17,0,-442.947839,0
1,663,0,6,-238.639395,0
2,2124,11,1,1140.536876,0
3,429,19,2,-546.450494,0
4,435,13,5,-513.051175,0
