# NYC Taxi Trip Data Analysis with Databricks (PySpark & SQL)

This notebook demonstrates basic data analysis using PySpark in Databricks.
We use a small NYC taxi dataset to:
- Load data from CSV
- Explore data structure
- Perform basic transformations
- Run SQL queries
- Create simple visualizations

**Author:** Your Name
**Date:** 2025-08-13


In [None]:
# Load CSV file into Databricks
df = spark.read.csv("/FileStore/nyc_taxi_sample.csv", header=True, inferSchema=True)
df.show(5)
df.printSchema()

## Data Cleaning
We'll check for nulls and basic descriptive statistics.

In [None]:
from pyspark.sql.functions import col, count, isnan

# Count missing values per column
df.select([count(when(col(c).isNull() | isnan(c), c)).alias(c) for c in df.columns]).show()

# Show basic stats
df.describe().show()

## Transformations
We add a `trip_duration_minutes` column.

In [None]:
from pyspark.sql.functions import unix_timestamp, round

df = df.withColumn(
    "trip_duration_minutes",
    round((unix_timestamp(col("dropoff_datetime")) - unix_timestamp(col("pickup_datetime"))) / 60, 2)
)

df.show(5)

## Register as SQL Table and Query
We can use Spark SQL to run queries on the same data.

In [None]:
df.createOrReplaceTempView("nyc_taxi")

spark.sql("""
SELECT pickup_location, COUNT(*) as trip_count, ROUND(AVG(fare_amount), 2) as avg_fare
FROM nyc_taxi
GROUP BY pickup_location
ORDER BY trip_count DESC
""").show()

## Visualization Example
We'll visualize average fare per pickup location.

In [None]:
import matplotlib.pyplot as plt

pdf = df.groupBy("pickup_location").avg("fare_amount").toPandas()

plt.figure(figsize=(8, 5))
plt.bar(pdf["pickup_location"], pdf["avg(fare_amount)"])
plt.xticks(rotation=45)
plt.xlabel("Pickup Location")
plt.ylabel("Average Fare ($)")
plt.title("Average Fare per Pickup Location")
plt.show()