<a href="https://colab.research.google.com/github/ShindeAnjali2k6/DataAnalysis/blob/main/BIG_DATA_ANALYSIS_CT_T1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# 1. Install Java
!apt-get install openjdk-11-jdk-headless -qq > /dev/null

# 2. Download Spark 3.3.2 (stable with py4j)
!wget -q https://archive.apache.org/dist/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
!tar -xzf spark-3.3.2-bin-hadoop3.tgz

# 3. Install findspark
!pip install -q findspark


In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.3.2-bin-hadoop3"

import findspark
findspark.init()


In [None]:
import glob
print(glob.glob("/content/spark-3.3.2-bin-hadoop3/python/lib/py4j-*.zip"))


['/content/spark-3.3.2-bin-hadoop3/python/lib/py4j-0.10.9.5-src.zip']


In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("NYC Taxi Trip Analysis") \
        .getOrCreate()


In [None]:
from google.colab import files
uploaded = files.upload()  # upload sample_nyc_taxi_trips_10k.csv

df = spark.read.csv("sample_nyc_taxi_trips_10k.csv", header=True, inferSchema=True)
df.show(5)


Saving sample_nyc_taxi_trips_10k.csv to sample_nyc_taxi_trips_10k.csv
+---------+-------------------+-------------------+---------------+-------------+------------+------------+
|vendor_id|    pickup_datetime|   dropoff_datetime|passenger_count|trip_distance|payment_type|total_amount|
+---------+-------------------+-------------------+---------------+-------------+------------+------------+
|      VTS|2023-01-11 23:15:00|2023-01-11 23:33:00|              5|        16.15|     Unknown|       99.07|
|      CMT|2023-01-01 14:20:00|2023-01-01 14:52:00|              4|         0.83|     Dispute|       44.07|
|      VTS|2023-01-27 11:58:00|2023-01-27 12:54:00|              4|         3.61| Credit card|       44.53|
|      VTS|2023-01-08 20:04:00|2023-01-08 20:57:00|              1|        16.44|   No charge|       18.55|
|      CMT|2023-01-05 08:25:00|2023-01-05 08:32:00|              2|         1.83| Credit card|       51.34|
+---------+-------------------+-------------------+---------------

In [None]:
df.printSchema()


root
 |-- vendor_id: string (nullable = true)
 |-- pickup_datetime: timestamp (nullable = true)
 |-- dropoff_datetime: timestamp (nullable = true)
 |-- passenger_count: integer (nullable = true)
 |-- trip_distance: double (nullable = true)
 |-- payment_type: string (nullable = true)
 |-- total_amount: double (nullable = true)



In [None]:
df.groupBy("passenger_count").count().orderBy("count", ascending=False).show()


+---------------+-----+
|passenger_count|count|
+---------------+-----+
|              5| 2148|
|              4| 2026|
|              3| 2010|
|              1| 1920|
|              2| 1896|
+---------------+-----+



In [None]:
df.select("trip_distance", "total_amount").summary().show()


+-------+-----------------+------------------+
|summary|    trip_distance|      total_amount|
+-------+-----------------+------------------+
|  count|            10000|             10000|
|   mean|10.05650899999996|51.703217999999865|
| stddev|5.721878267842658| 27.80625805503466|
|    min|              0.1|              3.01|
|    25%|             5.17|             27.87|
|    50%|            10.12|             51.59|
|    75%|            14.98|             75.71|
|    max|             20.0|             99.99|
+-------+-----------------+------------------+



In [None]:
df.groupBy("payment_type").count().orderBy("count", ascending=False).show()


+------------+-----+
|payment_type|count|
+------------+-----+
|     Unknown| 2037|
| Credit card| 2008|
|        Cash| 2001|
|   No charge| 1987|
|     Dispute| 1967|
+------------+-----+



In [None]:
df.select("trip_distance", "total_amount").orderBy(df.trip_distance.desc()).show(1)


+-------------+------------+
|trip_distance|total_amount|
+-------------+------------+
|         20.0|       47.41|
+-------------+------------+
only showing top 1 row



In [None]:
df_clean = df.dropna()
print("Rows after dropping nulls:", df_clean.count())


Rows after dropping nulls: 10000


✨ Insights derived from big data processing
1️⃣ The dataset contains 10,000 valid taxi trips after dropping null values.
2️⃣ Passenger count distribution: Most trips were with 5 passengers (2148 trips) and 4 passengers (2026 trips).
3️⃣ Trip distance stats:

Average trip distance ≈ 10.06 miles

Longest trip distance = 20.0 miles

50% of trips are under 10.12 miles

4️⃣ Fare stats:

Mean fare: $51.70

Most expensive fare: $99.99

Average fare per mile ≈ varies, e.g., median around $5 per mile

5️⃣ Payment type distribution:

Largest category: Unknown (2037 trips)

Credit card: 2008 trips

Cash: 2001 trips

6️⃣ The longest trip (20 miles) charged $47.41, showing price does not always scale linearly with distance.