**AIM:**The aim of this code is to analyze an e-commerce dataset using Apache Spark SQL.

Data Analysis Using Spark SQL:

1. Load the dataset into Spark and understand its schema and structure.

2. Perform analytical operations on the dataset to derive insights:
*   Identify the country with the highest total purchase.
*   Calculate the number of unique customers from each country.
*   Determine the earliest purchase date for each customer.
3.  Display the results in a structured format.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, countDistinct, min, sum, desc

Step 1: Initialize SparkSession

The SparkSession initializes the Spark environment, allowing us to use Spark SQL for data analysis

In [None]:
# Step 1: Initialize SparkSession
spark = SparkSession.builder \
    .appName("Ecommerce Analysis") \
    .getOrCreate()

Step 2: Load the Dataset

*   The CSV file is loaded into Spark as a DataFrame.
*   header=True ensures that the first row is treated as column names.
*   inferSchema=True automatically detects the data types of each column.


In [None]:
# Step 2: Load the dataset
file_path = "/content/ecommerce_sample.csv"  # Replace with the actual dataset path
data = spark.read.csv(file_path, header=True, inferSchema=True)

Step 3: Display Schema

Displays the structure of the DataFrame, showing column names, data types, and nullability.

In [None]:
# Step 3: Display the schema
print("Schema:")
data.printSchema()

Schema:
root
 |-- Invoice No.: integer (nullable = true)
 |-- Invoice Date: string (nullable = true)
 |-- Customer-ID: integer (nullable = true)
 |-- Unit Price: double (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- Country: string (nullable = true)



Step 4: Display First Five Rows

In [None]:
# Step 4: Display the first five rows
print("First five rows:")
data.show(5)

First five rows:
+-----------+------------+-----------+----------+--------+---------+
|Invoice No.|Invoice Date|Customer-ID|Unit Price|Quantity|  Country|
+-----------+------------+-----------+----------+--------+---------+
|     536365|   12/1/2021|      17850|      2.55|       6|Australia|
|     536366|   12/1/2021|      17850|      3.39|      10|Australia|
|     536367|   12/1/2021|      13047|      2.75|      32|      USA|
|     536368|   12/1/2021|      12583|      4.25|      50|   Canada|
|     536369|   12/1/2021|      13047|      1.99|      24|  Germany|
+-----------+------------+-----------+----------+--------+---------+



Step 5: Calculate Total Purchase per Country

*  Create a new column, Total_Purchase, as the product of Unit Price and Quantity.
*  Use groupBy and sum to calculate the total purchases for each country.
*  Sort the results in descending order to find the country with the highest total purchase.
*  Display the result.

In [None]:
# Step 5: Country with the highest purchase
# Assuming "Unit Price" * "Quantity" gives the total purchase value
data = data.withColumn("Total_Purchase", col("Unit Price") * col("Quantity"))

highest_purchase_country = (
    data.groupBy("Country")
    .agg(sum("Total_Purchase").alias("Total_Purchase"))
    .orderBy(desc("Total_Purchase"))
    .limit(1)
)
print("Country with the highest purchase:")
highest_purchase_country.show()

Country with the highest purchase:
+-------+--------------+
|Country|Total_Purchase|
+-------+--------------+
| Canada|         212.5|
+-------+--------------+



Step 6: Count Customers from Each Country

*  Group the data by Country.
*  Use countDistinct to calculate the number of unique customers in each country.
*  Sort the result in descending order to see countries with the most customers.



In [None]:
# Step 6: Number of customers from each country
customers_per_country = (
    data.groupBy("Country")
    .agg(countDistinct("Customer-ID").alias("Number_of_Customers"))
    .orderBy(desc("Number_of_Customers"))
)
print("Number of customers from each country:")
customers_per_country.show()

Number of customers from each country:
+---------+-------------------+
|  Country|Number_of_Customers|
+---------+-------------------+
|  Germany|                  1|
|      USA|                  1|
|   Canada|                  1|
|Australia|                  1|
+---------+-------------------+



Step 7: Find Earliest Purchase per Customer

*  Group the data by Customer-ID.
*  Use the min function to find the earliest purchase date for each customer.
*  Display the result in ascending order by purchase date.

In [None]:
# Step 7: Earliest purchase made by customer
earliest_purchase = (
    data.groupBy("Customer-ID")
    .agg(min("Invoice Date").alias("Earliest_Purchase"))
    .orderBy("Earliest_Purchase")
)
print("Earliest purchase made by customers:")
earliest_purchase.show(5)  # Display first five records for brevity

Earliest purchase made by customers:
+-----------+-----------------+
|Customer-ID|Earliest_Purchase|
+-----------+-----------------+
|      12583|        12/1/2021|
|      13047|        12/1/2021|
|      17850|        12/1/2021|
+-----------+-----------------+



**RESULT:**
Demonstrate how to preprocess, load, and analyze structured data using Spark SQL.
Showcase the use of Spark's powerful aggregation, filtering, and grouping capabilities for large-scale data analysis.