# <span style="color:#1f77b4">**Data Analytics 01 - Data Ingest**</span>

This notebook pulls sample sales data into DBFS, loads it into Spark, and walks through a basic SQL-driven analysis with quick visual checks.


### <span style="color:#1f77b4">**Loading csv dataset into the databricks file system (dbfs)**</span>

Create a working folder in DBFS and download the raw CSV files.

- `%sh` runs shell commands in the notebook context.
- `wget` fetches the CSV files from GitHub into DBFS.


In [None]:
# Create a DBFS folder and download sample CSVs
%sh
rm -r /dbfs/spark_lab
mkdir /dbfs/spark_lab
wget -O /dbfs/spark_lab/2019.csv https://raw.githubusercontent.com/Ch3rry-Pi3-Azure/DataBricks-Data-Analytics/refs/heads/main/data/2019.csv
wget -O /dbfs/spark_lab/2020.csv https://raw.githubusercontent.com/Ch3rry-Pi3-Azure/DataBricks-Data-Analytics/refs/heads/main/data/2020.csv
wget -O /dbfs/spark_lab/2021.csv https://raw.githubusercontent.com/Ch3rry-Pi3-Azure/DataBricks-Data-Analytics/refs/heads/main/data/2021.csv

### <span style="color:#1f77b4">**Loading csv files into a dataframe**</span>

Read the CSVs into a Spark DataFrame and preview the rows.

- `spark.read.load` reads files into a DataFrame.
- `display` renders a quick preview in Databricks.


In [None]:
# Load all CSVs into a Spark DataFrame
df = spark.read.load('spark_lab/*.csv', format='csv')
display(df.limit(100))

### <span style="color:#1f77b4">**Defining Schema for the dataframe**</span>

Apply an explicit schema so dates, numbers, and strings parse consistently.

- `StructType` defines the full schema structure.
- `StructField` defines each column name, type, and nullability.


In [None]:
# Define an explicit schema for consistent types
from pyspark.sql.types import *
from pyspark.sql.functions import *
orderSchema = StructType([
    StructField("SalesOrderNumber", StringType()),
    StructField("SalesOrderLineNumber", IntegerType()),
    StructField("OrderDate", DateType()),
    StructField("CustomerName", StringType()),
    StructField("Email", StringType()),
    StructField("Item", StringType()),
    StructField("Quantity", IntegerType()),
    StructField("UnitPrice", FloatType()),
    StructField("Tax", FloatType())
])
df = spark.read.load('/spark_lab/*.csv', format='csv', schema=orderSchema)
display(df.limit(100))

### <span style="color:#1f77b4">**Query Data using Spark SQL**</span>

Register a temp view and run SQL to explore and aggregate the data.

- `createOrReplaceTempView` exposes the DataFrame as a SQL view.
- `spark.sql` runs SQL over Spark data.


In [None]:
# Register a temp view and query with Spark SQL
df.createOrReplaceTempView("salesorders")
spark_df = spark.sql("SELECT * FROM salesorders")
display(spark_df)

In [None]:
# Aggregate gross revenue by year
sqlQuery = "SELECT CAST(YEAR(OrderDate) AS CHAR(4)) AS OrderYear, \
               SUM((UnitPrice * Quantity) + Tax) AS GrossRevenue \
        FROM salesorders \
        GROUP BY CAST(YEAR(OrderDate) AS CHAR(4)) \
        ORDER BY OrderYear"
df_spark = spark.sql(sqlQuery)
df_spark.show()

### <span style="color:#1f77b4">**Using Matplotlib for visualisation**</span>

Convert to Pandas and plot revenue by year for a quick sanity check.

- `toPandas` collects results to the driver for plotting.
- `plt.bar` creates the bar chart.


In [None]:
# Convert to Pandas for local plotting
from matplotlib import pyplot as plt

# matplotlib requires a Pandas dataframe, not a Spark one
df_sales = df_spark.toPandas()
# Create a bar plot of revenue by year
plt.bar(x=df_sales['OrderYear'], height=df_sales['GrossRevenue'])
# Display the plot
plt.show()

### <span style="color:#1f77b4">**Using Seaborn Library**</span>

Recreate the chart with Seaborn for a cleaner, styled visual.

- `sns.barplot` builds a categorical bar chart with styling.
- `plt.show` renders the chart output.


In [None]:
# Use Seaborn for a styled bar chart
import seaborn as sns

# Clear the plot area
plt.clf()
# Create a bar chart
ax = sns.barplot(x="OrderYear", y="GrossRevenue", data=df_sales)
plt.show()