# Task 1: Set Up Your Environment

### 1.1 Install Apache Spark (Windows 11 Example).

#### **Perquisites**  
- Java JDK (8 or later, but don't exceed 17.0) <br>
Spark runs on the JVM, so Java is required.

- Python 3.x <br>
For PySpark support and Jupyter integration.

- Environment Variables (JAVA_HOME, SPARK_HOME, etc.) <br>
---

#### **Install Java JDK**
Apache Spark requires Java (version 8 or later is fine). <br>

a. Download JDK <br>
Go to: https://adoptium.net

Select: Temurin 11 (LTS recommended) <br>

Download the Windows x64 MSI installer <br>

b. Install and Set JAVA_HOME <br>
After installing:

#### **Open Environment Variables (Win + S - search for Environment Variables)**

Under System variables, click New: <br>

Name: JAVA_HOME <br>

Value: C:\Program Files\Eclipse Adoptium\jdk-11.x.x (or whatever matches your version) <br>

Add %JAVA_HOME%\bin to the Path variable. <br>

---

#### **Install Apache Spark**
a. Download Spark Binary <br>
Go to: https://spark.apache.org/downloads.html

Choose a version (e.g., Spark 3.5.0) <br>

Choose a package type: (Pre-built for Apache Hadoop 3) <br>

Download and unzip the folder (e.g., to C:\spark) <br>

b. Set SPARK_HOME and PATH <br>
In Environment Variables:

Add a new variable: <br>

Name: SPARK_HOME <br>

Value: C:\spark <br>

Add %SPARK_HOME%\bin to the Path <br>

---

#### **Install winutils.exe (for Hadoop compatibility)**
Spark on Windows needs a workaround to run without full Hadoop. <br>

a. Download winutils.exe <br>
Go to: <br>
https://github.com/steveloughran/winutils

Find your Hadoop version (e.g., hadoop-3.3.1\bin\winutils.exe) <br>

Place the winutils.exe file in: C:\hadoop\bin <br>

b. Set HADOOP_HOME <br>
In Environment Variables:

Name: HADOOP_HOME <br>

Value: C:\hadoop <br>

Also add %HADOOP_HOME%\bin to your Path <br>

---
#### **Install PySpark in Your Virtual Environment**
In VS Code terminal (with virtual environment activated): <br>
    - (In VS Code CTRL + ` to open the Terminal Powershell Environment)

```bash
pip install pyspark
```

(Optional: for Jupyter support)

```bash
pip install notebook findspark
```


### 1.2 Set Up SQLite JDBC Driver

#### **Download SQLite JDBC Driver**
Go to the official repository: <br>

https://github.com/xerial/sqlite-jdbc


#### **Direct Download Link for the Latest .jar:**

https://repo1.maven.org/maven2/org/xerial/sqlite-jdbc/
Choose the latest version (e.g., sqlite-jdbc-3.49.1.0.jar). <br>

#### **Create a directory to store the driver**
Create a folder in your project directory to store the JDBC .jar file: <br>

```bash
mkdir jdbc_drivers
```
Then move the downloaded .jar into that folder (e.g., jdbc_drivers/sqlite-jdbc-3.49.1.0.jar). <br>

In your Jupyter Notebook or script (spark.ipynb), you’ll reference the JDBC driver when creating the Spark session:


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("SmartSales") \
    .config("spark.jars", "jdbc_drivers/sqlite-jdbc-3.49.1.0.jar") \
    .getOrCreate()

print("Spark Session created successfully!")

ModuleNotFoundError: No module named 'typing.io'; 'typing' is not a package

### 1.3 Verify PySpark Works in Jupyter Notebook

In [None]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("SmartSales").getOrCreate()
print(spark)

<pyspark.sql.session.SparkSession object at 0x0000020FB2055BE0>


### 1.4 Verify the SQLite JDBC Driver Works

In [None]:
df_sales = spark.read.format("jdbc") \
    .option("url", "jdbc:sqlite:data/dw/smart_sales.db") \
    .option("dbtable", "sales") \
    .option("driver", "org.sqlite.JDBC") \
    .load()

df_sales.show(5)

Py4JJavaError: An error occurred while calling o67.load.
: java.lang.ClassNotFoundException: org.sqlite.JDBC
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:445)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:592)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
	at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:103)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:103)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:103)
	at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:41)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:34)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:346)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:229)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:211)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:172)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:842)
