## Installation Check
You should be able to run all cells to advance with the course.
Every cell has some troubleshooting guidelines included.

### Step 1 : Importing needed modules
Possible error solutions:
1. Check if the needed packages are installed correctly (View > Tool windows > Python Packages). If not, run the requirements.txt again.
2. If you applied some changes to the environment during this session, restart the PyCharme IDE.

In [11]:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

### Step 2 : Setting up environment variables
To avoid setting up the environment variables in your os, this code will set the correct variables. Make sure you adjusted ConnectionConfig.py with the correct directories.
Look closely at the listed variables to make sure everything is set correctly. PATH, HADOOP_HOME, JAVA_HOME and SPARK_HOME should be set correctly.


In [12]:
import ConnectionConfig as cc
cc.setupEnvironment()
cc.listEnvironment()

HOMEBREW_PREFIX: /opt/homebrew
COMMAND_MODE: unix2003
INFOPATH: /opt/homebrew/share/info:
SHELL: /bin/zsh
PYTHONPATH: /Users/user/Desktop/sparkdelta
__CFBundleIdentifier: com.jetbrains.pycharm
TMPDIR: /var/folders/k_/tkt88xx94n17f7_nvrzjrkwc0000gn/T/
LC_ALL: en_US.UTF-8
HOME: /Users/user
HOMEBREW_REPOSITORY: /opt/homebrew
PATH: /Users/user/Desktop/sparkdelta/badEnvironment/bin:/Users/user/Library/Java/JavaVirtualMachines/temurin-21.0.2/Contents/Home/bin:/opt/homebrew/opt/python@3.11/bin:/opt/homebrew/opt/python@3.11/bin:/opt/homebrew/bin:/opt/homebrew/opt/python@3.11/bin:/opt/homebrew/bin:/opt/homebrew/sbin:/usr/local/bin:/System/Cryptexes/App/usr/bin:/usr/bin:/bin:/usr/sbin:/sbin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/local/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/bin:/var/run/com.apple.security.cryptexd/codex.system/bootstrap/usr/appleinternal/bin:/Users/user/Desktop/KDG/VMware Fusion.app/Contents/Public:/usr/local/go/bin:/Users/use

### Step 3 : Configuring the sparkSession
Possible error solutions:
1. Make sure the imports in step 1 succeeded.

In [13]:
builder = SparkSession.builder \
    .appName("InstallCheck") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.shuffle.partitions", "4") \
    .master("local[*]")

builder = configure_spark_with_delta_pip(builder)




### Step 4  : Creating a local spark cluster
This step starts the sparkSession. Because you are running a local cluster.
If you already started a sparkSession with getOrCreate(), running this cell does not change the session. Restart the Jupyter server, and rerun all above cells again
After running this step you will get the url (click on Spark UI) to the Spark server. Check if you can visit the URL

Possible error solutions:

1. Make sure the previous step was executed correctly
2. Check your environment variables again with es.listEnvironment(). HADOOP_HOME, SPARK_HOME, JAVA_HOME and PATH have to be set correctly corresponding the instructions in README.MD. In most cases the error message will give you information on what went wrong.
3. Read the error message. If you don't get a clear error message look at Jupyter console (View > Tool windows > Python Packages). The console will give information about the startup proces of the Spark-server
4. In Windows, make sure your HADOOP_HOME has winutils.exe in the bin directory. If not see README.MD for clear instructions

In [6]:
spark = builder.getOrCreate()

25/02/26 11:31:18 WARN Utils: Your hostname, MacBook-Pro-170.local resolves to a loopback address: 127.0.0.1; using 10.140.33.222 instead (on interface en0)
25/02/26 11:31:18 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/Users/user/Desktop/spark_and_hadop/spark-3.5.4-bin-hadoop3/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /Users/user/.ivy2/cache
The jars for the packages stored in: /Users/user/.ivy2/jars
io.delta#delta-spark_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-b791f7fc-5fee-40b7-ae9f-5a8384e283c4;1.0
	confs: [default]
	found io.delta#delta-spark_2.12;3.2.0 in central
	found io.delta#delta-storage;3.2.0 in central
	found org.antlr#antlr4-runtime;4.9.3 in central
:: resolution report :: resolve 282ms :: artifacts dl 15ms
	:: modules in use:
	io.delta#delta-spark_2.12;3.2.0 from central in [default]
	io.delta#delta-storage;3.2.0 from central in [default]
	org.antlr#antlr4-runtime;4.9.3 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0  

In [7]:

spark.getActiveSession()

### Step 5  : Reading source into Spark DataFrame

Possible error solutions:
1. Make sure the file is present in the project at [file_location]

In [8]:
# File location and type
file_location = "./FileStore/tables/shakespeare.txt"
file_type = "text"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type)  \
  .load(file_location)
df.show()
df.describe()

+--------------------+
|               value|
+--------------------+
|This is the 100th...|
|is presented in c...|
|Library of the Fu...|
|often releases Et...|
|                    |
|         Shakespeare|
|                    |
|*This Etext has c...|
|                    |
|<<THIS ELECTRONIC...|
|SHAKESPEARE IS CO...|
|PROVIDED BY PROJE...|
|WITH PERMISSION. ...|
|DISTRIBUTED SO LO...|
|PERSONAL USE ONLY...|
|COMMERCIALLY.  PR...|
|SERVICE THAT CHAR...|
|                    |
|*Project Gutenber...|
|in the presentati...|
+--------------------+
only showing top 20 rows



DataFrame[summary: string, value: string]

### Step 7  : Creating a view on the source and performing SQL on View
This step should not pose any problem if the previous steps where successful.


In [9]:
df.createOrReplaceTempView('lines')
words = spark.sql('select explode(split(value, " ")) from lines')
words.createOrReplaceTempView('words')
lowerwords = spark.sql('select lower(trim(col)) as word, count(*) as amount from words where lower(trim(col)) <> "" group by lower(trim(col)) order by amount desc limit 20')
lowerwords.show()


[Stage 1:>                                                          (0 + 2) / 2]

+----+------+
|word|amount|
+----+------+
| the| 27549|
| and| 26037|
|   i| 19540|
|  to| 18700|
|  of| 18010|
|   a| 14383|
|  my| 12455|
|  in| 10671|
| you| 10630|
|that| 10487|
|  is|  9145|
| for|  7982|
|with|  7931|
| not|  7643|
|your|  6871|
| his|  6749|
|  be|  6700|
| but|  5886|
|  he|  5884|
|  as|  5882|
+----+------+



                                                                                

### Step 8  : Saving the result as a Delta table
After running this step you should have a directory spark-warehouse/shakespeareWords in your project directory. This directory contains the Delta table. Right click the root directory and click "Reload all from disk" to see the directory.

Possible error solutions:
1. Make sure the previous step was executed correctly
2. Make sure delta-spark is installed correctly. If not, run the requirements.txt again.
3. Make sure your project is not in a user directory with spaces in the name.
4. Make sure you have the correct permissions to write to the project directory.

In [10]:
# With this registered as a temp view, it will only be available to this particular notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.
lowerwords.describe()

permanent_table_name = "shakespeareWords"

lowerwords.write.format("delta").mode("overwrite").saveAsTable(permanent_table_name)

25/02/26 11:32:38 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

In [9]:
spark.stop()