## Installation Check
You should be able to run all cells to advance with the course.
Every cell has some troubleshooting guidelines included.

### Step 1 : Importing needed modules
Possible error solutions:
1. Check if the needed packages are installed correctly (View > Tool windows > Python Packages). If not, run the requirements.txt again.
2. If you applied some changes to the environment during this session, restart the PyCharme IDE.

In [1]:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip

### Step 2 : Setting up environment variables
To avoid setting up the environment variables in your os, this code will set the correct variables. Make sure you adjusted ConnectionConfig.py with the correct directories.
Look closely at the listed variables to make sure everything is set correctly. PATH, HADOOP_HOME, JAVA_HOME and SPARK_HOME should be set correctly.


In [2]:
import ConnectionConfig as cc
cc.setupEnvironment()
cc.listEnvironment()

ALLUSERSPROFILE: C:\ProgramData
APPDATA: C:\Users\overvelj\AppData\Roaming
COMMONPROGRAMFILES: C:\Program Files\Common Files
COMMONPROGRAMFILES(X86): C:\Program Files (x86)\Common Files
COMMONPROGRAMW6432: C:\Program Files\Common Files
COMPUTERNAME: AKDGPORT11191
COMSPEC: C:\WINDOWS\system32\cmd.exe
DRIVERDATA: C:\Windows\System32\Drivers\DriverData
HOMEDRIVE: C:
HOMEPATH: \Users\overvelj
IDEA_INITIAL_DIRECTORY: C:\WINDOWS\system32
JAVA_HOME: C:\Program Files\Java\jdk-11.0.8\
LANG: en_US.UTF-8
LANGUAGE: 
LC_ALL: en_US.UTF-8
LOCALAPPDATA: C:\Users\overvelj\AppData\Local
LOGONSERVER: \\CDCDCADM02
NUMBER_OF_PROCESSORS: 8
ONEDRIVE: C:\Users\overvelj\OneDrive - Karel de Grote Hogeschool
ONEDRIVECOMMERCIAL: C:\Users\overvelj\OneDrive - Karel de Grote Hogeschool
OS: Windows_NT
PATH: C:\DevApps\DeltaSpark\Scripts;C:\Program Files (x86)\Common Files\Oracle\Java\javapath;C:\Users\overvelj\AppData\Local\Programs\Python\Python39;C:\Program Files\Common Files\Oracle\Java\javapathREMOVE;C:\Program F

### Step 3 : Configuring the sparkSession
Possible error solutions:
1. Make sure the imports in step 1 succeeded.

In [3]:
builder = SparkSession.builder \
    .appName("InstallCheck") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .config("spark.sql.shuffle.partitions", "4") \
    .master("local[*]")

builder = configure_spark_with_delta_pip(builder)




### Step 4  : Creating a local spark cluster
This step starts the sparkSession. Because you are running a local cluster.
If you already started a sparkSession with getOrCreate(), running this cell does not change the session. Restart the Jupyter server, and rerun all above cells again
After running this step you will get the url (click on Spark UI) to the Spark server. Check if you can visit the URL

Possible error solutions:

1. Make sure the previous step was executed correctly
2. Check your environment variables again with es.listEnvironment(). HADOOP_HOME, SPARK_HOME and PATH have to be set correctly corresponding the instructions in README.MD. In most cases the error message will give you information on what went wrong.
3. Read the error message. If you don't get a clear error message look at Jupyter console (View > Tool windows > Python Packages). The console will give information about the startup proces of the Spark-server
4. In Windows, make sure your HADOOP_HOME has winutils.exe in the bin directory. If not see README.MD for clear instructions

In [4]:
spark = builder.getOrCreate()
spark.getActiveSession()

### Step 5  : Reading source into Spark DataFrame

Possible error solutions:
1. Make sure the file is present in the project at [file_location]

In [5]:
# File location and type
file_location = "./FileStore/tables/shakespeare.txt"
file_type = "text"

# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(file_type)  \
  .load(file_location)
df.show()
df.describe()

+--------------------+
|               value|
+--------------------+
|This is the 100th...|
|is presented in c...|
|Library of the Fu...|
|often releases Et...|
|                    |
|         Shakespeare|
|                    |
|*This Etext has c...|
|                    |
|<<THIS ELECTRONIC...|
|SHAKESPEARE IS CO...|
|PROVIDED BY PROJE...|
|WITH PERMISSION. ...|
|DISTRIBUTED SO LO...|
|PERSONAL USE ONLY...|
|COMMERCIALLY.  PR...|
|SERVICE THAT CHAR...|
|                    |
|*Project Gutenber...|
|in the presentati...|
+--------------------+
only showing top 20 rows



DataFrame[summary: string, value: string]

### Step 7  : Creating a view on the source and performing SQL on View
This step should not pose any problem if the previous steps where successful.


In [6]:
df.createOrReplaceTempView('lines')
words = spark.sql('select explode(split(value, " ")) from lines')
words.createOrReplaceTempView('words')
lowerwords = spark.sql('select lower(trim(col)) as word, count(*) as amount from words where lower(trim(col)) <> "" group by lower(trim(col)) order by amount desc limit 20')
lowerwords.show()


+----+------+
|word|amount|
+----+------+
| the| 27549|
| and| 26037|
|   i| 19540|
|  to| 18700|
|  of| 18010|
|   a| 14383|
|  my| 12455|
|  in| 10671|
| you| 10630|
|that| 10487|
|  is|  9145|
| for|  7982|
|with|  7931|
| not|  7643|
|your|  6871|
| his|  6749|
|  be|  6700|
| but|  5886|
|  he|  5884|
|  as|  5882|
+----+------+



### Step 8  : Saving the result as a Delta table
After running this step you should have a directory spark-warehouse/shakespeareWords in your project directory. This directory contains the Delta table. Right click the root directory and click "Reload all from disk" to see the directory.

Possible error solutions:
1. Make sure the previous step was executed correctly
2. Make sure delta-spark is installed correctly. If not, run the requirements.txt again.
3. Make sure your project is not in a user directory with spaces in the name.
4. Make sure you have the correct permissions to write to the project directory.

In [7]:
# With this registered as a temp view, it will only be available to this particular notebook. If you'd like other users to be able to query this table, you can also create a table from the DataFrame.
# Once saved, this table will persist across cluster restarts as well as allow various users across different notebooks to query this data.
# To do so, choose your table name and uncomment the bottom line.
lowerwords.describe()

permanent_table_name = "shakespeareWords"

lowerwords.write.format("delta").mode("overwrite").saveAsTable(permanent_table_name)

DataFrame[summary: string, word: string, amount: string]