# Spark Introduction

In this tutorial, we will show:

  - how to create a basic `spark session`.
  - how to load data from a file system.
  - how to use `spark sql` to do basic data analytics.  
  - how to write results on a file system.

## Step1: Import pyspark api 

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, split, lower, col
from pathlib import Path

## Step2: Initialize a Spark Session

In [2]:

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("WordCount") \
    .getOrCreate()

## Step3: Load data into spark

After step2, you should be able to access the [spark ui](http://localhost:4040/jobs/).

You will notice there are no jobs listed in the spark ui after running `step3`. You know why?

In [9]:
# data path
data_dir = Path().cwd().parent / 'data'
file_path = data_dir  / 'le_petit_prince.txt'

# Read text file into DataFrame
# Each line becomes a row in the DataFrame
df = spark.read.text(file_path.as_posix())

## Step4: Count word frequency

The below code first splits a sentence into words, then it will do a groupby of all words, at last it counts the frequency of words

> If you check the spark ui, after execution of the below code, there is still no job. You know why?

In [7]:
# Process the text and count words
word_counts = df \
    .select(explode(split(lower(col("value")), "\\s+")).alias("word")) \
    .filter(col("word") != "") \
    .groupBy("word") \
    .count() \
    .orderBy("count", ascending=False)

## Step 5: Get the result

All the code in step3, and step4 all transformations. As spark uses the `lazy evaluation` strategy, there are no jobs has been executed. The below code calls `show` which is a function.
This will trigger the execution of the above transformation. After running step5, if you check [spark ui](http://localhost:4040/jobs/), you should see a list of jobs have been executed.

In [8]:
# Show top 20 most frequent words
word_counts.show(20)  

+-----+-----+
| word|count|
+-----+-----+
|   le|  454|
|    -|  434|
|   de|  428|
|   je|  316|
|   et|  283|
|   il|  260|
|  les|  249|
|   un|  230|
|   la|  219|
|petit|  193|
|    à|  178|
|   ne|  169|
|  que|  154|
|  pas|  148|
|   tu|  136|
|  des|  131|
|c'est|  126|
|  dit|  125|
|    ?|  125|
| mais|  123|
+-----+-----+
only showing top 20 rows



## Step6: write the result in to filesystem

In [11]:
# Save results if needed
output_file_path = data_dir / 'out' / 'le_petit_prince_count'
word_counts.write.csv(output_file_path.as_posix(), header=True)

## Step7: Stop the Spark session

In [12]:
spark.stop()