# Hello Spark

Spark based on Java, so you need to install Java before running spark. Download and install Java [from here](https://docs.aws.amazon.com/corretto/index.html). Spark currently support java 8 or 11. Please see the compatibility on [Spark documentation](https://spark.apache.org/docs/latest/)

Then, install pyspark using pip.
  
`#> pip install pyspark`

For this hello spark sample, it will read data set is from a local file (`spark_people.csv`)

In [None]:
from pyspark.sql import SparkSession

Since we're using Spark locally we already have both a SparkContext and a SparkSession running.  
We can update some of the parameters, such our application's name. 
Let's just call it *Hello Spark*.
  
This might take some time for the first run.

In [None]:
spark = SparkSession.builder.appName("Hello Spark").getOrCreate()

See the spark session. We can also open the link to spark UI.

In [None]:
spark.sparkContext.getConf().getAll()
spark

## Pyspark CSV Read

Read from csv file, where first row is header row. The result is spark dataframe.

In [None]:
path = "data/spark_people.csv"
people_data = spark.read.csv(path, header=True)

Spark schema

In [None]:
people_data.printSchema()

Show top 10 data, without truncating long value (using `False` as parameter).

In [None]:
people_data.show(10, False)

Select particular columns, then show top 10 data, without truncating long value (using `False` as parameter).

In [None]:
people_data.select("email", "full_name").show(10, False)

Take and iterate the last 12 rows.

In [None]:
for row in people_data.tail(12):
    print("id_number is {}, full_name is {}, email is {}, and address is {}".format(
        row.id_number, row.full_name, row["email"], row["address"]))

## Pyspark JSONL Read

In [None]:
path = "data/spark_people.jsonl"
people_data = spark.read.json(path)

See the schema

In [None]:
people_data.printSchema()

Show the first 10 records, without truncating long value

In [None]:
people_data.show(10, False)

Iterate the first 8 records.

In [None]:
for row in people_data.take(8):
    print("id_number is {}, full_name is {}, email is {}, and address is {}".format(
        row.id_number, row.full_name, row["email"], row["address"]))

## Pyspark JSON Array Read

If the file is json array, not jsonl, then add `multiLine=True` while reading

In [None]:
path = "data/spark_people.json"
people_json_data = spark.read.json(path, multiLine=True)

In [None]:
people_json_data.printSchema()

In [None]:
people_json_data.show(10, False)

In [None]:
for row in people_json_data.take(8):
    print("id_number is {}, full_name is {}, email is {}, and address is {}".format(
        row.id_number, row.full_name, row["email"], row["address"]))

The complete pyspark API reference [available here.](http://spark.apache.org/docs/latest/api/python/reference/index.html)