# Introduction to Apache Spark with Python

This notebook uses Apache Sparks Python API for basic use cases, i.e. PySpark

**Some notes on Spark:**

* Any PySpark Application starts with initializing `Spark Sessions` which is the entry point of PySpark. 
* When spark [transforms](https://spark.apache.org/docs/latest/rdd-programming-guide.html#transformations) data, it does not immediately compute the transformation but **plans how to compute** it later. 
* When [actions](https://spark.apache.org/docs/latest/rdd-programming-guide.html#actions) such as `collect()` are explicitly called, the **computation starts**

**By the way:**

* In case of runnint PySpark applications in PySpark shell via <code>pyspark</code> executable, the shell automatically creates the session in the variable <code>spark</code> for users.
* P.S. Pyspark dataframes are implemented on top of [RDD](https://spark.apache.org/docs/latest/rdd-programming-guide.html#overview)


In [1]:
# !pip install pyspark  # if needed
from pyspark.sql import SparkSession

In [2]:
import os
os.environ['PYSPARK_SUBMIT_ARGS']= "--master spark://localhost:8888"

In [3]:
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df

RuntimeError: Java gateway process exited before sending its port number

Further referencesto Apache Spark documentation site: 

* [Spark SQL and DataFrames](https://spark.apache.org/docs/latest/sql-programming-guide.html) 
* [RDD Programming Guide](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
* [Structured Streaming Programming Guide](https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html)
* [Spark Streaming Programming Guide](https://spark.apache.org/docs/latest/streaming-programming-guide.html)
* [Machine Learning Library (MLlib) Guide](https://spark.apache.org/docs/latest/ml-guide.html)