# Python vs PySpark Commands
*PySpark*


## Creating a Spark Session

In [0]:
# import findspark
# findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("PySpark").getOrCreate()
spark

### Create a Spark dataframe

In PySpark we need to create a Spark session first before we do anything. Then the createdataframe is inherent in session.

In [0]:
# initialize list of lists (same as in python)
data = [['tom', 10], ['nick', 15], ['juli', 14]] 
  
# Create the pandas DataFrame 
df = spark.createDataFrame(data,['Name', 'Age']) 

## Display Dataframe and it's properties

In [0]:
df.show()

+----+---+
|Name|Age|
+----+---+
| tom| 10|
|nick| 15|
|juli| 14|
+----+---+



In [0]:
# This is closer to pandas df.head()
df.toPandas()

Unnamed: 0,Name,Age
0,tom,10
1,nick,15
2,juli,14


In [0]:
# View column names
# This is the same
df.columns

Out[4]: ['Name', 'Age']

In [0]:
# How many rows are in the dataframe
df.count()

Out[5]: 3

## Read in data

In [0]:
path = "dbfs:/FileStore/shared_uploads/purvajainpj123@gmail.com/StudentsPerformance.csv"
df = spark.read.csv(path,header=True)
df.toPandas()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


## Aggregate Data

This method is very similar to pandas but you can only do one metric at a time

In [0]:
df.groupBy("gender").agg({'math score':'mean'}).show()

+------+------------------+
|gender|   avg(math score)|
+------+------------------+
|female|63.633204633204635|
|  male| 68.72821576763485|
+------+------------------+



For more than one aggreate... use this

In [0]:
from pyspark.sql import functions as F
df.groupBy("gender").agg(F.min("math score"), F.max("math score"), F.avg("math score")).show()

+------+---------------+---------------+------------------+
|gender|min(math score)|max(math score)|   avg(math score)|
+------+---------------+---------------+------------------+
|female|              0|             99|63.633204633204635|
|  male|            100|             99| 68.72821576763485|
+------+---------------+---------------+------------------+



## Sparks Immutability

Spark DataFrame's are built on top of RDDs which are immutable in nature, hence Data frames are immutable in nature as well.

So if we make a change to a dataframe like adding a column or changing any of the values in the dataframe using the same naming convention, it will generate a new dataframe (with a new unique ID) instead of updating the existing data frame.

In [0]:
# Let's fetch the id of our dataframe we created above
df.rdd.id()

Out[10]: 601

In [0]:
# Even if we duplicate the dataframe, the ID remains the same
df2 = df
df2.rdd.id()

Out[11]: 601

In [0]:
# It's not until we change the df in some way, that the ID changes
df = df.withColumn('new_col', df['math score'] * 2)
df.rdd.id()

Out[12]: 607

## Spark's Lazy Comuptation


As the name itself indicates its definition, lazy evaluation in Spark means that the execution will not start until it absolutuley HAS to. 

Let's look at an example.

In [0]:
## These kinds of commands won't actually be run...
df = df.withColumn('new_col', df['math score'] * 2)

In [0]:
# Until we executute a command like this
collect = df.collect()

In [0]:
#or this
print(df)

DataFrame[gender: string, race/ethnicity: string, parental level of education: string, lunch: string, test preparation course: string, math score: string, reading score: string, writing score: string, new_col: double]


The benefit is saving resources and optimizing the Spark cluster overall.