### Create a pandas dataframe

- In python, if we want to create a dataframe, we need to import a library. Pandas is for that. 



In [1]:
import pandas as pd

#initialize list of lists
data = [['Cat',10],["Nick",15],["Juli",14]]

#Create the pandas Dataframe
df = pd.DataFrame(data,columns =["Name","Age"])


df.head()

Unnamed: 0,Name,Age
0,Cat,10
1,Nick,15
2,Juli,14


--- 

### Creating a Spark Session



In [2]:
import findspark
findspark.init()

import pyspark
from pyspark.sql import SparkSession

#May take awhile locally
spark = SparkSession.builder.appName("PySpark").getOrCreate()
spark

#### Create a Spark dataframe
- in PySpark you need to create a Spark session first before you do anything. 
- then the createDataFrame is inherent in session.


In [3]:
#Initialize list of lists (same as in python)
data = [['Cat',10],['Nick',15],['Juli',14]]

#Create DataFrame
df = spark.createDataFrame(data,['Name','Age']) #DATA + COLUMNS

df.show()


+----+---+
|Name|Age|
+----+---+
| Cat| 10|
|Nick| 15|
|Juli| 14|
+----+---+



In [9]:
df.show(2) #you can pass in the parameter

+----+---+
|Name|Age|
+----+---+
| Cat| 10|
|Nick| 15|
+----+---+
only showing top 2 rows



In [10]:
#Make it to pandas.

df.toPandas()

Unnamed: 0,Name,Age
0,Cat,10
1,Nick,15
2,Juli,14


In [15]:
#To check columns 
df.columns

['Name', 'Age']

In [13]:
df.count()

3

---

### Read in Data

In [20]:
path = "students.csv"
df = spark.read.csv(path,header=True)
df 

#How can I read it?  -> toPandas()!

DataFrame[gender: string, race/ethnicity: string, parental level of education: string, lunch: string, test preparation course: string, math score: string, reading score: string, writing score: string]

In [21]:
df.toPandas()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
0,female,group B,bachelor's degree,standard,none,72,72,74
1,female,group C,some college,standard,completed,69,90,88
2,female,group B,master's degree,standard,none,90,95,93
3,male,group A,associate's degree,free/reduced,none,47,57,44
4,male,group C,some college,standard,none,76,78,75
...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95
996,male,group C,high school,free/reduced,none,62,55,55
997,female,group C,high school,free/reduced,completed,59,71,65
998,female,group D,some college,standard,completed,68,78,77


### Aggreate Data

- This is method is very similar to pandas but you can only do one metric at a time

In [23]:
#groupby  (groupBy)

df.groupBy("gender").agg({"math score":"mean"}).show()  

+------+------------------+
|gender|   avg(math score)|
+------+------------------+
|female|63.633204633204635|
|  male| 68.72821576763485|
+------+------------------+



In [25]:
#groupby with more aggregation function

from pyspark.sql import functions as F
df.groupby("gender").agg(F.min("math score"), F.max("math score"), F.avg("math score")).show()



+------+---------------+---------------+------------------+
|gender|min(math score)|max(math score)|   avg(math score)|
+------+---------------+---------------+------------------+
|female|              0|             99|63.633204633204635|
|  male|            100|             99| 68.72821576763485|
+------+---------------+---------------+------------------+



### Spark dataframe is immutable in its nature.

- If you wanna make a change to a dataframe like any column or values, it will generate new dataframe **with a new unique id**.

#### fetching ID of our dataframe we created above.

In [26]:
df.rdd.id()

68

In [27]:
df2 = df
df2.rdd.id()

68

#### Change data of the df and put them into other variable.

In [30]:
df2 = df.withColumn("new_col", df['math score']*2)
df2.toPandas()

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,new_col
0,female,group B,bachelor's degree,standard,none,72,72,74,144.0
1,female,group C,some college,standard,completed,69,90,88,138.0
2,female,group B,master's degree,standard,none,90,95,93,180.0
3,male,group A,associate's degree,free/reduced,none,47,57,44,94.0
4,male,group C,some college,standard,none,76,78,75,152.0
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,176.0
996,male,group C,high school,free/reduced,none,62,55,55,124.0
997,female,group C,high school,free/reduced,completed,59,71,65,118.0
998,female,group D,some college,standard,completed,68,78,77,136.0


In [31]:
df2.rdd.id()

104

### Spark's Lazy Comuptation

- What does that mean exactly?
 
- As the name itself indicates its definition, lazy evaluation in Spark means that the execution will not start until it absolutuley HAS to. 
 


In [47]:
#These kinds of commands won't actually be run

df3= df.withColumn('new_col',df['math score']*2)

In [50]:
#until we execute a comman like this

collect =df3.collect()

In [51]:
pan = df3.toPandas()
pan

Unnamed: 0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score,new_col
0,female,group B,bachelor's degree,standard,none,72,72,74,144.0
1,female,group C,some college,standard,completed,69,90,88,138.0
2,female,group B,master's degree,standard,none,90,95,93,180.0
3,male,group A,associate's degree,free/reduced,none,47,57,44,94.0
4,male,group C,some college,standard,none,76,78,75,152.0
...,...,...,...,...,...,...,...,...,...
995,female,group E,master's degree,standard,completed,88,99,95,176.0
996,male,group C,high school,free/reduced,none,62,55,55,124.0
997,female,group C,high school,free/reduced,completed,59,71,65,118.0
998,female,group D,some college,standard,completed,68,78,77,136.0


In [None]:
#The benefit is saving resources and optimizing the spark cluster overall.

#You'll see the benefits once you start using a really large dataset.

In [63]:
#just out of curiosity,

df.groupBy("race/ethnicity").agg(F.avg("math score"),F.avg("reading score"),F.avg("writing score")).orderBy("avg(math score)",ascending=False).toPandas()

Unnamed: 0,race/ethnicity,avg(math score),avg(reading score),avg(writing score)
0,group E,73.821429,73.028571,71.407143
1,group D,67.362595,70.030534,70.145038
2,group C,64.46395,69.103448,67.827586
3,group B,63.452632,67.352632,65.6
4,group A,61.629213,64.674157,62.674157
