<a href="https://colab.research.google.com/github/Saheer7/Pyspark/blob/master/4_Spark_Data_Frame_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install pyspark

Collecting pyspark
[?25l  Downloading https://files.pythonhosted.org/packages/f0/26/198fc8c0b98580f617cb03cb298c6056587b8f0447e20fa40c5b634ced77/pyspark-3.0.1.tar.gz (204.2MB)
[K     |████████████████████████████████| 204.2MB 48kB/s 
[?25hCollecting py4j==0.10.9
[?25l  Downloading https://files.pythonhosted.org/packages/9e/b6/6a4fb90cd235dc8e265a6a2067f2a2c99f0d91787f06aca4bcf7c23f3f80/py4j-0.10.9-py2.py3-none-any.whl (198kB)
[K     |████████████████████████████████| 204kB 36.4MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.0.1-py2.py3-none-any.whl size=204612243 sha256=2e5584c8ea78d90424b6032f9495c43189110c5a26385e955daeaf3e8ab903f7
  Stored in directory: /root/.cache/pip/wheels/5e/bd/07/031766ca628adec8435bb40f0bd83bb676ce65ff4007f8e73f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9 pyspark-3.0.1



# Spark Data frame basics

In [3]:
from pyspark.sql import SparkSession    #Starting a Spark session

In [4]:
spark = SparkSession.builder.appName("Basics").getOrCreate()

In [9]:
df= spark.read.json('/content/sample_data/people.json')   #Read input

In [10]:
df.show()   #Display table  #spark automatically replaces missing data with null

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [11]:
df.printSchema()   #Data type of dataframe

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [13]:
df.columns      #This is attribute so no need paranthesis 

['age', 'name']

In [16]:
df.describe()  #Statistical summary of dataframe

DataFrame[summary: string, age: string, name: string]

In [17]:
df.describe().show()   #use show() to display the summary

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



DEFINING OWN SCHEMA: 
Schema has to be correct in dataframe i.e data type of fields should be appropriate, Use below instructions carefully:

In [28]:
 from pyspark.sql.types import (StructField,StringType
                                ,IntegerType,StructType)   

In [33]:
#Creating a list of structure fields
#Structure fields take 3 parameters: Name, datatype and some sort of Nullable

data_schema = [StructField('age',IntegerType(),True),
               StructField('name',StringType(),True)]   
#This creates a structure where 'age' is column ,
#type is int and Whether or not the field can be NULL
#NOTE: Make sure in the structure the datatype parameters are functions and not attributes

In [36]:
final_struct = StructType(fields=data_schema)   

In [37]:
df= spark.read.json('/content/sample_data/people.json',schema=final_struct)   #Read input and with the updated structure

In [38]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)



SELECTING OR GRABBING THE DATA:

In [39]:
df['age']   

Column<b'age'>

In [40]:
type(df['age'])  #Column object

pyspark.sql.column.Column

In [41]:
df.select(['age'])

DataFrame[age: int]

In [44]:
type(df.select(['age']))  #Data frame object

pyspark.sql.dataframe.DataFrame

In [45]:
df.select(['age']).show()    #Displaying column

+----+
| age|
+----+
|null|
|  30|
|  19|
+----+



In [53]:
df.head(2) #Display first 2 Rows    #The rows are displayed as a list

[Row(age=None, name='Michael'), Row(age=30, name='Andy')]

In [54]:
df.head(2)[0]   #Selecting first row

Row(age=None, name='Michael')

In [55]:
type(df.head(2)[0])  

pyspark.sql.types.Row

WHY ARE THERE SO MANY SPECIALIZED OBJECTS IN SPARK ? 

Because Spark's ability to read from a distributed data source and 
then map that out to distributed computing

In [57]:
#SELECTING MULTIPLE COLUMNS IN THE DATA
df.select(['age','name']).show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [59]:
#CREATING NEW COLUMN
#Withcolumn return a new dataframe by adding column or replacing existing columns

df.withColumn('newage',df['age']).show()   #Create copy of age column

+----+-------+------+
| age|   name|newage|
+----+-------+------+
|null|Michael|  null|
|  30|   Andy|    30|
|  19| Justin|    19|
+----+-------+------+



In [61]:
df.withColumn('double_age',df['age']*2).show()  #Double age

+----+-------+----------+
| age|   name|double_age|
+----+-------+----------+
|null|Michael|      null|
|  30|   Andy|        60|
|  19| Justin|        38|
+----+-------+----------+



In [62]:
df.show() #Above changes done are not permanent, they only disply results of operations done [Assign to a variable to store]

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [63]:
#RENAME COLUMN NAME
df.withColumnRenamed('age','my_new_age').show()

+----------+-------+
|my_new_age|   name|
+----------+-------+
|      null|Michael|
|        30|   Andy|
|        19| Justin|
+----------+-------+



USING SQL TO INTERACT WITH DATAFRAMES

In [71]:
#REGISTER DATAFRAME AS SQL TEMPORARY VIEW

df.createOrReplaceTempView('people_view')    #giving some name to view

#Creates view or replaces if it exists

In [72]:
results = spark.sql("SELECT * FROM PEOPLE_VIEW")

In [73]:
results.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [74]:
new_results = spark.sql("SELECT * FROM PEOPLE WHERE AGE=30")
new_results.show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+

