# Spark Operations using Spark DataFrames and Spark SQL

### 0.Set PySpark environment.

In [1]:
#!apt-get install openjdk-8-jdk-headless -qq > /dev/null
#!wget -q https://archive.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz
#!tar xf spark-2.3.1-bin-hadoop2.7.tgz
#!pip install -q findspark

The system cannot find the path specified.
'wget' is not recognized as an internal or external command,
operable program or batch file.
tar: Error opening archive: Failed to open 'spark-2.3.1-bin-hadoop2.7.tgz'


OS module in Python provides functions for interacting with the operating system. OS comes under Python’s standard utility modules. This module provides a portable way of using operating system dependent functionality.

os.environ in Python is a mapping object that represents the user’s environmental variables. It returns a dictionary having user’s environmental variable as key and their values as value.

os.environ behaves like a python dictionary, so all the common dictionary operations like get and set can be performed. We can also modify os.environ but any changes will be effective only for the current process where it was assigned and it will not change the value permanently.

In [0]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.3.1-bin-hadoop2.7"

### 1.Create  SparkSession

In [0]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession \
      .builder \
      .appName('PySpark on Google Colab') \
      .master('local[*]') \
      .getOrCreate()

### 2. Check the Spark Session Configuration

In [230]:
spark

In [0]:
sc = spark.sparkContext

In [232]:
sc

## ** Spark DataFrame **

#### A DataFrame is the most common Structured API and simply represents a table of data with rows and columns. 
<br> The list that defines the columns and the types within those columns is called the schema. 
<br> One can think of a DataFrame as a spreadsheet with named columns.
<br> A spreadsheet sits on one computer in one specific location, whereas a Spark DataFrame can span thousands of computers.
<br> The reason for putting the data on more than one computer should be intuitive: 
<br>     either the data is too large to fit on one machine or 
<br>     it would simply take too long to perform that computation on one machine.

#### NOTE
Spark has several core abstractions: Datasets, DataFrames, SQL Tables, and Resilient Distributed Datasets (RDDs). 
<br> These different abstractions all represent distributed collections of data. 
<br> The easiest and most efficient are DataFrames, which are available in all languages.



### 3. Create Dataframe

In [233]:
myDF = spark.createDataFrame([[1, 'Alice', 30],
                              [2, 'Bob', 28],
                              [3, 'Cathy', 31], 
                              [4, 'Dave', 56]], ['Id', 'Name', 'Age'])

myDF.show()

+---+-----+---+
| Id| Name|Age|
+---+-----+---+
|  1|Alice| 30|
|  2|  Bob| 28|
|  3|Cathy| 31|
|  4| Dave| 56|
+---+-----+---+



#### Create Dataframe from an RDD

In [0]:
#from google.colab import files
#files.upload()

In [235]:
trainRDD = sc.textFile("./train.csv")
print("Total Records with header: ", trainRDD.count())

Total Records with header:  550069


In [236]:
print("\nFirst Two Records Before Removing Header\n")
print(trainRDD.take(2))


First Two Records Before Removing Header

['User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase', '1000001,P00069042,F,0-17,10,A,2,0,3,,,8370']


In [237]:
header = trainRDD.first()
header

'User_ID,Product_ID,Gender,Age,Occupation,City_Category,Stay_In_Current_City_Years,Marital_Status,Product_Category_1,Product_Category_2,Product_Category_3,Purchase'

In [238]:
trainRDD = trainRDD.filter(lambda line: line != header)
print("Total Records without header: ", trainRDD.count())
print("\nFirst Two Records After Removing Header\n")
print(trainRDD.take(2))

Total Records without header:  550068

First Two Records After Removing Header

['1000001,P00069042,F,0-17,10,A,2,0,3,,,8370', '1000001,P00248942,F,0-17,10,A,2,0,1,6,14,15200']


In [239]:
# Split the data into individual columns
splitRDD = trainRDD.map(lambda row:row.split(","))
print("\nFirst Two Records After Split/Parsing\n")
print(splitRDD.take(2))


First Two Records After Split/Parsing

[['1000001', 'P00069042', 'F', '0-17', '10', 'A', '2', '0', '3', '', '', '8370'], ['1000001', 'P00248942', 'F', '0-17', '10', 'A', '2', '0', '1', '6', '14', '15200']]


#### Create a dataframe for the above Data
1. Define Schema
2. Create dataframe using the above schema

#### Create Schema

In [0]:
from pyspark.sql.types import *

trainSchema = StructType([
    StructField("User_ID", StringType(), True),
    StructField("Product_ID", StringType(), True),
    StructField("Gender", StringType(), True),
    StructField("Age", StringType(), True),
    StructField("Occupation", StringType(), True),
    StructField("City_Category", StringType(), True),
    StructField("Stay_In_Current_City_Years",StringType(),True),
    StructField("Marital_Status", StringType(), True),
    StructField("Product_Category_1", StringType(), True),
    StructField("Product_Category_2", StringType(), True),
    StructField("Product_Category_3", StringType(), True),
    StructField("Purchase",StringType(),True)
])

#### Create DataFrame using toDF()

In [241]:
trainDF = splitRDD.toDF(schema = trainSchema)
trainDF.show(5)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|                  |                  |    8370|
|1000001| P00248942|     F|0-17|        10|            A|                         2|             0|                 1|                 6|                14|   15200|
|1000001| P00087842|     F|0-17|        10|            A|                         2|             0|                12|                  |                  |    1422|
|100

#### Create DataFrame using createDataFrame()

In [242]:
trainDF = spark.createDataFrame(data = splitRDD, schema=trainSchema)
trainDF.show(5)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|                  |                  |    8370|
|1000001| P00248942|     F|0-17|        10|            A|                         2|             0|                 1|                 6|                14|   15200|
|1000001| P00087842|     F|0-17|        10|            A|                         2|             0|                12|                  |                  |    1422|
|100

### 4. DataFrame Transformations & Actions

### Transformations
In Spark, the core data structures are immutable, meaning they cannot be changed after they’re created.
<br> To “change” a DataFrame, you need to instruct Spark how you would like to modify it to do what you want.
<br> These instructions are called transformations.
<br> Transformations are the core of how you express your business logic using Spark.
<br> Transformations are simply ways of specifying different series of data manipulation.



#### Create a dataframe with one column containing 100 rows with values from 0 to 99.

In [0]:
myRange = spark.range(100).toDF('number')

In [244]:
myRange.show(10)

+------+
|number|
+------+
|     0|
|     1|
|     2|
|     3|
|     4|
|     5|
|     6|
|     7|
|     8|
|     9|
+------+
only showing top 10 rows



In [245]:
divisBy2 = myRange.where("number % 2 = 0")
divisBy2

DataFrame[number: bigint]

Notice that these return no output. <br>This is because we specified only an abstract transformation, and Spark will not act on transformations until we call an action.

### Actions
Transformations allow us to build up our logical transformation plan. 
<br> To trigger the computation, we run an action.
<br> An action instructs Spark to compute a result from a series of transformations. 
<br> The simplest action is show, which displays the records in the DataFrame

#### There are 3 types of actions
Actions to view data in the console
<br>Actions to collect data 
<br>Actions to write to output data sources

In [246]:
divisBy2.show()

+------+
|number|
+------+
|     0|
|     2|
|     4|
|     6|
|     8|
|    10|
|    12|
|    14|
|    16|
|    18|
|    20|
|    22|
|    24|
|    26|
|    28|
|    30|
|    32|
|    34|
|    36|
|    38|
+------+
only showing top 20 rows



In [247]:
divisBy2.count()

50

In [248]:
trainDF.take(2)

[Row(User_ID='1000001', Product_ID='P00069042', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='3', Product_Category_2='', Product_Category_3='', Purchase='8370'),
 Row(User_ID='1000001', Product_ID='P00248942', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='1', Product_Category_2='6', Product_Category_3='14', Purchase='15200')]

In [249]:
trainDF.show(4,truncate=False)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender|Age |Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001|P00069042 |F     |0-17|10        |A            |2                         |0             |3                 |                  |                  |8370    |
|1000001|P00248942 |F     |0-17|10        |A            |2                         |0             |1                 |6                 |14                |15200   |
|1000001|P00087842 |F     |0-17|10        |A            |2                         |0             |12                |                  |                  |1422    |
|100

In [250]:
trainDF.count()

550068

### 5. Reading a CSV file into a DataFrame 

In [0]:
path ="./train.csv"

In [0]:
trainDF = spark.read.csv(path=path,header=True,schema=trainSchema,sep=",")

In [253]:
trainDF.take(5)

[Row(User_ID='1000001', Product_ID='P00069042', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='3', Product_Category_2=None, Product_Category_3=None, Purchase='8370'),
 Row(User_ID='1000001', Product_ID='P00248942', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='1', Product_Category_2='6', Product_Category_3='14', Purchase='15200'),
 Row(User_ID='1000001', Product_ID='P00087842', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='12', Product_Category_2=None, Product_Category_3=None, Purchase='1422'),
 Row(User_ID='1000001', Product_ID='P00085442', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='12', Product_Category_2='14', Product_Category_3=None, Purchase

In [254]:
trainDF.show(5,truncate=False)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender|Age |Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001|P00069042 |F     |0-17|10        |A            |2                         |0             |3                 |null              |null              |8370    |
|1000001|P00248942 |F     |0-17|10        |A            |2                         |0             |1                 |6                 |14                |15200   |
|1000001|P00087842 |F     |0-17|10        |A            |2                         |0             |12                |null              |null              |1422    |
|100

#### Getting the  shape of the spark data frame
* As such there is no shape command directly in spark we need to get it from the length of columns and 
  count of records

In [255]:
## To Count the number of rows in DataFrame
print('Total records count in train dataset is {}'.format(trainDF.count()))

Total records count in train dataset is 550068


In [256]:
## Columns count and column names
print("Total Columns count in train dataset is {}".format(len(trainDF.columns)))
print("\n\nColumns in train dataset are: {} \n".format(trainDF.columns))

Total Columns count in train dataset is 12


Columns in train dataset are: ['User_ID', 'Product_ID', 'Gender', 'Age', 'Occupation', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status', 'Product_Category_1', 'Product_Category_2', 'Product_Category_3', 'Purchase'] 



### 6. Verify Schema

In [257]:
## Print Schema
trainDF.printSchema()

root
 |-- User_ID: string (nullable = true)
 |-- Product_ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Occupation: string (nullable = true)
 |-- City_Category: string (nullable = true)
 |-- Stay_In_Current_City_Years: string (nullable = true)
 |-- Marital_Status: string (nullable = true)
 |-- Product_Category_1: string (nullable = true)
 |-- Product_Category_2: string (nullable = true)
 |-- Product_Category_3: string (nullable = true)
 |-- Purchase: string (nullable = true)



In [258]:
trainDF.dtypes

[('User_ID', 'string'),
 ('Product_ID', 'string'),
 ('Gender', 'string'),
 ('Age', 'string'),
 ('Occupation', 'string'),
 ('City_Category', 'string'),
 ('Stay_In_Current_City_Years', 'string'),
 ('Marital_Status', 'string'),
 ('Product_Category_1', 'string'),
 ('Product_Category_2', 'string'),
 ('Product_Category_3', 'string'),
 ('Purchase', 'string')]

#### Getting the Columns from the SparkDataframe

In [259]:
trainDF.columns

['User_ID',
 'Product_ID',
 'Gender',
 'Age',
 'Occupation',
 'City_Category',
 'Stay_In_Current_City_Years',
 'Marital_Status',
 'Product_Category_1',
 'Product_Category_2',
 'Product_Category_3',
 'Purchase']

In [260]:
type(trainDF.columns)

list

### 7.To Show first n observations

In [261]:
## Use head operation to see first n observations (say, 2 observations). 
## Head operation in PySpark is similar to head operation in Pandas.
trainDF.head(2)

[Row(User_ID='1000001', Product_ID='P00069042', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='3', Product_Category_2=None, Product_Category_3=None, Purchase='8370'),
 Row(User_ID='1000001', Product_ID='P00248942', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='1', Product_Category_2='6', Product_Category_3='14', Purchase='15200')]

In [262]:
## Above results are comprised of row like format. 
## To see the result in more interactive manner (rows under the columns), Use the show operation. 
## Show operation on train and take first 5 rows of it. 
trainDF.show(2)


+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|              null|              null|    8370|
|1000001| P00248942|     F|0-17|        10|            A|                         2|             0|                 1|                 6|                14|   15200|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
only

### 8.Summary statistics

In [263]:
## To get the summary statistics (mean, standard deviance, min ,max , count) of numerical columns in a DataFrame
trainDF.describe().show(truncate=False)

+-------+------------------+----------+------+------+-----------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
|summary|User_ID           |Product_ID|Gender|Age   |Occupation       |City_Category|Stay_In_Current_City_Years|Marital_Status     |Product_Category_1|Product_Category_2|Product_Category_3|Purchase         |
+-------+------------------+----------+------+------+-----------------+-------------+--------------------------+-------------------+------------------+------------------+------------------+-----------------+
|count  |550068            |550068    |550068|550068|550068           |550068       |550068                    |550068             |550068            |376430            |166821            |550068           |
|mean   |1003028.8424013031|null      |null  |null  |8.076706879876669|null         |1.468494139793958         |0.40965298835780306|5.404270017525106 |9.842329251122386

In [264]:
## Check what happens when we specify the name of a categorical / String columns in describe operation.
## describe operation is working for String type column but the output for mean, stddev are null and 
## min & max values are calculated based on ASCII value of categories.
trainDF.describe(['Purchase']).show()

+-------+-----------------+
|summary|         Purchase|
+-------+-----------------+
|  count|           550068|
|   mean|9263.968712959126|
| stddev|5023.065393820575|
|    min|            10000|
|    max|             9999|
+-------+-----------------+



### 9. a. Adding Columns

In [265]:
## More Formal way
from pyspark.sql.functions import lit
trainDF.withColumn("One", lit(1)).show(2)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+---+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|One|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+---+
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|              null|              null|    8370|  1|
|1000001| P00248942|     F|0-17|        10|            A|                         2|             0|                 1|                 6|                14|   15200|  1|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+-------------

In [266]:
tempDF = trainDF.withColumn("SameCategoryCode", 
trainDF["Product_Category_1"] == trainDF["Product_Category_2"])
tempDF.show(4)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|SameCategoryCode|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|              null|              null|    8370|            null|
|1000001| P00248942|     F|0-17|        10|            A|                         2|             0|                 1|                 6|                14|   15200|           false|
|1000001| P00087842|     F|0-17|        10|            A|                         2| 

### 9.b.Renaming Columns

In [267]:
tempDF.withColumnRenamed("SameCategoryCode", "SimilarCategory").show(2)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+---------------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|SimilarCategory|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+---------------+
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|              null|              null|    8370|           null|
|1000001| P00248942|     F|0-17|        10|            A|                         2|             0|                 1|                 6|                14|   15200|          false|
+-------+----------+------+----+----------+-------------+--------------------------+------

In [268]:
tempDF.show(2)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|SameCategoryCode|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+----------------+
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|              null|              null|    8370|            null|
|1000001| P00248942|     F|0-17|        10|            A|                         2|             0|                 1|                 6|                14|   15200|           false|
+-------+----------+------+----+----------+-------------+--------------------------+-

### 9.c.Removing Columns

In [269]:
tempDF.drop("SameCategoryCode").show(2)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|              null|              null|    8370|
|1000001| P00248942|     F|0-17|        10|            A|                         2|             0|                 1|                 6|                14|   15200|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
only

### 10. Changing a Column’s Type (cast)

In [270]:
tempDF.printSchema()

root
 |-- User_ID: string (nullable = true)
 |-- Product_ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Occupation: string (nullable = true)
 |-- City_Category: string (nullable = true)
 |-- Stay_In_Current_City_Years: string (nullable = true)
 |-- Marital_Status: string (nullable = true)
 |-- Product_Category_1: string (nullable = true)
 |-- Product_Category_2: string (nullable = true)
 |-- Product_Category_3: string (nullable = true)
 |-- Purchase: string (nullable = true)
 |-- SameCategoryCode: boolean (nullable = true)



In [271]:
tempDF.withColumn("Purchase",tempDF.Purchase.cast("int")).printSchema()

root
 |-- User_ID: string (nullable = true)
 |-- Product_ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Occupation: string (nullable = true)
 |-- City_Category: string (nullable = true)
 |-- Stay_In_Current_City_Years: string (nullable = true)
 |-- Marital_Status: string (nullable = true)
 |-- Product_Category_1: string (nullable = true)
 |-- Product_Category_2: string (nullable = true)
 |-- Product_Category_3: string (nullable = true)
 |-- Purchase: integer (nullable = true)
 |-- SameCategoryCode: boolean (nullable = true)



### 11. Splitting the data into Train and Test

In [272]:
trainDF,testDF = trainDF.randomSplit([0.7, 0.3], seed=1234)
print(trainDF.count())
print(testDF.count())

385465
164603


### 12. Working with Nulls in Data

In [273]:
from pyspark.sql.functions import isnan, when, count, col
trainDF.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in trainDF.columns]).show()

+-------+----------+------+---+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender|Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+---+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|      0|         0|     0|  0|         0|            0|                         0|             0|                 0|            121619|            268621|       0|
+-------+----------+------+---+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+



#### To drop the all rows with null value?
##### Use **dropna()** operation. 
  To drop row from the DataFrame it consider three options.
* **how** – ‘any’ or ‘all’. If ‘any’, drop a row if it contains any nulls. If ‘all’, drop a row only if all its values are null.

* **thresh** – int, default None If specified, drop rows that have less than thresh non-null values.This overwrites the how parameter.

* **subset** – optional list of column names to consider.

#### Drop null rows in train with default parameters and count the rows in output DataFrame. 
#### Default options are any, None, None for how, thresh, subset respectively.

In [274]:
print(trainDF.dropna().count())
print(trainDF.na.drop().count())
print(trainDF.na.drop("any").count())

116844
116844
116844


#### To replace the null values in DataFrame with constant number
#### Use **fillna()** operation. 

 The fillna will take two parameters to fill the null values.
* **value**:
    - It will take a dictionary to specify which column will replace with which value.A value (int , float, string) for all columns.
* **subset**: Specify some selected columns.



In [275]:
##Fill ‘-1’ inplace of null values in train DataFrame.
trainDF.fillna(-1).show(5)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001| P00051842|     F|0-17|        10|            A|                         2|             0|                 4|                 8|              null|    2849|
|1000001| P00059442|     F|0-17|        10|            A|                         2|             0|                 6|                 8|                16|   16622|
|1000001| P00069042|     F|0-17|        10|            A|                         2|             0|                 3|              null|              null|    8370|
|100

In [276]:
## Filling with different values for different columns
fill_cols_vals = {
"Gender": 'M',
"Purchase" : 999999
}
trainDF.na.fill(fill_cols_vals).count()

385465

### 13. Distinct Values

In [277]:
## To find the number of distinct product in train and test datasets
## To calculate the number of distinct products in train and test datasets apply distinct operation.
print("Distinct values in Product_ID's in train dataset are {}".format(trainDF.select('Product_ID').distinct().count()))
print("Distinct values in Product_ID's in test dataset are {}".format(testDF.select('Product_ID').distinct().count()))

Distinct values in Product_ID's in train dataset are 3573
Distinct values in Product_ID's in test dataset are 3421


#### Differences in two columns

In [278]:
## From the above we can see the train file has more categories than test file. 
## Let us check what are the categories for Product_ID, which are in test file but not in train file by 
## applying subtract operation.
## We can do the same for all categorical features.
diff_cat_in_test_train=testDF.select('Product_ID').subtract(trainDF.select('Product_ID'))
print("Count of Product_ID's there in test dataset but not train dataset are {}".format(diff_cat_in_test_train.count()))

diff_cat_in_train_test=trainDF.select('Product_ID').subtract(testDF.select('Product_ID'))
print("Count of Product_ID's there in train dataset but not test dataset are {}".format(diff_cat_in_train_test.count()))

Count of Product_ID's there in test dataset but not train dataset are 58
Count of Product_ID's there in train dataset but not test dataset are 210


### 14. Using Spark SQL 
With Spark SQL, you can register any DataFrame as a table or view (a temporary table) and query it using pure SQL. 
<br>There is no performance difference between writing SQL queries or writing DataFrame code, <br>they both “compile” to the same underlying plan that we specify in DataFrame code.

In [0]:
## Create view/table
trainDF.createOrReplaceTempView("trainDFTable")

In [280]:
## Verify Dataframe
trainDF.show(2)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001| P00051842|     F|0-17|        10|            A|                         2|             0|                 4|                 8|              null|    2849|
|1000001| P00059442|     F|0-17|        10|            A|                         2|             0|                 6|                 8|                16|   16622|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
only

In [281]:
## Verify Dataframe
trainDF.take(2)

[Row(User_ID='1000001', Product_ID='P00051842', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='4', Product_Category_2='8', Product_Category_3=None, Purchase='2849'),
 Row(User_ID='1000001', Product_ID='P00059442', Gender='F', Age='0-17', Occupation='10', City_Category='A', Stay_In_Current_City_Years='2', Marital_Status='0', Product_Category_1='6', Product_Category_2='8', Product_Category_3='16', Purchase='16622')]

In [282]:
## Verify Table
spark.sql("SELECT * FROM trainDFTable LIMIT 2").show()

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000001| P00051842|     F|0-17|        10|            A|                         2|             0|                 4|                 8|              null|    2849|
|1000001| P00059442|     F|0-17|        10|            A|                         2|             0|                 6|                 8|                16|   16622|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+



#### Column References

#### Select & SelectExpr

In [283]:
## Multiple ways of referring a column in a dataframe
from pyspark.sql.functions import expr, col, column

trainDF.select(expr("User_ID AS userID") , col("User_ID"), 
               column("User_ID"), "User_ID").show(2)

+-------+-------+-------+-------+
| userID|User_ID|User_ID|User_ID|
+-------+-------+-------+-------+
|1000001|1000001|1000001|1000001|
|1000001|1000001|1000001|1000001|
+-------+-------+-------+-------+
only showing top 2 rows



In [284]:
trainDF.select(col("User_ID"), "User_ID")

DataFrame[User_ID: string, User_ID: string]

#### Pandas dot notation doesn't work here 

In [0]:
result = trainDF.User_ID

This will save/assign a column name to the newly created variable

In [286]:
# select content from the above column
trainDF.select(result).show(2)

+-------+
|User_ID|
+-------+
|1000001|
|1000001|
+-------+
only showing top 2 rows



In [287]:
trainDF.select(expr("User_ID AS userID")).show(2)

+-------+
| userID|
+-------+
|1000001|
|1000001|
+-------+
only showing top 2 rows



In [288]:
spark.sql("SELECT User_ID AS userID FROM trainDFTable").show(2)

+-------+
| userID|
+-------+
|1000001|
|1000001|
+-------+
only showing top 2 rows



In [289]:
trainDF.selectExpr("User_ID AS userID", "Product_ID AS productID").show(2)

+-------+---------+
| userID|productID|
+-------+---------+
|1000001|P00051842|
|1000001|P00059442|
+-------+---------+
only showing top 2 rows



In [290]:
trainDF.select("User_ID", "Product_ID", "Age").show(2)

+-------+----------+----+
|User_ID|Product_ID| Age|
+-------+----------+----+
|1000001| P00051842|0-17|
|1000001| P00059442|0-17|
+-------+----------+----+
only showing top 2 rows



#### Converting to Spark Types (Literals)
Sometimes we need to pass explicit values into Spark that aren’t a new column but are just a value in all the rows. This might be a constant value or something we’ll need to compare to later on. The way we do this is through literals. 
This is basically a translation from a given programming language’s literal value to one that Spark understands. 
Literals are expressions and can be used in the same way.

In [291]:
from pyspark.sql.functions import lit
trainDF.select("*", lit(1).alias('One')).show(2)

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+---+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|One|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+---+
|1000001| P00051842|     F|0-17|        10|            A|                         2|             0|                 4|                 8|              null|    2849|  1|
|1000001| P00059442|     F|0-17|        10|            A|                         2|             0|                 6|                 8|                16|   16622|  1|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+-------------

In [292]:
## In SQL, literals are just the specific value.
trainDF.createOrReplaceTempView('trainDFTable')
spark.sql("SELECT *, 1 as One FROM trainDFTable LIMIT 2").show()

+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+---+
|User_ID|Product_ID|Gender| Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|One|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+---+
|1000001| P00051842|     F|0-17|        10|            A|                         2|             0|                 4|                 8|              null|    2849|  1|
|1000001| P00059442|     F|0-17|        10|            A|                         2|             0|                 6|                 8|                16|   16622|  1|
+-------+----------+------+----+----------+-------------+--------------------------+--------------+------------------+------------------+-------------

In [293]:
from pyspark.sql.functions import col
tempDF.withColumn("Purchase",col('Purchase').cast("integer")).printSchema()

root
 |-- User_ID: string (nullable = true)
 |-- Product_ID: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Age: string (nullable = true)
 |-- Occupation: string (nullable = true)
 |-- City_Category: string (nullable = true)
 |-- Stay_In_Current_City_Years: string (nullable = true)
 |-- Marital_Status: string (nullable = true)
 |-- Product_Category_1: string (nullable = true)
 |-- Product_Category_2: string (nullable = true)
 |-- Product_Category_3: string (nullable = true)
 |-- Purchase: integer (nullable = true)
 |-- SameCategoryCode: boolean (nullable = true)



#### Pair wise Frequencies - Crosstab

In [294]:
## To calculate pair wise frequency of categorical columns
## Use crosstab operation on DataFrame to calculate the pair wise frequency of columns. 
## Apply crosstab operation on ‘Age’ and ‘Gender’ columns of train DataFrame.
trainDF.crosstab('Age', 'Gender').show()

+----------+-----+------+
|Age_Gender|    F|     M|
+----------+-----+------+
|      0-17| 3548|  7027|
|     46-50| 9191| 22722|
|     18-25|17192| 52744|
|     36-45|19017| 57967|
|       55+| 3583| 11541|
|     51-55| 6962| 20053|
|     26-35|35670|118248|
+----------+-----+------+



In [295]:
trainDF.groupBy('Age', 'Gender').count().show()

+-----+------+------+
|  Age|Gender| count|
+-----+------+------+
|51-55|     F|  6962|
|18-25|     M| 52744|
| 0-17|     F|  3548|
|46-50|     M| 22722|
|18-25|     F| 17192|
|  55+|     M| 11541|
|  55+|     F|  3583|
|36-45|     M| 57967|
|26-35|     F| 35670|
| 0-17|     M|  7027|
|36-45|     F| 19017|
|51-55|     M| 20053|
|26-35|     M|118248|
|46-50|     F|  9191|
+-----+------+------+



In [296]:
spark.sql("""select Age,
    sum(case when Gender = 'F' then 1 else 0 end) F,
    sum(case when Gender = 'M' then 1 else 0 end) M
from trainDFTable
group by Age""").show()

# spark.sql("""select Age,
#     count(*) total,
#     sum(case when Gender = 'F' then 1 else 0 end) F,
#     sum(case when Gender = 'M' then 1 else 0 end) M
# from trainDFTable
# group by Age""").show()

+-----+-----+------+
|  Age|    F|     M|
+-----+-----+------+
|18-25|17192| 52744|
|26-35|35670|118248|
| 0-17| 3548|  7027|
|46-50| 9191| 22722|
|51-55| 6962| 20053|
|36-45|19017| 57967|
|  55+| 3583| 11541|
+-----+-----+------+



#### Removing Duplicates

In [297]:
##To get the DataFrame without any duplicate rows of given a DataFrame
##Use dropDuplicates operation to drop the duplicate rows of a DataFrame. 
## In this command, performing this on two columns Age and Gender of train dataset and 
## Get the all unique rows for these two columns.
trainDF.select('Age','Gender').dropDuplicates().show()

+-----+------+
|  Age|Gender|
+-----+------+
|51-55|     F|
|18-25|     M|
| 0-17|     F|
|46-50|     M|
|18-25|     F|
|  55+|     M|
|  55+|     F|
|36-45|     M|
|26-35|     F|
| 0-17|     M|
|36-45|     F|
|51-55|     M|
|26-35|     M|
|46-50|     F|
+-----+------+



#### Filtering the rows

In [298]:
## To filter the rows in train dataset which has Purchases more than 15000
## apply the filter operation on Purchase column in train DataFrame 
## to filter out the rows with values more than 15000. 
print("Count of rows where Purchase Amount more than 15000 are {}".format(trainDF.filter(trainDF.Purchase > 15000).count()))
print("Count of rows where Purchase Amount more than 15000 are {}".format(trainDF.filter(col("Purchase") > 15000).count()))
print("Count of rows where Purchase Amount more than 15000 are {}".format(trainDF.filter(column("Purchase") > 15000).count()))
print("Count of rows where Purchase Amount more than 15000 are {}".format(trainDF.filter(expr("Purchase") > 15000).count()))
print("Count of rows where Purchase Amount more than 15000 are {}".format(trainDF.filter(trainDF["Purchase"] > 15000).count()))

Count of rows where Purchase Amount more than 15000 are 77137
Count of rows where Purchase Amount more than 15000 are 77137
Count of rows where Purchase Amount more than 15000 are 77137
Count of rows where Purchase Amount more than 15000 are 77137
Count of rows where Purchase Amount more than 15000 are 77137


In [299]:
spark.sql("""
SELECT 
COUNT(*) AS Count
FROM trainDFTable
WHERE Purchase > 15000""").show()

+-----+
|Count|
+-----+
|77137|
+-----+



In [300]:
trainDF.where("Purchase > 15000").where("Gender = 'F'").count()

15034

In [301]:
trainDF.filter("Purchase > 15000").where("Gender = 'F'").count()

15034

In [302]:
trainDF.where((col("Purchase") > 15000) & (col("Gender") == 'F')).count()

15034

In [303]:
trainDF.filter((col("Purchase") > 15000) & (col("Gender") == 'F')).count()

15034

In [304]:
spark.sql("SELECT * FROM trainDFTable WHERE Purchase > 15000 AND Gender = 'F'").count()

15034

### 15. Aggregations

#### Count Distinct

In [305]:
from pyspark.sql.functions import countDistinct
trainDF.select(countDistinct("Age")).show()

+-------------------+
|count(DISTINCT Age)|
+-------------------+
|                  7|
+-------------------+



#### Approximate Count Distinct
* **Parameters:**
    * col - Name of the column
    * rsd – maximum estimation error allowed (default = 0.05).

In [306]:
from pyspark.sql.functions import approx_count_distinct
trainDF.select(approx_count_distinct(col="Age", rsd=0.1)).show()

+--------------------------+
|approx_count_distinct(Age)|
+--------------------------+
|                         7|
+--------------------------+



#### First and Last

In [307]:
from pyspark.sql.functions import first, last
trainDF.select(first("Product_ID"), last("Product_ID")).show()

+------------------------+-----------------------+
|first(Product_ID, false)|last(Product_ID, false)|
+------------------------+-----------------------+
|               P00051842|              P00349442|
+------------------------+-----------------------+



#### Min and Max

In [308]:
from pyspark.sql.functions import min, max
trainDF.select(min("Purchase"), max("Purchase")).show()

+-------------+-------------+
|min(Purchase)|max(Purchase)|
+-------------+-------------+
|        10000|         9999|
+-------------+-------------+



#### Sum

In [309]:
from pyspark.sql.functions import sum
trainDF.select(sum("Purchase")).show()

+-------------+
|sum(Purchase)|
+-------------+
|3.566654792E9|
+-------------+



#### sumDistinct

In [310]:
from pyspark.sql.functions import sumDistinct
trainDF.select(sumDistinct("Purchase")).show()

+----------------------+
|sum(DISTINCT Purchase)|
+----------------------+
|          1.98153395E8|
+----------------------+



#### Avg

In [311]:
from pyspark.sql.functions import sum, count, avg, expr

trainDF.select(
    count("Purchase").alias("total_transactions"),
    sum("Purchase").alias("total_purchases"),
    avg("Purchase").alias("avg_purchases"),
    expr("mean(Purchase)").alias("mean_purchases"))\
  .selectExpr(
    "total_purchases/total_transactions",
    "avg_purchases",
    "mean_purchases").show()

+--------------------------------------+-----------------+-----------------+
|(total_purchases / total_transactions)|    avg_purchases|   mean_purchases|
+--------------------------------------+-----------------+-----------------+
|                     9252.862885086843|9252.862885086843|9252.862885086843|
+--------------------------------------+-----------------+-----------------+



#### Variance and Standard Deviation

In [312]:
from pyspark.sql.functions import var_pop, stddev_pop
from pyspark.sql.functions import var_samp, stddev_samp

trainDF.select(var_pop("Purchase"), var_samp("Purchase"),
  stddev_pop("Purchase"), stddev_samp("Purchase")).show()

+--------------------+--------------------+--------------------+---------------------+
|   var_pop(Purchase)|  var_samp(Purchase)|stddev_pop(Purchase)|stddev_samp(Purchase)|
+--------------------+--------------------+--------------------+---------------------+
|2.5196570178942546E7|2.5196635545799576E7|   5019.618529225358|    5019.625040359048|
+--------------------+--------------------+--------------------+---------------------+



In [313]:
spark.sql("""SELECT var_pop(Purchase), var_samp(Purchase),
             stddev_pop(Purchase), stddev_samp(Purchase)
             FROM trainDFTable""").show()

+---------------------------------+----------------------------------+------------------------------------+-------------------------------------+
|var_pop(CAST(Purchase AS DOUBLE))|var_samp(CAST(Purchase AS DOUBLE))|stddev_pop(CAST(Purchase AS DOUBLE))|stddev_samp(CAST(Purchase AS DOUBLE))|
+---------------------------------+----------------------------------+------------------------------------+-------------------------------------+
|             2.5196570178942546E7|              2.5196635545799576E7|                   5019.618529225358|                    5019.625040359048|
+---------------------------------+----------------------------------+------------------------------------+-------------------------------------+



#### skewness and kurtosis

In [314]:
from pyspark.sql.functions import skewness, kurtosis
trainDF.select(skewness("Purchase"), kurtosis("Purchase")).show()

+------------------+-------------------+
|skewness(Purchase)| kurtosis(Purchase)|
+------------------+-------------------+
|0.6011891058033598|-0.3332972102207403|
+------------------+-------------------+



In [315]:
spark.sql("""SELECT skewness(Purchase), kurtosis(Purchase)
             FROM trainDFTable""").show()

+----------------------------------+----------------------------------+
|skewness(CAST(Purchase AS DOUBLE))|kurtosis(CAST(Purchase AS DOUBLE))|
+----------------------------------+----------------------------------+
|                0.6011891058033598|               -0.3332972102207403|
+----------------------------------+----------------------------------+



#### Covariance and Correlation

In [316]:
from pyspark.sql.functions import corr, covar_pop, covar_samp
trainDF.select(corr("Product_Category_1", "Purchase"), covar_samp("Product_Category_1", "Purchase"),
    covar_pop("Product_Category_1", "Purchase")).show()

+----------------------------------+----------------------------------------+---------------------------------------+
|corr(Product_Category_1, Purchase)|covar_samp(Product_Category_1, Purchase)|covar_pop(Product_Category_1, Purchase)|
+----------------------------------+----------------------------------------+---------------------------------------+
|               -0.3441352558130045|                      -6799.583767801264|                      -6799.56612785012|
+----------------------------------+----------------------------------------+---------------------------------------+



In [317]:
spark.sql("""SELECT corr(Product_Category_1, Purchase), covar_samp(Product_Category_1, Purchase),
             covar_pop(Product_Category_1, Purchase)
             FROM trainDFTable""").show()

+------------------------------------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------+
|corr(CAST(Product_Category_1 AS DOUBLE), CAST(Purchase AS DOUBLE))|covar_samp(CAST(Product_Category_1 AS DOUBLE), CAST(Purchase AS DOUBLE))|covar_pop(CAST(Product_Category_1 AS DOUBLE), CAST(Purchase AS DOUBLE))|
+------------------------------------------------------------------+------------------------------------------------------------------------+-----------------------------------------------------------------------+
|                                               -0.3441352558130045|                                                      -6799.583767801264|                                                      -6799.56612785012|
+------------------------------------------------------------------+------------------------------------------------------------------------+---

#### Complex Aggregations

In [318]:
from pyspark.sql.functions import collect_set, collect_list
trainDF.agg(collect_set("Age"), collect_list("Age")).show()

+--------------------+--------------------+
|    collect_set(Age)|   collect_list(Age)|
+--------------------+--------------------+
|[55+, 51-55, 0-17...|[0-17, 0-17, 0-17...|
+--------------------+--------------------+



In [319]:
spark.sql("""SELECT collect_set(Age), collect_list(Age) FROM trainDFTable""").show()

+--------------------+--------------------+
|    collect_set(Age)|   collect_list(Age)|
+--------------------+--------------------+
|[55+, 51-55, 0-17...|[0-17, 0-17, 0-17...|
+--------------------+--------------------+



#### Grouping

In [320]:
trainDF.groupBy("Age", "Gender").count().show()

+-----+------+------+
|  Age|Gender| count|
+-----+------+------+
|51-55|     F|  6962|
|18-25|     M| 52744|
| 0-17|     F|  3548|
|46-50|     M| 22722|
|18-25|     F| 17192|
|  55+|     M| 11541|
|  55+|     F|  3583|
|36-45|     M| 57967|
|26-35|     F| 35670|
| 0-17|     M|  7027|
|36-45|     F| 19017|
|51-55|     M| 20053|
|26-35|     M|118248|
|46-50|     F|  9191|
+-----+------+------+



In [321]:
trainDF.select("Age","Gender","Purchase").groupBy("Age","Gender").agg(sum("Purchase").alias("Age Group Purchase")).show()

+-----+------+------------------+
|  Age|Gender|Age Group Purchase|
+-----+------+------------------+
|51-55|     F|       6.2724479E7|
|18-25|     M|      4.96745484E8|
| 0-17|     F|       2.9605153E7|
|46-50|     M|      2.13164876E8|
|18-25|     F|      1.43115866E8|
|  55+|     M|      1.08894726E8|
|  55+|     F|       3.2342305E7|
|36-45|     M|      5.48151866E8|
|26-35|     F|      3.11812355E8|
| 0-17|     M|       6.4693751E7|
|36-45|     F|       1.7023059E8|
|51-55|     M|      1.94432303E8|
|26-35|     M|     1.109729324E9|
|46-50|     F|       8.1011714E7|
+-----+------+------------------+



#### Grouping with Expressions

In [322]:
trainDF.groupBy("Age").agg(
  count("Purchase").alias("quan"),
  expr("count(Purchase)")).show()

+-----+------+---------------+
|  Age|  quan|count(Purchase)|
+-----+------+---------------+
|18-25| 69936|          69936|
|26-35|153918|         153918|
| 0-17| 10575|          10575|
|46-50| 31913|          31913|
|51-55| 27015|          27015|
|36-45| 76984|          76984|
|  55+| 15124|          15124|
+-----+------+---------------+



In [323]:
trainDF.groupBy("Age").agg(expr("avg(Purchase)"),expr("stddev_pop(Purchase)")).show()

+-----+-----------------+--------------------+
|  Age|    avg(Purchase)|stddev_pop(Purchase)|
+-----+-----------------+--------------------+
|18-25|9149.241449325098|    5030.73260424529|
|26-35|9235.707837939683|   5001.417464297852|
| 0-17|8917.154042553191|   5108.509402410989|
|46-50| 9218.08009275217|   4973.931293087956|
|51-55|9519.036905422914|   5074.569101301074|
|36-45|9331.581315598047|   5025.818032593724|
|  55+|9338.602948955302|   5026.226769074877|
+-----+-----------------+--------------------+



In [324]:
## To find the mean of each age group in train dataset - Average purchases in each age group
trainDF.groupby('Age').agg({'Purchase': 'mean'}).show()

+-----+-----------------+
|  Age|    avg(Purchase)|
+-----+-----------------+
|18-25|9149.241449325098|
|26-35|9235.707837939683|
| 0-17|8917.154042553191|
|46-50| 9218.08009275217|
|51-55|9519.036905422914|
|36-45|9331.581315598047|
|  55+|9338.602948955302|
+-----+-----------------+



In [325]:
trainDF.groupby('Age').agg({'Purchase': 'sum'}).show()

+-----+-------------+
|  Age|sum(Purchase)|
+-----+-------------+
|18-25|  6.3986135E8|
|26-35|1.421541679E9|
| 0-17|  9.4298904E7|
|46-50|  2.9417659E8|
|51-55| 2.57156782E8|
|36-45| 7.18382456E8|
|  55+| 1.41237031E8|
+-----+-------------+



In [326]:
## Apply sum, min, max, count with groupby to get different summary insight for each group. 
exprs = {x: "sum" for x in trainDF.columns}
trainDF.groupBy("Age").agg(exprs).show(5)

+-----+------------------+-----------------------+-------------------+-------------+----------------+---------------+-------------------------------+-----------------------+--------+-----------+-----------------------+---------------+
|  Age|sum(City_Category)|sum(Product_Category_3)|sum(Marital_Status)|sum(Purchase)|    sum(User_ID)|sum(Occupation)|sum(Stay_In_Current_City_Years)|sum(Product_Category_1)|sum(Age)|sum(Gender)|sum(Product_Category_2)|sum(Product_ID)|
+-----+------------------+-----------------------+-------------------+-------------+----------------+---------------+-------------------------------+-----------------------+--------+-----------+-----------------------+---------------+
|18-25|              null|               271580.0|            14766.0|  6.3986135E8| 7.0132128001E10|       471577.0|                        81943.0|               357761.0|    null|       null|               459321.0|           null|
|26-35|              null|               592171.0|          

### 16. User-Defined Functions

##### a. simple UDF function for finding the cube of a number

In [327]:
udfExampleDF = spark.range(5).toDF("num")

def power3(double_value):
    return double_value ** 3

power3(2.0)

8.0

Once the function is created, we need to register them with Spark so that we can used
them on all of our worker machines. Spark will serialize the function on the driver, and transfer it over the network to all executor processes. This happens regardless of language.

<br>Once we go to use the function, there are essentially two different things that occur. If the function is written in Scala or Java then we can use that function within the JVM. This means there will be little performance penalty aside from the fact that we can’t take advantage of code generation capabilities that Spark has for built-in functions.

<br>If the function is written in Python, something quite different happens. 
Spark will start up a python process on the worker, serialize all of the data to a format that python can understand (remember it was in the JVM before), execute the function row by row on that data in the python process, before finally returning the results of the row operations to the JVM and Spark.

In [0]:
from pyspark.sql.functions import udf
power3udf = udf(power3)

In [329]:
from pyspark.sql.functions import col
udfExampleDF.select(power3udf(col("num"))).show()

+-----------+
|power3(num)|
+-----------+
|          0|
|          1|
|          8|
|         27|
|         64|
+-----------+



##### b. Binning of Purchase column

In [0]:
def binning_purchase(purchase):
    """
    args:
        Accepts Purchase amount and returns the correspondin bin
    return:
        bin number (Bin01,02,....) type=String
    0       - 500       -> Bin01
    501     - 1000      -> Bin02
    1001    - 2000      -> Bin03
    2001    - 4000      -> Bin04
    4001    - 6000      -> Bin05
    6001    - 8000      -> Bin06
    8001    - 10000     -> Bin07
    10001   - 20000     -> Bin08
    20001   - 30000     -> Bin09
    """
    if float(purchase) > 0:
        purchase = float(purchase)
    else:
        purchase = float(0)
    
    if purchase <= 500: return str("Bin01")
    elif (purchase > 500 and purchase <= 1000): return str("Bin02")
    elif (purchase > 1000 and purchase <= 2000): return str("Bin03")
    elif (purchase > 2000 and purchase <= 4000): return str("Bin04")
    elif (purchase > 4000 and purchase <= 6000): return str("Bin05")
    elif (purchase > 6000 and purchase <= 8000): return str("Bin06")
    elif (purchase > 8000 and purchase <= 10000): return str("Bin07")
    elif (purchase > 10000 and purchase <= 20000): return str("Bin08")
    else:
        return str("Bin09")
    
    

In [0]:
bin_purchase_udf = udf(binning_purchase)

In [332]:
trainDF.withColumn('Binned_Purchase',bin_purchase_udf('Purchase')).select("Purchase","Binned_Purchase").show(4)

+--------+---------------+
|Purchase|Binned_Purchase|
+--------+---------------+
|    2849|          Bin04|
|   16622|          Bin08|
|    8370|          Bin07|
|    1057|          Bin03|
+--------+---------------+
only showing top 4 rows



## 17. Joins

#### Dataset
* The data is obtained from Surfeous,a recommender system prototype that uses social annotations (e.g., tags) and contextual models to find restaurants that best suit the user preferences.It is a publicly available dataset in UCI.It has threee tables restaurants,consumers and user rating.The tables we choose are from them which are fitered for our scenario


#### Data dictionary :
* __RestGenInfo.csv__ contains :
    * placeID - Uniqued Id of restaurants
    * latitude - Location detail 
    * longitude - Location detail
    * name - Name of the restaurant
    * state - Name of the state 
    * alcohol - Constraints on having alcoholic beverages
    * smoking_area - Information for smokers
    * price - Pricing type of restaurant
    * franchise - Does the restaurant have frachise
    * area - open or close type of restaurant

* __Cuisine.csv__ contains :
    * placeID - Uniqued Id of restaurants
    * Rcuisine - Different styles of food

    
* __PaymentMode.csv__ contains :
    * placeID - Uniqued Id of restaurants
    * Rpayment - Different modes of payment

    
* __parking.csv__ contains :
     * placeID - Uniqued Id of restaurants
     * parking_lot - Different types of parking available

#### Read the data as a dataframe

In [0]:
#from google.colab import files
#files.upload()

In [0]:
#from google.colab import files
#files.upload()

In [0]:
#from google.colab import files
#files.upload()

In [0]:
#from google.colab import files
#files.upload()

In [0]:
restoGen = spark.read.csv('./RestGenInfo.csv', header=True, inferSchema=True,nullValue='?')
cuisine = spark.read.csv('./Cuisine.csv', header=True, inferSchema=True)
paymentMode = spark.read.csv('./PaymentMode.csv', header=True, inferSchema=True)
parking = spark.read.csv('./parking.csv', header=True, inferSchema=True)

In [338]:
restoGen.select("placeID").distinct().count()

130

In [339]:
cuisine.select("placeID").distinct().count()

769

#### Check for any null values in the data

In [0]:
from pyspark.sql.functions import *

In [341]:
restoGen.select([count(when(isnan(c)| col(c).isNull(), 1)).alias(c) for c in restoGen.columns]).show()
cuisine.select([count(when(isnan(c)| col(c).isNull(), 1)).alias(c) for c in cuisine.columns]).show()
paymentMode.select([count(when(isnan(c)| col(c).isNull(), 1)).alias(c) for c in paymentMode.columns]).show()
parking.select([count(when(isnan(c)| col(c).isNull(), 1)).alias(c) for c in parking.columns]).show()

+-------+--------+---------+----+-----+-------+------------+-----+---------+----+
|placeID|latitude|longitude|name|state|alcohol|smoking_area|price|franchise|area|
+-------+--------+---------+----+-----+-------+------------+-----+---------+----+
|      0|       0|        0|   0|   18|      0|           0|    0|        0|   0|
+-------+--------+---------+----+-----+-------+------------+-----+---------+----+

+-------+--------+
|placeID|Rcuisine|
+-------+--------+
|      0|       0|
+-------+--------+

+-------+--------+
|placeID|Rpayment|
+-------+--------+
|      0|       0|
+-------+--------+

+-------+-----------+
|placeID|parking_lot|
+-------+-----------+
|      0|          0|
+-------+-----------+



In [0]:
restoGen = restoGen.dropna()

In [343]:
restoGen.select('placeID').distinct().count()

112

In [344]:
cuisine.select('placeID').distinct().count()

769

In [345]:
paymentMode.select('placeID').distinct().count()

616

In [346]:
parking.select('placeID').distinct().count()

675

In [347]:
cuisine.select('Rcuisine').distinct().count()

59

In [0]:
restoGen.createOrReplaceTempView('restoGenTable')
cuisine.createOrReplaceTempView('cuisineTable')
paymentMode.createOrReplaceTempView('paymentModeTable')
parking.createOrReplaceTempView('parkingTable')

In [349]:
 ##The  count of restaurants(as numberOfHotels) for each payment modes and area also order based on numberOfHotels in descending order.
    
spark.sql('''select  count(*) as numberOfHotels, Rpayment, area from
restoGenTable a join paymentModeTable b 
where a.placeID = b.placeID group by Rpayment, area 
order by numberOfHotels desc''').show()

+--------------+-------------------+------+
|numberOfHotels|           Rpayment|  area|
+--------------+-------------------+------+
|            92|               cash|closed|
|            44|               VISA|closed|
|            40|MasterCard-Eurocard|closed|
|            24|   American_Express|closed|
|            11|   bank_debit_cards|closed|
|            11|               cash|  open|
|             3|               VISA|  open|
|             2|MasterCard-Eurocard|  open|
|             1|   bank_debit_cards|  open|
|             1|      Carte_Blanche|closed|
|             1|   American_Express|  open|
+--------------+-------------------+------+



#### Inner Join

In [350]:
inner_join = restoGen.join(paymentMode, restoGen.placeID == paymentMode.placeID,how='inner') 
inner_join.show(4)

+-------+----------+------------+--------------------+-------+-----------------+-------------+------+---------+------+-------+-------------------+
|placeID|  latitude|   longitude|                name|  state|          alcohol| smoking_area| price|franchise|  area|placeID|           Rpayment|
+-------+----------+------------+--------------------+-------+-----------------+-------------+------+---------+------+-------+-------------------+
| 135106|22.1497088|-100.9760928|El Rinc�n de San ...|    SLP|        Wine-Beer|  only at bar|medium|        f|  open| 135106|               cash|
| 135106|22.1497088|-100.9760928|El Rinc�n de San ...|    SLP|        Wine-Beer|  only at bar|medium|        f|  open| 135106|               VISA|
| 135106|22.1497088|-100.9760928|El Rinc�n de San ...|    SLP|        Wine-Beer|  only at bar|medium|        f|  open| 135106|MasterCard-Eurocard|
| 135088|18.8760113| -99.2198896|   Cafeteria cenidet|Morelos|No_Alcohol_Served|not permitted|   low|        f|closed|

In [351]:
count_of_hotels = inner_join.select('Rpayment','area').groupby('area','Rpayment').count()

count_of_hotels = count_of_hotels.withColumnRenamed('count','NumberofHotels')
count_of_hotels.show()

+------+-------------------+--------------+
|  area|           Rpayment|NumberofHotels|
+------+-------------------+--------------+
|  open|   American_Express|             1|
|  open|   bank_debit_cards|             1|
|closed|   bank_debit_cards|            11|
|  open|               cash|            11|
|closed|   American_Express|            24|
|closed|               VISA|            44|
|closed|      Carte_Blanche|             1|
|  open|               VISA|             3|
|closed|               cash|            92|
|closed|MasterCard-Eurocard|            40|
|  open|MasterCard-Eurocard|             2|
+------+-------------------+--------------+



In [352]:
count_of_hotels.orderBy(count_of_hotels.NumberofHotels.desc()).show()

+------+-------------------+--------------+
|  area|           Rpayment|NumberofHotels|
+------+-------------------+--------------+
|closed|               cash|            92|
|closed|               VISA|            44|
|closed|MasterCard-Eurocard|            40|
|closed|   American_Express|            24|
|closed|   bank_debit_cards|            11|
|  open|               cash|            11|
|  open|               VISA|             3|
|  open|MasterCard-Eurocard|             2|
|  open|   bank_debit_cards|             1|
|  open|   American_Express|             1|
|closed|      Carte_Blanche|             1|
+------+-------------------+--------------+



#### Left outer

In [353]:
##  Count the number of Cuisines that are used by the Restaurants
print("The number of Disintct Cuisines Available at that =  ",cuisine.select('Rcuisine').distinct().count())
left_join = restoGen.join(cuisine,how = 'left',on=restoGen.placeID==cuisine.placeID)
left_join.show()
print("The number of cusines used in the Restaurants")
left_join.select(countDistinct('Rcuisine').alias(" Distinct Cusines  used in Restaurants")).show()
print("The Cusinies available are ")
left_join.select('Rcuisine').distinct().alias("Cusines  used in Restaurants").show()

The number of Disintct Cuisines Available at that =   59
+-------+----------+------------+--------------------+----------+-----------------+-------------+------+---------+------+-------+---------+
|placeID|  latitude|   longitude|                name|     state|          alcohol| smoking_area| price|franchise|  area|placeID| Rcuisine|
+-------+----------+------------+--------------------+----------+-----------------+-------------+------+---------+------+-------+---------+
| 132560|23.7523041| -99.1669133|  puesto de gorditas|Tamaulipas|No_Alcohol_Served|    permitted|   low|        f|  open| 132560| Regional|
| 132572|22.1416471|-100.9927118|        Cafe Chaires|       SLP|No_Alcohol_Served|not permitted|   low|        f|closed| 132572|Cafeteria|
| 132583|18.9222904|  -99.234332|    McDonalds Centro|   Morelos|No_Alcohol_Served|not permitted|   low|        t|closed| 132583| American|
| 132608|23.7588052| -99.1651297|Hamburguesas La p...|Tamaulipas|No_Alcohol_Served|    permitted|   low

In [354]:
## Count the distinct restaurant names which has valet parking
spark.sql('''select distinct name ,parking_lot from restoGenTable a join parkingTable b where a.placeID = b.placeID and 
          b.parking_lot = 'valet parking' ''').show()

+--------------------+-------------+
|                name|  parking_lot|
+--------------------+-------------+
|La Posada del Virrey|valet parking|
+--------------------+-------------+



#### Right Join

In [355]:
right_join = restoGen.join(other=parking,on=parking.placeID==restoGen.placeID,how='right')
names_of_restaurants = right_join.select('name','parking_lot').filter(parking.parking_lot=='valet parking')
names_of_restaurants.distinct().filter(names_of_restaurants.name!='null').show()

+--------------------+-------------+
|                name|  parking_lot|
+--------------------+-------------+
|La Posada del Virrey|valet parking|
+--------------------+-------------+



#### Full outer Join

In [356]:
### Identify the placeID where the paymentMode for parking  is not available
spark.sql("""SELECT parkingTable.placeID,parkingTable.parking_lot,paymentModeTable.Rpayment
FROM parkingTable
FULL OUTER JOIN paymentModeTable ON parkingTable.placeID=paymentModeTable.placeID WHERE paymentModeTable.Rpayment is NULL""").show(20)


+-------+-----------+--------+
|placeID|parking_lot|Rpayment|
+-------+-----------+--------+
| 133018|       none|    null|
| 132478|       none|    null|
| 132478|        fee|    null|
| 132663|       none|    null|
| 135108|       none|    null|
| 134979|       none|    null|
| 133010|        yes|    null|
| 132479|     street|    null|
| 132831|       none|    null|
| 132292|     public|    null|
| 132292|     street|    null|
| 133012|       none|    null|
| 132661|       none|    null|
| 132484|        yes|    null|
| 132326|        yes|    null|
| 132326|     public|    null|
| 132881|       none|    null|
| 135019|       none|    null|
| 132999|       none|    null|
| 132568|       none|    null|
+-------+-----------+--------+
only showing top 20 rows



#### Some more Queries

#### 1. The restaurant names and their corresponding restaurant cuisine styles, price, location details(latitude, longitude) and smoking_area informations only for those which are located in Morelos state and have closed roofing, also order based on price

In [357]:
spark.sql('''select distinct name, Rcuisine, price, latitude, longitude, smoking_area from 
restoGenTable a join cuisineTable b 
where a.placeID = b.placeID and a.state = 'Morelos' and a.area = 'closed' 
order by price''').show(truncate = False)

+----------------------------------------------------+---------------+------+----------+-----------+-------------+
|name                                                |Rcuisine       |price |latitude  |longitude  |smoking_area |
+----------------------------------------------------+---------------+------+----------+-----------+-------------+
|Restaurant Las Mananitas                            |International  |high  |18.928798 |-99.239513 |none         |
|Restaurant and Bar and Clothesline Carlos N Charlies|Bar_Pub_Brewery|high  |18.948657 |-99.235361 |section      |
|Restaurant and Bar and Clothesline Carlos N Charlies|Bar            |high  |18.948657 |-99.235361 |section      |
|Restaurant Bar Coty y Pablo                         |Bar            |low   |18.875011 |-99.159422 |none         |
|Cafeteria cenidet                                   |Cafeteria      |low   |18.8760113|-99.2198896|not permitted|
|McDonalds Centro                                    |American       |low   |18.

 #### 2. The distinct count of restaurants(as numberOfHotels) for each payment modes and area and order based on numberOfHotels in descending order. 

In [358]:
spark.sql('''select distinct count(*) as numberOfHotels, Rpayment, area from
restoGenTable a join paymentModeTable b 
where a.placeID = b.placeID group by Rpayment, area 
order by numberOfHotels desc''').show()

+--------------+-------------------+------+
|numberOfHotels|           Rpayment|  area|
+--------------+-------------------+------+
|            92|               cash|closed|
|            44|               VISA|closed|
|            40|MasterCard-Eurocard|closed|
|            24|   American_Express|closed|
|            11|   bank_debit_cards|closed|
|            11|               cash|  open|
|             3|               VISA|  open|
|             2|MasterCard-Eurocard|  open|
|             1|   bank_debit_cards|  open|
|             1|      Carte_Blanche|closed|
|             1|   American_Express|  open|
+--------------+-------------------+------+



spark.sql("""SELECT * FROM TableA
FULL OUTER JOIN TableB
ON TableA.name = TableB.name""").show()

right_join = ta.join(tb, ta.name == tb.name,how='right') # Could also use 'right_outer'
right_join.show()

spark.sql("""SELECT * FROM TableA
RIGHT OUTER JOIN TableB
ON TableA.name = TableB.name""").show()

left_join = ta.join(tb, ta.name == tb.name,how='left') # Could also use 'left_outer'
left_join.show()

spark.sql("""SELECT * FROM TableA
LEFT OUTER JOIN TableB
ON TableA.name = TableB.name""").show()

#### Natural Joins
Natural joins make implicit guesses at the columns on which you would like to join. 
It finds matching columns and returns the results. 
Left, right, and outer natural joins are all supported.

WARNING:
Implicit is always dangerous! 
The following query will give us incorrect results because 
the two DataFrames/tables share a column name (id), but it means different things in the datasets. 
You should always use this join with caution.

In [359]:
#spark.sql("""SELECT * FROM TableA NATURAL JOIN TableB""").show()

spark.sql('''select  * from restoGenTable a NATURAL JOIN paymentModeTable b ''').show(10)

+-------+----------+------------+--------------------+-------+-----------------+-------------+------+---------+------+-------------------+
|placeID|  latitude|   longitude|                name|  state|          alcohol| smoking_area| price|franchise|  area|           Rpayment|
+-------+----------+------------+--------------------+-------+-----------------+-------------+------+---------+------+-------------------+
| 135106|22.1497088|-100.9760928|El Rinc�n de San ...|    SLP|        Wine-Beer|  only at bar|medium|        f|  open|               cash|
| 135106|22.1497088|-100.9760928|El Rinc�n de San ...|    SLP|        Wine-Beer|  only at bar|medium|        f|  open|               VISA|
| 135106|22.1497088|-100.9760928|El Rinc�n de San ...|    SLP|        Wine-Beer|  only at bar|medium|        f|  open|MasterCard-Eurocard|
| 135088|18.8760113| -99.2198896|   Cafeteria cenidet|Morelos|No_Alcohol_Served|not permitted|   low|        f|closed|               cash|
| 135086| 22.141421| -101.0

#### Cross (Cartesian) Joins
Cross-joins in simplest terms are inner joins that do not specify a predicate. 
Cross joins will join every single row in the left DataFrame to ever single row in the right DataFrame. 
This will cause an absolute explosion in the number of rows contained in the resulting DataFrame. 
If you have 1,000 rows in each DataFrame, the cross-join of these will result in 1,000,000 (1,000 x 1,000) rows. 
For this reason, you must very explicitly state that you want a cross-join by using the cross join keyword:

tableA.crossJoin(tableB).show()

spark.sql("""SELECT * FROM TableA CROSS JOIN TableB""").show()

#### Random Samples

In [360]:
## To create a sample DataFrame from the base DataFrame
## Use sample operation to take sample of a DataFrame. 
## The sample method on DataFrame will return a DataFrame containing the sample of base DataFrame. 
## The sample method takes 3 parameters.
## withReplacement = True or False to select a observation with or without replacement.
## fraction = x, where x = .5 shows that we want to have 50% data in sample DataFrame.
## seed to reproduce the result
sampleDF1 = trainDF.sample(False, 0.2, 1234)
sampleDF2 = trainDF.sample(False, 0.2, 4321)
print(sampleDF1.count(), sampleDF2.count())

76916 76981


#### Map Transformation

In [361]:
## To apply map operation on DataFrame columns
## Apply a function on each row of DataFrame using map operation. 
## After applying this function, we get the result in the form of RDD. 
## Apply a map operation on User_ID column of train and print the first 5 elements of mapped RDD(x,1) 
## ----- Applying lambda function.

trainDF.select('User_ID').rdd.map(lambda x:(x,1)).take(5)

[(Row(User_ID='1000001'), 1),
 (Row(User_ID='1000001'), 1),
 (Row(User_ID='1000001'), 1),
 (Row(User_ID='1000001'), 1),
 (Row(User_ID='1000001'), 1)]

*__Prior to Spark 2.0, spark_df.map would alias to spark_df.rdd.map(). 
With Spark 2.0, you must explicitly call .rdd first.__*

#### Sorting Rows

In [362]:
## To sort the DataFrame based on column(s)
## Use orderBy operation on DataFrame to get sorted output based on some column. 
## The orderBy operation take two arguments.
## List of columns.
## ascending = True or False for getting the results in ascending or descending order(list in case of more than two columns )
## Sort the train DataFrame based on ‘Purchase’.
trainDF.orderBy(trainDF.Purchase.desc()).show(5)

+-------+----------+------+-----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|User_ID|Product_ID|Gender|  Age|Occupation|City_Category|Stay_In_Current_City_Years|Marital_Status|Product_Category_1|Product_Category_2|Product_Category_3|Purchase|
+-------+----------+------+-----+----------+-------------+--------------------------+--------------+------------------+------------------+------------------+--------+
|1000264| P00144042|     M|36-45|         0|            B|                         1|             1|                 2|                 3|                 4|    9999|
|1000181| P00003242|     M|18-25|        17|            C|                         1|             0|                 8|                15|              null|    9999|
|1000058| P00189642|     M|26-35|         2|            B|                         3|             0|                 8|                13|              null|    9999

#### Repartition and Coalesce
Another important optimization opportunity is to partition the data according to some frequently filtered columns
which controls the physical layout of data across the cluster including the partitioning scheme and the number of
partitions.

Repartition will incur a full shuffle of the data, regardless of whether or not one is necessary. This means that you should typically only repartition when the future number of partitions is greater than your current number of
partitions or when you are looking to partition by a set of columns.

In [363]:
## Find existing partitions count
trainDF.rdd.getNumPartitions()
## Do the repartition
## trainDF.repartition(5)

## Repartition based on a column
## If we know we are going to be filtering by a certain column often, 
## it can be worth repartitioning based on that column.
## trainDF.repartition(col(“Purchase”))

## We can optionally specify the number of partitions we would like too.
## trainDF.repartition(5, col(“Purchase”))

## Coalesce on the other hand will not incur a full shuffle and will try to combine partitions. 
## This operation will shuffle our data into 5 partitions based on the Purchase, 
## then coalesce them (without a full shuffle).
## trainDF.repartition(5, col("Purchase")).coalesce(2)

2

### Miscellaneous

#### Unions

In [364]:
df1 = spark.createDataFrame([[1, 'Alex', 25],[3, 'Carol', 53],[5, 'Emily', 25],[7, 'Gabriel', 32],[9, 'Ilma', 35],[11, 'Kim', 45]], ['id', 'name', 'age'])
df2 = spark.createDataFrame([[2, 'Ben', 66],[4, 'Daniel', 28],[6, 'Frank', 64],[8, 'Harley', 29],[10, 'Jack', 35],[12, 'Litmya', 45]], ['id', 'name', 'age'])
print("Before")
print("DataFrame-1")
print(df1.show())
print("DataFrame-2")
print(df2.show())
print("After")
df1 = df1.union(df2)
print("DataFrame-1")
print(df1.show())

Before
DataFrame-1
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|   Alex| 25|
|  3|  Carol| 53|
|  5|  Emily| 25|
|  7|Gabriel| 32|
|  9|   Ilma| 35|
| 11|    Kim| 45|
+---+-------+---+

None
DataFrame-2
+---+------+---+
| id|  name|age|
+---+------+---+
|  2|   Ben| 66|
|  4|Daniel| 28|
|  6| Frank| 64|
|  8|Harley| 29|
| 10|  Jack| 35|
| 12|Litmya| 45|
+---+------+---+

None
After
DataFrame-1
+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|   Alex| 25|
|  3|  Carol| 53|
|  5|  Emily| 25|
|  7|Gabriel| 32|
|  9|   Ilma| 35|
| 11|    Kim| 45|
|  2|    Ben| 66|
|  4| Daniel| 28|
|  6|  Frank| 64|
|  8| Harley| 29|
| 10|   Jack| 35|
| 12| Litmya| 45|
+---+-------+---+

None


#### Unions and condtional append

In [365]:
df1.union(df2).where("age < 60").show()

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|   Alex| 25|
|  3|  Carol| 53|
|  5|  Emily| 25|
|  7|Gabriel| 32|
|  9|   Ilma| 35|
| 11|    Kim| 45|
|  4| Daniel| 28|
|  8| Harley| 29|
| 10|   Jack| 35|
| 12| Litmya| 45|
|  4| Daniel| 28|
|  8| Harley| 29|
| 10|   Jack| 35|
| 12| Litmya| 45|
+---+-------+---+



#### String Manipulations

In [366]:
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim

trainDF.select(
ltrim(lit(" HELLO ")).alias("ltrim"),
rtrim(lit(" HELLO ")).alias("rtrim"),
trim(lit(" HELLO ")).alias("trim"),
lpad(lit("HELLO"), 7, " ").alias("lp"),
rpad(lit("HELLO"), 7, " ").alias("rp"))\
.show(2,truncate=False)

+------+------+-----+-------+-------+
|ltrim |rtrim |trim |lp     |rp     |
+------+------+-----+-------+-------+
|HELLO | HELLO|HELLO|  HELLO|HELLO  |
|HELLO | HELLO|HELLO|  HELLO|HELLO  |
+------+------+-----+-------+-------+
only showing top 2 rows



In [367]:
spark.sql("""SELECT
ltrim(' HELLLOOOO ') AS ltrim,
rtrim(' HELLLOOOO ') AS rtrim,
trim(' HELLLOOOO ') AS trim,
lpad('HELLOOOO ', 3, ' ') AS lp,
rpad('HELLOOOO ', 10, ' ') AS rp
FROM
trainDFTable""").show(2)

+----------+----------+---------+---+----------+
|     ltrim|     rtrim|     trim| lp|        rp|
+----------+----------+---------+---+----------+
|HELLLOOOO | HELLLOOOO|HELLLOOOO|HEL|HELLOOOO  |
|HELLLOOOO | HELLLOOOO|HELLLOOOO|HEL|HELLOOOO  |
+----------+----------+---------+---+----------+
only showing top 2 rows



#### Regular Expressions

In [368]:
from pyspark.sql.functions import regexp_replace
regex_string = "F|M"

X = trainDF.select(
regexp_replace(col("Gender"), regex_string, "MALE_OR_FEMALE")
.alias("Gender_DECODE"),
col("Gender"))
X.show(10)

+--------------+------+
| Gender_DECODE|Gender|
+--------------+------+
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
+--------------+------+
only showing top 10 rows



In [369]:
spark.sql("""
SELECT
regexp_replace(Gender, 'F|M', 'MALE_OR_FEMALE') as
Gender_DECODE,
Gender
FROM
trainDFTable
""").show(2)

+--------------+------+
| Gender_DECODE|Gender|
+--------------+------+
|MALE_OR_FEMALE|     F|
|MALE_OR_FEMALE|     F|
+--------------+------+
only showing top 2 rows



In [370]:
from pyspark.sql.functions import translate
trainDF.select(
translate(col("Gender"), "FM", "01"),
col("Gender"))\
.show(10)

+-------------------------+------+
|translate(Gender, FM, 01)|Gender|
+-------------------------+------+
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
+-------------------------+------+
only showing top 10 rows



In [371]:
spark.sql("""
SELECT
translate(Gender, 'FM', '01'),
Gender
FROM
trainDFTable
""").show(10)

+-------------------------+------+
|translate(Gender, FM, 01)|Gender|
+-------------------------+------+
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
|                        0|     F|
+-------------------------+------+
only showing top 10 rows



#### Working with Date and Time

In [372]:
from pyspark.sql.functions import current_date, current_timestamp
dateDF = spark.range(10)\
.withColumn("today", current_date())\
.withColumn("now", current_timestamp())
dateDF.show(truncate = False)

+---+----------+-----------------------+
|id |today     |now                    |
+---+----------+-----------------------+
|0  |2019-12-01|2019-12-01 03:29:06.722|
|1  |2019-12-01|2019-12-01 03:29:06.722|
|2  |2019-12-01|2019-12-01 03:29:06.722|
|3  |2019-12-01|2019-12-01 03:29:06.722|
|4  |2019-12-01|2019-12-01 03:29:06.722|
|5  |2019-12-01|2019-12-01 03:29:06.722|
|6  |2019-12-01|2019-12-01 03:29:06.722|
|7  |2019-12-01|2019-12-01 03:29:06.722|
|8  |2019-12-01|2019-12-01 03:29:06.722|
|9  |2019-12-01|2019-12-01 03:29:06.722|
+---+----------+-----------------------+



In [373]:
dateDF.createOrReplaceTempView("dateDFTable")
dateDF.printSchema()

root
 |-- id: long (nullable = false)
 |-- today: date (nullable = false)
 |-- now: timestamp (nullable = false)



In [374]:
from pyspark.sql.functions import date_add, date_sub
dateDF.select(date_sub(col("today"), 10),date_add(col("today"), 10)).show(1)

+-------------------+-------------------+
|date_sub(today, 10)|date_add(today, 10)|
+-------------------+-------------------+
|         2019-11-21|         2019-12-11|
+-------------------+-------------------+
only showing top 1 row



In [375]:
spark.sql("""
SELECT
date_sub(today, 10),
date_add(today, 10)
FROM
dateDFTable
""").show(1)

+-------------------+-------------------+
|date_sub(today, 10)|date_add(today, 10)|
+-------------------+-------------------+
|         2019-11-21|         2019-12-11|
+-------------------+-------------------+
only showing top 1 row



In [376]:
from pyspark.sql.functions import datediff, months_between, to_date
dateDF\
.withColumn("week_ago", date_sub(col("today"), 7))\
.select(datediff(col("week_ago"), col("today")).alias('datediff_today_weekago'))\
.show(1)

+----------------------+
|datediff_today_weekago|
+----------------------+
|                    -7|
+----------------------+
only showing top 1 row



In [377]:
dateDF\
.select(
to_date(lit("2017-01-01")).alias("start"),
to_date(lit("2018-02-18")).alias("end"))\
.select(months_between(col("end"), col("start")))\
.show(1)

+--------------------------+
|months_between(end, start)|
+--------------------------+
|                13.5483871|
+--------------------------+
only showing top 1 row



In [378]:
spark.sql("""
SELECT
to_date('2016-01-01') AS date,
months_between('2017-01-01', '2016-01-01') AS months_between,
datediff('2017-01-01', '2016-01-01') AS datediff_days
FROM
dateDFTable
""").show(2)

+----------+--------------+-------------+
|      date|months_between|datediff_days|
+----------+--------------+-------------+
|2016-01-01|          12.0|          366|
|2016-01-01|          12.0|          366|
+----------+--------------+-------------+
only showing top 2 rows



In [379]:
from pyspark.sql.functions import to_date, lit
spark.range(5).withColumn("date", lit("2017-01-01"))\
.select(to_date(col("date")))\
.show()

+---------------+
|to_date(`date`)|
+---------------+
|     2017-01-01|
|     2017-01-01|
|     2017-01-01|
|     2017-01-01|
|     2017-01-01|
+---------------+



__WARNING__
<br>Spark will not throw an error if it cannot parse the date, it’ll just return null. This can be a bit tricky in larger pipelines because you may be expecting your data in one format and getting it in another. To illustrate, let’s take a look at the date format that has switched from year-month-day to year-day-month. Spark will fail to parse this date and silently return null instead.

In [380]:
### 2016-20-12 - year-day-month
### 2017-12-11 - year-month-day
dateDF.select(to_date(lit("2016-20-12")),to_date(lit("2017-12-11"))).show(1)

+---------------------+---------------------+
|to_date('2016-20-12')|to_date('2017-12-11')|
+---------------------+---------------------+
|                 null|           2017-12-11|
+---------------------+---------------------+
only showing top 1 row

