## Agenda:
- Dropping columns
- Dropping rows
- Various parameters in drop functionalities
- Handling missed values by Mean, Median and Mode

In [1]:
#  An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. -> Change JDK from latest to JDK 1.8
import os
os.environ['JAVA_HOME']="C:\Program Files\Java\jdk1.8.0_211"

In [6]:
from pyspark.sql import SparkSession
import pandas as pd
spark = SparkSession.builder.appName("Practice").getOrCreate()

In [39]:
books_path = '../assets/data.csv'
df_pyspark = spark.read.option('header','true').option("quote", ").csv(books_path, inferSchema=True)
df_pyspark.printSchema()

root
 |-- ISBN13: string (nullable = true)
 |-- title: string (nullable = true)
 |-- author: string (nullable = true)
 |-- description: string (nullable = true)
 |-- edition: string (nullable = true)
 |-- image_url: string (nullable = true)
 |-- link_to_website: string (nullable = true)
 |-- price: string (nullable = true)
 |-- currency: string (nullable = true)
 |-- published_date: string (nullable = true)
 |-- publisher: string (nullable = true)



In [56]:
df_pyspark.describe().toPandas()

Unnamed: 0,summary,ISBN13,title,author,description,edition,image_url,link_to_website,price,currency,published_date,publisher
0,count,852,273,235,204,150,63,39,29,27,17,12
1,mean,9.781527107113719E12,8.661875000000006,,1.6297227E9,,,,,,,
2,stddev,1416.5567908675514,2.8897997266795943,,2.3925523236195162E7,,,,,,,
3,min,All,(whether in the home,& the Christian life.,Luke and John we have a record of over 50 dif...,Ellis’ experience from riots,Erin Wheeler unpacks what the Bible says abou...,I came to know that God offers more liberation,and an inspiration for readers to strive to l...,and more fulfilment than I could dare to imag...,Adèle,amongst others.
4,max,‘When I Use a Word’,When Home Hurts begins with an overview to pr...,https://christianfocus.s3-eu-west-1.amazonaws....,https://christianfocus.s3-eu-west-1.amazonaws....,’ we are now to live in holy ways,https://christianfocus.s3-eu-west-1.amazonaws....,"""https://www.christianfocus.com/products/3048/...",https://christianfocus.s3-eu-west-1.amazonaws....,https://christianfocus.s3-eu-west-1.amazonaws....,https://christianfocus.s3-eu-west-1.amazonaws....,https://christianfocus.s3-eu-west-1.amazonaws....


## Dropping columns

In [46]:
df_pyspark.drop('edition')

DataFrame[ISBN13: string, title: string, author: string, description: string, image_url: string, link_to_website: string, price: string, currency: string, published_date: string, publisher: string]

## Dropping rows / various parameters in drop functionalities

In [43]:
# deletes all whetehever nan value present
df_pyspark.na.drop(how="any").show()

+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       ISBN13|               title|              author|         description|             edition|           image_url|     link_to_website|               price|            currency|      published_date|           publisher|
+-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|9781527105577|Through All The C...|         David Ellis|"Following God’s ...| Ellis’ experienc...|  political upheaval|   physical hardship|        and burn out| to caring for hi...|               Adèle| through her Alzh...|
|9781527107786|40 Years Behind B...|Prison Fellowship...| "There are around 8|000 people incarc.

In [28]:
# deletes if all values are nan value
df_pyspark.na.drop(how="all").show()

+--------------------+--------------------+--------------------+--------------------+-------+--------------------+--------------------+-----+--------+--------------+---------+
|              ISBN13|               title|              author|         description|edition|           image_url|     link_to_website|price|currency|published_date|publisher|
+--------------------+--------------------+--------------------+--------------------+-------+--------------------+--------------------+-----+--------+--------------+---------+
|       9781527109001|Teaching Deuteron...|         Matt Fuller|Deuteronomy is pr...|   null|                null|                null| null|    null|          null|     null|
|       life or death| blessing or curs...|                null|                null|   null|                null|                null| null|    null|          null|     null|
|                    |                null|                null|                null|   null|                null|      

In [42]:
# deletes if there is two nan values in a row
df_pyspark.na.drop(thresh=2).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+--------+--------------+---------+
|              ISBN13|               title|              author|         description|             edition|           image_url|     link_to_website|price|currency|published_date|publisher|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-----+--------+--------------+---------+
|       9781527109001|Teaching Deuteron...|         Matt Fuller|"Deuteronomy is p...| and yet Jesus qu...|                null|                null| null|    null|          null|     null|
|       life or death| blessing or curs...|                null|                null|                null|                null|                null| null|    null|          null|     null|
|In his helpful gu...| to choose to lov...|            

In [44]:
# subset
df_pyspark.na.drop(how="any",subset=['edition']).show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|              ISBN13|               title|              author|         description|             edition|           image_url|     link_to_website|               price|            currency|      published_date|           publisher|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|       9781527109001|Teaching Deuteron...|         Matt Fuller|"Deuteronomy is p...| and yet Jesus qu...|                null|                null|                null|                null|                null|                null|
|                   "|               12.99|                 USD|    

## Filling nan values

In [50]:
df_pyspark.na.fill('XXX').toPandas()

Unnamed: 0,ISBN13,title,author,description,edition,image_url,link_to_website,price,currency,published_date,publisher
0,9781527109001,Teaching Deuteronomy: From Text to Message,Matt Fuller,"""Deuteronomy is probably not the first book an...",and yet Jesus quoted it regularly. It is the ...,XXX,XXX,XXX,XXX,XXX,XXX
1,life or death,blessing or curse. Moses exhorts the people t...,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX
2,,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX
3,In his helpful guide Matt Fuller suggests ways...,to choose to love him each and every day.,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX
4,,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX
...,...,...,...,...,...,...,...,...,...,...,...
847,9781527105638,Teaching 2 Peter & Jude: From Text to Message,Angus MacLeay,"""The books of 2 Peter and Jude are some of the...",these dynamic little books have an important ...,pragmatism and drift. These books are dense a...,XXX,XXX,XXX,XXX,XXX
848,,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX
849,Teaching 2 Peter and Jude is a great addition ...,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX,XXX
850,Whether you are a small group leader,preacher,youth worker or someone who simply want help ...,this book will help you to comprehend and com...,XXX,https://christianfocus.s3-eu-west-1.amazonaws....,"""https://www.christianfocus.com/products/2909/...",XXX,XXX,XXX,XXX


In [51]:
# set subset
df_pyspark.na.fill(value='XXX',subset=['author', 'description', 'edition']).toPandas()

Unnamed: 0,ISBN13,title,author,description,edition,image_url,link_to_website,price,currency,published_date,publisher
0,9781527109001,Teaching Deuteronomy: From Text to Message,Matt Fuller,"""Deuteronomy is probably not the first book an...",and yet Jesus quoted it regularly. It is the ...,,,,,,
1,life or death,blessing or curse. Moses exhorts the people t...,XXX,XXX,XXX,,,,,,
2,,,XXX,XXX,XXX,,,,,,
3,In his helpful guide Matt Fuller suggests ways...,to choose to love him each and every day.,XXX,XXX,XXX,,,,,,
4,,,XXX,XXX,XXX,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
847,9781527105638,Teaching 2 Peter & Jude: From Text to Message,Angus MacLeay,"""The books of 2 Peter and Jude are some of the...",these dynamic little books have an important ...,pragmatism and drift. These books are dense a...,,,,,
848,,,XXX,XXX,XXX,,,,,,
849,Teaching 2 Peter and Jude is a great addition ...,,XXX,XXX,XXX,,,,,,
850,Whether you are a small group leader,preacher,youth worker or someone who simply want help ...,this book will help you to comprehend and com...,XXX,https://christianfocus.s3-eu-west-1.amazonaws....,"""https://www.christianfocus.com/products/2909/...",,,,


In [54]:
from pyspark.ml.feature import Imputer

imputer = Imputer(
    inputCols=['price'],
    outputCols=["{}_imputed".format(c) for c in ['price']]
).setStrategy('mean')

In [55]:
imputer.fit(df_pyspark).transform(df_pyspark).show()

IllegalArgumentException: requirement failed: Column price must be of type numeric but was actually of type string.