# Spark DataFrame Basics

Spark DataFrames are the workhouse and main way of working with Spark and Python post Spark 2.0. DataFrames act as powerful versions of tables, with rows and columns, easily handling large datasets. The shift to DataFrames provides many advantages:
* A much simpler syntax
* Ability to use SQL directly in the dataframe
* Operations are automatically distributed across RDDs
    
If you've used R or even the pandas library with Python you are probably already familiar with the concept of DataFrames. Spark DataFrame expand on a lot of these concepts, allowing you to transfer that knowledge easily by understanding the simple syntax of Spark DataFrames. Remember that the main advantage to using Spark DataFrames vs those other programs is that Spark can handle data across many RDDs, huge data sets that would never fit on a single computer. That comes at a slight cost of some "peculiar" syntax choices, but after this course you will feel very comfortable with all those topics!

Let's get started!

## Creating a DataFrame

First we need to start a SparkSession:

# What is PySpark?
**PySpark** is the collaboration of **Apache Spark** and **Python**.

Apache Spark is an open-source cluster-computing framework, built around speed, ease of use, and streaming analytics whereas Python is a general-purpose, high-level programming language. It provides a wide range of libraries and is majorly used for Machine Learning and Real-Time Streaming Analytics.

In other words, it **is a Python API for Spark** that lets you harness the simplicity of Python and the power of Apache Spark in order to tame Big Data and perform massive distributed processing over resilient sets of data. It's a must for Big data’s lovers. 


# Why use PySpark in a Jupyter Notebook?

**Jupyter Notebook** is a popular application that enables you to edit, run and share Python code into a web view. It allows you to modify and re-execute parts of your code in a very flexible way. That’s why Jupyter is a great tool to test and prototype programs.

While using Spark, most data engineers recommends to develop either in Scala (which is the “native” Spark language) or in Python through complete PySpark API.

Python for Spark is obviously slower than Scala. However many developers, love Python because it’s flexible, robust, easy to learn, and benefits from all their favorite libraries. Many would agree that, Python is the perfect language for prototyping in Big Data/Machine Learning fields.

If you prefer to develop in Scala, you will find many alternatives on the following github repository: alexarchambault/jupyter-scala

To learn more about Python vs. Scala pro and cons for Spark context, please refer to this interesting article: Scala vs. Python for Apache Spark.

Now, let’s get started.

# How is PySpark different than Python?

One of the most noteable differences you will find with PySpark as opposed to Python is that it runs on a SparkContext which is a cluster, so certian processes will look different especially when you get in the machine learning libraries. In addition to this main difference, I've note a few attibutes to be aware of below:

1. PySpark does not use indexing
2. **ALL** objects in PySpark are **immutable**
3. Error messages are much less informative
4. Many of the libraries you are used to using in Python won't function in PySpark

# Contents of this notebook
This notebook is intended to give users a tangible guide to get started using Python on Apache Spark. You will notice many of the commands you are used to in Python are not imployed here like Pandas and Numpy, as well as new libraries like pyspark.sql. I hope you all find this documentation useful!

# Some helpful resources

- Exploring S3 Keys:https://alexwlchan.net/2017/07/listing-s3-keys/
- Using S3 Select: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-s3select.html
- PySpark Cheat Sheets: 
    https://www.qubole.com/resources/pyspark-cheatsheet/
    https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf

## Let's Get started!

Starting a PySpark Session

In [None]:
from pyspark.sql import SparkSession

In [None]:
# May take awhile locally
spark = SparkSession.builder.appName("Operations").getOrCreate()

## Read in and view data

A DataFrame is equivalent to a relational table in Spark SQL, and can be created using various functions in SparkSession.

In [1]:
# Start by reading a basic csv dataset
# Let Spark know about the header and infer the Schema types!

#Some csv data
stocks = spark.read.csv('appl_stock.csv',inferSchema=True,header=True)
sales = spark.read.csv('sales_info.csv',inferSchema=True,header=True)
nulls = spark.read.csv("ContainsNull.csv",header=True,inferSchema=True)

# And one json
people = spark.read.json('people.json')

NameError: name 'spark' is not defined

In [None]:
# A Solid Summary of your data:

#show the data (like df.head())
print(stocks.show())
print("")
print(stocks.printSchema())
print("")
print(stocks.columns)
print("")
print(stocks.describe())

In [None]:
# Neat "describe" function
df.describe(['VISA_TYPE_ID']).show()

In [None]:
# Summary function
df.select("VISA_TYPE_ID", "CASE_STATUS_ID").summary("count", "min", "25%", "75%", "max").show()

Some data types make it easier to infer schema (like tabular formats such as csv which we will show later). 

However you often have to set the schema yourself if you aren't dealing with a .read method that doesn't have inferSchema() built-in.

Spark has all the tools you need for this, it just requires a very specific structure:

In [None]:
from pyspark.sql.types import StructField,StringType,IntegerType,StructType

Next we need to create the list of Structure fields
    * :param name: string, name of the field.
    * :param dataType: :class:`DataType` of the field.
    * :param nullable: boolean, whether the field can be null (None) or not.

In [None]:
data_schema = [StructField("age", IntegerType(), True),StructField("name", StringType(), True)]

In [None]:
final_struc = StructType(fields=data_schema)

In [None]:
people = spark.read.json('people.json', schema=final_struc)

In [None]:
people.printSchema()

In [None]:
# Now something a bit more complicated: Read in a full parquet
bucket = "dol-ocio-dac-oflc-curated-prod"
#reads in all partitioned keys with this begining....
key = "oflc-cmpdbprod/OFLC_CMS/OFLC_CASE/"
df = spark.read.parquet('s3://'+bucket+'/'+key)

In [None]:
# Read in user specified parts of partitions

bucket = "dol-ocio-dac-oflc-datascience-test"
key1 = "partition_test/oflc-cmpdb/OFLC_CMS/OFLC_CASE/CREATED_YEAR=2015/*"
key2 = "partition_test/oflc-cmpdb/OFLC_CMS/OFLC_CASE/CREATED_YEAR=2017/*"
# key3 = "partition_test/oflc-cmpdb/OFLC_CMS/OFLC_CASE/CREATED_YEAR=2018/part-00000-1cf1268f-9a00-443e-93a6-b148021f5753.c000.snappy.parquet"
key3 = "partition_test/oflc-cmpdb/OFLC_CMS/OFLC_CASE/CREATED_YEAR=2018/*"

test_df = spark.read.parquet('s3://'+bucket+'/'+key1,\
                             's3://'+bucket+'/'+key2, \
                             's3://'+bucket+'/'+key3)

test_df.show(1)


## Filtering Data

A large part of working with DataFrames is the ability to quickly filter out data based on conditions. Spark DataFrames are built on top of the Spark SQL platform, which means that is you already know SQL, you can quickly and easily grab that data using SQL commands, or using the DataFram methods (which is what we focus on in this course).

In [None]:
# Using SQL
stocks.filter("Close<500").show()

In [None]:
stocks.select(['Open','Close']).show()

In [None]:
# Using SQL with .select()
stocks.filter("Close<500").select(['Open','Close']).show()

In [None]:
# Try it yourself!
# Edit the line below to select only closing values above 800
stocks.filter("Close<500").select(['Open','Close']).show()

**Slicing**

In [None]:
# Slicing
from pyspark.sql.functions import slice
# pyspark.sql.functions.slice(x, start, length)[source]
# Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length. 
# Note: indexing starts at 1 here

slicer = spark.createDataFrame([([1, 2, 3],), ([4, 5],)], ['x']) 
slicer.select(slice(slicer.x, 2, 2).alias("sliced")).show()

**Like**

In [None]:
# Like
# Show only the emails that are gmail emails
df.select("CASE_ID","CREATED_BY").where(df.CREATED_BY.like("%GMAIL%")).show(5, False)

**Substrings**

In [None]:
# Substring
# Extract the second set of 3 numbers from the Case Number Field
df.select("CASE_ID","CASE_NUMBER",df.CASE_NUMBER.substr(4, 3)).show()

**Case When**

In [None]:
# Case When 
# Create a binary classifier for age (ie. label everyone over 30 as 1 and everyone else as 0)
from pyspark.sql import functions as F

values = [('Kyle',10),('Melbourne',36),('Nina',123),('Stephen',48),('Orphan',16),('Imran',1)]
case_when = spark.createDataFrame(values,['name','age'])
case_when.show()

case_when.select("name","age",F.when(case_when.age > 30, 1).otherwise(0).alias("Binary")).show()

# df[df.firstName.isin("Jane","Boris")].show()

**IS IN**

In [None]:
# ISIN 
# Select Specific people from a column
from pyspark.sql import functions as F

values = [('Bananas',10),('Chips',36),('Sandwiches',123),('Chicken Fingers',48),('Fries',16),('Eggplant',1)]
IS_IN = spark.createDataFrame(values,['food','Quantity'])
IS_IN.show()

IS_IN[IS_IN.food.isin("Bananas","Eggplant")].show()

**Starts with Ends with**

In [None]:
# Startswith – Endswith
# Search for a specific case - begins with "PW" and ends with "1"
df.select("CASE_ID","CASE_NUMBER").where(df.CASE_NUMBER.startswith("PW")) \
                                  .where(df.CASE_NUMBER.endswith("1")).show()

## Collecting Results as Python Objects

In [None]:
# Collecting results as Python objects
result = stocks.filter("Low=197.16").collect()

In [None]:
# Note the nested structure returns a nested row object
type(result[0])

In [None]:
row = result[0]

Rows can be called to turn into dictionaries

In [None]:
row.asDict()

In [None]:
for item in result[0]:
    print(item)

## Manipulate Data

Dataframes in pyspark are immutable, so you need to create a new dataframe if you want to manipulate it. If you just want to test some code, you can use the .show() method as I've shown below. This will only display the results and NOT change your dataframe.

In [None]:
# Change a datatype
# Available types:
    # DataType
    # NullType
    # StringType
    # BinaryType
    # BooleanType
    # DateType
    # TimestampType
    # DecimalType
    # DoubleType
    # FloatType
    # ByteType
    # IntegerType
    # LongType
    # ShortType
    # ArrayType
    # MapType
    # StructField
    # StructType
    
# Notice all vars are stings above....
# let's change that
from pyspark.sql.types import * #IntegerType
from pyspark.sql.functions import *

df = df.withColumn("CASE_ID", df["CASE_ID"].cast(IntegerType())) \
        .withColumn("VISA_TYPE_ID", df["VISA_TYPE_ID"].cast(IntegerType())) \
        .withColumn("MODIFIED_DATE", to_timestamp(df.MODIFIED_DATE, 'yyyy-MM-dd HH:mm:ss')) \
        .withColumn("SUBMIT_DATE", to_timestamp(df.SUBMIT_DATE, 'yyyy-MM-dd HH:mm:ss'))
print(df.printSchema())
print(df.show(1))

#### Creating new columns

In [None]:
# Add a new column from an existing column like this....
# withColumn(colName, col)[source]
# Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
# The column expression must be an expression over this DataFrame; attempting to add a column from some other dataframe will raise an error.

# Parameters
# colName – string, name of the new column.

# col – a Column expression for the new column.
df.withColumn('CASE_NUMBER', df.CASE_NUMBER + 2).show()

#### Rename Columns

In [None]:
# Simple Rename
renamed = people.withColumnRenamed('age','supernewage').collect()

**Extract or Create New Values**

In [None]:
# Concatenate columns
# pyspark.sql.functions.concat_ws(sep, *cols)[source]
# Concatenates multiple input string columns together into a single string column, using the given separator.

names = spark.createDataFrame([('Abraham','Lincoln')], ['first_name', 'last_name'])
names.select(names.first_name,names.last_name,concat_ws(' ', names.first_name, names.last_name).alias('full_name')).show()

In [None]:
# Extract year, month, day etc. from a date field
# Other options: dayofmonth, dayofweek, dayofyear, weekofyear
import pyspark.sql.functions as fn
year = df.withColumn("SUBMIT_YEAR",fn.year("SUBMIT_DATE")) \
         .withColumn("SUBMIT_MONTH",fn.month("SUBMIT_DATE"))
#QA
year.filter("SUBMIT_YEAR=2019").select(['SUBMIT_DATE','SUBMIT_YEAR','SUBMIT_MONTH']).show()


In [None]:
# Calculate the difference between two dates:
# pyspark.sql.functions.datediff(end, start)
# Returns the number of days from start to end.

date_df = spark.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2'])
date_df.select(datediff(date_df.d2, date_df.d1).alias('diff')).show()

In [None]:
# Split a string around pattern (pattern is a regular expression).
from pyspark.sql.functions import *
# pyspark.sql.functions.split(str, pattern)[source]

abc = spark.createDataFrame([('ab12cd',)], ['s',])
abc.select(abc.s,split(abc.s, '[0-9]+').alias('news')).show()

**Clean Data**

In [None]:
# Trim
# pyspark.sql.functions.trim(col) - Trim the spaces from both ends for the specified string column.
from pyspark.sql.functions import *

trim_ex = spark.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings
trim_ex.show()
print("left trim")
trim_ex.select('d1', ltrim(trim_ex.d1)).show()
print("right trim")
trim_ex.select('d1', rtrim(trim_ex.d1)).show()
print("trim")
trim_ex.select('d1', trim(trim_ex.d1)).show()

In [None]:
# Regex!
# Regex is used to replace or extract all substrings of the specified string value that match regexp with rep.
# regexp_replace(str, pattern, replacement)
# for more info on regex calls visit: https://docs.oracle.com/cd/B19306_01/server.102/b14200/ap_posix001.htm#BABJDBHB
from pyspark.sql.functions import regexp_replace, regexp_extract

print("Example: Remove all commas in a string field:")
reggi = spark.createDataFrame([('2, 5, and 10 are numbers in this example',)], ['dirty'])
reggi.select(reggi.dirty,regexp_replace('dirty', r'(\d)', '#').alias('clean')).show(1, False) #False will prevent col from truncating results

**Arrays**

In [None]:
# Arrays - col/cols – list of column names (string) or list of Column expressions that have the same data type.
# pyspark.sql.functions
from pyspark.sql.functions import *
#      .array(*cols)   -   Creates a new array column.
#      .array_contains(col, value)  - Collection function: returns null if the array is null, true if the array contains the given value, and false otherwise.
#      .array_distinct(col) - Collection function: removes duplicate values from the array. :param col: name of column or expression
#      .array_except(col1, col2) - Collection function: returns an array of the elements in col1 but not in col2, without duplicates.
#      .array_intersect(col1, col2) - Collection function: returns an array of the elements in the intersection of col1 and col2, without duplicates.
#      .array_join(col, delimiter, null_replacement=None) - Concatenates the elements of column using the delimiter. Null values are replaced with null_replacement if set, otherwise they are ignored.
#      .array_max(col) - Collection function: returns the maximum value of the array.
#      .array_min(col) - Collection function: returns the minimum value of the array.
#      .array_position(col, value) - Collection function: Locates the position of the first occurrence of the given value in the given array. Returns null if either of the arguments are null.
#      .array_remove(col, element)- Collection function: Remove all elements that equal to element from the given array.
#      .array_repeat(col, count) - Collection function: creates an array containing a column repeated count times.
#      .array_sort(col) - Collection function: sorts the input array in ascending order. The elements of the input array must be orderable. Null elements will be placed at the end of the returned array.
#      .array_union(col1, col2) - Collection function: returns an array of the elements in the union of col1 and col2, without duplicates.
#      .arrays_overlap(a1, a2) - Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise.
#      .arrays_zip(*cols)[source] - Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays.

customer = spark.createDataFrame([('coffee','milk','coffee','coffee','chocolate','')], ['item1', 'item2','item3','item4','item5','item6'])
purchases = customer.select(array('item1', 'item2','item3').alias("Monday"),\
                            array('item4', 'item5','item6').alias("Tuesday"))

print("array")
purchases.show()

print("Which customers purchased milk? array_contains")
purchases.select(array_contains(purchases.Monday, "milk")).show(1, False)

print("List of unique products purchased on Monday: array_distinct")
purchases.select(array_distinct(purchases.Monday)).show(1, False)

print("What did our customers order on Monday but not Tuesday? array_except")
purchases.select(array_except(purchases.Monday, purchases.Tuesday)).show(1, False)

print("What did our customers order on BOTH Monday and Tuesday?: array_intersect")
purchases.select(array_intersect(purchases.Monday, purchases.Tuesday)).show(1, False)

print("All purchases on monday in a string: array_join")
purchases.select(array_join(purchases.Monday, ',')).show(1, False)

# Now you try!
# Can you list all the unique products purchased on BOTH Monday and Tuesday using one of the functions above
# The answer should be: 'coffee','milk','chocolate'
# Note: we did not use the function yet... you need to figure out which one to use :)

### Joins and Appends

In [None]:
# APPEND two dataframes together that have the same columns
new_df = df
df_concat = df.union(new_df)
print(("df Counts:", df.count(), len(df.columns)))
print(("df_concat Counts:", df_concat.count(), len(df_concat.columns)))
print(df_concat.show(5))


In [None]:
# JOINS!

valuesA = [('Pirate',1,'Arrrg'),('Monkey',2,'Oooo'),('Ninja',3,'Yaaaa'),('Spaghetti',4,'Slurp!')]
TableA = spark.createDataFrame(valuesA,['name','id','sound'])

valuesB = [('Rutabaga',1,2),('Pirate',2,45),('Ninja',3,102),('Darth Vader',4,87)]
TableB = spark.createDataFrame(valuesB,['name','id','age'])

print("This is TableA")
print(TableA.show())
print("And this is TableB")
print(TableB.show())

inner_join = TableA.join(TableB, ["name","id"],"inner")
print("Inner Join Example")
print(inner_join.show())

left_join = TableA.join(TableB, ["name","id"], how='left') # Could also use 'left_outer'
print("Left Join Example")
print(left_join.show())

conditional_join = TableA.join(TableB, ["name","id"], how='left').filter(TableB.name.isNull())
print("Conditional Left Join")
print(conditional_join.show())

right_join = TableA.join(TableB,  ["name","id"],how='right') # Could also use 'right_outer'
print("Right Join")
print(right_join.show())

full_outer_join = TableA.join(TableB, ["name","id"],how='full') # Could also use 'full_outer'
print("Full Outer Join")
print(full_outer_join.show())

In [None]:
# Now you try!
# Can you create a query to show ONLY Ninja and Darth Vader names WITHOUT the age column?

## Run SQL Queries on your DataFrame

Spark TempView provides two functions that allow users to run **SQL** queries against a Spark DataFrame: 

1. **createOrReplaceTempView:** The lifetime of this temporary view is tied to the [[SparkSession]] that was used to create this Dataset. It creates (or replaces if that view name already exists) a lazily evaluated "view" that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view.

2. **createGlobalTempView:** The lifetime of this temporary view is tied to this Spark application.

**Spark Session vs. Spark application:**

Spark application can be used:

- for a single batch job
- an interactive session with multiple jobs
- a long-lived server continually satisfying requests
- A Spark job can consist of more than just a single map and reduce.
- A Spark Application can consist of more than one session

A SparkSession on the other hand is associated to a Spark Application:

Generally, a session is an interaction between two or more entities. In Spark 2.0 you can use SparkSession. A SparkSession can be created without creating SparkConf, SparkContext or SQLContext, (they’re encapsulated within the SparkSession)

Global temporary views are introduced in Spark 2.1.0 release. This feature is useful when you want to share data among different sessions and keep alive until your application ends.


In [None]:
# Create a temporary view of the dataframe
people.createOrReplaceTempView("tempview")

In [None]:
# Then Query the temp view
spark.sql("SELECT * FROM people WHERE age=30").show()

In [None]:
# Or pass it to an object
sql_results = spark.sql("SELECT * FROM people")
sql_results

# GroupBy and Aggregate Functions

Let's learn how to use GroupBy and Aggregate methods on a DataFrame. GroupBy allows you to group rows together based off some column value, for example, you could group together sales data by the day the sale occured, or group repeast customer data based off the name of the customer. Once you've performed the GroupBy operation you can use an aggregate function off that data. An aggregate function aggregates multiple rows of data into a single output, such as taking the sum of inputs, or counting the number of inputs.

Let's see some examples on an example dataset!

In [None]:
# Groupby Function with count (you can also use sum, min, max)
stocks.groupBy("Company").show()

In [None]:
# Then group there, you can add the following aggregate functions: mean, count, min, max, sum
# Like this for example
stocks.groupBy("Company").mean().show()

In [None]:
# This is also a pretty neat function you can use:
stocks.summary("count", "min", "25%", "75%", "max").show()

In [None]:
# To do a summary for specific columns first select them:
stocks.select("age", "name").summary("count").show()

Aggregate on the entire DataFrame without groups (shorthand for df.groupBy.agg()).

In [None]:
# Aggregate!
# agg(*exprs)
# Aggregate on the entire DataFrame without groups (shorthand for df.groupBy.agg()).
# available agg functions: min, max, count, countDistinct, approx_count_distinct
# df.agg.(covar_pop(col1, col2)) Returns a new Column for the population covariance of col1 and col2
# df.agg.(covar_samp(col1, col2)) Returns a new Column for the sample covariance of col1 and col2.
# df.agg(corr(col1, col2)) Returns a new Column for the Pearson Correlation Coefficient for col1 and col2.
from pyspark.sql import functions as F
df.agg(F.min(df.VISA_TYPE_ID).alias("Min Visa Type")).show()

In [None]:
# Max sales across everything
stocks.agg({'Sales':'max'}).show()

In [None]:
# And then if you want to select only certian columns you can do this:
stocks.groupBy("Company").agg({"Sales":'max'}).show()

Pivot Function

In [None]:
# Pivot Function
# pivot(pivot_col, values=None)
df.groupBy("CASE_CATEGORY_ID").pivot("VISA_TYPE_ID", ["7", "3"]).count().show()

## Functions
There are a variety of functions you can import from pyspark.sql.functions. Check out the documentation for the full list available:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

In [None]:
# Import the functions we will need:
from pyspark.sql.functions import countDistinct,avg,stddev
from pyspark.sql.functions import abs # Absolute value
from pyspark.sql.functions import acos # inverse cosine of col, as if computed by java.lang.Math.acos()

In [None]:
sales.select(countDistinct("Sales"),avg('Sales'),stddev("Sales")).show()

Often you will want to change the name, use the .alias() method for this:

In [None]:
sales.select(countDistinct("Sales").alias("Distinct Sales")).show()

## Order By

You can easily sort with the orderBy method:

In [None]:
# OrderBy
# Ascending
sales.orderBy("Sales").show()

In [None]:
# Descending call off the column itself.
sales.orderBy(df["Sales"].desc()).show()

Check out this link for more info on other methods:
http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark-sql-module

Not all methods need a groupby call, instead you can just call the generalized .agg() method, that will call the aggregate across all rows in the dataframe column specified. It can take in arguments as a single column, or create multiple aggregate calls all at once using dictionary notation.

For example:

### Dates and Timestamps

Most real world datasets contain Time and Date information, let's see how to work with it!

In [3]:
# First import all the functions you will need:
from pyspark.sql.functions import format_number,dayofmonth,hour,dayofyear,month,year,weekofyear,date_format

In [None]:
# Then you can extract these elements from the date column...
sales.select(year(df['Date']),month(df['Date'])).show()

So for example, let's say we wanted to know the average ________ per year. Easy! With a groupby and the year() function call:

In [None]:
newdf = sales.withColumn("Year",year(df['Date']))
result = newdf.groupBy("Year").mean()[['avg(Year)','avg(Close)']]
result.show()

And if you wanted to format your results at bit you could add the format_number and alias functions as follows

In [None]:
result = result.select(\
                       'avg(Year)'.alias("Year"),\
                       format_number('avg(Close)',2).alias("Mean Close")\
                      ).show()

# Missing Data

Often data sources are incomplete, which means you will have missing data, you have 3 basic options for filling in missing data (you will personally have to make the decision for what is the right approach:

* Just keep the missing data points.
* Drop them missing data points (including the entire row)
* Fill them in with some other value.

Let's cover examples of each of these methods!

## Keeping the missing data
A few machine learning algorithms can easily deal with missing data, let's see what it looks like:

In [None]:
# We will be using the nulls dataframe for this section:
nulls.show()

## Drop the missing data

You can use the .na functions for missing data. The drop command has the following parameters:

    df.na.drop(how='any', thresh=None, subset=None)

In [None]:
# Drop any row that contains missing data
nulls.na.drop().show()

In [None]:
# Drop rows that have at least 2 NON-null values
nulls.na.drop(thresh=2).show()

In [None]:
# Only drop the rows whose values in the sales column are null
nulls.na.drop(subset=["Sales"]).show()

In [None]:
# Drop any row that has a null value in ANY column;
nulls.na.drop(how='any').show()

In [None]:
# Drop a row only if ALL its values are null.
nulls.na.drop(how='all').show()

In [None]:
# Another way to do it
df.filter(df.CERT_RETURNED.isNotNull()).show()

## Fill the missing values

We can also fill the missing values with new values. If you have multiple nulls across multiple data types, Spark is actually smart enough to match up the data types. For example:

In [None]:
# Fill all nulls values with one common value (character value)
nulls.na.fill('NEW VALUE').show()

In [None]:
# Fill all nulls values with one common value (numeric value)
nulls.na.fill(0).show()

Usually you should specify what columns you want to fill with the subset parameter

In [None]:
nulls.na.fill('No Name',subset=['Name']).show()

A very common practice is to fill values with the mean value for the column, for example:

In [None]:
from pyspark.sql.functions import mean
mean_val = nulls.select(mean(nulls['Sales'])).collect()

# Weird nested formatting of Row object!
mean_val[0][0]

In [None]:
mean_sales = mean_val[0][0]

In [None]:
nulls.na.fill(mean_sales,["Sales"]).show()

That is all we need to know for now!

# Additional features

In [1]:
# Compute the levenshtein distance beween two strings
# pyspark.sql.functions.levenshtein(left, right)  - Computes the Levenshtein distance of the two given strings.

df0 = spark.createDataFrame([('Aple', 'Apple','Microsoft','IBM')], ['Input', 'Option1','Option2','Option3'])
print("Correct this company name: Aple")
df0.select(levenshtein('Input', 'Option1').alias('Apple')).show()
df0.select(levenshtein('Input', 'Option2').alias('Microsoft')).show()
df0.select(levenshtein('Input', 'Option3').alias('IBM')).show()

NameError: name 'spark' is not defined