# Pyspark Walkthrough 

Based off the Tutorial by Krish Naik
[https://www.youtube.com/watch?v=WyZmM6K7ubc&list=PLZoTAELRMXVNjiiawhzZ0afHcPvC8jpcg&index=1]

What is Pyspark? 

-   Pyspark is a distributed framework used for large scale preprocesing. It uses in-memory processing which means everything is saved in RAM instead of the Hard drive. It also includes Py4j that allows devs to work with RDDs (Robust Distributed Development) in Python. 

What is Pyspark Pandas API?

-   The Pandas API allows for parallel preprocessing but mimics the syntax of Pandas Library. It is not as fast a Pyspark since there is an additional overhead to match the Pandas syntax.

When to use Pandas, Pandas API, and Pyspark?

-   Pandas: smaller datasets less than a 1 million rows
-   Pandas API: medium-sized datasets less than 10 million rows
- Pyspark: large datasets containing 1 billion or more rows

What to use for Large-Scaled ML? 
-   For distributed preprocessing and machine learning: Use PySpark and its MLlib.
-   For larger-than-memory computations with a Pandas-like API: Use Dask.
-   For deep learning: Use TensorFlow or PyTorch, which support efficient data loading and processing.
-   For NLP tasks: Use Hugging Face’s datasets library with transformers.

## Introduction

In [35]:
import pyspark
from pyspark.sql import SparkSession
import pandas as pd

In [36]:
df = pd.DataFrame({'fname': ['Lora', 'Happy', 'Donny', 'CJ'], 'House': ['Ravenclaw', 'Gryffindor', 'Slytherin', 'Hufflepuff'], 'Year': [4,4,5,6], 'No_Classes': [6,5,5,7]})
df

Unnamed: 0,fname,House,Year,No_Classes
0,Lora,Ravenclaw,4,6
1,Happy,Gryffindor,4,5
2,Donny,Slytherin,5,5
3,CJ,Hufflepuff,6,7


In [11]:
spark = SparkSession.builder.appName('Walkthrough').getOrCreate()

In [12]:
spark

In [13]:
df_pyspark = spark.createDataFrame(df)

In [14]:
df_pyspark.show()

+------+----------+
| fname|     house|
+------+----------+
|  Luna| Ravenclaw|
| Harry|Gryffindor|
| Draco| Slytherin|
|Cedric|Hufflepuff|
+------+----------+



In [15]:
# spark.read.option("header", "true").csv("test1.csv").show()

In [16]:
type(df_pyspark), type(df)

(pyspark.sql.dataframe.DataFrame, pandas.core.frame.DataFrame)

In [17]:
df_pyspark.head(3)

[Row(fname='Luna', house='Ravenclaw'),
 Row(fname='Harry', house='Gryffindor'),
 Row(fname='Draco', house='Slytherin')]

In [18]:
# pyspark df info

df_pyspark.printSchema()

root
 |-- fname: string (nullable = true)
 |-- house: string (nullable = true)



In [37]:
df.to_csv('test1.csv')

## Pyspark with Dataframes

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName("DataFrame").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/04 14:48:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

In [41]:
# Read Dataset

spark.read.option('header','true').csv("test1.csv")

DataFrame[_c0: string, fname: string, House: string, Year: string, No_Classes: string]

In [42]:
df_pyspark = spark.read.option('header', 'true').csv('test1.csv')

In [43]:
# Check the schema

df_pyspark.show()

+---+-----+----------+----+----------+
|_c0|fname|     House|Year|No_Classes|
+---+-----+----------+----+----------+
|  0| Lora| Ravenclaw|   4|         6|
|  1|Happy|Gryffindor|   4|         5|
|  2|Donny| Slytherin|   5|         5|
|  3|   CJ|Hufflepuff|   6|         7|
+---+-----+----------+----+----------+



24/07/03 11:51:32 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , fname, House, Year, No_Classes
 Schema: _c0, fname, House, Year, No_Classes
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/test1.csv


In [44]:
df_pyspark.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- fname: string (nullable = true)
 |-- House: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- No_Classes: string (nullable = true)



In [4]:
# InferSchema hyperparameter: set to True otherwise datatypes will all be string
# In this example the index is originally a string type, but after setting 
# ...InferSchema to True, the dtype is changes to integer.
df_spark = spark.read.option('header', 'true').csv('test1.csv', inferSchema=True)

In [46]:
df_spark.show()

+---+-----+----------+----+----------+
|_c0|fname|     House|Year|No_Classes|
+---+-----+----------+----+----------+
|  0| Lora| Ravenclaw|   4|         6|
|  1|Happy|Gryffindor|   4|         5|
|  2|Donny| Slytherin|   5|         5|
|  3|   CJ|Hufflepuff|   6|         7|
+---+-----+----------+----+----------+



24/07/03 11:51:38 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , fname, House, Year, No_Classes
 Schema: _c0, fname, House, Year, No_Classes
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/test1.csv


In [47]:
df_spark.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- fname: string (nullable = true)
 |-- House: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- No_Classes: integer (nullable = true)



In [7]:
# I can also set spark.read.csv(filename, header=True, inferSchema=True) instead

df_pyspark = spark.read.csv('test1.csv', header=True, inferSchema=True)


In [67]:
df_pyspark

DataFrame[_c0: int, fname: string, House: string, Year: int, No_Classes: int]

In [68]:
df_pyspark.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- fname: string (nullable = true)
 |-- House: string (nullable = true)
 |-- Year: integer (nullable = true)
 |-- No_Classes: integer (nullable = true)



In [69]:
# Type 

type(df_pyspark), type(df_spark)

(pyspark.sql.dataframe.DataFrame, pyspark.sql.dataframe.DataFrame)

In [70]:
# List columns
df_pyspark.columns

['_c0', 'fname', 'House', 'Year', 'No_Classes']

In [71]:
df_pyspark.head(2)

24/07/03 11:53:45 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , fname, House, Year, No_Classes
 Schema: _c0, fname, House, Year, No_Classes
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/test1.csv


[Row(_c0=0, fname='Lora', House='Ravenclaw', Year=4, No_Classes=6),
 Row(_c0=1, fname='Happy', House='Gryffindor', Year=4, No_Classes=5)]

In [72]:
# Selecting columns and indices

df_pyspark.select('fname')

DataFrame[fname: string]

In [73]:
df_pyspark.select('fname').show()

+-----+
|fname|
+-----+
| Lora|
|Happy|
|Donny|
|   CJ|
+-----+



In [74]:
type(df_pyspark.select('fname'))

pyspark.sql.dataframe.DataFrame

In [75]:
# Select multiple columns

df_pyspark.select(['_c0', 'fname']).show()

+---+-----+
|_c0|fname|
+---+-----+
|  0| Lora|
|  1|Happy|
|  2|Donny|
|  3|   CJ|
+---+-----+



24/07/03 11:53:47 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , fname
 Schema: _c0, fname
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/test1.csv


In [76]:
df_pyspark['fname']

Column<'fname'>

In [77]:
# Dtypes
df_pyspark.dtypes

[('_c0', 'int'),
 ('fname', 'string'),
 ('House', 'string'),
 ('Year', 'int'),
 ('No_Classes', 'int')]

In [8]:
# Describe option


df_pyspark.describe()

DataFrame[summary: string, _c0: string, fname: string, House: string, Year: string, No_Classes: string]

In [9]:
df_pyspark.describe().show()

24/07/04 14:49:24 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/07/04 14:49:24 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , fname, House, Year, No_Classes
 Schema: _c0, fname, House, Year, No_Classes
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/test1.csv


+-------+------------------+-----+----------+------------------+------------------+
|summary|               _c0|fname|     House|              Year|        No_Classes|
+-------+------------------+-----+----------+------------------+------------------+
|  count|                 4|    4|         4|                 4|                 4|
|   mean|               1.5| NULL|      NULL|              4.75|              5.75|
| stddev|1.2909944487358056| NULL|      NULL|0.9574271077563382|0.9574271077563382|
|    min|                 0|   CJ|Gryffindor|                 4|                 5|
|    max|                 3| Lora| Slytherin|                 6|                 7|
+-------+------------------+-----+----------+------------------+------------------+



In [None]:
# Adding columns to pyspark dataframes

df_pyspark = df_pyspark.withColumn('Next_Year', df_pyspark['Year']+1)

In [None]:
df_pyspark

DataFrame[_c0: int, fname: string, House: string, Year: int, No_Classes: int, Next_Year: int]

In [None]:
df_pyspark.show()

+---+-----+----------+----+----------+---------+
|_c0|fname|     House|Year|No_Classes|Next_Year|
+---+-----+----------+----+----------+---------+
|  0| Lora| Ravenclaw|   4|         6|        5|
|  1|Happy|Gryffindor|   4|         5|        5|
|  2|Donny| Slytherin|   5|         5|        6|
|  3|   CJ|Hufflepuff|   6|         7|        7|
+---+-----+----------+----+----------+---------+



24/07/04 01:42:06 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , fname, House, Year, No_Classes
 Schema: _c0, fname, House, Year, No_Classes
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/test1.csv


In [None]:
# Drop columns

df_pyspark = df_pyspark.drop('Next_Year')

In [None]:
df_pyspark.show()

+---+-----+----------+----+----------+
|_c0|fname|     House|Year|No_Classes|
+---+-----+----------+----+----------+
|  0| Lora| Ravenclaw|   4|         6|
|  1|Happy|Gryffindor|   4|         5|
|  2|Donny| Slytherin|   5|         5|
|  3|   CJ|Hufflepuff|   6|         7|
+---+-----+----------+----+----------+



24/07/04 01:42:11 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , fname, House, Year, No_Classes
 Schema: _c0, fname, House, Year, No_Classes
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/test1.csv


In [None]:
# Rename Columns
df_pyspark = df_pyspark.withColumnRenamed('fname', 'Name')

In [6]:
df_pyspark.show()

NameError: name 'df_pyspark' is not defined

## Handling Missing Values

In [1]:
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np

In [2]:
spark = SparkSession.builder.appName("Practice").getOrCreate()
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/05 14:54:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [9]:
df = pd.DataFrame(
    {
        'Name': ['Sarah', 'Blake', 'Gina', 'Timmy', 'Wayne', np.nan, np.nan, 'Joshua'],
        'WorkType': ['Contract', 'Contract-To-Hire', np.nan, 'Part-time', 'Full-Time', 'Full-Time', 'Part-Time', 'Contract-To-Hire'],
        'Age': [32, 38, 55, 23, 46, 34, 29, np.nan],
        'YearlySalary': [np.nan, 150000, 200000, 55000, 175000, 190000, 80000, np.nan],
        'TotalExperience': [7, 8, np.nan, 4, np.nan, 6, 5, 9]
    }
)

In [10]:
df

Unnamed: 0,Name,WorkType,Age,YearlySalary,TotalExperience
0,Sarah,Contract,32.0,,7.0
1,Blake,Contract-To-Hire,38.0,150000.0,8.0
2,Gina,,55.0,200000.0,
3,Timmy,Part-time,23.0,55000.0,4.0
4,Wayne,Full-Time,46.0,175000.0,
5,,Full-Time,34.0,190000.0,6.0
6,,Part-Time,29.0,80000.0,5.0
7,Joshua,Contract-To-Hire,,,9.0


In [11]:
# df.to_csv('MissingValuesDataset.csv')

In [3]:
df_pyspark = spark.read.csv('MissingValuesDataset.csv', header=True, inferSchema=True)

In [13]:
df_pyspark.show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|           NULL|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/04 15:09:30 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [14]:
# Drop columns
df_pyspark.drop('Age').show()

+---+------+----------------+------------+---------------+
|_c0|  Name|        WorkType|YearlySalary|TotalExperience|
+---+------+----------------+------------+---------------+
|  0| Sarah|        Contract|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|    150000.0|            8.0|
|  2|  Gina|            NULL|    200000.0|           NULL|
|  3| Timmy|       Part-time|     55000.0|            4.0|
|  4| Wayne|       Full-Time|    175000.0|           NULL|
|  5|  NULL|       Full-Time|    190000.0|            6.0|
|  6|  NULL|       Part-Time|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|        NULL|            9.0|
+---+------+----------------+------------+---------------+



24/07/04 15:11:27 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [15]:
df_pyspark.show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|           NULL|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/04 15:11:37 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [16]:
# Drop Rows with Missing Values

df_pyspark.na.drop().show()

+---+-----+----------------+----+------------+---------------+
|_c0| Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+-----+----------------+----+------------+---------------+
|  1|Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  3|Timmy|       Part-time|23.0|     55000.0|            4.0|
+---+-----+----------------+----+------------+---------------+



24/07/04 15:12:44 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [18]:
# Understanding Drop method parameters
# I can also use pyspark.dropna()

'''
Parameters: 
    - how: str -> 'any' or 'all'. I.e drop rows with any misssing data or only drop rows with all missing data
    - thresh: int -> overwrites how parameter. Drops rows below the threshold of non-null values
    - subset: str, tuple, or list. list of columns to drop
'''

df_pyspark.na.drop('any').show()

+---+-----+----------------+----+------------+---------------+
|_c0| Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+-----+----------------+----+------------+---------------+
|  1|Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  3|Timmy|       Part-time|23.0|     55000.0|            4.0|
+---+-----+----------------+----+------------+---------------+



24/07/04 15:20:24 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [19]:
df_pyspark.na.drop('all').show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|           NULL|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/04 15:20:45 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [29]:
# At least X non-null values should be present in each row, otherwise they are removed
df_pyspark.na.drop(thresh=5).show()


+---+-----+----------------+----+------------+---------------+
|_c0| Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+-----+----------------+----+------------+---------------+
|  0|Sarah|        Contract|32.0|        NULL|            7.0|
|  1|Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  3|Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4|Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5| NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6| NULL|       Part-Time|29.0|     80000.0|            5.0|
+---+-----+----------------+----+------------+---------------+



24/07/04 15:25:55 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [31]:
# Drop rows that contain null values inside a particular or list of columns
df_pyspark.na.drop(subset='YearlySalary').show()

+---+-----+----------------+----+------------+---------------+
|_c0| Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+-----+----------------+----+------------+---------------+
|  1|Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2| Gina|            NULL|55.0|    200000.0|           NULL|
|  3|Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4|Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5| NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6| NULL|       Part-Time|29.0|     80000.0|            5.0|
+---+-----+----------------+----+------------+---------------+



24/07/04 15:27:23 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [33]:
# Filling the missing values in a specific column
# You can also use df_pyspark.na.fill(value, subset)

'''
pyspark.sql.DataFrame.fillna 
Parameters:
    - value: str, float, bool, or dict. The element to replace the current values with
    - subset: str, tuple, or list. list of column names to have the nan values filled in. 


'''

df_pyspark.fillna(value=5.0, subset='TotalExperience').show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|            5.0|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|            5.0|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/04 15:33:52 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [34]:
# Replace all missing values in string columns

df_pyspark.fillna(value='N/A').show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|             N/A|55.0|    200000.0|           NULL|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5|   N/A|       Full-Time|34.0|    190000.0|            6.0|
|  6|   N/A|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/04 15:36:30 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [35]:
# Replace missing values in all numeric columns

df_pyspark.fillna(value=0.0).show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|         0.0|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|            0.0|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|            0.0|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire| 0.0|         0.0|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/04 15:39:01 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [36]:
# Notice you cannot mix different datatypes together 
# e.g. I cant replace Null values with 'Not Available' for numeric columns

df_pyspark.fillna(value='N/A', subset='YearlySalary').show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|           NULL|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/04 15:41:15 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [38]:
# Replace missing values through multiple (but not all) columns
# Notice the 0 value converts to a float dtype
df_pyspark.fillna(value=0, subset=['YearlySalary', 'TotalExperience']).show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|         0.0|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|            0.0|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|            0.0|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|         0.0|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/04 15:43:03 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [4]:
df_pyspark.show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|           NULL|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/05 14:57:18 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [5]:
from pyspark.ml.feature import Imputer

In [11]:
# Class to fill missing values with mean, median, mode, etc. 
imputer = Imputer(
    inputCols=["Age", "YearlySalary", "TotalExperience"],
    outputCols=[f"{col_name}_imputed" for col_name in ["Age", "YearlySalary", "TotalExperience"]]
).setStrategy('median')

In [12]:
imputer.fit(df_pyspark).transform(df_pyspark).show()

+---+------+----------------+----+------------+---------------+-----------+--------------------+-----------------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|Age_imputed|YearlySalary_imputed|TotalExperience_imputed|
+---+------+----------------+----+------------+---------------+-----------+--------------------+-----------------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|       32.0|            150000.0|                    7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|       38.0|            150000.0|                    8.0|
|  2|  Gina|            NULL|55.0|    200000.0|           NULL|       55.0|            200000.0|                    6.0|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|       23.0|             55000.0|                    4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|       46.0|            175000.0|                    6.0|
|  5|  NULL|       Full-Time|34.

24/07/05 15:06:09 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


## Filter Operations

In [2]:
import pyspark

In [5]:
# Create Spark Session

spark = pyspark.sql.SparkSession.builder.appName(name="FilterSession").getOrCreate()
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/05 15:12:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [15]:
df_pyspark = spark.read.csv("MissingValuesDataset.csv", header=True, inferSchema=True)
df_pyspark

DataFrame[_c0: int, Name: string, WorkType: string, Age: double, YearlySalary: double, TotalExperience: double]

In [16]:
df_pyspark.describe()

DataFrame[summary: string, _c0: string, Name: string, WorkType: string, Age: string, YearlySalary: string, TotalExperience: string]

In [17]:
df_pyspark.show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|           NULL|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/05 15:15:31 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [18]:
# Double dtypes are floating point numbers
df_pyspark.dtypes

[('_c0', 'int'),
 ('Name', 'string'),
 ('WorkType', 'string'),
 ('Age', 'double'),
 ('YearlySalary', 'double'),
 ('TotalExperience', 'double')]

In [20]:
# Filter a single column using column instances
df_pyspark.filter(df_pyspark["Age"] > 35).show()

+---+-----+----------------+----+------------+---------------+
|_c0| Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+-----+----------------+----+------------+---------------+
|  1|Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2| Gina|            NULL|55.0|    200000.0|           NULL|
|  4|Wayne|       Full-Time|46.0|    175000.0|           NULL|
+---+-----+----------------+----+------------+---------------+



24/07/05 15:18:12 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [29]:
# Filter column using sql expressions for strings
df_pyspark.filter('YearlySalary>100000').show()

+---+-----+----------------+----+------------+---------------+
|_c0| Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+-----+----------------+----+------------+---------------+
|  1|Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2| Gina|            NULL|55.0|    200000.0|           NULL|
|  4|Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5| NULL|       Full-Time|34.0|    190000.0|            6.0|
+---+-----+----------------+----+------------+---------------+



24/07/05 15:26:28 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [38]:
# Filter using multiple conditions through column instances
df_pyspark.filter((df_pyspark['WorkType'] == 'Full-Time') & (df_pyspark['YearlySalary'] > 175000)).show()

+---+----+---------+----+------------+---------------+
|_c0|Name| WorkType| Age|YearlySalary|TotalExperience|
+---+----+---------+----+------------+---------------+
|  5|NULL|Full-Time|34.0|    190000.0|            6.0|
+---+----+---------+----+------------+---------------+



24/07/05 15:36:33 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [33]:
# Filter using multiple conditions through SQL
df_pyspark.filter("TotalExperience >= 6 and Age < 35").show()

+---+-----+---------+----+------------+---------------+
|_c0| Name| WorkType| Age|YearlySalary|TotalExperience|
+---+-----+---------+----+------------+---------------+
|  0|Sarah| Contract|32.0|        NULL|            7.0|
|  5| NULL|Full-Time|34.0|    190000.0|            6.0|
+---+-----+---------+----+------------+---------------+



24/07/05 15:29:06 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [35]:
# Filter using n conditions but only display specific columns

df_pyspark.filter("YearlySalary >= 150000 and Age < 35").select(["WorkType", "Age"]).show()

+---------+----+
| WorkType| Age|
+---------+----+
|Full-Time|34.0|
+---------+----+



In [51]:
# Filter using NOT ("~") Operation

df_pyspark.filter(
    ~(df_pyspark["WorkType"] == "Contract")
    &
    ~(df_pyspark["YearlySalary"] <= 100000)
    ).show()

+---+-----+----------------+----+------------+---------------+
|_c0| Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+-----+----------------+----+------------+---------------+
|  1|Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  4|Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5| NULL|       Full-Time|34.0|    190000.0|            6.0|
+---+-----+----------------+----+------------+---------------+



24/07/05 15:44:43 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


## GroupBy and Aggregate Functions

In [14]:
import pyspark
import pandas as pd

In [17]:
df = pd.DataFrame(
    {
        'Name': ['Draco', 'Ron', 'Draco', 'Ron'],
        'Age': [18, 17, 18, 16],
        'Experience': [8, 6, 6, 8],
        'House': ["G", "G", "H", "H"]
    }
    )
df

Unnamed: 0,Name,Age,Experience,House
0,Draco,18,8,G
1,Ron,17,6,G
2,Draco,18,6,H
3,Ron,16,8,H


In [18]:
# df.to_csv("GroupByDataset.csv")

In [3]:
spark = pyspark.sql.SparkSession.builder.appName("GBAF").getOrCreate()
spark

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/05 15:48:05 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [19]:
df_pyspark = spark.read.csv('GroupByDataset.csv', header=True, inferSchema=True)
df_pyspark

DataFrame[_c0: int, Name: string, Age: int, Experience: int, House: string]

In [20]:
df_pyspark.show()

+---+-----+---+----------+-----+
|_c0| Name|Age|Experience|House|
+---+-----+---+----------+-----+
|  0|Draco| 18|         8|    G|
|  1|  Ron| 17|         6|    G|
|  2|Draco| 18|         6|    H|
|  3|  Ron| 16|         8|    H|
+---+-----+---+----------+-----+



24/07/05 16:00:59 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, Age, Experience, House
 Schema: _c0, Name, Age, Experience, House
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/GroupByDataset.csv


In [21]:
# Group by all numeric columns
df_pyspark.groupBy().avg().show()

+--------+--------+---------------+
|avg(_c0)|avg(Age)|avg(Experience)|
+--------+--------+---------------+
|     1.5|   17.25|            7.0|
+--------+--------+---------------+



24/07/05 16:01:02 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Age, Experience
 Schema: _c0, Age, Experience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/GroupByDataset.csv


In [23]:
# GroupBy using Aggregrate Function
# Parameter of .agg is a dictionary with 
# ...a column name as the key and
# ...the specified aggregate method as the value pertaining towards the key 
df_pyspark.groupBy('Name').agg({'Experience': 'sum'}).show()

+-----+---------------+
| Name|sum(Experience)|
+-----+---------------+
|Draco|             14|
|  Ron|             14|
+-----+---------------+



In [33]:
# Group by with Aggregate Function and sorting method
df_pyspark.groupBy('House').agg({"Age": "mean"}).sort('avg(Age)').show()

+-----+--------+
|House|avg(Age)|
+-----+--------+
|    H|    17.0|
|    G|    17.5|
+-----+--------+



In [52]:
# Or use .mean(), .median(), .count(), etc. with select functiono  instead of .agg()
df_pyspark.groupBy('House').mean().select("House", "avg(Age)").sort("avg(Age)").show()

+-----+--------+
|House|avg(Age)|
+-----+--------+
|    H|    17.0|
|    G|    17.5|
+-----+--------+



In [36]:
# Group By with max Values
df_pyspark.groupBy('Age').max().sort("max(Experience)").show()

+---+--------+--------+---------------+
|Age|max(_c0)|max(Age)|max(Experience)|
+---+--------+--------+---------------+
| 17|       1|      17|              6|
| 16|       3|      16|              8|
| 18|       2|      18|              8|
+---+--------+--------+---------------+



24/07/05 16:14:23 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Age, Experience
 Schema: _c0, Age, Experience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/GroupByDataset.csv


In [43]:
# GroupBy multiple columns and count the rows for each group

df_pyspark.groupBy([df_pyspark['Age'], 'House']).count().sort("Age", "House").show()

+---+-----+-----+
|Age|House|count|
+---+-----+-----+
| 16|    H|    1|
| 17|    G|    1|
| 18|    G|    1|
| 18|    H|    1|
+---+-----+-----+



In [45]:
# GroupBy the name and determine the max age and experience
df_pyspark.groupBy('Name').max().show()

+-----+--------+--------+---------------+
| Name|max(_c0)|max(Age)|max(Experience)|
+-----+--------+--------+---------------+
|Draco|       2|      18|              8|
|  Ron|       3|      17|              8|
+-----+--------+--------+---------------+



24/07/05 16:29:28 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, Age, Experience
 Schema: _c0, Name, Age, Experience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/GroupByDataset.csv


In [46]:
# GroupBy name and count the number of rows for each name
df_pyspark.groupBy('Name').count().show()

+-----+-----+
| Name|count|
+-----+-----+
|Draco|    2|
|  Ron|    2|
+-----+-----+



In [53]:
# GroupBy the count of students per house
df_pyspark.groupBy('House').count().show()

+-----+-----+
|House|count|
+-----+-----+
|    G|    2|
|    H|    2|
+-----+-----+



In [54]:
# .agg function without GroupBy

df_pyspark.agg({"Age": "mean"}).show()

+--------+
|avg(Age)|
+--------+
|   17.25|
+--------+



In [55]:
# Find th max age and experience for each name
df_pyspark.groupBy('Name').max().show()

+-----+--------+--------+---------------+
| Name|max(_c0)|max(Age)|max(Experience)|
+-----+--------+--------+---------------+
|Draco|       2|      18|              8|
|  Ron|       3|      17|              8|
+-----+--------+--------+---------------+



24/07/05 16:40:03 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, Age, Experience
 Schema: _c0, Name, Age, Experience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/GroupByDataset.csv


## Create Unit Test with Pytest

- First, create a dataframe with missing data, then apply a pytest function to transform the data.
- Next, load in the same local dataframe, but has already been through the preprocessing step.
- Lastly, generate a unit test to compare the results to assert they contain the same data and datatypes.

In [6]:
import pytest
import pyspark
import pandas as pd

In [4]:
# Function containing the spark session

spark = pyspark.sql.SparkSession.builder.appName('UnitTest').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/07/06 16:34:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
spark

In [7]:
df_pyspark = spark.read.csv('MissingValuesDataset.csv', header=True, inferSchema=True)

In [8]:
df_pyspark.show()

+---+------+----------------+----+------------+---------------+
|_c0|  Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+------+----------------+----+------------+---------------+
|  0| Sarah|        Contract|32.0|        NULL|            7.0|
|  1| Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  2|  Gina|            NULL|55.0|    200000.0|           NULL|
|  3| Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4| Wayne|       Full-Time|46.0|    175000.0|           NULL|
|  5|  NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6|  NULL|       Part-Time|29.0|     80000.0|            5.0|
|  7|Joshua|Contract-To-Hire|NULL|        NULL|            9.0|
+---+------+----------------+----+------------+---------------+



24/07/06 16:41:44 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv


In [15]:
''' 
Name: str
WorkType: str
Age: int
YearsWithCompany: int
AnnualPay: float (double)


dropna
fillna
fillna with agg
filter
groupby

'''
df_pyspark_dropna = df_pyspark.dropna(thresh=5)


In [19]:
df_pyspark_fillna_total_experience = df_pyspark_dropna.fillna(value=0.0, subset="TotalExperience")

In [21]:
df_pyspark_fillna_total_experience.fillna(value='mean', subset='YearlySalary').show()

+---+-----+----------------+----+------------+---------------+
|_c0| Name|        WorkType| Age|YearlySalary|TotalExperience|
+---+-----+----------------+----+------------+---------------+
|  0|Sarah|        Contract|32.0|        NULL|            7.0|
|  1|Blake|Contract-To-Hire|38.0|    150000.0|            8.0|
|  3|Timmy|       Part-time|23.0|     55000.0|            4.0|
|  4|Wayne|       Full-Time|46.0|    175000.0|            0.0|
|  5| NULL|       Full-Time|34.0|    190000.0|            6.0|
|  6| NULL|       Part-Time|29.0|     80000.0|            5.0|
+---+-----+----------------+----+------------+---------------+



24/07/06 17:21:37 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , Name, WorkType, Age, YearlySalary, TotalExperience
 Schema: _c0, Name, WorkType, Age, YearlySalary, TotalExperience
Expected: _c0 but found: 
CSV file: file:///Users/druestaples/projects/ner-chatbot-transformer/MissingValuesDataset.csv
