# PySpark DataFrame basics

Install the PySpark library using pip.

In [None]:
## Package import
!pip install --user --upgrade pip
!pip install --user pyspark
!pip install --user pandas==1.3

In [2]:
## Basic Functions
import pandas as pd
import numpy as np

import random
import string

## PySpark
import pyspark
from pyspark.sql.functions import isnan, isnull
from pyspark.sql.functions import monotonically_increasing_id, row_number
from pyspark.sql import Window, SparkSession, SQLContext
from pyspark.sql import SparkSession
from pyspark.ml.feature import Imputer,VectorAssembler, StandardScaler
from pyspark.ml.regression import LinearRegression

## Generate a Random Dataset
Create a random dataset with columns like name, age, salary, company, and position. 

In [3]:
#Create a random dataset

# Function to generate a random name and position
def generate_name_and_position():
    first_names = ["Alice", "Bob", "Charlie", "David", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"]
    last_names = ["Smith", "Johnson", "Williams", "Jones", "Brown", "Davis", "Miller", "Wilson", "Moore", "Taylor"]

    # .001% chance of being a C-suite or board member
    if random.random() < 0.0001:
        positions = ["CEO", "CTO", "CFO", "CMO", "COO", "Chairman", "Board Member"]
        position = random.choice(positions)
    else:
        positions = ["Data Scientist", "Machine Learning Engineer", "Data Engineer", "Data Analyst", "AI Researcher",
                     "Intern", "HR", "Manager", "Product Owner", "Developer"]
        position = random.choice(positions)

    return f"{random.choice(first_names)} {random.choice(last_names)}", position

# Function to generate a random company name
def generate_company_name():
    company_suffixes = ["Technologies", "Solutions", "Innovations", "Labs", "Systems", "Analytics"]
    company_prefixes = ['Alpha', 'Beta', 'Gamma', 'Delta','Meta']
    company_name = f"{random.choice(company_prefixes)} {random.choice(company_suffixes)}"
    
    return company_name


# Function to generate random years of experience, age, and salary with some null values
def generate_experience_age_salary():
    experience = random.randint(0, 10)
    age = random.randint(18, 65)
    salary = random.randint(8000, 120000)

    return experience, age, salary
def generate_csv(n,file_name):

    # Generate random data for 10000 rows
    data = [(*generate_name_and_position(), *generate_experience_age_salary(),generate_company_name()) for _ in range(n)]

    # Convert data to DataFrame
    df = pd.DataFrame(data, columns=['Name', 'Position', 'Experience', 'Age', 'Salary','Company'])

    # Randomly remove values within each column
    for col in df.columns:
        # 10% chance of setting a value to NaN in each column
        df.loc[df.sample(frac=0.1).index, col] = np.nan

    # Write data to CSV file
    csv_file_path = file_name
    df.to_csv(csv_file_path, index=False)

    print(f"CSV file '{csv_file_path}' created successfully.")


In [7]:
def command_line_tool (n,file_name):
    
    !python generate_csv.py n file_name

command_line_tool (1000,'test2.csv')

In [4]:
generate_csv(5000000,'test1.csv')

CSV file 'test1.csv' created successfully.


## Start Spark Session
Initializing a Spark session with SparkSession.builder.appName("example").getOrCreate() creates a connection to a Spark cluster, allowing you to use Spark functionality in your Jupyter notebook. The "example" is the application name.

In [5]:
# Start a spark session
from pyspark.sql import SparkSession

## Spark Sessions: Local vs. Cloud

When working with Apache Spark locally, you typically create only one Spark session in your application as it uses all available CPU cores on your machine. Creating multiple Spark sessions in a local mode might lead to resource conflicts - Each session would try to utilize the same resources.It is typically meant for development and testing purposes and more appropriate to use Spark's built-in parallel processing capabilitie

In a cloud environment, such as on platforms like Amazon EMR, Google Dataproc, or Databricks, you can create multiple Spark sessions because these platforms manage the distribution and allocation of resources across a cluster of machines. Each Spark session in a cloud environment is associated with one or more executor nodes in the cluster.

### Why Multiple Spark Sessions in the Cloud?

1. **Distributed Environment:**
   - In the cloud, Spark operates in a distributed environment, and each Spark session can be associated with different parts of the cluster.

2. **Resource Management:**
   - Cloud platforms handle resource management, allocating resources like CPU, memory, and storage to each Spark session independently. This allows multiple Spark sessions to run concurrently without resource conflicts.

3. **Scalability:**
   - Cloud environments allow you to scale the number of nodes in your Spark cluster based on your processing needs. This scalability facilitates the creation of multiple Spark sessions to handle different workloads concurrently.

In [6]:
spark =SparkSession.builder.appName('Practise').getOrCreate()
spark



# Data Handling

### Reading a Dataset

Reading a CSV file with PySpark's `spark.read.option("inferSchema", "true").option("header", "true")` function lets Spark automatically infer the schema of the DataFrame based on the data types of columns. The "header" option indicates that the first row of the CSV file contains column headers.

In [7]:
df_spark = spark.read.option('header','true').csv('test1.csv',inferSchema=True,header=True )
df_spark.printSchema()

root
 |-- Name: string (nullable = true)
 |-- Position: string (nullable = true)
 |-- Experience: double (nullable = true)
 |-- Age: double (nullable = true)
 |-- Salary: double (nullable = true)
 |-- Company: string (nullable = true)



In [8]:
# Basic Operations
type(df_spark)

pyspark.sql.dataframe.DataFrame

In [9]:
df_spark.show(10)

+--------------+--------------------+----------+----+--------+------------------+
|          Name|            Position|Experience| Age|  Salary|           Company|
+--------------+--------------------+----------+----+--------+------------------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|   Gamma Solutions|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|              null|
|          null|Machine Learning ...|       2.0|51.0| 22728.0|     Alpha Systems|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|    Meta Analytics|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|        Gamma Labs|
|Henry Williams|                null|       2.0|26.0| 81062.0| Meta Technologies|
|          null|              Intern|      null|58.0| 53429.0|  Beta Innovations|
|          null|             Manager|       7.0|20.0| 37876.0|Delta Technologies|
| Jack Williams|Machine Learning ...|       4.0|36.0| 10997.0|    Meta Analytics|
|          null|

### Selecting, Indexing and Checking the Datatypes of the Column(Schema)

In [11]:
df_spark.select('Name','Age').show(5)

+--------------+----+
|          Name| Age|
+--------------+----+
|  Ivy Williams|30.0|
|  Bob Williams|58.0|
|          null|51.0|
| Charlie Brown|39.0|
|Frank Williams|57.0|
+--------------+----+
only showing top 5 rows



In [13]:
df_spark.dtypes

[('Name', 'string'),
 ('Position', 'string'),
 ('Experience', 'double'),
 ('Age', 'double'),
 ('Salary', 'double'),
 ('Company', 'string')]

In [14]:
print(df_spark.describe())
df_spark.describe().show()

DataFrame[summary: string, Name: string, Position: string, Experience: string, Age: string, Salary: string, Company: string]
+-------+-----------+-------------+------------------+------------------+------------------+-----------------+
|summary|       Name|     Position|        Experience|               Age|            Salary|          Company|
+-------+-----------+-------------+------------------+------------------+------------------+-----------------+
|  count|    4500000|      4500000|           4500000|           4500000|           4500000|          4500000|
|   mean|       null|         null| 5.001912222222222| 41.49889266666667|63980.533792222224|             null|
| stddev|       null|         null|3.1605540489825565|13.858792239732075|32324.683588721702|             null|
|    min|Alice Brown|AI Researcher|               0.0|              18.0|            8000.0|  Alpha Analytics|
|    max|Jack Wilson|Product Owner|              10.0|              65.0|          120000.0|Meta T

### Adding and Dropping Columns

In [16]:
# Adding column into dataframe
df_spark = df_spark.withColumn('Year Born',2023-df_spark['Age'])
df_spark.show(5)

+--------------+--------------------+----------+----+--------+---------------+---------+
|          Name|            Position|Experience| Age|  Salary|        Company|Year Born|
+--------------+--------------------+----------+----+--------+---------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|Gamma Solutions|   1993.0|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|           null|   1965.0|
|          null|Machine Learning ...|       2.0|51.0| 22728.0|  Alpha Systems|   1972.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0| Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|     Gamma Labs|   1966.0|
+--------------+--------------------+----------+----+--------+---------------+---------+
only showing top 5 rows



In [17]:
df_spark = df_spark.withColumnRenamed('Name','Full Name')
df_spark.show(5)

+--------------+--------------------+----------+----+--------+---------------+---------+
|     Full Name|            Position|Experience| Age|  Salary|        Company|Year Born|
+--------------+--------------------+----------+----+--------+---------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|Gamma Solutions|   1993.0|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|           null|   1965.0|
|          null|Machine Learning ...|       2.0|51.0| 22728.0|  Alpha Systems|   1972.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0| Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|     Gamma Labs|   1966.0|
+--------------+--------------------+----------+----+--------+---------------+---------+
only showing top 5 rows



# PySpark Handling Missing Values



Using `df_spark.show()` displays the first few rows of the Spark DataFrame, providing a quick overview of the data. `df_spark.printSchema()` prints the schema of the DataFrame, revealing the data types and structure of each column. These actions aid in understanding the content and structure of the dataset.

In [18]:
# drop all rows with atleast one NA or Null
df_spark.na.drop().show(5)

+--------------+--------------------+----------+----+--------+-----------------+---------+
|     Full Name|            Position|Experience| Age|  Salary|          Company|Year Born|
+--------------+--------------------+----------+----+--------+-----------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|  Gamma Solutions|   1993.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|   Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|       Gamma Labs|   1966.0|
| Jack Williams|Machine Learning ...|       4.0|36.0| 10997.0|   Meta Analytics|   1987.0|
|  Frank Wilson|             Manager|       8.0|47.0|107395.0|Beta Technologies|   1976.0|
+--------------+--------------------+----------+----+--------+-----------------+---------+
only showing top 5 rows



In [19]:
# drop all rows with atleast one NA or Null
# how can have two attributes. all/any. Default is 'any'
df_spark.na.drop(how='any').show(10)
df_spark.na.drop(how='all').show(10)

+--------------+--------------------+----------+----+--------+-----------------+---------+
|     Full Name|            Position|Experience| Age|  Salary|          Company|Year Born|
+--------------+--------------------+----------+----+--------+-----------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|  Gamma Solutions|   1993.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|   Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|       Gamma Labs|   1966.0|
| Jack Williams|Machine Learning ...|       4.0|36.0| 10997.0|   Meta Analytics|   1987.0|
|  Frank Wilson|             Manager|       8.0|47.0|107395.0|Beta Technologies|   1976.0|
|     Ivy Brown|             Manager|       1.0|48.0|104867.0|    Alpha Systems|   1975.0|
|    Emma Smith|       Data Engineer|       2.0|30.0| 60924.0|     Meta Systems|   1993.0|
| Charlie Jones|             Manager|      10.0|54.0| 15520.0|  Gamma Analytics|   1969.0|

In [21]:
# Using threshold along with any. 
# if 'thresh=2', then there should be atleast two non-null values and the row won't remove the row 
df_spark.na.drop(how='any',thresh=2).show(10)

+--------------+--------------------+----------+----+--------+------------------+---------+
|     Full Name|            Position|Experience| Age|  Salary|           Company|Year Born|
+--------------+--------------------+----------+----+--------+------------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|   Gamma Solutions|   1993.0|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|              null|   1965.0|
|          null|Machine Learning ...|       2.0|51.0| 22728.0|     Alpha Systems|   1972.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|    Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|        Gamma Labs|   1966.0|
|Henry Williams|                null|       2.0|26.0| 81062.0| Meta Technologies|   1997.0|
|          null|              Intern|      null|58.0| 53429.0|  Beta Innovations|   1965.0|
|          null|             Manager|       7.0|20.0| 37876.0|Delta Technologies

In [22]:
# Using subset.
# remove rows that have 'null' in one column
df_spark.na.drop(how='any',subset='Full Name').show(10)

+--------------+--------------------+----------+----+--------+-----------------+---------+
|     Full Name|            Position|Experience| Age|  Salary|          Company|Year Born|
+--------------+--------------------+----------+----+--------+-----------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|  Gamma Solutions|   1993.0|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|             null|   1965.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|   Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|       Gamma Labs|   1966.0|
|Henry Williams|                null|       2.0|26.0| 81062.0|Meta Technologies|   1997.0|
| Jack Williams|Machine Learning ...|       4.0|36.0| 10997.0|   Meta Analytics|   1987.0|
|  Frank Wilson|             Manager|       8.0|47.0|107395.0|Beta Technologies|   1976.0|
|     Ivy Brown|             Manager|       1.0|48.0|104867.0|    Alpha Systems|   1975.0|

In [23]:
# Filling Missing values

In [24]:
df_spark.na.fill('NaN').show(10)

+--------------+--------------------+----------+----+--------+------------------+---------+
|     Full Name|            Position|Experience| Age|  Salary|           Company|Year Born|
+--------------+--------------------+----------+----+--------+------------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|   Gamma Solutions|   1993.0|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|               NaN|   1965.0|
|           NaN|Machine Learning ...|       2.0|51.0| 22728.0|     Alpha Systems|   1972.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|    Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|        Gamma Labs|   1966.0|
|Henry Williams|                 NaN|       2.0|26.0| 81062.0| Meta Technologies|   1997.0|
|           NaN|              Intern|      null|58.0| 53429.0|  Beta Innovations|   1965.0|
|           NaN|             Manager|       7.0|20.0| 37876.0|Delta Technologies

In [25]:
df_spark.na.fill(0,['Age','Experience','Salary']).show(10)

+--------------+--------------------+----------+----+--------+------------------+---------+
|     Full Name|            Position|Experience| Age|  Salary|           Company|Year Born|
+--------------+--------------------+----------+----+--------+------------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|   Gamma Solutions|   1993.0|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|              null|   1965.0|
|          null|Machine Learning ...|       2.0|51.0| 22728.0|     Alpha Systems|   1972.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|    Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|        Gamma Labs|   1966.0|
|Henry Williams|                null|       2.0|26.0| 81062.0| Meta Technologies|   1997.0|
|          null|              Intern|       0.0|58.0| 53429.0|  Beta Innovations|   1965.0|
|          null|             Manager|       7.0|20.0| 37876.0|Delta Technologies

# Filter Operations

In this Video We will Cover 
- Pyspark Dataframes 
- Filter Operation 
- &, |, == 
- ~


In [29]:
## Salary less than or equal to 20000
df_spark.filter("Salary<=60000").show()
# Other method  df_spark.filkter(df_spark['Salary']<=60000).show()

+--------------+--------------------+----------+----+-------+------------------+---------+
|     Full Name|            Position|Experience| Age| Salary|           Company|Year Born|
+--------------+--------------------+----------+----+-------+------------------+---------+
|  Bob Williams|        Data Analyst|       8.0|58.0|26085.0|              null|   1965.0|
|          null|Machine Learning ...|       2.0|51.0|22728.0|     Alpha Systems|   1972.0|
| Charlie Brown|        Data Analyst|       2.0|39.0|38455.0|    Meta Analytics|   1984.0|
|          null|              Intern|      null|58.0|53429.0|  Beta Innovations|   1965.0|
|          null|             Manager|       7.0|20.0|37876.0|Delta Technologies|   2003.0|
| Jack Williams|Machine Learning ...|       4.0|36.0|10997.0|    Meta Analytics|   1987.0|
|          null|                null|       3.0|null|53073.0| Delta Innovations|     null|
| Charlie Jones|             Manager|      10.0|54.0|15520.0|   Gamma Analytics|   1969.0|

In [30]:
df_spark.filter((df_spark['Salary']>=60000) | (df_spark['Salary']<=75000)).show()

+--------------+--------------------+----------+----+--------+------------------+---------+
|     Full Name|            Position|Experience| Age|  Salary|           Company|Year Born|
+--------------+--------------------+----------+----+--------+------------------+---------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|   Gamma Solutions|   1993.0|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|              null|   1965.0|
|          null|Machine Learning ...|       2.0|51.0| 22728.0|     Alpha Systems|   1972.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|    Meta Analytics|   1984.0|
|Frank Williams|       Product Owner|       4.0|57.0| 65109.0|        Gamma Labs|   1966.0|
|Henry Williams|                null|       2.0|26.0| 81062.0| Meta Technologies|   1997.0|
|          null|              Intern|      null|58.0| 53429.0|  Beta Innovations|   1965.0|
|          null|             Manager|       7.0|20.0| 37876.0|Delta Technologies

In [31]:
df_spark.filter((df_spark["Salary"] >= 70000) & (df_spark["Position"] == "Data Engineer")).show(10)

+---------------+-------------+----------+----+--------+-----------------+---------+
|      Full Name|     Position|Experience| Age|  Salary|          Company|Year Born|
+---------------+-------------+----------+----+--------+-----------------+---------+
|     Ivy Taylor|Data Engineer|       8.0|37.0| 75551.0|  Delta Analytics|   1986.0|
|Charlie Johnson|Data Engineer|       4.0|59.0| 82748.0|        Beta Labs|   1964.0|
|           null|Data Engineer|       4.0|64.0|117264.0|  Delta Solutions|   1959.0|
|    Henry Davis|Data Engineer|       4.0|34.0|107322.0|   Beta Analytics|   1989.0|
|   Alice Miller|Data Engineer|       4.0|35.0|100242.0|        Beta Labs|   1988.0|
|   Henry Wilson|Data Engineer|       8.0|62.0| 72275.0|Beta Technologies|   1961.0|
|     Jack Brown|Data Engineer|       5.0|44.0| 92677.0|             null|   1979.0|
| David Williams|Data Engineer|       2.0|28.0| 93153.0| Meta Innovations|   1995.0|
| Charlie Miller|Data Engineer|      10.0|28.0|101487.0|    Gamma

### PySpark GroupBy and Aggregate Operations

In Apache Spark, the GroupBy and Aggregate functions are used for data aggregation and summarization. The `groupBy` operation is similar to its counterpart in pandas and is employed to group data based on one or more columns. Subsequently, aggregate functions, such as `sum`, `avg`, `max`, or custom aggregation expressions, are applied to the grouped data.

Compared to pandas, Spark's GroupBy and Aggregate functions operate in a distributed manner, allowing them to handle large datasets that exceed the memory capacity of a single machine. While both pandas and Spark provide similar functionalities, Spark's distributed nature enables it to process vast amounts of data in parallel, making it advantageous for big data scenarios.

In terms of speed, Spark's GroupBy and Aggregate operations may exhibit better performance on large-scale datasets compared to pandas, especially when leveraging the parallel processing capabilities of a Spark cluster.


In [33]:
df_spark.groupBy('Position').max().show()

+--------------------+---------------+--------+-----------+--------------+
|            Position|max(Experience)|max(Age)|max(Salary)|max(Year Born)|
+--------------------+---------------+--------+-----------+--------------+
|                 CTO|           10.0|    65.0|   115678.0|        2004.0|
|                  HR|           10.0|    65.0|   120000.0|        2005.0|
|                null|           10.0|    65.0|   120000.0|        2005.0|
|              Intern|           10.0|    65.0|   120000.0|        2005.0|
|           Developer|           10.0|    65.0|   120000.0|        2005.0|
|       Product Owner|           10.0|    65.0|   120000.0|        2005.0|
|Machine Learning ...|           10.0|    65.0|   120000.0|        2005.0|
|                 CFO|           10.0|    65.0|   117446.0|        2005.0|
|                 CEO|           10.0|    64.0|   119662.0|        2005.0|
|      Data Scientist|           10.0|    65.0|   120000.0|        2005.0|
|        Data Analyst|   

In [34]:
df_spark.groupBy('Position').count().show()

+--------------------+------+
|            Position| count|
+--------------------+------+
|                 CTO|    55|
|                  HR|450842|
|                null|500000|
|              Intern|450314|
|           Developer|448851|
|       Product Owner|448808|
|Machine Learning ...|449738|
|                 CFO|    79|
|                 CEO|    69|
|      Data Scientist|450302|
|        Data Analyst|450126|
|            Chairman|    66|
|        Board Member|    60|
|       AI Researcher|450487|
|             Manager|449975|
|                 CMO|    67|
|                 COO|    73|
|       Data Engineer|450088|
+--------------------+------+



In [35]:
# Aggregation
df_spark.agg({'Salary':'Sum'}).show()

+----------------+
|     sum(Salary)|
+----------------+
|2.87912402065E11|
+----------------+



In [36]:
df_spark.groupBy('Position').agg({'Salary':'mean'}).show()

+--------------------+-----------------+
|            Position|      avg(Salary)|
+--------------------+-----------------+
|                 CTO|57389.41176470588|
|                  HR|64049.72387482003|
|                null|63990.30724020897|
|              Intern|63989.16718638214|
|           Developer|64008.08671762606|
|       Product Owner|63985.54207518261|
|Machine Learning ...| 63862.1746125443|
|                 CFO|65832.30985915494|
|                 CEO|     64872.796875|
|      Data Scientist|63996.87451278357|
|        Data Analyst|63954.98347060536|
|            Chairman|68150.79032258065|
|        Board Member|57970.64705882353|
|       AI Researcher|64024.12765505994|
|             Manager| 63933.3196142544|
|                 CMO|61796.13114754098|
|                 COO|66257.74242424243|
|       Data Engineer|63990.78019210973|
+--------------------+-----------------+



### Using Imputer Function

In Apache Spark, the `Imputer` function is used for handling missing values in a DataFrame. It allows you to impute (fill in) missing values in specified columns by replacing them with a chosen strategy, such as mean, median, or a constant value. The `Imputer` function is particularly useful when dealing with datasets containing incomplete or null values.

In [26]:
from pyspark.ml.feature import Imputer

In [27]:
imputer = Imputer(inputCols = ['Experience', 'Age', 'Salary'],outputCols=["{}_impute".format(c) for c in ['Experience', 'Age', 'Salary']]).setStrategy('mean')

In [28]:
imputer.fit(df_spark).transform(df_spark).show(10)

+--------------+--------------------+----------+----+--------+------------------+---------+-----------------+-----------------+-------------+
|     Full Name|            Position|Experience| Age|  Salary|           Company|Year Born|Experience_impute|       Age_impute|Salary_impute|
+--------------+--------------------+----------+----+--------+------------------+---------+-----------------+-----------------+-------------+
|  Ivy Williams|       Product Owner|       8.0|30.0|117550.0|   Gamma Solutions|   1993.0|              8.0|             30.0|     117550.0|
|  Bob Williams|        Data Analyst|       8.0|58.0| 26085.0|              null|   1965.0|              8.0|             58.0|      26085.0|
|          null|Machine Learning ...|       2.0|51.0| 22728.0|     Alpha Systems|   1972.0|              2.0|             51.0|      22728.0|
| Charlie Brown|        Data Analyst|       2.0|39.0| 38455.0|    Meta Analytics|   1984.0|              2.0|             39.0|      38455.0|
|Frank