# Data Manipulation with Pyspark in 10 Steps

In this notebook, I'll show you how to perform data manipulation with PySpark in 10 steps. Let's dive in!

---
<a id="toc"></a>
# **Table of Contents**
---

**1.**  [**Creating SparkSession**](#Step1)<br>
**2.**  [**Reading Data**](#Step2)<br>
**3.**  [**Understanding Data**](#Step3)<br>
**4.**  [**Selecting Columns**](#Step4)<br>
**5.**  [**Data Filtering**](#Step5)<br>
**6.**  [**Adding New Columns**](#Step6)<br>
**7.**  [**Grouping Data**](#Step7)<br>
**8.**  [**Applying User-Defined Functions**](#Step8)<br>
**9.**  [**Deleting Data**](#Step9)<br>
**10.** [**Writing Data**](#Step10)<br>
**11.** [**Conclusion**](#Step11)<br>

---
<a name = Step1></a>
## **1. Creating SparkSession**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

To work with PySpark, you first need to create SparkSession. SparkSession is an entry point to PySpark functionality. Let's instantiate SparkSession to use PySpark.

In [1]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('data_manipulation').getOrCreate()

PySpark uses a builder pattern with the SparkSession.builder object that provides a set of
methods. You can use the `appName` method to give a name to app. The `getOrCreate` method is used to work in both interactive and batch mode by avoiding the creation of a new SparkSession if one already exists.

---
<a name = Step2></a>
## **2. Reading Data**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

The data I'm going to use in this article is the diabetes dataset in the csv format. You can find this dataset here. PySpark offers two main structures for storing data when performing manipulations: The RDD and the DataFrame. You can think of the RDD as a distributed collection of objects (or rows). You can think of the DataFrame as it like a table. Note that a DataFrame organizes the records in columns. Let's read our dataset as DataFrame.

In [2]:
df = spark.read.csv('data/diabetes.csv', header=True, inferSchema=True)

I set the inferSchema parameter as True. So Spark in the background will infer the datatypes of the values in the dataset on its own. 

---
<a name = Step3></a>
## **3. Understanding Data**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

Understanding data is one of the crucial steps of data analysis. Let's take a look at the first ten rows of the dataset with `show` method.

In [3]:
df.show(10)

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|
|          5|    116|           74|            0|      0|25.6|                   0.201| 30|      0|
|          3|     78|           50|           32|     88|31.0|                   0.248| 26|      1|


You can print the column names of your dataset with the `columns` method.

In [4]:
df.columns

['Pregnancies',
 'Glucose',
 'BloodPressure',
 'SkinThickness',
 'Insulin',
 'BMI',
 'DiabetesPedigreeFunction',
 'Age',
 'Outcome']

You can use the `count` method to get the total number of records in the Dataframe. The `len` method allows you to see the number of columns in DataFrame. Let's take a look at the shape of our dataset with the `count` and `len` methods.

In [5]:
print((df.count(),len(df.columns)))

(768, 9)


To get the schema information of the dataset, you can use the `printSchema` method that often utilize to understand data with the `show` method in data analysis. 

In [6]:
# printSchema
df.printSchema()

root
 |-- Pregnancies: integer (nullable = true)
 |-- Glucose: integer (nullable = true)
 |-- BloodPressure: integer (nullable = true)
 |-- SkinThickness: integer (nullable = true)
 |-- Insulin: integer (nullable = true)
 |-- BMI: double (nullable = true)
 |-- DiabetesPedigreeFunction: double (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Outcome: integer (nullable = true)



You can use `describe().show()` to take a look at description statistics of the dataset. I'm going to use the `truncate` parameter to only see 8 characters. 

In [7]:
df.describe().show(truncate=8)

+-------+-----------+--------+-------------+-------------+--------+--------+------------------------+--------+--------+
|summary|Pregnancies| Glucose|BloodPressure|SkinThickness| Insulin|     BMI|DiabetesPedigreeFunction|     Age| Outcome|
+-------+-----------+--------+-------------+-------------+--------+--------+------------------------+--------+--------+
|  count|        768|     768|          768|          768|     768|     768|                     768|     768|     768|
|   mean|   3.845...|120.8...|     69.10...|     20.53...|79.79...|31.99...|                0.471...|33.24...|0.348...|
| stddev|   3.369...|31.97...|     19.35...|     15.95...|115.2...|7.884...|                0.331...|11.76...|0.476...|
|    min|          0|       0|            0|            0|       0|     0.0|                   0.078|      21|       0|
|    max|         17|     199|          122|           99|     846|    67.1|                    2.42|      81|       1|
+-------+-----------+--------+----------

---
<a name = Step4></a>
## **4. Selecting Columns**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

You can use the `select` method to select specific columns. Let's take the Pregnancies and Age columns from the dataset with `select` method.

In [8]:
df.select("Pregnancies", "Age").show(10)

+-----------+---+
|Pregnancies|Age|
+-----------+---+
|          6| 50|
|          1| 31|
|          8| 32|
|          1| 21|
|          0| 33|
|          5| 30|
|          3| 26|
|         10| 29|
|          2| 53|
|          8| 54|
+-----------+---+
only showing top 10 rows



You can also use the `col` function in pyspark.sql.functions module to select columns. Let me show you.

In [9]:
import pyspark.sql.functions as F
df.select(F.col("Pregnancies"), F.col("Age")).show(10)

+-----------+---+
|Pregnancies|Age|
+-----------+---+
|          6| 50|
|          1| 31|
|          8| 32|
|          1| 21|
|          0| 33|
|          5| 30|
|          3| 26|
|         10| 29|
|          2| 53|
|          8| 54|
+-----------+---+
only showing top 10 rows



---
<a name = Step5></a>
## **5. Data Filtering**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

To clean the dataset and keep only records you want, you can perform to filter records based on conditions. There are two methods to filter data: `filter()` and `where()`. Let's filter data where the value of column “age” is less than 40 with the `filter` method.

In [10]:
df.filter(df['Age']<40).show(10)

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|
|          5|    116|           74|            0|      0|25.6|                   0.201| 30|      0|
|          3|     78|           50|           32|     88|31.0|                   0.248| 26|      1|
|         10|    115|            0|            0|      0|35.3|                   0.134| 29|      0|


You can perform further filtering using the `select` method to see only specific columns.

In [11]:
df.where(df['age'] < 40).select('Insulin','Outcome').show(10)

+-------+-------+
|Insulin|Outcome|
+-------+-------+
|      0|      0|
|      0|      1|
|     94|      0|
|    168|      1|
|      0|      0|
|     88|      1|
|      0|      0|
|      0|      0|
|      0|      1|
|      0|      1|
+-------+-------+
only showing top 10 rows



You can also apply to filter records based on conditions. Let find records with age greater than 60 and persons who are only sick.

In [12]:
df.filter(df['age'] > 60).filter(df['Outcome'] == '1').show(10)

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|          4|    146|           92|            0|      0|31.2|                   0.539| 61|      1|
|          0|    105|           84|            0|      0|27.9|                   0.741| 62|      1|
|          2|    158|           90|            0|      0|31.6|                   0.805| 66|      1|
|          4|    146|           78|            0|      0|38.5|                    0.52| 67|      1|
|          2|    197|           70|           99|      0|34.7|                   0.575| 62|      1|
|          4|    145|           82|           18|      0|32.5|                   0.235| 70|      1|
|          6|    190|           92|            0|      0|35.5|                   0.278| 66|      1|


You can use operators like `&` and `|` to apply multiple filter conditions. Let's filter persons who are sick and who have pregnancies of greater than or equal to 10 using “&”.

In [13]:
df.filter((df['Outcome']==1) & (df['Pregnancies'] >=9)).show(10)

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|         10|    168|           74|            0|      0|38.0|                   0.537| 34|      1|
|          9|    119|           80|           35|      0|29.0|                   0.263| 29|      1|
|         11|    143|           94|           33|    146|36.6|                   0.254| 51|      1|
|         10|    125|           70|           26|    115|31.1|                   0.205| 41|      1|
|          9|    102|           76|           37|      0|32.9|                   0.665| 46|      1|
|          9|    171|          110|           24|    240|45.4|                   0.721| 54|      1|
|         13|    126|           90|            0|      0|43.4|                   0.583| 42|      1|


To find a count of the number of records after filtering, you can use the `count` method.

In [14]:
df.filter(df['age']>40).count()

194

You can filter data the `where` method like the `filter` method. Let me show you.

In [15]:
df.where((df['Outcome']==1) & (df['Pregnancies'] >=9)).show(10)

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+
|         10|    168|           74|            0|      0|38.0|                   0.537| 34|      1|
|          9|    119|           80|           35|      0|29.0|                   0.263| 29|      1|
|         11|    143|           94|           33|    146|36.6|                   0.254| 51|      1|
|         10|    125|           70|           26|    115|31.1|                   0.205| 41|      1|
|          9|    102|           76|           37|      0|32.9|                   0.665| 46|      1|
|          9|    171|          110|           24|    240|45.4|                   0.721| 54|      1|
|         13|    126|           90|            0|      0|43.4|                   0.583| 42|      1|


You can also use the `where` method along with the `count` method.

In [16]:
df.where(df['age'] > 40).count()

194

---
<a name = Step6></a>
## **6. Adding New Columns**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

You can use the `withColumn` method to add a new column. Let's create a new column by using the age column. To do this, I'm going to add the age values to ten value. 

In [17]:
df.withColumn('New_Age',df['age'] + 10).show(10)

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|New_Age|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+-------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|     60|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|     41|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|     42|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|     31|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|     43|
|          5|    116|           74|            0|      0|25.6|                   0.201| 30|      0|     40|
|          3|     78|       

---
<a name = Step7></a>
## **7. Grouping Data**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

When working with large amounts of data, we'll often use the `groupBy` method to summarize data. After grouping data, you can apply an aggregation function on each one of them. Let's take a look at the sum number of each categorical value of the outcome column.

In [18]:
df.groupBy('Outcome').count().show()

+-------+-----+
|Outcome|count|
+-------+-----+
|      1|  268|
|      0|  500|
+-------+-----+



You can also use the `distinct` and `count` methods to find distinct value in a column. Let's take a look at the district values in the Pregnancies column.

In [19]:
df.select('Pregnancies').distinct().count()

17

You can use the other the aggregate functions such as sum, mean, or min. Let's find the mean of age after grouping the outcome column. Note that the `alias` method is used to name the new column. 

In [20]:
df.groupBy('Outcome').agg(F.mean("age").alias("age_mean")).show(10)

+-------+-----------------+
|Outcome|         age_mean|
+-------+-----------------+
|      1|37.06716417910448|
|      0|            31.19|
+-------+-----------------+



---
<a name = Step8></a>
## **8. Applying User-Defined Functions**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

You can also apply your own function to the grouped data with UDFs (user-defined functions) in the pyspark.sql.functions module. To show this, let's create a function named diabete first.

In [21]:
def diabete(case):
    if case == 1 :
        return "diabete"
    else:
        return 'no diabete'

Now let's declare the UDF and its return type (StringType in this example). After that, I'm going to use withColumn to create a new column and then pass the relevant Dataframe column (Outcome):

In [22]:
from pyspark.sql.types import *
diabete_udf = F.udf(diabete, StringType())
df.withColumn('diabete_case', diabete_udf(df['Outcome'])).show(10)

+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+------------+
|Pregnancies|Glucose|BloodPressure|SkinThickness|Insulin| BMI|DiabetesPedigreeFunction|Age|Outcome|diabete_case|
+-----------+-------+-------------+-------------+-------+----+------------------------+---+-------+------------+
|          6|    148|           72|           35|      0|33.6|                   0.627| 50|      1|     diabete|
|          1|     85|           66|           29|      0|26.6|                   0.351| 31|      0|  no diabete|
|          8|    183|           64|            0|      0|23.3|                   0.672| 32|      1|     diabete|
|          1|     89|           66|           23|     94|28.1|                   0.167| 21|      0|  no diabete|
|          0|    137|           40|           35|    168|43.1|                   2.288| 33|      1|     diabete|
|          5|    116|           74|            0|      0|25.6|                   0.201| 30|     

---
<a name = Step9></a>
## **9. Deleting Data**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

To delete a column or multiple columns, you can use the `drop` method in PySpark. Let's delete the `Insulin` column with the `drop` method.

In [23]:
df.drop('Insulin').show(10)

+-----------+-------+-------------+-------------+----+------------------------+---+-------+
|Pregnancies|Glucose|BloodPressure|SkinThickness| BMI|DiabetesPedigreeFunction|Age|Outcome|
+-----------+-------+-------------+-------------+----+------------------------+---+-------+
|          6|    148|           72|           35|33.6|                   0.627| 50|      1|
|          1|     85|           66|           29|26.6|                   0.351| 31|      0|
|          8|    183|           64|            0|23.3|                   0.672| 32|      1|
|          1|     89|           66|           23|28.1|                   0.167| 21|      0|
|          0|    137|           40|           35|43.1|                   2.288| 33|      1|
|          5|    116|           74|            0|25.6|                   0.201| 30|      0|
|          3|     78|           50|           32|31.0|                   0.248| 26|      1|
|         10|    115|            0|            0|35.3|                   0.134| 

To remove the duplicate records from the Dataframe, you can use the `dropDuplicates` method.

In [24]:
print("The number of records: ", df.count())
df=df.dropDuplicates()
print("The number of records after removing the duplicate : ", df.count())

The number of records:  768
The number of records after removing the duplicate :  768


As you can see, there are no the duplicate records in the dataset.

---
<a name = Step10></a>
## **10. Writing Data**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

After performing data manipulation, you’ll often want to export your results. You can write the clean Dataframe to a desired location in the required format with the `write` method. 


Let's write our results in a CSV files.

In [25]:
df.write.csv("./data/my_dataset.csv")

Let's take a look at this file.

In [26]:
%%bash
ls data/my_dataset.csv | head -5

_SUCCESS
part-00000-2b5e0da4-fab1-4141-ae6f-1f89eea9818f-c000.csv
part-00001-2b5e0da4-fab1-4141-ae6f-1f89eea9818f-c000.csv
part-00002-2b5e0da4-fab1-4141-ae6f-1f89eea9818f-c000.csv
part-00003-2b5e0da4-fab1-4141-ae6f-1f89eea9818f-c000.csv


As you can see, this folder includes many partitions. To reduce the number of partitions, you can use the `coalesce` method with the desired number of partitions.

In [27]:
df.coalesce(1).write.csv("./data/my_single_partition.csv")

In [28]:
%%bash
ls data/my_single_partition.csv | head

_SUCCESS
part-00000-53da5401-5d0a-4fad-a7e3-0114833b68d8-c000.csv


As you can see, there is a single CSV file inside of this folder.

---
<a name = Step11></a>
## **Conclusion**
---
<a id="0"></a>
<a href="#toc" class="btn btn-primary btn-sm" role="button" aria-pressed="true" 
style="color:white; background-color:#edcff0" data-toggle="popover">Content</a>

In this notebook, I talked about data manipulation with PySpark from reading to exporting data. 

Thanks for reading. I hope you enjoy it. Don't forget to follow us on [YouTube](http://youtube.com/tirendazacademy) | [Medium](http://tirendazacademy.medium.com) | [Twitter](http://twitter.com/tirendazacademy) | [GitHub](http://github.com/tirendazacademy) | [Linkedin](https://www.linkedin.com/in/tirendaz-academy) | [Kaggle](https://www.kaggle.com/tirendazacademy) 😎