## Introduction

From this article I'm starting the PySpark tutorial series and this is first arrow. In this particular article we will be closely looking at how to get started with PySpark's data preprocessing techniques and moreover introducing how the PySpark's DataFrame look like and perform some general operations on the same i.e. from starting the PySpark's session to dealing with data preprocessing technique using PySpark.

## Table of content

1. **Starting PySpark session:** Mandatory step to get started with PySpark.
2. **Reading the dataset:** In this section we will read the dataset using PySpark function only.
3. **Datatypes of the column:** In this section we will analyze the datatypes and related thing for each column.
4. **Indexing:** Here we will came to know that how one can do the indexing through columns.
5. **Describe:** Similar function to Pandas and we will get to know how to use it.
6. **Adding columns:** Get to know how one can add the columns using PySpark.
7. **Dropping columns:** Get to know how to drop the irrelavant columns. 
8. **Renaming columns:** Get to know how one can rename the existing columns in the PySpark DataFrame.

Before moving towards the main functionalities we have to **`start the spark session`**. So let's do that first!

## Starting PySpark Session

In [1]:
!pip install pyspark

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 33 kB/s 
[?25hCollecting py4j==0.10.9.3
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 59.3 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=fec94c17ef29546704079ced4911417cbb1f97e6df190bd03abdbef95fad7464
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [2]:
from pyspark.sql import SparkSession

data_spark = SparkSession.builder.appName('DataFrame_article').getOrCreate()

data_spark

**Code breakdown**

1. Firstly we have imported the **`SparkSession`** from **`pyspark.sql`** object.
2. Then by using **`getOrCreate()`** and **`builder`** function we created a SparkSession and stored it in a variable.
3. At the last we simply saw what is there in the **`data_spark`** variable.

**Note:** This is not the detailed illustration of *"how to start spark session" *and if you are not able to get every bit of it then I'll recommend to go through my previous article on- **`Getting started with PySpark using Python`**

Who already understood can jump to the main section of the article.

## Reading the dataset

In [3]:
data_spark.read.option('header','true').csv('/content/sample_data/california_housing_train.csv')

DataFrame[longitude: string, latitude: string, housing_median_age: string, total_rooms: string, total_bedrooms: string, population: string, households: string, median_income: string, median_house_value: string]

**read.option.csv:** This complete set of function is responsible to read the csv type of file using PySpark where **read.csv()** can also work but to make the column name as the column header we need to use **option()** as well

**Inference:** Here in the output we can see that the DataFrame object is returned which shows the column name and corresponding type of columns.

Now let's see the whole dataset i.e. column and records as well using show() method.

In [4]:
df_spark = data_spark.read.option('header','true').csv('/content/sample_data/california_housing_train.csv').show()
df_spark

+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|  longitude| latitude|housing_median_age|total_rooms|total_bedrooms| population| households|median_income|median_house_value|
+-----------+---------+------------------+-----------+--------------+-----------+-----------+-------------+------------------+
|-114.310000|34.190000|         15.000000|5612.000000|   1283.000000|1015.000000| 472.000000|     1.493600|      66900.000000|
|-114.470000|34.400000|         19.000000|7650.000000|   1901.000000|1129.000000| 463.000000|     1.820000|      80100.000000|
|-114.560000|33.690000|         17.000000| 720.000000|    174.000000| 333.000000| 117.000000|     1.650900|      85700.000000|
|-114.570000|33.640000|         14.000000|1501.000000|    337.000000| 515.000000| 226.000000|     3.191700|      73400.000000|
|-114.570000|33.570000|         20.000000|1454.000000|    326.000000| 624.000000| 262.000000|     1.925000|    

Here with the help of show() function we can see the whole dataset

## Checking DataTypes of the columns

In [5]:
df_pyspark = data_spark.read.option('header','true').csv('/content/sample_data/california_housing_train.csv')
df_pyspark.printSchema()

root
 |-- longitude: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- housing_median_age: string (nullable = true)
 |-- total_rooms: string (nullable = true)
 |-- total_bedrooms: string (nullable = true)
 |-- population: string (nullable = true)
 |-- households: string (nullable = true)
 |-- median_income: string (nullable = true)
 |-- median_house_value: string (nullable = true)



**Inference:** Here with the help of **printSchema** function we can notice that it returned an ample of information related to columns and its datatypes.

But, Hold on! We can see that every column shows the **`string`** value but that is not True right? 
**Answer:** Reason behind this glitch is the **default** setting of **printScehma()** function as it will always return the column type as String until we fix it.

So, Let's fix this issue first!

In [6]:
df_pyspark = data_spark.read.option('header','true').csv('/content/sample_data/california_housing_train.csv', inferSchema=True)
df_pyspark.printSchema()

root
 |-- longitude: double (nullable = true)
 |-- latitude: double (nullable = true)
 |-- housing_median_age: double (nullable = true)
 |-- total_rooms: double (nullable = true)
 |-- total_bedrooms: double (nullable = true)
 |-- population: double (nullable = true)
 |-- households: double (nullable = true)
 |-- median_income: double (nullable = true)
 |-- median_house_value: double (nullable = true)



**Inference:** So now we can see the valid data type corresponding to each column with just a minor change of adding one more argument as **`inferScehma = True`** which will change the default setting of **printSchema()**. One more thing to keep a note is **nullable = True** which certainly means that column might have null values in it.

There is one more way of **checking the Data types of the columns** which is pretty similar to what we used to do in the case of the pandas DataFrame. Let's see that approach as well!

In [10]:
df_pyspark.dtypes

[('longitude', 'double'),
 ('latitude', 'double'),
 ('housing_median_age', 'double'),
 ('total_rooms', 'double'),
 ('total_bedrooms', 'double'),
 ('population', 'double'),
 ('households', 'double'),
 ('median_income', 'double'),
 ('median_house_value', 'double')]

**Inference:** Here also it returns the same output as in previous approach but this time in the different format as it returns the output in the form of **"list of tuple"**.

## Column Indexing

First let us see how we can get name of each column so that based on that we can perform our column indexing and other operations.

In [7]:
df_pyspark.columns

['longitude',
 'latitude',
 'housing_median_age',
 'total_rooms',
 'total_bedrooms',
 'population',
 'households',
 'median_income',
 'median_house_value']

**Inference:** By using the **`columns`** object we can see the name of all the columns present in the dataset in the **list object**

Now let's understand how we can select the columns. For an instance let's say that we want to pluck out the **total_rooms** column only from the dataset.

In [8]:
df_pyspark.select('total_rooms').show()

+-----------+
|total_rooms|
+-----------+
|     5612.0|
|     7650.0|
|      720.0|
|     1501.0|
|     1454.0|
|     1387.0|
|     2907.0|
|      812.0|
|     4789.0|
|     1497.0|
|     3741.0|
|     1988.0|
|     1291.0|
|     2478.0|
|     1448.0|
|     2556.0|
|     1678.0|
|       44.0|
|     1388.0|
|       97.0|
+-----------+
only showing top 20 rows



**Inference:** Here with the help of **`select`** function we have selected the **total_rooms** column only and it returned that column as **DataFrame** of PySpark.

So we have by far pluck out only single column from the dataset but what if we want to grab **multiple columns**. So let's have a look at it!

In [9]:
df_pyspark.select(['total_rooms', 'total_bedrooms', 'median_income']).show()

+-----------+--------------+-------------+
|total_rooms|total_bedrooms|median_income|
+-----------+--------------+-------------+
|     5612.0|        1283.0|       1.4936|
|     7650.0|        1901.0|         1.82|
|      720.0|         174.0|       1.6509|
|     1501.0|         337.0|       3.1917|
|     1454.0|         326.0|        1.925|
|     1387.0|         236.0|       3.3438|
|     2907.0|         680.0|       2.6768|
|      812.0|         168.0|       1.7083|
|     4789.0|        1175.0|       2.1782|
|     1497.0|         309.0|       2.1908|
|     3741.0|         801.0|       2.6797|
|     1988.0|         483.0|        1.625|
|     1291.0|         248.0|       2.1571|
|     2478.0|         464.0|        3.212|
|     1448.0|         378.0|       0.8585|
|     2556.0|         587.0|       1.6991|
|     1678.0|         322.0|       2.9653|
|       44.0|          33.0|       0.8571|
|     1388.0|         386.0|       1.2049|
|       97.0|          24.0|       1.2656|
+----------

**Inference:** Now we have simply passed the **multiple column names** in the argument of **`select`** method but in the form of **`list`**, same logic as we used to perform in **pandas DataFrame** and with just this minute change we can grab out multiple columns from our dataset based on the requirement.

## Describe function in PySpark

In [13]:
df_pyspark.describe().show()

+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|summary|          longitude|          latitude|housing_median_age|      total_rooms|   total_bedrooms|        population|       households|     median_income|median_house_value|
+-------+-------------------+------------------+------------------+-----------------+-----------------+------------------+-----------------+------------------+------------------+
|  count|              17000|             17000|             17000|            17000|            17000|             17000|            17000|             17000|             17000|
|   mean|-119.56210823529375|  35.6252247058827| 28.58935294117647|2643.664411764706|539.4108235294118|1429.5739411764705|501.2219411764706| 3.883578100000021|207300.91235294117|
| stddev| 2.0051664084260357|2.1373397946570867|12.586936981660406|2179.947071452777|421.4994515798648| 1

**Inference:** So here is the result from the **describe function of PySpark** and by looking at the output one who is familiar with using **pandas's describe function** they can consider it the spitting image of the pandas DataFrame because it is showing the exact same **statistics** in the same way.

In this function you can find the below mentioned detail of the dataset:
1. count: Where you find the total number of records present in each column.
2. mean: Here one can see the mean of the column values.
3. stddev: It will return the standard deviation of the column values.
4. min: This will return the minimum value present in the column.
5. max: This will return the maximum value present in the column.


## Adding columns in PySpark DataFrame 

In [15]:
df_pyspark = df_pyspark.withColumn('Updated longitude', df_pyspark['longitude']+1.2)
df_pyspark.show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+-------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|  Updated longitude|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+-------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|            -113.11|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|            -113.27|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|            -113.36|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|-113.36999999999999|
|  -11

**Inference:** Now from the above output we can clearly see that new column is updated in the DataFrame as **"Updated longitude"**.

Let's discuss what we did to add the columns:
1. We used the **`withcolumn()`** function to add the columns or change the existing columns in the Pyspark DataFrame.
2. Then in that function we will be giving two parameters 
  * First one will be the **name of the new column**
  * Second one will be **what value** that **new column will hold.**

## Dropping columns in PySpark DataFrame

Dropping the column from the dataset is pretty straightforward task and for that we will be using the **`drop()`** function from PySpark 

In [17]:
df_pyspark.drop('Updated longitude').show()

+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population|households|median_income|median_house_value|
+---------+--------+------------------+-----------+--------------+----------+----------+-------------+------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|    1015.0|     472.0|       1.4936|           66900.0|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|    1129.0|     463.0|         1.82|           80100.0|
|  -114.56|   33.69|              17.0|      720.0|         174.0|     333.0|     117.0|       1.6509|           85700.0|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|     515.0|     226.0|       3.1917|           73400.0|
|  -114.57|   33.57|              20.0|     1454.0|         326.0|     624.0|     262.0|        1.925|           65500.0|
|  -114.58|   33.63|    

**Inference:** In the output we can see that "Updated longitude" column doesn't exist anymore in the dataset and as we have noticed that we simply gave the name of the column in the paramter and got that column dropped from the dataset.

Note: If we want to drop **multiple columns** from the dataset in the same instance then we can pass the **list of column name** as the paramter.

## Renaming the column

In [19]:
df_pyspark.withColumnRenamed('population', 'population per capita').show()

+---------+--------+------------------+-----------+--------------+---------------------+----------+-------------+------------------+-------------------+
|longitude|latitude|housing_median_age|total_rooms|total_bedrooms|population per capita|households|median_income|median_house_value|  Updated longitude|
+---------+--------+------------------+-----------+--------------+---------------------+----------+-------------+------------------+-------------------+
|  -114.31|   34.19|              15.0|     5612.0|        1283.0|               1015.0|     472.0|       1.4936|           66900.0|            -113.11|
|  -114.47|    34.4|              19.0|     7650.0|        1901.0|               1129.0|     463.0|         1.82|           80100.0|            -113.27|
|  -114.56|   33.69|              17.0|      720.0|         174.0|                333.0|     117.0|       1.6509|           85700.0|            -113.36|
|  -114.57|   33.64|              14.0|     1501.0|         337.0|                

**Inference:** From the above output we can see that **"population"** column is renamed to **"population per capita"** by using the with **`columnRenamed()`** function where in one parameter we need to pass the column name which is to be renamed and next parameter will be the updated name.

## Conclusion

So finally it's time to conclude this article and let's quickly discuss everything that we have covered in this article with short description of the same.

1. The very first thing that we learned is how to start the spark session as this is the mandatory step to go with PySpark.
2. Then we learned how to get information regarding the columns of the dataset by using printSchema() function, columns object and describe function().
3. Then at the last we also look at how to manipulate the Schema of the dataset when we saw how to add, drop and rename the columns.