<a href="https://colab.research.google.com/github/Ramprashanth17/DataEngineering/blob/main/Databricks_DE/Practice/PySpark_Fundamentals.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Spark Dataframes and their Operations


In [2]:
## Installing PySpark

!pip install pyspark



Once Spark has been installed, we have to create Spark session which acts as our entry point for any Spark Application.

In [3]:
### Creating a spark session

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

If using spark shell, we need not run the above code as the session is auto created. Note that, only one Spark session can be created at any given time and duplicating a Spark session is not possible!

## Dataset or DataFrame API

The Dataset API is a distributed collection of data, it is available in Java and Scala not in Python and R. The API uses RDDs (Resilient Distributed Datasets) it provides fixed typing.

For Python and R users, we have the similar Dataframe API, influenced by Pandas Dataframes in Python. It is essentially like a table, with the table headers as column names and below these headers are data arranged accordingly. This API was also built on top of RDDs, and Dataframes abstracts from the complexity of RDD, Dataframes are also lazily evaluated and are immutable.

By this lazy evaluation, Spark has performance gains and optimization by running the computations only when needed. Computations start only when an action is called on a DataFrame.

## Creating DataFrame Operations

DataFrames are the main building blocks of Spark data, they consist of rows and column data structures.

You can specify the schema of the dataframe either explicitly or let it infer from the Dataframe directly.

#### 1. Creating DataFrame using a list of rows

```
import pandas as pd
from datetime import datetime, date
from pyspark.sql import Row
data_df = spark.createDataFrame([
    Row(col_1=100, col_2=200., col_3='string_test_1', col_4=date(2023, 1, 1), col_5=datetime(2023, 1, 1, 12, 0)),
    Row(col_1=200, col_2=300., col_3='string_test_2', col_4=date(2023, 2, 1), col_5=datetime(2023, 1, 2, 12, 0)),
    Row(col_1=400, col_2=500., col_3='string_test_3', col_4=date(2023, 3, 1), col_5=datetime(2023, 1, 3, 12, 0))
]

# To define schema explicitly
, schema=' col_1 long, col_2 double, col_3 string, col_4 date, col_5 timestamp')

```


#### 2. Creating Pandas Dataframes
First create a DataFrame using Pandas first, later convert that dataframe to PySpark dataframe.

```
from datetime import datetime, date
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
rdd = spark.sparkContext.parallelize([
    (100, 200., 'string_test_1', date(2023, 1, 1), datetime(2023, 1, 1, 12, 0)),
    (200, 300., 'string_test_2', date(2023, 2, 1), datetime(2023, 1, 2, 12, 0)),
    (300, 400., 'string_test_3', date(2023, 3, 1), datetime(2023, 1, 3, 12, 0))
])
data_df = spark.createDataFrame(rdd, schema=['col_1', 'col_2', 'col_3', 'col_4', 'col_5'])
```


#### 3. Using Tuples

We can create a tuple as a row and add each tuple as a separate row in the DataFrame.

```
import pandas as pd
from datetime import datetime, date
from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([
    (100, 200., 'string_test_1', date(2023, 1, 1), datetime(2023, 1, 1, 12, 0)),
    (200, 300., 'string_test_2', date(2023, 2, 1), datetime(2023, 1, 2, 12, 0)),
    (300, 400., 'string_test_3', date(2023, 3, 1), datetime(2023, 1, 3, 12, 0))
])
data_df = spark.createDataFrame(rdd, schema=['col_1', 'col_2', 'col_3', 'col_4', 'col_5'])
```

In [None]:
## Viewing DataFrames

data_df.show()

In [None]:
## Viewing top n rows

data_df.show(2)

In [None]:
## Viewing the schema of the dataframe

data_df.printSchema()

In [None]:
## To view the data vertically
data_df.show(1, vertical=True) # Shows the first record and its columns

In [None]:
data_df.columns ## Shows the columns

In [None]:
data_df.select('col1', 'col2').describe().show() ## Shows the summary statistics

#### Collecting the Data

A collect statement is used when we want to get all the data that is being processed in different clusters back to the driver. Make sure that the driver has enough memory to hold the processed data, to avoid out-of memory errors.


```
data_df.collect()
```

Use head, tail, take statements to avoid out-of-memory errors as they return only a subset of the data from the dataframe.


```
data_df.count()
```

Returns the number of rows present in the dataset.


***tail(n) operation is expensive as it might have to scan the entire dataset first and then return the last n values***

- **"take" and "collect" are used to retrieve data elements, with take being more suitable for small subsets and collect for retrieving all data**
- **"show" is used for visual inspection, head retrieves the first rows as Row objects, and tail retrieves the last rows of the dataset**

#### Converting a PySpark DataFrame to a Pandas DataFrame

There are options to convert a PySpark DataFrame to a Pandas DataFrame. This option is ***toPandas().***

Since Python inherently is not distributed, when a PySpark dataframe is converted to Pandas, the driver would need to collect all data in its memory and it should be able to contain all the memory to hold it, otherwise leading to out-of-memory error.

#### How to manipulate data on rows and columns

- 1. Selecting Columns

Use column functions for manipulation at the columnar level in Spark Dataframe.
```
from pyspark.sql import Column
data_df.select(data_df.col_3).show()
### Or use this method

data_df.select(data_df['col_3']).show()
```
- 2. Creating Columns

We can use a withColumn() function to create a new column in a DataFrame. To create a new column, we would need to pass the column name and column values to fill the column with.


```
from pyspark.sql import functions as F
data_df = data_df.withColumn("col_6", F.lit("A")) ## Here A is the literal
data_df.show()
```

- 3. Dropping Columns

Use drop() function to drop a column from Spark DataFrame.

```
data_df = data_df.drop("col_5")
data_df.show()
```

If we drop a non-existing column, it won't result in an error.

- 4. Updating Columns

```
data_df.withColumn("col_2", F.col("col_2") / 100). show()
```

One thing to note here is the use of the col function when updating the column. This function is used for column-wise operators. If we donâ€™t use this function, our code will return an error.

- 5. Renaming Columns
```
data_df = data_df.withColumnRenamed("col_3", "string_col")
data_df.show()
```

- 6. Find unique values in a column
```
data_df.select("col_6").distinct().show()
```

To show the count of distinct values in a given column.
```
data_df.select(F.countDistinct("col_6").alias("Total_Unique")).show()
```


- 7. To change the case
```
from pyspark.sql.functions import upper
data_df.withColumn('upper_string_col', upper(data_df.string_col)).show()
```


- 8. To filter out the records

```
data_df.filter(data_df.col_1 == 100).show()
```
 - 9. Logical operators in a DataFrame
```
data_df.filter((data_df.col_1 == 100)
                  & (data_df.col_6 == 'A')).show()
```

```
data_df.filter((data_df.col_1 == 100)
                  | (data_df.col_2 == 300.00)).show()
```

- 10. isin() operator

The isin() function is used to find values in a DataFrame column that exist in a list.

```
list = [100, 200]
data_df.filter(data_df.col_1.isin(list)).show()
```



### DataType conversions

Use cast function to change datatype

```
from pyspark.sql.functions import col
from pyspark.sql.types import StringType,BooleanType,DateType,IntegerType
data_df_2 = data_df.withColumn("col_4",col("col_4").cast(StringType())) \
    .withColumn("col_1",col("col_1").cast(IntegerType()))
data_df_2.printSchema()
data_df.show()
```

```
data_df_3 = data_df_2.selectExpr("cast(col_4 as date) col_4",
    "cast(col_1 as long) col_1")
data_df_3.printSchema()
data_df_3.show(truncate=False)
```

```
data_df_3.createOrReplaceTempView("CastExample")
data_df_4 = spark.sql("SELECT DOUBLE(col_1), DATE(col_4) from CastExample")
data_df_4.printSchema()
data_df_4.show(truncate=False)
```


### Dropping null values

.dropna() function to drop null values

### Dropping duplicates

.dropDuplicates() to drop duplicate values