## Data Transformation with Spark

Data transformation is a critical step in the journey from raw data to actionable insights. It involves the process of cleaning, enriching, and structuring data to make it suitable for analysis. The importance of data transformation lies in its ability to:

- **Enhance Data Quality**: Transformations help clean and validate data, ensuring that it is accurate and reliable for downstream analysis

- **Enable Analysis**: Well-transformed data is easier to analyze, allowing data scientists and analysts to derive meaningful patterns, trends, and insights

- **Support Decision-Making**: Businesses rely on high-quality, transformed data to make informed decisions, optimize processes, and gain a competitive edge

> Databricks leverages the power of Spark to ensure that data undergoes these transformations efficiently, providing a seamless and collaborative environment where the full potential of transformed data can be realized.

## Apache Spark Architecture

Before we start leveraging Spark to perform data transformations, let's first understand the architecture underlying Apache Spark. Remember, Spark is a unified engine for large-scale distributed data processing on computer clusters.

<p align="center">
    <img src="images/SparkArchitecture.png" width="700" height="350"/>
</p>

### Cluster Manager

> Apache Spark's architecture revolves around a *Cluster Manager*, a central entity that coordinates the distribution of tasks across a computing cluster. The Cluster Manager is responsible for resource allocation and task scheduling.
 
In Databricks, users are abstracted from direct interaction with the Cluster Manager. Users only interact with Databricks to create and configure clusters through an interface. In the background, the platform automatically manages cluster resources, handling tasks like resource allocation and task scheduling.

### Spark Application

> A *Spark Application* represents the entire computation process performed using Spark. It consists of the driver program (*Spark Driver*) and a set of executor programs (*Spark Executors*). The Spark Application defines the tasks to be executed on the Spark cluster and submit them to the Cluster Manager for execution.

Databricks facilitates the submission and management of Spark Applications. Users define and execute Spark Applications through Databricks Notebooks. Databricks also manages the Spark Application lifecycle, including job submission and execution.

### Spark Executors

> **Spark Executors** are worker nodes within the cluster responsible for executing tasks. Executors manage data partitions in memory and store intermediate results. They enhance performance by processing data close to where it is stored, minimizing data movement across the network.

Databricks transparently managed Spark Executors. When users execute Spark Jobs, the Databricks platform dynamically allocates and oversees Spark Executors on the underlying infrastructure. Users do not need to manually configure or monitor individual executors.

### Spark Driver

> The **Spark Driver** is a central control program that manages the overall execution of a Spark job. It communicates with the Cluster Manager to acquire resources and coordinates tasks across Spark Executors. The Spark Driver is responsible for overseeing the execution flow and collection of final results.

Users initiate the Spark Driver through Databricks notebooks or jobs. Code is written and executed in notebooks, or when jobs are triggered, Databricks coordinates with the Spark Driver. This automated process ensures smooth execution of Spark tasks without requiring users to initiate the Spark Driver manually.

### Spark Session

> The *Spark Session* serves as the entry point for interacting with Spark. It manages configuration settings and provides an unified interface for executing various operations. 

Databricks abstracts the concept of Spark Session for users. Starting a Spark Session is implicit when running code in a notebook cell. Databricks handles the creation of the Spark Session behind the scenes, ensuring users can switch between Spark SQL, Python, and other components without explicit session management.

## Spark SQL vs PySpark

So far in the course, we have seen examples of managing relational identities and Delta Lakes using Spark SQL. Now, we will switch our attention to PySpark, the Python API for Spark, but first let's understand the difference between the two flavors of Spark.

> **Spark SQL** is a module in Spark designed for structured data processing. It allows users to execute SQL queries on Spark data, providing a high-level interface for working with structure and semi-structured data.

Spark SQL is ideal for scenarios where you want to leverage the familiarity and expressiveness of SQL for querying and analyzing data. It's particularly well-suited for structured datasets and situations where SQL-like operations are preferable.

> **PySpark** is the Python API for Spark. It enables Python developers to harness the power of Spark for distributed data processing. PySpark provides a programmatic interface for working with Spark, allowing more flexibility in expressing complex data transformations and analytics.

PySpark is versatile and can be employed when you need more control and customization in your data processing tasks. It's suitable for scenarios where Python is the preferred language, or when you need to integrate Spark with Python-based libraries and tools.

As we've seen before, Databricks provides a unified platform where you can seamlessly switch between Spark SQL and PySpark within the same notebook environment. This is because both Spark SQL and PySpark operate on the underlying concept of *DataFrames*. This serves as a bridge, allowing you to transition between the two of them using the DataFrame API.

In the previous lesson, we focus our attention on relational entities in Databricks. While effective, traditional data structures such as databases and tables are more limited to SQL-based operations. With DataFrames, you can perform data manipulations and transformations using Spark SQL operations, benefiting from SQL-like syntax, then switch to PySpark to apply more programmatic and customized transformations, all on the same DataFrame.

## Spark DataFrames

> A **DataFrame** is a distributed collection of data organized into named columns, similar to a table in a relational database. DataFrames serve as a fundamental abstraction, providing a structured and tabular representation o data.

The key features of DataFrames are:

- **Distributed Nature**: DataFrames are distributed across a cluster of machines, allowing for parallel processing. This distribution enables Spark to handle large-scale datasets by dividing them into smaller partitions and processing them in parallel.

- **Immutable Structure**: DataFrames are immutable, meaning their structure cannot be changed once created. However, you can perform transformations on a DataFrame to create a new DataFrame with the desired changes. This immutability ensures data consistency and facilitates the construction of a lineage of transformations.

- **Lazy Evaluation**: Spark employs *lazy evaluation*, meaning that transformations on DataFrames are not executed immediately. Instead, Spark builds a logical execution plan, and the actual computation is deferred until an action is triggered. This optimization enhances performance by minimizing unnecessary computations.

- **Schema**: DataFrames have a well-defined schema that specifies the names and types of columns. The schema provides structure to the data, allowing Spark to optimize query execution and enabling users to express complex transformations in a declarative manner.

DataFrames in Spark can be created from various data sources, including structured data formats like `CSV` and `JSON`, or external databases. In the data transformation process, reading and loading data plays a pivotal role. In the next section, we will learn how to read and load data from various file types and examine practical examples.


## Reading Data into DataBricks

Let's firstly begin exploring how to read `JSON` data. We'll cover different scenarios, including reading a single `JSON` file, a directory of `JSON` files, and multiple `JSON` files using a wildcard.

### Reading a Single `JSON` File

When dealing with a single `JSON` file, PySpark DataFrames provide a straightforward method for reading and interacting with the data. The syntax is as it follows: 

```python
json_data_single_file = spark.read.json("path/to/single/json/file")
```

Let's see how we would use this in Databricks. Begin by downloading [this example `JSON` file](https://cdn.theaicore.com/content/lessons/14cf5386-4ddc-44c3-a575-bf581a740fec/single_json_file.json). Next, import it into Databricks using the **Data** explorer. In the **Data** explorer tab, use the **Create Table** button. Utilize the drag-and-drop functionality to upload the previously downloaded file. Note and copy the path at which the file will be uploaded.

<p align="center">
    <img src="images/CreateTable.png" width="700" height="550"/>
</p>

Click the **Create Table with UI** button, select a cluster to preview the table, and click **Preview Table**. This should display the table preview and the inferred file type. Finally, click **Create Table** to finish.

Now, navigate to a Databricks Notebook and run the following PySpark code to read in the uploaded `JSON` file:

`json_data_from_dbfs1 = spark.read.json('/FileStore/tables/single_json_file_1.json')`

Replacing the file path with your specific file path.

### Checking DataFrame Contents

After creating the DataFrame, you might want to inspect its contents. You can use the `show()` method to display the first few rows of the DataFrame: `json_data_from_dbfs1.show()`. 

<p align="center">
    <img src="images/ShowCommand.png" width="800" height="200"/>
</p>

This command prints a tabular representation of the DataFrame, providing an overview of the data's structure. You can adjust the number of rows displayed by specifying the desired value within the `show()` method (e.g., `show(10)` for the first 10 rows).

Alternatively, for a more interactive approach, you can use the `display()` method:

`display(json_data_from_dbfs1)`

<p align="center">
    <img src="images/DisplayTable.png" width="800" height="250"/>
</p>

The `display()` method provides a feature-rich interface for exploring and interacting with the DataFrame, including filtering, sorting, and visualizations. It's a powerful tool for a comprehensive examination of your data.

### Reading a Directory of Files

To read a directory of `JSON` files, you can use a similar approach but instead of specifying the file path, specify the path of the directory:

`data = spark.read.json("path/to/files/directory")`

> This command reads all the files in the specified directory into a PySpark DataFrame.

Let's look at an example to illustrate the process. Follow these steps:

- Begin by downloading [this file](https://cdn.theaicore.com/content/lessons/14cf5386-4ddc-44c3-a575-bf581a740fec/file1.json), and [this file](https://cdn.theaicore.com/content/lessons/14cf5386-4ddc-44c3-a575-bf581a740fec/file2.json), representing two different `JSON` files

- Import the files into Databricks using the **Data** explorer. In the **Data** explorer tab, make sure to modify the **DBFS target directory** to create a new folder where you can store your `JSON` files. See the example below for updating the first file:

<p align="center">
    <img src="images/NewDirectory.png" width="700" height="550"/>
</p>

- Upload both files to the same directory

- Run the following command to read in all the files in the directory: `json_data = spark.read.json('/FileStore/tables/json_files')`

- Visualize the output of this command use the `display()` command

### Reading Multiple Files Using a Wildcard

If you have multiple files in a directory, that also contains other data types. For example, you want to read only the `JSON` files you can use the following syntax:

`json_data_multiple_files = spark.read.json("path/to/json/files/*.json")`

This command reads all `JSON` files matching the wildcard pattern (`*`) into a PySpark DataFrame. The wildcard (`*`) here represents any number of possible characters. Given this `*.json` will match any possible file name ending in `.json`.

## Reading `CSV` Data

When dealing with tabular data, especially in scenarios where structured data is stored in `CSV` format, Spark provides efficient methods for reading and loading this data into Spark DataFrames.

### Reading a Single `CSV` File

To read a single `CSV` file into a DataFrame, you can use the following syntax:

```python
# Read a single CSV file into a PySpark DataFrame
csv_data_single_file = spark.read.csv("path/to/single/csv/file.csv", 
                                     header=True, 
                                     inferSchema=True, 
                                     sep=";")
```

The command above has the following key parameters:

- `header=True`: Indicates that the first row contains column headers
- `inferSchema=True`: Attempts to infer the schema of the data
- `sep=";"`: Specifies the column delimiter. The default value is a comma (`,`), but in this example, it's set to a semicolon (`;`). Other possible values for `sep` include, but are not limited to: `\t` for tab, `" "` for space, etc.

Let's look at an example to illustrate the process. Follow these steps:

- Begin by downloading [this CSV file](https://cdn.theaicore.com/content/lessons/14cf5386-4ddc-44c3-a575-bf581a740fec/username.csv)

- Import the file into Databricks using the **Data** explorer. In the preview table tab, make sure to enable **First row is header** and the **Infer schema** options before creating the table.

<p align="center">
    <img src="images/CreateCSVTable.png" width="800" height="350"/>
</p>

- Run the following command in a Databricks Notebook to read in the `CSV` file: `csv_data_single_file = spark.read.csv("dbfs:/FileStore/tables/username.csv", header=True, inferSchema=True, sep=";")`

- Visualize the output using the `display()` command

### Reading `CSV` Files from a Directory

If you have multiple `CSV` files in a directory, you can read them all into a DataFrame using a similar approach as for reading `JSON`s files from a directory:

`csv_data_directory = spark.read.csv("path/to/csv/files/directory", header=True, inferSchema=True)`

### Handling `CSV` Files with Custom Schemas

In some cases, you might want to specify a custom schema for your `CSV` data. You can achieve this by defining a schema and using it during the read operation:

``` python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Define a custom schema
custom_schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    # Add more fields as needed
])

# Read CSV data with the custom schema
csv_data_custom_schema = spark.read.csv("path/to/csv/file.csv", header=True, schema=custom_schema)

````
This command reads a `CSV` file into a PySpark DataFrame, applying the specified custom schema. We will look in more detail at defining custom schema in a future lesson.


## Reading Data from Cloud Storage

Data stored in cloud storage systems such as Amazon S3, Google Cloud Storage (GCS), or Azure Storage can be easily accessed with PySpark. Here's are two general approachs:

Legacy approach:

- **Store Credentials Safely**: Avoid hardcoding credentials directly in code. Leverage secure methods such as environment variables or secure key storage services.
- **Mount Storage to Databricks**: Mount your cloud storage to Databricks, providing a secure way to access data. This involves configuring storage-specific credentials within Databricks.

While mounting storage locations still works on Databricks it is **no longer recommended**. 

Modern approach:

- **Configure Access using Cloud Permissions**: Each cloud provider has their own method of configuring Databricks to allow direct access to the cloud storage. Once this is configured, the data can be read directly into Databricks without the need to upload credentials. This reduces the risk that the credentials could be leaked or maliciously used. 

### Reading Data from S3

Once Databricks has been configured for access, you will be able to read data into Databricks directly from an S3 bucket. This task would be the responsibility of the workspace administrator and would require the use of a Databricks AWS account. The steps to achieve this can be found in the Databricks documentation [here](https://docs.databricks.com/en/connect/storage/tutorial-s3-instance-profile.html). For any projects requiring this, this will already be configured for you.  

Once configured you can use the following syntax to read your files directly:

``` python
bucket_name = "<YOUR S3 BUCKET NAME>"

df = spark.read.load(f"s3a://{bucket_name}/path/to/your/s3/files/somefile.txt")
display(df)
dbutils.fs.ls(f"s3a://{bucket_name}/") # List the contents of the S3 directory

```

This method, will also support the reading of multiple files just like the `JSON` and `CSV` examples. 

## Spark SQL for Reading Data

In addition to using PySpark DataFrames, you can leverage Spark SQL to query and interact with your data using SQL commands. Let's explore how you can use Spark SQL to read `JSON` and `CSV` files.

You can use Spark SQL to query `JSON` files directly. The following example illustrates how to select all columns from a `JSON` file:

```sql
SELECT * FROM json.`path/to/json/file`
```

Similarly, Spark SQL enables you to query `CSV` files using SQL commands. For instance, you can use the following SQL command to select all columns from a `CSV` file:

```sql
SELECT * FROM csv.`path/to/csv/file`
```

## Data Cleaning and Transformation

In the data analysis process, ensuring that the data is clean and appropriately formatted is crucial. Raw data often comes with inconsistencies, missing values, or formats that are not helpful for analysis. Cleaning and transforming data involves preparing it for further analysis, enhancing its quality, and making it suitable for downstream processes. In this section we will look at different techniques for cleaning and transforming data using PySpark.

Let's start by considering the following example DataFrame representing information about individuals:

```python
# Create an example DataFrame
data = [
    ["John", "Doe", 30, "Male", "$500.00", "2022-01-01 08:30:00", ["Street1", "New York", "12345", "USA"], "john.doe@example.com", "Married"],
    ["Alice", "Smith", 25, "Female", "$700.50", "2022-01-02 15:45:30", ["Street2", "San Francisco", "54321", "USA"], "alice.smith@example.com", "Single"],
    ["Bob", "Jones", 30, "Male", "$650.00", "2022-01-03 12:15:00", ["Street3", "Los Angeles", "67890", "USA"], "Unknown", "Single"],
    ["Eve", "White", 25, "Female", "$600.75", "2022-01-04 10:00:45", ["Street4", "New Haven", "00000", "USA"], "eve.white@example.com", "User Info Error"],
    ["Chris", "Johnson", 25, "Male", "$550.00", "2022-01-05 09:30:15", ["Street5", "Chicago", "45678", "USA"], "chris.j@example.com", "Single"],
    ["Emily", "Davis", 30, "Female", "$750.25", "2022-01-06 11:20:30", ["Street6", "Seattle", "98765", "USA"], "emily.d@example.com", "Married"],
    ["Michael", "Brown", 25, "Male", "$620.50", "2022-01-07 14:45:00", ["Street7", "Boston", "13579", "USA"], "michael.b@example.com", "Single"],
    ["Samantha", "Clark", 30, "Female", "$720.75", "2022-01-08 16:00:45", ["Street8", "Denver", "24680", "USA"], "samantha.c@example.com", "User Info Error"]
]

columns = ["First Name", "Last Name", "Age", "Gender", "Salary", "Timestamp", "Location", "Email", "Status"]

# Creating the updated DataFrame
example_df = spark.createDataFrame(data, columns)

# Show the original DataFrame
print("Original DataFrame:")
example_df.show()
```

This DataFrame will serve as our example throughout this section.

### 1. Replacing Missing Values

Handling missing values is a common occurrence in the data cleaning process. The `replace` method allows us to replace specific values with designated replacements. In the example DataFrame, suppose we want to replace all occurrences of `'User Info Error'` with `None` in the `Status` column:

```python
cleaned_df = example_df.replace({'User Info Error': None}, subset=['Status'])
```

In this example:

  - `replace` initiates the replacement operation.
  - `{'User Info Error': None}` defines the replacement rule, indicating that occurrences of `'User Info Error'` should be replaced with `None` 
  - `subset=['Status']` specifies the column where the replacement should occur

### 2. Updating Data Points

The `replace` method is not limited to handling null values; it can also be employed to update existing values. For instance, suppose we want to update all occurrences of `Unknown` in the `Email` column to `Pending`:

```python
cleaned_df = cleaned_df.replace({'Unknown': 'Pending'}, subset=['Email'])
```
In this example, `{'Unknown': 'Pending'}` defines the replacement rule, indicating that occurrences of `Unknown` should be replaced with `Pending` in the `Email` column.

### 3. Using `regexp_replace` for Column Transformations

Column transformations are essential for manipulating text-based columns. The `regexp_replace` function enables us to apply regular expression patterns to modify or clean column values.

The default syntax is as follows:

```python
df = df.withColumn("ColumnName", regexp_replace("ColumnName", "pattern", "replacement")
```
Let's break down the components:

- `.withColumn("ColumnName", ...)` : This method is used to add or replace a column in the DataFrame. In this case, it specifies that the operation is targeting a column called `ColumnName`.

- `regexp_replace("ColumnName", "pattern", "replacement")`: This is the PySpark `regexp_replace` function. It takes three arguments:
  - `ColumnName`: The name of the column to which the replacement will be applied
  - `pattern`: The regular expression pattern to search for in the values of the specified column
  - `replacement`: The string that will replace the matched pattern in the column values

Let's look at an example in our `cleaned_df` DataFrame:

```python
from pyspark.sql.functions import regexp_replace
cleaned_df = cleaned_df.withColumn("Salary", regexp_replace("Salary", "\\$", "")
```
In the example above, the `regexp_replace` function removes dollar signs (`$`) from the `Salary` column in the DataFrame. The regex pattern `\\$` represents the dollar sign, and it is replaced with an empty string `""`.

### 4. Casting Columns to Different Data Types

Casting columns to different data types is a common operation, especially when the inferred schema needs adjustment. We can cast columns to ensure they are of the correct data type. The default syntax is as it follows:

```python
df = df.withColumn("ColumnName", df["ColumnName"].cast("<desired_type>"))
```
The `cast` function is applied to convert the specified column to the desired data type. The argument, `<desired_type>`, represents the target data type to which you want to cast the column to. There are different types of data casting, including:

- **Numeric Types**: Casting to numeric types, such as `integer`, `double`, `float`, etc., is common when dealing with numerical data

- **String Type**: You can cast a column to the `string` type if you want to treat it as text

- **Boolean Type**: Casting to `boolean` is suitable for columns representing true/false or binary data

- **Timestamp Type**: For columns containing timestamp or date data, casting to `timestamp` is useful

> Before casting different columns to new data types, it is useful to see exactly which data type is assigned to each column. You can do so using the `printSchema()` command.

<p align="center">
    <img src="images/printSchema.png" width="700" height="250"/>
</p>

In the example above, we can observe that the `Salary` column in our `cleaned_df` DataFrame is of type `string`, but as we removed the `$` sign from each value in the column in the previous step, we can now change the data type for this column to a numeric one:

```python
cleaned_df = cleaned_df.withColumn("Salary", cleaned_df["Salary"].cast("float"))
```
If we now rerun the `printSchema()` command we should see the `Salary` column is now of type `float`:

<p align="center">
    <img src="images/NewDataType.png" width="700" height="300"/>
</p>


### 5. Transforming Columns to Timestamp Type

Transforming columns to `timestamp` type is crucial for handling temporal data. The `to_timestamp` function is used to convert a `string` representation of a timestamp into the `timestamp` type. 

While the `cast` function allows casting to different data types, including timestamps, if you need to handle timestamp or date-related data, `to_timestamp` is the appropriate choice. This is because `to_timestamp` is specific to timestamp-related transformations and ensures the correct interpretation of time-related data.

The default syntax is:

```python
df = df.withColumn("NewTimestampColumn", to_timestamp("ExistingTimestampColumn"))
```

Let's break down the components of this syntax:

- `"NewTimestampColumn"`: Specifies the name of the new column that will store the converted timestamps
- `to_timestamp("ExistingTimestampColumn")`: This is the PySpark `to_timestamp` function. It takes one argument:
  - `"ExistingTimestampColumn"`: The name of the existing column containing string representations of timestamps that you want to convert

If we take a look at the previously ran `printSchema()` command for our `cleaned_df` DataFrame we can see that the `Timestamp` column is of type string. We can change this using the following command:

```python
from pyspark.sql.functions import to_timestamp
cleaned_df = cleaned_df.withColumn("Timestamp", to_timestamp("Timestamp"))
```
In this example, the `Timestamp` column is transformed to a `timestamp` type using the `to_timestamp` function.

<p align="center">
    <img src="images/ToTimestamp.png" width="700" height="300"/>
</p>


### 6. Creating New Columns Using Array Functions

Generating new insights and integrating data often requires the creation of new columns. In PySpark, we leverage *array functions* and concatenation to derive meaningful information.

> **Array functions** operate on arrays, which are ordered collections of elements. These functions become invaluable when working with columns that contain arrays of values. 

#### `array`: Creating a new Array Column

The `array` function creates a new array column. In the example below, we create a column named `new_array_column` by combining values from `column1` and `column2`:

```python
df = df.withColumn("new_array_column", array("column1", "column2"))
```

The output of such command would look like this:

```markdown
| column1 | column2 | new_array_column |
|---------|---------|------------------|
|   val1  |   val2  | [val1, val2]     |
|   val3  |   val4  | [val3, val4]     |
|   val5  |   val6  | [val5, val6]     |
```

We can use the `array` function to create a new column called `Full Name` in our `clean_df` DataFrame from the `First Name` and `Last Name` columns:

```python
from pyspark.sql.functions import array
cleaned_df = cleaned_df.withColumn("Full Name Array", array("First Name", "Last Name"))
```

#### `array_contains`: Checking Array for a Value

The `array_contains` function checks if an array contains a specific value. In the example below, we check if the `array_column` contains the value `val3`:

```python
df = df.withColumn("contains_value", array_contains("array_column", "val3"))
```

The output of this command would look like this:

```markdown
| array_column    | contains_value  |
|-----------------|-----------------|
| [val1, val2]    | False           |
| [val3, val4]    | True            |
| [val5, val6]    | False           |
```

#### `size`: Getting Size of an Array

The `size` function returns the size (length) of an array. In the example below, we create a column named `array_size` to store the size of the `array_column`:

```python
df = df.withColumn("array_size", size("array_column"))
```
The output of this command would look like this:

```markdown
| array_column    | array_size |
|-----------------|------------|
| [val1, val2]    | 2          |
| [val3, val4]    | 2          |
| [val5, val6]    | 2          |
```

#### `concat`: Concatenating Multiple Arrays/Values

The `concat` function can concatenate multiple arrays into a single array. In the example below, we create a column named `concatenated_arrays` by combining values from `array1` and `array2`:

```python
df = df.withColumn("concatenated_arrays", concat("array1", "array2"))
```
The output of this command would look like this:

```markdown
| array1    | array2    | concatenated_arrays  |
|-----------|-----------|-----------------------|
| [v1, v2]  | [v3, v4]  | [v1, v2, v3, v4]      |
| [v5, v6]  | [v7, v8]  | [v5, v6, v7, v8]      |
| [v9, v10] | [v11, v12]| [v9, v10, v11, v12]   |
```

> The `concat` function in PySpark is not limited to concatenating arrays only; it can be employed to concatenate various elements, including strings and literals. Here's a breakdown of its utility:

```python
# Example: Concatenating strings from two columns
df = df.withColumn("concatenated_strings", concat("column1", "column2"))

# Example: Concatenating strings and literals
df = df.withColumn("combined_values", concat("column1", lit(" - "), "column2"))
```
The `lit` function allows you to include a constant or literal value in a DataFrame operation, making it useful for combining columns with fixed strings or values.

Let's explore the examples and their corresponding output:

- Example 1: Concatenating Strings from Two Columns:

```markdown
| column1 | column2 | concatenated_strings |
|---------|---------|----------------------|
|   val1  |   val2  |       val1val2       |
|   val3  |   val4  |       val3val4       |
|   val5  |   val6  |       val5val6       |
```

- Example 2: Concatenating Strings and Literals

```markdown
| column1 | column2 | combined_values  |
|---------|---------|-------------------|
|   val1  |   val2  |   val1 - val2     |
|   val3  |   val4  |   val3 - val4     |
|   val5  |   val6  |   val5 - val6     |
```

In the first example, we concatenate the strings from two columns (`column1` and `column2`). In the second example, we combine strings with a literal hyphen (`" - "`) between them. This adds a hyphen between the values of `column1` and `column2` in the newly created `combined_values` column.

Earlier we created a new column `Full Name Array` in our `cleaned_df` DataFrame. Alternatively to that approach, we can create a new column called `Full Name` that will use the `concat` function to concatenate the information from the `First Name` and `Last Name` columns. Instead of being a type `array` as with the previous example, this new column will simply be a `string`.

```python
from pyspark.sql.functions import concat, lit
cleaned_df = cleaned_df.withColumn("Full Name", concat("First Name", lit(" "), "Last Name"))
```

#### `explode`: Transforming Array into Rows

The `explode` function in PySpark is specifically designed for transforming columns containing arrays into rows. It duplicates the other columns for each element in the array, creating a new row for each array element.

Let's consider the following DataFrame:

```markdown
| column1 |       array_column        |
|---------|---------------------------|
|   val1  |   ["item1", "item2"]      |
|   val2  |   ["item3", "item4"]      |
|   val3  |   ["item5", "item6"]      |
```

If we now apply the following transformation to this DataFrame:

```python
df = df.select("column1", explode("array_column").alias("exploded_values"))
```

After applying the `explode` function:

```markdown
| column1 |  exploded_values  |
|---------|-------------------|
|   val1  |      "item1"      |
|   val1  |      "item2"      |
|   val2  |      "item3"      |
|   val2  |      "item4"      |
|   val3  |      "item5"      |
|   val3  |      "item6"      |
```

In this example, the `explode` function creates new rows for each element in the `array_column`, duplicating the values in the `column1` for each exploded row.

#### 7. Unfolding Information from Arrays

When working with arrays in PySpark, you may encounter scenarios where information is stored in array columns, and you want to unfold or extract that information into separate columns. The `withColumn` method, combined with array indexing, enables you to achieve this transformation. The general syntax is as follows:

```python
# Unfolding array information into separate columns
df = df.withColumn("NewColumn1", col("ArrayColumn")[index1]) \
       .withColumn("NewColumn2", col("ArrayColumn")[index2]) \
       .withColumn("NewColumn3", col("ArrayColumn")[index3])
# Repeat for each desired column
```

In the syntax above:

- `"NewColumn1"`, `"NewColumn2"`, ...: The names of the new columns you want to create
- `"ArrayColumn"`: The name of the column containing the array
- `index1`, `index2`, ...: The indices indicating which elements from the array should populate the new columns

Let's take a look now at our example `cleaned_df`, which has an array column named `"Location"` containing street, city, postcode, and country information:

```python
from pyspark.sql.functions import col

# Unfolding the Address array into separate columns
cleaned_df = cleaned_df.withColumn("Street", col("Location")[0]) \
            .withColumn("City", col("Location")[1]) \
            .withColumn("Postcode", col("Location")[2]) \
            .withColumn("Country", col("Location")[3])
```
After applying this transformation, the DataFrame will have new columns: `"Street"`, `"City"`, `"Postcode"`, and `"Country"`.

#### 8. Renaming and Dropping Columns

Renaming and dropping columns are common operations to enhance DataFrame clarity and exclude unnecessary information. In PySpark, you can achieve this using the `withColumnRenamed` method for renaming and the `drop` method for dropping columns.

The general syntax for renaming columns is:

```python
# Renaming a single column
df = df.withColumnRenamed("OldColumnName", "NewColumnName")

# Renaming multiple columns
df = df.withColumnRenamed("OldColumnName1", "NewColumnName1") \
       .withColumnRenamed("OldColumnName2", "NewColumnName2")
# Repeat for each column pair

```
The general syntax for dropping columns is:

```python
# Dropping a single column
df = df.drop("ColumnName")

# Dropping multiple columns
df = df.drop("ColumnName1", "ColumnName2", ...)
# List all columns you want to drop
```

Let's drop some redundant columns in our example `cleaned_df` DataFrame:

```python
cleaned_df = cleaned_df.drop("Location", "Full Name Array", "First Name", "Last Name")
```

#### 9. Reordering Columns

Reordering columns can enhance DataFrame structure, providing better organization and simplifying data exploration. The syntax is as follows:

```python
df = df.select("Column1", "Column2", ...)
```
Let's apply this to our example `cleaned_df` DataFrame. You can begin by observing the current column ordering using the `printSchema()` function, then decide on the desired ordering. For example:

```python
cleaned_df = cleaned_df.select("Full Name", "Age", "Gender", "Email", "Salary", "Street", "City", "Postcode", "Country", "Status", "Timestamp")
````

## Advanced Data Manipulations

In this section, we will look at more advanced data manipulation techniques for combining, grouping, aggregating, sorting, and utilizing window functions to extract deeper insights from your data.

Let's begin by creating a second DataFrame that we will use in combination with our previously defined and cleaned, `cleaned_df` DataFrame:

```python
# Create a second example DataFrame
additional_data = [
    ["john.doe@example.com", "Engineer"],
    ["alice.smith@example.com", "Designer"],
    ["bob.jones@example.com", "Analyst"],
    ["eve.white@example.com", "Manager"]
]
additional_columns = ["Email", "Occupation"]

additional_data_df = spark.createDataFrame(additional_data, additional_columns)

# Show the additional DataFrame
print("Additional DataFrame:")
additional_data_df.show()
```

### 1. Using Joins to Combine DataFrames

Joining DataFrames is a common operation to combine data from different sources. The `join` method is used, specifying the common column and the type of join (e.g., inner, outer, left, right).

The general syntax is:

```python
# Joining two DataFrames based on a common column
joined_df = df1.join(df2, df1["CommonColumn"] == df2["CommonColumn"], how="join_type")
```
The `join` method is used to combine two DataFrames, `df1` and `df2`, based on a common column, in this case, the column named `"CommonColumn"`. The condition specified by `df1["CommonColumn"] == df2["CommonColumn"] `ensures that rows are aligned where the values in the common column match between the two DataFrames.

The `how` parameter determines the type of join to perform, and it takes values such as `"inner"`, `"outer"`, `"left"`, and `"right"`. These join types correspond to standard SQL `JOIN` operations.

Let's join our two DataFrames, `cleaned_df` and `additional_data_df` based on the `Email` column:

```python
# Joining the two DataFrames
combined_df = cleaned_df.join(additional_data_df, additional_data_df["Email"] == cleaned_df["Email"], how="inner")
```

In this example, we've joined `cleaned_df` and `additional_data_df` based on the `Email` column, resulting in a new DataFrame `combined_df`.

### 2. Grouping and Aggregating Data

Grouping and aggregating data allows us to summarize information based on certain columns. The `groupBy` method is used to define the grouping column, and then *aggregation functions* are applied to obtain summary statistics. Essentially, the aggregation functions will operate on groups of rows defined by the `groupBy` method.

> **Aggregation functions** are mathematical operations that perform calculations on a set of values and return a single value as a result. In the context of data manipulation, these functions summarize or transform data within a group. 

There are various types of aggregation functions, categorized based on the type of summary they provide:

- **Counting Functions**:
  - `count`: Counts the number of elements in a group <br><br>

- **Statistical Functions**:
  - `sum`: Calculates the sum of values in a group
  - `avg`: Computes the average of values in a group
  - `min`: Identifies the minimum value in a group
  - `max`: Identifies the maximum value in a group

The general syntax here is:

```python
# Grouping by a column and applying an aggregation function
grouped_df = df.groupBy("GroupingColumn").agg({"AggregatedColumn": "aggregation_function"})
```

In this syntax:

- `"GroupingColumn"` is the column by which the DataFrame is grouped
- `"AggregatedColumn"` is the column for which the aggregation function is applied
- `"aggregation_function"` is the specific aggregation function to be applied

Let's group `combined_df` by the `Gender` column and calculate the average salary for each gender: 

```python
# Grouping and aggregating by gender
grouped_df = combined_df.groupBy("Gender").agg({"Salary": "avg"})
```
Alternatively, this can also be written in the following way:

```python
grouped_df = combined_df.groupBy("Gender").agg(avg("Salary"))
```

Notice the output of this command:

<p align="center">
    <img src="images/AggregationOutput.png" width="700" height="300"/>
</p>


### 3. Using Aliases

Applying aliases to columns enhances clarity, especially when dealing with aggregated or calculated columns. Notice in the above aggregation example that the aggregate column receives quite a random name `avg(Salary)`. In cases like this we might want to replace that with a better formatted or a more descriptive name.

The general syntax for aliases is:

```python
aliased_df = df.select(col("OriginalColumn").alias("NewColumnName"))
```
Let's calculate the average salary again and apply an alias to the resulting column.

```python
# Grouping by Gender and calculating average salary with alias
grouped_df = df.groupBy("Gender").agg(avg("Salary").alias("AverageSalary"))
```

Notice the difference in output to the previous command:

<p align="center">
    <img src="images/Alias.png" width="700" height="250"/>
</p>


### 4. Sorting Data

Sorting data is crucial for better visualization and analysis. The `orderBy` method is used to sort the DataFrame based on one or more columns. The general syntax is:

```python
# Sorting data based on one or more columns
sorted_df = df.orderBy("Column1", ascending=False)
```
The `ascending` parameter in the `orderBy` method determines the sorting order for the specified columns. It is a Boolean parameter, where `True` represents ascending order (default) and `False` represents descending order.

Let's sort `joined_df` by the `Salary` columns in descending order:

```python
sorted_df = combined_df.orderBy("Salary", ascending=False)
```


### 5. Utilizing Window Functions

> *Window functions* are used for advanced operations that involve calculations over a specified range or "window" of rows. Unlike standard aggregation functions, which collapse multiple rows into a single result, window functions perform calculations across a set of related rows defined by a *window specification*.

The **window specification** defines the criteria for partitioning and ordering the data before applying the window function. It is created using the `Window` class, specifying partitioning and ordering columns. The window specification is essential for defining the context in which the window function operates.

The general syntax of a window function is:

```python
# Creating a Window specification
window_spec = Window.partitionBy("PartitionColumn").orderBy("OrderColumn")

# Applying window functions
result_df = df.withColumn("NewColumn", function.over(window_spec))

```

In the above syntax:

- **Creating a Window Specification**:

  - `Window.partitionBy("PartitionColumn")`: This part of the syntax defines the partitioning column(s) based on which the data will be divided into distinct partitions. The window function will operate independently within each partition.
  - `orderBy("OrderColumn")`: This part specifies the ordering of rows within each partition. It defines the sequence in which the window function processes rows. <br><br>

- **Applying Window Functions**:

  - `df.withColumn("NewColumn", function.over(window_spec))`: Here, the `withColumn` method is used to add a new column, and the `function.over(window_spec)` applies the window function within the specified window specification. The result is stored in the `NewColumn`.

Let's consider an example where we want to calculate the average salary for each gender, considering the rows in the same gender partition and ordered by age within each partition:

```python
from pyspark.sql.window import Window
from pyspark.sql.functions import avg

# Creating a Window specification
window_spec = Window.partitionBy("Gender").orderBy("Age")

# Applying window function to calculate average salary within each gender partition
result_df = cleaned_df.withColumn("AvgSalaryByGender", avg("Salary").over(window_spec))

# Selecting only the relevant columns
result_df = result_df.select("Gender", "Age", "AvgSalaryByGender").distinct()

display(result_df)

```

Let's break down the example above:

- **Creating a Window Specification**:

  - `Window.partitionBy("Gender")`: This part of the specification defines the partitioning of the data. It means that the window function will be applied separately for each distinct value in the `Gender` column.
  - `.orderBy("Age")`: This specifies the ordering of rows within each partition based on the `Age` column. It influences the window frame over which the window function operates.

- **Applying Window Function to Calculate Average Salary**:

  - `avg("Salary").over(window_spec)`: This applies the window function to calculate the average salary over the window specified by `window_spec`. It creates a new column called `AvgSalaryByGender` in the DataFrame.

- **Selecting Only Relevant Columns**:

  - Here we select only the columns `Gender`, `Age`, and `AvgSalaryByGender` from the DataFrame. Including additional columns in the final result, such as `First Name`, `Last Name` or other non-aggregated columns, might lead to inaccurate or nonsensical values. The `distinct()` method ensures that you get unique combinations of `Gender` and `Age`.

The output should look like this:

<p align="center">
    <img src="images/WindowFunction3.png" width="700" height="350"/>
</p>

In output DataFrame (`result_df`) for each distinct value in the `Gender` column, the `Age` will be ordered, and the `AvgSalaryByGender` column will represent the calculated average salary within each gender and age partition.

## Writing Data To Tables

Persisting DataFrames as tables is a critical step in data processing, and PySpark provides several write methods to facilitate this operation. The `write` method allows you to save a DataFrame to a table, specifying the storage formation, compression options, and additional properties.

```python
# Writing a DataFrame to a table in Parquet format
df.write.format("parquet").mode("overwrite").saveAsTable("my_table")
```

Let's break the syntax down:

- `format ("parquet")`:

  - Specifies the storage format for the table. In this example, it's set to `parquet`, a columnar storage format that is efficient for analytics and provides good compression.
  - Other notable formats include `orc` (Optimized Row Columnar) and `csv` (Comma-Separated Values), among others. Each format has its advantages and use cases, and you can choose the one that best fits your requirements. <br><br>

- `Mode ("overwrite")`:

  - Defines the behavior when saving the DataFrame to an existing table. In this case, `overwrite` means that if the table already exists, its contents will be replaced with the new DataFrame.
  - Other modes include `append` (add data to the existing table), `ignore` (do nothing if the table already exists), and `error` (throw an error if the table already exists). <br><br>

- `saveAsTable("my_table")`:

  - Indicates the name of the table to which the DataFrame is being saved to

## Key Takeaways

- Apache Spark is a distributed computing system that provides high-level APIs for distributed data processing. Apache Spark's architecture includes a **Cluster Manager**, **Spark Application**, **Spark Executors**, and **Spark Driver**:
  - Databricks abstracts users from direct interaction with the **Cluster Manager**, managing resources and tasks
  - **Spark Executors** process data close to where it's stored, enhancing performance
  - The **Spark Driver** oversees the overall execution of a Spark job and communicates with the **Cluster Manager** <br><br>

- Spark SQL is designed for structured data processing and allows SQL queries on Spark data. PySpark is the Python API for Spark, offering a programmatic interface for distributed data processing. 

- A **DataFrame** is a distributed collection of data organized into named columns, akin to a relational database table. DataFrames are distributed, immutable, and follow lazy evaluation in Spark.

- Spark DataFrames support various data sources and can be created from existing RDDs, external databases, or various file formats

- PySpark allows reading data from various sources, including `JSON` and `CSV` files. The `spark.read` module provides methods like `json()`, `csv()`, etc., to load data into DataFrames.

- Data transformations are fundamental operations to clean, preprocess, and reshape data in PySpark:

  - Techniques include replacing missing values, using `regexp_replace` for pattern-based string replacements, and casting data types with `cast`
  - `withColumn` and `select` are essential for renaming, dropping, and creating new columns
  - Array functions, such as `array`, `array_contains`, `size`, and `concat`, facilitate working with columns that contain arrays of values. They are useful for creating, checking, and manipulating arrays, including concatenating multiple arrays or values. <br><br>

- Joins are employed to combine data from different DataFrames based on common column

- Grouping and aggregating data involve the `groupBy` method and aggregation functions like `avg`, providing summarized insights

- Window functions, defined by a `Window` specification, allow advanced operations over a specified range or "window" of rows

- Persisting DataFrames as tables is crucial for data processing, and PySpark provides the `write` method for this purpose