# Apache Spark

Apache Spark is an open-source, unified analytics engine for large-scale data processing and machine learning

**Key Features**:
- **Distributed processing**: Spark processes data across a cluster of machines, making it scalable and fault-tolerant.
- **In-memory processing**: Spark caches data in memory, reducing disk I/O and increasing processing speed.
- **Unified analytics engine**: Spark supports various workloads, including batch processing, interactive queries, real-time analytics, and machine learning.
- **Multi-language support**: Spark has APIs for Python, Java, Scala, R, and SQL.
- **High-level APIs**: Spark provides high-level APIs like DataFrames, Datasets, and Spark SQL for easy data manipulation.

**Components**:
- **Spark Core**: The foundation of Spark, providing basic data structures and APIs.
- **Spark SQL**: A module for structured data processing, with support for SQL queries.
- **Spark Streaming**: A module for real-time data processing.
- **MLlib**: A machine learning library, providing algorithms for classification, regression, clustering, and more.
- **GraphX**: A module for graph processing and analytics.

# PySpark

PySpark is the Python API for Apache Spark, allowing Python developers to write Spark applications using Python. It provides a seamless integration with Spark's engine, enabling data processing, machine learning, and data analytics.

**Key Features**:
- **Pythonic API**: PySpark provides a Python-friendly API, making it easy to write Spark applications.
- **Dynamic Typing**: PySpark supports dynamic typing, allowing for flexible data processing.
- Integration with Spark: PySpark is built on top of Spark's engine, providing access to Spark's features and performance.
- **DataFrames and Datasets**: PySpark supports DataFrames and Datasets, providing a structured data processing API.
- **Machine Learning**: PySpark provides access to Spark's MLlib, enabling machine learning tasks.

**Documentation**
- [Documentation - Apache Spark](https://spark.apache.org/docs/latest/)
- [Documentation - PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
- [Wikipedia - Apache Spark](https://en.wikipedia.org/wiki/Apache_Spark)

# Imports

In [31]:
!pip install pyspark

# Install the rest of common librarys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import pyspark and create a session
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('PySpark_101').getOrCreate()



Just ignore the Hadoop warning

# Dataframes
## Dataframe creation
There are different waysto create a PySpark DataFrame. Here is a list:
- Creating a DataFrame from a List of Tuples
- Creating a DataFrame from a Pandas DataFrame
- Creating a DataFrame from a CSV File
- Creating a DataFrame from a JSON File
- Creating a DataFrame with SQL Queries

In [32]:
# Import csv file from github as pandas dataframe and then converting to spark dataframe
df = pd.read_csv('https://github.com/YBIFoundation/BigData/raw/main/HR50k.csv')
df = spark.createDataFrame(df)

<a id="viewing_data"></a>
## Viewing Data

PySpark DataFrame is lazily evaluated. This means that Spark delays the execution of operations until an action is performed. This is a fundamental concept in Spark's design and contributes to its efficiency and scalability.

- Transformations: When you apply transformations to a DataFrame (e.g., select(), filter(), groupBy()), Spark does not immediately execute these operations. Instead, it builds up a logical execution plan or a Directed Acyclic Graph (DAG) of the transformations.
- Actions: The actual computation only occurs when an action is called (e.g., show(), collect(), count()). Actions trigger Spark to execute the transformations and return results.

Here are some useful methods for data exploration

1. [**Show the First Few Rows**](#show)
    
    `show()`: Displays the first 20 rows of the DataFrame (default). You can specify a different number of rows by passing an argument.

2. [**Print Schema**](#printSchema)

    `printSchema()`: Prints the schema of the DataFrame, which includes column names and types.

3. [**Describe DataFrame**](#describe)
    
    `describe()`: Provides summary statistics (count, mean, stddev, min, max) for numeric columns.

4. [**Show Column Names**](#columns)
    
    `columns`: Returns a list of column names.

5. [**Select Specific Columns**](#select)

    `select()`: Allows you to select and view specific columns from the DataFrame.

6. [**Filter Rows**](#filter)

    `filter()` or `where()`: Allows you to filter rows based on a condition.

7. [**Head of DataFrame**](#head)
    
    `head()`: Retrieves the first row or the first n rows as a list.

8. [**First Row**](#first)

    `first()`: Retrieves the first row of the DataFrame.

9. [**Count Rows**](#count)

    `count()`: Returns the number of rows in the DataFrame.

10. [**Show DataFrame Summary**](#summary)

    `summary()`: Provides a summary of statistics including count, mean, stddev, min, max, and others for numeric columns.

11. [**Sample Data**](#sample)

    `sample()`: Samples a fraction of rows from the DataFrame.

<a id="show"></a>
[Back to Viewing Data](#viewing_data)
### `show()`

In [33]:
# Displays the first 20 rows of the DataFrame (default). You can specify a different number of rows by passing an argument
df.show(2, vertical=True)

-RECORD 0----------------------------------------
 Age                      | 31                   
 Attrition                | No                   
 BusinessTravel           | Non-Travel           
 DailyRate                | 158                  
 Department               | Software             
 DistanceFromHome         | 7                    
 Education                | 3                    
 EducationField           | Medical              
 EmployeeCount            | 1                    
 EmployeeNumber           | 1                    
 EnvironmentSatisfaction  | 3                    
 Gender                   | Male                 
 HourlyRate               | 42                   
 JobInvolvement           | 2                    
 JobLevel                 | 3                    
 JobRole                  | Developer            
 JobSatisfaction          | 1                    
 MaritalStatus            | Married              
 MonthlyIncome            | 42682                


<a id="printSchema"></a>
[Back to Viewing Data](#viewing_data)
### `printSchema()`

In [34]:
# Prints the schema of the DataFrame, which includes column names and types
df.printSchema()

root
 |-- Age: long (nullable = true)
 |-- Attrition: string (nullable = true)
 |-- BusinessTravel: string (nullable = true)
 |-- DailyRate: long (nullable = true)
 |-- Department: string (nullable = true)
 |-- DistanceFromHome: long (nullable = true)
 |-- Education: long (nullable = true)
 |-- EducationField: string (nullable = true)
 |-- EmployeeCount: long (nullable = true)
 |-- EmployeeNumber: long (nullable = true)
 |-- EnvironmentSatisfaction: long (nullable = true)
 |-- Gender: string (nullable = true)
 |-- HourlyRate: long (nullable = true)
 |-- JobInvolvement: long (nullable = true)
 |-- JobLevel: long (nullable = true)
 |-- JobRole: string (nullable = true)
 |-- JobSatisfaction: long (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- MonthlyIncome: long (nullable = true)
 |-- MonthlyRate: long (nullable = true)
 |-- NumCompaniesWorked: long (nullable = true)
 |-- Over18: string (nullable = true)
 |-- OverTime: string (nullable = true)
 |-- PercentSalaryHike: 

<a id="describe"></a>
[Back to Viewing Data](#viewing_data)
### `describe()`

In [35]:
# Provides summary statistics (count, mean, stddev, min, max) for numeric columns
df.describe().show(vertical=True)

24/08/10 23:04:11 WARN TaskSetManager: Stage 58 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.

-RECORD 0----------------------------------------
 summary                  | count                
 Age                      | 50000                
 Attrition                | 50000                
 BusinessTravel           | 50000                
 DailyRate                | 50000                
 Department               | 50000                
 DistanceFromHome         | 50000                
 Education                | 50000                
 EducationField           | 50000                
 EmployeeCount            | 50000                
 EmployeeNumber           | 50000                
 EnvironmentSatisfaction  | 50000                
 Gender                   | 50000                
 HourlyRate               | 50000                
 JobInvolvement           | 50000                
 JobLevel                 | 50000                
 JobRole                  | 50000                
 JobSatisfaction          | 50000                
 MaritalStatus            | 50000                


                                                                                

<a id="columns"></a>
[Back to Viewing Data](#viewing_data)
### `columns`

In [36]:
# Returns a list of column names
df.columns

['Age',
 'Attrition',
 'BusinessTravel',
 'DailyRate',
 'Department',
 'DistanceFromHome',
 'Education',
 'EducationField',
 'EmployeeCount',
 'EmployeeNumber',
 'EnvironmentSatisfaction',
 'Gender',
 'HourlyRate',
 'JobInvolvement',
 'JobLevel',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'Over18',
 'OverTime',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StandardHours',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

<a id="select"></a>
[Back to Viewing Data](#viewing_data)
### `select()`

In [37]:
# Allows you to select and view specific columns from the DataFrame
df.select("Age").show()

+---+
|Age|
+---+
| 31|
| 38|
| 59|
| 52|
| 32|
| 19|
| 42|
| 30|
| 41|
| 45|
| 36|
| 23|
| 24|
| 39|
| 42|
| 44|
| 50|
| 42|
| 49|
| 58|
+---+
only showing top 20 rows



<a id="filter"></a>
[Back to Viewing Data](#viewing_data)
### `filter()` or `where()`

In [38]:
# Allows you to filter rows based on a condition
df.filter(df["Age"] > 50).show(2, vertical=True)

-RECORD 0----------------------------------------
 Age                      | 59                   
 Attrition                | Yes                  
 BusinessTravel           | Non-Travel           
 DailyRate                | 1273                 
 Department               | Sales                
 DistanceFromHome         | 5                    
 Education                | 2                    
 EducationField           | Technical Degree     
 EmployeeCount            | 1                    
 EmployeeNumber           | 3                    
 EnvironmentSatisfaction  | 4                    
 Gender                   | Female               
 HourlyRate               | 96                   
 JobInvolvement           | 1                    
 JobLevel                 | 3                    
 JobRole                  | Manufacturing Dir... 
 JobSatisfaction          | 2                    
 MaritalStatus            | Married              
 MonthlyIncome            | 46149                


In [39]:
# Allows you to filter rows based on a condition
df.where(df["Age"] == "50").show(2, vertical=True)

-RECORD 0----------------------------------------
 Age                      | 50                   
 Attrition                | Yes                  
 BusinessTravel           | Travel_Frequently    
 DailyRate                | 460                  
 Department               | Research & Develo... 
 DistanceFromHome         | 10                   
 Education                | 4                    
 EducationField           | Human Resources      
 EmployeeCount            | 1                    
 EmployeeNumber           | 17                   
 EnvironmentSatisfaction  | 4                    
 Gender                   | Male                 
 HourlyRate               | 181                  
 JobInvolvement           | 2                    
 JobLevel                 | 5                    
 JobRole                  | Manager              
 JobSatisfaction          | 3                    
 MaritalStatus            | Divorced             
 MonthlyIncome            | 22090                


<a id="head"></a>
[Back to Viewing Data](#viewing_data)
### `head()`

In [40]:
# Retrieves the first row or the first n rows as a list
df.head()  # First row

Row(Age=31, Attrition='No', BusinessTravel='Non-Travel', DailyRate=158, Department='Software', DistanceFromHome=7, Education=3, EducationField='Medical', EmployeeCount=1, EmployeeNumber=1, EnvironmentSatisfaction=3, Gender='Male', HourlyRate=42, JobInvolvement=2, JobLevel=3, JobRole='Developer', JobSatisfaction=1, MaritalStatus='Married', MonthlyIncome=42682, MonthlyRate=298774, NumCompaniesWorked=2, Over18='Y', OverTime='No', PercentSalaryHike=20, PerformanceRating=4, RelationshipSatisfaction=1, StandardHours=80, StockOptionLevel=2, TotalWorkingYears=15, TrainingTimesLastYear=1, WorkLifeBalance=2, YearsAtCompany=12, YearsInCurrentRole=4, YearsSinceLastPromotion=10, YearsWithCurrManager=11)

In [41]:
df.head(5)  # First 5 rows

[Row(Age=31, Attrition='No', BusinessTravel='Non-Travel', DailyRate=158, Department='Software', DistanceFromHome=7, Education=3, EducationField='Medical', EmployeeCount=1, EmployeeNumber=1, EnvironmentSatisfaction=3, Gender='Male', HourlyRate=42, JobInvolvement=2, JobLevel=3, JobRole='Developer', JobSatisfaction=1, MaritalStatus='Married', MonthlyIncome=42682, MonthlyRate=298774, NumCompaniesWorked=2, Over18='Y', OverTime='No', PercentSalaryHike=20, PerformanceRating=4, RelationshipSatisfaction=1, StandardHours=80, StockOptionLevel=2, TotalWorkingYears=15, TrainingTimesLastYear=1, WorkLifeBalance=2, YearsAtCompany=12, YearsInCurrentRole=4, YearsSinceLastPromotion=10, YearsWithCurrManager=11),
 Row(Age=38, Attrition='No', BusinessTravel='Travel_Rarely', DailyRate=985, Department='Human Resources', DistanceFromHome=33, Education=5, EducationField='Life Sciences', EmployeeCount=1, EmployeeNumber=2, EnvironmentSatisfaction=1, Gender='Female', HourlyRate=66, JobInvolvement=2, JobLevel=4, Jo

<a id="first"></a>
[Back to Viewing Data](#viewing_data)
### `first()`

In [42]:
# Retrieves the first row of the DataFrame
df.first()

Row(Age=31, Attrition='No', BusinessTravel='Non-Travel', DailyRate=158, Department='Software', DistanceFromHome=7, Education=3, EducationField='Medical', EmployeeCount=1, EmployeeNumber=1, EnvironmentSatisfaction=3, Gender='Male', HourlyRate=42, JobInvolvement=2, JobLevel=3, JobRole='Developer', JobSatisfaction=1, MaritalStatus='Married', MonthlyIncome=42682, MonthlyRate=298774, NumCompaniesWorked=2, Over18='Y', OverTime='No', PercentSalaryHike=20, PerformanceRating=4, RelationshipSatisfaction=1, StandardHours=80, StockOptionLevel=2, TotalWorkingYears=15, TrainingTimesLastYear=1, WorkLifeBalance=2, YearsAtCompany=12, YearsInCurrentRole=4, YearsSinceLastPromotion=10, YearsWithCurrManager=11)

<a id="count"></a>
[Back to Viewing Data](#viewing_data)
### `count()`

In [43]:
# Returns the number of rows in the DataFrame
df.count()

24/08/10 23:04:18 WARN TaskSetManager: Stage 67 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

50000

<a id="summary"></a>
[Back to Viewing Data](#viewing_data)
### `summary()`

In [44]:
# Provides a summary of statistics including count, mean, stddev, min, max, and others for numeric columns
df.summary().show(vertical=True)

24/08/10 23:04:19 WARN TaskSetManager: Stage 70 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.
[Stage 72:>                                                         (0 + 1) / 1]

-RECORD 0----------------------------------------
 summary                  | count                
 Age                      | 50000                
 Attrition                | 50000                
 BusinessTravel           | 50000                
 DailyRate                | 50000                
 Department               | 50000                
 DistanceFromHome         | 50000                
 Education                | 50000                
 EducationField           | 50000                
 EmployeeCount            | 50000                
 EmployeeNumber           | 50000                
 EnvironmentSatisfaction  | 50000                
 Gender                   | 50000                
 HourlyRate               | 50000                
 JobInvolvement           | 50000                
 JobLevel                 | 50000                
 JobRole                  | 50000                
 JobSatisfaction          | 50000                
 MaritalStatus            | 50000                


                                                                                

<a id="sample"></a>
[Back to Viewing Data](#viewing_data)
### `sample()`

In [45]:
# Samples a fraction of rows from the DataFrame
df.sample(fraction=0.1).show(2, vertical=True)  # Sample 10% of the rows

-RECORD 0----------------------------------------
 Age                      | 31                   
 Attrition                | No                   
 BusinessTravel           | Non-Travel           
 DailyRate                | 158                  
 Department               | Software             
 DistanceFromHome         | 7                    
 Education                | 3                    
 EducationField           | Medical              
 EmployeeCount            | 1                    
 EmployeeNumber           | 1                    
 EnvironmentSatisfaction  | 3                    
 Gender                   | Male                 
 HourlyRate               | 42                   
 JobInvolvement           | 2                    
 JobLevel                 | 3                    
 JobRole                  | Developer            
 JobSatisfaction          | 1                    
 MaritalStatus            | Married              
 MonthlyIncome            | 42682                


## Applying Functions
PySpark supports various User-Defined Functions (UDFs) and APIs to allow users to execute Python native functions.

First we imoort the packages

In [46]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

Here are some examples

In [47]:
# UDF that converts text to uppercase.

# Define a Python function
def to_uppercase(s):
    return s.upper() if s else None

# Register the UDF
to_uppercase_udf = udf(to_uppercase, StringType())

# Create a DataFrame

# Use the UDF in a DataFrame transformation
df_with_uppercase = df.withColumn("EducationField_uppercase", to_uppercase_udf(df["EducationField"]))
df_with_uppercase.select("EducationField").show(5)
df_with_uppercase.select("EducationField_uppercase").show(5)

+----------------+
|  EducationField|
+----------------+
|         Medical|
|   Life Sciences|
|Technical Degree|
|       Marketing|
| Human Resources|
+----------------+
only showing top 5 rows

+------------------------+
|EducationField_uppercase|
+------------------------+
|                 MEDICAL|
|           LIFE SCIENCES|
|        TECHNICAL DEGREE|
|               MARKETING|
|         HUMAN RESOURCES|
+------------------------+
only showing top 5 rows



In [48]:
# UDF that takes two arguments and concatenates them

# Define a Python function
def concatenate_strings(s1, s2):
    return (s1 + "-" + s2) if s1 and s2 else None

# Register the UDF
concatenate_udf = udf(concatenate_strings, StringType())

# Use the UDF in a DataFrame transformation
df_concatenated = df.withColumn("Concatenated", concatenate_udf(df["EducationField"], df["JobRole"]))
df_concatenated.select("Concatenated").show(5, vertical=True)

-RECORD 0----------------------------
 Concatenated | Medical-Developer    
-RECORD 1----------------------------
 Concatenated | Life Sciences-Hea... 
-RECORD 2----------------------------
 Concatenated | Technical Degree-... 
-RECORD 3----------------------------
 Concatenated | Marketing-Human R... 
-RECORD 4----------------------------
 Concatenated | Human Resources-M... 
only showing top 5 rows



## Grouping Data

the `groupBy` method is used to group rows in a DataFrame by one or more columns and perform aggregations on those groups. This is similar to SQL's GROUP BY clause and is essential for summarizing and analyzing data.

In [49]:
from pyspark.sql.functions import sum, avg

In [50]:
df_grouped = df.groupBy("Department").agg(avg("MonthlyIncome").alias("avg_monthly_income"))
df_grouped.show()

24/08/10 23:04:29 WARN TaskSetManager: Stage 77 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.

+--------------------+------------------+
|          Department|avg_monthly_income|
+--------------------+------------------+
|               Sales| 26118.75346030995|
|Research & Develo...|25796.079456665466|
|            Software|26026.253958733207|
|             Support| 26065.20192655027|
|            Hardware|26028.070265638387|
|     Human Resources| 26058.44547398432|
+--------------------+------------------+



                                                                                

In [51]:
df_grouped = df.groupBy("Department", "JobRole").agg(
    sum("DistanceFromHome").alias("total_distance_from_home"),
    avg("DailyRate").alias("avg_daily_rate")
)
df_grouped.show()

24/08/10 23:04:31 WARN TaskSetManager: Stage 80 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.


+--------------------+--------------------+------------------------+-----------------+
|          Department|             JobRole|total_distance_from_home|   avg_daily_rate|
+--------------------+--------------------+------------------------+-----------------+
|               Sales|Sales Representative|                   21468| 781.016548463357|
|               Sales|     Human Resources|                   22035|794.9689655172414|
|     Human Resources|Manufacturing Dir...|                   19691|790.4191542288557|
|            Software|Healthcare Repres...|                   21464|777.5381984036488|
|     Human Resources|   Research Director|                   20837|797.6147132169576|
|            Software|     Sales Executive|                   21285|798.3353221957041|
|               Sales|  Research Scientist|                   21859|823.6493506493506|
|Research & Develo...|   Research Director|                   21026|792.7322540473225|
|            Hardware|             Manager|

## Getting Data In/Out

Getting data in and out of Spark is essential for data processing and analysis. PySpark provides various methods to read from and write to different data sources. Here’s a guide to common operations for reading from and writing to various formats and storage systems

### CSV File

In [52]:
# Writing
df.write.mode("overwrite").csv("path/to/file_csv.csv", header=True)
# Reading
df_csv = spark.read.csv("path/to/file_csv.csv", header=True, inferSchema=True)
df_csv.count()

24/08/10 23:04:32 WARN TaskSetManager: Stage 83 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

50000

### JSON Files

In [53]:
# Writing
df.write.mode("overwrite").json("path/to/file_json.json")
# Reading
df_json = spark.read.json("path/to/file_json.json")
df_json.count()

24/08/10 23:04:34 WARN TaskSetManager: Stage 89 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

50000

### Parquet File

In [54]:
# Writing
df.write.mode("overwrite").parquet("path/to/file_parquet.parquet")
# Reading
df_parquet = spark.read.parquet("path/to/file_parquet.parquet")
df_parquet.count()

24/08/10 23:04:36 WARN TaskSetManager: Stage 94 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

50000

### ORC File

In [55]:
# Writing
df.write.mode("overwrite").format("orc").save("path/to/file_orc.orc")
# Reading
df_orc = spark.read.format("orc").load("path/to/file_orc.orc")
df_orc.count()

24/08/10 23:04:37 WARN TaskSetManager: Stage 99 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.
                                                                                

50000

### JDBC (Databases)

JDBC (Java Database Connectivity) is a Java-based API that allows Java applications to interact with databases. In the context of Spark, JDBC is used to connect to relational databases to read from or write to these databases

## Working with SQL

Working with SQL in PySpark can be quite powerful for querying and manipulating large datasets. Here’s a basic guide to get you started

### Running SQL Queries

In [56]:
# Create a DataFrame from a list of tuples
data = [("Alice", 1), ("Bob", 2), ("Cathy", 3)]
columns = ["Name", "Id"]
df_sql = spark.createDataFrame(data, columns)

# You can register a DataFrame as a temporary SQL table to run SQL queries on it
df_sql.createOrReplaceTempView("my_table")

df_sql.show()

+-----+---+
| Name| Id|
+-----+---+
|Alice|  1|
|  Bob|  2|
|Cathy|  3|
+-----+---+



### Using SQL Queries

In [57]:
# Run a SQL query
result_df = spark.sql("SELECT * FROM my_table WHERE Id > 1")

# Show the results
result_df.show()

+-----+---+
| Name| Id|
+-----+---+
|  Bob|  2|
|Cathy|  3|
+-----+---+



### Using SQL Functions

In [58]:
from pyspark.sql.functions import col, expr

# Use SQL functions
result_df = df_sql.select(col("Name"), expr("Id + 1 as Id_plus_one"))
result_df.show()

+-----+-----------+
| Name|Id_plus_one|
+-----+-----------+
|Alice|          2|
|  Bob|          3|
|Cathy|          4|
+-----+-----------+



### Joining DataFrames

In [59]:
# Create another DataFrame
data2 = [("Alice", "Engineering"), ("Bob", "HR")]
columns2 = ["Name", "Department"]
df2 = spark.createDataFrame(data2, columns2)

# Join DataFrames
joined_df = df_sql.join(df2, on="Name", how="inner")
joined_df.show()



+-----+---+-----------+
| Name| Id| Department|
+-----+---+-----------+
|Alice|  1|Engineering|
|  Bob|  2|         HR|
+-----+---+-----------+



                                                                                

### Eager Evaluation
In PySpark, eager evaluation refers to the immediate execution of transformations and actions on DataFrames or RDDs. Unlike lazy evaluation, where operations are deferred until an action is called, eager evaluation processes each operation as it is encountered

In [60]:
#Enable eager evaluation for Spark SQL, allowing for faster and more interactive queries
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)