<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports:" data-toc-modified-id="Imports:-1">Imports:</a></span></li><li><span><a href="#Creating-the-Spark-Entry-point-(SparkSession)" data-toc-modified-id="Creating-the-Spark-Entry-point-(SparkSession)-2">Creating the Spark Entry point (SparkSession)</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-3">Data Preparation</a></span></li><li><span><a href="#Filtering" data-toc-modified-id="Filtering-4">Filtering</a></span></li><li><span><a href="#How-to-Handle-Missing-Value" data-toc-modified-id="How-to-Handle-Missing-Value-5">How to Handle Missing Value</a></span></li><li><span><a href="#Demo-with-pyspark.sql.functions" data-toc-modified-id="Demo-with-pyspark.sql.functions-6">Demo with <code>pyspark.sql.functions</code></a></span></li><li><span><a href="#Sampling-Data" data-toc-modified-id="Sampling-Data-7">Sampling Data</a></span><ul class="toc-item"><li><span><a href="#Experiment-1" data-toc-modified-id="Experiment-1-7.1">Experiment 1</a></span></li><li><span><a href="#Experiment-2" data-toc-modified-id="Experiment-2-7.2">Experiment 2</a></span></li></ul></li><li><span><a href="#Partitioning" data-toc-modified-id="Partitioning-8">Partitioning</a></span></li><li><span><a href="#I/O" data-toc-modified-id="I/O-9">I/O</a></span><ul class="toc-item"><li><span><a href="#Convert-from-*.parqet-to-*.csv" data-toc-modified-id="Convert-from-*.parqet-to-*.csv-9.1">Convert from <code>*.parqet</code> to <code>*.csv</code></a></span></li><li><span><a href="#Convert-from-*.csv-to-*.parqet" data-toc-modified-id="Convert-from-*.csv-to-*.parqet-9.2">Convert from <code>*.csv</code> to <code>*.parqet</code></a></span></li></ul></li></ul></div>

#### Imports:

In [1]:
import os
from pyspark.sql import SparkSession

from pyspark.sql.functions import (
    col, when, asc, desc, lit,
    mean, sum, avg, stddev,
    count, countDistinct,
    format_number, isnan,
    asc, desc, mean, 
    rank, lag, lead,
)
from pyspark.sql.window import Window

from pyspark.sql.types import (
    StructField, StructType, LongType, TimestampType,
    StringType, IntegerType, 
    FloatType, BooleanType,
    DateType,
)

In [2]:
import pyspark
import datetime
import numpy as np
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [4]:
# ## 
# # You might have noticed this code in the screencast.

# import findspark
# findspark.init('spark-3.3.3-bin-hadoop3')

# # The findspark Python module makes it easier to install
# # Spark in local mode on your computer. This is convenient
# # for practicing Spark syntax locally. 
# # However, the workspaces already have Spark installed and you do not
# # need to use the findspark module

# ##


In [2]:
# pyspark.__version__

- [Spark 3.1.0](https://spark.apache.org/docs/3.0.1/index.html)
- [Spark Python API](https://spark.apache.org/docs/latest/api/python/index.html#)
- [PySpark API Reference](https://spark.apache.org/docs/latest/api/python/reference/index.html#api-reference)
- [Transformations and Actions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.streaming.html#transformations-and-actions)
- [Broadcast and Accumulator](https://spark.apache.org/docs/latest/api/python/reference/pyspark.html#broadcast-and-accumulator)

- One use of Spark SQL is to execute SQL queries. Spark SQL can also be used to read data from an existing Hive installation. When running SQL from within another programming language the results will be returned as a `Dataset`/`DataFrame`.
- The `Dataset` API is available in Scala and Java. Python does not have the support for the Dataset API.
- The `DataFrame` API is available in Scala, Java, Python, and R
   

The following are the natively available data types in PySpark:

-   **BooleanType**: Represents a boolean value.
-   **ByteType**: Represents a byte value.
-   **ShortType**: Represents a short integer value.
-   **IntegerType**: Represents an integer value.
-   **LongType**: Represents a long integer value.
-   **FloatType**: Represents a float value.
-   **DoubleType**: Represents a double value.
-   **DecimalType**: Represents a decimal value.
-   **StringType**: Represents a string value.
-   **BinaryType**: Represents a binary (byte array) value.
-   **DateType**: Represents a date value.
-   **TimestampType**: Represents a timestamp value.
-   **ArrayType**: Represents an array of values.
-   **MapType**: Represents a map of key-value pairs.
-   **StructType**: Represents a struct (complex type) with fields.

#### Creating the Spark Entry point (SparkSession)

In [3]:
spark = SparkSession.builder \
        .appName("auto") \
        .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/31 17:12:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10000000)

# Enable eager evaluation for better formatting of the output
spark.conf.set("spark.sql.repl.eagerEval.enabled", True)
spark.conf.get("spark.sql.sources.bucketing.enabled")
spark.conf.get("spark.sql.autoBroadcastJoinThreshold")

# Disable Broadcast Join
spark.conf.set("spar.sql.autoBroadcastJoinThreshold", -1)
spark.conf.set("spar.sql.adaptive.enabled", False)

In [5]:
spark.conf.get("spark.sql.warehouse.dir")

'file:/Users/am/mydocs/Software_Development/notes_hub/nbs/spark-warehouse'

In [61]:
# spark.sparkContext.getConf().getAll()

In [4]:
# spark.conf.get("spark.sql.parquet.filterPushDown")

#### Data Preparation

In [48]:
DATA_DIR = os.environ['DATA'] + '/IBM_Data_Analysis'

In [49]:
# ! head -3 {DATA_DIR}/imports-85.csv

In [50]:
column_names = ['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
       'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
       'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
       'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
       'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
       'highway-mpg', 'price']

In [52]:
sqldf = spark.read.csv(DATA_DIR + "/imports-85.csv/", header=False).toDF(*column_names)

In [62]:
# sqldf.columns

- Alternative way to rename the columns
```python
columns = sqldf.columns
for old_col, new_col in zip(columns, column_names):
    sqldf = sqldf.withColumnRenamed(old_col, new_col)
```

In [None]:
# # creates a temporary view against which we can run SQL queries.
# df = sqldf.createOrReplaceTempView('auto')
# spark.sql("SELECT * FROM auto LIMIT 2").show()

In [None]:
# sqldf.printSchema()
# sqldf.describe()
# sqldf.take(2)

In [None]:
# df = spark.range(1, 10000, 1, 10).select(col("id"), rand(10).alias("Attribute"))


In [4]:
EMP_PATH = "/Users/am/mydocs/Software_Development/Databases/RDBMS/sql/schemas/employees.csv"
DEPT_PATH = "/Users/am/mydocs/Software_Development/Databases/RDBMS/sql/schemas/department.csv"

emp_schema = StructType(fields=[
    StructField('employee_id', StringType(), True),
    StructField('first_name',  StringType(), True),
    StructField('last_name', StringType(), True),
    StructField('age', FloatType(), True),
    StructField('salary', FloatType(), True),
    StructField('joining_date', TimestampType(), True),
    StructField('department_id', LongType(), True),
    StructField('manager_id', LongType(), True)
    ]
)
dept_schema = StructType(fields=[
    StructField('department_id', LongType(), True),
    StructField('department',  StringType(), True)
])

emp = spark.read.csv(EMP_PATH, header=False, schema=emp_schema)
dept = spark.read.csv(DEPT_PATH, header=False, schema=dept_schema)

In [5]:
emp.show()
dept.show()

                                                                                

+-----------+----------+---------+----+-------+-------------------+-------------+----------+
|employee_id|first_name|last_name| age| salary|       joining_date|department_id|manager_id|
+-----------+----------+---------+----+-------+-------------------+-------------+----------+
|          1|     Alice|  Johnson|59.0|70000.0|2000-05-27 16:00:33|            1|      NULL|
|          2|      John|    Smith|30.0|60000.0|2001-02-27 16:00:33|            1|         5|
|          3|     James|    Smith|25.0|55000.0|2001-01-20 16:00:33|            1|         6|
|          4|      Mona|     null|28.0|62000.0|2000-05-27 16:00:33|            1|         7|
|          5|      Bill|  Clinton|29.0|54000.0|2024-05-27 16:00:33|            1|         5|
|          6|    Hilary|  Clinton|24.0|52000.0|2024-05-27 16:00:33|            1|         5|
|          7|       Eva|  Clinton|31.0|63000.0|2024-05-27 16:00:33|            1|         7|
|          8|   Charlie|    Brown|27.0|58000.0|2024-05-27 16:00:33|   

In [6]:
# df = emp
# df.show()

In [7]:
# Inner Join
df = emp.join(dept, emp.department_id == dept.department_id, "inner").select("employee_id","first_name","last_name","salary","age","department","manager_id")
pddf = df.toPandas(); pddf

Unnamed: 0,employee_id,first_name,last_name,salary,age,department,manager_id
0,1.0,Alice,Johnson,70000.0,59.0,Engineering,
1,2.0,John,Smith,60000.0,30.0,Engineering,5.0
2,3.0,James,Smith,55000.0,25.0,Engineering,6.0
3,4.0,Mona,,62000.0,28.0,Engineering,7.0
4,5.0,Bill,Clinton,54000.0,29.0,Engineering,5.0
5,6.0,Hilary,Clinton,52000.0,24.0,Engineering,5.0
6,7.0,Eva,Clinton,63000.0,31.0,Engineering,7.0
7,8.0,Charlie,Brown,58000.0,27.0,Finance,5.0
8,9.0,Grace,Brown,74000.0,33.0,Finance,5.0
9,10.0,Bob,Williams,65000.0,32.0,HR,6.0


In [44]:
# df = df.select("employee_id","first_name","last_name","age","department","manager_id").show()

#### Filtering

In [None]:
# # Initialize Spark session
# spark = SparkSession.builder \
#     .appName("PySpark Filtering Examples") \
#     .getOrCreate()

# # Sample data
# data = [
#     ("Alice", 34, "2023-01-01", "HR"),
#     ("Bob", 45, "2022-12-15", "Finance"),
#     ("Catherine", 29, "2023-03-05", "HR"),
#     ("David", 50, "2021-07-30", "Finance"),
#     ("Eva", 40, "2023-05-22", "IT")
# ]

# # Create DataFrame
# df = spark.createDataFrame(data, ["name", "age", "join_date", "department"])

# # Show the original DataFrame
# print("Original DataFrame:")
# df.show()

In [8]:
# Filtering with filter
df.filter(df.age > 30).show()

+-----------+----------+---------+-------+----+-----------+----------+
|employee_id|first_name|last_name| salary| age| department|manager_id|
+-----------+----------+---------+-------+----+-----------+----------+
|          1|     Alice|  Johnson|70000.0|59.0|Engineering|      NULL|
|          7|       Eva|  Clinton|63000.0|31.0|Engineering|         7|
|          9|     Grace|    Brown|74000.0|33.0|    Finance|         5|
|         10|       Bob| Williams|65000.0|32.0|         HR|         6|
|         13|    Nathan| Williams|70000.0|32.0|         HR|         5|
|         14|     Henry|Jefferson|66000.0|34.0|         IT|         7|
|         15|      Jack|Jefferson|78000.0|36.0|         IT|         7|
|         16|     Kathy|Jefferson|80000.0|38.0|         IT|         6|
|         19|     Peter|   Wilson|77000.0|33.0|    Finance|         7|
|       null|    Thomas|Jefferson|    0.0|38.0|         IT|         6|
+-----------+----------+---------+-------+----+-----------+----------+



In [9]:
# Filtering with where
df.where(df.department == "HR").show()

# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("employees")

# Filtering with SQL query
filtered_df = spark.sql("SELECT * FROM employees WHERE age > 30")
filtered_df.show()

# Filtering with col
df.filter(col("age") > 35).show()

# Filtering with AND condition
df.filter((df.age > 35) & (df.department == "HR")).show()

# Filtering with OR condition
df.filter((df.age > 35) | (df.department == "IT")).show()

# Filtering with isin
df.filter(df.department.isin("HR", "IT")).show()

# Filtering with like
df.filter(df.first_name.like("A%")).show()

# Filtering with rlike (regular expression)
df.filter(df.first_name.rlike("^[AE].*")).show()

# Filtering with between
df.filter(df.age.between(30, 40)).show()

+-----------+----------+---------+-------+----+----------+----------+
|employee_id|first_name|last_name| salary| age|department|manager_id|
+-----------+----------+---------+-------+----+----------+----------+
|         10|       Bob| Williams|65000.0|32.0|        HR|         6|
|         11|     David| Williams|    0.0|29.0|        HR|         7|
|         12|     Frank|     NULL|61000.0|26.0|        HR|         6|
|         13|    Nathan| Williams|70000.0|32.0|        HR|         5|
+-----------+----------+---------+-------+----+----------+----------+

+-----------+----------+---------+-------+----+-----------+----------+
|employee_id|first_name|last_name| salary| age| department|manager_id|
+-----------+----------+---------+-------+----+-----------+----------+
|          1|     Alice|  Johnson|70000.0|59.0|Engineering|      NULL|
|          7|       Eva|  Clinton|63000.0|31.0|Engineering|         7|
|          9|     Grace|    Brown|74000.0|33.0|    Finance|         5|
|         10|

In [10]:
# Sample data with null values
data_with_nulls = [
    ("Alice", None, "2023-01-01", "HR"),
    ("Bob", 45, "2022-12-15", "Finance"),
    ("Catherine", 29, "2023-03-05", "HR"),
    ("David", 50, "2021-07-30", None),
    ("Eva", 40, "2023-05-22", "IT")
]

# Create DataFrame
df_nulls = spark.createDataFrame(data_with_nulls, ["name", "age", "join_date", "department"])

# Filtering with isNull
df_nulls.filter(df_nulls.age.isNull()).show()

# Filtering with isNotNull
df_nulls.filter(df_nulls.department.isNotNull()).show()

                                                                                

+-----+----+----------+----------+
| name| age| join_date|department|
+-----+----+----------+----------+
|Alice|NULL|2023-01-01|        HR|
+-----+----+----------+----------+

+---------+----+----------+----------+
|     name| age| join_date|department|
+---------+----+----------+----------+
|    Alice|NULL|2023-01-01|        HR|
|      Bob|  45|2022-12-15|   Finance|
|Catherine|  29|2023-03-05|        HR|
|      Eva|  40|2023-05-22|        IT|
+---------+----+----------+----------+



In [11]:
df = df.withColumn(
    'AgeGroup',
    when(df['Age']<20, 'junior')
    .when((df['Age']>=20) & (df['Age']<30), 'young')
    .when((df['Age']>=30) & (df['Age']<55), 'middle')
    .when((df['Age']>=55) & (df['Age']<70), 'senior')
    .otherwise('elderly')
); df.show()

+-----------+----------+---------+-------+----+-----------+----------+--------+
|employee_id|first_name|last_name| salary| age| department|manager_id|AgeGroup|
+-----------+----------+---------+-------+----+-----------+----------+--------+
|          1|     Alice|  Johnson|70000.0|59.0|Engineering|      NULL|  senior|
|          2|      John|    Smith|60000.0|30.0|Engineering|         5|  middle|
|          3|     James|    Smith|55000.0|25.0|Engineering|         6|   young|
|          4|      Mona|     null|62000.0|28.0|Engineering|         7|   young|
|          5|      Bill|  Clinton|54000.0|29.0|Engineering|         5|   young|
|          6|    Hilary|  Clinton|52000.0|24.0|Engineering|         5|   young|
|          7|       Eva|  Clinton|63000.0|31.0|Engineering|         7|  middle|
|          8|   Charlie|    Brown|58000.0|27.0|    Finance|         5|   young|
|          9|     Grace|    Brown|74000.0|33.0|    Finance|         5|  middle|
|         10|       Bob| Williams|65000.

In [12]:
# # SQL 
# query = """
# SELECT
#     Age,
#     CASE
#         WHEN Age < 20 THEN 'junior'
#         WHEN (Age >=20) AND (Age<30) THEN 'young'
#         WHEN Age BETWEEN 30 AND 55 THEN 'middle' -- Age==30 and Age==55 are classified as 'middle'
#         WHEN Age>=55 AND Age<70 THEN 'senior'
#         ELSE 'elderly'
#     END AS AgeGroup
# FROM
#     df
# """

# spark.sql(query).show(n=5, truncate=False)

#### GROUP_BY

In [32]:
df.groupBy('department').agg(avg('salary')).show()

+-----------+-----------------+
| department|      avg(salary)|
+-----------+-----------------+
|Engineering|59428.57142857143|
|         HR|          49000.0|
|    Finance|          69500.0|
|         IT|          61000.0|
+-----------+-----------------+



In [16]:
# Group by multiple conditions
train_df.groupBy('department', 'salary').count().sort(asc('count')).show()

+-----------+-------+-----------+--------+---------------+
| department| salary|sum(salary)|sum(age)|sum(manager_id)|
+-----------+-------+-----------+--------+---------------+
|Engineering|52000.0|    52000.0|    24.0|              5|
|         IT|78000.0|    78000.0|    36.0|              7|
|    Finance|74000.0|    74000.0|    33.0|              5|
|         HR|70000.0|    70000.0|    32.0|              5|
|Engineering|55000.0|    55000.0|    25.0|              6|
|    Finance|77000.0|    77000.0|    33.0|              7|
|Engineering|60000.0|    60000.0|    30.0|              5|
|         IT|80000.0|    80000.0|    38.0|              6|
|Engineering|54000.0|    54000.0|    29.0|              5|
|Engineering|70000.0|    70000.0|    59.0|           NULL|
|         HR|61000.0|    61000.0|    26.0|              6|
|    Finance|69000.0|    69000.0|    29.0|              5|
|Engineering|62000.0|    62000.0|    28.0|              7|
|    Finance|58000.0|    58000.0|    27.0|              

In [None]:
# Get average values
train_df.groupby('HomePlanet').mean('TotalBill').show()

# Combination with filter() function
train_df.groupBy('HomePlanet').mean('TotalBill') \
        .filter(col('avg(TotalBill)') >= 1000).show()

# mean() function calls agg(avg()) function. After agg() function, you can use alias() function for renaming.
train_df.groupBy('HomePlanet').agg(avg('TotalBill') \
        .alias('mean_total_bill')).filter(col('mean_total_bill') >= 1000) \
        .sort(desc('mean_total_bill')).show()

In [None]:
# Group by multiple conditions and aggregate multiple functions
train_df.groupBy('HomePlanet').agg(avg('TotalBill').alias('avg_total_bill'), \
                                   stddev('TotalBill').alias('stddev_total_bill')) \
                                   .show()

In [None]:
# Pivot Table 
# You can get aggregation group members as a header by pivottable.
train_df.groupBy('HomePlanet').agg(count('VIP').alias('VIPCount')).show()
train_df.groupby('HomePlanet').pivot('VIP').count().show()
train_df.groupby('HomePlanet').pivot('VIP').max('TotalBill').show()

In [None]:
# SQL
query = """
SELECT
    HomePlanet
    , COUNT(*) AS Count
    , COUNT(DISTINCT PassengerId) as TotalIDs
    , AVG(TotalBill) AS mean_total_bill
    , MAX(TotalBill) AS max_total_bill
    , SUM(TotalBill) AS sum_of_total_bill
FROM
    train_df
GROUP BY
    HomePlanet
HAVING
    mean_total_bill > 1000
ORDER BY
    mean_total_bill DESC
"""

spark.sql(query).show(n=5, truncate=False)

In [None]:
# Group by multiple conditions
train_df.groupBy('HomePlanet', 'VIP').count().sort(asc('count')).show()

In [None]:
# Get average values
train_df.groupby('HomePlanet').mean('TotalBill').show()

# Combination with filter() function
train_df.groupBy('HomePlanet').mean('TotalBill') \
        .filter(col('avg(TotalBill)') >= 1000).show()

# mean() function calls agg(avg()) function. After agg() function, you can use alias() function for renaming.
train_df.groupBy('HomePlanet').agg(avg('TotalBill') \
        .alias('mean_total_bill')).filter(col('mean_total_bill') >= 1000) \
        .sort(desc('mean_total_bill')).show()

In [None]:
# Group by multiple conditions and aggregate multiple functions
train_df.groupBy('HomePlanet').agg(avg('TotalBill').alias('avg_total_bill'), \
                                   stddev('TotalBill').alias('stddev_total_bill')) \
                                   .show()

In [None]:
# Pivot Table 
# You can get aggregation group members as a header by pivottable.
train_df.groupBy('HomePlanet').agg(count('VIP').alias('VIPCount')).show()
train_df.groupby('HomePlanet').pivot('VIP').count().show()
train_df.groupby('HomePlanet').pivot('VIP').max('TotalBill').show()

In [None]:
# SQL
query = """
SELECT
    HomePlanet
    , COUNT(*) AS Count
    , COUNT(DISTINCT PassengerId) as TotalIDs
    , AVG(TotalBill) AS mean_total_bill
    , MAX(TotalBill) AS max_total_bill
    , SUM(TotalBill) AS sum_of_total_bill
FROM
    train_df
GROUP BY
    HomePlanet
HAVING
    mean_total_bill > 1000
ORDER BY
    mean_total_bill DESC
"""

spark.sql(query).show(n=5, truncate=False)

#### How to Handle Missing Value

In [None]:
# Replace all nulls with a specific value
df = df.fillna({
    'first_name': 'Tom',
    'age': 0,
})

# Take the first value that is not null
df = df.withColumn('last_name', F.coalesce(df.last_name, df.surname, F.lit('N/A')))

# Drop duplicate rows in a dataset (distinct)
df = df.dropDuplicates() # or
df = df.distinct()

# Drop duplicate rows, but consider only specific columns
df = df.dropDuplicates(['name', 'height'])

# Replace empty strings with null (leave out subset keyword arg to replace in all columns)
df = df.replace({"": None}, subset=["name"])

# Convert Python/PySpark/NumPy NaN operator to null
df = df.replace(float("nan"), None)

In [12]:
float("nan")

nan

In [4]:
# Initialize SparkSession
spark = SparkSession.builder.appName("HandleMissingData").getOrCreate()

# Sample data with missing values
data = [
    (1, "John", None),
    (2, None, 5000),
    (3, "Sara", 4500),
    (None, "David", None),
    (5, "Mike", 5500)
]

# Define schema
schema = ["id", "name", "salary"]

# Create DataFrame
df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.show()

In [None]:
# Drop rows with any null values
df_dropped_any = df.na.drop()
df_dropped_any.show()

df_dropped_any2 = df.dropna('any')
df_dropped_any2.show()

# Drop rows with all null values
df_dropped_all = df.na.drop(how='all')
df_dropped_all.show()

df_dropped_all2 = df.dropna('all')

# Drop rows with null values in specific columns
df_dropped_subset = df.na.drop(subset=['name', 'salary'])
df_dropped_subset.show()

# Fill all null values with a specified value
df_filled_all = df.na.fill("Unknown")
df_filled_all.show()


# Fill null values in specific columns
df_filled_subset = df.na.fill({"name": "Unknown", "salary": 0})
df_filled_subset.show()

df_filled_subset = df.fillna('unknown', subset=['name'])

# Replace specific values
# df_replaced = df.na.replace({None: "Unknown"})
# df_replaced.show()

In [None]:
# Using SQL to handle missing data
df.createOrReplaceTempView("people")

filled_df_sql = spark.sql("""
SELECT id,
       COALESCE(name, 'Unknown') as name,
       COALESCE(salary, 0) as salary
FROM people
""")
filled_df_sql.show()

# Stop the SparkSession
spark.stop()

In [None]:
# coalesce and na.fill
df_nulls.withColumn("name_filled", coalesce("name", lit("Unknown"))) \
    .withColumn("salary_filled", coalesce("salary", lit(0))) \
    .na.fill({"name": "Unknown", "salary": 0}) \
    .show()

#### Demo with `pyspark.sql.functions`

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, lit, when, concat, concat_ws, substring, length, 
    upper, lower, round, bround, date_format, current_date, year, month, 
    dayofmonth, collect_list, collect_set
)

# Initialize Spark session
spark = SparkSession.builder \
    .appName("PySpark SQL Functions Examples") \
    .getOrCreate()

# Sample data
data = [
    ("Alice", 34, "New York"),
    ("Bob", 45, "San Francisco"),
    ("Catherine", 29, "Chicago"),
    ("David", 50, "New York"),
    ("Eva", 40, "San Francisco")
]

# Create DataFrame
df = spark.createDataFrame(data, ["name", "age", "city"])

# Show the original DataFrame
print("Original DataFrame:")
df.show()

# Demonstrating various functions
df = df.withColumn("country", lit("USA"))
df = df.withColumn("age_group", when(col("age") < 40, "Young").otherwise("Old"))
df = df.withColumn("name_city", concat_ws(" - ", col("name"), col("city")))
df = df.withColumn("name_substr", substring(col("name"), 1, 3))
df = df.withColumn("name_length", length(col("name")))
df = df.withColumn("name_upper", upper(col("name"))).withColumn("name_lower", lower(col("name")))

# Adding a current date column for demonstration
df = df.withColumn("current_date", current_date())
df = df.withColumn("formatted_date", date_format(col("current_date"), "MM/dd/yyyy"))
df = df.withColumn("year", year(col("current_date"))).withColumn("month", month(col("current_date"))).withColumn("day", dayofmonth(col("current_date")))

# Show the transformed DataFrame
print("Transformed DataFrame:")
df.show()

# Creating a DataFrame with a float column for demonstration
data = [("Alice", 34.567), ("Bob", 45.123), ("Catherine", 29.987), ("David", 50.456), ("Eva", 40.789)]
df = spark.createDataFrame(data, ["name", "score"])

df = df.withColumn("score_rounded", round(col("score"), 1)).withColumn("score_brounded", bround(col("score"), 1))

# Show the DataFrame with rounded scores
print("DataFrame with Rounded Scores:")
df.show()

# Group by city and aggregate names
df = spark.createDataFrame([
    ("Alice", 34, "New York"),
    ("Bob", 45, "San Francisco"),
    ("Catherine", 29, "Chicago"),
    ("David", 50, "New York"),
    ("Eva", 40, "San Francisco")
], ["name", "age", "city"])

df = df.groupBy("city").agg(collect_list("name").alias("names_list"), collect_set("name").alias("names_set"))

# Show the aggregated DataFrame
print("Aggregated DataFrame:")
df.show(truncate=False)


24/05/31 17:17:17 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Original DataFrame:


                                                                                

+---------+---+-------------+
|     name|age|         city|
+---------+---+-------------+
|    Alice| 34|     New York|
|      Bob| 45|San Francisco|
|Catherine| 29|      Chicago|
|    David| 50|     New York|
|      Eva| 40|San Francisco|
+---------+---+-------------+

Transformed DataFrame:
+---------+---+-------------+-------+---------+-------------------+-----------+-----------+----------+----------+------------+--------------+----+-----+---+
|     name|age|         city|country|age_group|          name_city|name_substr|name_length|name_upper|name_lower|current_date|formatted_date|year|month|day|
+---------+---+-------------+-------+---------+-------------------+-----------+-----------+----------+----------+------------+--------------+----+-----+---+
|    Alice| 34|     New York|    USA|    Young|   Alice - New York|        Ali|          5|     ALICE|     alice|  2024-05-31|    05/31/2024|2024|    5| 31|
|      Bob| 45|San Francisco|    USA|      Old|Bob - San Francisco|        Bob

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, lit, date_add, date_sub, datediff, months_between, current_date, 
    year, month, dayofmonth, dayofweek, dayofyear, weekofyear, hour, minute, 
    second, current_timestamp, unix_timestamp, coalesce, split, array_contains
)

# Initialize Spark session
spark = SparkSession.builder \
    .appName("PySpark SQL Functions Additional Examples") \
    .getOrCreate()

# Sample data
data = [
    ("Alice", 34, "2023-01-01"),
    ("Bob", 45, "2022-12-15"),
    ("Catherine", 29, "2023-03-05"),
    ("David", 50, "2021-07-30"),
    ("Eva", 40, "2023-05-22")
]

# Create DataFrame
df = spark.createDataFrame(data, ["name", "age", "join_date"])

# Show the original DataFrame
print("Original DataFrame:")
df.show()

# date_add and date_sub
df.withColumn("date_plus_10", date_add("join_date", 10)) \
  .withColumn("date_minus_10", date_sub("join_date", 10)) \
  .show()

# datediff and months_between
df.withColumn("days_since_join", datediff(current_date(), "join_date")) \
  .withColumn("months_since_join", months_between(current_date(), "join_date")) \
  .show()

# year, month, dayofmonth, dayofweek, dayofyear, weekofyear
df.withColumn("year", year("join_date")) \
  .withColumn("month", month("join_date")) \
  .withColumn("day", dayofmonth("join_date")) \
  .withColumn("day_of_week", dayofweek("join_date")) \
  .withColumn("day_of_year", dayofyear("join_date")) \
  .withColumn("week_of_year", weekofyear("join_date")) \
  .show()

# Sample data with timestamp
data_with_time = [
    ("Alice", 34, "2023-01-01 12:34:56"),
    ("Bob", 45, "2022-12-15 14:20:30"),
    ("Catherine", 29, "2023-03-05 08:45:15"),
    ("David", 50, "2021-07-30 19:50:40"),
    ("Eva", 40, "2023-05-22 06:25:10")
]

# Create DataFrame
df_time = spark.createDataFrame(data_with_time, ["name", "age", "join_time"])

# hour, minute, second
df_time.withColumn("hour", hour("join_time")) \
      .withColumn("minute", minute("join_time")) \
      .withColumn("second", second("join_time")) \
      .show()

# current_timestamp and unix_timestamp
df.withColumn("current_ts", current_timestamp()) \
  .withColumn("unix_ts", unix_timestamp("join_date")) \
  .show()

# Sample data with null values
data_with_nulls = [
    ("Alice", None),
    (None, 45),
    ("Catherine", 29),
    ("David", None),
    ("Eva", 40)
]

# Create DataFrame
df_nulls = spark.createDataFrame(data_with_nulls, ["name", "age"])

# coalesce and na.fill
df_nulls.withColumn("name_filled", coalesce("name", lit("Unknown"))) \
        .withColumn("age_filled", coalesce("age", lit(0))) \
        .na.fill({"name": "Unknown", "age": 0}) \
        .show()


In [None]:
# Construct a new dynamic column
df = df.withColumn('full_name', F.when(
    (df.fname.isNotNull() & df.lname.isNotNull()), F.concat(df.fname, df.lname)
).otherwise(F.lit('N/A'))

- **Demo how to use 'collect_list() and explode() functions'**

In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_list, explode

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Collect List and Explode Example") \
    .getOrCreate()

# Sample data
data = [
    ("Alice", "Math"),
    ("Alice", "Science"),
    ("Bob", "Math"),
    ("Bob", "English"),
    ("Charlie", "Math"),
    ("Charlie", "Science"),
    ("Charlie", "History")
]

# Create DataFrame
df = spark.createDataFrame(data, ["name", "subject"])

# Show the original DataFrame
print("Original DataFrame:")
df.show()

# Group by name and collect subjects into a list
collected_df = df.groupBy("name").agg(collect_list("subject").alias("subjects"))

# Show the DataFrame with collected lists
print("DataFrame after collect_list():")
collected_df.show(truncate=False)

# Explode the list of subjects into individual rows
exploded_df = collected_df.select("name", explode("subjects").alias("subject"))

# Show the DataFrame with exploded lists
print("DataFrame after explode():")
exploded_df.show()

# Stop the Spark session
spark.stop()


24/05/28 11:16:12 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


Original DataFrame:


                                                                                

+-------+-------+
|   name|subject|
+-------+-------+
|  Alice|   Math|
|  Alice|Science|
|    Bob|   Math|
|    Bob|English|
|Charlie|   Math|
|Charlie|Science|
|Charlie|History|
+-------+-------+

DataFrame after collect_list():
+-------+------------------------+
|name   |subjects                |
+-------+------------------------+
|Alice  |[Math, Science]         |
|Bob    |[Math, English]         |
|Charlie|[Math, Science, History]|
+-------+------------------------+

DataFrame after explode():
+-------+-------+
|   name|subject|
+-------+-------+
|  Alice|   Math|
|  Alice|Science|
|    Bob|   Math|
|    Bob|English|
|Charlie|   Math|
|Charlie|Science|
|Charlie|History|
+-------+-------+



#### Sampling Data

##### Experiment 1

In [None]:
DATA_DIR = os.environ['DATA'] + '/Spark_Experiments'
transactions_file = DATA_DIR + "/TNX.csv"

In [1]:
sample_size = 100000
chunk_size = 1000000  # Adjust based on available memory
sample = pd.DataFrame()

df = pd.read_csv(transactions_file)

for chunk in pd.read_csv(transactions_file, chunksize=chunk_size):
    chunk_sample = chunk.sample(n=min(sample_size, len(chunk)), random_state=1)
    sample = pd.concat([sample, chunk_sample], axis=0)
    sample = sample.sample(n=min(sample_size, len(sample)), random_state=1)

    if len(sample) > sample_size:
        sample = sample.sample(n=sample_size, random_state=1)

sample.to_csv(DATA_DIR + '/sampled_tnx.csv', index=False)

##### Experiment 2

In [5]:
import os

In [8]:
DATA_DIR = os.environ['DATA']
# Define paths
input_parquet_path = os.environ['HOME'] + '/Desktop/endomondoHR_proper.json'
# input_parquet_path = DATA_DIR + '/sparkify_log_small.json'
# output_csv_path = DATA_DIR + '/sparkify_log_small_test.json'
output_csv_path = DATA_DIR + '/endomondoHR_proper.json'
input_parquet_path, output_csv_path

('/Users/am/Desktop/endomondoHR_proper.json',
 '/Users/am/DATA/endomondoHR_proper.json')

In [2]:
from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Parquet Sampling") \
    .config("spark.sql.shuffle.partitions", "2") \
    .config("spark.driver.memory", "4g") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

# Read the Parquet file
df = spark.read.json(input_parquet_path, mode="DROPMALFORMED")

# Get the total number of records
total_records = df.count()

# Calculate the fraction to sample approximately 10,000 records
sample_fraction = 10000 / total_records

# Sample the DataFrame
sampled_df = df.sample(withReplacement=False, fraction=sample_fraction, seed=42)

# Ensure exactly 10,000 records by limiting after sampling
sampled_df = sampled_df.limit(10000)

# Write the sampled records to a CSV file
sampled_df.write.mode('overwrite').option("header", "true").json(output_csv_path)

# Stop the SparkSession
spark.stop()


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


24/05/27 21:57:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


NameError: name 'input_parquet_path' is not defined

#### Partitioning

```python
## Write Partitioned Data
df.write.partitionBy("column").parquet("path/to/output")

df_transactions.coalesce(1).write.mode('overwrite').option("header", "true").csv("TNX_test.csv")

df_transactions.repartition(5).write.mode("overwrite").option("header", "true").csv("/TNX_test.csv")
```

```python
(
    df
    .repartition(3)
    .write
    .mode("overwrite")
    .partitionBy("listen_date")
    .parquet(DATA_DIR + "/partitioning/partitioned/listening_activity_pt_4")
)
```

#### I/O

```python
## Read CSV Files
df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

## Read JSON Files
df = spark.read.json("path/to/file.json")

## Read Parquet Files
df = spark.read.parquet("path/to/file.parquet")

## Read ORC Files
df = spark.read.orc("path/to/file.orc")

## Read Text Files
df = spark.read.text("path/to/file.txt")

## Read Avro Files
df = spark.read.format("avro").load("path/to/file.avro")

## Read Delta Lake
df = spark.read.format("delta").load("path/to/delta/table")

## Read JDBC/ODBC
df = spark.read.format("jdbc").option("url", "jdbc:mysql://localhost:3306/db").option("dbtable", "table_name").option("user", "username").option("password", "password").load()

## Read Other Formats
df = spark.read.format("custom_format").load("path/to/source")
```

```python
df.write.save(out_path, format="csv", header=True)

## Write CSV Files
df.write.csv("path/to/output.csv", header=True)

## Write JSON Files
df.write.json("path/to/output.json")

## Write Parquet Files
df.write.parquet("path/to/output.parquet")

## Write ORC Files
df.write.orc("path/to/output.orc")

## Write Text Files
df.write.text("path/to/output.txt")

## Write Avro Files
df.write.format("avro").save("path/to/output.avro")

## Write Delta Lake
df.write.format("delta").save("path/to/delta/output")

## Write JDBC/ODBC
df.write.format("jdbc").option("url", "jdbc:mysql://localhost:3306/db").option("dbtable", "table_name").option("user", "username").option("password", "password").save()

## Write Hive Tables
df.write.saveAsTable("hive_table")

df.write.format("parquet").saveAsTable("non_bucketed_table")

## Write Partitioned Data
df.write.partitionBy("column").parquet("path/to/output")

df_transactions.coalesce(1).write.mode('overwrite').option("header", "true").csv("TNX_test.csv")

df_transactions.repartition(5).write.mode("overwrite").option("header", "true").csv("/TNX_test.csv")
```

In [8]:
# help(spark.read.csv)

```python
Option 1:

column_names = ["column1", "column2", "column3"]

# Read the CSV file with header
df = spark.read.csv("path/to/csvfile.csv", header=True, inferSchema=True).toDF(*new_column_names)

Option 2:

df = spark.read.csv("path/to/csvfile.csv", header=True, inferSchema=True).

# Rename columns
for idx, new_name in enumerate(new_column_names):
    df = df.withColumnRenamed(f"_c{idx}", new_name)

Option 3:

# Define the column names
column_names = "column1,column2,column3"

# Read the CSV file with specified column names
df = spark.read.option("header", "false") \
               .option("inferSchema", "true") \
               .option("delimiter", ",") \
               .option("quote", "\"") \
               .option("escape", "\"") \
               .schema(column_names) \
               .csv("path/to/csvfile.csv")
```

- <b style="color:magenta">How to read data from MySQL in Apache Spark?</b>

-  **Download the MySQL JDBC Driver**:

You can download the MySQL Connector/J (JDBC driver) from the official MySQL website: MySQL Connector/J.
Choose the version that matches your environment (e.g., mysql-connector-java-8.0.30.jar).

- **Ensure the Driver is Available to Spark**:

Place the downloaded mysql-connector-java-8.0.xx.jar file in a directory accessible by your Spark environment.
Note the full path to this JAR file.

- **Modify Your Spark Submit Command**:

    - Use the `--jars` option to include the MySQL JDBC driver when you submit your Spark job.
    - `spark-submit --jars /path/to/mysql-connector-java-8.0.xx.jar scripts/read_mysql.py`

In [3]:
# JDBC URL format: jdbc:mysql://<host>:<port>/<database>
jdbc_url = "jdbc:mysql://localhost:3306/interview_questions"

# Connection properties
connection_properties = {
    "user": "Shah",
    "password": "shah711409",
    "driver": "com.mysql.cj.jdbc.Driver"
}

In [5]:
from pyspark.sql import SparkSession

# Initialize Spark Session
spark = SparkSession.builder \
    .appName("MySQL Spark Integration") \
    .config("spark.jars", "/path/to/mysql-connector-java-8.0.xx.jar") \
    .getOrCreate()


# Reading table from MySQL
df = spark.read.jdbc(url=jdbc_url, table="Employee", properties=connection_properties)

# Show the DataFrame content
df.show()

# Stop Spark Session
spark.stop()


##### Convert from `*.parqet` to `*.csv`

In [None]:
DATA_DIR = os.environ['DATA'] + '/Spark_Experiments'
transactions_file_pq = DATA_DIR + "/transactions.parquet"

In [5]:
df_transactions = spark.read.parquet(transactions_file)
df_transactions.show()

                                                                                

In [13]:
df_transactions.coalesce(1).write.mode('overwrite').option("header", "true").csv(DATA_DIR + "/TNX_test.csv")
# df_transactions.repartition(5).write.mode("overwrite").option("header", "true").csv(DATA_DIR + "/TNX_test.csv")

                                                                                

In [14]:
transactions_file_csv = DATA_DIR + "/TNX_test.csv"
df_transactions_csv = spark.read.parquet(transactions_file)
df_transactions_csv.show()

+----------+----------+----------+---------------+----------+----+-----+---+-------------+-----+-------------+
|   cust_id|start_date|  end_date|         txn_id|      date|year|month|day| expense_type|  amt|         city|
+----------+----------+----------+---------------+----------+----+-----+---+-------------+-----+-------------+
|CB4ONXHMAX|2012-05-01|      null|TV9Z6GK6TGNU830|2017-10-05|2017|   10|  5|    Groceries|37.01|san_francisco|
|CFO5OH0CZ2|2013-02-01|      null|TV24915NS0DVU5O|2014-02-05|2014|    2|  5| Motor/Travel| 6.18|       denver|
|C0YDPQWPBJ|2012-01-01|      null|TPLK255YSER4EAT|2016-09-23|2016|    9| 23|Entertainment|21.15| philadelphia|
|C0YDPQWPBJ|2011-05-01|2020-09-01|T6GASJ61JA491KC|2015-08-06|2015|    8|  6| Motor/Travel|12.64|      chicago|
|C0YDPQWPBJ|2010-07-01|2019-05-01|TYK1IXM9VOV6OXD|2012-02-19|2012|    2| 19| Motor/Travel| 65.2|     portland|
|C0YDPQWPBJ|2012-05-01|2020-07-01|TEW93GN8XIHW3KP|2019-10-19|2019|   10| 19|    Groceries| 6.23|      chicago|
|

##### Convert from `*.csv` to `*.parqet`

In [17]:
transactions_file_csv = DATA_DIR + "/TNX.csv"
df_transactions = spark.read.parquet(transactions_file)
df_transactions.repartition(5).write.mode("overwrite").option("header", "true").parquet(DATA_DIR + "/TNX_test.parquet")

                                                                                