# 02.Use pyspark in CASD

In this tutorial, we will learn:
- how to install spark and pyspark in CASD
- how to create a spark session
- read a csv file
- read a parquet file


## 1. Install spark and pyspark in CASD

As we explained before,
- `Apache Spark` is a distributed computation framework written mostly in Scala and Java, runs in a JVM(Java Virtual Machine)
- `PySpark` is a Python API for Apache Spark.
- PySpark talks to Spark engine via `Py4J`. Your Python code → Py4J(serialized and sent to JVM) → Spark core executes it → results sent back to python

> Spark framework installation is essential, without it, pyspark will never work.

### 1.1 Install spark framework

CASD provides an installation script(`InstallSpark.ps1`) to install the `latest spark framework` and underlying JDK available in CASD.
You can find this script in `Bureau->Raccourcis->Spark`.

Open a powershell terminal and run the below command

```powershell
# goto the target folder
cd C:\Users\Public\Desktop\Raccourcis\Spark

# run the installation script
.\InstallSpark.ps1
```

> If everything works well, this script will install spark in `C:\Users\<your-id>\AppData\Local\spark\spark-3.5.5-bin-hadoop3`. It will also install open-jdk, winutils, and set up your
> windows env vars.

Now let's check if your spark works or not. Open a new powershell terminal and run the below command

```powershell
# check the installed spark version
spark-shell --version

## it may take few seconds to show the output, be patient
# expected output
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.5.5
      /_/

Using Scala version 2.12.18, OpenJDK 64-Bit Server VM, 11.0.2
Branch HEAD
Compiled by user ubuntu on 2024-08-06T11:36:15Z
Revision bb7846dd487f259994fdc69e18e03382e3f64f42
Url https://github.com/apache/spark
Type --help for more information.
```

The
> If you can't see the spark output, contact `service@casd.eu`

### 1.2 Install pyspark

As we mentioned before, CASD recommends you to create a `python virtual environment` for each for your python project.

Suppose we will start a new project called `docs_in_paris`, let's create a python virtual environment with this name


```powershell
# create a python virtual environment
conda create --name docs_in_paris python --offline

# activate the virtual environment
conda activate docs_in_paris

# check python version
python -V

# check installed packages
pip list

# install pyspark
pip install pyspark==3.5.5

# check the installed pyspark version
pip show pyspark

# expected output
Name: pyspark
Version: 3.5.5
Summary: Apache Spark Python API
......
```

> You can notice that we have installed a specific version of pyspark. Because the pyspark version must be the same as the spark framework version. As the output in `section 1.1` is **spark-3.5.5**. So we need to
> install pyspark-3.5.5

## 2. Create a spark session

A `Spark session` is the entry point to Apache Spark. A spark session allows us to interact with Spark’s core engine, no matter if you’re working in Python (PySpark), Scala, Java, or R.


It encapsulates:

- Cluster connection (or local JVM if local mode)
- Configuration settings (memory, partitions, serializer, etc.)
- Access to Spark’s APIs: Spark SQL API, DataFrame and Dataset API, RDD API (via .sparkContext), Streaming and machine learning APIs

To create a spark session, you need to
- import the required module
- configure the spark session settings
- create the spark session instance

### 2.1 A minimum spark session creation

Below shows a minimum spark session creation.

In [1]:
from pyspark.sql import SparkSession, DataFrame

In [2]:
# create a spark session in local mode
spark = SparkSession.builder \
     .master("local[*]") \
    .appName("Use_pyspark_in_CASD") \
    .getOrCreate()

In [3]:
# you can get and set configuration of your spark session any moments
# get all conf
spark.sparkContext.getConf().getAll()

[('spark.driver.extraJavaOptions',
  '-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false'),
 ('spark.app.name', 'Use_pyspark_in_CASD'),
 ('spark.executor.id', 'driver'),
 ('spark.app.id', 'local-

In [4]:
from pyspark.sql.types import StructType, StringType, StructField, IntegerType

# create a dataframe by using List
dept = [("Alice","Finance",10),
        ("Bob","Marketing",20),
        ("Charlie","Sales",30),
        ("Toto","IT",40)
      ]

# give an explicit schema
deptSchema = StructType([
    StructField('name', StringType(), True),
    StructField('dept_name', StringType(), True),
    StructField('age', IntegerType(), True)
])


# create dataframe
deptDF1 = spark.createDataFrame(data=dept, schema = deptSchema)
deptDF1.printSchema()
deptDF1.show(truncate=False)

root
 |-- name: string (nullable = true)
 |-- dept_name: string (nullable = true)
 |-- age: integer (nullable = true)

+-------+---------+---+
|name   |dept_name|age|
+-------+---------+---+
|Alice  |Finance  |10 |
|Bob    |Marketing|20 |
|Charlie|Sales    |30 |
|Toto   |IT       |40 |
+-------+---------+---+



> In this tutorial, we only focus on spark on local mode. CASD also proposes spark on `yarn` and `k8s` mode. For more information, please contact `service@casd.eu`.

## 2. Read a csv file

In [3]:
csv_sample_file_path = "C:/Users/PLIU/Documents/ubuntu_share/data_set/france_immobilier/transactions_sample.csv"

# the option header
sample_df = spark.read.csv(csv_sample_file_path, header=True, inferSchema=True)

In [4]:
sample_df.show(5)

+--------------+----------------+--------+-----------+--------+--------------------+-----------+--------------------+-------------+-----+--------+-----------------+--------------------+----------------+----------------+-------------------+--------------------------+--------------------------+---------------------+-----------------------+
|id_transaction|date_transaction|    prix|departement|id_ville|               ville|code_postal|             adresse|type_batiment| vefa|n_pieces|surface_habitable|id_parcelle_cadastre|        latitude|       longitude|surface_dependances|surface_locaux_industriels|surface_terrains_agricoles|surface_terrains_sols|surface_terrains_nature|
+--------------+----------------+--------+-----------+--------+--------------------+-----------+--------------------+-------------+-----+--------+-----------------+--------------------+----------------+----------------+-------------------+--------------------------+--------------------------+---------------------+-----

In [5]:
sample_df.printSchema()

root
 |-- id_transaction: integer (nullable = true)
 |-- date_transaction: date (nullable = true)
 |-- prix: double (nullable = true)
 |-- departement: integer (nullable = true)
 |-- id_ville: integer (nullable = true)
 |-- ville: string (nullable = true)
 |-- code_postal: integer (nullable = true)
 |-- adresse: string (nullable = true)
 |-- type_batiment: string (nullable = true)
 |-- vefa: boolean (nullable = true)
 |-- n_pieces: integer (nullable = true)
 |-- surface_habitable: integer (nullable = true)
 |-- id_parcelle_cadastre: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- longitude: double (nullable = true)
 |-- surface_dependances: string (nullable = true)
 |-- surface_locaux_industriels: string (nullable = true)
 |-- surface_terrains_agricoles: string (nullable = true)
 |-- surface_terrains_sols: string (nullable = true)
 |-- surface_terrains_nature: string (nullable = true)



## Read a parquet file



In [4]:
fr_immo_transaction_path = "C:/Users/PLIU/Documents/git/Seminar_PySpark_Sedona_GeoParquet/data/fr_immo_transaction.parquet"
fr_immo_transactions_df = spark.read.parquet(fr_immo_transaction_path)

In [5]:
required_col = ["id_transaction","date_transaction","prix","departement","ville","code_postal","adresse","type_batiment","n_pieces","surface_habitable","latitude","longitude"]
clean_fr_immo_df = fr_immo_transactions_df.select(required_col)

In [6]:
# cache the dataframe for better performence
# clean_fr_immo_df.cache()
clean_fr_immo_df.show(5)

+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+----------------+----------------+
|id_transaction|date_transaction|    prix|departement|               ville|code_postal|             adresse|type_batiment|n_pieces|surface_habitable|        latitude|       longitude|
+--------------+----------------+--------+-----------+--------------------+-----------+--------------------+-------------+--------+-----------------+----------------+----------------+
|        141653|      2014-01-02|197000.0|         01|             TREVOUX|       1600|  6346 MTE DES LILAS|  Appartement|       4|               84|45.9423014034837|4.77069364742062|
|        141970|      2014-01-02|157500.0|         01|              VIRIAT|       1440|1369 RTE DE STRAS...|       Maison|       4|              103|46.2364072868351|5.26293493674271|
|        139240|      2014-01-02|112000.0|         01|SAINT-JEAN-SUR-VEYLE|     

## Clean dataframe

We want to check some basic information of the dataframe:
- Total row count
- schema(e.g.column name and data type)
- Empty rows (all-null)
- Rows with any missing values
- Duplicate rows
- Rows containing empty strings
- Nulls count per column

In [11]:
from pyspark.sql.functions import sum as spark_sum
from pyspark.sql.functions import col, when, isnan, trim
import pyspark.sql.types as spark_types

def get_empty_row_count_per_column(df:DataFrame):
    totalRowCount = df.count()

    nullSymbols = ["?","-"]
    aggExpression = []

    # step2: build the condition expression for detecting various null case
    for colName in df.columns:
        # temporal col name
        nullCountCol = f"{colName}__null"
        nanCountCol = f"{colName}__nan"
        blankCountCol = f"{colName}__blank"
        nullSymbolCountCol = f"{colName}__symbol"
        c = col(colName)
        colType = df.schema[colName].dataType
        # always test null
        nullExpr = when(c.isNull(), 1).otherwise(0).alias(nullCountCol)
        aggExpression.append(nullExpr)
        # test isnan for only numeric columns
        nanExpr = when(isnan(c), 1).otherwise(0).alias(nanCountCol)
        if isinstance(colType, spark_types.NumericType):
            aggExpression.append(nanExpr)
        # string null value only for string columns
        if isinstance(colType, spark_types.StringType):
            aggExpression.append(when(trim(c) == "", 1).otherwise(0).alias(blankCountCol))
            aggExpression.append(when(c.isin(nullSymbols), 1).otherwise(0).alias(nullSymbolCountCol))

    # show the agg expression
    for aggExpr in aggExpression:
        print(aggExpr)
    # Perform full-column conditional tagging
    flaggedDf = df.select(*aggExpression)
    flaggedDf.show(5)

    # step3: sum all per-column null case flags in one single pass
    try:
        summed = flaggedDf.agg(*[spark_sum(c).alias(c) for c in flaggedDf.columns]).collect()[0].asDict()
    except Exception as e:
        print(f"Aggregation failed on flaggedDf columns: {flaggedDf.columns}: {e}")

    result = []
    # step4: build a list of dict which contains all info for the final result dataframe
    for colName in df.columns:
        # temporal col name
        nullCountCol = f"{colName}__null"
        nanCountCol = f"{colName}__nan"
        blankCountCol = f"{colName}__blank"
        nullSymbolCountCol = f"{colName}__symbol"
        nullCount = summed.get(nullCountCol, 0)
        nanCount = summed.get(nanCountCol, 0)
        blankCount = summed.get(blankCountCol, 0)
        symbolCount = summed.get(nullSymbolCountCol, 0)
        totalEmpty = nullCount + nanCount + blankCount + symbolCount

        result.append((
            colName, nullCount, nanCount, blankCount,
            symbolCount, totalEmpty, totalRowCount
        ))
    # convert the list of dict into a new dataframe
    resDf = spark.createDataFrame(result, ["column_name", "null_count", "nan_count", "blank_count",
                                                "null_symbol_count", "total_empty_row_count",
                                                "total_row_count"])
    #

    return resDf


In [8]:
def get_duplicated_row_count(df:DataFrame):
     duplicate_row_count = df.count() - df.dropDuplicates().count()
     print(f"Duplicate row count: {duplicate_row_count}")


In [12]:
null_col_stats = get_empty_row_count_per_column(clean_fr_immo_df)

Column<'CASE WHEN (id_transaction IS NULL) THEN 1 ELSE 0 END AS id_transaction__null'>
Column<'CASE WHEN isnan(id_transaction) THEN 1 ELSE 0 END AS id_transaction__nan'>
Column<'CASE WHEN (date_transaction IS NULL) THEN 1 ELSE 0 END AS date_transaction__null'>
Column<'CASE WHEN (prix IS NULL) THEN 1 ELSE 0 END AS prix__null'>
Column<'CASE WHEN isnan(prix) THEN 1 ELSE 0 END AS prix__nan'>
Column<'CASE WHEN (departement IS NULL) THEN 1 ELSE 0 END AS departement__null'>
Column<'CASE WHEN (trim(departement) = ) THEN 1 ELSE 0 END AS departement__blank'>
Column<'CASE WHEN (departement IN (?, -)) THEN 1 ELSE 0 END AS departement__symbol'>
Column<'CASE WHEN (ville IS NULL) THEN 1 ELSE 0 END AS ville__null'>
Column<'CASE WHEN (trim(ville) = ) THEN 1 ELSE 0 END AS ville__blank'>
Column<'CASE WHEN (ville IN (?, -)) THEN 1 ELSE 0 END AS ville__symbol'>
Column<'CASE WHEN (code_postal IS NULL) THEN 1 ELSE 0 END AS code_postal__null'>
Column<'CASE WHEN isnan(code_postal) THEN 1 ELSE 0 END AS code_pos

In [22]:
null_col_stats.show(20)

+-----------------+----------+---------+-----------+-----------------+---------------------+---------------+
|      column_name|null_count|nan_count|blank_count|null_symbol_count|total_empty_row_count|total_row_count|
+-----------------+----------+---------+-----------+-----------------+---------------------+---------------+
|   id_transaction|         0|        0|          0|                0|                    0|        9141573|
| date_transaction|         0|        0|          0|                0|                    0|        9141573|
|             prix|         0|        0|          0|                0|                    0|        9141573|
|      departement|         0|        0|          0|                0|                    0|        9141573|
|            ville|         0|        0|          0|                0|                    0|        9141573|
|      code_postal|         0|        0|          0|                0|                    0|        9141573|
|          adresse|

In [18]:
get_duplicated_row_count(clean_fr_immo_df)

Duplicate row count: 0


In [20]:
from pyspark.sql.functions import max as spark_max

def has_value(df:DataFrame):
    exprs = []
    nullSymbols = ["?","-"]
    for colName in df.columns:
        colRef = col(colName)
        colType = df.schema[colName].dataType
        # base condition, the given column is not null
        conditions = [colRef.isNotNull()]

        # for numeric column
        if isinstance(colType, spark_types.NumericType):
            conditions.append(~isnan(colRef))

        # for string column
        if isinstance(colType, spark_types.StringType):
            conditions.append(trim(colRef) != "")
            conditions.append(~colRef.isin(nullSymbols))

        # build final filter condition
        hasValCond = conditions[0]
        for cond in conditions[1:]:
            hasValCond = hasValCond & cond
        exprs.append(spark_max(when(hasValCond,1).otherwise(0)).alias(colName))
    result = df.agg(*exprs).collect()[0].asDict()
    result["toto"] = 0
    return [c for c, has_value in result.items() if has_value == 0]
    print(result)



In [21]:
has_value(clean_fr_immo_df)

['toto']

In [None]:
# creating a geometry column

