## Setup: Create a Spark Session and Input Data {#setup-create-a-spark-session-and-input-data}

> ðŸ“– Read the full article: [Writing Safer PySpark Queries with Parameters](https://codecut.ai/pyspark-sql-enhancing-reusability-with-parameterized-queries/)


We'll begin by creating a Spark session and generating a sample DataFrame using the Pandas-to-Spark conversion method. For other common ways to build DataFrames in PySpark, see [this guide on creating PySpark DataFrames](https://codecut.ai/3-powerful-ways-to-create-pyspark-dataframes/).

In [None]:
from datetime import date
import pandas as pd
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Create a Spark DataFrame
item_price_pandas = pd.DataFrame(
    {
        "item_id": [1, 2, 3, 4],
        "price": [4, 2, 5, 1],
        "transaction_date": [
            date(2025, 1, 15),
            date(2025, 2, 1),
            date(2025, 3, 10),
            date(2025, 4, 22),
        ],
    }
)

item_price = spark.createDataFrame(item_price_pandas)
item_price.show()

**Output**

```python
+-------+-----+----------------+
|item_id|price|transaction_date|
+-------+-----+----------------+
|      1|    4|      2025-01-15|
|      2|    2|      2025-02-01|
|      3|    5|      2025-03-10|
|      4|    1|      2025-04-22|
+-------+-----+----------------+
```

## Traditional PySpark Query Approach {#traditional-pyspark-query-approach}

The traditional approach uses f-strings to build SQL, which is not ideal because:

- **Security Risk**: Interpolated strings can expose your query to SQL injection.
- **Limited Flexibility**: F-strings can't handle Python objects like DataFrames directly, so you have to create temporary views and manually quote values like dates to match SQL syntax.

In [None]:
item_price.createOrReplaceTempView("item_price_view")
transaction_date_str = "2025-02-15"

query_with_fstring = f"""SELECT *
FROM item_price_view
WHERE transaction_date > '{transaction_date_str}'
"""

spark.sql(query_with_fstring).show()

**Output**

```python
+-------+-----+----------------+
|item_id|price|transaction_date|
+-------+-----+----------------+
|      3|    5|      2025-03-10|
|      4|    1|      2025-04-22|
+-------+-----+----------------+
```



## Parameterized Queries with PySpark Custom String Formatting {#parameterized-queries-with-pyspark-custom-string-formatting}

PySpark supports parameterized SQL with custom string formatting, separating SQL logic from parameter values. During parsing, it safely handles each value as a typed literal and inserts it into the SQL parse tree, preventing injection attacks and ensuring correct data types.

```python
Query
â”œâ”€â”€ SELECT
â”‚   â””â”€â”€ *
â”œâ”€â”€ FROM
â”‚   â””â”€â”€ {item_price}
â””â”€â”€ WHERE
    â””â”€â”€ Condition
        â”œâ”€â”€ Left: transaction_date
        â”œâ”€â”€ Operator: >
        â””â”€â”€ Right: {transaction_date}
```

Because it handles each value as a typed literal, it treats the value according to its actual data type, not as raw text, when inserting it into a SQL query, meaning:

- `item_price` can be passed directly without creating a temporary view
- `transaction_date` does not need to be manually wrapped in single quotes

In [None]:
parametrized_query = """SELECT *
FROM {item_price}
WHERE transaction_date > {transaction_date}
"""

spark.sql(
    parametrized_query, item_price=item_price, transaction_date=transaction_date_str
).show()

Output:

```python
+-------+-----+----------------+
|item_id|price|transaction_date|
+-------+-----+----------------+
|      3|    5|      2025-03-10|
|      4|    1|      2025-04-22|
+-------+-----+----------------+
```

## Parameterized Queries with Parameter Markers {#parameterized-queries-with-parameter-markers}

Custom string formatting would treat `date(2023, 2, 15)` as a mathematical expression rather than a date, which would cause a type mismatch error.

In [None]:
parametrized_query = """SELECT *
FROM {item_price}
WHERE transaction_date > {transaction_date}
"""

spark.sql(parametrized_query, item_price=item_price, transaction_date=transaction_date).show()

Output:

```python
[DATATYPE_MISMATCH.BINARY_OP_DIFF_TYPES] Cannot resolve "(transaction_date > ((2023 - 2) - 15))" due to data type mismatch
```

Parameter markers preserve type information, so `date` objects are passed as proper SQL DATE literals. This allows you to safely use Python dates without formatting or quoting them manually.

In [None]:
query_with_markers = """SELECT *
FROM {item_price}
WHERE transaction_date > :transaction_date
"""

transaction_date = date(2025, 2, 15)

spark.sql(
    query_with_markers,
    item_price=item_price,
    args={"transaction_date": transaction_date},
).show()

## Make PySpark SQL Easier to Reuse {#make-pyspark-sql-easier-to-reuse}

Parameterized SQL templates are easier to reuse across your codebase. Instead of copying and pasting full SQL strings with values hardcoded inside, you can define flexible query templates that accept different input variables.

Here's a reusable query to filter using different transaction dates:

In [None]:
transaction_date_1 = date(2025, 3, 9)

spark.sql(
    query_with_markers,
    item_price=item_price,
    args={"transaction_date": transaction_date_1},
).show()

Output:

```python
+-------+-----+----------------+
|item_id|price|transaction_date|
+-------+-----+----------------+
|      3|    5|      2025-03-10|
|      4|    1|      2025-04-22|
+-------+-----+----------------+
```

You can easily change the filter with a different date:

In [None]:
transaction_date_2 = date(2025, 3, 15)

spark.sql(
    query_with_markers,
    item_price=item_price,
    args={"transaction_date": transaction_date_2},
).show()

Output:

```python
+-------+-----+----------------+
|item_id|price|transaction_date|
+-------+-----+----------------+
|      4|    1|      2025-04-22|
+-------+-----+----------------+
```

## Easier Unit Testing with PySpark Parameterized Queries {#easier-unit-testing-with-pyspark-parameterized-queries}

Parameterization also simplifies testing by letting you pass different inputs into a reusable query string.

For example, in the code below, we define a function that takes a DataFrame and a threshold value, then filters rows using a parameterized query.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

def filter_by_price_threshold(df, amount):
    return spark.sql(
        "SELECT * from {df} where price > :amount", df=df, args={"amount": amount}
    )

Because the values are passed separately from the SQL logic, we can easily reuse and test this function with different parameters without rewriting the query itself.

In [None]:
def test_query_return_correct_number_of_rows():
    # Create test input DataFrame
    df = spark.createDataFrame(
        [
            ("Product 1", 10.0, 5),
            ("Product 2", 15.0, 3),
            ("Product 3", 8.0, 2),
        ],
        ["name", "price", "quantity"],
    )

    # Execute query with parameters
    assert filter_by_price_threshold(df, 10).count() == 1
    assert filter_by_price_threshold(df, 8).count() == 2

For more tips on validating DataFrame outputs effectively, see [best practices for PySpark DataFrame comparison and testing](https://codecut.ai/best-practices-for-pyspark-dataframe-comparison-testing/).