**Drop():**

In PySpark, the `drop()` transformation is used to remove one or more columns from a DataFrame. This can be useful when you no longer need certain columns in your DataFrame for analysis or when you're preparing the data for further transformations or writing it to an output.

**Syntax:**

```
DataFrame.drop(*cols)
```

-   *cols: This is a variable-length argument that specifies the names of the columns to be dropped. You can pass a single column name as a string or multiple column names as a list or multiple string arguments.

**Example Usage:**

-   **1. Dropping a single column:**

In [None]:
import findspark
findspark.init
import getpass
from pyspark.sql import SparkSession

username = getpass.getuser()
spark = SparkSession. \
    builder. \
    config("spark.sql.catalogImplementation", "hive"). \
    config("spark.sql.warehouse.dir",f"/Users/{username}/Documents/data/warehouse"). \
    enableHiveSupport(). \
    master("local"). \
    getOrCreate()

In [2]:
# Sample DataFrame
data = [("Alice", 25, "Female"), ("Bob", 30, "Male"), ("Charlie", 35, "Male")]
columns = ["name", "age", "gender"]

In [3]:
df = spark.createDataFrame(data, columns)

In [4]:
# Show original DataFrame
df.show()

                                                                                

+-------+---+------+
|   name|age|gender|
+-------+---+------+
|  Alice| 25|Female|
|    Bob| 30|  Male|
|Charlie| 35|  Male|
+-------+---+------+



In [5]:
# Drop the 'gender' column
df_dropped = df.drop("gender")

In [6]:
# Show the DataFrame after dropping the 'gender' column
df_dropped.show()

+-------+---+
|   name|age|
+-------+---+
|  Alice| 25|
|    Bob| 30|
|Charlie| 35|
+-------+---+



**2. Dropping multiple columns:**

You can also drop multiple columns by passing more than one column name to the drop() method.

In [7]:
# Drop both 'age' and 'gender' columns
df_dropped_multiple = df.drop("age", "gender")

In [8]:
# Show the DataFrame after dropping multiple columns
df_dropped_multiple.show()

+-------+
|   name|
+-------+
|  Alice|
|    Bob|
|Charlie|
+-------+



**Important Notes:**

-   The drop() method does not modify the original DataFrame in place. It returns a new DataFrame with the specified columns removed.
If you attempt to drop a column that does not exist, PySpark will raise an error.

-   This operation is computationally expensive if performed multiple times on large datasets, as it involves shuffling data in the underlying partitions.

**Use Case:**

The drop() transformation is especially useful when you're working with large datasets and want to reduce the memory footprint or eliminate unnecessary columns before performing operations like filtering, aggregating, or writing the data to storage.

**Mastering Column Selection and Expression Evaluation with select() and expr() in PySpark**

In PySpark, `select()` is a transformation that allows you to choose specific columns or expressions from a DataFrame. It's similar to a SELECT statement in SQL. You can use select() to:

-   Select individual columns from a DataFrame.
-   Apply expressions or transformations to one or more columns.
-   Rename columns or create new columns.

**Syntax:**
```
df.select(*cols)
```

-   **df:** The DataFrame you are operating on.
-   **cols:** One or more columns or expressions you want to select.

You can use the `expr()` function inside select() to apply SQL-like expressions to the columns in your DataFrame. expr() allows you to run SQL-style queries for column transformations, which can include arithmetic operations, string manipulations, and other expressions.

The expr() function allows you to perform operations like:

-   Mathematical calculations (col1 + col2, col1 * 2, etc.)
-   String functions (upper(col1), concat(col1, col2), etc.)
-   Conditional operations (CASE WHEN, etc.)

**Example:**


In [9]:
# Sample DataFrame
data = [("Alice", 10), ("Bob", 15), ("Catherine", 20)]
df = spark.createDataFrame(data, ["Name", "Age"])

In [12]:
df.show()

+---------+---+
|     Name|Age|
+---------+---+
|    Alice| 10|
|      Bob| 15|
|Catherine| 20|
+---------+---+



In [10]:
df.select("Name").show()

+---------+
|     Name|
+---------+
|    Alice|
|      Bob|
|Catherine|
+---------+



In [11]:
from pyspark.sql.functions import expr

In [13]:
df.select("Age",expr("Age * 2 AS Double_Age")).show()

+---+----------+
|Age|Double_Age|
+---+----------+
| 10|        20|
| 15|        30|
| 20|        40|
+---+----------+



In [15]:
df.select("Name","Age", expr("CASE WHEN Age > 10 THEN 'Adult' ELSE 'Child' END AS Age_Group")).show()

+---------+---+---------+
|     Name|Age|Age_Group|
+---------+---+---------+
|    Alice| 10|    Child|
|      Bob| 15|    Adult|
|Catherine| 20|    Adult|
+---------+---+---------+



**Key Points:**

-   expr() allows you to write SQL expressions, which gives you flexibility in transforming your data.

-   You can use expr() to perform arithmetic, string operations, or conditional logic inside the select() method.

-   expr() is particularly useful when you want to apply complex expressions without having to write multiple functions like col(), when(), etc.

This makes select() with expr() very powerful for complex column manipulations in PySpark.

**Understanding the selectExpr() Transformation in PySpark:**

In PySpark, the selectExpr() transformation is a powerful way to select and transform columns in a DataFrame using SQL expressions. It allows you to use SQL syntax to specify how you want to select or modify the columns, enabling more complex expressions, calculations, and transformations in a concise and readable manner.

**Syntax:**

```
DataFrame.selectExpr(*expr)
```

**Parameters:**

-   *expr: A list of string expressions or column names. Each string represents a SQL expression, which can include column names, arithmetic operations, aggregations, and SQL functions.

**Key Features:**

-   **SQL Expressions:** You can use SQL expressions for selecting and transforming the data, similar to how you would write SQL queries.

-   **Column Alias:** You can give new names to the columns (aliases) in the result.

-   **Complex Operations:** You can apply complex transformations, including mathematical operations, aggregations, and conditional logic.

Example 1: Selecting Columns with Expressions

In [16]:
# Sample data
data = [(1, "John", 30), (2, "Jane", 25), (3, "Sam", 35)]
columns = ["id", "name", "age"]

In [17]:
# Create DataFrame
df = spark.createDataFrame(data, columns)

In [19]:
df.selectExpr("id","Name","Age", "Age * 2 AS Doube_Age").show()

+---+----+---+---------+
| id|Name|Age|Doube_Age|
+---+----+---+---------+
|  1|John| 30|       60|
|  2|Jane| 25|       50|
|  3| Sam| 35|       70|
+---+----+---+---------+



In this example, the selectExpr method is used to select the columns id, name, age and a transformed version of age (age * 2) which is renamed as double_age.

**Example 2: Using SQL Expressions for Complex Logic**

In [20]:
df.selectExpr("id", 
              "Name",
              "Age",
              "CASE WHEN Age > 30 THEN 'Senior' ELSE 'Junior' END AS Age_Group").show()

+---+----+---+---------+
| id|Name|Age|Age_Group|
+---+----+---+---------+
|  1|John| 30|   Junior|
|  2|Jane| 25|   Junior|
|  3| Sam| 35|   Senior|
+---+----+---+---------+



Here, the SQL CASE expression is used to create a new column age_group, which categorizes people as 'Senior' or 'Junior' based on their age.

**Example 3: Performing Aggregations**

In [22]:
df.selectExpr("avg(Age) AS average_age", "max(Age) AS max_age").show()

+-----------+-------+
|average_age|max_age|
+-----------+-------+
|       30.0|     35|
+-----------+-------+



In this example, we use selectExpr() to calculate the average and maximum age from the DataFrame.

**Example 4: Renaming Columns**

You can also rename columns directly in selectExpr():

In [24]:
df.selectExpr("id as User_Id", "Name as User_Name", "Age as User_Age").show()

+-------+---------+--------+
|User_Id|User_Name|User_Age|
+-------+---------+--------+
|      1|     John|      30|
|      2|     Jane|      25|
|      3|      Sam|      35|
+-------+---------+--------+



**Summary:**

-   selectExpr() enables the use of SQL-like expressions to select and transform columns.
-   You can perform mathematical operations, conditional logic, aggregations, and rename columns within the same transformation.
-   It provides a more flexible and concise approach compared to using select() with column objects or expressions directly.

**Using `expr()` Inside `withColumn()` for SQL Expressions in PySpark:**

you can use the expr() function inside the withColumn() transformation in PySpark.

The expr() function allows you to use SQL expressions to define column transformations within PySpark DataFrame operations. This can be useful when you need to perform complex operations or reference columns directly in a SQL-like syntax.

**Example of using expr() inside withColumn():**

In [25]:
# Sample data
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["name", "age"]

In [26]:
# Create DataFrame
df = spark.createDataFrame(data, columns)

In [27]:
# Add a new column 'age_double' using expr inside withColumn
df_with_age_double = df.withColumn("age_double", expr("age * 2"))

In [28]:
df_with_age_double.show()

+-----+---+----------+
| name|age|age_double|
+-----+---+----------+
| John| 25|        50|
|Alice| 30|        60|
|  Bob| 35|        70|
+-----+---+----------+



**In this example:**

-   The expr("age * 2") expression multiplies the age column by 2.
-   The result is added to a new column called age_double.

**Common Use Cases:**

-   Mathematical operations (e.g., addition, subtraction, multiplication, division).
-   String manipulations.
-   Conditional logic (e.g., CASE WHEN statements).
-   More complex SQL-style operations that are easier to write with SQL syntax rather than using PySpark's built-in functions.


You can also use expr() for more advanced SQL expressions like:

In [29]:
df_with_column = df.withColumn("age_group", expr("CASE WHEN age < 30 THEN 'Young' ELSE 'Old' END"))

In [30]:
df_with_column.show()

+-----+---+---------+
| name|age|age_group|
+-----+---+---------+
| John| 25|    Young|
|Alice| 30|      Old|
|  Bob| 35|      Old|
+-----+---+---------+



This would create a new column called age_group with conditional values based on the age column.

**Important Notes:**

-   expr() allows you to write SQL-style syntax, but it might not be as efficient as using the equivalent PySpark functions when possible.
-   You need to ensure that the SQL expression you write is valid and references the correct column names.