# Comprehensive PySpark DataFrame Guide

This notebook covers all essential DataFrame operations in PySpark, including:
- Creating DataFrames from CSV files
- Creating DataFrames from RDDs
- Working with Schemas
- Selecting columns
- Modifying DataFrames with withColumn

---

## 1. Initialize Spark Session

First, we need to create a Spark session which is the entry point for all DataFrame operations.

In [17]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql.functions import col, lit

# Create Spark session
spark = SparkSession.builder.appName("ComprehensiveDataFrameGuide").getOrCreate()
print("Spark Session Created Successfully!")

Spark Session Created Successfully!


## 2. Creating DataFrames from CSV Files

### 2.1 Basic CSV Read with Header

In [49]:
# Read CSV file with header
df = spark.read.options(header='True').csv("StudentData.csv")
print("Basic DataFrame loaded:")
df.show()

Basic DataFrame loaded:
+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dust

### 2.2 CSV Read with Schema Inference

Using `inferSchema='True'` allows Spark to automatically detect column data types.

In [19]:
# Read CSV with inferred schema
df_inferred = spark.read.options(inferSchema='True', header='True', delimiter=',').csv("StudentData.csv")
print("DataFrame with inferred schema:")
df_inferred.printSchema()
df_inferred.show()

DataFrame with inferred schema:
root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: integer (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund

## 3. Working with Schemas

### 3.1 Define Custom Schema

Defining a schema explicitly gives you better control over data types and nullable properties.

In [20]:
# Define custom schema
student_schema = StructType([
    StructField("age", IntegerType(), True),
    StructField("gender", StringType(), True),
    StructField("name", StringType(), True),
    StructField("course", StringType(), True),
    StructField("roll", StringType(), True),
    StructField("marks", IntegerType(), True),
    StructField("email", StringType(), True)
])

print("Custom schema defined successfully!")

Custom schema defined successfully!


### 3.2 Read CSV with Custom Schema

In [21]:
# Read CSV with custom schema
df_schema = spark.read.options(header='True').schema(student_schema).csv("StudentData.csv")
print("DataFrame with custom schema:")
df_schema.printSchema()
df_schema.show()

DataFrame with custom schema:
root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El

## 4. Creating DataFrames from RDDs

### 4.1 Read Data as RDD

In [22]:
from pyspark import SparkContext, SparkConf

# Get or create SparkContext
conf = SparkConf().setAppName("RDD to DataFrame")
sc = SparkContext.getOrCreate(conf=conf)

# Read file as RDD
rdd_raw = sc.textFile("StudentData.csv")
header_row = rdd_raw.first()
rdd_data = rdd_raw.filter(lambda x: x != header_row).map(lambda x: x.split(","))

print("RDD created successfully!")
print(f"Header: {header_row}")
print(f"Number of records: {rdd_data.count()}")

RDD created successfully!
Header: age,gender,name,course,roll,marks,email
Number of records: 1000
Number of records: 1000


### 4.2 Convert RDD to DataFrame with Schema

In [24]:
# Create DataFrame from RDD using custom schema
df_from_rdd = spark.createDataFrame(rdd_data, schema=student_schema)
print("DataFrame created from RDD:")
df_from_rdd.printSchema()
# df_from_rdd.show()

DataFrame created from RDD:
root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)



## 5. Selecting DataFrame Columns

There are multiple ways to select columns from a DataFrame.

### 5.1 Select Columns by Name

In [28]:
# Use the inferred schema DataFrame for selection operations
print("Selecting 'name' and 'gender' columns:")
df_inferred.select("name", "gender").show()

Selecting 'name' and 'gender' columns:
+----------------+------+
|            name|gender|
+----------------+------+
| Hubert Oliveras|Female|
|Toshiko Hillyard|Female|
|  Celeste Lollis|  Male|
|    Elenore Choy|Female|
|  Sheryll Towler|  Male|
|  Margene Moores|  Male|
|     Neda Briski|  Male|
|    Claude Panos|Female|
|  Celeste Lollis|  Male|
|  Cordie Harnois|  Male|
|       Kena Wild|Female|
| Ernest Rossbach|  Male|
|  Latia Vanhoose|Female|
|  Latia Vanhoose|Female|
|     Neda Briski|  Male|
|  Latia Vanhoose|Female|
|  Loris Crossett|  Male|
|  Annika Hoffman|  Male|
|   Santa Kerfien|  Male|
|Mickey Cortright|Female|
+----------------+------+
only showing top 20 rows


### 5.2 Select Columns Using DataFrame Attributes

In [29]:
# Selecting columns using dot notation
print("Selecting columns using DataFrame attributes:")
df_inferred.select(df_inferred.name, df_inferred.gender).show()

Selecting columns using DataFrame attributes:
+----------------+------+
|            name|gender|
+----------------+------+
| Hubert Oliveras|Female|
|Toshiko Hillyard|Female|
|  Celeste Lollis|  Male|
|    Elenore Choy|Female|
|  Sheryll Towler|  Male|
|  Margene Moores|  Male|
|     Neda Briski|  Male|
|    Claude Panos|Female|
|  Celeste Lollis|  Male|
|  Cordie Harnois|  Male|
|       Kena Wild|Female|
| Ernest Rossbach|  Male|
|  Latia Vanhoose|Female|
|  Latia Vanhoose|Female|
|     Neda Briski|  Male|
|  Latia Vanhoose|Female|
|  Loris Crossett|  Male|
|  Annika Hoffman|  Male|
|   Santa Kerfien|  Male|
|Mickey Cortright|Female|
+----------------+------+
only showing top 20 rows


### 5.3 Select Columns Using col() Function

In [30]:
# Selecting columns using col function (recommended approach)
print("Selecting columns using col() function:")
df_inferred.select(col("name"), col("gender")).show()

Selecting columns using col() function:
+----------------+------+
|            name|gender|
+----------------+------+
| Hubert Oliveras|Female|
|Toshiko Hillyard|Female|
|  Celeste Lollis|  Male|
|    Elenore Choy|Female|
|  Sheryll Towler|  Male|
|  Margene Moores|  Male|
|     Neda Briski|  Male|
|    Claude Panos|Female|
|  Celeste Lollis|  Male|
|  Cordie Harnois|  Male|
|       Kena Wild|Female|
| Ernest Rossbach|  Male|
|  Latia Vanhoose|Female|
|  Latia Vanhoose|Female|
|     Neda Briski|  Male|
|  Latia Vanhoose|Female|
|  Loris Crossett|  Male|
|  Annika Hoffman|  Male|
|   Santa Kerfien|  Male|
|Mickey Cortright|Female|
+----------------+------+
only showing top 20 rows


### 5.4 Select All Columns

In [31]:
# Selecting all columns using wildcard
print("Selecting all columns:")
df_inferred.select('*').show()

Selecting all columns:
+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   45|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   32|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   64|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dusti

### 5.5 Select Columns by Index Range

In [32]:
# Selecting columns by index range
print("Selecting columns by index (2 to 6):")
df_inferred.select(df_inferred.columns[2:6]).show()

Selecting columns by index (2 to 6):
+----------------+------+------+-----+
|            name|course|  roll|marks|
+----------------+------+------+-----+
| Hubert Oliveras|    DB|  2984|   59|
|Toshiko Hillyard| Cloud| 12899|   62|
|  Celeste Lollis|    PF| 21267|   45|
|    Elenore Choy|    DB| 32877|   29|
|  Sheryll Towler|   DSA| 41487|   41|
|  Margene Moores|   MVC| 52771|   32|
|     Neda Briski|   OOP| 61973|   69|
|    Claude Panos| Cloud| 72409|   85|
|  Celeste Lollis|   MVC| 81492|   64|
|  Cordie Harnois|   OOP| 92882|   51|
|       Kena Wild|   DSA|102285|   35|
| Ernest Rossbach|    DB|111449|   53|
|  Latia Vanhoose|    DB|122502|   27|
|  Latia Vanhoose|   MVC|132110|   55|
|     Neda Briski|    PF|141770|   42|
|  Latia Vanhoose|    DB|152159|   27|
|  Loris Crossett|   MVC|161771|   36|
|  Annika Hoffman|   OOP|171660|   22|
|   Santa Kerfien|    PF|182129|   56|
|Mickey Cortright|    DB|192537|   62|
+----------------+------+------+-----+
only showing top 20 rows


### 5.6 Create New DataFrame with Selected Columns

In [33]:
# Creating a new DataFrame with specific columns
df_selected = df_inferred.select(col("roll"), col("name"), col("marks"))
print("New DataFrame with selected columns:")
df_selected.show()

New DataFrame with selected columns:
+------+----------------+-----+
|  roll|            name|marks|
+------+----------------+-----+
|  2984| Hubert Oliveras|   59|
| 12899|Toshiko Hillyard|   62|
| 21267|  Celeste Lollis|   45|
| 32877|    Elenore Choy|   29|
| 41487|  Sheryll Towler|   41|
| 52771|  Margene Moores|   32|
| 61973|     Neda Briski|   69|
| 72409|    Claude Panos|   85|
| 81492|  Celeste Lollis|   64|
| 92882|  Cordie Harnois|   51|
|102285|       Kena Wild|   35|
|111449| Ernest Rossbach|   53|
|122502|  Latia Vanhoose|   27|
|132110|  Latia Vanhoose|   55|
|141770|     Neda Briski|   42|
|152159|  Latia Vanhoose|   27|
|161771|  Loris Crossett|   36|
|171660|  Annika Hoffman|   22|
|182129|   Santa Kerfien|   56|
|192537|Mickey Cortright|   62|
+------+----------------+-----+
only showing top 20 rows


## 6. Modifying DataFrames with withColumn

The `withColumn()` method is used to add new columns or modify existing ones.

### 6.1 Cast Column Type

In [34]:
# Cast 'roll' column from Integer to String
df_modified = df_inferred.withColumn("roll", col("roll").cast("String"))
print("Schema after casting 'roll' to String:")
df_modified.printSchema()

Schema after casting 'roll' to String:
root
 |-- age: integer (nullable = true)
 |-- gender: string (nullable = true)
 |-- name: string (nullable = true)
 |-- course: string (nullable = true)
 |-- roll: string (nullable = true)
 |-- marks: integer (nullable = true)
 |-- email: string (nullable = true)



### 6.2 Add Value to Existing Column

In [35]:
# Add 10 marks to all students (bonus marks)
df_bonus = df_modified.withColumn("marks", col("marks") + 10)
print("DataFrame after adding 10 bonus marks:")
df_bonus.show()

DataFrame after adding 10 bonus marks:
+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB|  2984|   69|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   72|Margene Moores_Ma...|
| 28|  Male|  Celeste Lollis|    PF| 21267|   55|Jeannetta Golden_...|
| 29|Female|    Elenore Choy|    DB| 32877|   39|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   51|Claude Panos_Judi...|
| 28|  Male|  Margene Moores|   MVC| 52771|   42|Toshiko Hillyard_...|
| 28|  Male|     Neda Briski|   OOP| 61973|   79|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   95|Sheryll Towler_Al...|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   74|Nicole Harwood_Cl...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   61|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|10

### 6.3 Create New Column Based on Existing Column

In [36]:
# Create a new column 'final_marks' by subtracting 10 from bonus marks
df_final = df_bonus.withColumn("final_marks", col("marks") - 10)
print("DataFrame with new 'final_marks' column:")
df_final.show()

DataFrame with new 'final_marks' column:
+---+------+----------------+------+------+-----+--------------------+-----------+
|age|gender|            name|course|  roll|marks|               email|final_marks|
+---+------+----------------+------+------+-----+--------------------+-----------+
| 28|Female| Hubert Oliveras|    DB|  2984|   69|Annika Hoffman_Na...|         59|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   72|Margene Moores_Ma...|         62|
| 28|  Male|  Celeste Lollis|    PF| 21267|   55|Jeannetta Golden_...|         45|
| 29|Female|    Elenore Choy|    DB| 32877|   39|Billi Clore_Mitzi...|         29|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   51|Claude Panos_Judi...|         41|
| 28|  Male|  Margene Moores|   MVC| 52771|   42|Toshiko Hillyard_...|         32|
| 28|  Male|     Neda Briski|   OOP| 61973|   79|Alberta Freund_El...|         69|
| 28|Female|    Claude Panos| Cloud| 72409|   95|Sheryll Towler_Al...|         85|
| 28|  Male|  Celeste Lollis|   MVC| 81492|   

### 6.4 Multiple column additions and modifications

In [None]:
# Multiple column additions and modifications
df_enhanced = df_inferred.withColumn("name", lit("USA"))
df_enhanced = df_enhanced.withColumn("marks", col("marks") - 10).withColumn("updated marks", col("marks") + 20).withColumn("Country", lit("USA"))
df_enhanced.show()

+---+------+----+------+------+-----+--------------------+-------------+-------+
|age|gender|name|course|  roll|marks|               email|updated marks|Country|
+---+------+----+------+------+-----+--------------------+-------------+-------+
| 28|Female| USA|    DB|  2984|   49|Annika Hoffman_Na...|           69|    USA|
| 29|Female| USA| Cloud| 12899|   52|Margene Moores_Ma...|           72|    USA|
| 28|  Male| USA|    PF| 21267|   35|Jeannetta Golden_...|           55|    USA|
| 29|Female| USA|    DB| 32877|   19|Billi Clore_Mitzi...|           39|    USA|
| 28|  Male| USA|   DSA| 41487|   31|Claude Panos_Judi...|           51|    USA|
| 28|  Male| USA|   MVC| 52771|   22|Toshiko Hillyard_...|           42|    USA|
| 28|  Male| USA|   OOP| 61973|   59|Alberta Freund_El...|           79|    USA|
| 28|Female| USA| Cloud| 72409|   75|Sheryll Towler_Al...|           95|    USA|
| 28|  Male| USA|   MVC| 81492|   54|Nicole Harwood_Cl...|           74|    USA|
| 29|  Male| USA|   OOP| 928

### 6.5 Rename columns (withColumnRenamed)

In [None]:
# Rename columns
df_withColumnRenamed = df_enhanced.withColumnRenamed("gender", "sex").withColumnRenamed("roll", "roll number")
df_withColumnRenamed.show()

+---+------+----+------+-----------+-----+--------------------+-------------+-------+
|age|   sex|name|course|roll number|marks|               email|updated marks|Country|
+---+------+----+------+-----------+-----+--------------------+-------------+-------+
| 28|Female| USA|    DB|       2984|   49|Annika Hoffman_Na...|           69|    USA|
| 29|Female| USA| Cloud|      12899|   52|Margene Moores_Ma...|           72|    USA|
| 28|  Male| USA|    PF|      21267|   35|Jeannetta Golden_...|           55|    USA|
| 29|Female| USA|    DB|      32877|   19|Billi Clore_Mitzi...|           39|    USA|
| 28|  Male| USA|   DSA|      41487|   31|Claude Panos_Judi...|           51|    USA|
| 28|  Male| USA|   MVC|      52771|   22|Toshiko Hillyard_...|           42|    USA|
| 28|  Male| USA|   OOP|      61973|   59|Alberta Freund_El...|           79|    USA|
| 28|Female| USA| Cloud|      72409|   75|Sheryll Towler_Al...|           95|    USA|
| 28|  Male| USA|   MVC|      81492|   54|Nicole Harwo

In [None]:
# Rename columns using alias
df_final.select(col("name").alias("Full Name")).show()

+----------------+
|       Full Name|
+----------------+
| Hubert Oliveras|
|Toshiko Hillyard|
|  Celeste Lollis|
|    Elenore Choy|
|  Sheryll Towler|
|  Margene Moores|
|     Neda Briski|
|    Claude Panos|
|  Celeste Lollis|
|  Cordie Harnois|
|       Kena Wild|
| Ernest Rossbach|
|  Latia Vanhoose|
|  Latia Vanhoose|
|     Neda Briski|
|  Latia Vanhoose|
|  Loris Crossett|
|  Annika Hoffman|
|   Santa Kerfien|
|Mickey Cortright|
+----------------+
only showing top 20 rows


## 7. Row Filtering

### 7.1 Basic Filtering

In [None]:
df.filter(df.course == "DB").show()

+---+------+-----------------+------+-------+-----+--------------------+
|age|gender|             name|course|   roll|marks|               email|
+---+------+-----------------+------+-------+-----+--------------------+
| 28|Female|  Hubert Oliveras|    DB|  02984|   59|Annika Hoffman_Na...|
| 29|Female|     Elenore Choy|    DB|  32877|   29|Billi Clore_Mitzi...|
| 29|  Male|  Ernest Rossbach|    DB| 111449|   53|Maybell Duguay_Ab...|
| 28|Female|   Latia Vanhoose|    DB| 122502|   27|Latia Vanhoose_Mi...|
| 29|Female|   Latia Vanhoose|    DB| 152159|   27|Claude Panos_Sant...|
| 28|Female| Mickey Cortright|    DB| 192537|   62|Ernest Rossbach_M...|
| 28|Female|      Anna Santos|    DB| 311589|   79|Celeste Lollis_Mi...|
| 28|  Male|    Kizzy Brenner|    DB| 381712|   36|Paris Hutton_Kena...|
| 28|  Male| Toshiko Hillyard|    DB| 392218|   47|Leontine Phillips...|
| 29|  Male|     Paris Hutton|    DB| 481229|   57|Clementina Menke_...|
| 28|Female| Mickey Cortright|    DB| 551389|   43|

In [None]:
df.filter(col("course") == "DB").show()

+---+------+-----------------+------+-------+-----+--------------------+
|age|gender|             name|course|   roll|marks|               email|
+---+------+-----------------+------+-------+-----+--------------------+
| 28|Female|  Hubert Oliveras|    DB|  02984|   59|Annika Hoffman_Na...|
| 29|Female|     Elenore Choy|    DB|  32877|   29|Billi Clore_Mitzi...|
| 29|  Male|  Ernest Rossbach|    DB| 111449|   53|Maybell Duguay_Ab...|
| 28|Female|   Latia Vanhoose|    DB| 122502|   27|Latia Vanhoose_Mi...|
| 29|Female|   Latia Vanhoose|    DB| 152159|   27|Claude Panos_Sant...|
| 28|Female| Mickey Cortright|    DB| 192537|   62|Ernest Rossbach_M...|
| 28|Female|      Anna Santos|    DB| 311589|   79|Celeste Lollis_Mi...|
| 28|  Male|    Kizzy Brenner|    DB| 381712|   36|Paris Hutton_Kena...|
| 28|  Male| Toshiko Hillyard|    DB| 392218|   47|Leontine Phillips...|
| 29|  Male|     Paris Hutton|    DB| 481229|   57|Clementina Menke_...|
| 28|Female| Mickey Cortright|    DB| 551389|   43|

### 7.2 Filtering with specific cases

In [None]:
courses = ["DB", "Cloud", "OOP", "DSA"]
df.filter(df.course.isin(courses)).show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|Toshiko Hillyard| Cloud| 12899|   62|Margene Moores_Ma...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 28|  Male|     Neda Briski|   OOP| 61973|   69|Alberta Freund_El...|
| 28|Female|    Claude Panos| Cloud| 72409|   85|Sheryll Towler_Al...|
| 29|  Male|  Cordie Harnois|   OOP| 92882|   51|Judie Chipps_Clem...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29|  Male| Ernest Rossbach|    DB|111449|   53|Maybell Duguay_Ab...|
| 28|Female|  Latia Vanhoose|    DB|122502|   27|Latia Vanhoose_Mi...|
| 29|Female|  Latia Vanhoose|    DB|152159|   27|Claude Panos_Sant...|
| 29| 

In [50]:
df.filter(df.course.startswith("D")).show()

+---+------+----------------+------+------+-----+--------------------+
|age|gender|            name|course|  roll|marks|               email|
+---+------+----------------+------+------+-----+--------------------+
| 28|Female| Hubert Oliveras|    DB| 02984|   59|Annika Hoffman_Na...|
| 29|Female|    Elenore Choy|    DB| 32877|   29|Billi Clore_Mitzi...|
| 28|  Male|  Sheryll Towler|   DSA| 41487|   41|Claude Panos_Judi...|
| 29|Female|       Kena Wild|   DSA|102285|   35|Dustin Feagins_Ma...|
| 29|  Male| Ernest Rossbach|    DB|111449|   53|Maybell Duguay_Ab...|
| 28|Female|  Latia Vanhoose|    DB|122502|   27|Latia Vanhoose_Mi...|
| 29|Female|  Latia Vanhoose|    DB|152159|   27|Claude Panos_Sant...|
| 28|Female|Mickey Cortright|    DB|192537|   62|Ernest Rossbach_M...|
| 28|Female|    Jc Andrepont|   DSA|232060|   58|Billi Clore_Abram...|
| 29|Female|    Paris Hutton|   DSA|271472|   99|Sheryll Towler_Al...|
| 28|Female|  Dustin Feagins|   DSA|291984|   82|Abram Nagao_Kena ...|
| 28|F

In [51]:
df.filter(df.course.endswith("A")).show()

+---+------+----------------+------+-------+-----+--------------------+
|age|gender|            name|course|   roll|marks|               email|
+---+------+----------------+------+-------+-----+--------------------+
| 28|  Male|  Sheryll Towler|   DSA|  41487|   41|Claude Panos_Judi...|
| 29|Female|       Kena Wild|   DSA| 102285|   35|Dustin Feagins_Ma...|
| 28|Female|    Jc Andrepont|   DSA| 232060|   58|Billi Clore_Abram...|
| 29|Female|    Paris Hutton|   DSA| 271472|   99|Sheryll Towler_Al...|
| 28|Female|  Dustin Feagins|   DSA| 291984|   82|Abram Nagao_Kena ...|
| 28|Female|Mickey Cortright|   DSA| 342003|   44|Mitzi Seldon_Jean...|
| 29|Female|     Anna Santos|   DSA| 411479|   42|Kena Wild_Mitzi S...|
| 28|Female|  Maybell Duguay|   DSA| 452141|   29|Leontine Phillips...|
| 29|Female|    Paris Hutton|   DSA| 492159|   60|Nicole Harwood_Ma...|
| 29|  Male|  Celeste Lollis|   DSA| 562065|   85|Jc Andrepont_Mela...|
| 29|  Male|  Maybell Duguay|   DSA| 592061|   83|Eda Neathery_J

In [54]:
df.filter(df.name.contains("se")).show()
df.filter(df.name.like('%s%e%')).show()

+---+------+--------------+------+-------+-----+--------------------+
|age|gender|          name|course|   roll|marks|               email|
+---+------+--------------+------+-------+-----+--------------------+
| 28|Female|Latia Vanhoose|    DB| 122502|   27|Latia Vanhoose_Mi...|
| 29|Female|Latia Vanhoose|   MVC| 132110|   55|Eda Neathery_Nico...|
| 29|Female|Latia Vanhoose|    DB| 152159|   27|Claude Panos_Sant...|
| 29|  Male|Loris Crossett|   MVC| 161771|   36|Mitzi Seldon_Jenn...|
| 29|Female|Loris Crossett|    PF| 201487|   96|Elenore Choy_Lati...|
| 28|Female|Loris Crossett|    PF| 332739|   62|Michelle Ruggiero...|
| 29|  Male|Loris Crossett|    PF| 911593|   46|Gonzalo Ferebee_M...|
| 28|Female|Loris Crossett|   DSA|1662549|   86|Paris Hutton_Lati...|
| 29|  Male|Latia Vanhoose| Cloud|1832268|   60|Marylee Capasso_S...|
| 29|  Male|Latia Vanhoose|   OOP|2372748|   94|Latia Vanhoose_La...|
| 28|Female|Loris Crossett|   OOP|2691881|   29|Maybell Duguay_Ni...|
| 28|  Male|Loris Cr

<!-- ## 7. Summary and Best Practices

### Key Takeaways:

1. **Reading CSV Files**: Use `inferSchema='True'` for automatic type detection, or define custom schemas for better control
2. **Schema Definition**: Explicitly defining schemas improves performance and data quality
3. **RDD to DataFrame**: Use `createDataFrame()` with a schema for converting RDDs to DataFrames
4. **Column Selection**: Use `col()` function for most operations as it's the most flexible approach
5. **withColumn**: Immutable operation - always returns a new DataFrame

### Performance Tips:
- Define schemas explicitly when possible to avoid schema inference overhead
- Chain multiple `withColumn()` operations for better readability
- Use `select()` to reduce the dataset size early in your pipeline -->

<!-- ## 8. Cleanup

Stop the Spark session when done. -->

In [None]:
# # Uncomment the following line to stop the Spark session
# # spark.stop()
# print("Remember to stop the Spark session when you're done!")