| **Operation**              | **Definition**                                             | **Example**                                 | **Sample Output**                           | **When to Use**                                     |
| -------------------------- | ---------------------------------------------------------- | ------------------------------------------- | ------------------------------------------- | --------------------------------------------------- |
| **`show(n)`**              | Displays first `n` rows in table format                    | `df.show(2)`                                | First 2 rows in tabular format              | Quick inspection of data                            |
| **`take(n)`**              | Returns first `n` rows as a **Python list** of Row objects | `df.take(2)`                                | `[Row(name='John'..), Row(name='Alice'..)]` | When you want to work with sample rows in Python    |
| **`first()`**              | Returns the **first row** of the DataFrame                 | `df.first()`                                | `Row(name='John', age=30, city='New York')` | Check first record quickly                          |
| **`head(n)`**              | Returns first `n` rows, like `take()`                      | `df.head(2)`                                | Same as `take(2)`                           | Get small subset of data                            |
| **`count()`**              | Returns total row count                                    | `df.count()`                                | `5`                                         | Validate row count, dataset size                    |
| **`collect()`**            | Returns **all rows** to driver as a list                   | `df.collect()`                              | `[Row(...), Row(...)]`                      | Only for **small datasets** (⚠️ avoid for big data) |
| **`distinct()`**           | Removes duplicate rows                                     | `df.distinct().show()`                      | Unique rows only                            | Deduplication in ETL pipelines                      |
| **`filter()` / `where()`** | Filters rows matching a condition                          | `df.filter(df.age > 25).show()`             | Rows with `age > 25`                        | Data cleansing, conditional selection               |
| **`select()`**             | Selects specific columns                                   | `df.select("name","city").show()`           | Table with only selected columns            | Use when you need subset of columns                 |
| **`orderBy()` / `sort()`** | Sorts rows by column(s)                                    | `df.orderBy(df.age.desc()).show()`          | Sorted dataset by age                       | Ranking, reporting, ordered outputs                 |
| **`groupBy()`**            | Groups rows for aggregation                                | `df.groupBy("city").count().show()`         | City-wise counts                            | Summarization, aggregations (SUM, AVG, etc.)        |
| **`withColumn()`**         | Adds or updates a column                                   | `df.withColumn("age_plus_5", col("age")+5)` | Adds `age_plus_5` column                    | Feature engineering, new calculations               |
| **`drop()`**               | Removes one/more columns                                   | `df.drop("age").show()`                     | Table without `age`                         | Remove unnecessary columns                          |
| **`na.fill()`**            | Replace NULL values                                        | `df.na.fill({"city":"Unknown"})`            | Null replaced with "Unknown"                | Data cleaning, missing values                       |
| **`na.drop()`**            | Drops rows with NULLs                                      | `df.na.drop()`                              | Only complete rows kept                     | Ensure only valid records kept                      |
| **`limit(n)`**             | Returns first `n` rows without collecting                  | `df.limit(3).show()`                        | First 3 rows                                | Sample small portion of data                        |
| **`cache()`**              | Stores DataFrame in memory                                 | `df.cache()`                                | N/A                                         | Reuse dataset multiple times in same job            |
| **`persist()`**            | Stores DataFrame with configurable storage level           | `df.persist(StorageLevel.DISK_ONLY)`        | N/A                                         | For larger datasets that don’t fit in memory        |
| **`dropDuplicates()`**     | Removes duplicates based on given columns                  | `df.dropDuplicates(["name","city"])`        | Deduplicated rows                           | Use when only specific columns define uniqueness    |
| **`describe()`**           | Returns summary statistics                                 | `df.describe().show()`                      | Count, mean, stddev, min, max               | Quick statistical overview                          |
| **`printSchema()`**        | Prints schema of DataFrame                                 | `df.printSchema()`                          | Tree of column names/types                  | Schema validation before processing                 |


In [0]:
from pyspark.sql import SparkSession

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql import Row

spark = SparkSession.builder.appName("Basics").getOrCreate()

data = [
    Row(name="John", age=30, city="New York"),
    Row(name="Alice", age=25, city="London"),
    Row(name="Bob", age=30, city="New York"),
    Row(name="Mary", age=22, city="Paris"),
    Row(name="John", age=30, city="New York")   # duplicate row
]

df = spark.createDataFrame(data)
df.show()


+-----+---+--------+
| name|age|    city|
+-----+---+--------+
| John| 30|New York|
|Alice| 25|  London|
|  Bob| 30|New York|
| Mary| 22|   Paris|
| John| 30|New York|
+-----+---+--------+



In [0]:
# displays the first 20 rows of the DataFrame
df.show(2)

+-----+---+--------+
| name|age|    city|
+-----+---+--------+
| John| 30|New York|
|Alice| 25|  London|
+-----+---+--------+
only showing top 2 rows


In [0]:
#Definition: Returns the first row of the dataset.
df.take(2)

[Row(name='John', age=30, city='New York'),
 Row(name='Alice', age=25, city='London')]

In [0]:
df1 = df.take(2)
first_name = df1[0].name
display(first_name)

30

In [0]:
#Definition: Returns the first n rows (similar to take), but slightly different return type.
df.first()  # Row(name='John', age=30)


Row(name='John', age=30, city='New York')

In [0]:
#Definition: Returns the first n rows (similar to take), but slightly different return type.
df.head(2)


In [0]:
#Definition: Returns the number of rows in the DataFrame/RDD.
df.count()

5

In [0]:
#Definition: Returns all rows as a list to the driver (⚠️ not recommended for very large datasets).
df.collect()

[Row(name='John', age=30, city='New York'),
 Row(name='Alice', age=25, city='London'),
 Row(name='Bob', age=30, city='New York'),
 Row(name='Mary', age=22, city='Paris'),
 Row(name='John', age=30, city='New York')]

In [0]:
#Definition: Returns a new DataFrame with duplicate rows removed.
df.select("city").distinct().show()


+--------+
|    city|
+--------+
|New York|
|  London|
|   Paris|
+--------+



In [0]:
# Definition: Returns rows that satisfy a given condition.
df.filter(df.age > 25).show()
df.where("age > 25").show()


+----+---+--------+
|name|age|    city|
+----+---+--------+
|John| 30|New York|
| Bob| 30|New York|
|John| 30|New York|
+----+---+--------+

+----+---+--------+
|name|age|    city|
+----+---+--------+
|John| 30|New York|
| Bob| 30|New York|
|John| 30|New York|
+----+---+--------+



In [0]:
# Definition: Selects specific columns.
df.select("name", "age").show()


+-----+---+
| name|age|
+-----+---+
| John| 30|
|Alice| 25|
|  Bob| 30|
| Mary| 22|
| John| 30|
+-----+---+



In [0]:
# Definition: Sorts DataFrame rows by one or more columns.
df.orderBy(df.age.desc()).show()


+-----+---+--------+
| name|age|    city|
+-----+---+--------+
| John| 30|New York|
|  Bob| 30|New York|
| John| 30|New York|
|Alice| 25|  London|
| Mary| 22|   Paris|
+-----+---+--------+



In [0]:
# Definition: Groups rows by column(s), often used with aggregates.
df.groupBy("city").count().show()


+--------+-----+
|    city|count|
+--------+-----+
|New York|    3|
|  London|    1|
|   Paris|    1|
+--------+-----+



In [0]:
# Definition: Adds a new column or updates an existing column.
from pyspark.sql.functions import col
df.withColumn("age_plus_5", col("age") + 5).show()


+-----+---+--------+----------+
| name|age|    city|age_plus_5|
+-----+---+--------+----------+
| John| 30|New York|        35|
|Alice| 25|  London|        30|
|  Bob| 30|New York|        35|
| Mary| 22|   Paris|        27|
| John| 30|New York|        35|
+-----+---+--------+----------+



In [0]:
# Definition: Removes one or more columns.
df.drop("age").show()

+-----+--------+
| name|    city|
+-----+--------+
| John|New York|
|Alice|  London|
|  Bob|New York|
| Mary|   Paris|
| John|New York|
+-----+--------+



In [0]:
# Definition: Handle null values in a DataFrame.
df.na.fill({"age": 0}).show()     # Fill null age with 0
df.na.drop().show()               # Drop rows with nulls


+-----+---+--------+
| name|age|    city|
+-----+---+--------+
| John| 30|New York|
|Alice| 25|  London|
|  Bob| 30|New York|
| Mary| 22|   Paris|
| John| 30|New York|
+-----+---+--------+

+-----+---+--------+
| name|age|    city|
+-----+---+--------+
| John| 30|New York|
|Alice| 25|  London|
|  Bob| 30|New York|
| Mary| 22|   Paris|
| John| 30|New York|
+-----+---+--------+

