In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("JupyterStandalo") \
    .master("spark://8fa087ac675c:7077") \
    .config("spark.executor.instances", "3") \
    .config("spark.executor.cores", "6") \
    .config("spark.executor.memory", "2g") \
    .getOrCreate()

Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/24 10:46:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [2]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType
# Initialize our data
data2 = [("Pulkit", 12, "CS32", 82, "Programming"),
         ("Ritika", 20, "CS32", 94, "Writing"),
         ("Atirikt", 4, "BB21", 78, None),
         ("Reshav", 18, None, 56, None)
         ]

# Start spark session

# Define schema
schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Roll Number", IntegerType(), True),
    StructField("Class ID", StringType(), True),
    StructField("Marks", IntegerType(), True),
    StructField("Extracurricular", StringType(), True)
])

# read the dataframe
df = spark.createDataFrame(data=data2, schema=schema)

In [3]:
df.show()

                                                                                

+-------+-----------+--------+-----+---------------+
|   Name|Roll Number|Class ID|Marks|Extracurricular|
+-------+-----------+--------+-----+---------------+
| Pulkit|         12|    CS32|   82|    Programming|
| Ritika|         20|    CS32|   94|        Writing|
|Atirikt|          4|    BB21|   78|           NULL|
| Reshav|         18|    NULL|   56|           NULL|
+-------+-----------+--------+-----+---------------+



In [4]:
from pyspark.sql.functions import col

In [5]:
x=df.groupBy("Name").agg({"Marks": "sum"})

In [6]:
x.show()

[Stage 5:>                                                          (0 + 1) / 1]

+-------+----------+
|   Name|sum(Marks)|
+-------+----------+
| Ritika|        94|
| Reshav|        56|
| Pulkit|        82|
|Atirikt|        78|
+-------+----------+



                                                                                

You can apply a wide variety of functions on Spark DataFrames to manipulate, analyze, and transform your data. Here’s an overview of the kinds of functions available:

---

### 1. **Basic DataFrame Operations**

* **`select()`** — Select columns:

  ```python
  df.select("col1", "col2")
  ```
* **`filter()` / `where()`** — Filter rows by condition:

  ```python
  df.filter(df.col("age") > 30)
  ```
* **`groupBy()` + aggregation** — Group rows and aggregate:

  ```python
  df.groupBy("category").count()
  df.groupBy("category").agg({"sales": "sum"})
  ```
* **`orderBy()` / `sort()`** — Sort rows by columns:

  ```python
  df.orderBy("age", ascending=False)
  ```

---

### 2. **Column Functions (from `pyspark.sql.functions`)**

You can import many built-in functions to apply on columns, for example:

```python
from pyspark.sql.functions import col, lit, concat, upper, lower, when, isnan, count

df.select(
    col("name"),
    upper(col("name")).alias("upper_name"),
    when(col("age") > 18, "Adult").otherwise("Minor").alias("age_group")
)
```

Common functions include:

* String functions: `upper()`, `lower()`, `concat()`, `substring()`, `trim()`
* Math functions: `round()`, `sqrt()`, `abs()`
* Conditional functions: `when()`, `otherwise()`
* Null handling: `isnull()`, `isnan()`, `coalesce()`
* Date/time: `current_date()`, `datediff()`, `year()`, `month()`

---

### 3. **SQL Expressions**

You can run SQL queries on DataFrames registered as temp views:

```python
df.createOrReplaceTempView("my_table")
spark.sql("SELECT name, age FROM my_table WHERE age > 30").show()
```

---

### 4. **Joins**

Joining DataFrames on keys:

```python
df1.join(df2, df1.id == df2.id, "inner")
```

Types: inner, left\_outer, right\_outer, full\_outer, cross, semi, anti

---

### 5. **Window Functions**

Perform operations over a sliding window of rows:

```python
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

window_spec = Window.partitionBy("category").orderBy("sales")
df.withColumn("rank", row_number().over(window_spec)).show()
```

---

### 6. **Aggregations**

* `agg()`, `sum()`, `avg()`, `min()`, `max()`, `count()`

Example:

```python
df.groupBy("category").agg({"sales": "sum", "price": "avg"}).show()
```

---

### 7. **User-Defined Functions (UDFs)**

Write your own functions for custom transformations:

```python
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def plus_one(x):
    return str(int(x) + 1)

plus_one_udf = udf(plus_one, StringType())
df.withColumn("new_col", plus_one_udf(df["age"])).show()
```

---

### 8. **Other Useful Functions**

* `withColumn()` — Add or replace a column
* `drop()` — Drop columns
* `distinct()` — Remove duplicates
* `limit()` — Take first N rows
* `cache()` / `persist()` — Cache DataFrame in memory/disk
* `explain()` — Show query plan for optimization insight

---

### Summary

| Function Type          | Examples                            | Purpose                             |
| ---------------------- | ----------------------------------- | ----------------------------------- |
| Basic ops              | `select()`, `filter()`, `groupBy()` | Data selection, filtering, grouping |
| Column functions       | `upper()`, `when()`, `concat()`     | Column transformations              |
| SQL                    | `spark.sql("SELECT ...")`           | SQL querying                        |
| Joins                  | `join()`                            | Combining data                      |
| Window functions       | `row_number()`, `rank()`            | Analytics over partitions           |
| Aggregations           | `sum()`, `count()`, `avg()`         | Summary statistics                  |
| User Defined Functions | `udf()`                             | Custom transformations              |
| DataFrame management   | `withColumn()`, `drop()`, `cache()` | Modify columns, optimize caching    |

---

If you want, I can give examples of any of these or show how to chain multiple functions together!
