# Setting up PySpark in Locan Ubuntu System?

### ✅ Stage 2: Setting Up PySpark in a Local Ubuntu System (Step-by-Step)

Here’s a **clean and professional guide** to set up PySpark on Ubuntu (22.04 or similar) for local development.



### 🔧 Prerequisites

* ✅ Python (3.7+)
* ✅ Java (Java 8 or 11 recommended)
* ✅ pip (Python package manager)
* ✅ Ubuntu terminal access



### 🚀 Step-by-Step Installation Guide

#### ✅ Step 1: Install Java (if not already installed)

```bash
sudo apt update
sudo apt install openjdk-11-jdk -y
```

✅ **Check version**:

```bash
java -version
```



#### ✅ Step 2: Install Python & pip (if not already)

```bash
sudo apt install python3 python3-pip -y
```

✅ **Check version**:

```bash
python3 --version
pip3 --version
```



#### ✅ Step 3: Install Apache Spark via pip

Use `findspark` and `pyspark`:

```bash
pip3 install pyspark findspark
```

* **pyspark** → PySpark bindings
* **findspark** → Helps Jupyter or Python scripts locate Spark



#### ✅ Step 4: Set Environment Variables (optional but recommended)

Edit your `.bashrc` file:

```bash
nano ~/.bashrc
```

Add these lines at the end:

```bash
export SPARK_HOME=$(pip3 show pyspark | grep Location | cut -d' ' -f2)/pyspark
export PATH=$SPARK_HOME/bin:$PATH
```

Then run:

```bash
source ~/.bashrc
```

---

#### ✅ Step 5: Test PySpark Shell

```bash
pyspark
```

If everything works, you'll see:

```
>>> Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version X.X.X
      /_/
```

Type `exit()` or press `Ctrl + D` to exit.



#### ✅ Step 6: Test PySpark in Python

Create a file `test_spark.py`:

```python
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("TestApp").getOrCreate()

data = [("Alice", 25), ("Bob", 30)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
```

Run it:

```bash
python3 test_spark.py
```

You should see the DataFrame printed in your terminal.

---

### ✅ (Optional) Step 7: Jupyter Notebook + PySpark

Install Jupyter:

```bash
pip3 install notebook
```

Then create a kernel:

```bash
pip3 install ipykernel
python3 -m ipykernel install --user --name=pyspark_env
```

Launch Jupyter:

```bash
jupyter notebook
```

Use this code to initialize Spark in your notebook:

```python
import findspark
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NotebookApp").getOrCreate()
```



# Setting up PySpark in Google Colab?

### ✅ Setting Up PySpark in Google Colab (2025 Guide)

Google Colab is an excellent platform for running **PySpark** without local installation. Follow these simple steps to configure and run Spark on Colab.



### 🔹 Step-by-Step Setup Guide

#### ✅ Step 1: Install PySpark

Run the following in a Colab cell:

```python
!apt-get install openjdk-11-jdk -y
!pip install pyspark
```



#### ✅ Step 2: Set Environment Variables

```python
import os

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-11-openjdk-amd64"
os.environ["SPARK_HOME"] = "/usr/local/lib/python3.10/dist-packages/pyspark"
```

✅ This tells Spark where Java is located (required to run on JVM).



#### ✅ Step 3: Start Spark Session

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Colab PySpark") \
    .getOrCreate()
```



#### ✅ Step 4: Test Spark

```python
data = [("Ahmad", 22), ("Raza", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
```



### ✅ Optional: Check Spark Version

```python
print(spark.version)
```



### 📌 Notes

* No need for `findspark` in Colab, unless running multiple sessions.
* Spark runs **locally** on Colab’s virtual machine (not distributed).
* You can upload files to Colab or use `gdown`, `wget`, or mount Google Drive for data input.



# Setting up PySpark in Databrics?

### ✅ Setting Up PySpark in **Databricks**

Databricks is a powerful cloud-based platform built on Apache Spark. It offers an easy-to-use environment for running **PySpark** code without manual installation or setup.



### 🔹 Step-by-Step Guide to Set Up PySpark in Databricks

#### ✅ Step 1: Create a Databricks Account

1. Go to: [https://community.cloud.databricks.com](https://community.cloud.databricks.com)
2. Sign up for a **free Community Edition** account (sufficient for learning).



#### ✅ Step 2: Create a Workspace

Once logged in:

1. Go to the **Workspace** tab.
2. Click on `Create > Notebook`.



#### ✅ Step 3: Create a New Notebook

1. **Name** your notebook (e.g., `My PySpark Demo`).
2. **Default language**: Select `Python`.
3. **Cluster**: You’ll be prompted to attach a cluster.



#### ✅ Step 4: Create and Start a Cluster

1. Go to `Compute > Create Cluster`.
2. Set:

   * Cluster name: `my-cluster`
   * Runtime: Choose default (`10.x` or higher is fine)
   * Cluster mode: `Single Node`
3. Click **Create Cluster**.

📌 Wait 2–3 minutes for the cluster to initialize.



#### ✅ Step 5: Run PySpark Code in Notebook

You can now run PySpark like this:

```python
# Create a SparkSession
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DatabricksApp").getOrCreate()

# Sample data
data = [("Ahmad", 22), ("Raza", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
```



### 🔄 Bonus Features in Databricks

| Feature                    | Use                                   |
| -------------------------- | ------------------------------------- |
| 📊 Built-in visualizations | Click on result ➝ Visualize           |
| 📁 File system (DBFS)      | `/dbfs/` path for storing files       |
| 📚 Markdown Support        | `%md` cells for documentation         |
| 📦 Libraries               | Install via `Libraries > Install New` |



### ✅ Databricks Magic Commands

| Magic Command | Description                          |
| ------------- | ------------------------------------ |
| `%fs`         | Access Databricks File System (DBFS) |
| `%run`        | Run another notebook                 |
| `%sql`        | Run SQL queries                      |
| `%python`     | Run Python code (default)            |
| `%sh`         | Run shell commands                   |



### Testing

In [2]:
import findspark
import warnings
warnings.filterwarnings('ignore')
findspark.init()

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NotebookApp").getOrCreate()


# SparkSession and SparkContext?

Sure! Here's the restored content followed by the explanation of **`SparkSession`** and **`SparkContext`** in PySpark.

---

## ✅ Setting Up PySpark in Databricks

Databricks is a powerful cloud-based platform built on Apache Spark. It offers an easy-to-use environment for running **PySpark** code without manual installation or setup.



### 🔹 Step-by-Step Guide to Set Up PySpark in Databricks

#### ✅ Step 1: Create a Databricks Account

1. Go to: [https://community.cloud.databricks.com](https://community.cloud.databricks.com)
2. Sign up for a **free Community Edition** account (sufficient for learning).



#### ✅ Step 2: Create a Workspace

Once logged in:

1. Go to the **Workspace** tab.
2. Click on `Create > Notebook`.



#### ✅ Step 3: Create a New Notebook

1. **Name** your notebook (e.g., `My PySpark Demo`).
2. **Default language**: Select `Python`.
3. **Cluster**: You’ll be prompted to attach a cluster.



#### ✅ Step 4: Create and Start a Cluster

1. Go to `Compute > Create Cluster`.
2. Set:

   * Cluster name: `my-cluster`
   * Runtime: Choose default (`10.x` or higher is fine)
   * Cluster mode: `Single Node`
3. Click **Create Cluster**.

📌 Wait 2–3 minutes for the cluster to initialize.



#### ✅ Step 5: Run PySpark Code in Notebook

You can now run PySpark like this:

```python
# Create a SparkSession
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DatabricksApp").getOrCreate()

# Sample data
data = [("Ahmad", 22), ("Raza", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()
```



### 🔄 Bonus Features in Databricks

| Feature                    | Use                                   |
| -------------------------- | ------------------------------------- |
| 📊 Built-in visualizations | Click on result ➝ Visualize           |
| 📁 File system (DBFS)      | `/dbfs/` path for storing files       |
| 📚 Markdown Support        | `%md` cells for documentation         |
| 📦 Libraries               | Install via `Libraries > Install New` |



### ✅ Databricks Magic Commands

| Magic Command | Description                          |
| ------------- | ------------------------------------ |
| `%fs`         | Access Databricks File System (DBFS) |
| `%run`        | Run another notebook                 |
| `%sql`        | Run SQL queries                      |
| `%python`     | Run Python code (default)            |
| `%sh`         | Run shell commands                   |





## 🔍 Now: SparkSession vs SparkContext in PySpark

### ✅ 1. What is `SparkSession`?

**`SparkSession`** is the entry point to work with Spark functionality using the **DataFrame and SQL API** in PySpark.

It combines `SQLContext`, `HiveContext`, and `SparkContext` into one unified object.

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("MyApp") \
    .getOrCreate()
```



### ✅ 2. What is `SparkContext`?

**`SparkContext`** is the original entry point in older versions of Spark (< 2.0). It allows interaction with Spark Core and RDDs.

```python
sc = spark.sparkContext  # Access from SparkSession
```



### 🆚 Difference Between `SparkSession` and `SparkContext`

| Feature       | `SparkSession`                        | `SparkContext`                   |
| ------------- | ------------------------------------- | -------------------------------- |
| Introduced In | Spark 2.0+                            | Spark 1.x                        |
| Used For      | DataFrames, Datasets, SQL, Spark Core | Spark Core and RDDs only         |
| Combines      | SQLContext, HiveContext, SparkContext | Only provides core functionality |
| Accessed From | `SparkSession.builder.getOrCreate()`  | `spark.sparkContext`             |
| Suitable For  | Most use cases today                  | Legacy RDD-based apps            |



### ✅ Typical Usage Example

```python
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()

# Access SparkContext
sc = spark.sparkContext

# Use SparkContext to create an RDD
rdd = sc.parallelize([1, 2, 3, 4])
print(rdd.collect())

# Use SparkSession to create a DataFrame
df = spark.createDataFrame([(1, "Ahmad"), (2, "Raza")], ["ID", "Name"])
df.show()
```



In [3]:
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()

# Access SparkContext
sc = spark.sparkContext

# Use SparkContext to create an RDD
rdd = sc.parallelize([1, 2, 3, 4])
print(rdd.collect())

# Use SparkSession to create a DataFrame
df = spark.createDataFrame([(1, "Ahmad"), (2, "Raza")], ["ID", "Name"])
df.show()


25/06/11 15:50:17 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


[1, 2, 3, 4]


                                                                                

+---+-----+
| ID| Name|
+---+-----+
|  1|Ahmad|
|  2| Raza|
+---+-----+



# RDDs (Resilient Distributed Datasets)



## ✅ **RDDs (Resilient Distributed Datasets)**

### 📘 What is an RDD?

An **RDD (Resilient Distributed Dataset)** is the **core data structure** of Apache Spark.
It is:

* **Immutable** (once created, cannot be changed)
* **Distributed** across a cluster
* **Lazy-evaluated** (transformations are not executed until an action is called)
* **Fault-tolerant** (can recover from node failures)



### 📌 How to Create RDDs

#### 1. From an existing collection (list, tuple):

```python
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)
```

#### 2. From an external source (e.g., text file):

```python
rdd = sc.textFile("/path/to/file.txt")
```



## 🔧 RDD Transformations vs Actions

| Type               | Meaning                                      | Example                |
| ------------------ | -------------------------------------------- | ---------------------- |
| **Transformation** | Lazy operation, returns a new RDD            | `map()`, `filter()`    |
| **Action**         | Triggers execution, returns result to driver | `collect()`, `count()` |



## 🔁 **RDD Transformations** (Lazy)

These return a new RDD and are lazily evaluated.

| Transformation   | Description                        | Example                             |
| ---------------- | ---------------------------------- | ----------------------------------- |
| `map(func)`      | Applies a function to each element | `rdd.map(lambda x: x * 2)`          |
| `filter(func)`   | Filters elements                   | `rdd.filter(lambda x: x > 3)`       |
| `flatMap(func)`  | Like map, but flattens result      | `rdd.flatMap(lambda x: x.split())`  |
| `distinct()`     | Removes duplicate elements         | `rdd.distinct()`                    |
| `union(rdd2)`    | Combines two RDDs                  | `rdd.union(other_rdd)`              |
| `intersection()` | Returns common elements            | `rdd1.intersection(rdd2)`           |
| `sample()`       | Random sampling of RDD             | `rdd.sample(False, 0.5)`            |
| `groupByKey()`   | Groups values with the same key    | Only for (key, value) RDDs          |
| `reduceByKey()`  | Aggregates by key                  | `rdd.reduceByKey(lambda x, y: x+y)` |
| `sortBy()`       | Sorts by a custom function         | `rdd.sortBy(lambda x: x)`           |



## ⚡ **RDD Actions** (Trigger Execution)

These return values or output to the driver or external storage.

| Action             | Description                             | Example                       |
| ------------------ | --------------------------------------- | ----------------------------- |
| `collect()`        | Returns all elements as a list          | `rdd.collect()`               |
| `count()`          | Returns number of elements              | `rdd.count()`                 |
| `first()`          | Returns first element                   | `rdd.first()`                 |
| `take(n)`          | Returns first `n` elements              | `rdd.take(3)`                 |
| `reduce(func)`     | Reduces elements using function         | `rdd.reduce(lambda x,y: x+y)` |
| `saveAsTextFile()` | Saves RDD to file                       | `rdd.saveAsTextFile("/out")`  |
| `countByValue()`   | Counts occurrences of each unique value | `rdd.countByValue()`          |
| `foreach(func)`    | Applies a function (no return)          | `rdd.foreach(print)`          |



## ✅ Example Code: Basic RDD Operations

```python
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Transformations
squared_rdd = rdd.map(lambda x: x ** 2)
filtered_rdd = squared_rdd.filter(lambda x: x > 10)

# Actions
print("Squared values:", squared_rdd.collect())
print("Filtered values:", filtered_rdd.collect())
print("Sum of all values:", rdd.reduce(lambda x, y: x + y))
```



### 🧠 Why Use RDDs?

* Full control over **low-level transformations**
* Ideal for **unstructured** or **semi-structured** data
* More **manual**, but more **flexible** than DataFrames



### ⚠️ When **NOT** to use RDDs:

* When working with **structured data** (prefer DataFrames)
* When performance and optimization are crucial (RDDs don’t get Catalyst optimizer benefits)



In [4]:
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Transformations
squared_rdd = rdd.map(lambda x: x ** 2)
filtered_rdd = squared_rdd.filter(lambda x: x > 10)

# Actions
print("Squared values:", squared_rdd.collect())
print("Filtered values:", filtered_rdd.collect())
print("Sum of all values:", rdd.reduce(lambda x, y: x + y))


                                                                                

Squared values: [1, 4, 9, 16, 25]


                                                                                

Filtered values: [16, 25]
Sum of all values: 15



## ✅ Setup (Run this first)

```python
from pyspark.sql import SparkSession

# Start Spark session
spark = SparkSession.builder.appName("RDDFunctions").getOrCreate()

# Get SparkContext
sc = spark.sparkContext
```

---

## 🔁 RDD **Transformations** with Code Examples

### 1. `map(func)`

```python
rdd = sc.parallelize([1, 2, 3])
mapped = rdd.map(lambda x: x * 2)
print(mapped.collect())  # [2, 4, 6]
```

---

### 2. `filter(func)`

```python
rdd = sc.parallelize([1, 2, 3, 4, 5])
filtered = rdd.filter(lambda x: x > 3)
print(filtered.collect())  # [4, 5]
```

---

### 3. `flatMap(func)`

```python
rdd = sc.parallelize(["Hello Spark", "RDD example"])
flat_mapped = rdd.flatMap(lambda x: x.split())
print(flat_mapped.collect())  # ['Hello', 'Spark', 'RDD', 'example']
```

---

### 4. `distinct()`

```python
rdd = sc.parallelize([1, 2, 2, 3, 3, 3])
distinct_rdd = rdd.distinct()
print(distinct_rdd.collect())  # [1, 2, 3]
```

---

### 5. `union(rdd2)`

```python
rdd1 = sc.parallelize([1, 2])
rdd2 = sc.parallelize([3, 4])
unioned = rdd1.union(rdd2)
print(unioned.collect())  # [1, 2, 3, 4]
```

---

### 6. `intersection(rdd2)`

```python
rdd1 = sc.parallelize([1, 2, 3])
rdd2 = sc.parallelize([2, 3, 4])
intersected = rdd1.intersection(rdd2)
print(intersected.collect())  # [2, 3]
```

---

### 7. `sample(withReplacement, fraction)`

```python
rdd = sc.parallelize(range(10))
sampled = rdd.sample(False, 0.4)
print(sampled.collect())  # Random 40% sample
```

---

### 8. `groupByKey()`

```python
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
grouped = rdd.groupByKey()
print([(k, list(v)) for k, v in grouped.collect()])  # [('a', [1, 3]), ('b', [2])]
```

---

### 9. `reduceByKey(func)`

```python
rdd = sc.parallelize([("a", 1), ("b", 2), ("a", 3)])
reduced = rdd.reduceByKey(lambda x, y: x + y)
print(reduced.collect())  # [('a', 4), ('b', 2)]
```

---

### 10. `sortBy(func)`

```python
rdd = sc.parallelize([("a", 3), ("b", 1), ("c", 2)])
sorted_rdd = rdd.sortBy(lambda x: x[1])
print(sorted_rdd.collect())  # [('b', 1), ('c', 2), ('a', 3)]
```

---

## ⚡ RDD **Actions** with Code Examples

### 1. `collect()`

```python
rdd = sc.parallelize([10, 20])
print(rdd.collect())  # [10, 20]
```

---

### 2. `count()`

```python
rdd = sc.parallelize([10, 20, 30])
print(rdd.count())  # 3
```

---

### 3. `first()`

```python
rdd = sc.parallelize([5, 6, 7])
print(rdd.first())  # 5
```

---

### 4. `take(n)`

```python
rdd = sc.parallelize([5, 6, 7, 8])
print(rdd.take(2))  # [5, 6]
```

---

### 5. `reduce(func)`

```python
rdd = sc.parallelize([1, 2, 3, 4])
result = rdd.reduce(lambda x, y: x + y)
print(result)  # 10
```

---

### 6. `saveAsTextFile(path)`

```python
rdd = sc.parallelize(["line1", "line2"])
rdd.saveAsTextFile("/tmp/output_text")
```

📌 Run this only in local environments, not in Colab.

---

### 7. `countByValue()`

```python
rdd = sc.parallelize([1, 2, 2, 3, 3, 3])
print(rdd.countByValue())  # {1: 1, 2: 2, 3: 3}
```
---

### 8. `foreach(func)`

```python
rdd = sc.parallelize(["a", "b", "c"])
rdd.foreach(lambda x: print("Letter:", x))  # Printed in executors
```

⚠️ In local mode, this may not print in order or at all.



# DataFrames



## ✅ **DataFrames in PySpark**

### 📘 What is a DataFrame?

A **DataFrame** in PySpark is a **distributed collection of data** organized into **named columns**, just like a table in a relational database or a **pandas DataFrame**.

It is built **on top of RDDs** and uses **Catalyst optimizer** for query optimization and **Tungsten** for execution, making it faster and more efficient.



### 🔍 Key Features of DataFrames

* Schema (column names and types)
* SQL-like operations (`select`, `filter`, `groupBy`, etc.)
* Automatic optimization via Catalyst engine
* Lazy evaluation
* Supports reading from multiple sources (CSV, JSON, Parquet, Hive, etc.)
* Interoperable with SQL



## 🔧 Creating a DataFrame

### 1. ✅ From a Python List (with schema):

```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

data = [("Ahmad", 22), ("Raza", 25), ("Ali", 30)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)
df.show()
```

```
+-----+---+
| Name|Age|
+-----+---+
|Ahmad| 22|
| Raza| 25|
|  Ali| 30|
+-----+---+
```



### 2. ✅ From a CSV File

```python
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()
```



### 3. ✅ From an RDD

```python
rdd = spark.sparkContext.parallelize([(1, "Spark"), (2, "PySpark")])
df = rdd.toDF(["ID", "Course"])
df.show()
```



## 🔍 Common DataFrame Operations

### ✅ Viewing Data

```python
df.show()            # Displays rows in tabular format
df.printSchema()     # Shows the schema of DataFrame
df.columns           # List of column names
df.describe().show() # Summary statistics
```



### ✅ Selecting Columns

```python
df.select("Name").show()
df.select("Name", "Age").show()
```



### ✅ Filtering Rows

```python
df.filter(df.Age > 24).show()
df.where(df.Name == "Ahmad").show()
```



### ✅ Adding New Columns

```python
df.withColumn("AgePlusOne", df.Age + 1).show()
```



### ✅ Renaming Columns

```python
df.withColumnRenamed("Age", "NewAge").show()
```



### ✅ Dropping Columns

```python
df.drop("Age").show()
```



### ✅ Grouping and Aggregation

```python
df.groupBy("Age").count().show()
df.groupBy("Age").agg({"Age": "avg"}).show()
```



### ✅ Sorting

```python
df.sort("Age").show()
df.orderBy(df.Age.desc()).show()
```



### ✅ Joining DataFrames

```python
data1 = [(1, "Ahmad"), (2, "Raza")]
data2 = [(1, "Male"), (2, "Male")]

df1 = spark.createDataFrame(data1, ["ID", "Name"])
df2 = spark.createDataFrame(data2, ["ID", "Gender"])

df1.join(df2, on="ID", how="inner").show()
```



## 🧠 When to Use DataFrames Over RDDs

| Feature         | DataFrame                           | RDD                           |
| --------------- | ----------------------------------- | ----------------------------- |
| Optimization    | Catalyst Optimizer                  | Manual                        |
| Ease of use     | SQL-like operations                 | Functional transformations    |
| Performance     | Fast and efficient                  | Slower for complex processing |
| Structured data | Best for structured/semi-structured | Not suitable for tabular data |



## Select, filter, where, withColumn, drop, distinct

## ✅ 1. `select()`

### ▶ Purpose: Selects specific columns from a DataFrame.

```python
df = spark.createDataFrame(
    [("Ahmad", 22), ("Raza", 25)],
    ["Name", "Age"]
)

df.select("Name").show()
```

📤 **Output:**

```
+-----+
| Name|
+-----+
|Ahmad|
| Raza|
+-----+
```



## ✅ 2. `filter()` / `where()`

### ▶ Purpose: Filters rows based on a condition.

These two methods are functionally **identical**.

```python
df.filter(df.Age > 22).show()
df.where(df.Name == "Ahmad").show()
```

📤 **Output for filter:**

```
+----+---+
|Name|Age|
+----+---+
|Raza| 25|
+----+---+
```

📤 **Output for where:**

```
+-----+---+
| Name|Age|
+-----+---+
|Ahmad| 22|
+-----+---+
```



## ✅ 3. `withColumn()`

### ▶ Purpose: Adds a **new column** or **updates an existing one**.

```python
from pyspark.sql.functions import col

df.withColumn("AgePlus5", col("Age") + 5).show()
```

📤 **Output:**

```
+-----+---+--------+
| Name|Age|AgePlus5|
+-----+---+--------+
|Ahmad| 22|      27|
| Raza| 25|      30|
+-----+---+--------+
```

---

## ✅ 4. `drop()`

### ▶ Purpose: Removes one or more columns.

```python
df.drop("Age").show()
```

📤 **Output:**

```
+-----+
| Name|
+-----+
|Ahmad|
| Raza|
+-----+
```



## ✅ 5. `distinct()`

### ▶ Purpose: Removes duplicate rows.

```python
df_dup = spark.createDataFrame(
    [("Ahmad", 22), ("Ahmad", 22), ("Raza", 25)],
    ["Name", "Age"]
)

df_dup.distinct().show()
```

📤 **Output:**

```
+-----+---+
| Name|Age|
+-----+---+
| Raza| 25|
|Ahmad| 22|
+-----+---+
```



### ✅ Summary Table

| Function       | Use Case                 | Example                         |
| -------------- | ------------------------ | ------------------------------- |
| `select()`     | Select specific columns  | `df.select("Name")`             |
| `filter()`     | Filter rows by condition | `df.filter(df.Age > 20)`        |
| `where()`      | Same as `filter()`       | `df.where(df.Name == "Ahmad")`  |
| `withColumn()` | Add/modify column        | `df.withColumn("new", col + 1)` |
| `drop()`       | Drop a column            | `df.drop("Age")`                |
| `distinct()`   | Remove duplicate rows    | `df.distinct()`                 |



# Data types and schema inference?


## ✅ What is a Schema in PySpark?

A **schema** defines the structure of a DataFrame:

* Column **names**
* Column **data types**
* Whether a column is **nullable**

It is similar to a **table definition in SQL**.



## 🧠 1. **Schema Inference**

When creating a DataFrame, PySpark can **automatically infer the schema** by inspecting the data.

### ✅ Example – Schema Inference

```python
data = [("Ahmad", 22), ("Raza", 25)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.printSchema()
```

📤 **Output:**

```
root
 |-- Name: string (nullable = true)
 |-- Age: long (nullable = true)
```

PySpark infers:

* `"Name"` is a `string`
* `"Age"` is a `long` (integer)



## 📘 2. **Manually Defining Schema**

For better control, especially for large or structured datasets, define schema using `StructType` and `StructField`.

### ✅ Example:

```python
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True)
])

data = [("Ahmad", 22), ("Raza", 25)]
df = spark.createDataFrame(data, schema)
df.printSchema()
```

📤 **Output:**

```
root
 |-- Name: string (nullable = true)
 |-- Age: integer (nullable = true)
```

---

## 🧩 Common PySpark Data Types

| PySpark Type      | Description                     |
| ----------------- | ------------------------------- |
| `StringType()`    | String                          |
| `IntegerType()`   | Integer (32-bit)                |
| `LongType()`      | Long (64-bit)                   |
| `FloatType()`     | Float (32-bit)                  |
| `DoubleType()`    | Double precision float (64-bit) |
| `BooleanType()`   | Boolean                         |
| `DateType()`      | Date only                       |
| `TimestampType()` | Date + Time                     |
| `ArrayType()`     | Array/List                      |
| `MapType()`       | Dictionary (key-value)          |
| `StructType()`    | Nested row structure            |



## ✅ Viewing Schema

```python
df.printSchema()  # Print schema
df.schema         # Returns schema object
```



## 🧪 Checking Data Types Programmatically

```python
for field in df.schema.fields:
    print(f"Column: {field.name}, Type: {field.dataType}")
```



## 📁 Schema Inference from CSV/JSON

```python
df = spark.read.csv("file.csv", header=True, inferSchema=True)
df.printSchema()
```

Use `inferSchema=True` to automatically detect data types.

