# DataFrame Operations in PySpark

## Table of Contents
1. [Introduction to DataFrame Operations](#1-introduction-to-dataframe-operations)
2. [Loading Data](#2-loading-data)
3. [Selecting Columns](#3-selecting-columns)
4. [Filtering Rows](#4-filtering-rows)
5. [Adding/Renaming/Dropping Columns](#5-adding-renaming-dropping-columns)
6. [Sorting Data](#6-sorting-data)
7. [Handling Missing Data](#7-handling-missing-data)
8. [Distinct and Duplicate Handling](#8-distinct-and-duplicate-handling)
9. [Union Operations](#9-union-operations)


## 1. Introduction to DataFrame Operations

DataFrame operations are the backbone of data manipulation in PySpark. They allow you to transform, filter, and analyze data efficiently. In this notebook, we’ll explore various DataFrame operations using the **customer dataset** you provided.

### Key Concepts
- **DataFrame**: A distributed collection of data organized into named columns.
- **Transformations**: Operations that produce a new DataFrame (e.g., `select`, `filter`).
- **Actions**: Operations that trigger computation and return results (e.g., `show`, `count`).

```
┌──────────────┐
│  DataFrame   │
└──────┬───────┘
       │
       ▼
┌───────────────────┐
│  Transformations  │
│  (e.g., select,   │
│   filter, groupBy)│
└─────────┬─────────┘
          │
          ▼
┌───────────────────┐
│  Actions          │
│  (e.g., show,     │
│   count, collect) │
└───────────────────┘
```

## 2. Loading Data

Before performing any operations, we need to load the customer dataset into a DataFrame.

### Example

```python
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("DataFrameOperations").getOrCreate()

# Load the customer dataset
df = spark.read.csv("customers.csv", header=True, inferSchema=True)

# Show the first 5 rows
df.show(5)
```

**Output**

| customer_id | name        | city      | state        | country | registration_date | is_active |
|-------------|-------------|-----------|--------------|---------|-------------------|-----------|
| 0           | Customer_0  | Pune      | West Bengal  | India   | 2023-10-10        | True      |
| 1           | Customer_1  | Bangalore | Gujarat      | India   | 2023-10-19        | False     |
| 2           | Customer_2  | Bangalore | Karnataka    | India   | 2023-02-10        | True      |
| 3           | Customer_3  | Bangalore | Telangana    | India   | 2023-03-24        | True      |
| 4           | Customer_4  | Hyderabad | Telangana    | India   | 2023-06-04        | False     |

### Explanation
- **`spark.read.csv`**: Loads a CSV file into a DataFrame.
- **`header=True`**: Uses the first row as column names.
- **`inferSchema=True`**: Automatically infers the data types of columns.

## 3. Selecting Columns

Selecting specific columns from a DataFrame is a common operation. You can use the `select` method to choose columns.

### Example

```python
# Select specific columns
df.select("customer_id", "name", "city").show(5)
```

**Output**

| customer_id | name        | city      |
|-------------|-------------|-----------|
| 0           | Customer_0  | Pune      |
| 1           | Customer_1  | Bangalore |
| 2           | Customer_2  | Bangalore |
| 3           | Customer_3  | Bangalore |
| 4           | Customer_4  | Hyderabad |

### Explanation
- **`select`**: Selects specific columns from the DataFrame.
- **`show(5)`**: Displays the first 5 rows of the selected columns.

## 4. Filtering Rows

Filtering rows based on conditions is essential for data analysis. You can use the `filter` or `where` method to filter rows.

### Example

```python
# Filter rows where city is 'Bangalore'
df.filter(df["city"] == "Bangalore").show(5)
```

**Output**

| customer_id | name        | city      | state      | country | registration_date | is_active |
|-------------|-------------|-----------|------------|---------|-------------------|-----------|
| 1           | Customer_1  | Bangalore | Gujarat    | India   | 2023-10-19        | False     |
| 2           | Customer_2  | Bangalore | Karnataka  | India   | 2023-02-10        | True      |
| 3           | Customer_3  | Bangalore | Telangana  | India   | 2023-03-24        | True      |
| 7           | Customer_7  | Bangalore | Telangana  | India   | 2023-08-25        | True      |
| 8           | Customer_8  | Bangalore | Maharashtra| India   | 2023-07-13        | False     |

### Explanation
- **`filter`**: Filters rows based on a condition.
- **`df["city"] == "Bangalore"`**: Condition to filter rows where the city is Bangalore.

## 5. Adding/Renaming/Dropping Columns

You can add, rename, or drop columns in a DataFrame using methods like `withColumn`, `withColumnRenamed`, and `drop`.

### Example

```python
from pyspark.sql.functions import lit

# Add a new column 'is_premium' with a default value of False
df = df.withColumn("is_premium", lit(False))

# Rename the column 'is_active' to 'active_status'
df = df.withColumnRenamed("is_active", "active_status")

# Drop the column 'state'
df = df.drop("state")

# Show the updated DataFrame
df.show(5)
```

**Output**

| customer_id | name        | city      | country | registration_date | active_status | is_premium |
|-------------|-------------|-----------|---------|-------------------|---------------|------------|
| 0           | Customer_0  | Pune      | India   | 2023-10-10        | True          | False      |
| 1           | Customer_1  | Bangalore | India   | 2023-10-19        | False         | False      |
| 2           | Customer_2  | Bangalore | India   | 2023-02-10        | True          | False      |
| 3           | Customer_3  | Bangalore | India   | 2023-03-24        | True          | False      |
| 4           | Customer_4  | Hyderabad | India   | 2023-06-04        | False         | False      |

### Explanation
- **`withColumn`**: Adds a new column or updates an existing one.
- **`withColumnRenamed`**: Renames a column.
- **`drop`**: Drops a column from the DataFrame.

## 6. Sorting Data

Sorting data is useful for organizing and analyzing data. You can use the `orderBy` or `sort` method to sort rows.

### Example

```python
# Sort by 'registration_date' in descending order
df.orderBy("registration_date", ascending=False).show(5)
```

**Output**

| customer_id | name        | city      | country | registration_date | active_status | is_premium |
|-------------|-------------|-----------|---------|-------------------|---------------|------------|
| 1           | Customer_1  | Bangalore | India   | 2023-10-19        | False         | False      |
| 0           | Customer_0  | Pune      | India   | 2023-10-10        | True          | False      |
| 7           | Customer_7  | Bangalore | India   | 2023-08-25        | True          | False      |
| 8           | Customer_8  | Bangalore | India   | 2023-07-13        | False         | False      |
| 5           | Customer_5  | Hyderabad | India   | 2023-07-26        | True          | False      |

### Explanation
- **`orderBy`**: Sorts the DataFrame by specified columns.
- **`ascending=False`**: Sorts in descending order.

## 7. Handling Missing Data

Handling missing data is crucial for accurate analysis. You can use methods like `na.fill` and `na.drop` to handle missing values.

### Example

```python
# Fill missing values in 'city' with 'Unknown'
df.na.fill({"city": "Unknown"}).show(5)

# Drop rows with missing values
df.na.drop().show(5)
```

### Explanation
- **`na.fill`**: Fills missing values with specified values.
- **`na.drop`**: Drops rows with missing values.

## 8. Distinct and Duplicate Handling

You can remove duplicate rows using the `distinct` and `dropDuplicates` methods.

### Example

```python
# Remove duplicate rows
df.distinct().show(5)

# Remove duplicates based on specific columns
df.dropDuplicates(["city", "state"]).show(5)
```

### Explanation
- **`distinct`**: Removes duplicate rows from the entire DataFrame.
- **`dropDuplicates`**: Removes duplicates based on specific columns.

## 9. Union Operations

You can combine two DataFrames using the `union` method.

### Example

```python
# Create another DataFrame with similar schema
new_data = [
    (10, "Customer_10", "Mumbai", "Maharashtra", "India", "2023-09-01", True),
    (11, "Customer_11", "Chennai", "Tamil Nadu", "India", "2023-09-15", False)
]
new_df = spark.createDataFrame(new_data, schema=df.schema)

# Union the two DataFrames
combined_df = df.union(new_df)
combined_df.show()
```

### Explanation
- **`union`**: Combines two DataFrames with the same schema.