In [1]:
import findspark
findspark.init('/home/akshay/spark-3.5.1-bin-hadoop3')
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("spark_tutorials_1").getOrCreate()

24/06/23 21:28:26 WARN Utils: Your hostname, akshay-vm resolves to a loopback address: 127.0.1.1; using 10.0.2.15 instead (on interface enp0s3)
24/06/23 21:28:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/23 21:28:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/06/23 21:28:28 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Apache Spark `collect` Function: Overview, Pros, and Cons

The `collect` function in Apache Spark is used to retrieve all the elements of a distributed dataset (RDD or DataFrame) to the driver program. It returns the data as a local collection (e.g., Python list or NumPy array), which can be processed using standard Python operations.

### Syntax

```python
# For RDDs:
data = rdd.collect()

# For DataFrames:
data = df.collect()


## Apache Spark `collect` Function: Pros, Cons, and Best Practices

The `collect` function in Apache Spark is used to retrieve all the elements of a distributed dataset (RDD or DataFrame) to the driver program. It returns the data as a local collection (e.g., Python list or NumPy array), which can be processed using standard Python operations.

### Pros

- **Local Processing**: Data collected using `collect` is available locally in the driver program. This allows for easy integration with local Python libraries and facilitates further analysis or processing.

- **Ease of Use**: Once collected, the data can be manipulated using familiar Python operations, making it straightforward to perform subsequent calculations or transformations.

- **Debugging**: Useful for debugging and exploring small datasets extracted from larger distributed datasets. It enables quick inspection of data contents and structure.

### Cons

- **Memory Intensive**: The `collect` function brings all data to the driver program, which can lead to memory issues if the dataset is too large to fit in memory on the driver. This limits its usefulness for very large datasets.

- **Performance Impact**: Collecting large amounts of data can be slow and may introduce significant overhead, especially when dealing with distributed computations where data needs to be transferred over the network.

- **Not Scalable**: Use of `collect` undermines Spark's parallel processing capabilities. It forces all data to be handled by a single machine (the driver), negating the benefits of distributed processing for large-scale data analysis.

### Best Practices

- **Use Sparingly**: Limit the use of `collect` to situations where you genuinely need to bring data to the driver for local processing or analysis.
  
- **Sampling**: If possible, sample a subset of data using techniques like `take` or `sample` before deciding to collect the entire dataset.

- **Aggregation**: For summarizing results, prefer using Spark's built-in aggregation functions (`reduce`, `aggregate`, `fold`, etc.) instead of `collect` whenever feasible.

### Example

```python
# Assuming 'rdd' is an RDD or 'df' is a DataFrame
# Collecting data to the driver program (use with caution):
collected_data = rdd.collect()

# Process collected_data locally:
for item in collected_data:
    print(item)


In [2]:
data = [
    (1, "Alice", "Female", 60000, "HR"),
    (2, "Bob", "Male", 80000, "Engineering"),
    (3, "Charlie", "Male", 75000, "Marketing"),
    (4, "Diana", "Female", 90000, "Finance"),
    (5, "Eve", "Female", 70000, "Engineering"),
    (6, "Frank", "Male", 85000, "HR"),
    (7, "Gina", "Female", 65000, "Marketing"),
    (8, "Henry", "Male", 95000, "Finance"),
    (9, "Irene", "Female", 72000, "Engineering"),
    (10, "John", "Male", 78000, "HR")
]


In [3]:
schema = ["id", "name", "gender","salary","dept"]

In [4]:
df = spark.createDataFrame(data=data, schema=schema)
df.show()

                                                                                

+---+-------+------+------+-----------+
| id|   name|gender|salary|       dept|
+---+-------+------+------+-----------+
|  1|  Alice|Female| 60000|         HR|
|  2|    Bob|  Male| 80000|Engineering|
|  3|Charlie|  Male| 75000|  Marketing|
|  4|  Diana|Female| 90000|    Finance|
|  5|    Eve|Female| 70000|Engineering|
|  6|  Frank|  Male| 85000|         HR|
|  7|   Gina|Female| 65000|  Marketing|
|  8|  Henry|  Male| 95000|    Finance|
|  9|  Irene|Female| 72000|Engineering|
| 10|   John|  Male| 78000|         HR|
+---+-------+------+------+-----------+



In [5]:
type(df)

pyspark.sql.dataframe.DataFrame

In [6]:
help(df.collect)

Help on method collect in module pyspark.sql.dataframe:

collect() -> List[pyspark.sql.types.Row] method of pyspark.sql.dataframe.DataFrame instance
    Returns all the records as a list of :class:`Row`.
    
    .. versionadded:: 1.3.0
    
    .. versionchanged:: 3.4.0
        Supports Spark Connect.
    
    Returns
    -------
    list
        List of rows.
    
    Examples
    --------
    >>> df = spark.createDataFrame(
    ...     [(14, "Tom"), (23, "Alice"), (16, "Bob")], ["age", "name"])
    >>> df.collect()
    [Row(age=14, name='Tom'), Row(age=23, name='Alice'), Row(age=16, name='Bob')]



In [7]:
type(df.collect())

list

In [8]:
list1 = df.collect()

In [9]:
list1

[Row(id=1, name='Alice', gender='Female', salary=60000, dept='HR'),
 Row(id=2, name='Bob', gender='Male', salary=80000, dept='Engineering'),
 Row(id=3, name='Charlie', gender='Male', salary=75000, dept='Marketing'),
 Row(id=4, name='Diana', gender='Female', salary=90000, dept='Finance'),
 Row(id=5, name='Eve', gender='Female', salary=70000, dept='Engineering'),
 Row(id=6, name='Frank', gender='Male', salary=85000, dept='HR'),
 Row(id=7, name='Gina', gender='Female', salary=65000, dept='Marketing'),
 Row(id=8, name='Henry', gender='Male', salary=95000, dept='Finance'),
 Row(id=9, name='Irene', gender='Female', salary=72000, dept='Engineering'),
 Row(id=10, name='John', gender='Male', salary=78000, dept='HR')]

In [10]:
list1[0]

Row(id=1, name='Alice', gender='Female', salary=60000, dept='HR')