#### How do you solve Data skew issue?

- pareto principle:  80 percent of data comes from 20 percent of users
- run a spark job to get a summary of the data
- changed 
-  two man ways to solve it:
    - use different key
    - partition data
    

#### What is Apache Spark, and how does it differ from Hadoop MapReduce?

Apache Spark is an open-source, distributed data processing framework designed for big data and analytics. 

In-Memory Processing: Spark stores data in-memory, which allows for faster data processing compared to Hadoop MapReduce, which primarily relies on disk storage. This in-memory processing capability is particularly advantageous for iterative algorithms and interactive data analytics.

##### Spark architecture:

Apache Spark has a distributed architecture designed to process large-scale data across a cluster of machines efficiently. Its architecture consists of several key components that work together to perform distributed data processing and computation. Here are the main components of the Apache Spark architecture:

![architecture](https://static.javatpoint.com/tutorial/spark/images/spark-architecture.png)






1. Driver Program:
   - The driver program is the entry point for a Spark application. It runs the user's main function and creates a SparkContext to coordinate the execution of tasks across the cluster.
   - The driver program is responsible for defining the application and its execution plan, and it manages the overall control flow.

2. SparkContext:
   - SparkContext is the heart of a Spark application. It coordinates the execution of tasks across a cluster and manages the cluster resources.
   - SparkContext is responsible for setting up the application, connecting to the cluster manager, and distributing tasks to worker nodes.
   - It also manages the configuration and controls the parallelism of data processing tasks.

3. Cluster Manager:
   - The cluster manager is responsible for allocating and managing resources in the cluster. Apache Spark supports various cluster managers like Apache Mesos, Hadoop YARN, and its built-in cluster manager.
   - The cluster manager ensures that resources are allocated to the Spark application's tasks and that the application runs efficiently.

4. Executors:
   - Executors are worker nodes in the Spark cluster that run tasks on behalf of the driver program.
   - Each executor runs in its own JVM (Java Virtual Machine) and is responsible for executing tasks and caching data in memory for fast access.
   - Executors communicate with the driver program and the cluster manager to receive tasks and report status.

5. RDD (Resilient Distributed Dataset):
   - RDD is the fundamental data structure in Spark, representing a distributed collection of data that can be processed in parallel.
   - RDDs are fault-tolerant, distributed, and immutable, and they can be cached in memory for faster access.
   - Spark applications perform transformations and actions on RDDs to process data.

6. Spark Core:
   - Spark Core is the foundation of the Spark framework, providing essential functionalities like task scheduling, memory management, and fault recovery.
   - It includes the core APIs for working with RDDs and offers the basic building blocks for distributed data processing.

7. Libraries and APIs:
   - Spark provides various libraries and APIs for different data processing tasks, including:
     - Spark SQL: For structured data processing using SQL queries.
     - Spark Streaming: For processing real-time data streams.
     - MLlib: For machine learning tasks.
     - GraphX: For graph processing.
     - SparkR: For R language integration.

8. Cluster Mode:
   - Spark applications can run in different cluster modes, including standalone, Mesos, and YARN, allowing users to choose the cluster manager that best fits their requirements.

9. Storage Systems:
   - Spark can read and write data from/to various storage systems, including HDFS, Apache Cassandra, HBase, Amazon S3, and more.

These components work together to enable distributed data processing, fault tolerance, and in-memory computing, making Apache Spark a powerful framework for big data analytics and processing. The flexibility and scalability of Spark's architecture make it suitable for a wide range of data processing tasks.

### What is a broadcast join in spark?

- for optimization we can broadacast a smaller dataframe to al the nodes in the cluster and then perform the join. This is called a broadcast join
- `large_df.join(broadcast(small_df), "id")`

### What is a braodcast variable in spark?



### what are wide and narrow transformations in spark?


##### Narrow transformations:
- Narrow transformations are the transformations where each input partition contributes to only one output partition
- example:  `map`, `filter`, `union`

###### Wide transformations:
- Wide transformations are transformations that require data to be shuffled between partitions. Each output partition depends on multiple input 
- `join`, `grouping` , `reduceby key`

### What is broadcast variable:

- A broadcast variable is a read-only variable cached on each machine in the Spark cluster, rather than shipping a copy of it with tasks.

In [1]:
from pyspark import SparkContext

# Create a Spark context
sc = SparkContext("local", "Broadcast Example")

# Create a large read-only variable
large_variable = range(1, 1000)

# Broadcast the variable to all worker nodes
broadcast_variable = sc.broadcast(large_variable)

# Define a function that uses the broadcast variable
def process_data(x):
    # Access the broadcast variable locally
    local_data = broadcast_variable.value
    return x * local_data[0]

# Create an RDD
data = sc.parallelize([1, 2, 3, 4, 5])

# Use the broadcast variable in a Spark transformation
result = data.map(process_data)

# Collect the results
print(result.collect())

# Stop the Spark context
sc.stop()

23/12/08 17:44:20 WARN Utils: Your hostname, navneetsajwan-ThinkPad-L480 resolves to a loopback address: 127.0.1.1; using 172.20.10.3 instead (on interface wlp5s0)
23/12/08 17:44:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
23/12/08 17:44:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                

[1, 2, 3, 4, 5]


### What is sc.parallelize in spark?

- creates a RDD
- automatically distributes data on the cluster
- transformations can be perfoormed parallely

### RDD vs dataframe:

- RDD:
    - low level adn more flexible
    - less performant due to manual optimizations

- Dataframe:
    - High level and easy to understand
    - high performant due to automatic optimiation
    - uses spark's catalyst optimizer

### What is accumulator in spark?

- An accumulator is a variable that can be used to accumulate values across multiple tasks in a parallel and fault-tolerant manner.
- Required where the result of a computation needs to be efficiently shared across multiple tasks running on different nodes of a cluster.

### Write a  simple example in PySpark using an accumulator to count the number of elements in an RDD

In [6]:
from pyspark import SparkContext

sc = SparkContext("local", "Accumulator Example")

# Create an accumulator with an initial value of 0
accumulator = sc.accumulator(0)

# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Define a function to update the accumulator
def process_data(x):
    global accumulator
    accumulator += 1
    return x

# Use the accumulator in a Spark transformation
result = rdd.map(process_data)

# Perform an action to trigger the execution
result.collect()

# Access the final value of the accumulator in the driver program
final_count = accumulator.value
print("Final Count:", final_count)

# Stop the Spark context
sc.stop()


                                                                                

Final Count: 5


In [20]:
from pyspark import SparkContext

#create sparkcontext object
sc= SparkContext("local", "Acculmulator example")

# Create an accumulator with an initial value of 0
accumulator = sc.accumulator(0)

data = [1,2,3,4,5]
rdd = sc.parallelize(data)

In [21]:
def process_data(x):
    global accumulator
    accumulator+=1
    return x

In [22]:
result = rdd.map(process_data)

In [23]:
result.collect()

                                                                                

[1, 2, 3, 4, 5]

In [24]:
# Access the final value of the accumulator in the driver program
final_count = accumulator.value
print("Final Count:", final_count)

Final Count: 5


In [25]:
sc.stop()

### What s udf in spark?

 User-Defined Function (UDF) refers to a feature that allows you to define your own functions for use in Spark SQL or DataFrame API operations. UDFs enable you to apply custom, user-defined logic to the data in a distributed and parallelized manner across a Spark cluster.

 - define
 - register
 - use

### Write a UDF in spark

In [31]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

# Create a Spark session
spark = SparkSession.builder.appName("UDF Example").getOrCreate()

# Sample DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
columns = ["Name", "Value"]
df = spark.createDataFrame(data, columns)

# Define a UDF to square a number
@udf(IntegerType()) # retrun type should be integer
def square_udf(value):
    return value ** 2

# Register the UDF
spark.udf.register("square", square_udf)

# Use the UDF in a DataFrame transformation
result_df = df.withColumn("SquaredValue", square_udf(df["Value"]))

# Show the result
result_df.show()

# Stop the Spark session
spark.stop()


                                                                                

+-------+-----+------------+
|   Name|Value|SquaredValue|
+-------+-----+------------+
|  Alice|    1|           1|
|    Bob|    2|           4|
|Charlie|    3|           9|
+-------+-----+------------+

