<a href="https://colab.research.google.com/github/ShovalBenjer/Bigdata_Pyspark_Spark_Hadoop_Apache/blob/ShovalBenjer-patch-1/integral_approximation_with_spark_1_ipynb%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook explores distributed numerical integration using Apache Spark integrated into google colabs jupyter notebooks, focusing on approximating the integral of a specific f(x) over a specified range. By varying the number of intervals n and Spark workers, the project evaluates the trade-offs between computation accuracy, execution time, and scalability in a distributed environment. The results demonstrate how increasing n improves precision while parallelism enhances performance, highlighting the efficiency and limitations of distributed systems for computational tasks.

**Setup = Imports and intiallize**

In [8]:
from pyspark import SparkContext
import time
import pandas as pd
from sympy import symbols, integrate
from pyspark import SparkContext

if SparkContext._active_spark_context:
    sc = SparkContext.getOrCreate()
else:
    sc = SparkContext("local", "Integral Approximation")

In [9]:
def f(x):
    """Function to calculate f(x) = 10x^2 - 2. Change the function for different approximation."""
    return 10 * x**2 - 2

def exact_integral_sympy(a, b):
    """
    Calculate the exact integral of f(x) using SymPy.
    Args:
        a (float): Lower bound of the integral.
        b (float): Upper bound of the integral.
    Returns:
        float: Exact integral value.
    """
    x = symbols('x')
    exact_integral = integrate(f(x), (x, a, b))
    return float(exact_integral)


def calculate_integral(a, b, n, num_workers):
    """
    Calculate the integral approximation using Spark RDDs.
    Args:
        h (float): Step size for the integral approximation.
        a (float): Lower bound of the integral.
        b (float): Upper bound of the integral.
        n (int): Number of intervals.
        map(f): transformation that applies a function to each element of an RDD, producing a new RDD with the transformed elements.
        num_workers (int): Number of Spark partitions (workers).
        x_values (list): List of x values for the integral approximation.
        rdd (pyspark.RDD): Spark RDD for the integral approximation.
    Returns:
        float: Approximated integral value.
    """
    h = (b - a) / n
    x_values = [a + k * h for k in range(n + 1)]
    rdd = sc.parallelize(x_values, num_workers)
    integral_sum = rdd.map(f).reduce(lambda x, y: x + y)
    result = h * ((f(a) + f(b)) / 2 + integral_sum - f(a) - f(b))
    return result

def run_experiments():
    """
    Run the integral approximation for various configurations of workers and intervals.
    The function performs the following steps:
    1. Sets the bounds of the integral (a = 1, b = 20).
    2. Defines the number of intervals (n_values) for approximation: [100, 1000, 10000].
    3. Defines the number of workers (worker_counts) for parallelism: [2, 4].
    4. Iterates through combinations of intervals and worker counts to:
        - Approximate the integral using Spark RDDs.
        - Measure the execution time for the computation.
        - Compare the approximate integral with the exact integral calculated using SymPy.
        - Compute the error between the approximate and exact integral values.
    5. Stores the results (n, number of workers, error, execution time) in a list.
    6. Converts the results into a Pandas DataFrame for analysis.
    Returns:
        pd.DataFrame: Results as a DataFrame with columns:
            - "n": Number of intervals used in the approximation.
            - "Number of Workers": Number of workers (parallelism) used in Spark.
            - "Error": Absolute error between the approximate and exact integral.
            - "Execution Time (s)": Time taken to compute the integral.
    """
    a, b = 1, 20
    n_values = [100, 1000, 10000]
    worker_counts = [2, 4]
    results = []

    for num_workers in worker_counts:
        for n in n_values:
            start_time = time.time()
            integral_value = calculate_integral(a, b, n, num_workers)
            execution_time = time.time() - start_time
            expected_value = exact_integral_sympy(a, b)
            error = abs(integral_value - expected_value)
            results.append((n, num_workers, error, execution_time))
    df = pd.DataFrame(results, columns=["n", "Number of Workers", "Error", "Execution Time (s)"])
    return df

In [None]:
results_df = run_experiments()
results_sorted = results_df.sort_values(by=["Number of Workers", "n"])
print(results_sorted)
sc.stop()

| \( n \)   | Number of Workers | Error       | Execution Time (s) | Explanation                                                                 |
|-----------|-------------------|-------------|---------------------|-----------------------------------------------------------------------------|
| 100       | 2                 | 1.143167    | 3.728478            | High error due to fewer intervals (\( n \)) leading to less precise results, longer time due to limited parallelism. |
| 1000      | 2                 | 0.011432    | 0.910908            | Error significantly reduced with more intervals, faster execution due to better load distribution. |
| 10000     | 2                 | 0.000114    | 0.675122            | Very low error with higher intervals, faster execution due to distributed computation. |
| 100       | 4                 | 1.143167    | 1.627744            | High error persists, but adding workers reduces execution time significantly. |
| 1000      | 4                 | 0.011432    | 1.029233            | Lower error, but slight increase in execution time due to distributed overhead. |
| 10000     | 4                 | 0.000114    | 0.980111            | Minimal error, but execution time slightly increases due to coordination overhead between workers. |

### Overall Comparison
- **Error**:
  - The error depends only on the number of intervals (\( n \)) and is identical for 2 and 4 workers configurations at each interval.
  - Larger intervals (\( n = 10000 \)) achieve near-perfect accuracy (\( 0.000114 \)).

- **Execution Time**:
  - For smaller intervals (\( n = 100 \)), increasing the number of workers significantly reduces execution time.
  - For larger intervals (\( n = 1000 \) and \( n = 10000 \)), adding more workers slightly increases execution time due to coordination overhead.

Expected Behavior with n:
Small n: High execution time due to overhead dominating computation.
Moderate n: Efficient execution as workers are better utilized, leading to reduced execution time.
Large n: Gradual increase in execution time as computation grows linearly and distributed system limitations (e.g., memory or bandwidth) are reached.

**Optimal results are achieved with n=10000 and 2 workers for a balance of accuracy and efficiency.**

