<a href="https://colab.research.google.com/github/ShovalBenjer/Bigdata_Pyspark_Spark_Hadoop_Apache/blob/main/integral_approximation_with_spark_1_ipynb%22.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This Notebook explores distributed numerical integration using Apache Spark integrated into google colabs jupyter notebooks, focusing on approximating the integral of a specific f(x) over a specified range. By varying the number of intervals n and Spark workers, the project evaluates the trade-offs between computation accuracy, execution time, and scalability in a distributed environment. The results demonstrate how increasing n improves precision while parallelism enhances performance, highlighting the efficiency and limitations of distributed systems for computational tasks.

In [1]:
from pyspark import SparkContext
import time
import pandas as pd
from sympy import symbols, integrate

sc = SparkContext("local", "Integral Approximation")

In [2]:
def f(x):
    return 10 * x**2 - 2
    """Function to calculate f(x) = 10x^2 - 2."""

def run_experiments():
    """
    Run the integral approximation for various configurations of workers and intervals.
    Returns:
        pd.DataFrame: Results as a DataFrame.
    """
    a, b = 1, 20  # bounds of the integral
    n_values = [100, 1000, 10000]  # number of intervals
    worker_counts = [2, 4]  # number of workers
    results = []

In [5]:
def f(x):
    """Function to calculate f(x) = 10x^2 - 2."""
    return 10 * x**2 - 2

def exact_integral_sympy(a, b):
    """
    Run the integral approximation for various configurations of workers and intervals.
    The function performs the following steps:
    1. Sets the bounds of the integral (a = 1, b = 20).
    2. Defines the number of intervals (n_values) for approximation: [100, 1000, 10000].
    3. Defines the number of workers (worker_counts) for parallelism: [2, 4].
    4. Iterates through combinations of intervals and worker counts to:
        - Approximate the integral using Spark RDDs.
        - Measure the execution time for the computation.
        - Compare the approximate integral with the exact integral calculated using SymPy.
        - Compute the error between the approximate and exact integral values.
    5. Stores the results (n, number of workers, error, execution time) in a list.
    6. Converts the results into a Pandas DataFrame for analysis.
    Returns:
        pd.DataFrame: Results as a DataFrame with columns:
            - "n": Number of intervals used in the approximation.
            - "Number of Workers": Number of workers (parallelism) used in Spark.
            - "Error": Absolute error between the approximate and exact integral.
            - "Execution Time (s)": Time taken to compute the integral.
    """
    x = symbols('x')
    f_sympy = 10 * x**2 - 2
    exact_integral = integrate(f_sympy, (x, a, b))
    return float(exact_integral)

# Define the integral approximation function
def calculate_integral(a, b, n, num_workers):
    """
    Calculate the integral approximation using Spark RDDs.
    Args:
        a (float): Lower bound of the integral.
        b (float): Upper bound of the integral.
        n (int): Number of intervals.
        num_workers (int): Number of Spark partitions (workers).
    Returns:
        float: Approximated integral value.
    """
    h = (b - a) / n  # Step size
    x_values = [a + k * h for k in range(n + 1)]  # Divide the interval into n steps
    rdd = sc.parallelize(x_values, num_workers)  # Distribute data among workers

    # Calculate the sum using the Spark RDD map and reduce
    integral_sum = rdd.map(f).reduce(lambda x, y: x + y)

    # Apply the formula for the integral approximation
    result = h * ((f(a) + f(b)) / 2 + integral_sum - f(a) - f(b))
    return result

# Run the integral calculation for different configurations
def run_experiments():
    """
    Run the integral approximation for various configurations of workers and intervals.
    The function performs the following steps:
    1. Sets the bounds of the integral (a = 1, b = 20).
    2. Defines the number of intervals (n_values) for approximation: [100, 1000, 10000].
    3. Defines the number of workers (worker_counts) for parallelism: [2, 4].
    4. Iterates through combinations of intervals and worker counts to:
        - Approximate the integral using Spark RDDs.
        - Measure the execution time for the computation.
        - Compare the approximate integral with the exact integral calculated using SymPy.
        - Compute the error between the approximate and exact integral values.
    5. Stores the results (n, number of workers, error, execution time) in a list.
    6. Converts the results into a Pandas DataFrame for analysis.
    Returns:
        pd.DataFrame: Results as a DataFrame with columns:
            - "n": Number of intervals used in the approximation.
            - "Number of Workers": Number of workers (parallelism) used in Spark.
            - "Error": Absolute error between the approximate and exact integral.
            - "Execution Time (s)": Time taken to compute the integral.
    """
    a, b = 1, 20
    n_values = [100, 1000, 10000]
    worker_counts = [2, 4]
    results = []

    for num_workers in worker_counts:
        for n in n_values:
            start_time = time.time()
            integral_value = calculate_integral(a, b, n, num_workers)
            execution_time = time.time() - start_time
            expected_value = exact_integral_sympy(a, b)
            error = abs(integral_value - expected_value)
            results.append((n, num_workers, error, execution_time))
    df = pd.DataFrame(results, columns=["n", "Number of Workers", "Error", "Execution Time (s)"])
    return df

In [6]:
results_df = run_experiments()
results_sorted = results_df.sort_values(by=["Number of Workers", "n"])
print(results_sorted)
sc.stop()

       n  Number of Workers     Error  Execution Time (s)
0    100                  2  1.143167            2.545882
1   1000                  2  0.011432            0.575953
2  10000                  2  0.000114            0.424423
3    100                  4  1.143167            0.970460
4   1000                  4  0.011432            0.823609
5  10000                  4  0.000114            0.933813


| \( n \)   | Number of Workers | Error       | Execution Time (s) | Explanation                                                                 |
|-----------|-------------------|-------------|---------------------|-----------------------------------------------------------------------------|
| 100       | 2                 | 1.143167    | 2.545882            | High error due to fewer intervals (\( n \)) leading to less precise results.|
| 1000      | 2                 | 0.011432    | 0.575953            | Error significantly reduced with more intervals, but execution time drops. |
| 10000     | 2                 | 0.000114    | 0.424423            | Very low error with higher intervals, faster execution due to distributed load.|
| 100       | 4                 | 1.143167    | 0.970460            | High error persists, but adding workers reduces execution time significantly.|
| 1000      | 4                 | 0.011432    | 0.823609            | Lower error, but slight increase in execution time due to distributed overhead.|
| 10000     | 4                 | 0.000114    | 0.933813            | Minimal error, execution time slightly increases due to coordination overhead.|


Increasing n (number of intervals) reduces error significantly, with  n=10000 achieving near-perfect accuracy. Adding more workers improves execution time for smaller n, but coordination overhead slightly impacts performance for larger
n.
**Optimal results are achieved with n=10000 and 2 workers for a balance of accuracy and efficiency.**