# 0. Introduction and Motivation

Apache Spark has become one of the most popular frameworks for **large-scale data processing** and advanced analytics. It is widely adopted by companies such as **Netflix**, **Uber**, **Airbnb**, and more, primarily because:

- It offers **in-memory** computation, which can be significantly faster than traditional disk-based systems.
- It has a **unified engine** for both batch and streaming data.
- It provides **high-level APIs** in Python, Scala, Java, and R for accessible big data analytics.

## Real-World Spark Use Cases

1. **Netflix**:
   - Spark is used to build and run Netflix’s recommendation engine, analyzing user watching patterns and behaviors in near real-time.
   - Streams of event data (when you pause, play, skip, or interact with a video) are ingested, then processed with Spark to generate content suggestions.
   - Spark’s in-memory engine drastically reduces the time to train and refresh large-scale machine learning models.

2. **Uber**:
   - Uber leverages Spark (alongside Kafka and other data pipelines) to handle real-time analytics for ride pricing, supply-demand forecasting, and route optimization.
   - Spark’s streaming capabilities help them react quickly to changing traffic conditions, user surge, and more.

3. **Airbnb**:
   - Airbnb uses Spark to analyze user bookings, prices, and host preferences.
   - They apply machine learning models on huge datasets for personalized search rankings and dynamic pricing.

4. **E-commerce Platforms**:
   - Large online retailers use Spark for recommendation engines, A/B testing analysis, and fraud detection.
   - Data streams (clicks, orders, reviews) can be combined and processed in near real-time for better user experiences.

Spark’s ability to distribute processing across a cluster allows you to tackle massive datasets that would be infeasible to handle on a single machine. Additionally, Spark’s ecosystem includes libraries like **Spark SQL**, **Spark Streaming**, **MLlib**, and **GraphX**, which turn Spark into a one-stop solution for broad data processing tasks.

---

## Comparing Apache Spark with Hadoop

While both Apache Spark and Hadoop are powerful tools for big data processing, they have distinct differences in architecture, performance, and use cases.

### Diagram: Apache Spark vs. Hadoop Architecture

![Apache Spark vs. Hadoop Architecture](../images/spark_vs_hadoop_arch.jpg)

![Apache Spark vs. Hadoop Ecosystem](../images/spark_vs_hadoop_arch_2.png)


*Reference for more information: [GeeksforGeeks](https://www.geeksforgeeks.org/difference-between-hadoop-and-spark/)*

### Interpretation

- **Processing Model**:
  - **Hadoop**: Utilizes a disk-based storage system (HDFS) and processes data in batches using the MapReduce paradigm. Each operation reads from and writes to disk, which can introduce latency.
  - **Spark**: Employs in-memory processing, keeping data in RAM between operations. This approach significantly reduces read/write cycles to disk, enhancing performance, especially for iterative tasks.

- **Performance**:
  - **Hadoop**: Suitable for large-scale batch processing but may experience slower performance due to disk I/O operations.
  - **Spark**: Can be up to 100 times faster than Hadoop for certain tasks, thanks to in-memory computation. This speed advantage is particularly noticeable in machine learning algorithms and real-time data processing.

- **Fault Tolerance**:
  - **Hadoop**: Achieves fault tolerance through data replication across multiple nodes. If a node fails, data can be retrieved from another replica.
  - **Spark**: Uses Resilient Distributed Datasets (RDDs) with lineage information, allowing it to recompute lost data without the need for replication, thus saving storage space.

- **Ease of Use**:
  - **Hadoop**: Requires complex code, often in Java, making it less accessible for rapid development.
  - **Spark**: Provides high-level APIs in multiple languages, including Python, Scala, and R, simplifying the development process.

### Diagram: Performance Comparison

![Performance Comparison](../images/performance_comparison.png)


*Reference for more information: [Data Engineer Academy](https://dataengineeracademy.com/blog/apache-spark-vs-hadoop-comprehensive-guide/)*

### Interpretation

- **Batch Processing**:
  - **Hadoop**: Efficient for processing large volumes of data in batches but may not be ideal for real-time analytics.
  - **Spark**: Excels in both batch and real-time processing, offering flexibility for various data processing needs.

- **Scalability**:
  - Both frameworks are highly scalable, capable of handling petabytes of data across numerous nodes. However, Spark's in-memory requirements can lead to higher costs due to the need for more RAM.

### Use Cases

Hadoop use cases include:

- Processing large datasets in environments where data size exceeds available memory.
- Building data analysis infrastructure with a limited budget.
- Completing jobs where immediate results are not required, and time is not a limiting factor.
- Batch processing with tasks exploiting disk read and write operations.
- Historical and archive data analysis.
- With Spark, we can separate the following use cases where it outperforms Hadoop:

- The analysis of real-time stream data.
- When time is of the essence, Spark delivers quick results with in-memory computations.
- Dealing with the chains of parallel operations using iterative algorithms.
- Graph-parallel processing to model the data.
- All machine learning applications.

*Reference for more information: [PhoenixNAP](https://phoenixnap.com/kb/hadoop-vs-spark)*

### Interpretation

- **Machine Learning**:
  - **Hadoop**: Limited machine learning capabilities through libraries like Mahout. Not ideal for iterative algorithms.
  - **Spark**: Offers MLlib for scalable machine learning tasks, supporting iterative algorithms and real-time processing.



## Real-World Spark Use Cases
1. **Netflix**: Recommendation systems, real-time user behavior analysis.
2. **Uber**: Real-time analytics for pricing, supply-demand forecasting.
3. **Airbnb**: User personalization, dynamic pricing.
4. **E-commerce Platforms**: Fraud detection, recommendation engines.

## Simple Python Example

Below is a *trivial* code snippet that processes a random dataset with pure Python loops or standard libraries.

In [None]:
import random
import time

N = 10_000_000
data = [random.random() for _ in range(N)]

start = time.time()
mean_val = sum(data) / len(data)
end = time.time()

print(f"Mean: {mean_val}")
print(f"Time taken (pure Python): {end - start:.2f} seconds")

## Simple Spark Example

Now let's see how Spark can help us handle even larger data seamlessly, leveraging distributed computing.

In [None]:
import pyspark
from pyspark.sql import SparkSession
import time

# Create Spark session
spark = SparkSession.builder \
    .appName("SparkIntro") \
    .getOrCreate()

# Convert the Python list to an RDD
rdd = spark.sparkContext.parallelize(data)

start_spark = time.time()
mean_val_spark = rdd.mean()
end_spark = time.time()

print(f"Spark Mean: {mean_val_spark}")
print(f"Time taken (Spark): {end_spark - start_spark:.2f} seconds")

# Stop the Spark session
spark.stop()