# TASK 2 ⏰

### Install Apache Spark

In [2]:
!pip install pyspark



### Setup Apache Spark

In [14]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MySparkSession").getOrCreate()

### Data Frame

In [4]:
#Data Frame is useful for analyzing data, and using sql like queries
df = spark.read.csv("customer_churn.csv", header=True, inferSchema=True)

In [22]:
df.printSchema()

root
 |-- CustomerID: integer (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Subscription_Length_Months: integer (nullable = true)
 |-- Watch_Time_Hours: double (nullable = true)
 |-- Number_of_Logins: integer (nullable = true)
 |-- Preferred_Content_Type: string (nullable = true)
 |-- Membership_Type: string (nullable = true)
 |-- Payment_Method: string (nullable = true)
 |-- Payment_Issues: integer (nullable = true)
 |-- Number_of_Complaints: integer (nullable = true)
 |-- Resolution_Time_Days: integer (nullable = true)
 |-- Churn: integer (nullable = true)



In [28]:
df.count()

1000

### RDD (Resilient Distributed Dataset)

### 1. Initialize PySpark and Load Data

In [13]:
#RDD is more useful for applying functional transformations like map, reduce, etc.
rdd = spark.sparkContext.textFile("customer_churn.csv")

### 2. Perform the MapReduce Job

In [18]:
# Map: Split lines into words and assign an initial count of 1
# flatMap(lambda line: line.split(" ")) will split each line into words
# map(lambda word: (word, 1) maps each word to a (word, 1) tuple
words_rdd = rdd.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1))

# Reduce: Sum counts for each word
word_counts = words_rdd.reduceByKey(lambda a, b: a + b)

# Collect and show the results
#for word, count in word_counts.collect():
    #print(word, count)

In [19]:
#Show top 10 words
sorted_word_counts = word_counts.sortBy(lambda x: x[1], ascending=False)
print(sorted_word_counts.take(10))  # Show top 10 words

[('Shows,Standard,Credit', 32), ('Shows,Basic,Bank', 25), ('Shows,Standard,Bank', 25), ('Shows,Premium,Bank', 23), ('Shows,Premium,Credit', 23), ('Shows,Basic,Credit', 21), ('Transfer,0,0,26,1', 4), ('Card,0,7,21,0', 3), ('Transfer,0,7,18,0', 3), ('Card,0,7,4,0', 3)]



* **Volume:**  We can see there are 1000 entries using df.count()

* **Velocity:** The data is batch processed. Meaning that the data existed already before some time, and was collected into a cluster to be analysed.

* **Variety:** We can see the structure using the Data Frame and doing
df.printSchema()


# TASK 3 ⏰

## • What advantages does Spark offer over traditional MapReduce?
Sparks map reduce is significantly quicker than the traditional map reduce through the use of RDD with lambda functions and a built-in support for statistical computations.

It also uses lazy evaluations which means that it uses a recorded lineage of operations. Execution occurs only when an action (collet, reduce) is triggered.

RDDs can be persisted (cached) in memory or disk for reuse, enhancing performance for iterative computations

RDDs allow batch processing of multiple data items simultaneously, making them
efficient for parallel data processing.
## • How does Spark's in-memory computation improve performance?
RDD's improve the speed of map reduce thanks to a few things. One of its key features is its In-Memory computation which is known for being quicker than a traditional disk I/O.


### MapReduce Approach
Reads data from HDFS → Maps → Writes to disk → Shuffles → Reduces → Writes final output to HDFS.

Multiple disk I/O operations slow down performance.

### Spark Approach
Loads data into memory → Processes transformations in-memory → Writes final output to disk once.

No repeated disk writes, leading to 10-100x speedup.

# TASK 5 ⏰

| Feature             | **Apache Spark**                                                       | **Apache Hive**                                                    |
|---------------------|------------------------------------------------------------------------|--------------------------------------------------------------------|
| **Structured Data**  | Efficient with DataFrames and SQL-like queries (Spark SQL). Optimized for in-memory processing. | Designed for structured data with SQL-based querying. Runs on top of Hadoop. |
| **Unstructured Data**| Can process unstructured data (e.g., text, images) using RDDs and ML libraries. | Primarily for structured data but can handle semi-structured (JSON, XML) with extra parsing. |
| **Processing Speed** | In-memory computation makes it much faster for iterative tasks.         | Uses disk-based processing (MapReduce), making it slower.         |
| **Use Case**         | Best for real-time analytics, machine learning, and complex transformations. | Best for batch processing, ETL, and data warehousing.             |


Spark is more flexible for unstructured data, while Hive is optimized for structured queries.