## Step 1: Introduction to Apache Spark and Setting up Spark to the Cluster

### Task 1: Install and Run Apache Spark

1. Download Spark:

In [None]:
$ wget https://downloads.apache.org/spark/spark-3.4.0/spark-3.4.0-bin-hadoop3.tgz
$ tar -xzf spark-3.4.0-bin-hadoop3.tgz

2. Set Environment Variables for Spark: Add the following to your `~/.bashrc`:

In [None]:
export SPARK_HOME=/path/to/spark
export PATH=$PATH:$SPARK_HOME/bin

3. Start and verify the Spark Shell:

In [None]:
$ SPARK_HOME/bin/spark-shell

## Step 2: Case Study: Setting Up a Big Data Ecosystem for a Retail Business. Introduction to Spark Basics Using pyspark-shell

**Scenario**:
You are tasked with building a big data ecosystem for a fictional online retail business. The business wants to analyze large volumes of sales data to gain insights into customer behavior, popular products, and seasonal trends. You will use Hadoop and Spark to set up this big data environment, process data, and perform basic analysis.

In this exercise, you'll focus on using Apache Spark to process and analyze customer sales data in real-time using the `pyspark-shell`.

### Task 2: Starting & testing pyspark-shell
**Description**: In this task, students will use the `pyspark-shell` to interact with data and perform basic Spark operations.

1. Launch `pyspark-shell`: Start the interactive PySpark shell from your terminal:

In [None]:
$ SPARK_HOME/bin/pyspark

2. Spark Context: When the shell starts, a `SparkContext` named `sc` is automatically available. This is the entry point to Spark and allows you to interact with data in a distributed manner.

Check if SparkContext is running or you can overwire it by creating a new SparkContext:

In [None]:
>>> sc
>>> # OR:
>>> sc = SparkContext("local", "Simple App")

2. Run Simple Spark script (Word Count Example)

In [None]:
from pyspark import SparkContext

sc = SparkContext("local", "Word Count")

text_file = sc.textFile("hdfs:///user/student/words.txt")
counts = text_file.flatMap(lambda line: line.split()) \
                  .map(lambda word: (word, 1)) \
                  .reduceByKey(lambda a, b: a + b)

counts.saveAsTextFile("hdfs:///user/student/output/wordcount.txt")


### Task 3: Loading and handeling Data into Spark

**Description**: Load a sample retail sales dataset into Spark. For this case study, imagine the dataset contains customer transactions with columns like `customer_id`, `product`, `category`, and `amount_spent`.

1. Create Sample Retail Data: In this example, we'll simulate a small dataset directly in the shell:


In [None]:
data = [
    ("1001", "Laptop", "Electronics", 1200.00),
    ("1002", "Smartphone", "Electronics", 800.00),
    ("1003", "Shoes", "Fashion", 150.00),
    ("1004", "T-shirt", "Fashion", 20.00),
    ("1005", "Book", "Books", 25.00)
]

2. **Parallelize the Data**: Use Spark's `parallelize` function to create an RDD (Resilient Distributed Dataset):

In [None]:
rdd = sc.parallelize(data)

3. **Display the Data**: Show the first few records from the RDD:

In [None]:
rdd.collect()

**Performing Basic Transformations:**

**Description**: Transformations in Spark are operations that create new RDDs from existing ones. Common transformations include map, filter, and reduceByKey.

4. Map Transformation: Use the map function to extract the product categories and sales amounts:

In [None]:
categories_amounts = rdd.map(lambda x: (x[2], x[3]))
categories_amounts.collect()

5. ReduceByKey Transformation: Aggregate total sales by product category using `reduceByKey`:

In [None]:
total_sales_by_category = categories_amounts.reduceByKey(lambda x, y: x + y)
total_sales_by_category.collect()

**Performing Actions:**

**Description**: Actions trigger the execution of transformations and return results. Common actions include `collect`, `count`, and `take`.

6. Count the Number of Transactions: Use the count action to find the number of transactions:

In [None]:
total_transactions = rdd.count()
print(f"Total Transactions: {total_transactions}")

7. Display a Sample of the Data: Use the `take` action to show a few transactions:

In [None]:
sample_data = rdd.take(3)
print(sample_data)

### Task 4: Analyzing Retail Sales Data:

**Description**: Now, students will apply the transformations and actions they’ve learned to answer specific business questions for the retail business.

1. **What is the Total Revenue?** Calculate the total revenue from all sales:



In [None]:
total_revenue = rdd.map(lambda x: x[3]).sum()
print(f"Total Revenue: ${total_revenue}")

2. **What is the Total Revenue by Category?** Calculate total sales revenue for each product category:

In [None]:
total_sales_by_category = rdd.map(lambda x: (x[2], x[3])).reduceByKey(lambda x, y: x + y)
total_sales_by_category.collect()

3. **Which Category Has the Highest Sales?** Find the product category with the highest sales:

In [None]:
highest_sales_category = total_sales_by_category.max(lambda x: x[1])
print(f"Highest Sales Category: {highest_sales_category[0]} with ${highest_sales_category[1]}")

### Summary of Key Operations:

- **RDD**: Resilient Distributed Dataset, the fundamental data structure in Spark.

- **Transformations**: Operations that return a new RDD (e.g., `map`, `filter`, `reduceByKey`).

- **Actions**: Operations that trigger the execution of transformations and return results (e.g., `collect`, `count`, `sum`).



--------------------------------------------------------