<a href="https://colab.research.google.com/github/TanishqLambhate/Data-Science-Training/blob/pyspark/Pyspark_Excercise_3_RDD.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
pip install pyspark

Collecting pyspark
  Downloading pyspark-3.5.2.tar.gz (317.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.3/317.3 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.2-py2.py3-none-any.whl size=317812365 sha256=eb04db76e50f27705ccb5b587e3fdae4be68213eec70bdd9cdbc68f60e840898
  Stored in directory: /root/.cache/pip/wheels/34/34/bd/03944534c44b677cd5859f248090daa9fb27b3c8f8e5f49574
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.2


In [2]:
sales_data = [
    ("ProductA", 100),
    ("ProductB", 150),
    ("ProductA", 200),
    ("ProductC", 300),
    ("ProductB", 250),
    ("ProductC", 100)
]
regional_sales_data = [
    ("ProductA", 50),
    ("ProductC", 150)
]
### **Step 1: Initialize Spark Context**

# 1. **Initialize SparkSession and SparkContext:**
#    - Create a Spark session in PySpark and use the `spark.sparkContext` to create an RDD from the provided data.

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("RDD for sales and regional data") .getOrCreate()

# 1. **Initialize SparkSession and SparkContext:**
#    - Create a Spark session in PySpark and use the `spark.sparkContext` to create an RDD from the provided data.
sc=spark.sparkContext
print("Spark session created")


Spark session created


In [6]:
# 2. **Task 1: Create an RDD from the Sales Data**
#    - Create an RDD from the `sales_data` list provided above.
#    - Print the first few elements of the RDD.
rdd=sc.parallelize(sales_data)
#Print the original RDD
print("Original RDD:",rdd.collect())

# 3. **Task 2: Group Data by Product Name**
#    - Group the sales data by product name using `groupByKey()`.
#    - Print the grouped data to understand its structure.
grouped_sales_rdd = rdd.groupByKey()
print("Grouped Data by Product Name:")
for key, values in grouped_sales_rdd.collect():
    print(f"{key}: {list(values)}")

# 4. **Task 3: Calculate Total Sales by Product**
#    - Use `reduceByKey()` to calculate the total sales for each product.
#    - Print the total sales for each product.
total_sales_rdd = rdd.reduceByKey(lambda x, y: x + y)
print("Total Sales by Product:")
for product, total_sales in total_sales_rdd.collect():
    print(f"{product}: {total_sales}")

# 5. **Task 4: Sort Products by Total Sales**
#    - Sort the products by their total sales in descending order.
#    - Print the sorted list of products along with their sales amounts.
sorted_sales_rdd = total_sales_rdd.sortBy(lambda x: x[1], ascending=False)
print("Sorted Products by Total Sales:")
for product, total_sales in sorted_sales_rdd.collect():
    print(f"{product}: {total_sales}")

Original RDD: [('ProductA', 100), ('ProductB', 150), ('ProductA', 200), ('ProductC', 300), ('ProductB', 250), ('ProductC', 100)]
Grouped Data by Product Name:
ProductA: [100, 200]
ProductB: [150, 250]
ProductC: [300, 100]
Total Sales by Product:
ProductA: 300
ProductB: 400
ProductC: 400
Sorted Products by Total Sales:
ProductB: 400
ProductC: 400
ProductA: 300


In [8]:
# ### **Step 4: Additional Transformations**

# 6. **Task 5: Filter Products with High Sales**
#    - Filter the products that have total sales greater than 200.
#    - Print the products that meet this condition.
high_sales_rdd = total_sales_rdd.filter(lambda x: x[1] > 200)
print("Products with Total Sales Greater Than 200:")
for product, total_sales in high_sales_rdd.collect():
    print(f"{product}: {total_sales}")

# 7. **Task 6: Combine Regional Sales Data**
#    - Create another RDD from the `regional_sales_data` list.
#    - Combine this RDD with the original sales RDD using `union()`.
#    - Calculate the new total sales for each product after combining the datasets.
#    - Print the combined sales data.

regional_sales_rdd = sc.parallelize(regional_sales_data)
combined_sales_rdd = rdd.union(regional_sales_rdd)
new_total_sales_rdd = combined_sales_rdd.reduceByKey(lambda x, y: x + y)
print("Combined Sales Data (after union):")
for product, total_sales in new_total_sales_rdd.collect():
    print(f"{product}: {total_sales}")

# ### **Step 5: Perform Actions on the RDD**

# 8. **Task 7: Count the Number of Distinct Products**
#    - Count the number of distinct products in the RDD.
#    - Print the count of distinct products.
distinct_products_count = combined_sales_rdd.keys().distinct().count()
print(f"Number of Distinct Products: {distinct_products_count}")

# 9. **Task 8: Identify the Product with Maximum Sales**
#    - Find the product with the maximum total sales using `reduce()`.
#    - Print the product name and its total sales amount.
max_sales_product = new_total_sales_rdd.reduce(lambda x, y: x if x[1] > y[1] else y)
print(f"Product with Maximum Sales: {max_sales_product[0]} with sales amount {max_sales_product[1]}")


Products with Total Sales Greater Than 200:
ProductA: 300
ProductB: 400
ProductC: 400
Combined Sales Data (after union):
ProductA: 350
ProductC: 550
ProductB: 400
Number of Distinct Products: 3
Product with Maximum Sales: ProductC with sales amount 550


In [None]:
### **Challenge Task: Calculate the Average Sales per Product**

# 10. **Challenge Task:**
#     - Calculate the average sales amount per product using the key-value pair RDD.
#     - Print the average sales for each product.
sales_and_count_rdd = combined_sales_rdd.mapValues(lambda x: (x, 1)) \
                                        .reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))

average_sales_rdd = sales_and_count_rdd.mapValues(lambda x: x[0] / x[1])
print("Average Sales per Product:")
for product, avg_sales in average_sales_rdd.collect():
    print(f"{product}: {avg_sales}")