<a href="https://colab.research.google.com/github/OGPanther08/ICP-8/blob/main/ICP_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!pip install pyspark



In [None]:
from pyspark import SparkContext, SparkConf

In [None]:
#1. Produce RDD with List of first 15 natural numbers
sc = SparkContext("local", "ICP_8")

numbers_rdd = sc.parallelize(range(1,16))

In [None]:
sc.stop()

In [None]:
#2. show the elements and number of partitions in RDD
print("Elements in RDD: ", numbers_rdd.collect())
print("Number of partitions:", numbers_rdd.getNumPartitions())

Elements in RDD:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Number of partitions: 1


In [None]:
#3. returns the first element in the RDD.
first = numbers_rdd.first()
print("First element in RDD:", first)

First element in RDD: 1


In [None]:
#4. Use filter transformation to create a new RDD by selecting elements that are even.
even = numbers_rdd.filter(lambda x: x % 2 == 0)
print("Even numbers in RDD:", even.collect())

Even numbers in RDD: [2, 4, 6, 8, 10, 12, 14]


In [None]:
#5. Apply map transformation to each element in the RDD and returns a new RDD with square of each element as an output.
squared = numbers_rdd.map(lambda x: x * x)
print("Squared numbers in RDD:", squared.collect())

Squared numbers in RDD: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]


In [None]:
#6. aggregates all elements in the RDD using reduce action
sum = numbers_rdd.reduce(lambda x, y: x + y)
print("Sum of numbers in RDD:", sum)

Sum of numbers in RDD: 120


In [None]:
#7. saves the RDD data as a text file
squared.saveAsTextFile("squared_numbers1")

In [None]:
#8. take two new list RDDs and Combine them with union transformation
one = sc.parallelize([1,2,3,4,5])
two = sc.parallelize([6,7,8,9,10])
union = one.union(two)
print("Union of RDDs:", union.collect())

Union of RDDs: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


In [None]:
#9. Use cartesian transformation on defined list RDDs that returns a new list of ordered pairs.
cartesian = one.cartesian(two)
print("Cartesian product of RDDs:", cartesian.collect())

Cartesian product of RDDs: [(1, 6), (1, 7), (1, 8), (1, 9), (1, 10), (2, 6), (2, 7), (2, 8), (2, 9), (2, 10), (3, 6), (3, 7), (3, 8), (3, 9), (3, 10), (4, 6), (4, 7), (4, 8), (4, 9), (4, 10), (5, 6), (5, 7), (5, 8), (5, 9), (5, 10)]


In [None]:
#10.Create an RDD with Dictionary
dictionary = sc.parallelize([{"a":1}, {"b":2}, {"c":3}, {"d":4}])
print("Dictionary RDD:", dictionary.collect())

Dictionary RDD: [{'a': 1}, {'b': 2}, {'c': 3}, {'d': 4}]


In [None]:
#11. Get unique value in the RDD as the key and its count as the value.
values = sc.parallelize(["a","b","c","d","e","e","a","f"])
unique_values = values.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x + y)
print("Unique Values and their Counts", unique_values.collect())

Unique Values and their Counts [('a', 2), ('b', 1), ('c', 1), ('d', 1), ('e', 2), ('f', 1)]


In [None]:
#12.Create RDD by combining multiple .text files
combined_text_files = sc.textFile("file1.txt,file2.txt")
print("Combined Text Files:", combined_text_files.collect())


Combined Text Files: ['apple', 'banana', 'orange', 'apple', 'mango', 'banana', 'grape', 'apple', 'pineapple', 'orange']


In [None]:
#13. Inspect the First 5 Lines of an RDD
print("First 5 lines of combined RDD:", union.take(5))

First 5 lines of combined RDD: [1, 2, 3, 4, 5]


In [None]:
#14.Create Dataframe and Dataset
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ICP_8").getOrCreate()

numbers = spark.createDataFrame(cartesian, ["cartesian"])
numbers.show()

+---------+---+
|cartesian| _2|
+---------+---+
|        1|  6|
|        1|  7|
|        1|  8|
|        1|  9|
|        1| 10|
|        2|  6|
|        2|  7|
|        2|  8|
|        2|  9|
|        2| 10|
|        3|  6|
|        3|  7|
|        3|  8|
|        3|  9|
|        3| 10|
|        4|  6|
|        4|  7|
|        4|  8|
|        4|  9|
|        4| 10|
+---------+---+
only showing top 20 rows



15. Show difference between RDD, Dataframe and Dataset using example


RDD:
Resilient Distributed Dataset, is fault tolerant, and distributed, there's no schema enforcement

DataFrame:
Dataset organized into named columns, suitable for structured data, and allows sql queries

Dataset:
structured, has a mix of rdd and dataframe features


In [None]:
#15. Show difference between RDD, Dataframe and Dataset using example
#Example of this would be:
print("RDD Elements:", cartesian.collect())
print("DataFrame Elements:", numbers.show())

RDD Elements: [(1, 6), (1, 7), (1, 8), (1, 9), (1, 10), (2, 6), (2, 7), (2, 8), (2, 9), (2, 10), (3, 6), (3, 7), (3, 8), (3, 9), (3, 10), (4, 6), (4, 7), (4, 8), (4, 9), (4, 10), (5, 6), (5, 7), (5, 8), (5, 9), (5, 10)]
+---------+---+
|cartesian| _2|
+---------+---+
|        1|  6|
|        1|  7|
|        1|  8|
|        1|  9|
|        1| 10|
|        2|  6|
|        2|  7|
|        2|  8|
|        2|  9|
|        2| 10|
|        3|  6|
|        3|  7|
|        3|  8|
|        3|  9|
|        3| 10|
|        4|  6|
|        4|  7|
|        4|  8|
|        4|  9|
|        4| 10|
+---------+---+
only showing top 20 rows

DataFrame Elements: None
