<a href="https://colab.research.google.com/github/Evan700/BDA/blob/main/ICP_8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ICP 8 Evan Finger

In [4]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("RDDExample").getOrCreate()

##1.

In [5]:
numbers = list(range(1, 16))
rdd = spark.sparkContext.parallelize(numbers)

result = rdd.collect()
print(result)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]


##2.

In [6]:
elements = rdd.collect()
print("Elements: ", elements)
partitions = rdd.getNumPartitions()
print("Number of Partitions: ", partitions)

Elements:  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
Number of Partitions:  2


##3.

In [7]:
first_element = rdd.first()
print("First Element: ", first_element)

First Element:  1


##4.

In [8]:
even_numbers = rdd.filter(lambda x: x % 2 == 0)
even_numbers_list = even_numbers.collect()
print("Even Numbers: ", even_numbers_list)

Even Numbers:  [2, 4, 6, 8, 10, 12, 14]


##5.

In [9]:
squared_numbers = rdd.map(lambda x: x ** 2)
squared_numbers_list = squared_numbers.collect()
print("Squared Numbers: ", squared_numbers_list)

Squared Numbers:  [1, 4, 9, 16, 25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]


##6.

In [10]:
total_sum = rdd.reduce(lambda x, y: x + y)
print("Total Sum: ", total_sum)

Total Sum:  120


##7.

In [11]:
output_path = "home/output.txt"
rdd.saveAsTextFile(output_path)
print("RDD saved to", output_path)

RDD saved to home/output.txt


##8.

In [12]:
list1 = [1, 2, 3, 4, 5]
list2 = [3, 4, 5, 6, 7]
rdd1 = spark.sparkContext.parallelize(list1)
rdd2 = spark.sparkContext.parallelize(list2)
union_rdd = rdd1.union(rdd2)
union_list = union_rdd.collect()
print("Union: ", union_list)

Union:  [1, 2, 3, 4, 5, 3, 4, 5, 6, 7]


##9.

In [13]:
list3 = [1,2,3]
list4 = ['a', 'b', 'c']
rdd3 = spark.sparkContext.parallelize(list3)
rdd4 = spark.sparkContext.parallelize(list4)
cartesian_rdd = rdd3.cartesian(rdd4)
cartesian_list = cartesian_rdd.collect()
print("Cartesian Product: ", cartesian_list)

Cartesian Product:  [(1, 'a'), (1, 'b'), (1, 'c'), (2, 'a'), (3, 'a'), (2, 'b'), (2, 'c'), (3, 'b'), (3, 'c')]


##10.

In [14]:
data = [{"id": 1, "name": "Anne", "age": 23},
        {"id": 2, "name": "Bill", "age": 25},
        {"id": 3, "name": "Evan", "age": 22}]
dict_rdd = spark.sparkContext.parallelize(data)
result = dict_rdd.collect()
print("RDD with dictionary: ", result)

RDD with dictionary:  [{'id': 1, 'name': 'Anne', 'age': 23}, {'id': 2, 'name': 'Bill', 'age': 25}, {'id': 3, 'name': 'Evan', 'age': 22}]


##11.

In [16]:
data = [1, 2, 3, 2, 1, 4, 5, 6, 4, 4, 7, 8, 7, 9, 1]
rdd = spark.sparkContext.parallelize(data)
mapped_rdd = rdd.map(lambda x: (x, 1))
reduced_rdd = mapped_rdd.reduceByKey(lambda x, y: x + y)
result = reduced_rdd.collect()
print("Unique values and counts:", result)

Unique values and counts: [(2, 2), (4, 3), (6, 1), (8, 1), (1, 3), (3, 1), (5, 1), (7, 2), (9, 1)]


##12.

In [17]:
input_path1 = "/text1.txt"
input_path2 = "/text2.txt"
rdd1 = spark.sparkContext.textFile(input_path1)
rdd2 = spark.sparkContext.textFile(input_path2)
union_rdd = rdd1.union(rdd2)
result = union_rdd.collect()
print("RDD from multiple text files:", result)

RDD from multiple text files: ['1, 2, 3', '3, 4, 5']


##13.

In [19]:
numbers = list(range(1, 16))
rdd = spark.sparkContext.parallelize(numbers)
result = rdd.take(5)
print("First 5 lines:", result)

First 5 lines: [1, 2, 3, 4, 5]


##14.

In [24]:
from pyspark.sql import Row
data = [
        Row(id= 1, name= "Anne", age= 23),
        Row(id= 2, name= "Bill", age= 25),
        Row(id= 3, name= "Evan", age= 22)
        ]
df = spark.createDataFrame(data)
df.show()



+---+----+---+
| id|name|age|
+---+----+---+
|  1|Anne| 23|
|  2|Bill| 25|
|  3|Evan| 22|
+---+----+---+



Per the slides, datasets are not supported in python so you cannot make one in collab

##15.

In [26]:
data = [(1, "Anne", 23), (2, "Bill", 25), (3, "Evan", 22)]
rdd = spark.sparkContext.parallelize(data)
print("Rdd elements: ", rdd.collect())
ages_rdd = rdd.map(lambda x: x[2])
print("Ages Rdd: ", ages_rdd.collect())

columns = ["id", "name", "age"]
df = rdd.toDF(columns)
df.show()
df_filtered = df.filter(df["age"] > 23)
df_filtered.show()

Rdd elements:  [(1, 'Anne', 23), (2, 'Bill', 25), (3, 'Evan', 22)]
Ages Rdd:  [23, 25, 22]
+---+----+---+
| id|name|age|
+---+----+---+
|  1|Anne| 23|
|  2|Bill| 25|
|  3|Evan| 22|
+---+----+---+

+---+----+---+
| id|name|age|
+---+----+---+
|  2|Bill| 25|
+---+----+---+



Once again datasets are not supported by python and are only supported in java or scala