In [3]:
"""In Apache Spark, Resilient Distributed Datasets (RDDs) are the fundamental data structure, and you can perform various operations on them. RDD operations in Spark can be categorized into two types: Transformations and Actions.

Transformations:

Map: Applies a function to each element of the RDD.
Filter: Selects elements based on a given condition.
FlatMap: Similar to Map, but each input item can be mapped to zero or more output items.
Union: Returns a new RDD containing elements from both source RDDs.
Distinct: Returns a new RDD with distinct elements.
GroupByKey: Groups the elements of the RDD based on a key.
ReduceByKey: Aggregates values of each key using a specified reduce function.
SortByKey: Sorts the RDD based on keys.
Join: Performs an inner join between two RDDs based on their keys.
Actions:

Collect: Returns all elements of the RDD as an array to the driver program.
Count: Returns the number of elements in the RDD.
First: Returns the first element of the RDD.
Take: Returns the first n elements of the RDD.
Reduce: Aggregates the elements of the RDD using a specified associative and commutative function.
Foreach: Applies a function to each element of the RDD (useful for side effects).

"""

'In Apache Spark, Resilient Distributed Datasets (RDDs) are the fundamental data structure, and you can perform various operations on them. RDD operations in Spark can be categorized into two types: Transformations and Actions.\n\nTransformations:\n\nMap: Applies a function to each element of the RDD.\nFilter: Selects elements based on a given condition.\nFlatMap: Similar to Map, but each input item can be mapped to zero or more output items.\nUnion: Returns a new RDD containing elements from both source RDDs.\nDistinct: Returns a new RDD with distinct elements.\nGroupByKey: Groups the elements of the RDD based on a key.\nReduceByKey: Aggregates values of each key using a specified reduce function.\nSortByKey: Sorts the RDD based on keys.\nJoin: Performs an inner join between two RDDs based on their keys.\nActions:\n\nCollect: Returns all elements of the RDD as an array to the driver program.\nCount: Returns the number of elements in the RDD.\nFirst: Returns the first element of the RD

In [4]:
from pyspark import SparkContext

# Create a Spark context
sc = SparkContext("local", "GroupByRDDExample")

# Sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22), ("Alice", 28), ("Bob", 32)]

# Create an RDD
myRDD = sc.parallelize(data)

# Perform groupByKey transformation
groupedRDD = myRDD.groupByKey()

# Calculate average age for each group
averageAgeRDD = groupedRDD.mapValues(lambda ages: sum(ages) / len(ages))

# Collect and display the result
result = averageAgeRDD.collect()
print(result)

# Stop the Spark context
sc.stop()



[('Alice', 26.5), ('Bob', 31.0), ('Charlie', 22.0)]


In [23]:
from pyspark.sql import *
from pyspark.sql.functions import *


In [7]:
spark=SparkSession.builder.getOrCreate()



In [14]:
data1=spark.read.csv('sources/student_dataset.csv',inferSchema=True,header=True)

In [15]:
data1


DataFrame[StudentID: int, FirstName: string, LastName: string, Gender: string, DOB: string, Grade: int]

In [16]:
data1.show()

+---------+---------+--------+------+----------+-----+
|StudentID|FirstName|LastName|Gender|       DOB|Grade|
+---------+---------+--------+------+----------+-----+
|        1|     John|    null|  Male|1995-05-10|   12|
|        2|     Jane|    null|Female|1996-08-22| null|
|        3|     Mark| Johnson|  Male|1997-03-15|   10|
|        4|    Emily|Williams|Female|1998-11-28| null|
|        5|  Michael|    null|  Male|1999-07-04|   11|
|        6|     Emma|   Jones|Female|2000-01-19|   10|
|        7|  William|    null|  Male|2001-09-03|    9|
|        8|   Sophia|  Miller|Female|2002-04-12|   11|
|        9|    Aiden|    null|  Male|2003-10-27|   10|
|       10|   Olivia|  Martin|Female|2004-06-08| null|
+---------+---------+--------+------+----------+-----+



In [57]:
data2=spark.read.csv('sources/student_dataset2.csv',inferSchema=True,header=True)

In [58]:
data2=data2.withColumnRenamed('LastName','lname')

In [59]:
data2.show()

+---------+---------+--------+------+----------+-----+
|StudentID|FirstName|   lname|Gender|       DOB|Grade|
+---------+---------+--------+------+----------+-----+
|        1|     John|      MM|  Male|1995-05-10|   12|
|        2|     Jane|    null|Female|1996-08-22|   10|
|        3|     Mark| Johnson|  Male|1997-03-15|   10|
|        4|    Emily|Williams|Female|1998-11-28|   10|
|        5|  Michael|      HH|  Male|1999-07-04|   11|
|        6|     Emma|   Jones|Female|2000-01-19|   10|
|        7|  William|    null|  Male|2001-09-03|    9|
|        8|   Sophia|  Miller|Female|2002-04-12|   11|
|        9|    Aiden|     NNN|  Male|2003-10-27|   10|
+---------+---------+--------+------+----------+-----+



In [60]:
data3=data1.join(data2,data1.StudentID==data2.StudentID,"inner")

In [62]:
data3.select('LastName',coalesce(data3['LastName'],data3['lname'])).show()

+--------+-------------------------+
|LastName|coalesce(LastName, lname)|
+--------+-------------------------+
|    null|                       MM|
|    null|                     null|
| Johnson|                  Johnson|
|Williams|                 Williams|
|    null|                       HH|
|   Jones|                    Jones|
|    null|                     null|
|  Miller|                   Miller|
|    null|                      NNN|
+--------+-------------------------+



In [54]:
data1.show()

+---------+---------+--------+------+----------+-----+
|StudentID|FirstName|LastName|Gender|       DOB|Grade|
+---------+---------+--------+------+----------+-----+
|        1|     John|    null|  Male|1995-05-10|   12|
|        2|     Jane|    null|Female|1996-08-22| null|
|        3|     Mark| Johnson|  Male|1997-03-15|   10|
|        4|    Emily|Williams|Female|1998-11-28| null|
|        5|  Michael|    null|  Male|1999-07-04|   11|
|        6|     Emma|   Jones|Female|2000-01-19|   10|
|        7|  William|    null|  Male|2001-09-03|    9|
|        8|   Sophia|  Miller|Female|2002-04-12|   11|
|        9|    Aiden|    null|  Male|2003-10-27|   10|
|       10|   Olivia|  Martin|Female|2004-06-08| null|
+---------+---------+--------+------+----------+-----+



In [63]:
data1.show()

+---------+---------+--------+------+----------+-----+
|StudentID|FirstName|LastName|Gender|       DOB|Grade|
+---------+---------+--------+------+----------+-----+
|        1|     John|    null|  Male|1995-05-10|   12|
|        2|     Jane|    null|Female|1996-08-22| null|
|        3|     Mark| Johnson|  Male|1997-03-15|   10|
|        4|    Emily|Williams|Female|1998-11-28| null|
|        5|  Michael|    null|  Male|1999-07-04|   11|
|        6|     Emma|   Jones|Female|2000-01-19|   10|
|        7|  William|    null|  Male|2001-09-03|    9|
|        8|   Sophia|  Miller|Female|2002-04-12|   11|
|        9|    Aiden|    null|  Male|2003-10-27|   10|
|       10|   Olivia|  Martin|Female|2004-06-08| null|
+---------+---------+--------+------+----------+-----+



In [67]:
data4=data1.groupBy('Gender')

In [70]:
data4.

AttributeError: 'GroupedData' object has no attribute 'select'