<!-- use this command in cmd - spark-shell -->

In [4]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("StudentGrades") \
    .getOrCreate()

# Sample student scores
scores = [
    ("Alice", {"Math": 85, "Science": 90, "English": 80}),
    ("Bob", {"Math": 70, "Science": 75, "English": 85}),
    ("Charlie", {"Math": 60, "Science": 65, "English": 70}),
    ("David", {"Math": 90, "Science": 95, "English": 85}),
    ("Eve", {"Math": 75, "Science": 80, "English": 75})
]

# Create RDD from the scores
scores_rdd = spark.sparkContext.parallelize(scores)

# Define the grading scheme (example)
grading_scheme = {
    "A": (80, 100),
    "B": (60, 79),
    "C": (40, 59),
    "D": (0, 39)
}

# Function to compute grades for a given score
def compute_grade(score):
    for grade, (lower_bound, upper_bound) in grading_scheme.items():
        if lower_bound <= score <= upper_bound:
            return grade
    return "F"

# Map operation to compute grades for each student
grades_rdd = scores_rdd.map(lambda x: (x[0], {subject: compute_grade(score) for subject, score in x[1].items()}))

# Convert RDD to DataFrame
grades_df = spark.createDataFrame(grades_rdd.flatMap(lambda x: [(x[0], subject, grade) for subject, grade in x[1].items()]), ["Student", "Subject", "Grade"])

# Display the result
grades_df.show()

# Stop SparkSession
spark.stop()


+-------+-------+-----+
|Student|Subject|Grade|
+-------+-------+-----+
|  Alice|   Math|    A|
|  Alice|Science|    A|
|  Alice|English|    A|
|    Bob|   Math|    B|
|    Bob|Science|    B|
|    Bob|English|    A|
|Charlie|   Math|    B|
|Charlie|Science|    B|
|Charlie|English|    B|
|  David|   Math|    A|
|  David|Science|    A|
|  David|English|    A|
|    Eve|   Math|    B|
|    Eve|Science|    A|
|    Eve|English|    B|
+-------+-------+-----+



In [None]:
from pyspark.sql import SparkSession: This line imports the SparkSession class from the pyspark.sql module. SparkSession is the entry point to Spark SQL functionality and allows the creation and management of DataFrame objects.
spark = SparkSession.builder \ .appName("StudentGrades") \ .getOrCreate(): This code creates a SparkSession named "StudentGrades" if it doesn't already exist. The appName method sets the name of the application.
scores: This variable holds sample student scores as a list of tuples. Each tuple contains a student's name as the first element and a dictionary representing subject scores as the second element.
scores_rdd: This variable creates an RDD (Resilient Distributed Dataset) from the sample scores using parallelize. RDDs are the fundamental data structure in Spark.
grading_scheme: This dictionary defines the grading scheme, mapping grade letters to score ranges.
compute_grade: This function takes a score as input and computes the corresponding grade based on the grading scheme.
grades_rdd: This variable applies a map operation to the scores_rdd, computing grades for each student's subject scores using the compute_grade function.
grades_df: This variable converts the resulting RDD to a DataFrame using createDataFrame. It flattens the nested structure of the grades data and specifies column names as "Student", "Subject", and "Grade".
grades_df.show(): This line displays the DataFrame containing the computed grades for each student and subject.
spark.stop(): This line stops the SparkSession, releasing the resources associated with it.
In summary, this code demonstrates how to compute student grades based on their subject scores using PySpark RDDs and then convert the result into a DataFrame for easy visualization and further analysis.