**AIM**: To analyze a student dataset by utilizing PySpark DataFrame operations for data transformation, aggregation, and statistical analysis, showcasing the capabilities of Spark in handling structured data efficiently.

Operations Performed:  
1. Loading a CSV file into a PySpark DataFrame.  
2. Displaying the schema and first few rows of the dataset.  
3. Adding a new column to compute average marks for each student.  
4. Grouping data by country to:  
   - Calculate the highest average marks by country.  
   - Count the number of students per country.  
   - Find the earliest enrollment date for each country.  
5. Sorting results based on specific columns (e.g., average marks, enrollment date).  

In [None]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, min, desc

In [None]:
# Step 1: Create a sample student dataset
student_data = {
    "Student ID": [101, 102, 103, 104, 105],
    "Name": ["Alice", "Bob", "Charlie", "David", "Eve"],
    "Gender": ["F", "M", "M", "M", "F"],
    "Math Marks": [85, 90, 78, 88, 92],
    "Science Marks": [89, 85, 91, 84, 87],
    "English Marks": [90, 88, 86, 85, 89],
    "Enrollment Date": ["2021-06-15", "2021-07-20", "2021-06-18", "2021-07-25", "2021-06-10"],
    "Country": ["USA", "Canada", "USA", "Germany", "Canada"]
}

In [None]:
# Create a DataFrame and save it to a CSV file
student_df = pd.DataFrame(student_data)
student_file_path = "/content/student_dataset.csv"
student_df.to_csv(student_file_path, index=False)

print(f"Student CSV file created at: {student_file_path}")

Student CSV file created at: /content/student_dataset.csv


In [None]:

# Step 2: Initialize SparkSession
spark = SparkSession.builder \
    .appName("Student Dataset Analysis") \
    .getOrCreate()

In [None]:
# Step 3: Load the dataset
data = spark.read.csv(student_file_path, header=True, inferSchema=True)

In [None]:
# Step 4: Display the schema
print("Schema:")
data.printSchema()

Schema:
root
 |-- Student ID: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Math Marks: integer (nullable = true)
 |-- Science Marks: integer (nullable = true)
 |-- English Marks: integer (nullable = true)
 |-- Enrollment Date: date (nullable = true)
 |-- Country: string (nullable = true)



In [None]:
# Step 5: Display the first five rows
print("First five rows:")
data.show(5)

First five rows:
+----------+-------+------+----------+-------------+-------------+---------------+-------+
|Student ID|   Name|Gender|Math Marks|Science Marks|English Marks|Enrollment Date|Country|
+----------+-------+------+----------+-------------+-------------+---------------+-------+
|       101|  Alice|     F|        85|           89|           90|     2021-06-15|    USA|
|       102|    Bob|     M|        90|           85|           88|     2021-07-20| Canada|
|       103|Charlie|     M|        78|           91|           86|     2021-06-18|    USA|
|       104|  David|     M|        88|           84|           85|     2021-07-25|Germany|
|       105|    Eve|     F|        92|           87|           89|     2021-06-10| Canada|
+----------+-------+------+----------+-------------+-------------+---------------+-------+



In [None]:
# Step 6: Country with the highest average marks
data = data.withColumn(
    "Average Marks",
    (col("Math Marks") + col("Science Marks") + col("English Marks")) / 3
)

highest_avg_country = (
    data.groupBy("Country")
    .agg(avg("Average Marks").alias("Average_Marks"))
    .orderBy(desc("Average_Marks"))
    .limit(1)
)
print("Country with the highest average marks:")
highest_avg_country.show()

Country with the highest average marks:
+-------+-------------+
|Country|Average_Marks|
+-------+-------------+
| Canada|         88.5|
+-------+-------------+



In [None]:
# Step 7: Number of students from each country
students_per_country = (
    data.groupBy("Country")
    .agg(count("Student ID").alias("Number_of_Students"))
    .orderBy(desc("Number_of_Students"))
)
print("Number of students from each country:")
students_per_country.show()

Number of students from each country:
+-------+------------------+
|Country|Number_of_Students|
+-------+------------------+
|    USA|                 2|
| Canada|                 2|
|Germany|                 1|
+-------+------------------+



In [None]:
# Step 8: Earliest enrollment date for students from each country
earliest_enrollment = (
    data.groupBy("Country")
    .agg(min("Enrollment Date").alias("Earliest_Enrollment"))
    .orderBy("Earliest_Enrollment")
)
print("Earliest enrollment date for students from each country:")
earliest_enrollment.show()

Earliest enrollment date for students from each country:
+-------+-------------------+
|Country|Earliest_Enrollment|
+-------+-------------------+
| Canada|         2021-06-10|
|    USA|         2021-06-15|
|Germany|         2021-07-25|
+-------+-------------------+



**Result:** Performed spark PySpark DataFrame operations for data transformation, aggregation, and statistical analysis on student dataset.