### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Step 1: Create Spark session
spark = SparkSession.builder.appName("SchemaMismatchHandling").getOrCreate()

# Step 2: Load dataframe (simulate with sample data having schema issues)
data = [
    ("1", "John Doe", "18", "A"),
    ("2", "Jane Smith", "22", "B"),
    ("3", "Bob Johnson", "seventeen", "C"),  # Age is a string word
    ("4", "Alice Brown", None, "E"),         # Missing age
    ("5", "Tom White", "15", None)           # Missing grade
]

# Create DataFrame with inferred schema (potential mismatches)
df_raw = spark.createDataFrame(data, ["ID", "Name", "Age", "Grade"])
print("Raw Data (Before Schema Correction):")
df_raw.show()

# Step 3: Define the expected schema
expected_schema = StructType([
    StructField("ID", IntegerType(), True),
    StructField("Name", StringType(), True),
    StructField("Age", IntegerType(), True),
    StructField("Grade", StringType(), True)
])

# Step 4: Handle schema mismatches
# Convert data types using casting and handle errors
from pyspark.sql.functions import col

df_corrected = df_raw \
    .withColumn("ID", col("ID").cast("int")) \
    .withColumn("Age", col("Age").cast("int")) \
    .withColumn("Grade", col("Grade").cast("string"))

print("Corrected Data (After Schema Handling):")
df_corrected.show()

# Optional: Validate schema
print("Corrected Schema:")
df_corrected.printSchema()


ModuleNotFoundError: No module named 'pyspark'

### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [None]:
import pandas as pd
import numpy as np


# Sample data simulating an ETL load with missing values
data = {
    'ID': [1, 2, 3, 4, 5],
    'Name': ['John', 'Jane', 'Bob', None, 'Eva'],
    'Age': [18, None, 17, 20, None],
    'Grade': ['A', 'B', None, 'C', 'A']
}

# Step 1: Detect incomplete data
df = pd.DataFrame(data)
print("Original Data:")
print(df)

print("\nMissing Values Count:")
print(df.isnull().sum())

# Step 2: Fill missing values with estimates
# Fill missing Name with "Unknown"
df['Name'].fillna('Unknown', inplace=True)

# Fill missing Age with mean age
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)

# Fill missing Grade with mode
mode_grade = df['Grade'].mode()[0]
df['Grade'].fillna(mode_grade, inplace=True)

# Step 3: Report changes
print("\nCleaned Data:")
print(df)

print("\nSummary of Changes:")
print(f"Missing 'Name' filled with 'Unknown'")
print(f"Missing 'Age' filled with mean age: {mean_age:.2f}")
print(f"Missing 'Grade' filled with mode: {mode_grade}")


Original Data:
   ID  Name   Age Grade
0   1  John  18.0     A
1   2  Jane   NaN     B
2   3   Bob  17.0  None
3   4  None  20.0     C
4   5   Eva   NaN     A

Missing Values Count:
ID       0
Name     1
Age      2
Grade    1
dtype: int64

Cleaned Data:
   ID     Name        Age Grade
0   1     John  18.000000     A
1   2     Jane  18.333333     B
2   3      Bob  17.000000     A
3   4  Unknown  20.000000     C
4   5      Eva  18.333333     A

Summary of Changes:
Missing 'Name' filled with 'Unknown'
Missing 'Age' filled with mean age: 18.33
Missing 'Grade' filled with mode: A
