### Task 1: Handling Schema Mismatches using Spark
**Description**: Use Apache Spark to address schema mismatches by transforming data to match
the expected schema.

**Steps**:
1. Create Spark session
2. Load dataframe
3. Define the expected schema
4. Handle schema mismatches
5. Show corrected data

In [2]:
# Write your code from here
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, BooleanType
from pyspark.sql.functions import col

# 1. Create Spark session
spark = SparkSession.builder \
    .appName("SchemaMismatchHandling") \
    .getOrCreate()

# 2. Load dataframe (example CSV)
file_path = "your_data.csv"  # replace with your file path
df = spark.read.option("header", True).csv(file_path)

# 3. Define the expected schema
expected_schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("email", StringType(), True),
    StructField("is_active", BooleanType(), True),
    StructField("balance", FloatType(), True)
])

# 4. Handle schema mismatches by casting columns to expected types

def cast_columns_to_schema(df, schema):
    for field in schema.fields:
        col_name = field.name
        col_type = field.dataType
        if col_name in df.columns:
            # Cast the column to the expected data type
            df = df.withColumn(col_name, col(col_name).cast(col_type))
        else:
            # If column missing, add with nulls of proper type
            df = df.withColumn(col_name, col(col_name))  # or add null column if missing
    return df.select([field.name for field in schema.fields])  # reorder columns

corrected_df = cast_columns_to_schema(df, expected_schema)

# 5. Show corrected data
corrected_df.show()

# Stop Spark session when done
spark.stop()


JAVA_HOME is not set


PySparkRuntimeError: [JAVA_GATEWAY_EXITED] Java gateway process exited before sending its port number.

### Task 2: Detect and Correct Incomplete Data in ETL
**Description**: Use Python and Pandas to detect incomplete data in an ETL process and fill
missing values with estimates.

**Steps**:
1. Detect incomplete data
2. Fill missing values
3. Report changes

In [3]:
# Write your code from here
import pandas as pd

def detect_and_correct_incomplete_data(file_path):
    # 1. Load data safely
    try:
        df = pd.read_csv(file_path)
        if df.empty:
            print("CSV file is empty.")
            return
    except Exception as e:
        print(f"Error loading CSV: {e}")
        return
    
    print("Initial missing values per column:")
    missing_before = df.isnull().sum()
    print(missing_before[missing_before > 0])
    
    # 2. Fill missing values with estimates
    for col in df.columns:
        if df[col].isnull().any():
            if pd.api.types.is_numeric_dtype(df[col]):
                # Fill numeric columns with mean
                mean_val = df[col].mean()
                df[col].fillna(mean_val, inplace=True)
                print(f"Filled missing values in numeric column '{col}' with mean: {mean_val}")
            else:
                # Fill categorical columns with mode
                mode_val = df[col].mode()
                if not mode_val.empty:
                    df[col].fillna(mode_val[0], inplace=True)
                    print(f"Filled missing values in categorical column '{col}' with mode: {mode_val[0]}")
                else:
                    print(f"Column '{col}' has missing values but no mode found.")
    
    # 3. Report changes after filling
    missing_after = df.isnull().sum()
    print("\nMissing values per column after filling:")
    print(missing_after[missing_after > 0] if missing_after.sum() > 0 else "No missing values remaining.")
    
    # Optional: save corrected data
    corrected_file_path = "corrected_" + file_path
    df.to_csv(corrected_file_path, index=False)
    print(f"\nCorrected data saved to '{corrected_file_path}'.")

# Example usage
if __name__ == "__main__":
    detect_and_correct_incomplete_data("your_data.csv")


Initial missing values per column:
name     1
age      1
email    1
dtype: int64
Filled missing values in categorical column 'name' with mode: Alice Williams
Filled missing values in numeric column 'age' with mean: 28.5
Filled missing values in categorical column 'email' with mode: alice.williams@example.com

Missing values per column after filling:
No missing values remaining.

Corrected data saved to 'corrected_your_data.csv'.
