# Robust pipeline

| Requirement ID | Description                     | User Story                                                                 | Expected Behaviour / Outcome                                      |
|-----------------|---------------------------------|---------------------------------------------------------------------------|-------------------------------------------------------------------|
| RQ-001          | User Authentication            | As a user, I want to log in securely so that I can access my account.    | The system should validate credentials and grant access securely. |
| RQ-002          | Data Upload                    | As a user, I want to upload files so that I can share them with others.  | The system should allow file uploads and provide confirmation.   |
| RQ-003          | Data Visualization             | As a user, I want to view data charts so that I can analyze trends.      | The system should generate and display interactive charts.       |
| RQ-004          | Notification System            | As a user, I want to receive alerts so that I stay informed.             | The system should send timely notifications via email or SMS.    |

In [None]:
from pyspark.sql.types import StructType, StructField, IntegerType

schema = StructType([
    StructField("hour", IntegerType(), nullable=False),
    StructField("temperature", IntegerType(), nullable=False),
    StructField("humidity", IntegerType(), nullable=False)
])

In [None]:
# Read the CSV with permissive mode to allow corrupt records and put them in _corrupt_record
df = spark.read.option("mode", "PERMISSIVE").schema(schema).csv("/data/*.csv")

# Add all corrupt records to a separate dataframe
bad_df = df.filter(df["_corrupt_record"].isNotNull())

# Use exceptAll to ensure no overlap between valid and invalid data
# exceptAll guarantees no duplicates even when adding more bad data checks
df = df.exceptAll(bad_df)

# Write valid and invalid data to separate tables
# df.write.format("parquet").save("/path/to/table")
# bad_df.write.format("parquet").save("/path/to/table_bad")