# Data Preprocessing Notebook

## Introduction
This Jupyter notebook is dedicated to the preprocessing of data collected from three different sources: flights, weather, and reviews. The purpose of this notebook is to clean and prepare the data for further analysis or machine learning tasks. Preprocessing includes handling missing values, identifying and dealing with outliers, and merging the datasets into a single cohesive structure.

---

## Objective
The goal of this preprocessing step is to ensure that the datasets are:
- **Clean**: Free from inaccuracies and inconsistencies.
- **Complete**: Missing values are addressed appropriately.
- **Conformant**: Data is standardized to expected formats.
- **Consolidated**: Relevant data from all three sources are combined logically.

---

## Datasets
The datasets being processed are:
1. **Flights**: Contains information about flight schedules, delays, and other related attributes.
2. **Weather**: Includes weather conditions at different airport locations.
3. **Reviews**: Comprises customer reviews and ratings for the flights.

---

## Tools and Libraries
- `Spark`: For distributed data processing.
- `PySpark`: Python API for Spark.
- `Pandas`: For data manipulation within Spark jobs.
- `Matplotlib`/`Seaborn`: For visualizations (if needed, considering the size of data).
- `MLlib`: Spark’s machine learning library (if preprocessing involves feature selection or dimensionality reduction).



---

In [94]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat, lit, split, expr, to_date, to_timestamp, date_format, lower, concat_ws, regexp_replace, when
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

In [95]:
def flights_processing():
    """
    Transforms flight data by cleaning and structuring. Removes unnecessary columns, normalizes dates and times, 
    extracts key information from strings, and filters based on flight status. Assumes data is loaded from a CSV 
    with a predefined schema.
    """
    # Initialize Spark Session
    spark = SparkSession.builder.appName("FlightsDataProcessing")\
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY")\
        .getOrCreate()

    # Define the schema for reading the CSV file
    schema = StructType([
        StructField("host", StringType(), True),
        StructField("time", StringType(), True),
        StructField("flight_code", StringType(), True),
        StructField("destination", StringType(), True),
        StructField("airline", StringType(), True),
        StructField("aircraft", StringType(), True),
        StructField("color", StringType(), True),
        StructField("status", StringType(), True),
        StructField("date", StringType(), True),
        StructField("type", StringType(), True)
    ])

    # Load the data
    flights_df = spark.read.csv('./data/history/flights.csv', schema=schema, header=False)

    # Data Preprocessing Steps

    # 1. Remove unnecessary columns
    flights_df = flights_df.drop("color")

    # 2. Append year to 'date' and convert to DateType
    flights_df = flights_df.withColumn("date", concat(col("date"), lit(" 2024")))
    flights_df = flights_df.withColumn("date", to_date("date", "EEEE MMM dd yyyy"))

    # 3. Convert 'time' to TimestampType assuming it contains AM/PM
    # Concatenate 'date' with 'time' before converting to timestamp for 'expected_time'
    # This ensures the timestamp includes the correct date instead of defaulting to '1970-01-01'
    flights_df = flights_df.withColumn(
        "expected_time", 
        to_timestamp(concat_ws(" ", date_format(col("date"), "yyyy-MM-dd"), col("time")), "yyyy-MM-dd hh:mm a")
    )

    # 4. Extract city from 'destination' and convert it to lowercase
    flights_df = flights_df.withColumn("dest_city", split(col("destination"), " \\(")[0])

    # 5. Extract airport code from 'destination'
    flights_df = flights_df.withColumn("destination", lower(split(col("destination"), " \\(")[1].substr(0, 3)))

    # 6. Extract aircraft model from 'aircraft'
    flights_df = flights_df.withColumn("aircraft", split(col("aircraft"), " \\(")[0])

    # 7. Split 'status' into new 'status' and 'actual_time'
    split_col = split(col("status"), " ")
    flights_df = flights_df.withColumn("actual_time", expr("substring(status, length(split(status, ' ')[0]) + 2)"))
    flights_df = flights_df.withColumn("status", split_col.getItem(0))

    # 8. Filter rows to only include statuses 'Departed' or 'Arrived'
    flights_df = flights_df.filter(col("status").rlike("Departed|Arrived"))

    # 9. Correct 'actual_time' to match the 'date' column and convert to TimestampType
    flights_df = flights_df.withColumn("actual_time", to_timestamp(concat_ws(" ", col("date"), col("actual_time")), "yyyy-MM-dd hh:mm a")).drop("time", "date")

    for i in flights_df.dtypes:
        print(i)
    # Display the processed DataFrame
    flights_df.show(100, truncate=False)

    return flights_df

In [96]:
# Run the flights processing function
flights_df = flights_processing()

('host', 'string')
('flight_code', 'string')
('destination', 'string')
('airline', 'string')
('aircraft', 'string')
('status', 'string')
('type', 'string')
('expected_time', 'timestamp')
('dest_city', 'string')
('actual_time', 'timestamp')


+----+-----------+-----------+--------------------------------+--------+--------+----------+-------------------+---------+-------------------+
|host|flight_code|destination|airline                         |aircraft|status  |type      |expected_time      |dest_city|actual_time        |
+----+-----------+-----------+--------------------------------+--------+--------+----------+-------------------+---------+-------------------+
|tia |PC282      |saw        |Pegasus                         |B738    |Departed|departures|2024-03-17 03:05:00|Istanbul |2024-03-17 04:05:00|
|tia |NULL       |mxp        |Wizz Air                        |A21N    |Departed|departures|2024-03-17 13:45:00|Milan    |2024-03-17 13:52:00|
|tia |OS850      |vie        |Austrian Airlines               |A320    |Departed|departures|2024-03-17 04:55:00|Vienna   |2024-03-17 04:49:00|
|tia |W43841     |dtm        |Wizz Air                        |A21N    |Departed|departures|2024-03-17 06:00:00|Dortmund |2024-03-17 07:01:00|

In [97]:
def weather_processing():
    """
    Processes weather data by cleaning and transforming specific columns.
    This includes removing non-numeric characters, handling special cases in visibility,
    and converting date_time strings to timestamp format.
    """
    # Initialize Spark Session with a specified app name and configuration
    spark = SparkSession.builder.appName("WeatherDataProcessing") \
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
        .getOrCreate()

    # Define the schema for the weather data
    schema = StructType([
        StructField("day", StringType(), True),
        StructField("time", StringType(), True),
        StructField("wind_direction", StringType(), True),
        StructField("wind_speed", StringType(), True),
        StructField("temperature", StringType(), True),
        StructField("dew_point", StringType(), True),
        StructField("pressure", StringType(), True),
        StructField("visibility", StringType(), True),
        StructField("date_time", StringType(), True),
        StructField("airport", StringType(), True)
    ])

    # Load the weather data from a CSV file
    weather_df = spark.read.csv('./data/history/weather.csv', schema=schema, header=False)

    # Drop unnecessary columns
    weather_df = weather_df.drop("day", "time")

    # Clean numeric columns by removing non-numeric characters and casting to IntegerType
    numeric_columns = ["wind_direction", "wind_speed", "temperature", "dew_point", "pressure"]
    for column in numeric_columns:
        weather_df = weather_df.withColumn(column, regexp_replace(col(column), "[^0-9]", "").cast(IntegerType()))

    # Handle special visibility cases by replacing specific text with a numeric value and cleaning
    weather_df = weather_df.withColumn("visibility",
                                        when(col("visibility").rlike("Sky and visibility OK"), 10000)
                                        .otherwise(regexp_replace(col("visibility"), "[^0-9]", ""))
                                        .cast(IntegerType()))

    # Convert 'date_time' column to timestamp format
    weather_df = weather_df.withColumn("date_time", to_timestamp(col("date_time"), "HH:mm:ss yyyy-MM-dd"))


    for i in weather_df.dtypes:
        print(i)
        
    # Display the processed DataFrame
    weather_df.show(100, truncate=False)

    return weather_df

In [98]:
# Run the weather processing function
weather_df = weather_processing()

('wind_direction', 'int')
('wind_speed', 'int')
('temperature', 'int')
('dew_point', 'int')
('pressure', 'int')
('visibility', 'int')
('date_time', 'timestamp')
('airport', 'string')
+--------------+----------+-----------+---------+--------+----------+-------------------+-------+
|wind_direction|wind_speed|temperature|dew_point|pressure|visibility|date_time          |airport|
+--------------+----------+-----------+---------+--------+----------+-------------------+-------+
|240           |8         |26         |12       |1016    |10000     |2024-03-18 11:30:00|aae    |
|230           |9         |25         |12       |1017    |10000     |2024-03-18 11:00:00|aae    |
|230           |9         |24         |12       |1017    |10000     |2024-03-18 10:30:00|aae    |
|230           |9         |23         |11       |1017    |10000     |2024-03-18 10:00:00|aae    |
|220           |10        |22         |12       |1018    |10000     |2024-03-18 09:30:00|aae    |
|220           |9         |21    









## Preprocessing Steps
The preprocessing will be conducted in the following order:
1. **Initial Exploration**: Quick overview of the datasets to understand the structure and content.
2. **Data Cleaning**:
    - Removing duplicates.
    - Fixing structural errors (e.g., mislabeled classes, wrong data types).
3. **Handling Missing Values**:
    - Identifying missing values.
    - Deciding on a strategy to handle missing values (e.g., imputation, removal).
4. **Outlier Detection**:
    - Statistical methods to detect outliers.
    - Deciding on a strategy to handle outliers (e.g., trimming, capping, or correcting).
5. **Data Integration**:
    - Aligning datasets by common attributes.
    - Merging datasets into a unified table.
6. **Data Transformation**:
    - Normalization or scaling.
    - Encoding categorical variables.
7. **Final Inspection**:
    - Ensuring the processed data meets the initial objectives.
    - Storing the preprocessed data in a suitable format.

---



## Notes
- All changes will be documented and justified.
- Assumptions made during preprocessing will be clearly stated.
- Intermediary results will be visualized and inspected for validation.

---

## Conclusion
This section will summarize the preprocessing steps performed and discuss the readiness of the data for subsequent analysis.

---

## Change Log
- **Version 1.0** [Date]: Initial version of the notebook.
