# Data Preprocessing Notebook

## Introduction
This Jupyter notebook is dedicated to the preprocessing of data collected from three different sources: flights, weather, and reviews. The purpose of this notebook is to clean and prepare the data for further analysis or machine learning tasks. Preprocessing includes handling missing values, identifying and dealing with outliers, and merging the datasets into a single cohesive structure.

---

## Objective
The goal of this preprocessing step is to ensure that the datasets are:
- **Clean**: Free from inaccuracies and inconsistencies.
- **Complete**: Missing values are addressed appropriately.
- **Conformant**: Data is standardized to expected formats.
- **Consolidated**: Relevant data from all three sources are combined logically.

---

## Datasets
The datasets being processed are:
1. **Flights**: Contains information about flight schedules, delays, and other related attributes.
2. **Weather**: Includes weather conditions at different airport locations.
3. **Reviews**: Comprises customer reviews and ratings for the flights.

---

## Preprocessing Steps
The preprocessing will be conducted in the following order:
1. **Initial Exploration**: Quick overview of the datasets to understand the structure and content.
2. **Data Cleaning**:
    - Removing duplicates.
    - Fixing structural errors (e.g., mislabeled classes, wrong data types).
3. **Handling Missing Values**:
    - Identifying missing values.
    - Deciding on a strategy to handle missing values (e.g., imputation, removal).
4. **Outlier Detection**:
    - Statistical methods to detect outliers.
    - Deciding on a strategy to handle outliers (e.g., trimming, capping, or correcting).
5. **Data Integration**:
    - Aligning datasets by common attributes.
    - Merging datasets into a unified table.
6. **Data Transformation**:
    - Normalization or scaling.
    - Encoding categorical variables.
7. **Final Inspection**:
    - Ensuring the processed data meets the initial objectives.
    - Storing the preprocessed data in a suitable format.
---

## Tools and Libraries
- `Spark`: For distributed data processing.
- `PySpark`: Python API for Spark.
- `Pandas`: For data manipulation within Spark jobs.
- `Matplotlib`/`Seaborn`: For visualizations (if needed, considering the size of data).
- `MLlib`: Spark’s machine learning library (if preprocessing involves feature selection or dimensionality reduction).



In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat, lit, split, expr, to_date, to_timestamp, date_format, lower, concat_ws, regexp_replace, when, regexp_replace, trim, regexp_extract, hour, mean, minute, lpad
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, FloatType

---
## Flights

In [2]:
def flights_processing():
    """
    Transforms flight data by cleaning and structuring. Removes unnecessary columns, normalizes dates and times, 
    extracts key information from strings, and filters based on flight status. Assumes data is loaded from a CSV 
    with a predefined schema.

    Returns:
        flights_df (DataFrame): A Spark DataFrame with processed flights information.
    """
    # Initialize Spark Session
    spark = SparkSession.builder.appName("FlightsDataProcessing")\
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY")\
        .getOrCreate()

    # Define the schema for reading the CSV file
    schema = StructType([
        StructField("aircraft", StringType(), True),
        StructField("temp1", StringType(), True),
        StructField("temp2", StringType(), True),
        StructField("date", StringType(), True),
        StructField("from", StringType(), True),
        StructField("to", StringType(), True),
        StructField("flight", StringType(), True),
        StructField("flight_time", StringType(), True),
        StructField("scheduled_time_departure", StringType(), True),
        StructField("actual_time_departure", StringType(), True),
        StructField("scheduled_time_arrival", StringType(), True),
        StructField("temp3", StringType(), True),
        StructField("status", StringType(), True),
        StructField("temp4", StringType(), True),
    ])

    # Load the data
    flights_df = spark.read.csv('./data/history/flights.csv', schema=schema, header=False)
    
    # Data Preprocessing Steps

    # 1. Remove unnecessary columns
    flights_df = flights_df.drop("temp1", "temp2", "temp3", "temp4")
    
    # 2. Convert date to DateType
    flights_df = flights_df.withColumn("date", to_date("date", "dd MMM yyyy"))

    # 7. Split 'status' into new 'status' and 'actual_time_arrival'
    split_col = split(col("status"), " ")
    flights_df = flights_df.withColumn("actual_time_arrival", expr("substring(status, length(status) - 4, 5)"))
    flights_df = flights_df.withColumn("status", split_col.getItem(0))

    
    # 8. Filter rows to only include statuses 'Departed' or 'Arrived'
    flights_df = flights_df.filter(col("status").rlike("Landed"))

    
    # 3. Convert 'time' to TimestampType assuming it contains AM/PM
    # Concatenate 'date' with 'time' before converting to timestamp for 'expected_time'
    # This ensures the timestamp includes the correct date instead of defaulting to '1970-01-01'
    flights_df = flights_df.withColumn(
        "flight_time", 
        to_timestamp(concat_ws(" ", date_format(col("date"), "yyyy-MM-dd"), col("flight_time")), "yyyy-MM-dd HH:mm")
    ).withColumn(
        "scheduled_time_departure", 
        to_timestamp(concat_ws(" ", date_format(col("date"), "yyyy-MM-dd"), col("scheduled_time_departure")), "yyyy-MM-dd HH:mm")
    ).withColumn(
        "actual_time_departure", 
        to_timestamp(concat_ws(" ", date_format(col("date"), "yyyy-MM-dd"), col("actual_time_departure")), "yyyy-MM-dd HH:mm")
    ).withColumn(
        "scheduled_time_arrival", 
        to_timestamp(concat_ws(" ", date_format(col("date"), "yyyy-MM-dd"), col("scheduled_time_arrival")), "yyyy-MM-dd HH:mm")
    ).withColumn(
        "actual_time_arrival", 
        to_timestamp(concat_ws(" ", date_format(col("date"), "yyyy-MM-dd"), col("actual_time_arrival")), "yyyy-MM-dd HH:mm")
    )
    
    
    # 4. Extract city from 'from' and 'to' and convert it to lowercase
    flights_df = flights_df.withColumn("from_city", lower(split(col("from"), " \\(")[0])) \
                           .withColumn("to_city", lower(split(col("to"), " \\(")[0]))

    # 5. Extract airport code from 'from' and 'to'
    flights_df = flights_df.withColumn("from", lower(split(col("from"), " \\(")[1].substr(0, 3))) \
                           .withColumn("to", lower(split(col("to"), " \\(")[1].substr(0, 3))) \

    
    # Add a new column 'rounded_hour' that represents the closest hour to the scheduled time arrival
    flights_df = flights_df.withColumn("hour", hour("scheduled_time_arrival")) \
        .withColumn("minute", minute("scheduled_time_arrival")) \
        .withColumn("rounded_hour",
                        when(col("minute") >= 30, expr("hour + 1"))
                        .otherwise(col("hour"))
                    ) \
        .drop("hour", "minute")
    
    # Adjust for the case where adding 1 to the hour results in 24
    flights_df = flights_df.withColumn("rounded_hour",
                    when(col("rounded_hour") == 24, 0)
                    .otherwise(col("rounded_hour"))
                    )
    
    # Convert 'rounded_hour' to a string with two digits
    hour_str = lpad(col("rounded_hour"), 2, '0')
    
    # Concatenate 'date' and 'hour_str' to form a datetime string
    datetime_str = concat_ws(" ", col("date"), hour_str)

    # Append ":00:00" to represent minutes and seconds, forming a full datetime string
    datetime_str = concat_ws(":", datetime_str, lit("00"), lit("00"))

    # Convert the datetime string to a timestamp
    flights_df = flights_df.withColumn("rounded_hour", to_timestamp(datetime_str, "yyyy-MM-dd HH:mm:ss"))

    # 10. Remove duplicates
    flights_df = flights_df.dropDuplicates()

    flights_df = flights_df.withColumn('airport', col('to'))

    # 11. Add status and delay_time
    # Calculate delay in minutes
    flights_df = flights_df.withColumn("delay_time", 
                                (col("actual_time_arrival").cast("long") - col("scheduled_time_arrival").cast("long")) / 60)
    
    # Define status based on delay_time
    flights_df = flights_df.withColumn("status", when(col("delay_time") > 15, "Delayed").otherwise("On Time"))
    
    # Check the schema of columns
    flights_df.printSchema()

    # Display the processed DataFrame
    flights_df.show(truncate=False)
    
    # Return the processed DataFrame
    return flights_df

In [3]:
# Run the flights processing function
flights_df = flights_processing()

root
 |-- aircraft: string (nullable = true)
 |-- date: date (nullable = true)
 |-- from: string (nullable = true)
 |-- to: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- flight_time: timestamp (nullable = true)
 |-- scheduled_time_departure: timestamp (nullable = true)
 |-- actual_time_departure: timestamp (nullable = true)
 |-- scheduled_time_arrival: timestamp (nullable = true)
 |-- status: string (nullable = false)
 |-- actual_time_arrival: timestamp (nullable = true)
 |-- from_city: string (nullable = true)
 |-- to_city: string (nullable = true)
 |-- rounded_hour: timestamp (nullable = true)
 |-- airport: string (nullable = true)
 |-- delay_time: double (nullable = true)

+--------+----------+----+---+------+-------------------+------------------------+---------------------+----------------------+-------+-------------------+-----------------+----------+-------------------+-------+----------+
|aircraft|date      |from|to |flight|flight_time        |scheduled_ti

In [4]:
flights_df.groupBy('to').count().sort('count', ascending=False).show()

+---+-----+
| to|count|
+---+-----+
|dfw| 5130|
|ord| 4467|
|hnd| 4169|
|lhr| 3761|
|cdg| 3535|
|den| 3419|
|clt| 3120|
|mad| 3054|
|lim| 2851|
|can| 2849|
|mia| 2824|
|sgn| 2781|
|ewr| 2571|
|svo| 2531|
|yyz| 2490|
|bkk| 2412|
|iah| 2388|
|sea| 2296|
|sin| 2281|
|fra| 2240|
+---+-----+
only showing top 20 rows



---
## Aircrafts info

In [5]:
def aircrafts_info_processing():
    """
    Processes airaircraftport information data, cleaning and converting specific columns to proper data types.
    N/A values are treated as null, and numeric fields are cast to their respective types.
    
    Returns:
        aircraft_info_df (DataFrame): A Spark DataFrame with processed aircraft information.
    """
    # Initialize Spark Session with legacy time parser policy for compatibility
    spark = SparkSession.builder.appName("AircraftsInfoDataProcessing") \
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
        .getOrCreate()

    # Define the schema for the airport information data
    schema = StructType([
        StructField("msn", StringType(), True),
        StructField("type", StringType(), True),
        StructField("aircraft", StringType(), True),
        StructField("airline", StringType(), True),
        StructField("first_flight", StringType(), True),
        StructField("photo", StringType(), True),
    ])
    
    # Load the data from a CSV file, ensuring correct schema application
    aircraft_info_df = spark.read.csv('./data/history/aircrafts_info.csv', schema=schema, header=False)

    aircraft_info_df = aircraft_info_df.drop("photo")

    age_pattern = r"\((\d+) years\)"

    # Add a new column "age" that extracts the age part and converts it to an integer
    aircraft_info_df = aircraft_info_df.withColumn("age", regexp_extract(col("first_flight"), age_pattern, 1).cast("integer")).drop('first_flight')

    
    # Convert the 'aircraft' column to lowercase
    aircraft_info_df = aircraft_info_df.withColumn("aircraft", lower(aircraft_info_df["aircraft"]))

    # Check the schema of columns
    aircraft_info_df.printSchema()

    aircraft_info_df.show(truncate=False)

    return aircraft_info_df

In [6]:
aircraft_info_df = aircrafts_info_processing()

root
 |-- msn: string (nullable = true)
 |-- type: string (nullable = true)
 |-- aircraft: string (nullable = true)
 |-- airline: string (nullable = true)
 |-- age: integer (nullable = true)

+--------+----+--------+-----------------------+----+
|msn     |type|aircraft|airline                |age |
+--------+----+--------+-----------------------+----+
|09108   |A21N|vn-a508 |Vietnam Airlines       |4   |
|42175   |B739|n68811  |United Airlines        |10  |
|64301   |B739|n292ak  |Alaska Airlines        |5   |
|39392   |B738|b-5543  |Shandong Airlines      |13  |
|11468   |A21N|tc-rdu  |Pegasus                |NULL|
|4008    |DH8D|5y-vvu  |Bluebird Aviation      |24  |
|32937   |B737|b-2620  |China Southern Airlines|19  |
|567     |A310|ep-mnv  |Mahan Air              |33  |
|259     |A388|a6-evj  |Emirates               |5   |
|145135  |E145|s5-acj  |Amelia                 |25  |
|35965   |B738|n8695d  |Southwest Airlines     |7   |
|33346   |B738|n339pl  |American Airlines      |6   

---
## Airports info

In [7]:
def airports_info_processing():
    """
    Processes airport information data, cleaning and converting specific columns to proper data types.
    N/A values are treated as null, and numeric fields are cast to their respective types.
    
    Returns:
        info_df (DataFrame): A Spark DataFrame with processed airport information.
    """
    # Initialize Spark Session with legacy time parser policy for compatibility
    spark = SparkSession.builder.appName("AirportsInfoDataProcessing") \
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
        .getOrCreate()

    # Define the schema for the airport information data
    schema = StructType([
        StructField("my_flightradar24_rating", StringType(), True),
        StructField("temp", StringType(), True),  # Placeholder for column due to scraping error
        StructField("arrival_delay_index", StringType(), True),
        StructField("departure_delay_index", StringType(), True),
        StructField("utc", StringType(), True),
        StructField("local", StringType(), True),
        StructField("airport", StringType(), True),
    ])

    # Load the data from a CSV file, ensuring correct schema application
    info_df = spark.read.csv('./data/history/airports_info.csv', schema=schema, header=False)

    # Drop the 'temp' column as it contains null values due to scraping errors
    info_df = info_df.drop("temp")

    # Replace "N/A" string values with null across the DataFrame
    info_df = info_df.na.replace("N/A", None)

    # Clean numeric fields and cast to correct types
    info_df = info_df.withColumn("my_flightradar24_rating", 
                                 regexp_replace(col("my_flightradar24_rating"), "[^0-9]", "").cast(IntegerType())) \
                     .withColumn("arrival_delay_index", col("arrival_delay_index").cast(FloatType())) \
                     .withColumn("departure_delay_index", col("departure_delay_index").cast(FloatType()))
    
    # Extract the utc time part and convert it to a Spark timestamp format
    info_df = info_df.withColumn("utc", to_timestamp(regexp_extract(col("utc"), "(\\d{2}:\\d{2})", 0), "HH:mm"))

    # Convert local time to a Spark timestamp format
    info_df = info_df.withColumn("local", to_timestamp(concat(lit("1970-01-01 "), col("local")), "yyyy-MM-dd hh:mm a"))

    # Calculate time difference utc-local
    info_df = info_df.withColumn("time_diff", col('utc')-col('local')).drop('utc', 'local')

    # Remove duplicates
    info_df = info_df.dropDuplicates()

    # Check the schema of columns
    info_df.printSchema()

    # Display the processed DataFrame
    info_df.show(truncate=False)

    # Return the processed DataFrame
    return info_df

In [8]:
# Run the airports info processing function
info_df = airports_info_processing()

root
 |-- my_flightradar24_rating: integer (nullable = true)
 |-- arrival_delay_index: float (nullable = true)
 |-- departure_delay_index: float (nullable = true)
 |-- airport: string (nullable = true)
 |-- time_diff: interval day to second (nullable = true)

+-----------------------+-------------------+---------------------+-------+------------------------------------+
|my_flightradar24_rating|arrival_delay_index|departure_delay_index|airport|time_diff                           |
+-----------------------+-------------------+---------------------+-------+------------------------------------+
|92                     |0.3                |1.7                  |doh    |INTERVAL '-0 03:00:00' DAY TO SECOND|
|63                     |0.4                |1.0                  |crl    |INTERVAL '-0 01:00:00' DAY TO SECOND|
|72                     |0.4                |0.8                  |tia    |INTERVAL '-0 01:00:00' DAY TO SECOND|
|81                     |0.4                |2.1              

---
## Weather

In [9]:
def weather_processing():
    """
    Processes weather data by cleaning and transforming specific columns.
    This includes removing non-numeric characters, handling special cases in visibility,
    and converting date_time strings to timestamp format.

    Returns:
        weather_df (DataFrame): A Spark DataFrame with processed weather information.
    """
    # Initialize Spark Session with a specified app name and configuration
    spark = SparkSession.builder.appName("WeatherDataProcessing") \
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
        .getOrCreate()

    # Define the schema for reading the CSV file
    schema = StructType([
        StructField("time", StringType(), True),
        StructField("temperature", StringType(), True),
        StructField("dew_point", StringType(), True),
        StructField("humidity", StringType(), True),
        StructField("wind", StringType(), True),
        StructField("wind_speed", StringType(), True),
        StructField("wind_gust", StringType(), True),
        StructField("pressure", StringType(), True),
        StructField("precip", StringType(), True),
        StructField("condition", StringType(), True),
        StructField("airport", StringType(), True),
        StructField("date", StringType(), True),
    ])

    # Load the data
    weather_df = spark.read.csv('./data/history/weather.csv', schema=schema, header=False)

    # Drop null values
    weather_df = weather_df.dropna(how="any")

    # Clean numeric fields and cast to correct types
    weather_df = weather_df.withColumn("temperature", 
                                 regexp_replace(col("temperature"), "[^0-9-]", "").cast(IntegerType())) \
                            .withColumn("dew_point", 
                                 regexp_replace(col("dew_point"), "[^0-9-]", "").cast(IntegerType())) \
                            .withColumn("humidity", 
                                 regexp_replace(col("humidity"), "[^0-9]", "").cast(IntegerType())) \
                            .withColumn("wind_speed", 
                                 regexp_replace(col("wind_speed"), "[^0-9]", "").cast(IntegerType())) \
                            .withColumn("wind_gust", 
                                 regexp_replace(col("wind_gust"), "[^0-9]", "").cast(IntegerType())) \
                            .withColumn("pressure", 
                                 regexp_replace(col("pressure"), "[^0-9.]", "").cast(FloatType())) \
                            .withColumn("precip", 
                                 regexp_replace(col("precip"), "[^0-9.]", "").cast(FloatType()))

    

    weather_df = weather_df.withColumn(
        "date_time", 
        to_timestamp(concat_ws(" ", split(col("date"), " ")[0], col("time")), "yyyy-MM-dd hh:mm a")
    ).drop("date", "time")

    # Remove duplicates
    weather_df = weather_df.dropDuplicates()

    # Add a new column 'rounded_hour' that represents the closest hour to date_time
    weather_df = weather_df.withColumn("date", to_date("date_time")) \
        .withColumn("hour", hour("date_time")) \
        .withColumn("minute", minute("date_time")) \
        .withColumn("rounded_hour",
                        when(col("minute") >= 30, expr("hour + 1"))
                        .otherwise(col("hour"))
                    ) \
        .drop("hour", "minute")
    
    # Adjust for the case where adding 1 to the hour results in 24
    weather_df = weather_df.withColumn("rounded_hour",
                    when(col("rounded_hour") == 24, 0)
                    .otherwise(col("rounded_hour"))
                    )

    # Convert 'hour_column' to a string with two digits
    rounded_hour = lpad(col("rounded_hour"), 2, '0')
    
    # Concatenate 'date_column' and 'hour_str' to form a datetime string
    datetime_str = concat_ws(" ", col("date"), rounded_hour)

    # Append ":00:00" to represent minutes and seconds, forming a full datetime string
    datetime_str = concat_ws(":", datetime_str, lit("00"), lit("00"))

    # Convert the datetime string to a timestamp
    weather_df = weather_df.withColumn("rounded_hour", to_timestamp(datetime_str, "yyyy-MM-dd HH:mm:ss")).drop('date')
    
    # Drop duplicate rounded_hour
    weather_df = weather_df.dropDuplicates(['rounded_hour'])

    '''
    # Join the airports_info data with the aggregated weather data
    weather_df = weather_df.join(info_df, "airport", "left")

    # Converting weather date_time to local time using difference from joining info_df
    weather_df = weather_df.withColumn("date_time", expr("date_time - time_diff")).drop("time_diff")
    '''
    '''
    # Aggregating wind direction, wind speed, temperature, dew point, pressure and visibility
    weather_df = weather_df.groupBy("airport", "rounded_hour").agg(
        mean("wind_direction").alias("wind_direction"),
        mean("wind_speed").alias("wind_speed"),
        mean("temperature").alias("temperature"),
        mean("dew_point").alias("dew_point"),
        mean("pressure").alias("pressure"),
        mean("visibility").alias("visibility"),
    )
    '''
    
    # Check the schema of columns
    weather_df.printSchema()
        
    # Display the processed DataFrame
    weather_df.show(100, truncate=False)

    # Return the processed DataFrame
    return weather_df

In [10]:
# Run the weather processing function
weather_df = weather_processing()

root
 |-- temperature: integer (nullable = true)
 |-- dew_point: integer (nullable = true)
 |-- humidity: integer (nullable = true)
 |-- wind: string (nullable = true)
 |-- wind_speed: integer (nullable = true)
 |-- wind_gust: integer (nullable = true)
 |-- pressure: float (nullable = true)
 |-- precip: float (nullable = true)
 |-- condition: string (nullable = true)
 |-- airport: string (nullable = true)
 |-- date_time: timestamp (nullable = true)
 |-- rounded_hour: timestamp (nullable = true)

+-----------+---------+--------+----+----------+---------+--------+------+---------------------+-------+-------------------+-------------------+
|temperature|dew_point|humidity|wind|wind_speed|wind_gust|pressure|precip|condition            |airport|date_time          |rounded_hour       |
+-----------+---------+--------+----+----------+---------+--------+------+---------------------+-------+-------------------+-------------------+
|68         |50       |52      |NNW |14        |0        |29.88 

---
## Reviews

In [11]:
def reviews_processing():
    """
    Cleans review data from a CSV file. This function lowercases comments, removes special characters,
    filters out empty comments, and removes duplicate rows. It initializes a Spark session, reads the data using
    a predefined schema, and applies text preprocessing to the 'comment' field. The cleaned DataFrame is then returned.

    Returns:
        DataFrame: The processed reviews DataFrame.
    """
    # Initialization and data loading
    spark = SparkSession.builder.appName("ReviewsDataProcessing")\
        .config("spark.sql.legacy.timeParserPolicy", "LEGACY")\
        .getOrCreate()
    schema = StructType([
        StructField("comment", StringType(), True),
        StructField("airport", StringType(), True),
    ])
    reviews_df = spark.read.csv('./data/history/reviews.csv', schema=schema, header=False)

    # Data cleaning and preprocessing
    reviews_df = reviews_df.withColumn("comment", lower(col("comment")))
    reviews_df = reviews_df.withColumn("comment", regexp_replace(col("comment"), "[^a-zA-Z0-9 ]", ""))
    reviews_df = reviews_df.filter(trim(col("comment")) != "")
    reviews_df = reviews_df.dropDuplicates()

    path = "./data/processed/reviews"
    reviews_df.coalesce(1).write.csv(path=path, mode="overwrite", header=True)

    # Show and return the processed DataFrame
    reviews_df.show(truncate=False)

    return reviews_df

In [12]:
reviews_df = reviews_processing()

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|comment                                                                                                                                                                                                                                                                                                                                                                                                                             

In [13]:
print(flights_df.count())

371302


---
## Joining data

In [16]:
# Join the flights data with the aggregated weather data
joined_df = flights_df.join(weather_df, ["rounded_hour", "airport"], "left")

# Join flights data with airports info
joined_df = joined_df.join(info_df, ["airport"], "left").drop("time_diff")


# Join flights data with aircrafts info
joined_df = joined_df.join(aircraft_info_df, ["aircraft"], "left")

print(joined_df.count())

# Check the schema of columns
joined_df.printSchema()

# Display the result to verify the join
joined_df.where(joined_df.airport == 'doh').show(1000, truncate=False)

371302
root
 |-- aircraft: string (nullable = true)
 |-- airport: string (nullable = true)
 |-- rounded_hour: timestamp (nullable = true)
 |-- date: date (nullable = true)
 |-- from: string (nullable = true)
 |-- to: string (nullable = true)
 |-- flight: string (nullable = true)
 |-- flight_time: timestamp (nullable = true)
 |-- scheduled_time_departure: timestamp (nullable = true)
 |-- actual_time_departure: timestamp (nullable = true)
 |-- scheduled_time_arrival: timestamp (nullable = true)
 |-- status: string (nullable = false)
 |-- actual_time_arrival: timestamp (nullable = true)
 |-- from_city: string (nullable = true)
 |-- to_city: string (nullable = true)
 |-- delay_time: double (nullable = true)
 |-- temperature: integer (nullable = true)
 |-- dew_point: integer (nullable = true)
 |-- humidity: integer (nullable = true)
 |-- wind: string (nullable = true)
 |-- wind_speed: integer (nullable = true)
 |-- wind_gust: integer (nullable = true)
 |-- pressure: float (nullable = true)


---
## Saving output data

In [17]:
path = "./data/processed/flights"
joined_df.coalesce(1).write.csv(path=path, mode="overwrite", header=True)

## Conclusion
.

---

## Change Log
- **Version 1.0** [Date]: Initial version of the notebook.
