# Data Preprocessing Notebook

## Introduction
This Jupyter notebook is dedicated to the preprocessing of data collected from three different sources: flights, weather, and reviews. The purpose of this notebook is to clean and prepare the data for further analysis or machine learning tasks. Preprocessing includes handling missing values, identifying and dealing with outliers, and merging the datasets into a single cohesive structure.

---

## Objective
The goal of this preprocessing step is to ensure that the datasets are:
- **Clean**: Free from inaccuracies and inconsistencies.
- **Complete**: Missing values are addressed appropriately.
- **Conformant**: Data is standardized to expected formats.
- **Consolidated**: Relevant data from all three sources are combined logically.

---

## Datasets
The datasets being processed are:
1. **Flights**: Contains information about flight schedules, delays, and other related attributes.
2. **Weather**: Includes weather conditions at different airport locations.
3. **Reviews**: Comprises customer reviews and ratings for the flights.

---

## Tools and Libraries
- `Spark`: For distributed data processing.
- `PySpark`: Python API for Spark.
- `Pandas`: For data manipulation within Spark jobs.
- `Matplotlib`/`Seaborn`: For visualizations (if needed, considering the size of data).
- `MLlib`: Spark’s machine learning library (if preprocessing involves feature selection or dimensionality reduction).



In [472]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
from pyspark.sql.functions import col, to_timestamp, split, lit, concat, to_date, hour, minute, second, concat_ws, when, lower

---

In [473]:
spark = SparkSession.builder.appName("FlightsDataProcessing")\
    .config("spark.sql.legacy.timeParserPolicy", "LEGACY")\
    .getOrCreate()


In [474]:
schema = StructType([
    StructField("host", StringType(), True),
    StructField("time", StringType(), True),
    StructField("flight_code", StringType(), True),
    StructField("destination", StringType(), True),
    StructField("airline", StringType(), True),
    StructField("aircraft", StringType(), True),
    StructField("color", StringType(), True),
    StructField("status", StringType(), True),
    StructField("date", StringType(), True),
    StructField("type", StringType(), True)
])
flights_df = spark.read.csv('./data/history/flights.csv', schema=schema, header=False)
flights_df = flights_df.drop("color")


In [475]:
flights_df.show(10, truncate=False)

+----+-------+-----------+---------------+-----------------+-------------+----------------+-------------+----------+
|host|time   |flight_code|destination    |airline          |aircraft     |status          |date         |type      |
+----+-------+-----------+---------------+-----------------+-------------+----------------+-------------+----------+
|tia |3:05 AM|PC282      |Istanbul (SAW) |Pegasus          |B738 (TC-CRE)|Departed 4:05 AM|Sunday Mar 17|departures|
|tia |1:45 PM|NULL       |Milan (MXP)    |Wizz Air         |A21N (9H-WDP)|Departed 1:52 PM|Sunday Mar 17|departures|
|tia |4:55 AM|OS850      |Vienna (VIE)   |Austrian Airlines|A320 (OE-LBN)|Departed 4:49 AM|Sunday Mar 17|departures|
|tia |6:00 AM|W43841     |Dortmund (DTM) |Wizz Air         |A21N (HA-LZK)|Departed 7:01 AM|Sunday Mar 17|departures|
|tia |6:00 AM|W46625     |Brussels (CRL) |Wizz Air         |A321 (HA-LXA)|Departed 6:14 AM|Sunday Mar 17|departures|
|tia |6:10 AM|W43845     |Milan (BGY)    |Wizz Air         |A321

In [476]:
flights_df = flights_df.withColumn("date", concat(col("date"), lit(" 2024")))

In [477]:
flights_df = flights_df.withColumn("date", to_date("date", "EEEE MMM dd yyyy"))

In [478]:
flights_df = flights_df.withColumn("expected_time", to_timestamp(col("time"), "hh:mm a"))

In [479]:
from pyspark.sql.functions import col, to_timestamp, date_format, unix_timestamp

# Step 1: Extract the time component from the original timestamp
time_component = date_format(col("expected_time"), "HH:mm:ss")

# Step 2: Concatenate the correct date with the extracted time component and convert back to timestamp
# Assuming `date_column` is of type date. If it's a string, ensure it's in 'yyyy-MM-dd' format.
corrected_timestamp = to_timestamp(
    concat_ws(" ", date_format(col("date"), "yyyy-MM-dd"), time_component),
    "yyyy-MM-dd HH:mm:ss"
)

# Apply the transformation
flights_df = flights_df.withColumn("expected_time", corrected_timestamp)
flights_df = flights_df.drop("time")

In [480]:
flights_df = flights_df.withColumn("dest_city", lower(split(col("destination"), " \(")[0])) \
                       .withColumn("destination", split(col("destination"), " \(")[1].substr(0, 3))


In [481]:
flights_df = flights_df.withColumn("aircraft", split(col("aircraft"), " \(")[0])

In [482]:
from pyspark.sql.functions import col, split, expr

# Assuming your DataFrame is named flights_df and the column in question is named 'status'

# Split the 'status' column into an array of words
split_col = split(col("status"), " ")

flights_df = flights_df.withColumn("actual_time", expr("substring(status, length(split(status, ' ')[0]) + 2)"))

# Extract the first word for the new 'status' column
flights_df = flights_df.withColumn("status", split_col.getItem(0))

# Concatenate the remaining words for the 'actual_time' column
# This approach handles arbitrary lengths of the remaining string



In [483]:
# Filter rows with status column containing Departed or Arrived
flights_df = flights_df.filter(
    col("status").like("%Departed%") | 
    col("status").like("%Arrived%")
)

In [484]:
flights_df = flights_df.withColumn("actual_time", to_timestamp(col("actual_time"), "hh:mm a"))

In [485]:

# Step 1: Extract the time component from the original timestamp
time_component = date_format(col("actual_time"), "HH:mm:ss")

# Step 2: Concatenate the correct date with the extracted time component and convert back to timestamp
# Assuming `date_column` is of type date. If it's a string, ensure it's in 'yyyy-MM-dd' format.
corrected_timestamp = to_timestamp(
    concat_ws(" ", date_format(col("date"), "yyyy-MM-dd"), time_component),
    "yyyy-MM-dd HH:mm:ss"
)

# Apply the transformation
flights_df = flights_df.withColumn("actual_time", corrected_timestamp)

In [486]:
flights_df = flights_df.drop("date")

In [487]:
flights_df.show(100, truncate=False)

+----+-----------+-----------+--------------------------------+--------+--------+----------+-------------------+---------+-------------------+
|host|flight_code|destination|airline                         |aircraft|status  |type      |expected_time      |dest_city|actual_time        |
+----+-----------+-----------+--------------------------------+--------+--------+----------+-------------------+---------+-------------------+
|tia |PC282      |SAW        |Pegasus                         |B738    |Departed|departures|2024-03-17 03:05:00|istanbul |2024-03-17 04:05:00|
|tia |NULL       |MXP        |Wizz Air                        |A21N    |Departed|departures|2024-03-17 13:45:00|milan    |2024-03-17 13:52:00|
|tia |OS850      |VIE        |Austrian Airlines               |A320    |Departed|departures|2024-03-17 04:55:00|vienna   |2024-03-17 04:49:00|
|tia |W43841     |DTM        |Wizz Air                        |A21N    |Departed|departures|2024-03-17 06:00:00|dortmund |2024-03-17 07:01:00|









## Preprocessing Steps
The preprocessing will be conducted in the following order:
1. **Initial Exploration**: Quick overview of the datasets to understand the structure and content.
2. **Data Cleaning**:
    - Removing duplicates.
    - Fixing structural errors (e.g., mislabeled classes, wrong data types).
3. **Handling Missing Values**:
    - Identifying missing values.
    - Deciding on a strategy to handle missing values (e.g., imputation, removal).
4. **Outlier Detection**:
    - Statistical methods to detect outliers.
    - Deciding on a strategy to handle outliers (e.g., trimming, capping, or correcting).
5. **Data Integration**:
    - Aligning datasets by common attributes.
    - Merging datasets into a unified table.
6. **Data Transformation**:
    - Normalization or scaling.
    - Encoding categorical variables.
7. **Final Inspection**:
    - Ensuring the processed data meets the initial objectives.
    - Storing the preprocessed data in a suitable format.

---



## Notes
- All changes will be documented and justified.
- Assumptions made during preprocessing will be clearly stated.
- Intermediary results will be visualized and inspected for validation.

---

## Conclusion
This section will summarize the preprocessing steps performed and discuss the readiness of the data for subsequent analysis.

---

## Change Log
- **Version 1.0** [Date]: Initial version of the notebook.
