d


#### NOTEBOOK 1: Airline Delays - DATA PIPELINE

***Delta Lake*** is a transactional storage layer that runs on top of cloud storage such as Azure Data Lake Storage, delivering reliability, security and performance. It is the foundation of a cost-effective, highly scalable data pipeline. With support for ACID transactions and schema enforcement..., Delta Lake enables the customers to delivers ***massive scale and speed*** which helps to execute ETL workloads execute up to 48% faster.

Several noted features of Delta Lake that helps with our ETL work.

***1. Scalable metadata handling:*** Leverages Spark’s distributed processing power to handle all the metadata for petabyte-scale tables with billions of files at ease

***2. Schema enforcement:*** Automatically handles schema variations to prevent insertion of bad records during ingestion. 

***3. ACID transactions on Spark:*** Serializable isolation levels ensure that readers never see inconsistent data via the implementation of a transaction log, which includes checkpoint support.

***4. Upserts and deletes:*** Supports merge, update and delete operations to enable complex use cases like change-data-capture, slowly-changing-dimension (SCD) operations, streaming upserts

***5. Open Format:***  All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet

***6. Performance:*** Delta boasts query performance of 10 to 100 times faster than with Apache Spark on Parquet. It accomplishes this via Data Skipping (Delta maintains file statistics on the data subset so that only relevant portions of the data is read in a query), Compaction (Delta manages file sizes of the underlying Parquet files for the most efficient use), and Data Caching (Delta automatically caches highly accessed data to improve run times for commonly run queries) and well as other optimizations

### 1. Create data tables from delta files

In [0]:

from pyspark.sql.functions import col, max

blob_container = "w261group5container" # The name of your container created in https://portal.azure.com
storage_account = "w261team5storage" # The name of your Storage account created in https://portal.azure.com
secret_scope = "w261_group_05" # The name of the scope created in your local computer using the Databricks CLI
secret_key = "w261_group_05_key" # The name of the secret key created in your local computer using the Databricks CLI
blob_url = f"wasbs://{blob_container}@{storage_account}.blob.core.windows.net"
mount_path = "/mnt/mids-w261" 

spark.conf.set(
  f"fs.azure.sas.{blob_container}.{storage_account}.blob.core.windows.net",
  dbutils.secrets.get(scope = secret_scope, key = secret_key)
)

#### 1.1 Airline data table

In [0]:
%sql
--create new table and drop the current one if exists
DROP TABLE IF EXISTS tbl_team05_airline_data;

In [0]:
#airline data table 
file_location =  f"{blob_url}/airline_data_delta"
file_type = "delta"
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
df.write.format("delta").saveAsTable("tbl_team05_airline_data")

#### 1.2 Weather data table

In [0]:
%sql
DROP TABLE IF EXISTS tbl_team05_weather_data;

In [0]:
#weather data
file_location =  f"{blob_url}/weather_data_delta"
file_type = "delta"
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
df.write.format("delta").saveAsTable("tbl_team05_weather_data")

####1.3 Weather Stations data table

In [0]:
%sql
DROP TABLE IF EXISTS tbl_team05_stations_data;

In [0]:
#station table
file_location =  f"{blob_url}/stations_data_delta"
file_type = "delta"
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
df.write.format("delta").saveAsTable("tbl_team05_stations_data")


#### 1.4 Airport Code table

In [0]:
%sql
DROP TABLE IF EXISTS tbl_team05_airport_codes;

In [0]:
file_location =  f"{blob_url}/airport_delta_with_timezone"
file_type = "delta"
df = spark.read.format(file_type).option("inferSchema", "true").load(file_location)
df.write.format("delta").saveAsTable("tbl_team05_airport_codes")

### 2. New Data Table that links airport and weather station
We note that the neighbor calls in the stations table can be used to identify the airport codes

In [0]:
%sql
DROP TABLE IF EXISTS tbl_team05_airport_station;

In [0]:
%sql
CREATE TABLE tbl_team05_airport_station (
airport_code string, 
airport_name string,
airport_call string,
airport_state string,
airport_lat double,
airport_long double,
distance double,
station_lat double,
station_long double,
station_id string
);

In [0]:
%sql
insert into tbl_team05_airport_station
select right(neighbor_call,3), neighbor_name, neighbor_call , neighbor_state, neighbor_lat, neighbor_lon, distance_to_neighbor , lat, lon, station_id 
from( select neighbor_call, neighbor_name, neighbor_state, neighbor_lat, neighbor_lon, station_id,lat, lon ,distance_to_neighbor, ROW_NUMBER() OVER(PARTITION BY neighbor_call ORDER BY distance_to_neighbor  ) as rn
from tbl_team05_stations_data) as a
where rn = 1 and right(neighbor_call,3) in (select distinct origin from tbl_team05_airline_data ) order by neighbor_call

num_affected_rows,num_inserted_rows
343,343


In [0]:
#%sql
#select * from  tbl_team05_airport_station

### 3. New Flight Data Table (includes the WEATHER stations of origin and destination)
We will create a new airline data table from the original airline data table by adding the scheduled and actual UTC departure and arrival times and the weather stations. We also extract the hours from those times as seperate fields for the JOIN later.

In [0]:

sqlDF = spark.sql("select YEAR, QUARTER, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, FL_DATE, OP_UNIQUE_CARRIER,OP_CARRIER_AIRLINE_ID, OP_CARRIER, TAIL_NUM, OP_CARRIER_FL_NUM, ORIGIN_AIRPORT_ID,ORIGIN_AIRPORT_SEQ_ID, ORIGIN_CITY_MARKET_ID, ORIGIN , ORIGIN_CITY_NAME , ORIGIN_STATE_ABR ,   ORIGIN_STATE_FIPS,ORIGIN_STATE_NM ,ORIGIN_WAC, DEST_AIRPORT_ID, DEST_AIRPORT_SEQ_ID, DEST_CITY_MARKET_ID, DEST, DEST_CITY_NAME, DEST_STATE_ABR, DEST_STATE_FIPS, DEST_STATE_NM ,DEST_WAC, CRS_DEP_TIME, DEP_TIME, DEP_DELAY, DEP_DELAY_NEW, DEP_DEL15 , DEP_DELAY_GROUP, DEP_TIME_BLK , TAXI_OUT, WHEELS_OFF, WHEELS_ON,TAXI_IN, CRS_ARR_TIME,ARR_TIME, ARR_DELAY , ARR_DELAY_NEW , ARR_DEL15,ARR_DELAY_GROUP, ARR_TIME_BLK,CANCELLED , CANCELLATION_CODE,DIVERTED,CRS_ELAPSED_TIME, ACTUAL_ELAPSED_TIME,AIR_TIME ,FLIGHTS , air.DISTANCE , DISTANCE_GROUP ,CARRIER_DELAY , WEATHER_DELAY ,NAS_DELAY ,SECURITY_DELAY ,LATE_AIRCRAFT_DELAY , dep_station.station_id as departure_weather_station, CURRENT_TIMESTAMP() as utc_departure_time, -1 as utc_departure_hour, CURRENT_TIMESTAMP() as utc_departure_time_minus2, -1 as utc_departure_minus2_hour, CURRENT_TIMESTAMP() as utc_departure_time_minus3, -1 as utc_departure_minus3_hour, CURRENT_TIMESTAMP() as utc_actual_departure_time, -1 as utc_actual_departure_hour, air_dep.Timezone as departure_timezone, arrival_station.station_id as arrival_weather_station, CURRENT_TIMESTAMP() as utc_arrival_time, -1 as utc_arrival_hour, CURRENT_TIMESTAMP() as utc_actual_arrival_time, -1 as utc_actual_arrival_hour, air_arrival.Timezone as arrival_timezone, 'key_join' FROM  tbl_team05_airline_data air INNER JOIN tbl_team05_airport_station dep_station ON air.origin = dep_station.airport_code INNER JOIN tbl_team05_airport_codes air_dep ON air.origin = air_dep.IATA Inner Join tbl_team05_airport_station arrival_station  ON air.dest = arrival_station.airport_code Inner Join tbl_team05_airport_codes air_arrival ON air.origin = air_arrival.IATA" )


In [0]:
#sqlDF.display(5)

In [0]:
%sql
DROP TABLE IF EXISTS tbl_team05_expanded_airline_data;

In [0]:

sqlDF.write.format("delta").saveAsTable("tbl_team05_expanded_airline_data")

#### 3.1 Departure Time in UTC

We convert scheduled departure time to UTC. We also handle the Daylight Saving Time in this conversion

In [0]:
%sql
--1. UTC departure time: hour
UPDATE tbl_team05_expanded_airline_data set utc_departure_time  = 
case when LENGTH(CRS_DEP_TIME) = 1 THEN cast(FL_Date as timestamp) 
when LENGTH(CRS_DEP_TIME) = 2 THEN cast(FL_Date as timestamp) 
when LENGTH(CRS_DEP_TIME) = 3 THEN cast(FL_Date as timestamp) +  make_interval(0, 0, 0, 0, cast(left(CRS_DEP_TIME, 1) as int),0,0) 
when LENGTH(CRS_DEP_TIME) = 4 THEN cast(FL_Date as timestamp) +  make_interval(0, 0, 0, 0, cast(left(CRS_DEP_TIME, 2) as int),0,0) 
ELSE cast(FL_Date as timestamp) END 


num_affected_rows
69502420


In [0]:
%sql
--1. UTC departure time: minutes
update tbl_team05_expanded_airline_data set utc_departure_time  = utc_departure_time + make_interval(0, 0, 0, 0, 0 , cast(right(CRS_DEP_TIME, 2) as int), 0)


num_affected_rows
69502420


In [0]:
%sql
--1.UTC departure time: time zone
update tbl_team05_expanded_airline_data set utc_departure_time  = utc_departure_time + make_interval(0, 0, 0, 0, -departure_timezone, 0, 0) 

num_affected_rows
69502420


In [0]:
%sql
--1. UTC departure time: daylight time saving
update tbl_team05_expanded_airline_data set utc_departure_time   = utc_departure_time  + make_interval(0, 0, 0, 0, 1 , 0, 0)
where (YEAR =2019  AND cast(FL_Date as timestamp) >= cast('2019-03-10' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2019-11-03'as timestamp )) or
(YEAR =2018  AND cast(FL_Date as timestamp) >= cast('2018-03-11' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2018-11-04' as timestamp )) or
(YEAR =2017  AND cast(FL_Date as timestamp) >= cast('2017-03-12' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2017-11-05' as timestamp )) or
( YEAR =2016  AND cast(FL_Date as timestamp) >= cast('2016-03-13' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2016-11-06' as timestamp )) or
 (YEAR =2015  AND cast(FL_Date as timestamp) >= cast('2015-03-08' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2015-11-01' as timestamp )) 




num_affected_rows
46450410


In [0]:
%sql
-- Extract UTC departure HOUR
update tbl_team05_expanded_airline_data set  utc_departure_hour = hour(utc_departure_time)

num_affected_rows
69502420


In [0]:
%sql
-- UTC departure time - 2hours
update tbl_team05_expanded_airline_data set utc_departure_time_minus2 = utc_departure_time + make_interval(0, 0, 0, 0, -2 , 0, 0)

num_affected_rows
69502420


In [0]:
%sql
--Extract Hour from UTC departure time minus 2 hours
update tbl_team05_expanded_airline_data set utc_departure_minus2_hour = hour(utc_departure_time_minus2)

num_affected_rows
69502420


In [0]:
%sql
-- UTC departure time -3hours
update tbl_team05_expanded_airline_data set utc_departure_time_minus3 = utc_departure_time + make_interval(0, 0, 0, 0, -3 , 0, 0)

num_affected_rows
69502420


In [0]:
%sql
--Extract Hour from UTC departure time minus 3 hours
update tbl_team05_expanded_airline_data set  utc_departure_minus3_hour = hour(utc_departure_time_minus3)

num_affected_rows
69502420


#### 3.2 Actual Departure Time in UTC

In [0]:
%sql
-- UTC actual departure time: hour
UPDATE tbl_team05_expanded_airline_data set utc_actual_departure_time  = 
case when LENGTH(DEP_TIME) = 1 THEN cast(FL_Date as timestamp) 
when LENGTH(DEP_TIME) = 2 THEN cast(FL_Date as timestamp) 
when LENGTH(DEP_TIME) = 3 THEN cast(FL_Date as timestamp) +  make_interval(0, 0, 0, 0, cast(left(DEP_TIME, 1) as int),0,0) 
when LENGTH(DEP_TIME) = 4 THEN cast(FL_Date as timestamp) +  make_interval(0, 0, 0, 0, cast(left(DEP_TIME, 2) as int),0,0) 
ELSE cast(FL_Date as timestamp) END --never happen...length = 0 or >=5

num_affected_rows
69502420


In [0]:
%sql
---- UTC actual departure time add minutes
update tbl_team05_expanded_airline_data set utc_actual_departure_time  = utc_actual_departure_time + make_interval(0, 0, 0, 0, 0 , cast(right(DEP_TIME, 2) as int), 0)



num_affected_rows
69502420


In [0]:
%sql
---- UTC actual departure time handle time zone
update tbl_team05_expanded_airline_data set utc_actual_departure_time = utc_actual_departure_time + make_interval(0, 0, 0, 0, -departure_timezone, 0, 0) 

num_affected_rows
69502420


In [0]:
%sql
-- -- UTC actual departure time daylight saving
update tbl_team05_expanded_airline_data set utc_actual_departure_time   = utc_actual_departure_time  + make_interval(0, 0, 0, 0, 1 , 0, 0)
where (YEAR =2019  AND cast(FL_Date as timestamp) >= cast('2019-03-10' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2019-11-03'as timestamp )) or
(YEAR =2018  AND cast(FL_Date as timestamp) >= cast('2018-03-11' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2018-11-04' as timestamp )) or
(YEAR =2017  AND cast(FL_Date as timestamp) >= cast('2017-03-12' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2017-11-05' as timestamp )) or
( YEAR =2016  AND cast(FL_Date as timestamp) >= cast('2016-03-13' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2016-11-06' as timestamp )) or
 (YEAR =2015  AND cast(FL_Date as timestamp) >= cast('2015-03-08' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2015-11-01' as timestamp )) 



num_affected_rows
46450410


In [0]:
%sql
-- UTC actual departure time
--minus 2 hour
update tbl_team05_expanded_airline_data set utc_actual_departure_hour = hour(utc_actual_departure_time)

num_affected_rows
69502420


#### 3.3 Arrival Time in UTC

In [0]:
%sql

--UTC Arrival Hour
--utc_arrival_time, -1 as utc_arrival_hour, CURRENT_TIMESTAMP() as utc_actual_arrival_time, -1 as utc_actual_arrival_hour

UPDATE tbl_team05_expanded_airline_data set utc_arrival_time  = 
case when LENGTH(CRS_ARR_TIME) = 1 THEN cast(FL_Date as timestamp) 
when LENGTH(CRS_ARR_TIME) = 2 THEN cast(FL_Date as timestamp) 
when LENGTH(CRS_ARR_TIME) = 3 THEN cast(FL_Date as timestamp) +  make_interval(0, 0, 0, 0, cast(left(CRS_ARR_TIME, 1) as int),0,0) 
when LENGTH(CRS_ARR_TIME) = 4 THEN cast(FL_Date as timestamp) +  make_interval(0, 0, 0, 0, cast(left(CRS_ARR_TIME, 2) as int),0,0) 
ELSE cast(FL_Date as timestamp) END --never happen...length = 0 or >=5

num_affected_rows
69502420


In [0]:
%sql
--UTC Arrival Hour
--add minutes
update tbl_team05_expanded_airline_data set utc_arrival_time  = utc_arrival_time + make_interval(0, 0, 0, 0, 0 , cast(right(CRS_ARR_TIME, 2) as int), 0)



num_affected_rows
69502420


In [0]:
%sql
--UTC Arrival Hour
--handle time zone
update tbl_team05_expanded_airline_data set utc_arrival_time = utc_arrival_time + make_interval(0, 0, 0, 0, -arrival_timezone, 0, 0) 

num_affected_rows
69502420


In [0]:
%sql
--UTC Arrival Hour
update tbl_team05_expanded_airline_data set utc_arrival_time   = utc_arrival_time  + make_interval(0, 0, 0, 0, 1 , 0, 0)
where (YEAR =2019  AND cast(FL_Date as timestamp) >= cast('2019-03-10' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2019-11-03'as timestamp )) or
(YEAR =2018  AND cast(FL_Date as timestamp) >= cast('2018-03-11' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2018-11-04' as timestamp )) or
(YEAR =2017  AND cast(FL_Date as timestamp) >= cast('2017-03-12' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2017-11-05' as timestamp )) or
( YEAR =2016  AND cast(FL_Date as timestamp) >= cast('2016-03-13' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2016-11-06' as timestamp )) or
 (YEAR =2015  AND cast(FL_Date as timestamp) >= cast('2015-03-08' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2015-11-01' as timestamp )) 


num_affected_rows
46450410


In [0]:
%sql
----UTC Arrival Hour hour
update tbl_team05_expanded_airline_data set utc_arrival_hour = hour(utc_arrival_time)

num_affected_rows
69502420


#### 3.4 Actual Arrival Time in UTC

In [0]:
%sql
--UTC actual Arrival Hour
UPDATE tbl_team05_expanded_airline_data set utc_actual_arrival_time  = 
case when LENGTH(ARR_TIME) = 1 THEN cast(FL_Date as timestamp) 
when LENGTH(ARR_TIME) = 2 THEN cast(FL_Date as timestamp) 
when LENGTH(ARR_TIME) = 3 THEN cast(FL_Date as timestamp) +  make_interval(0, 0, 0, 0, cast(left(ARR_TIME, 1) as int),0,0) 
when LENGTH(ARR_TIME) = 4 THEN cast(FL_Date as timestamp) +  make_interval(0, 0, 0, 0, cast(left(ARR_TIME, 2) as int),0,0) 
ELSE cast(FL_Date as timestamp) END --never happen...length = 0 or >=5

num_affected_rows
69502420


In [0]:
%sql
--UTC actual Arrival Hour add minutes
update tbl_team05_expanded_airline_data set utc_actual_arrival_time  = utc_actual_arrival_time + make_interval(0, 0, 0, 0, 0 , cast(right(ARR_TIME, 2) as int), 0)



num_affected_rows
69502420


In [0]:
%sql
--handle time zone
update tbl_team05_expanded_airline_data set utc_actual_arrival_time = utc_actual_arrival_time + make_interval(0, 0, 0, 0, -arrival_timezone, 0, 0) 

num_affected_rows
69502420


In [0]:
%sql
--UTC actual Arrival Hour
update tbl_team05_expanded_airline_data set utc_actual_arrival_time   = utc_actual_arrival_time  + make_interval(0, 0, 0, 0, 1 , 0, 0)
where (YEAR =2019  AND cast(FL_Date as timestamp) >= cast('2019-03-10' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2019-11-03'as timestamp )) or
(YEAR =2018  AND cast(FL_Date as timestamp) >= cast('2018-03-11' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2018-11-04' as timestamp )) or
(YEAR =2017  AND cast(FL_Date as timestamp) >= cast('2017-03-12' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2017-11-05' as timestamp )) or
( YEAR =2016  AND cast(FL_Date as timestamp) >= cast('2016-03-13' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2016-11-06' as timestamp )) or
 (YEAR =2015  AND cast(FL_Date as timestamp) >= cast('2015-03-08' as timestamp)  AND cast(FL_Date as timestamp) <= cast('2015-11-01' as timestamp )) 

num_affected_rows
46450410


In [0]:
%sql
--UTC actual Arrival Hour
update tbl_team05_expanded_airline_data set utc_actual_arrival_hour = hour(utc_actual_arrival_time)

num_affected_rows
69502420


#### 3.5 Create Key Join
This key is based on UTC departure time minus 2 hours since we are trying to predict the delay based on the 2 hour prior to the scheduled departure.

In [0]:
%sql

update tbl_team05_expanded_airline_data set key_join = concat(
CAST( year(utc_departure_time_minus2) as string), '-',
CAST( month(utc_departure_time_minus2) as string), '-',
CAST( day( utc_departure_time_minus2) as string), '-',
CAST(departure_weather_station as string), '-', cast (hour( utc_departure_time_minus2) as string))

num_affected_rows
69502420


In [0]:
%sql
select distinct origin, count(origin) from tbl_team05_expanded_airline_data where origin in ('IFP', 'EAR', 'XWA', 'TKI') group by origin

origin,count(origin)


### 4. New Weather Table  with parsed weather attributes
From the original weather data, we are creating a new table with the parsed weather fields that we use for quick model prototypes

#### 4.1 New Expanded weather data for simple model

In [0]:
%sql
DROP TABLE IF EXISTS tbl_team05_expanded_weather_data;

In [0]:
%sql

CREATE TABLE tbl_team05_expanded_weather_data (
STATION string, 
DATE timestamp, 
NAME string,
REPORT_TYPE string,
CALL_SIGN string,
QUALITY_CONTROL string,
WND string,
Wind_Direction_Angle string,
Wind_Direction_Quality string,
Wind_Type_Code string,
Wind_Speed_Rate string,
Wind_speed_Quality string,
CIG string,
CIG_ceiling_height string, 
CIG_ceiling_quality string,
CIG_ceiling_deter string,
CIG_ceiling_visibility string,
VIS string,
VIS_distance string,
VIS_distance_quality string,
VIS_variability string,
VIS_variability_quality string,
TMP string,
TMP_air_temp string,
TMP_air_temp_quality string,
DEW string, 
DEW_temp string,
DEW_temp_quality string,
SLP string,
SLP_pressure string, 
SLP_pressure_quality string,
AA1 string,
RAIN_period_quantity string,
RAIN_depth string,
RAIN_condition string,
RAIN_quality_code string,
AJ1 string,
SNOW_depth_dimension string,
SNOW_condition string,
SNOW_quality_code string,
SNOW_eq_water_depth_dim string,
SNOW_eq_water_condition_code string,
SNOW_eq_water_condition_quality_code string,
MW1 string,
CURRENT_atmos_condition string,
CURRENT_atmos_condition_quality string,
MW2 string,
CURRENT_atmos_condition2 string,
CURRENT_atmos_condition_quality2 string,
key_join string,
date_only string,
hour string,
airport_code string
)



In [0]:
%sql
--MAIN TABLE : WEATHER
insert into tbl_team05_expanded_weather_data
select STATION , DATE ,NAME ,REPORT_TYPE ,CALL_SIGN, QUALITY_CONTROL,
WND, split(WND, ',')[0], split(WND, ',')[1], split(WND, ',')[2], split(WND, ',')[3], split(WND, ',')[4],
CIG, split(CIG, ',')[0], split(CIG, ',')[1], split(CIG, ',')[2], split(CIG, ',')[3],
VIS, split(VIS, ',')[0], split(VIS, ',')[1], split(VIS, ',')[2], split(VIS, ',')[3],
TMP, split(TMP, ',')[0], split(TMP, ',')[1],
DEW, split(DEW, ',')[0], split(DEW, ',')[1],
SLP, split(SLP, ',')[0], split(SLP, ',')[1],
AA1, split(AA1, ',')[0], split(AA1, ',')[1], split(AA1, ',')[2], split(AA1, ',')[3],
AJ1, split(AJ1, ',')[0], split(AJ1, ',')[1], split(AJ1, ',')[2], split(AJ1, ',')[3], split(AJ1, ',')[4], split(AJ1, ',')[5],
MW1, split(MW1, ',')[0], split(MW1, ',')[1],
MW2, split(MW2, ',')[0], split(MW2, ',')[1],
'key_join', date_format(DATE,"MM-dd-yyyy"), hour(DATE), right(call_sign,3)
from tbl_team05_weather_data



num_affected_rows,num_inserted_rows
630904436,630904436


#### 4.1.a Parse selective Weather attributes below:
1. Wind (WND) Parsing: WIND-OBSERVATION 
    a. Wind_Direction_Angle
    b. Wind_Direction_Quality
    c. Wind_Type_Code
    d. Wind_Speed_Rate
    e. Wind_speed_Quality
2. CIG Parsing : SKY-CONDITION-OBSERVATION ceiling height dimension (The height above ground level (AGL) of the lowest cloud )
    a. CIG_ceiling_height
    b. CIG_ceiling_quality
    c. CIG_ceiling_deter
    d. CIG_ceiling_visibility
3. Visibility (VIS) Parsing:
    a. VIS_distance:
    b. VIS_distance_quality:
    c. VIS_variability:
    d. VIS_variability_quality:
4. Temperature (TMP) Parsing:
    a. TMP_air_temp:
    b. TMP_air_temp_quality:
5. Dew (DEW) Parsing:
    a. DEW_temp:
    b. DEW_temp_quality:
6. Pressure (SLP) Parsing:
    a. SLP_pressure:
    b. SLP_pressure_quality:
7. Rain (AA1) Parsing: (ASSUMPTION IS THAT A BLANK READING INDICATES 0 RAIN DEPTH)
    a. RAIN_period_quantity:
    b. RAIN_depth:
    c. RAIN_condition:
    d. RAIN_quality_code:
8. Snow Depth (AJ1) Parsing: (ASSUMPTION IS THAT A BLANK READING INDICATES 0 RAIN DEPTH)
    a. SNOW_depth_dimension:
    b. SNOW_condition:
    c. SNOW_quality_code:
    d. SNOW_eq_water_depth_dim:
    e. SNOW_eq_water_condition_code:
    f. SNOW_eq_water_condition_quality_code:
9. FIRST Weather (MW1) Parsing: (The code that denotes a specific type of weather observed manually.)
    a. CURRENT_atmos_condition:
    b. CURRENT_atmos_condition_quality:
10. SECOND Weather (MW2) Parsing: (The code that denotes a specific type of weather observed manually.)
    a. CURRENT_atmos_condition:
    b. CURRENT_atmos_condition_quality:

In [0]:
%sql
--Wind_Direction_Angle: 
-- need to handle 999 (avarage?????)
select distinct(Wind_Direction_Angle), count (station) as total from  tbl_team05_expanded_weather_data group by Wind_Direction_Angle order by total,Wind_Direction_Angle

Wind_Direction_Angle,total
1,1
2,1
11,1
13,1
15,1
19,1
21,1
25,1
41,1
46,1


In [0]:
%sql
--Wind_Direction_Quality: CATEGORY 
select distinct(Wind_Direction_Quality), count (station) as total from  tbl_team05_expanded_weather_data group by Wind_Direction_Quality order by total,Wind_Direction_Quality

Wind_Direction_Quality,total
2,1
7,1
A,4
P,70
U,371
1,1634199
9,4456047
5,16173129


In [0]:
%sql
--Wind_Type_Code:
select distinct(Wind_Type_Code), count (station) as total from  tbl_team05_expanded_weather_data group by Wind_Type_Code order by total,Wind_Type_Code

Wind_Type_Code,total
9,651549
V,666588
C,3192146
N,17753539


In [0]:
%sql
--Wind_Speed_Rate:
select distinct(Wind_Speed_Rate), count (station) as total from  tbl_team05_expanded_weather_data group by Wind_Speed_Rate order by total,Wind_Speed_Rate

Wind_Speed_Rate,total
125,1
128,1
250,1
270,1
320,1
335,1
366,1
371,1
432,1
453,1


In [0]:
%sql
--Wind_speed_Quality:
select distinct(Wind_speed_Quality), count (station) as total from tbl_team05_expanded_weather_data group by Wind_speed_Quality order by total,Wind_speed_Quality


--good 

Wind_speed_Quality,total
I,55
7,64
2,307
U,414
6,2763
P,4287
A,4683
9,709351
1,1961447
5,19580451


In [0]:
%sql
-- CIG_ceiling_height:
select distinct(CIG_ceiling_height), count (station) as total from  tbl_team05_expanded_weather_data group by CIG_ceiling_height order by total,CIG_ceiling_height

--NO NULL
---99999 missing...

CIG_ceiling_height,total
1,1
5,1
22,1
2682,1
2804,1
2865,1
2987,1
3109,1
3139,1
3231,1


In [0]:
%sql
--CIG_ceiling_quality:

select distinct(CIG_ceiling_quality), count (station) as total from  tbl_team05_expanded_weather_data group by CIG_ceiling_quality order by total,CIG_ceiling_quality

CIG_ceiling_quality,total
6,9833
7,59770
1,581533
9,2118772
5,19493914


In [0]:
%sql
--CIG_ceiling_deter:

select distinct(CIG_ceiling_deter), count (station) as total from  tbl_team05_expanded_weather_data5 group by CIG_ceiling_deter order by total,CIG_ceiling_deter

CIG_ceiling_deter,total
C,4474
W,349620
M,9526570
9,12383158


In [0]:
%sql
--CIG_ceiling_visibility:

select distinct(CIG_ceiling_visibility), count (station) as total from  tbl_team05_expanded_weather_data group by CIG_ceiling_visibility order by total,CIG_ceiling_visibility

CIG_ceiling_visibility,total
9,645990
N,21617832


In [0]:
%sql
--VIS_distance:

select distinct(VIS_distance), count (station) as total from  tbl_team05_expanded_weather_data group by VIS_distance order by total,VIS_distance

VIS_distance,total
208,1
404,1
536,1
603,1
1104,1
2006,1
2300,1
4425,1
12874,1
43452,1


In [0]:
%sql
--VIS_distance_quality:

select distinct(VIS_distance_quality), count (station) as total from  tbl_team05_expanded_weather_data group by VIS_distance_quality order by total,VIS_distance_quality

VIS_distance_quality,total
I,1
P,10387
6,12304
7,17378
A,48561
9,671151
1,1960460
5,19543580


In [0]:
%sql
-- VIS_variability:

select distinct(VIS_variability), count (station) as total from  tbl_team05_expanded_weather_data group by VIS_variability order by total,VIS_variability

VIS_variability,total
V,92294
9,2607989
N,19563539


In [0]:
%sql

--VIS_variability_quality:

select distinct(VIS_variability_quality), count (station) as total from  tbl_team05_expanded_weather_data group by VIS_variability_quality order by total,VIS_variability_quality

VIS_variability_quality,total
A,63770
9,2607989
5,19592063


In [0]:
%sql
--TMP_air_temp:

select distinct(TMP_air_temp), count (station) as total from  tbl_team05_expanded_weather_data group by TMP_air_temp order by total,TMP_air_temp
--NO NULL

TMP_air_temp,total
386,1
397,1
402,1
412,1
413,1
418,1
423,1
431,1
432,1
442,1


In [0]:
%sql
--TMP_air_temp_quality:

select distinct(TMP_air_temp_quality), count (station) as total from  tbl_team05_expanded_weather_data group by TMP_air_temp_quality order by total,TMP_air_temp_quality
--NO NULL

TMP_air_temp_quality,total
I,3
P,249
2,1088
6,13019
7,24455
A,33083
9,683785
C,689194
1,2038918
5,18780028


In [0]:
%sql
-- DEW_temp:

select distinct(DEW_temp), count (station) as total from  tbl_team05_expanded_weather_data group by DEW_temp order by total,DEW_temp

DEW_temp,total
282,1
287,1
317,1
334,1
340,1
350,1
357,1
-401,1
-412,1
-418,1


In [0]:
%sql
--DEW_temp_quality:

select distinct(DEW_temp_quality), count (station) as total from tbl_team05_expanded_weather_data group by DEW_temp_quality order by total,DEW_temp_quality
--NO NULL

DEW_temp_quality,total
I,5
P,139
2,1036
6,4939
7,24447
A,30143
C,686779
9,717140
1,2032378
5,18766816


In [0]:
%sql
--SLP_pressure:

select distinct(SLP_pressure), count (station) as total from tbl_team05_expanded_weather_data group by SLP_pressure order by total,SLP_pressure
-- NO NULL


SLP_pressure,total
9591,1
9594,1
9596,1
10564,1
10570,1
10582,1
10584,1
10585,1
10589,1
10591,1


In [0]:
%sql
--SLP_pressure_quality:
 
select distinct(SLP_pressure_quality), count (station) as total from tbl_team05_expanded_weather_data group by SLP_pressure_quality order by total,SLP_pressure_quality

SLP_pressure_quality,total
I,6
P,6
2,824
6,49480
1,1783520
9,6956050
5,13473936


In [0]:
%sql
 --RAIN_period_quantity:

select distinct(RAIN_period_quantity), count (station) as total from  tbl_team05_expanded_weather_data group by RAIN_period_quantity order by total,RAIN_period_quantity
--there are a lot of  NULL =99

RAIN_period_quantity,total
12.0,1317
99.0,2621
0.0,5553
3.0,90162
6.0,298134
24.0,603970
,6670850
1.0,14591215


In [0]:
%sql
--RAIN_depth:

select distinct(RAIN_depth), count (station) as total from  tbl_team05_expanded_weather_data05 group by RAIN_depth order by total,RAIN_depth

--NULL = 0000

RAIN_depth,total
581.0,1
624.0,1
642.0,1
652.0,1
662.0,1
685.0,1
695.0,1
741.0,1
769.0,1
774.0,1


In [0]:
%sql
--RAIN_condition:
select distinct(RAIN_condition), count (station) as total from  tbl_team05_expanded_weather_data group by RAIN_condition  order by total,RAIN_condition

--  NULL


RAIN_condition,total
1.0,3202
3.0,1045321
2.0,1757662
,6670850
9.0,12786787


In [0]:
%sql
 --RAIN_quality_code:

select distinct(RAIN_quality_code), count (station) as total from tbl_team05_expanded_weather_data group by RAIN_quality_code  order by total,RAIN_quality_code

-- NULL

RAIN_quality_code,total
7,7
I,1165
U,1197
9,3223
2,4253
A,7512
P,18855
6,24938
1,1962309
,6670850


In [0]:
%sql
--SNOW_depth_dimension:

select distinct(SNOW_depth_dimension), count (station) as total from tbl_team05_expanded_weather_data group by SNOW_depth_dimension  order by total,SNOW_depth_dimension

--NULL


SNOW_depth_dimension,total
130.0,1
137.0,1
140.0,1
135.0,2
104.0,4
102.0,7
107.0,7
127.0,8
94.0,11
89.0,17


In [0]:
%sql
--SNOW_condition:

select distinct(SNOW_condition), count (station) as total from  tbl_team05_expanded_weather_data group by SNOW_condition  order by total,SNOW_condition

--NULL

SNOW_condition,total
1.0,56
3.0,50515
9.0,407370
,21805881


In [0]:
%sql
--SNOW_quality_code:
select distinct(SNOW_quality_code), count (station) as total from  tbl_final_weather_team05 group by SNOW_quality_code  order by total,SNOW_quality_code
--NULL

SNOW_quality_code,total
9,117
P,2309
I,6674
1,71321
5,377520
,21805881


In [0]:
%sql
--SNOW_eq_water_depth_dim:
select distinct(SNOW_eq_water_depth_dim), count (station) as total from  tbl_team05_expanded_weather_data group by SNOW_eq_water_depth_dim  order by total,SNOW_eq_water_depth_dim
--NULL

SNOW_eq_water_depth_dim,total
13460.0,1
13720.0,1
13970.0,1
8100.0,3
8400.0,3
10410.0,3
7900.0,6
12700.0,6
7600.0,7
10160.0,7


In [0]:
%sql
--SNOW_eq_water_condition_code:

select distinct(SNOW_eq_water_condition_code), count (station) as total from  tbl_team05_expanded_weather_data group by SNOW_eq_water_condition_code  order by total,SNOW_eq_water_condition_code
--NULL

SNOW_eq_water_condition_code,total
9.0,457941
,21805881


In [0]:
%sql
--SNOW_eq_water_condition_quality_code:

select distinct(SNOW_eq_water_condition_quality_code), count (station) as total from  tbl_team05_expanded_weather_data group by SNOW_eq_water_condition_quality_code  order by total,SNOW_eq_water_condition_quality_code

SNOW_eq_water_condition_quality_code,total
9.0,457941
,21805881


In [0]:
%sql
--CURRENT_atmos_condition:
select distinct(CURRENT_atmos_condition), count (station) as total from  tbl_team05_expanded_weather_data group by CURRENT_atmos_condition  order by total,CURRENT_atmos_condition
-- NULL

CURRENT_atmos_condition,total
58.0,1
96.0,1
8.0,2
76.0,2
92.0,2
1.0,3
18.0,3
19.0,3
36.0,3
34.0,4


In [0]:
%sql
--CURRENT_atmos_condition_quality:
select distinct(CURRENT_atmos_condition_quality), count (station) as total from  tbl_team05_expanded_weather_data group by CURRENT_atmos_condition_quality  order by total,CURRENT_atmos_condition_quality
--  NULL

CURRENT_atmos_condition_quality,total
7.0,2341
6.0,11514
1.0,225178
5.0,1090332
,20934457


In [0]:
%sql
-- CURRENT_atmos_condition2:
select distinct(CURRENT_atmos_condition2), count (station) as total from  tbl_team05_expanded_weather_data group by CURRENT_atmos_condition2  order by total,CURRENT_atmos_condition2
-- NULL

CURRENT_atmos_condition2,total
18.0,1
31.0,1
57.0,1
96.0,1
99.0,1
82.0,2
92.0,5
93.0,5
40.0,6
77.0,8


In [0]:
%sql
--CURRENT_atmos_condition_quality2:
select distinct(CURRENT_atmos_condition_quality2), count (station) as total from  tbl_team05_expanded_weather_data group by CURRENT_atmos_condition_quality2  order by total,CURRENT_atmos_condition_quality2
--  NULL

CURRENT_atmos_condition_quality2,total
7.0,15
6.0,202
1.0,9110
5.0,13795
,22240700


#### 4.1.b Create Join Key
The key is based on the date, station and hour of the weather report.

In [0]:
%sql
--UPDATE KEY-join (based on date station and hour)
update tbl_team05_expanded_weather_data set key_join = concat (Cast(year(date) as string), '-', cast(month(date) as string),'-', cast(day(date) as string),'-', cast(station as string),'-',cast(hour(date) as string))


####4.2 Add airport code to the original weather table based on call_sign for Advanced Feature Engineering.

In [0]:
%sql
--DROP TABLE IF EXISTS tbl_team05_weather_data_with_airport;

In [0]:
%sql
--Backup solution for airport codes.
---CREATE TABLE tbl_team05_weather_data_with_airport(
STATION string,
DATE timestamp,
SOURCE smallint,
LATITUDE double,
LONGITUDE double,
ELEVATION double,
NAME string,
REPORT_TYPE string,
CALL_SIGN string,
QUALITY_CONTROL string,
WND string,
CIG string,
VIS string,
TMP string,
DEW string,
SLP string,
AW1 string,
GA1 string,
GA2 string,
GA3 string,
GA4 string,
GE1 string,
GF1 string,
KA1 string,
KA2 string,
MA1 string,
MD1 string,
MW1 string,
MW2 string,
OC1 string,
OD1 string,
OD2 string,
REM string,
EQD string,
AW2 string,
AX4 string,
GD1 string,
AW5 string,
GN1 string,
AJ1 string,
AW3 string,
MK1 string,
KA4 string,
GG3 string,
AN1 string,
RH1 string,
AU5 string,
HL1 string,
OB1 string,
AT8 string,
AW7 string,
AZ1 string,
CH1 string,
RH3 string,
GK1 string,
IB1 string,
AX1 string,
CT1 string,
AK1 string,
CN2 string,
OE1 string,
MW5 string,
AO1 string,
KA3 string,
AA3 string,
CR1 string,
CF2 string,
KB2 string,
GM1 string,
AT5 string,
AY2 string,
MW6 string,
MG1 string,
AH6 string,
AU2 string,
GD2 string,
AW4 string,
MF1 string,
AA1 string,
AH2 string,
AH3 string,
OE3 string,
AT6 string,
AL2 string,
AL3 string,
AX5 string,
IB2 string,
AI3 string,
CV3 string,
WA1 string,
GH1 string,
KF1 string,
CU2 string,
CT3 string,
SA1 string,
AU1 string,
KD2 string,
AI5 string,
GO1 string,
GD3 string,
CG3 string,
AI1 string,
AL1 string,
AW6 string,
MW4 string,
AX6 string,
CV1 string,
ME1 string,
KC2 string,
CN1 string,
UA1 string,
GD5 string,
UG2 string,
AT3 string,
AT4 string,
GJ1 string,
MV1 string,
GA5 string,
CT2 string,
CG2 string,
ED1 string,
AE1 string,
CO1 string,
KE1 string,
KB1 string,
AI4 string,
MW3 string,
KG2 string,
AA2 string,
AX2 string,
AY1 string,
RH2 string,
OE2 string,
CU3 string,
MH1 string,
AM1 string,
AU4 string,
GA6 string,
KG1 string,
AU3 string,
AT7 string,
KD1 string,
GL1 string,
IA1 string,
GG2 string,
OD3 string,
UG1 string,
CB1 string,
AI6 string,
CI1 string,
CV2 string,
AZ2 string,
AD1 string,
AH1 string,
WD1 string,
AA4 string,
KC1 string,
IA2 string,
CF3 string,
AI2 string,
AT1 string,
GD4 string,
AX3 string,
AH4 string,
KB3 string,
CU1 string,
CN4 string,
AT2 string,
CG1 string,
CF1 string,
GG1 string,
MV2 string,
CW1 string,
GG4 string,
AB1 string,
AH5 string,
CN3 string,
airport_code string
)


In [0]:
%sql 
--- Backup solution
---insert into tbl_team05_weather_data_with_airport
select
STATION ,
DATE ,
SOURCE ,
LATITUDE ,
LONGITUDE ,
ELEVATION ,
NAME ,
REPORT_TYPE ,
CALL_SIGN ,
QUALITY_CONTROL ,
WND ,
CIG ,
VIS ,
TMP ,
DEW ,
SLP ,
AW1 ,
GA1 ,
GA2 ,
GA3 ,
GA4 ,
GE1 ,
GF1 ,
KA1 ,
KA2 ,
MA1 ,
MD1 ,
MW1 ,
MW2 ,
OC1 ,
OD1 ,
OD2 ,
REM ,
EQD ,
AW2 ,
AX4 ,
GD1 ,
AW5 ,
GN1 ,
AJ1 ,
AW3 ,
MK1 ,
KA4 ,
GG3 ,
AN1 ,
RH1 ,
AU5 ,
HL1 ,
OB1 ,
AT8 ,
AW7 ,
AZ1 ,
CH1 ,
RH3 ,
GK1 ,
IB1 ,
AX1 ,
CT1 ,
AK1 ,
CN2 ,
OE1 ,
MW5 ,
AO1 ,
KA3 ,
AA3 ,
CR1 ,
CF2 ,
KB2 ,
GM1 ,
AT5 ,
AY2 ,
MW6 ,
MG1 ,
AH6 ,
AU2 ,
GD2 ,
AW4 ,
MF1 ,
AA1 ,
AH2 ,
AH3 ,
OE3 ,
AT6 ,
AL2 ,
AL3 ,
AX5 ,
IB2 ,
AI3 ,
CV3 ,
WA1 ,
GH1 ,
KF1 ,
CU2 ,
CT3 ,
SA1 ,
AU1 ,
KD2 ,
AI5 ,
GO1 ,
GD3 ,
CG3 ,
AI1 ,
AL1 ,
AW6 ,
MW4 ,
AX6 ,
CV1 ,
ME1 ,
KC2 ,
CN1 ,
UA1 ,
GD5 ,
UG2 ,
AT3 ,
AT4 ,
GJ1 ,
MV1 ,
GA5 ,
CT2 ,
CG2 ,
ED1 ,
AE1 ,
CO1 ,
KE1 ,
KB1 ,
AI4 ,
MW3 ,
KG2 ,
AA2 ,
AX2 ,
AY1 ,
RH2 ,
OE2 ,
CU3 ,
MH1 ,
AM1 ,
AU4 ,
GA6 ,
KG1 ,
AU3 ,
AT7 ,
KD1 ,
GL1 ,
IA1 ,
GG2 ,
OD3 ,
UG1 ,
CB1 ,
AI6 ,
CI1 ,
CV2 ,
AZ2 ,
AD1 ,
AH1 ,
WD1 ,
AA4 ,
KC1 ,
IA2 ,
CF3 ,
AI2 ,
AT1 ,
GD4 ,
AX3 ,
AH4 ,
KB3 ,
CU1 ,
CN4 ,
AT2 ,
CG1 ,
CF1 ,
GG1 ,
MV2 ,
CW1 ,
GG4 ,
AB1 ,
AH5 ,
CN3 ,
A.airport_code 
from tbl_team05_weather_data W left join tbl_team05_airport_station A on W.station = A.station_id



In [0]:
%sql
ALTER TABLE tbl_team05_weather_data ADD COLUMNS ( airport_code string )

In [0]:
%sql
update tbl_team05_weather_data set airport_code = 
case when LENGTH(trim(call_sign)) > 3 THEN right(trim(call_sign),3)
ELSE trim(call_sign) END 

num_affected_rows
630904436


In [0]:
%sql
--select distinct trim(call_sign), Right(trim(call_sign),3) from tbl_team05_weather_data where trim(call_sign) like '%SFO'
select count(airport_code) from tbl_team05_weather_data where  airport_code='LAX'


count(airport_code)
53058


### 5. Join Expanded WEATHER and FLIGHT tables 

We are joining two new expanded tables together based on the key_join (date-station_id-hour) to get the final dataframe for the model training. However, the result coming from this training is not good enough to be displayed here (due to the limited time left with Databricks cluster). Instead, we will focus showing our work on the Advanced Feature Engineering instead.

In [0]:


DF = sqlContext.sql("SELECT YEAR, QUARTER, MONTH, DAY_OF_MONTH, DAY_OF_WEEK, FL_DATE,OP_CARRIER, TAIL_NUM, ORIGIN, DEST, DEP_TIME_BLK, CRS_ELAPSED_TIME, DISTANCE, Wind_Direction_Angle, Wind_Direction_Quality,Wind_Type_Code,Wind_Speed_Rate,Wind_speed_Quality,CIG_ceiling_height, CIG_ceiling_quality,CIG_ceiling_deter, CIG_ceiling_visibility, VIS_distance, VIS_distance_quality, VIS_variability, VIS_variability_quality, TMP_air_temp, TMP_air_temp_quality,DEW_temp, DEW_temp_quality, SLP_pressure, SLP_pressure_quality, DEP_DEL15 FROM  tbl_final_airline_data_team05 INNER JOIN  tbl_final_weather_team05 ON  tbl_final_airline_data_team05.key_join = tbl_final_weather_team05.key_join  WHERE  (DEP_DEL15 == 1 OR DEP_DEL15 ==0) ")


In [0]:
DF.count()