#Data Sourcing

- Based on Phase 3 requirements: Pull the data for recent years; Provide a clean dataset for Flights and/or Weather for Years 2020-2023.
- Save as parquet file, with same structure as raw datasets

**TO DO:**
- Weather data is available at https://www.ncei.noaa.gov/data/local-climatological-data/archive/ up to 2024. 
- Flights data is available at: https://www.transtats.bts.gov/homepage.asp 

- Data dictionary for flights: https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJ
- Data dictionary for weather: https://www.ncei.noaa.gov/pub/data/cdo/documentation/LCD_documentation.pdf

**Flights data**
- This is a subset of the passenger flight's on-time performance data taken from the TranStats data collection available from the U.S. Department of Transportation (DOT) https://www.transtats.bts.gov/Fields.asp?gnoyr_VQ=FGJLinks The flight dataset was downloaded from the US Department of Transportation and contains flight information from 2015 to 2021.

**Weather data**
- The weather dataset was downloaded from the National Oceanic and Atmospheric Administration repositoryLinks to an external site. and contains weather information from 2015 to 2021

# Setup cluster


In [0]:
blob_container = "261storagecontainer"  
storage_account = "261storage" 
secret_scope = "261_team_6_1_spring24_scope"  
secret_key = "team_6_1_key"  
team_blob_url = f"wasbs://{blob_container}@{storage_account}.blob.core.windows.net" 


# blob storage is mounted here.
mids261_mount_path = "/mnt/mids-w261"

# SAS Token: Grant the team limited access to Azure Storage resources
spark.conf.set(
    f"fs.azure.sas.{blob_container}.{storage_account}.blob.core.windows.net",
    dbutils.secrets.get(scope=secret_scope, key=secret_key),
)

# see what's in the blob storage root folder
# display(dbutils.fs.ls(f"{team_blob_url}"))

# mount
data_BASE_DIR = "dbfs:/mnt/mids-w261/"
# display(dbutils.fs.ls(f"{data_BASE_DIR}"))

# uploading files to the DBFS via File -> Upload data to DBFS
my_file_system = "dbfs:/FileStore/shared_uploads/julianagc@berkeley.edu/"


# Import libraries

In [0]:
#standard
import pandas as pd
import matplotlib.pyplot as plt
import pyspark.sql.functions as F #cleaning: split, col, when, lit, concat_ws,regexp_replace, regexp_extract
import seaborn as sns

#imputing
from pyspark.ml.feature import Imputer

#normalization and feature extraction
from pyspark.ml.feature import StandardScaler, VectorAssembler,PCA

# to download files from url
import requests
import tarfile
import os
from io import BytesIO


#Load raw data (weather and flights)


In [0]:
# display(dbutils.fs.ls(f"{data_BASE_DIR}"))

In [0]:
# Weather data
df_weather = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_weather_data/")
# display(df_weather)

In [0]:
# Flights data
# df_flights = spark.read.parquet(f"dbfs:/mnt/mids-w261/datasets_final_project_2022/parquet_airlines_data/")
# display(df_flights)

# Weather

#### 2020

In [0]:
weather_url = ["https://www.ncei.noaa.gov/data/local-climatological-data/archive/2020.tar.gz","https://www.ncei.noaa.gov/data/local-climatological-data/archive/2022.tar.gz","https://www.ncei.noaa.gov/data/local-climatological-data/archive/2022.tar.gz","https://www.ncei.noaa.gov/data/local-climatological-data/archive/2023.tar.gz"]


In [0]:
# Create a folder for downloaded files
dbutils.fs.mkdirs(f"{team_blob_url}/weather_data_2020")  

In [0]:
# see team blob contents
display(dbutils.fs.ls(f"{team_blob_url}/weather_data_2020"))

In [0]:


# 2020 folder
weather_20_dir = f"{team_blob_url}/weather_data_2020"

# Download the tar.gz file
url = "https://www.ncei.noaa.gov/data/local-climatological-data/archive/2020.tar.gz"
response = requests.get(url, stream=True)

# Extract the tar.gz file
tar_file = tarfile.open(fileobj=BytesIO(response.content), mode="r:gz")

# Loop through the extracted files and save to blob storage
for member in tar_file.getmembers():
    if member.isfile():
        file_name = member.name.split("/")[-1]
        file_bytes = tar_file.extractfile(member).read()
        file_contents = file_bytes.decode()  # Convert bytes to string

        # Write the file to blob storage
        dbutils.fs.put(os.path.join(weather_20_dir, file_name), file_contents, overwrite=True)

# Close the tar file
tar_file.close()

#### Save all files as a single parquet file per year


In [0]:
# Set the path to the CSV files
csv_path = "wasbs://261storagecontainer@261storage.blob.core.windows.net/weather_data_2020/*.csv"

# Read the CSV files into a DataFrame
weather_20_df = spark.read.csv(csv_path, header=True, inferSchema=True)

# Add a column with the file name
weather_20_df = weather_20_df.withColumn("source_file", F.input_file_name())

# Write the DataFrame as a Parquet file
weather_20_df.write.mode("overwrite").parquet("wasbs://261storagecontainer@261storage.blob.core.windows.net/weather_data_2020/weather_20_parquet")

In [0]:
#load checkpointed 2020 weather
weather_20_df = spark.read.parquet("wasbs://261storagecontainer@261storage.blob.core.windows.net/weather_data_2020/weather_20_parquet")

In [0]:
weather_20_df.display()

## Number of files per year
|Year|Number of files (stations)|
|---|---|
|2020|13562|



