# Data Ingestion 

This notebook aggregates the near-real-time **1-hour averaged data** from
[solarsoft](https://sohoftp.nascom.nasa.gov/sdb/goes/ace/monthly/). Real time
solar wind data is captured from the `MAG`, `SWEPAM`, `EPAM`, and `SIS` instruments. Additionally ACE aircraft location data is pulled. Note that `SWICS` instrument data is not available on this server and so take another approach to pull this data for modeling. 

## Imports and Configuration

In [None]:
# standard library
import logging
import warnings
warnings.filterwarnings("ignore")

# third party
from bs4 import BeautifulSoup
from datetime import date
import numpy as np
import pandas as pd
import requests
from tqdm import tqdm

In [None]:
# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)

## Prepare for Data Ingestion 
One hour aggregate ACE instrument measurements are publically accessible on NASA servers and are organize by year-month by [The Solar and Heliospheric Observatory (SOHO) Project Scientist Team](https://sohoftp.nascom.nasa.gov/). We will scrape the txt file format and store the data in pandas DataFrames. 

### Get URLs for all data files
We grab all the links on https://sohoftp.nascom.nasa.gov/sdb/goes/ace/monthly/ which contain a numeric digit as their first character as the ACE hourly aggregates are structured as `YYYYMM_ace_<instrument>_1h.txt` where `<instrument>` is replaced with `epam`, `sis`, `mag`, `loc`, or `swepam`. 

In [None]:
ROOT_URL = "https://sohoftp.nascom.nasa.gov/sdb/goes/ace/monthly/"

try:
    r = requests.get(ROOT_URL)
    r.raise_for_status()
    data = r.text
    soup = BeautifulSoup(data, "html.parser")

    file_list = [
        ROOT_URL + link.get("href")
        for link in soup.find_all("a")
        if link.get("href")[0].isdigit() # data txt file
    ]

except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

# display completion
print(f"{len(file_list)} files found")
sorted_files = sorted(file_list)
print(
    "Data from",
    sorted_files[0].split("/")[-1][:4],
    "to",
    sorted_files[-1].split("/")[-1][:4],
)

## Read and Process Data Files
Now that we have the URLs for hourly aggregates of our ACE instruments and satelite location from 2000 to 2024, we can read this data into pandas dataframes per instrument. 

In [None]:
# Creating Dataframe for each sensor
epam_df = pd.DataFrame(
    columns=[
        "Year",
        "Month",
        "Day",
        "HHMM",
        "Julian_Day",
        "Seconds_OTD",
        "Status_E",
        "E_38-53",
        "E_175-315",
        "Status_P",
        "P_47-65",
        "P_112-187",
        "P_310-580",
        "P_761-1220",
        "P_1060-1910",
        "Anis_Index",
    ]
)
loc_df = pd.DataFrame(
    columns=["Year", "Month", "Day", "HHMM", "Julian_Day", "Seconds_OTD", "X", "Y", "Z"]
)
mag_df = pd.DataFrame(
    columns=[
        "Year",
        "Month",
        "Day",
        "HHMM",
        "Julian_Day",
        "Seconds_OTD",
        "Status_Mag",
        "Bx",
        "By",
        "Bz",
        "Bt",
        "Lat",
        "Long",
    ]
)
sis_df = pd.DataFrame(
    columns=[
        "Year",
        "Month",
        "Day",
        "HHMM",
        "Julian_Day",
        "Seconds_OTD",
        "Status_PF_Low",
        ">10_MeV",
        "Status_PF_High",
        ">30_MeV",
    ]
)
swepam_df = pd.DataFrame(
    columns=[
        "Year",
        "Month",
        "Day",
        "HHMM",
        "Julian_Day",
        "Seconds_OTD",
        "Status_SW",
        "Proton_Density",
        "Bulk_Speed",
        "Ion_Temp",
    ]
)

In [None]:
# Define the DataFrames and their corresponding columns
dataframes = {
    "_epam": {"df": epam_df, "columns": epam_df.columns},
    "loc": {"df": loc_df, "columns": loc_df.columns},
    "mag": {"df": mag_df, "columns": mag_df.columns},
    "sis": {"df": sis_df, "columns": sis_df.columns},
    "_swepam": {"df": swepam_df, "columns": swepam_df.columns},
}

# Pulling data from solarsoft to dataframe
for link in tqdm(file_list, desc="Importing files"):
    data = pd.read_csv(link, comment="#", sep="\s+", header=None, skiprows=2)
    for key, value in dataframes.items():
        if key in link:
            data.columns = value["columns"]
            value["df"] = pd.concat([value["df"], data], ignore_index=True)
            break

## Write local copies for future use

In [None]:
# Saving dataframes as csv files
dataframes = [epam_df, loc_df, mag_df, sis_df, swepam_df]
file_names = ["epam", "loc", "mag", "sis", "swepam"]

for df, name in zip(dataframes, file_names):
    file_path = f"../data/ace/raw/{date.today()}_ace_master_{name}_1hr.csv"
    df.to_csv(file_path, index=False)