# Solar Flare prediction dataset pipeline

This is a replication of the dataset used by [Liu et al.](https://web.njit.edu/~wangj/LSTMpredict/).

The data comes from 2 sources:

1. Flare data from the GOES flare catalog at NOAA, which can be accessed with the sunpy.instr.goes.get_event_list() function.
 This tells us if an active region produced a flare or not.
2. Active region data from the Solar Dynamics Observatory's Heliosesmic and Magnetic Imager instrument, which can be accessed from the JSOC database via a JSON API.
This gives us the features characterizing each active region.

We ascribe each Active Region (AR) to one of two classes:

1. The positive class contains flaring active regions that will produce
flare >M5.0 in the next 24hours.
2. The negative class contains flaring active regions that will **not**
produce flare >M5.0 in the next 24hours.

First, some imports.


In [1]:
import numpy as np
import matplotlib.pylab as plt
import matplotlib.mlab as mlab
import pandas as pd
import scipy.stats
import requests
import urllib
import json
from datetime import datetime as dt_obj
from datetime import timedelta
from sklearn import svm
from sklearn.model_selection import StratifiedKFold
from sunpy.time import TimeRange
from sunpy.net import hek
import sunpy.instr.goes
import lime
import lime.lime_tabular
import os
pd.set_option('display.max_rows', 500)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


# Step 1: Get flare list

We get the entire GOES flare catalog at NOAA.



In [2]:
# Grab all the data from the GOES database
t_start = "2010-05-01"
t_end = "2018-05-30"
time_range = TimeRange(t_start, t_end)
if os.path.exists("../Data/GOES/all_flares_list.csv"):
    listofresults = pd.read_csv('../Data/GOES/all_flares_list.csv').drop\
        (columns="Unnamed: 0")
else:
    listofresults = sunpy.instr.goes.get_goes_event_list(time_range, 'B1')
    # Remove all events without NOAA number
    listofresults = listofresults[listofresults['noaa_active_region'] != 0]
    # save to csv
    pd.DataFrame(listofresults).to_csv('../Data/GOES/all_flares_list.csv')
    listofresults = pd.DataFrame(listofresults)

print('Grabbed all the GOES data; there are', len(listofresults), 'events.')


Grabbed all the GOES data; there are 11986 events.


Convert the ```times``` in the ```listofresults``` dataframe from a string
into a datetime object:

In [3]:
def parse_tai_string(tstr):
    year = int(tstr[:4])
    month = int(tstr[5:7])
    day = int(tstr[8:10])
    hour = int(tstr[11:13])
    minute = int(tstr[14:16])
    return dt_obj(year, month, day, hour, minute)

listofresults['start_time'] = listofresults['start_time'].apply(parse_tai_string)
listofresults['peak_time'] = listofresults['peak_time'].apply(parse_tai_string)
listofresults['end_time'] = listofresults['end_time'].apply(parse_tai_string)

Now let's query the JSOC database to see if there are active region parameters at the time of the flare.
First read the following file to map NOAA active region numbers to HARPNUMs (a HARP, or an HMI Active Region Patch, is the preferred numbering system for the HMI active regions as they appear in the magnetic field data before NOAA observes them in white light):

In [4]:
HARP_NOAA_list = pd.read_csv(
    'http://jsoc.stanford.edu/doc/data/hmi/harpnum_to_noaa/all_harps_with_noaa_ars.txt', sep=' ')

Now, let's determine at which time we'd like to predict CMEs. In general,
many people try to predict a CME either 24 or 48 hours before it happens.
We can report both in this study by setting a variable called ```timedelayvariable```:

In [5]:
timedelayvariable = 24

Now, we want the list of all the flares with its corresponding: label,
flare_class, timestep, NOAA and HARP number.


In [12]:
# first and peak result of AR
first = listofresults.groupby('noaa_active_region').nth(0)['start_time']
last = listofresults.groupby('noaa_active_region').nth(-1)['end_time']

# sample at 1 hour cadence between start and end time of AR
t_range_per_AR = [[pd.date_range(first.iloc[i], last.iloc[i],
    freq='1h'), first.index[i]] for i in range(first.shape[0])]

flare_list = pd.DataFrame()
# make dataframe from ranges.
for timesteps in t_range_per_AR:
    timesteps_df = pd.DataFrame([timesteps[0]])
    AR = pd.DataFrame([timesteps[1]]* len(timesteps[0]))
    ar_time = pd.concat([timesteps_df, AR.T])
    flare_list = pd.concat([flare_list, ar_time.T],
                                  ignore_index=True)

flare_list.columns = ['timestamp', 'NOAA']

Now we have the full length of time. We need to assign each time and AR with
its accompanying flare


In [7]:
# make dataframe of per active region
# flare_times = pd.DataFrame()
# for i in range (listofresults.shape[0]):
#     t_range = pd.date_range(listofresults['start_time'][i],
#                          listofresults['peak_time'][i], freq='1h').to_frame()
#     flare_class = listofresults['goes_class'][i]
#     for j in range(len(t_range)):
#         timestep = t_range.iloc[j]
#         flare_times = pd.concat([flare_times, pd.DataFrame([flare_class,
#                                                             timestep[0]
#                                                            ]).T ])


# Step 2: Get SHARP data

Now we can grab the SDO data from the JSOC database by executing the JSON queries.
We are selecting data that satisfies several criteria:
The data has to be [1] disambiguated with a version of the disambiguation module greater than 1.1,
 [2] taken while the orbital velocity of the spacecraft is less than 3500 m/s,
 [3] of a high quality, and
 [4] within 70 degrees of central meridian.
 If the data pass all these tests, they are stuffed into one of two lists:
 one for the positive class (called pos_flare_data)
 and one for the negative class (called neg_flare_data).

In [24]:
def get_the_jsoc_data(event_count):
    """
    Parameters
    ----------
    event_count: number of events
                 int

    t_rec:       list of times, one associated with each event in event_count
                 list of strings in JSOC format ('%Y.%m.%d_%H:%M_TAI')

    """

    catalog_data = []
    classification = []

    for i in range(event_count):

        print("=====", i, "=====")
        # next match NOAA_ARS to HARPNUM
        idx = HARP_NOAA_list[HARP_NOAA_list['NOAA_ARS'].str.contains(
            str(int(listofactiveregions[i])))]

        # if there's no HARPNUM, quit
        if (idx.empty == True):
            print('skip: there are no matching HARPNUMs for',
                  str(int(listofactiveregions[i])))
            continue

        # construct jsoc_info queries and query jsoc database; we are querying for 25 keywords
        url = "http://jsoc.stanford.edu/cgi-bin/ajax/jsoc_info?ds=hmi.sharp_720s["+str(
            idx.HARPNUM.values[0])+"]["+t_rec[i]+"][? (CODEVER7 !~ '1.1 ') and (abs(OBS_VR)< 3500) and (QUALITY<65536) ?]&op=rs_list&key=USFLUX,MEANGBT,MEANJZH,MEANPOT,SHRGT45,TOTUSJH,MEANGBH,MEANALP,MEANGAM,MEANGBZ,MEANJZD,TOTUSJZ,SAVNCPP,TOTPOT,MEANSHR,AREA_ACR,R_VALUE,ABSNJZH"
        response = requests.get(url)

        # if there's no response at this time, quit
        if response.status_code != 200:
            print('skip: cannot successfully get an http response')
            continue

        # read the JSON output
        data = response.json()

        # if there are no data at this time, quit
        if data['count'] == 0:
            print('skip: there are no data for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        # check to see if the active region is too close to the limb
        # we can compute the latitude of an active region in stonyhurst coordinates as follows:
        # latitude_stonyhurst = CRVAL1 - CRLN_OBS
        # for this we have to query the CEA series (but above we queried the other series as the CEA series does not have CODEVER5 in it)

        url = "http://jsoc.stanford.edu/cgi-bin/ajax/jsoc_info?ds=hmi.sharp_cea_720s["+str(
            idx.HARPNUM.values[0])+"]["+t_rec[i]+"][? (abs(OBS_VR)< 3500) and (QUALITY<65536) ?]&op=rs_list&key=CRVAL1,CRLN_OBS"
        response = requests.get(url)

        # if there's no response at this time, quit
        if response.status_code != 200:
            print('skip: failed to find CEA JSOC data for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        # read the JSON output
        latitude_information = response.json()

        # if there are no data at this time, quit
        if latitude_information['count'] == 0:
            print('skip: there are no data for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        CRVAL1 = float(latitude_information['keywords'][0]['values'][0])
        CRLN_OBS = float(latitude_information['keywords'][1]['values'][0])
        if (np.absolute(CRVAL1 - CRLN_OBS) > 70.0):
            print('skip: latitude is out of range for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        if ('MISSING' in str(data['keywords'])):
            print('skip: there are some missing keywords for HARPNUM',
                  idx.HARPNUM.values[0], 'at time', t_rec[i])
            continue

        print('accept NOAA Active Region number', str(int(
            listofactiveregions[i])), 'and HARPNUM', idx.HARPNUM.values[0], 'at time', t_rec[i])

        individual_flare_data = []
        for j in range(18):
            individual_flare_data.append(
                float(data['keywords'][j]['values'][0]))

        catalog_data.append(list(individual_flare_data))

        single_class_instance = [idx.HARPNUM.values[0], str(
            int(listofactiveregions[i])), listofgoesclasses[i], t_rec[i]]
        classification.append(single_class_instance)

    return catalog_data, classification

positive_result = get_the_jsoc_data(len(listofactiveregions))

KeyboardInterrupt: 

now we prepare the data tobe fed into function


In [26]:
listofactiveregions = list(listofresults['noaa_active_region'].unique())
listofgoesclasses = list(listofresults['goes_class'].values.flatten())


Call the function

In [10]:
# jsoc testing



In [19]:
positive_result = get_the_jsoc_data(listofactiveregions.shape[0])

AttributeError: 'list' object has no attribute 'shape'

## Label Positive class

## Label Negative class

# Step3: Generate history data

# Step 4: Calculate Decay values

# Step 5: Assign postive and negative classes?
