# Solar Flare prediction dataset pipeline

This is a replication of the dataset used by [Liu et al.](https://web.njit.edu/~wangj/LSTMpredict/).

The data comes from 2 sources:

1. Flare data from the GOES flare catalog at NOAA, which can be accessed with the sunpy.instr.goes.get_event_list() function.
 This tells us if an active region produced a flare or not.
2. Active region data from the Solar Dynamics Observatory's Heliosesmic and Magnetic Imager instrument, which can be accessed from the JSOC database via a JSON API.
This gives us the features characterizing each active region.

We ascribe each Active Region (AR) to one of two classes:

1. The positive class contains flaring active regions that will produce
flare >M5.0 in the next 24hours.
2. The negative class contains flaring active regions that will **not**
produce flare >M5.0 in the next 24hours.

First, some imports.


In [1]:
import numpy as np
import matplotlib.pylab as plt
import matplotlib.mlab as mlab
import pandas as pd
import scipy.stats
import requests
import urllib
import json
from datetime import datetime as dt_obj
from datetime import timedelta
from sklearn import svm
from sklearn.model_selection import StratifiedKFold
from sunpy.time import TimeRange
from sunpy.net import hek
from astropy.time import Time
import sunpy.instr.goes
import lime
import lime.lime_tabular
import os
import drms
pd.set_option('display.max_rows', 500)
%matplotlib inline
%config InlineBackend.figure_format = 'retina'


# Step 1: Get flare list

We get the entire GOES flare catalog at NOAA.



In [2]:
# Grab all the data from the GOES database
t_start = "2010-05-01"
t_end = "2018-05-30"
time_range = TimeRange(t_start, t_end)
if os.path.exists("../Data/GOES/all_flares_list.csv"):
    listofresults = pd.read_csv('../Data/GOES/all_flares_list.csv').drop\
        (columns="Unnamed: 0")
else:
    listofresults = sunpy.instr.goes.get_goes_event_list(time_range, 'B1')
    # Remove all events without NOAA number
    listofresults = listofresults[listofresults['noaa_active_region'] != 0]
    # save to csv
    pd.DataFrame(listofresults).to_csv('../Data/GOES/all_flares_list.csv')
    listofresults = pd.DataFrame(listofresults)

print('Grabbed all the GOES data; there are', len(listofresults), 'events.')


Grabbed all the GOES data; there are 11986 events.


Convert the ```times``` in the ```listofresults``` dataframe from a string
into a datetime object:

In [3]:
def parse_tai_string(tstr):
    year = int(tstr[:4])
    month = int(tstr[5:7])
    day = int(tstr[8:10])
    hour = int(tstr[11:13])
    minute = int(tstr[14:16])
    return dt_obj(year, month, day, hour, minute)

listofresults['start_time'] = listofresults['start_time'].apply(parse_tai_string)
listofresults['peak_time'] = listofresults['peak_time'].apply(parse_tai_string)
listofresults['end_time'] = listofresults['end_time'].apply(parse_tai_string)

Now let's query the JSOC database to see if there are active region parameters at the time of the flare.
First read the following file to map NOAA active region numbers to HARPNUMs (a HARP, or an HMI Active Region Patch, is the preferred numbering system for the HMI active regions as they appear in the magnetic field data before NOAA observes them in white light):

In [4]:
HARP_NOAA_list = pd.read_csv(
    'http://jsoc.stanford.edu/doc/data/hmi/harpnum_to_noaa/all_harps_with_noaa_ars.txt', sep=' ')

Now, let's determine at which time we'd like to predict CMEs. In general,
many people try to predict a CME either 24 or 48 hours before it happens.
We can report both in this study by setting a variable called ```timedelayvariable```:

In [5]:
timedelayvariable = 24

Now, we want the list of all the flares with its corresponding: label,
flare_class, timestep, NOAA and HARP number.


In [6]:
# first and peak result of AR
first = listofresults.groupby('noaa_active_region').nth(0)['start_time']
last = listofresults.groupby('noaa_active_region').nth(-1)['end_time']

# sample at 1 hour cadence between start and end time of AR
t_range_per_AR = [[pd.date_range(first.iloc[i], last.iloc[i],
    freq='1h'), first.index[i]] for i in range(first.shape[0])]

flare_list = pd.DataFrame()
# make dataframe from ranges.
for timesteps in t_range_per_AR:
    timesteps_df = pd.DataFrame([timesteps[0]])
    AR = pd.DataFrame([timesteps[1]]* len(timesteps[0]))
    ar_time = pd.concat([timesteps_df, AR.T])
    flare_list = pd.concat([flare_list, ar_time.T],
                                  ignore_index=True)

flare_list.columns = ['timestamp', 'NOAA']

Now we have the full length of time. We need to assign each time and AR with
its accompanying flare


In [7]:
# make dataframe of per active region
# flare_times = pd.DataFrame()
# for i in range (listofresults.shape[0]):
#     t_range = pd.date_range(listofresults['start_time'][i],
#                          listofresults['peak_time'][i], freq='1h').to_frame()
#     flare_class = listofresults['goes_class'][i]
#     for j in range(len(t_range)):
#         timestep = t_range.iloc[j]
#         flare_times = pd.concat([flare_times, pd.DataFrame([flare_class,
#                                                             timestep[0]
#                                                            ]).T ])


# Step 2: Get SHARP data

Now we can grab the SDO data from the JSOC database by executing the JSON queries.
We are selecting data that satisfies several criteria:
The data has to be [1] disambiguated with a version of the disambiguation module greater than 1.1,
 [2] taken while the orbital velocity of the spacecraft is less than 3500 m/s,
 [3] of a high quality, and
 [4] within 70 degrees of central meridian.
 If the data pass all these tests, they are stuffed into one of two lists:
 one for the positive class (called pos_flare_data)
 and one for the negative class (called neg_flare_data).

now we prepare the data tobe fed into function


In [8]:
minimum_class_label = ['M5', 'M6', 'M7', 'M8', 'M9', 'X']
listofactiveregions = list(listofresults['noaa_active_region'].unique())
listofgoesclasses = list(listofresults['goes_class'].values.flatten())

In [9]:
def get_the_jsoc_data(event_count):
    """
    Parameters
    ----------
    event_count: number of events
                 int

    t_rec:       list of times, one associated with each event in event_count
                 list of strings in JSOC format ('%Y.%m.%d_%H:%M_TAI')

    """
    from astropy.time import Time
    start_date = drms.to_datetime(t_start).strftime('%Y.%m.%d_%H:%M_TAI')
    end_date = drms.to_datetime(t_end).strftime('%Y.%m.%d_%H:%M_TAI')
    series_sharp = 'hmi.sharp_cea_720s'
    series_lorentz = 'cgem.lorentz'
    ids = ['T_REC','NOAA_AR', 'HARPNUM', 'CRVAL1','CRVAL2', 'CRLN_OBS',
           'CRLT_OBS', 'LAT_FWT', 'LON_FWT']
    sharps = ['USFLUX', 'MEANGBT',
              'MEANJZH', 'MEANPOT', 'SHRGT45',
              'TOTUSJH', 'MEANGBH','MEANALP','MEANGAM','MEANGBZ','MEANJZD',
              'TOTUSJZ','SAVNCPP', 'TOTPOT','MEANSHR','AREA_ACR','R_VALUE',
              'ABSNJZH']
    lorentzs = ['TOTFX','TOTFY','TOTFZ','EPSX','EPSY','EPSZ']
    conditions = '(CODEVER7 !~ "1.1") and (abs(OBS_VR)< 3500) and (QUALITY<65536)'
    conditions_lor = '(abs(OBS_VR)< 3500) and (QUALITY<65536)'
    c = drms.Client()
    data_jsoc = pd.DataFrame()

    for i in range(event_count):

        print("=====", i, "=====")
        # next match NOAA_ARS to HARPNUM
        idx = HARP_NOAA_list[HARP_NOAA_list['NOAA_ARS'].str.contains(
            str(int(listofactiveregions[i])))]

        # if there's no HARPNUM, quit
        if (idx.empty == True):
            print('skip: there are no matching HARPNUMs for',
                  str(int(listofactiveregions[i])))
            continue

        harpnum = idx.HARPNUM.values[0]
        # query jsoc database for sharp data
        data_sharp = c.query('%s[%d][%s-%s@60m][? %s ?]' % (series_sharp,
                                                           harpnum,
                                                        start_date,
                                                        end_date,
                                                   conditions),
                       key=ids+sharps)

        # if there are no data at this time, quit
        if len(data_sharp) == 0:
            print('skip: there are no data for HARPNUM',
                  harpnum)
            continue

        # query jsoc database for lorentz data
        data_lorentz = c.query('%s[%d][%s-%s@60m][? %s ?]' % (series_lorentz,harpnum,
                                                        start_date,
                                                        end_date,
                                                   conditions_lor),
                       key=lorentzs)

                # if there are no data at this time, quit
        if len(data_lorentz) == 0:
            print('skip: there are no data for HARPNUM',
                  harpnum)
            continue

        #concat the tables
        data = pd.concat([data_sharp, data_lorentz], axis=1)

        # check to see if the active region is too close to the limb
        # we can compute the latitude of an active region in stonyhurst coordinates as follows:
        # longitude_stonyhurst = CRVAL1 - CRLN_OBS
        # for this we have to query the CEA series (but above we queried the other series as the CEA series does not have CODEVER5 in it)
        data = data[np.abs(data['LON_FWT']) < 70.0]

        # convert tai string to date time
        data['T_REC'] = data['T_REC'].apply(parse_tai_string)

        print('accept NOAA Active Region number', str(int(
            listofactiveregions[i])), 'and HARPNUM', harpnum)

        # Append to larger dataset
        data_jsoc = pd.concat([data_jsoc, data], ignore_index=True)
        # append to csv
        outfile = '../Data/SHARP/jsoc_data.csv'
        data.to_csv(outfile, mode='a', header=not os.path.exists(outfile),
                    index=False)

    return data_jsoc

Call the function

In [10]:
if os.path.exists('../Data/SHARP/jsoc_data.csv'):
    data_jsoc = pd.read_csv('../Data/SHARP/jsoc_data.csv')
else:
    data_jsoc = get_the_jsoc_data(len(listofactiveregions))

## Match data with flares


In [11]:
# extra cleanup
data_jsoc = data_jsoc.drop(columns=['CRVAL1', 'CRVAL2', 'CRLN_OBS',
                                    'CRLT_OBS'])
data_jsoc['T_REC'] = pd.to_datetime(data_jsoc['T_REC'])
data_jsoc = data_jsoc.sort_values(by=['T_REC', 'NOAA_AR']).reset_index()

We take the closest peak time and previous peak time, get all the values
between
those times on the ```data_jsoc``` according to the class.

In [12]:
for i in range(len(listofresults)):
    start_time = listofresults['start_time'].iloc[i]
    peak_time = listofresults['peak_time'].iloc[i]
    noaa_num = listofresults['noaa_active_region'].iloc[i]
    goes_class = listofresults['goes_class'].iloc[i]
    if not noaa_num in data_jsoc['NOAA_AR'].unique():
        continue
    # get current noaa's data
    df = data_jsoc[data_jsoc['NOAA_AR'] == noaa_num]
    ar_start_time = df['T_REC'].iloc[0]

    previous_peak_time = ar_start_time if i == 1 else \
        listofresults['peak_time'].iloc[i-1]

    # create bolean mask
    mask = ( (df['T_REC'] > previous_peak_time) &
             (df['T_REC'] < peak_time))
    current_flare = df.loc[mask]
    data_jsoc.loc[data_jsoc.index[current_flare.index], 'flare'] = goes_class

    # label positive for minimum_class_label before peak time, else label
    # negative
    if any(c in goes_class for c in minimum_class_label):
        time_before = peak_time - timedelta(hours=timedelayvariable)
        # get samples between peak and 24h before
        mask = ( (df['T_REC'] > time_before) &
             (df['T_REC'] < peak_time))
        time_before_flare_df = df.loc[mask]



M5.4
M6.6
X2.2
M6.6
M5.3
X1.5
M9.3
M6.0
M9.3
X6.9
M5.3
X2.1
X1.8
M6.7
X1.4
X1.9
M7.1
M5.8
M7.4
X1.9
M8.7
X1.7
X1.1
X5.4
M6.3
M8.4
M7.9
M5.7
M5.1
M5.6
M5.3
M6.1
X1.1
M6.9
M6.1
M5.5
M9.0
M5.0
X1.8
M6.0
M6.5
M5.7
X1.7
X2.8
X3.2
X1.2
M5.0
M5.9
M9.3
X1.7
X2.1
X1.0
M5.1
X2.3
M6.3
M5.0
X3.3
X1.1
X1.1
X1.0
M6.4
M9.9
M7.2
X1.2
M6.6
M5.2
X4.9
M9.3
X1.0
M6.5
M7.3
X2.2
X1.5
X1.0
M6.5
X1.1
M8.7
X1.6
X3.1
X1.0
X2.0
M7.1
M6.7
X2.0
M6.6
M6.5
M7.9
M5.4
X1.6
M5.7
M6.1
M8.7
M6.9
X1.8
M5.6
M8.2
M9.2
M5.8
M5.1
X2.1
X2.7
M6.5
M7.9
M5.6
M7.6
M5.5
M6.7
M5.3
M5.7
M5.8
M5.5
X2.2
X9.3
M7.3
X1.3
M8.1
X8.2


In [13]:
data_jsoc['flare'] = data_jsoc['flare'].replace(np.nan, 'N')
# temp = data_jsoc[data_jsoc['flare'].str.contains('M')]

## Label Positive class

## Label Negative class

# Step3: Generate history data

# Step 4: Calculate Decay values

# Step 5: Assign postive and negative classes?
