# Predicting Coronal Mass Ejections Using SDO/HMI Vector Magnetic Data Products and Recurrent Neural Networks

Coronal mass ejections (CMEs) are intense bursts of magnetic flux and plasma that are ejected from the Sun into interplanetary space. They are often associated with solar flares and originate from active regions (ARs) on the Sun’s photosphere where magnetic fields are strong and evolve rapidly. Major CMEs and their associated flares can cause severe influences on the near-Earth environment, resulting in potentially life-threatening consequences. Therefore, substantial efforts are being invested in developing new technologies for early detection and forecasting of flares and CMEs.

In this notebook, we demonstrate 2 machine learning models to predict whether an AR that produces an M- or X-class flare will also produce a CME. The machine learning algorithms which we use include two types of recurrent neural networks (RNNs): a long short-term memory (LSTM) network and a gated recurrent unit (GRU) network. We show the models' prediction for various time windows (T), where T is the amount of time between the initial CME and the subsuquent M- or X-class flare; using data from the National Centers for
Environmental Information (NCEI) data catalog for years 2015-2019. The models were trained on NCEI data from years 2010-2014.

In [None]:
import sys
sys.path.insert(0, './CMEpredict')

from CMEpredict import *

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix as cm, roc_curve, roc_auc_score

### Set configuration variables

In [None]:
mask_value = 0
series_len = 20
epochs = 20
batch_size = 256
nclass = 2
start_feature = 4

### Definition of features

There are 18 features, or predicitve parameters, which characterize AR magnetic field properties for CME predicitons:

1. **MEANPOT**: mean photospheric magnetic free energy
1. **SHRGT45**: fraction of area with shear >45°
1. **TOTPOT**: total photospheric magnetic free energy density
1. **USFLUX**: total unsigned flux
1. **MEANJZH**: mean current helicity
1. **ABSNJZH**: absolute value of the net current helicity
1. **SAVNCPP**: sum of the modulus of the net current per polarity
1. **MEANALP**: mean characteristic twist parameter
1. **MEANSHR**: mean shear angle
1. **TOTUSJZ**: total unsigned vertical current
1. **TOTUSJH**: total unsigned current helicity
1. **MEANGAM**: mean angle of field from radial
1. **MEANGBZ**: mean gradient of vertical field
1. **MEANJZD**: mean vertical current density
1. **AREA_ACR**: area of strong field pixels in the active region
1. **R_VALUE**: sum of flux near polarity inversion line
1. **MEANGBT**: mean gradient of total field
1. **MEANGBH**: mean gradient of horizontal field

### Sample of NCEI data

The **Label** column has 3 values: _padding_, _N_, and _P_. _Padding_ means this is an auxiliary data sample used to construct time series for prediction. _N_ means there is a >=M class flare within the next x hours but the flare is not associated with a CME. _P_ means there is a >=M class flare within the next x hours and this flare is associated with a CME.

The **NOAA AR NUM** and **HARP NUM** columns are idenfitying columns for the AR and HARP, respectively. 

The subsuquent 18 columns are the 18 physical SHARP features.

In [None]:
names = ['Label', 'Timestamp', 'NOAA AR NUM', 'HARP NUM',
          'TOTUSJH', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'AREA_ACR',
          'MEANPOT', 'R_VALUE', 'SHRGT45', 'MEANGAM', 'MEANJZH', 'MEANGBT', 'MEANGBZ',
          'MEANJZD', 'MEANGBH', 'MEANSHR', 'MEANALP']
sample_data = pd.read_csv('CMEpredict/normalized_testing_12.csv', names=names)
display(sample_data,)

### Predicting

We get the threshhold for binary clasification. This is a Binary Classification problem - if the modelr returns a probability which is above the threshhold then it's Positive (P), else it's Negative (N).

We get n_features for each algorithm and time_window, how many of the 18 features we will use

for 12, 36, 60
    run both models
    show CM for each model (no results)

show ROC for all 6

In [None]:
import itertools
time_windows = [12, 36, 60] # options are 12, 24, 36, 48, 60
rnn_types = ['gru', 'lstm']

# columns = ['Predicted Label', 'Label', 'Timestamp', 'NOAA AR NUM', 'HARP NUM',
#                     'TOTUSJH', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'AREA_ACR',
#                     'MEANPOT', 'R_VALUE', 'SHRGT45', 'MEANGAM', 'MEANJZH', 'MEANGBT', 'MEANGBZ',
#                     'MEANJZD', 'MEANGBH', 'MEANSHR', 'MEANALP']

cm_grid = [['TN','FP'], ['FN', 'TP']]

for (time_window, type) in itertools.product(time_windows, rnn_types):
    model_file = f'CMEpredict/{type}-{time_window}-model.h5'

    n_features, threshold = get_n_features_thresh(type, time_window)

    model = load_model(model_file)

    test_data_file = f'CMEpredict/normalized_testing_{time_window}.csv'
    X_test, y_test, nb_test = load_data(datafile=test_data_file,
                                        series_len=series_len,
                                        start_feature=start_feature,
                                        n_features=n_features,
                                        mask_value=mask_value,
                                        type=type,
                                        time_window=time_window)

    prob = model.predict(X_test,
                        batch_size=batch_size,
                        verbose=False,
                        steps=None)
    # table, y_pred = get_output_table(test_data_file, type, time_window, start_feature, n_features, thresh, prob)

    # data = pd.DataFrame(table, columns=columns)    
    # display(data)

    confmat = cm(y_test, [1 if p >= threshold else 0 for p in prob])
    plt.imshow(confmat, interpolation='nearest', cmap=(plt.cm.PuBu if type =='lstm' else plt.cm.OrRd))
    classNames = ['Negative','Positive']
    plt.title(f'{type.upper()} {time_window} Hour Confusion Matrix')
    plt.ylabel('True Label')
    plt.xlabel('Predicted Label')
    tick_marks = np.arange(len(classNames))
    plt.xticks(tick_marks, classNames, rotation=45)
    plt.yticks(tick_marks, classNames)
    for i in range(2):
        for j in range(2):
            plt.text(j,i, f'{cm_grid[i][j]} = {confmat[i][j]}', horizontalalignment='center')
    plt.show()

In [None]:
r_auc = roc_auc_score(y_test, lp)
r_fpr, r_tpr, _ = roc_curve(y_test, lp)

plt.plot(r_fpr, r_tpr, linestyle='--', label='Random prediction (AUROC = %0.3f)' % r_auc)
plt.title('ROC Plot')
# Axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# Show legend
plt.legend() # 
# Show plot
plt.show()

In [None]:
model_file = 'CMEpredict/gru-12-model.h5'

n_features, thresh = get_n_features_thresh('gru', 12)

model = load_model(model_file)

print('loading testing data')
test_data_file = 'CMEpredict/normalized_testing_12.csv'
X_test, y_test, nb_test = load_data(datafile=test_data_file,
                                    series_len=series_len,
                                    start_feature=start_feature,
                                    n_features=n_features,
                                    mask_value=mask_value,
                                    type='gru',
                                    time_window=12)
print('done loading testing data...')
display((X_test[2]))

In [None]:
display(len(X_test[2]))

In [None]:
print('predicting testing data...')
prob = model.predict(X_test,
                     batch_size=batch_size,
                     verbose=False,
                     steps=None)
print('done predicting...')
table, y_pred = get_output_table(test_data_file, 'gru', 12, start_feature, n_features, thresh, prob)
columns = ['Predicted Label', 'Label', 'Timestamp', 'NOAA AR NUM', 'HARP NUM',
                'TOTUSJH', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'AREA_ACR',
                'MEANPOT', 'R_VALUE', 'SHRGT45', 'MEANGAM', 'MEANJZH', 'MEANGBT', 'MEANGBZ',
                'MEANJZD', 'MEANGBH', 'MEANSHR', 'MEANALP']
data = pd.DataFrame(table, columns=columns)    
display(data)

### Confusion Matrix for GRU 12

In [None]:
lp = np.array([1 if p >= thresh else 0 for p in prob])
confmat = cm(y_test, lp)
import matplotlib.pyplot as plt
plt.imshow(confmat, interpolation='nearest', cmap=plt.cm.Blues)
classNames = ['Negative','Positive']
plt.title('GRU 12 Confusion Matrix - Test Data')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j])+" = "+str(confmat[i][j]))
plt.show()
# plt.matshow(confmat)
# plt.tight_layout()
# display(y_test)
# display(y_pred)
# display(thresh)
# pd.options.display.max_rows = 4000
# display(pd.DataFrame([p for p in prob]))

### Predict 12 hour using LSTM model

In [None]:
model_file = 'CMEpredict/lstm-12-model.h5'

n_features, thresh = get_n_features_thresh('lstm', 12)

model = load_model(model_file)

print('loading testing data')
test_data_file = 'CMEpredict/normalized_testing_12.csv'
X_test, y_test, nb_test = load_data(datafile=test_data_file,
                                    series_len=series_len,
                                    start_feature=start_feature,
                                    n_features=n_features,
                                    mask_value=mask_value,
                                    type='lstm',
                                    time_window=12)
print('done loading testing data...')
# display(X_test)

In [None]:
print('predicting testing data...')
prob = model.predict(X_test,
                     batch_size=batch_size,
                     verbose=False,
                     steps=None)
print('done predicting...')
table, y_pred = get_output_table(test_data_file, 'lstm', 12, start_feature, n_features, thresh, prob)
columns = ['Predicted Label', 'Label', 'Timestamp', 'NOAA AR NUM', 'HARP NUM',
                'TOTUSJH', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'AREA_ACR',
                'MEANPOT', 'R_VALUE', 'SHRGT45', 'MEANGAM', 'MEANJZH', 'MEANGBT', 'MEANGBZ',
                'MEANJZD', 'MEANGBH', 'MEANSHR', 'MEANALP']
data = pd.DataFrame(table, columns=columns)    
display(data)

LSTM 60

In [None]:
model_file = 'CMEpredict/lstm-60-model.h5'

n_features, thresh = get_n_features_thresh('lstm', 60)

model = load_model(model_file)

print('loading testing data')
test_data_file = 'CMEpredict/normalized_testing_60.csv'
X_test, y_test, nb_test = load_data(datafile=test_data_file,
                                    series_len=series_len,
                                    start_feature=start_feature,
                                    n_features=n_features,
                                    mask_value=mask_value,
                                    type='lstm',
                                    time_window=60)
print('done loading testing data...')
display(nb_test)
# display(X_test)

In [None]:
print('predicting testing data...')
prob = model.predict(X_test,
                     batch_size=batch_size,
                     verbose=False,
                     steps=None)
print('done predicting...')
table, y_pred = get_output_table(test_data_file, 'lstm', 60, start_feature, n_features, thresh, prob)
columns = ['Predicted Label', 'Label', 'Timestamp', 'NOAA AR NUM', 'HARP NUM',
                'TOTUSJH', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'AREA_ACR',
                'MEANPOT', 'R_VALUE', 'SHRGT45', 'MEANGAM', 'MEANJZH', 'MEANGBT', 'MEANGBZ',
                'MEANJZD', 'MEANGBH', 'MEANSHR', 'MEANALP']
data = pd.DataFrame(table, columns=columns)    
display(data)

In [None]:
lp = np.array([1 if p >= thresh else 0 for p in prob])
confmat = cm(y_test, lp)
import matplotlib.pyplot as plt
plt.imshow(confmat, interpolation='nearest', cmap=plt.cm.Blues)
classNames = ['Negative','Positive']
plt.title('LSTM 60 Confusion Matrix - Test Data')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j])+" = "+str(confmat[i][j]), horizontalalignment='center')
plt.show()
# plt.matshow(confmat)
# plt.tight_layout()
# display(y_test)
# display(y_pred)
# display(thresh)
# pd.options.display.max_rows = 4000
# display(pd.DataFrame([p for p in prob]))

In [None]:
train_data_file = 'CMEpredict/normalized_training_60.csv'
X_train, y_train, nb_train = load_data(datafile=train_data_file,
                                        series_len=series_len,
                                        start_feature=start_feature,
                                        n_features=n_features,
                                        mask_value=mask_value,
                                        type='lstm',
                                        time_window=60)

class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
class_weight_ = {0: class_weights[0], 1: class_weights[1]}
print('done loading training data...')


model = lstm(n_features, series_len)

print('training the model, wait until it is finished...')
model.compile(loss='binary_crossentropy',
                optimizer='RMSprop',
                metrics=['accuracy'])

history = model.fit(X_train,
                    y_train,
                    epochs=epochs,
                    batch_size=batch_size,
                    verbose=False,
                    shuffle=True,
                    class_weight=class_weight_)

print('predicting testing data...')
test_data_file = 'CMEpredict/normalized_testing_60.csv'
X_test, y_test, nb_test = load_data(datafile=test_data_file,
                                    series_len=series_len,
                                    start_feature=start_feature,
                                    n_features=n_features,
                                    mask_value=mask_value,
                                    type='lstm',
                                    time_window=60)
prob = model.predict(X_test,
                     batch_size=batch_size,
                     verbose=False,
                     steps=None)
print('done predicting...')
table, y_pred = get_output_table(test_data_file, 'lstm', 60, start_feature, n_features, thresh, prob)
columns = ['Predicted Label', 'Label', 'Timestamp', 'NOAA AR NUM', 'HARP NUM',
                'TOTUSJH', 'TOTPOT', 'TOTUSJZ', 'ABSNJZH', 'SAVNCPP', 'USFLUX', 'AREA_ACR',
                'MEANPOT', 'R_VALUE', 'SHRGT45', 'MEANGAM', 'MEANJZH', 'MEANGBT', 'MEANGBZ',
                'MEANJZD', 'MEANGBH', 'MEANSHR', 'MEANALP']
data = pd.DataFrame(table, columns=columns) 


lp = np.array([1 if p >= thresh else 0 for p in prob])
confmat = cm(y_test, lp)
import matplotlib.pyplot as plt
plt.imshow(confmat, interpolation='nearest', cmap=plt.cm.Blues)
classNames = ['Negative','Positive']
plt.title('LSTM 60 Confusion Matrix - Retrain')
plt.ylabel('True label')
plt.xlabel('Predicted label')
tick_marks = np.arange(len(classNames))
plt.xticks(tick_marks, classNames, rotation=45)
plt.yticks(tick_marks, classNames)
s = [['TN','FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        plt.text(j,i, str(s[i][j])+" = "+str(confmat[i][j]), horizontalalignment='center')
plt.show()

In [None]:
r_auc = roc_auc_score(y_test, lp)
r_fpr, r_tpr, _ = roc_curve(y_test, lp)

plt.plot(r_fpr, r_tpr, linestyle='--', label='Random prediction (AUROC = %0.3f)' % r_auc)
plt.title('ROC Plot')
# Axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# Show legend
plt.legend() # 
# Show plot
plt.show()

#### Train your own model