#David Barnett
#Predicting the Speed of Coronal Mass Projections
#Data Collection and Cleaning


From NASA:

"*The Space Weather Database Of Notifications, Knowledge, Information (DONKI) is a comprehensive on-line tool for space weather forecasters, scientists, and the general space science community. DONKI provides chronicles the daily interpretations of space weather observations, analysis, models, forecasts, and notifications provided by the Space Weather Research Center (SWRC), comprehensive knowledge-base search functionality to support anomaly resolution and space science research, intelligent linkages, relationships, cause-and-effects between space weather activities and comprehensive webservice API access to information stored in DONKI.*"

In summary, this API provides data on Coronal Mass Ejections, and I will use the existing data to try to predict the speed of future CMEs.

API: https://api.nasa.gov/

Definition from NOAA: "Coronal Mass Ejections (CMEs) are large expulsions of plasma and magnetic field from the Sun’s corona". These events occur from a range of several times a day to once or twice a week.

![](https://cosmos-images1.imgix.net/file/spina/photo/18036/190218-sun-full.jpg?ixlib=rails-2.1.4&auto=format&ch=Width%2CDPR&fit=max&w=1920)

In this project, I'm attempting to predict the CME's speed (in km/s) based off of other provided variables, such as the direction of the CME and the time since the last CME. Unfortunate, the link to the README for this API doesn't work, so some variables are somewhat unclear.

Now to the collection and cleaning. The following few lines are just reading in the dataset from the NASA API.

In [0]:
import pandas as pd
import numpy as np
import requests
import time

api_key = "HLVX1IR9aQvs0QCteQ9HaO6NHep4GLhRLJs3nuRo"
prefix = "https://api.nasa.gov/DONKI/"

types = ["CME", "CMEAnalysis"]

suffixes = []

for i in range(2):
  suffixes.append(types[i] + 
                     "?startDate=2015-01-01&endDate=2019-12-31&api_key=" 
                     + api_key)

In [0]:
cme_response = requests.get(url = prefix + suffixes[0],
                headers = {
                  "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                  "Accept-Encoding": "gzip, deflate, br",
                  "Accept-Language": "en-US,en;q=0.9",
                  "Cache-Control": "max-age=0",
                  "Connection": "keep-alive",
                  "Cookie": "_ga=GA1.3.952345178.1583203760; _gid=GA1.3.2112114380.1583374584",
                  "Host": "api.nasa.gov",
                  "Referer": "https://api.nasa.gov/",
                  "Sec-Fetch-Dest": "document",
                  "Sec-Fetch-Mode": "navigate",
                  "Sec-Fetch-Site": "none",
                  "Sec-Fetch-User": "?1",
                  "Upgrade-Insecure-Requests": "1",
                  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36)"
              })
    
time.sleep(0.5)

analysis_response = requests.get(url = prefix + suffixes[1],
                headers = {
                  "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
                  "Accept-Encoding": "gzip, deflate, br",
                  "Accept-Language": "en-US,en;q=0.9",
                  "Cache-Control": "max-age=0",
                  "Connection": "keep-alive",
                  "Cookie": "_ga=GA1.3.952345178.1583203760; _gid=GA1.3.2112114380.1583374584",
                  "Host": "api.nasa.gov",
                  "Referer": "https://api.nasa.gov/",
                  "Sec-Fetch-Dest": "document",
                  "Sec-Fetch-Mode": "navigate",
                  "Sec-Fetch-Site": "none",
                  "Sec-Fetch-User": "?1",
                  "Upgrade-Insecure-Requests": "1",
                  "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36)"
              })

In [0]:
cme_df = pd.DataFrame(cme_response.json())
cme_df.head()

Unnamed: 0,activityID,startTime,sourceLocation,activeRegionNum,instruments,cmeAnalyses,linkedEvents,note,catalog
0,2015-01-01T08:24:00-CME-001,2015-01-01T08:24Z,,,"[{'id': 1, 'displayName': 'SOHO: LASCO/C2'}, {...","[{'time21_5': '2015-01-01T23:14Z', 'latitude':...",,"Eruption visible in SDO 193, starting ~ 2014-1...",SWRC_CATALOG
1,2015-01-02T14:36:00-CME-001,2015-01-02T14:36Z,S07W40,,"[{'id': 1, 'displayName': 'SOHO: LASCO/C2'}, {...","[{'time21_5': '2015-01-02T23:55Z', 'latitude':...",[{'activityID': '2015-01-07T05:24:00-IPS-001'}],Associated with a very gradual eruption in a s...,SWRC_CATALOG
2,2015-01-03T03:24:00-CME-001,2015-01-03T03:24Z,,,"[{'id': 1, 'displayName': 'SOHO: LASCO/C2'}, {...","[{'time21_5': '2015-01-03T23:40Z', 'latitude':...",,,SWRC_CATALOG
3,2015-01-06T18:24:00-CME-001,2015-01-06T18:24Z,,,"[{'id': 1, 'displayName': 'SOHO: LASCO/C2'}, {...","[{'time21_5': '2015-01-07T00:25Z', 'latitude':...",,No source region could be found in SDO.,SWRC_CATALOG
4,2015-01-07T16:24:00-CME-001,2015-01-07T16:24Z,,,"[{'id': 1, 'displayName': 'SOHO: LASCO/C2'}, {...","[{'time21_5': '2015-01-08T03:10Z', 'latitude':...",,SDO 193 shows indication of an eruption off th...,SWRC_CATALOG


Now I'm going to clean up cme_df. The following chunks each have comments of what I'm doing for each chunk.

In [0]:
#Changing linked events to an indicator variable, since relatively few are 
#linked and the data for other events were all too small to perform analyses

cme_df["linkedEvents"][cme_df["linkedEvents"].notna()] = 1
cme_df["linkedEvents"][cme_df["linkedEvents"].isna()] = 0
cme_df["linkedEvents"] = cme_df["linkedEvents"].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [0]:
#Formatting instruments variable to be a text variable

instruments = pd.Series()
for i in range(len(cme_df["instruments"])):
  insts = ""
  for inst in cme_df["instruments"][i]:
    for key in inst.keys():
      if key == "displayName":
        insts += inst[key] + " "
  instruments.set_value(i, insts)

cme_df["instruments"] = instruments

  if __name__ == '__main__':


In [0]:
#Date formatting

cme_df["startTime"] = pd.to_datetime(
  cme_df["startTime"].str[:10] + " " + cme_df["startTime"].str[11:16]
)

In [0]:
#Changing source location to coordinate columns

locs = cme_df["sourceLocation"][cme_df["sourceLocation"] != ""]

locs = locs.str.replace("N", "")
locs = locs.str.replace("S", "-")
locs = locs.str.replace("E", " ")
locs = locs.str.replace("W", " -")

locs = locs.str.split(" ")

for i in locs.index:
  for j in range(2):
    locs[i][j] = int(locs[i][j])

cme_df["N/S"] = np.nan
cme_df["E/W"] = np.nan

for i in locs.index:
  cme_df["N/S"][i] = locs[i][0]
  cme_df["E/W"][i] = locs[i][1]

cme_df["N/S"] = cme_df["N/S"][cme_df["N/S"] < 90] #Removing outliers


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [0]:
#Dropping unnecessary variables

cme_df.drop(["activityID", "sourceLocation", "cmeAnalyses", "catalog"], axis = 1, inplace = True)

In [0]:
#Dropping duplicate rows so CMEs aren't counted twice

cme_df.drop_duplicates(subset = "startTime", inplace = True)

In [0]:
#Setting the index to be when the CME occurred

cme_df.set_index("startTime", inplace = True)

Now I'll clean up analysis_df.

In [0]:
analysis_df = pd.DataFrame(analysis_response.json())
analysis_df.head()

Unnamed: 0,time21_5,latitude,longitude,halfAngle,speed,type,isMostAccurate,associatedCMEID,note,catalog
0,2015-01-01T23:14Z,31.0,26.0,32.0,350.0,S,True,2015-01-01T08:24:00-CME-001,using both SWPC_cat and STEREO_cat with approx...,SWRC_CATALOG
1,2015-01-02T23:55Z,3.0,34.0,23.0,353.0,S,True,2015-01-02T14:36:00-CME-001,Reanalyzed with more C2 and STEREOA imagery,SWRC_CATALOG
2,2015-01-03T23:40Z,-49.0,-82.0,42.0,210.0,S,True,2015-01-03T03:24:00-CME-001,Source region could not be found. It's possibl...,SWRC_CATALOG
3,2015-01-07T00:25Z,9.0,39.0,12.0,532.0,C,True,2015-01-06T18:24:00-CME-001,,SWRC_CATALOG
4,2015-01-08T03:10Z,67.0,-102.0,20.0,579.0,C,True,2015-01-07T16:24:00-CME-001,,SWRC_CATALOG


In [0]:
#Date formatting

analysis_df["analysis_time"] = pd.to_datetime(
    analysis_df["time21_5"].str[:10] + " " + 
    analysis_df["time21_5"].str[11:16] + ":00"
)
analysis_df["occurrence_time"] = pd.to_datetime(
    analysis_df["associatedCMEID"].str[:10] + " " + 
    analysis_df["associatedCMEID"].str[11:19]
)

In [0]:
#Creating a new variable that measures the distance between the occurrence time 
#of the CME and when the analysis was performed

analysis_df["time_before_analysis"] = (analysis_df["analysis_time"] - 
                            analysis_df["occurrence_time"])

In [0]:
#Additionally, creating a variable for time since the last CME

time_since_last_cme = pd.Series()
time_since_last_cme.set_value(0, np.nan)
for i in range(1, len(analysis_df)): 
  time_since_last_cme.set_value(i, analysis_df.iloc[i]["occurrence_time"] - 
                                analysis_df.iloc[i-1]["occurrence_time"])
analysis_df["time_since_last_cme"] = time_since_last_cme

  This is separate from the ipykernel package so we can avoid doing imports until
  


In [0]:
#Formatting the differences in time to be in number of hours rather than has a 
#timedelta variable. This helps with future analyses

analysis_df["time_before_analysis"] = (
      analysis_df["time_before_analysis"].dt.total_seconds() / 3600
)

analysis_df["time_since_last_cme"] = (
    analysis_df["time_since_last_cme"].dt.total_seconds() / 3600
)

In [0]:
#Dropping unnecessary variables

analysis_df.drop(["time21_5", "isMostAccurate", "associatedCMEID", "catalog"], 
               axis = 1, inplace = True)

In [0]:
#Dropping duplicate CMEs
analysis_df.drop_duplicates(subset = "occurrence_time", inplace = True)

In [0]:
#Setting the index to match the startTime of cme_df

analysis_df.set_index("occurrence_time", inplace = True)

In [0]:
display(cme_df.head())
print()
display(analysis_df.head())

Unnamed: 0_level_0,activeRegionNum,instruments,linkedEvents,note,N/S,E/W
startTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2015-01-01 08:24:00,,SOHO: LASCO/C2 SOHO: LASCO/C3,0,"Eruption visible in SDO 193, starting ~ 2014-1...",,
2015-01-02 14:36:00,,SOHO: LASCO/C2 SOHO: LASCO/C3,1,Associated with a very gradual eruption in a s...,-7.0,-40.0
2015-01-03 03:24:00,,SOHO: LASCO/C2 SOHO: LASCO/C3,0,,,
2015-01-06 18:24:00,,SOHO: LASCO/C2 SOHO: LASCO/C3,0,No source region could be found in SDO.,,
2015-01-07 16:24:00,,SOHO: LASCO/C2 SOHO: LASCO/C3 STEREO A: SECCHI...,0,SDO 193 shows indication of an eruption off th...,,





Unnamed: 0_level_0,latitude,longitude,halfAngle,speed,type,note,analysis_time,time_before_analysis,time_since_last_cme
occurrence_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2015-01-01 08:24:00,31.0,26.0,32.0,350.0,S,using both SWPC_cat and STEREO_cat with approx...,2015-01-01 23:14:00,14.833333,
2015-01-02 14:36:00,3.0,34.0,23.0,353.0,S,Reanalyzed with more C2 and STEREOA imagery,2015-01-02 23:55:00,9.316667,30.2
2015-01-03 03:24:00,-49.0,-82.0,42.0,210.0,S,Source region could not be found. It's possibl...,2015-01-03 23:40:00,20.266667,12.8
2015-01-06 18:24:00,9.0,39.0,12.0,532.0,C,,2015-01-07 00:25:00,6.016667,87.0
2015-01-07 16:24:00,67.0,-102.0,20.0,579.0,C,,2015-01-08 03:10:00,10.766667,22.0


In [0]:
#Downloading the resulting data frames to csv files

from google.colab import files

cme_df.to_csv("cme.csv")
files.download("cme.csv")

analysis_df.to_csv("cmeanalysis.csv")
files.download("cmeanalysis.csv")