# Preparing data on solar flares

This is one of the Jupyter notebooks I used in my preparation of *Probably Overthinking It: How to Use Data to Answer Questions, Avoid Statistical Traps, and Make Better Decisions*.

The book is scheduled to be published by University of Chicago Press in 2023.
If you would like to get infrequent email announcements about the book, please
[sign up for my mailing list](http://eepurl.com/h0nfbX).



[Click here to run this notebook on Colab](https://colab.research.google.com/github/AllenDowney/ProbablyOverthinkingIt/blob/book/notebooks/clean_flare.ipynb).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
from os.path import basename, exists

def download(url):
    filename = basename(url)
    if not exists(filename):
        from urllib.request import urlretrieve
        local, _ = urlretrieve(url, filename)
        print('Downloaded ' + local)

It looks like files from prior to 1997 don't have integrated flux.

And the last file is the 2017 year-to-date.

In [28]:
for i in range(1997, 2017):
    url = f'https://www.ngdc.noaa.gov/stp/space-weather/solar-data/solar-features/solar-flares/x-rays/goes/xrs/goes-xrs-report_{i}.txt'    
    download(url)

Two files have nonstandard filenames.

In [5]:
download('https://www.ngdc.noaa.gov/stp/space-weather/solar-data/solar-features/solar-flares/x-rays/goes/xrs/goes-xrs-report_2015_modifiedreplacedmissingrows.txt')
    

Downloaded goes-xrs-report_2015_modifiedreplacedmissingrows.txt


In [6]:
!cp goes-xrs-report_2015_modifiedreplacedmissingrows.txt goes-xrs-report_2015.txt

In [7]:
download('https://www.ngdc.noaa.gov/stp/space-weather/solar-data/solar-features/solar-flares/x-rays/goes/xrs/goes-xrs-report_2017-ytd.txt')
    

Downloaded goes-xrs-report_2017-ytd.txt


In [8]:
!cp goes-xrs-report_2017-ytd.txt goes-xrs-report_2017.txt

In [9]:
!head goes-xrs-report_2017.txt

31777170104  0234 0250 0239                                B 18    G15  1.2E-04                               
31777170105  1820 1839 1831                                B 14    G15  1.3E-04                               
31777170110  1025 1033 1030                                B 73    G15  1.8E-04                               
31777170110  1256 1304 1300                                B 88    G15  2.3E-04                               
31777170110  1510 1517 1513 N16W81                         B 39    G15  9.8E-05       170104.4                
31777170110  1521 1531 1525                                B 41    G15  1.7E-04                               
31777170110  1636 1649 1646 N17W81                         B 66    G15  2.6E-04       170104.5                
31777170110  1803 1813 1808 N17W80                         B 53    G15  2.0E-04       170104.6                
31777170110  2107 2115 2112                                B 11    G15  3.9E-05                 

Documentation of the format is here

https://www.ngdc.noaa.gov/stp/space-weather/solar-data/solar-features/solar-flares/x-rays/goes/documentation/miscellaneous/software/xraydatareports.pro

```
;Output File Specification
;   Column  Format  Description   
;
;    1- 2     I2    Data code: always 31 for x-ray events
;    3- 5     I3    Station Code, 777 for GOES
;    6- 7     I2    Year
;    8- 9     I2    Month
;   10-11     I2    Day
;   12-13     A2    Astrisks mark record with unconfirmed change (What does this mean?)
;   14-17     I4    Start time of x-ray event - SEE NOTE 1
;   18        1X    <space>
;   19-22     I4    End time
;   23        1X    <space>
;   24-27     I4    Max time
;   28        1X    <space>
;   29        A1    N or S for north or south latitude of xray flare if known
;   30-31     I2    Latitude of xray flare, if known
;   32        A1    E or W for east or west of longitude of xray flare, in known
;   33-34     I2    Central meridian distance of x-ray flare, if known
;   35-37     A3    SXI if data are from SXI imagery, blank otherwise
;   38-59    22X    <space>
;   60        A1    X-ray class: C,M,X code - SEE NOTE 2
;   61        1X    <space>
;   62-63     I2    X-ray intensity 10-99 for 1.0-9.9 x xray class
;   64-67     4X    <space>
;   68-71     A4    Station ame abbreviation - "Gxx " for GOES
;   72        1X    <space>
;   73-80   E7.1    Integrated flux (units = J/m**2)
;   81-85     I5    NOAA/USAF sunspot region number
;   86        1X    <space>
;   87-88     I2    Year - central meridian passage (CMP)
;   89-90     I2    Month - central meridian passage (CMP)
;   91-94   F4.1    Day - central meridian passage (CMP)
;   95        1X    <space>
;   96-102  F7.1    Total region area in squared arc seconds
;  103        1X    <space>
;  104-110  F7.2    Total intensity (units - TBD) from SXI, if available
```

From https://www.spaceweather.gov/phenomena/solar-flares-radio-blackouts

```
Radio Blackout….. X-ray Flare….. Flux (W/m2)….. Severity Descriptor

R1                            M1                   0.00001               Minor

R2                            M5                   0.00005               Moderate

R3                            X1                     0.0001                 Strong

R4                            X10                   0.001                   Severe

R5                            X20                   0.002                   Extreme```

In [10]:
i = 2017
filename = f'goes-xrs-report_{i}.txt'
filename

'goes-xrs-report_2017.txt'

In [11]:
!head goes-xrs-report_2017.txt

31777170104  0234 0250 0239                                B 18    G15  1.2E-04                               
31777170105  1820 1839 1831                                B 14    G15  1.3E-04                               
31777170110  1025 1033 1030                                B 73    G15  1.8E-04                               
31777170110  1256 1304 1300                                B 88    G15  2.3E-04                               
31777170110  1510 1517 1513 N16W81                         B 39    G15  9.8E-05       170104.4                
31777170110  1521 1531 1525                                B 41    G15  1.7E-04                               
31777170110  1636 1649 1646 N17W81                         B 66    G15  2.6E-04       170104.5                
31777170110  1803 1813 1808 N17W80                         B 53    G15  2.0E-04       170104.6                
31777170110  2107 2115 2112                                B 11    G15  3.9E-05                 

In [12]:
colspecs = [(5, 7), (28, 29), (29, 31), (72, 80), (7, 9), (9, 11)]
df = pd.read_fwf(filename, colspecs=colspecs, header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,17,,,0.00012,1,4
1,17,,,0.00013,1,5
2,17,,,0.00018,1,10
3,17,,,0.00023,1,10
4,17,N,16.0,9.8e-05,1,10


In [13]:
df_seq = []
# colspecs = [(5,7), (28, 29), (29, 31), (72, 80)]

for i in range(1997, 2018):
    filename = f'goes-xrs-report_{i}.txt'
    df = pd.read_fwf(filename, colspecs=colspecs, header=None)
    print(i, len(df), df.shape)
    df_seq.append(df)

1997 1141 (1141, 6)
1998 2248 (2248, 6)
1999 2425 (2425, 6)
2000 2661 (2661, 6)
2001 2706 (2706, 6)
2002 2718 (2718, 6)
2003 2394 (2394, 6)
2004 2369 (2369, 6)
2005 2171 (2171, 6)
2006 1339 (1339, 6)
2007 649 (649, 6)
2008 86 (86, 6)
2009 256 (256, 6)
2010 1255 (1255, 6)
2011 2171 (2171, 6)
2012 2037 (2037, 6)
2013 2051 (2051, 6)
2014 2258 (2258, 6)
2015 1963 (1963, 6)
2016 1194 (1194, 6)
2017 510 (510, 6)


In [14]:
flares = pd.concat(df_seq, ignore_index=True)
flares.columns = ['year', 'ns', 'lat', 'flux', 'month', 'day']
flares.head()

Unnamed: 0,year,ns,lat,flux,month,day
0,97,,,0.00043,1,5
1,97,S,1.0,1.5e-05,1,7
2,97,,,4.1e-05,1,16
3,97,,,,1,16
4,97,,,1e-05,1,19


In [15]:
flares['year'] += np.where(flares['year'] > 50, 1900, 2000)

In [16]:
flares['year'].describe()

count    36602.000000
mean      2005.935441
std          6.030805
min       1997.000000
25%       2001.000000
50%       2004.000000
75%       2012.000000
max       2017.000000
Name: year, dtype: float64

In [17]:
flares['month'].describe()

count    36602.000000
mean         6.567592
std          3.462545
min          1.000000
25%          4.000000
50%          7.000000
75%         10.000000
max         12.000000
Name: month, dtype: float64

In [18]:
flares['day'].describe()

count    36602.000000
mean        15.741052
std          8.728193
min          1.000000
25%          8.000000
50%         16.000000
75%         23.000000
max         31.000000
Name: day, dtype: float64

In [19]:
i = flares['flux'].argmax()
flares.loc[i]

year     2005
ns          S
lat      11.0
flux      2.6
month       9
day         7
Name: 20312, dtype: object

In [20]:
flux = flares['flux'].replace(0, np.nan).dropna()
flux.describe()

count    36560.000000
mean         0.004608
std          0.034690
min          0.000010
25%          0.000290
50%          0.000830
75%          0.002400
max          2.600000
Name: flux, dtype: float64

In [21]:
flux.to_csv('flares.csv', index=False, header=False)

In [22]:
!ls -lh flares.csv

-rw-rw-r-- 1 downey downey 264K Mar  1 20:22 flares.csv


In [23]:
mags = np.log10(flux)
mags.describe()

count    36560.000000
mean        -3.058657
std          0.691503
min         -5.022276
25%         -3.537602
50%         -3.080922
75%         -2.619789
max          0.414973
Name: flux, dtype: float64

In [24]:
flares['flux'].replace(0, np.nan, inplace=True)
flares = flares.dropna(subset=['flux'])
flares['mag'] = np.log10(flares['flux'])
flares.shape

(36560, 7)

In [25]:
bins = pd.qcut(flares['lat'], 5)
bins

0                  NaN
1        (-0.001, 9.0]
2                  NaN
4                  NaN
5        (-0.001, 9.0]
             ...      
36597              NaN
36598              NaN
36599              NaN
36600     (16.0, 20.0]
36601              NaN
Name: lat, Length: 36560, dtype: category
Categories (5, interval[float64, right]): [(-0.001, 9.0] < (9.0, 12.0] < (12.0, 16.0] < (16.0, 20.0] < (20.0, 86.0]]

In [26]:
summary = flares.groupby('year')['mag'].agg(['count', 'mean', 'std'])
summary

Unnamed: 0_level_0,count,mean,std
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1997,1139,-3.477092,0.625527
1998,2238,-3.104455,0.633378
1999,2413,-2.823423,0.567794
2000,2659,-2.692885,0.520485
2001,2704,-2.647219,0.587652
2002,2714,-2.708178,0.524542
2003,2393,-2.983089,0.657991
2004,2369,-3.227574,0.662144
2005,2169,-3.350025,0.734765
2006,1339,-3.733364,0.614207


In [27]:
cv = summary.std() / summary.mean()
cv

count    0.477711
mean    -0.125217
std      0.093619
dtype: float64

*Elements of Data Science*

Copyright 2022 Allen Downey

License: [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/)