# Analytics Base Table Construction
---
Begin to explore the [SWAN-SF Dataset](https://doi.org/10.7910/DVN/EBCFKM). 


Below you will find a number of steps that you will be required to complete before you can start.

---

## Step 1: Downloading the Data
---

This assignment will only be using [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB), but we will be using more than one by the end of the semster. In later steps, you will need to access the uncompressed files from these partitions, so remember where you put them.

A paper describing the construction of the dataset can be found [here](https://doi.org/10.1038/s41597-020-0548-x).

---

Individual partitions of the dataset can be accessed through following links:
- [Partition 1](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/BMXYCB) - 1.2GB
- [Partition 2](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/TCRPUD) - 1.4GB
- [Partition 3](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/PTPGQT) - 702.1 MB
- [Partition 4](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/FIFLFU) - 844.4 MB
- [Partition 5](https://dataverse.harvard.edu/api/access/datafile/:persistentId?persistentId=doi:10.7910/DVN/EBCFKM/QC2C3X) - 1.2 GB

---

### Dataset Attributes:

Each file in the dataset contains the following attributes as a single variate of the multivariate timeseries (MVTS) sample. 

|              |                  |             |
|--------------|------------------|-------------|
| 1. Timestamp | 2. TOTUSJH       | 3. TOTBSQ   |	
| 4. TOTPOT	   | 5. TOTUSJZ       | 6. ABSNJZH  |	
| 7. SAVNCPP   | 8. USFLUX        | 9. TOTFZ	|
| 10. MEANPOT  | 11. EPSZ	      | 12. MEANSHR |
| 13. SHRGT45  | 14. MEANGAM      | 15. MEANGBT |
| 16. MEANGBZ  | 17. MEANGBH      | 18. MEANJZH |
| 19. TOTFY    | 20. MEANJZD      | 21. MEANALP |	
| 22. TOTFX    | 23. EPSY	      | 24. EPSX	|
| 25. R_VALUE  | 26. CRVAL1       | 27. CRLN_OBS|	
| 28. CRLT_OBS | 29. CRVAL2       | 30. HC_ANGLE|	
| 31. SPEI     | 32. LAT_MIN      | 33. LON_MIN |
| 34. LAT_MAX  | 35. LON_MAX      | 36. QUALITY |	
| 37. BFLARE   | 38. BFLARE_LABEL |	39. CFLARE  |	
| 39. CFLARE_LABEL | 40. MFLARE | 41. MFLARE_LABEL |	
| 42. XFLARE | 43. XFLARE_LABEL | 44. BFLARE_LOC |	
| 45. BFLARE_LABEL_LOC | 46. CFLARE_LOC | 47. CFLARE_LABEL_LOC |	
| 48. MFLARE_LOC | 49. MFLARE_LABEL_LOC | 50. FLARE_LOC |	
| 51. XFLARE_LABEL_LOC | 52. XR_MAX | 53. XR_QUAL |	
|54. IS_TMFI | | |

---


## Step 2: Unpacking the data
---

The partitions come in tar.gz archive files. These are easily opened on all current operating systems using the same command in the terminal.

- On Windows 10: Use cmd.exe, then run: tar xf partition1_instances.tar.gz
- On Linux: In the terminal run: tar xf partition1_instances.tar.gz
- On Mac: In the terminal run: tar xf partition1_instances.tar.gz

These all assume you are in the directory that contains the tar.gz file and that you wish to unpack in this same directory.  Search for tar commands if you wish to do something else.

[Instruction Manual for Tar](https://man7.org/linux/man-pages/man1/tar.1.html)

---

## About the data
---

The __partition1__ direcotry contains two subdirectories, __FL__ and __NF__, these subdirectories represent the two classes of our target feature in the solar flare prediction problem we will be attempting to solve this semester. 

- __FL__: Represents the multivariate time series samples that have a Solar Flare occur within 24 hours of the observation.
- __NF__: Represents the multivariate time series samples that do not have a Solar Flare occur within 24 hours of the observation.

The multivariate time series samples are stored in .csv files for each individual sample. Each file name contains a number of pieces of information that we will wish to keep for our prediction task and therefore should be part of your Analytics Base Table. Below are examples of the naming for each sample type.

- __FL__ file name example:`M1.0@265:Primary_ar115_s2010-08-06T06:36:00_e2010-08-06T18:24:00.csv`
- __NF__ file name example:`FQ_ar99_s2010-08-01T19:00:00_e2010-08-02T06:48:00.csv` or `B1.9@909:Primary_ar325_s2011-01-04T02:36:00_e2011-01-04T14:24:00.csv`

Let's look at these formats, starting with those that contain an `@` symbol (we will use the __FL__ file as an example but note that the __NF__ data also has files with this naming):
- __M1.0@265:Primary__: This says that there occurs an M1.0 sized flare within 24 hours of our sample. It also says that this flare is numbered 265 in the accompanying integrated flare dataset that comes as a supplementary file to this dataset. Additionally, "Primary" indicates that the intersection with this active region was verified through the primary method described in the paper.  
- __\_ar115__: This indicates which active region (`_ar`) the sample comes from in the original unsampled dataset.
- __\_s2010-08-06T06:36:00__: This is the start time (`_st`) of the sample.
- __\_e2010-08-06T18:24:00__: This is the end time (`_et`) of the sample.

The files that don't contain the @ symbol begin with FQ and do not have any flare occuring within 24 hours of the sample in the file.  __Note__ that both the __FL__ and __NF__ have files that have flares within 24 hours, but the __NF__ ones are smaller flares that we are considering as unimportant and therefore fall in the non-flaring class.  

---


## Reading the flare and non-flare data 
---

Now that you have an understanding about the data, you will develop a function to read the flaring and non-flaring data and return an object that contains the data from the csv file and some of the information contained in the file name.

Below is the object you will return.

Notice that it takes in several objects: a [string](https://docs.python.org/3/library/string.html), two [datetime](https://docs.python.org/3/library/datetime.html) objects, and a [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) object.


In [None]:
from pandas import DataFrame
from datetime import datetime

In [None]:
class MVTSSample:
    
    def __init__(self, flare_type:str, start_time:datetime, end_time:datetime, data:DataFrame):
        self._flare_type = flare_type
        self._start_time = start_time
        self._end_time = end_time
        self._data = data
    
    def get_flare_type(self):
        return self._flare_type
    
    def get_start_time(self):
        return self._start_time
    
    def get_end_time(self):
        return self._end_time
    
    def get_data(self):
        return self._data

### About the MVTSSample class
---

The above class represents the data contained in one file. You are to return one of these objects for each call to your method(s). 

- The __flare_type__ is to be one of the following selections (__X__, __M__, __C__, __B__, __FQ__), and these lables will be derived from the information in the file name. The label __FQ__ should be manually created for file names starting with the character `F`.
- __start_time__ is the start time in the file name
- __end_time__ is the end time in the file name
- __data__ is a `Pandas DataFrame` which you will load from the csv using the [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method.  

---

### About your method
---

Your method is to take in the path and name of the file to open, and it is to return one `MVTSSample` for that file.

Below is a definition for that method, use it and write the code to complete the tasks necessary to return the specified information.  You can use a method call in another code block to test that your method works as required.

Some useful methods/functions to use for this question are:

* [String.find](https://www.w3schools.com/python/ref_string_find.asp)

* [datetime.strptime](https://docs.python.org/3/library/datetime.html#strftime-strptime-behavior)

* [pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) (__Note:__ the csv files are tab delimited so you will need to use `sep="\t"` to read them properly.)

* [os.path.join](https://docs.python.org/3/library/os.path.html#os.path.join)

In [None]:
import os
import pandas as pd

In [None]:
def read_mvts_instance(data_dir:str, file_name:str) -> MVTSSample: #Finished!
    # Get flare type from file name
    flare_type = file_name[0:2]

    try:
        # Get start time from file name
        start = file_name.find('s2')
        start_time = file_name[start+1: start+20]
        start_time = start_time.replace("T", " ")
        start_time = datetime.strptime(start_time, "%Y-%m-%d %H:%M:%S")

        # Get end time from file name
        end = file_name.find('e2')
        end_time = file_name[end+1: end+20]
        end_time = end_time.replace("T", " ")
        end_time = datetime.strptime(end_time, "%Y-%m-%d %H:%M:%S")
    except ValueError:
        print(ValueError)
        pass

    # Get data from csv file
    try:
        data = pd.read_csv(data_dir + "/" + file_name, sep="\t")
    except ValueError:
        print(ValueError)
        pass
    
    # Make mvts object 
    mvts = MVTSSample(flare_type, start_time, end_time, data)
    return mvts

In [None]:
data_dir = "data/partition1/FL"  # change the path to where your data is stored. (done)
file_name = "M1.0@265:Primary_ar115_s2010-08-06T06:36:00_e2010-08-06T18:24:00.csv"
results = read_mvts_instance(data_dir, file_name)

## Processing the DataFrame 
---

Now that you can read individual files to get the multivariate time sries for a sample period, it is time to start building the analytics base table (ABT).

The machine learning methods that we will cover are generally applied to tabular data with a set of descriptive features that are used to learn to classify or predict a target feature. To accomplish this with our raw input multivariate time series, we must produce a set of descriptive features from each of the variates of the the time series.  

In this section you will process the DataFrame that was returned from your `read_mvts_instance` method to construct a set of descriptive features for each MVTS sample. 

---

---

### DataFrame Attributes:

Above, you saw the 54 variates of the multivariate timeseries sample in each file. These 54 columns should be present in your dataframe that was returned from your previous `read_mvts_instance` method. For the next question, however, we will only be utilizing a fraction of those. The method description below gives you more information about which ones we will use.

---

### About your method
---
The following will be the variates we will be processing to return features of.

|              |                  |             |
|--------------|------------------|-------------|
| 1. R_VALUE   | 2. TOTUSJH       | 3. TOTBSQ   |	
| 4. TOTPOT	   | 5. TOTUSJZ       | 6. ABSNJZH  |	
| 7. SAVNCPP   | 8. USFLUX        | 9. TOTFZ	|
| 10. MEANPOT  | 11. EPSZ	      | 12. MEANSHR |
| 13. SHRGT45  | 14. MEANGAM      | 15. MEANGBT |
| 16. MEANGBZ  | 17. MEANGBH      | 18. MEANJZH |
| 19. TOTFY    | 20. MEANJZD      | 21. MEANALP |	
| 22. TOTFX    |        	      |         	|

For each of these variates you will calculate two descriptive features: 

- Median 
- Standard Deviation

Note:
* Computing these 2 descriptive features on the 22 variates listed above should yield a dataframe of 44 columns. Make sure your implementation of `calculate_descriptive_features` has all those columns. We will add more later, but for now, this will be sufficient to demonstrate the analytics base table construction process.
* The column names of your new dataframe should have both the variate name and the descriptive feature name (e.g., `TOTPOT_MEDIAN`).

Below is a function defintion, complete it to return the above specified information. You can use a method call in another code block to test that your method works as required.

Some useful methods/functions for this question are:

* [numpy.median](https://numpy.org/doc/stable/reference/generated/numpy.median.html)

* [numpy.std](https://numpy.org/doc/stable/reference/generated/numpy.std.html)

* [pandas.DataFrame.to_numpy](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_numpy.html#pandas.DataFrame.to_numpy) (__Note:__ this should be used to get your selected column into a format that the numpy functions above require.)


In [None]:
import numpy as np

In [None]:
def calculate_descriptive_features(data:DataFrame)-> DataFrame: #Finished!
    variates_to_calc_on = [ 'R_VALUE','TOTUSJH','TOTBSQ','TOTPOT','TOTUSJZ','ABSNJZH','SAVNCPP',
                           'USFLUX','TOTFZ','MEANPOT','EPSZ','MEANSHR','SHRGT45','MEANGAM','MEANGBT',
                           'MEANGBZ','MEANGBH','MEANJZH','TOTFY','MEANJZD','MEANALP','TOTFX']
    features_to_return = [ 'R_VALUE_MEDIAN','R_VALUE_STDDEV',
                          'TOTUSJH_MEDIAN','TOTUSJH_STDDEV',
                          'TOTBSQ_MEDIAN','TOTBSQ_STDDEV',
                          'TOTPOT_MEDIAN','TOTPOT_STDDEV',
                          'TOTUSJZ_MEDIAN','TOTUSJZ_STDDEV',
                          'ABSNJZH_MEDIAN','ABSNJZH_STDDEV',
                          'SAVNCPP_MEDIAN','SAVNCPP_STDDEV',
                          'USFLUX_MEDIAN','USFLUX_STDDEV',
                          'TOTFZ_MEDIAN','TOTFZ_STDDEV',
                          'MEANPOT_MEDIAN','MEANPOT_STDDEV',
                          'EPSZ_MEDIAN','EPSZ_STDDEV',
                          'MEANSHR_MEDIAN','MEANSHR_STDDEV',
                          'SHRGT45_MEDIAN','SHRGT45_STDDEV',
                          'MEANGAM_MEDIAN','MEANGAM_STDDEV',
                          'MEANGBT_MEDIAN','MEANGBT_STDDEV',
                          'MEANGBZ_MEDIAN','MEANGBZ_STDDEV',
                          'MEANGBH_MEDIAN','MEANGBH_STDDEV',
                          'MEANJZH_MEDIAN','MEANJZH_STDDEV',
                          'TOTFY_MEDIAN','TOTFY_STDDEV',
                          'MEANJZD_MEDIAN','MEANJZD_STDDEV',
                          'MEANALP_MEDIAN','MEANALP_STDDEV',
                          'TOTFX_MEDIAN','TOTFX_STDDEV']
    # Create empty data frame for return with named columns 
    df = pd.DataFrame(columns=features_to_return)

    
    # For each element append to temp list
    list2add = []
    for d in variates_to_calc_on:
        l = data[d].to_numpy()
        median = np.median(l)
        std = np.std(l)
        list2add.append(median)
        list2add.append(std)
        continue
    
    df.loc[len(df)] = list2add
    return df

In [None]:
calculate_descriptive_features(results.get_data()) # Use to test calcualted_descriptivefeatures function

---

## Putting the pieces together 

---

Now that you have the tools to read the data and process descriptive features, it is time to put this all together to produce an analytics base table for all of the data in Partiton 1.

In this question, you shall construct a method that will process a partition by extracting features for each sample in both the __FL__ and __NF__ subdirectories of that partition. The extracted descriptive features (e.g., `TOTPOT_MEDIAN`) are to be placed into your analytics base table DataFrame as colums, with the addition of the `FLARE_TYPE` target feature.

Note:
* Your method should take in the partition location and assume that there will be __FL__ and __NF__ subdirectories to process.

* Remember that your analytics base table should contain 5 flare types (`X`, `M`, `C`, `B`, and `FQ`).

* Your method shall also take in the name of the analytics base table to store. This should be the full name with either an absolute or relative path to store the table also part of the passed in name. 

__Suggestion__: It would be a good idea to debug your function on a much smaller version of one partition (often claled a "pet dataset") and run it on the entire Partition 1 only when you are confident that it is error-free.

Below you will find a method defintion, complete it to perform the above specified information. You can use a method call in another code block to test that your method works as required.


In [None]:
from os import listdir

In [None]:
def process_partition(partition_location:str, abt_name:str): #NEEDS WORK!
    abt_header = [ 'FLARE_TYPE', 'R_VALUE_MEDIAN','R_VALUE_STDDEV',
                          'TOTUSJH_MEDIAN','TOTUSJH_STDDEV',
                          'TOTBSQ_MEDIAN','TOTBSQ_STDDEV',
                          'TOTPOT_MEDIAN','TOTPOT_STDDEV',
                          'TOTUSJZ_MEDIAN','TOTUSJZ_STDDEV',
                          'ABSNJZH_MEDIAN','ABSNJZH_STDDEV',
                          'SAVNCPP_MEDIAN','SAVNCPP_STDDEV',
                          'USFLUX_MEDIAN','USFLUX_STDDEV',
                          'TOTFZ_MEDIAN','TOTFZ_STDDEV',
                          'MEANPOT_MEDIAN','MEANPOT_STDDEV',
                          'EPSZ_MEDIAN','EPSZ_STDDEV',
                          'MEANSHR_MEDIAN','MEANSHR_STDDEV',
                          'SHRGT45_MEDIAN','SHRGT45_STDDEV',
                          'MEANGAM_MEDIAN','MEANGAM_STDDEV',
                          'MEANGBT_MEDIAN','MEANGBT_STDDEV',
                          'MEANGBZ_MEDIAN','MEANGBZ_STDDEV',
                          'MEANGBH_MEDIAN','MEANGBH_STDDEV',
                          'MEANJZH_MEDIAN','MEANJZH_STDDEV',
                          'TOTFY_MEDIAN','TOTFY_STDDEV',
                          'MEANJZD_MEDIAN','MEANJZD_STDDEV',
                          'MEANALP_MEDIAN','MEANALP_STDDEV',
                          'TOTFX_MEDIAN','TOTFX_STDDEV']
    
    abt = pd.DataFrame(columns=abt_header)

    # Get lists of data from partition
    FL = os.listdir(partition_location + "/FL")
    NF = os.listdir(partition_location + "/NF")

    
    count = 0
    # Add row to abt from mvssample object and its median and std data
    for d in FL + NF:

        # Use temp list for each row and temp df
        list2add = []
        tempdf = pd.DataFrame(columns=abt_header)

        # Get mvs object and add flare type 
        if d in FL:
            mvs = read_mvts_instance(partition_location + '/FL', d)
        else:
            mvs = read_mvts_instance(partition_location + '/NF', d)
        list2add.append(mvs.get_flare_type())

        # Set up temp df for future concat with master data frame object
        tempdf2 = calculate_descriptive_features(mvs.get_data())
        templist = tempdf2.to_numpy()

        # From data frame concat current with temp for each feature
        for i in templist[0]:
            list2add.append(i)
            continue
        tempdf.loc[45] = list2add
        abt = pd.concat([abt, tempdf], ignore_index= True, axis = 0)

        ''' Limit to 10000 files for testing'''
        # count +=1
        # if count >= 10000:
        #     break
        # continue
    
    
    # return the completed analitics base table
    return abt

In [None]:
data_dir = "data/partition1"  # change the path to where your data is stored.
abt_name = "partition1_features.csv" # Corrected to partition 1
abt = process_partition(data_dir, abt_name)
print(abt)

Using the `pandas.DataFrame.describe` function you check a few things including the total number of samples. Use that to ensure you have processed all MVTS samples of partition 1. 

---

## Visualizing the distribution of flares 
---

How does the distribution of our 5 flare classes look like? This is the question we want to answer using a simple visualization. You can use the [pyplot.bar](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html) function from the [matplotlib](https://matplotlib.org/stable/index.html) library. The x-axis of the plot should represent the flare types and the y-axis should represent the counts of samples.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# Create dictionary for counts of each flare type
flare_types = {"M": 0, "X": 0, "C": 0, "B": 0, "FQ": 0}

# For each objects flare type sort into M, X, C, B, or FQ, otherwise skip
for i in abt['FLARE_TYPE']:
    if "M" in i:
        flare_types["M"] +=1
    elif "X" in i:
        flare_types["X"] +=1
    elif "C" in i:
        flare_types["C"] +=1
    elif "B" in i:
        flare_types["B"] +=1
    elif "FQ" in i:
        flare_types["FQ"] +=1
    else:
        continue


plt.bar(flare_types.keys(), flare_types.values())
