# Reading Tabular Data into DataFrames

## Getting data from CDAWeb using `sunpy`

The Coordinated Data Analysis Web (CDAWeb) stores data from from current and past space physics missions, and is full of heliospheric insitu datasets.

CDAWeb stores data from from current and past space physics missions, and is full of heliospheric insitu datasets.

First, we need to install `sunpy` and a couple of other dependencies. In most Python environments the command would be `pip install <module>`, but we need to modify that slightly for it to work correctly in a Jupyter Notebook.

In [None]:
# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install sunpy drms cdflib zeep h5netcdf

Now, we can import the modules we need from `sunpy`:

In [None]:
from sunpy.net import Fido
from sunpy.net import attrs as a
from sunpy.timeseries import TimeSeries

`sunpy.net.Fido` is the primary interface to search for and download data and will automatically search CDAWeb when the `cdaweb.Dataset` attribute is provided to the search. To lookup the different dataset IDs available, you can use the form at https://cdaweb.gsfc.nasa.gov/index.html/.

In [None]:
date_range = a.Time('2021/07/01', '2021/07/08')
dataset = a.cdaweb.Dataset('SOLO_L2_MAG-RTN-NORMAL-1-MINUTE')
result = Fido.search(date_range, dataset)

Let's inspect the results. We can see that there's seven files, one for each day within the query.

In [None]:
print(result)

We have something that looks a bit like a list of files from different providers. In our particular case, there is only one provider, so we can get the files from that:

In [None]:
print(result[0])

But using a higher index results in an error - there are no files from any other providers.

In [None]:
print(result[1])

We can look at the individual files in the set:

In [None]:
print(result[0,0])

In [None]:
print(result[0,1])

We can use a slice to view a subset of the files:

In [None]:
print(result[0,0:2])

We can use `Fido.fetch()` to download the contents of the specified files:

In [None]:
downloaded_files = Fido.fetch(result[0, 0:2])

We can then concatenate the contents of those files in a more readily usable form using `TimeSeries`

In [None]:
solo_mag = TimeSeries(downloaded_files, concatenate=True)

Looking at the type of `solo_mag` we can see that it is of a type which is defined within sunpy. We can use `help()` to find out a bit more about it.

In [None]:
print(type(solo_mag))

In [None]:
import sunpy.timeseries
help(sunpy.timeseries.timeseriesbase.GenericTimeSeries)

One of the methods on `GenericTimeSeries` is `to_dataframe()`, which returns the data contained in the timeseries as a _Pandas dataframe_.

Pandas is a very widely used data analysis library.

In [None]:
df = solo_mag.to_dataframe()
type(df)

A [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) is a collection of [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html); The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the [Numpy](https://www.numpy.org/) library, which in practice means that most of the methods defined for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.