# Pandas DataFrames

In [1]:
# First, we load the data we want to work with.
# The code below pulls together the process we stepped through in the previous part.

# Install a pip package in the current Jupyter kernel
import sys
!{sys.executable} -m pip install sunpy drms cdflib zeep h5netcdf matplotlib

from sunpy.net import Fido
from sunpy.net import attrs as a
from sunpy.timeseries import TimeSeries

date_range = a.Time('2021/07/01', '2021/07/08')
dataset = a.cdaweb.Dataset('SOLO_L2_MAG-RTN-NORMAL-1-MINUTE')
result = Fido.search(date_range, dataset)

downloaded_files = Fido.fetch(result[0, 0:2])
solo_mag = TimeSeries(downloaded_files, concatenate=True)
solo_mag_data = solo_mag.to_dataframe()
print(solo_mag_data.info())



  from .autonotebook import tqdm as notebook_tqdm
Files Downloaded: 100%|██████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.80file/s]

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2880 entries, 2021-07-01 00:00:29.999998 to 2021-07-02 23:59:30.000001
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   B_RTN_0                 2880 non-null   float32
 1   B_RTN_1                 2880 non-null   float32
 2   B_RTN_2                 2880 non-null   float32
 3   QUALITY_BITMASK         2880 non-null   uint16 
 4   QUALITY_FLAG            2880 non-null   uint8  
 5   VECTOR_RANGE            2880 non-null   uint8  
 6   VECTOR_TIME_RESOLUTION  2880 non-null   float32
dtypes: float32(4), uint16(1), uint8(2)
memory usage: 78.8 KB
None





## Note about Pandas DataFrames/Series

A [DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) is a collection of [Series](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html); The DataFrame is the way Pandas represents a table, and Series is the data-structure Pandas use to represent a column.

Pandas is built on top of the [Numpy](https://www.numpy.org/) library, which in practice means that most of the methods definned for Numpy Arrays apply to Pandas Series/DataFrames.

What makes Pandas so attractive is the powerful interface to access individual records of the table, proper handling of missing values, and relational-databases operations between DataFrames.

## Selecting values

To access a value at the position `[i,j]` of a DataFrame, we have two options, depending on what is the meaning of `i` in use. Remember that a DataFrame provides an _index_ as a way to identify the rows of the table; a row, then, has a _position_ inside the table as well as a _label_, which uniquely identifies its _entry_ in the DataFrame.

### Use `DataFrame.iloc[..., ...]` to select values by their (entry) position

Can specify location by numerical index analogously to 2D version of character selection in strings.

In [2]:
print(solo_mag_data.iloc[0,0])

1.2058102


### Use `DataFrame.loc[..., ...]` to select values by their (entry) label.

Can specify location by row and/or column name.

In [3]:
print(solo_mag_data.loc['2021-07-01 00:00:29.999998', 'B_RTN_0'])

1.2058102


### Use `:` on its own to mean all columns or all rows.

Just like Python’s usual slicing notation.

In [None]:
print(solo_mag_data.loc['2021-07-01 00:00:29.999998', :])

Would get the same result printing `data.loc['2021-07-01 00:00:29.999998']` (without a second index).

In [None]:
print(solo_mag_data.loc[:, 'B_RTN_0'])

- Would get the same result printing `data['B_RTN_0']`
- Also get the same result printing `data.B_RTN_0` (not recommended, because easily confused with . notation for methods)

### Select multiple columns or rows using `DataFrame.loc` and a named slice.

In [4]:
print(solo_mag_data.loc['2021-07-01 00:01:30.000004':'2021-07-01 00:04:30.000004', 'B_RTN_0':'B_RTN_2'])

                             B_RTN_0   B_RTN_1   B_RTN_2
EPOCH                                                   
2021-07-01 00:01:30.000004  0.410739  9.685264 -0.023507
2021-07-01 00:02:30.000001  0.758328  9.928966 -2.149603
2021-07-01 00:03:29.999997 -0.541748  9.544032 -3.647228
2021-07-01 00:04:30.000004  3.109208  8.734163  3.252406


In the above code, we discover that **slicing using `loc` is inclusive at both ends**, which differs from **slicing using `iloc`**, where slicing indicates everything up to but not including the final index.

## Result of slicing can be used in further operations.

- Usually don’t just print a slice.
- All the statistical operators that work on entire dataframes work the same way on slices.
- E.g., calculate max of a slice.

In [None]:
print(solo_mag_data.loc['2021-07-01 00:01:30.000004':'2021-07-01 00:04:30.000004', 'B_RTN_0':'B_RTN_2'].max())

In [None]:
print(solo_mag_data.loc['2021-07-01 00:01:30.000004':'2021-07-01 00:04:30.000004', 'B_RTN_0':'B_RTN_2'].min())

## Use comparisons to select data based on value.

- Comparison is applied element by element.
- Returns a similarly-shaped dataframe of `True` and `False`.

In [None]:
# Use a subset of data to keep output readable.
subset = solo_mag_data.loc['2021-07-01 00:01:30.000004':'2021-07-01 00:04:30.000004', 'B_RTN_0':'B_RTN_2']
print('Subset of data:\n', subset)

# Which values were positive ?
print('\nWhere are values positive?\n', subset > 0)

### Select values or NaN using a Boolean mask.

A frame full of Booleans is sometimes called a mask because of how it can be used.

In [None]:
mask = subset > 0
print(subset[mask])

- Get the value where the mask is true, and NaN (Not a Number) where it is false.
- Useful because NaNs are ignored by operations like max, min, average, etc.

In [None]:
print(subset[subset > 0].describe())

## Key Points

- Use `DataFrame.iloc[..., ...]` to select values by integer location.
- Use `:` on its own to mean all columns or all rows.
- Select multiple columns or rows using `DataFrame.loc` and a named slice.
- Result of slicing can be used in further operations.
- Use comparisons to select data based on value.
- Select values or NaN using a Boolean mask.