# Pandas

## Introduction <a name="introduction"></a>

The way we control, manipulate, and analyze data is an important process across all scientific disciplines. Exoplanet research is no different. The astronomical community has generally rallied behind **Python** (if you couldn't already tell from our use of Jupyter notebooks!) as the primary language for developing new data structures, computational methods, and robust analytical packages. These are amazing tools, but using them efficiently requries a solid foundation. This lesson gives a brief overview of the important capabilities behind `Pandas`, some interactive activities to visualize data, and how they are used in Exoplanet research.

 <div class="alert alert-block alert-warning">

**IMPORTANT**: If you have limited exposure or practice with Python as a language, it is encouraged to get an understanding of basic syntax and proper use of variables. [Here](https://programming-23.mooc.fi/) is a self-paced, free course offered by the University of Helsinki that covers most of the important concepts in the first few modules.

## Overview

`pandas` is a Python suite for data manipulation and analysis, with the primary data structures being `Series` and `DataFrames`. Both `Series` and `DataFrames` are similar to 1D and 2D `Numpy` arrays, respectively, but support a broader range of functionality at the cost of more complex syntax. By the end of this lesson, you should be able to:
- **Create, manipulate, and analyze data using `DataFrames` and `Series`**
- **Use `pandas` to import and export various types of data**
- **Visualize and interpret different types of data using `pandas` methods**

[Here](https://pandas.pydata.org/docs/user_guide/index.html) is the official documentation for `pandas`!

To start, we generally import both `Numpy` and `pandas` simultanously.

In [3]:
import numpy as np
import pandas as pd

## Series
[Here](https://pandas.pydata.org/docs/reference/series.html) is the official documentation for `Series`.
### Basic Properites
`Series` are 1D data structures which are very similar to 1D `Numpy` arrays, with a few key differences. `Series`:
- Can store mixed data types
- Are size immutable (fixed size when created), value mutable (elements can change)
- Are indexed, and custom indexes can be passed
- Can do operations between different `Series`, which do **not** need to be the same size
We can explore basic properties in the code below.

In [96]:

#Construct the Series by calling the function and bulding a list inside
data = pd.Series([0, 0.1, 0.4, 0.7, 10, 100, 'Test'])
print(data, '\n')
#Note the two columns when printed: the first is the index column, the second is our series. 
#See how we can have ints, floats, and strings!

#Can be indexed and sliced
print(data[5], '\n', data[4:], '\n')

#Can customize the indices and call that index to get the corresponding values
#Similar to the key:value structure of Python dictionaries. This does *not* overwrite 
#the default indexing (e.g. 0, 1, 2, 3) *unless* you use integers in your custom index.
indexed_data = pd.Series([15, 'Exo', 'Planets', 100], index=['a', 'b', 'c', 'd'])
if indexed_data[0] == indexed_data['a']:
    print('Identical index', '\n')
else:
    print('Not identical')

#You can also construct Series with custom indices by passing a dictionary.
#Note: If you set a custom index in the pd.Series function, it will *overwrite* the dictionary keys.
example_dict = dict({'a': 1, 'b':2, 'c':100})
ser = pd.Series(data = example_dict)
print(ser['a'], ser['b'], ser[2], '\n')

#Can't use normal indexing if passed index contains an integer UNLESS you use the .iloc function
indexed_data2 = pd.Series([15, 'Exo', 'Planets', 100], index=[1, 'b', 'c', 'd'])
try:
    if indexed_data2[0] == indexed_data2[1]:
        print('Identical Index')
    else:
        print('Not identical')
except KeyError:
    print('Cannot use default index if integers are assigned as a custom index!')

#If you want to change the index of a Series, use Series.reindex
#You can also reset the index to the default by using Series.reset_index
indexed_data.reindex(index=['a', 'b', 'c', 'd'])
print(indexed_data, '\n')
reset_data = indexed_data.reset_index(drop = True)
print(reset_data)


0       0
1     0.1
2     0.4
3     0.7
4      10
5     100
6    Test
dtype: object 

100 
 4      10
5     100
6    Test
dtype: object 

Identical index 

1 2 100 

Cannot use default index if integers are assigned as a custom index!
a         15
b        Exo
c    Planets
d        100
dtype: object 

0         15
1        Exo
2    Planets
3        100
dtype: object


### Series Methods
Now that we see how `Series` stack up to Python lists, dicts, and `Numpy` arrays for their basic syntax, we will highlight some methods available to `Series`. [This page](https://pandas.pydata.org/docs/reference/api/pandas.Series.html#pandas.Series) contains all available methods for `Series`, some of which will be highlighted below.

In [100]:
example = pd.Series([-10, 4, -0.44444, 18])
example2 = pd.Series([12, 14, 200, 3, -12, 72, np.nan])
example3 = pd.Series([2, 4, 6, 8])
#Returns the absolute value of each element. Only works for ints, floats. Type error for strings, bools
#NOTE: These methods do not alter the series, only if reassigned
print(example.abs(), '\n')

#Addition of two, different dimensional series.
#Note how the resulting series is the same shape as the largest series passed in the operation. 
#When performing operations on unequally sized series, the elements past the length of the smallest
#series are treated as NANs. 
example4 = example.add(example2)
example5 = example.sub(example3)
print(example4, '\n\n', example5, '\n')

0    10.00000
1     4.00000
2     0.44444
3    18.00000
dtype: float64 

0      2.00000
1     18.00000
2    199.55556
3     21.00000
4          NaN
5          NaN
6          NaN
dtype: float64 

 0   -12.00000
1     0.00000
2    -6.44444
3    10.00000
dtype: float64 



In [101]:
#Can combine operatrions using Series.agg
print(example2.agg(['min', 'max', 'mean', 'product']), '\n')

#Clip values to a floor/ceiling if they are below or above that threshold
print(example2.clip(lower = 0, upper = 10), '\n')

#Compare values of two series and highlight differences
print(example.compare(example3), '\n')

#Generate descriptive statistics with Series.describe. Can also use Series.info for similar results
print(example3.describe(), '\n')

min       -1.200000e+01
max        2.000000e+02
mean       4.816667e+01
product   -8.709120e+07
dtype: float64 

0    10.0
1    10.0
2    10.0
3     3.0
4     0.0
5    10.0
6     NaN
dtype: float64 

       self  other
0 -10.00000    2.0
2  -0.44444    6.0
3  18.00000    8.0 

count    4.000000
mean     5.000000
std      2.581989
min      2.000000
25%      3.500000
50%      5.000000
75%      6.500000
max      8.000000
dtype: float64 



In [110]:
#Comparisons between series, element wise. Greater than or equal to and less than or equal to, respectively
print(example.ge(example3), example.le(example3), '\n')

#Masking methods. 
#Series.mask replaces any element that meets the condition with the corresponding value.
#Series.notnull returns an array
print(example2.mask(np.isnan(example2) == True, 0), example2.notnull(), '\n')

#Can send contents of a Series to a list
listed_data = example2.tolist()
print(listed_data, '\n')

#Find the index(s) of a particular element(s)
print(example2[example2 == 200].index[0], '\n')

0    False
1     True
2    False
3     True
dtype: bool 0     True
1     True
2     True
3    False
dtype: bool 

0     12.0
1     14.0
2    200.0
3      3.0
4    -12.0
5     72.0
6      0.0
dtype: float64 0     True
1     True
2     True
3     True
4     True
5     True
6    False
dtype: bool 

[12.0, 14.0, 200.0, 3.0, -12.0, 72.0, nan] 

2 

