# Pandas Basics

In [None]:
import pandas as pd
import numpy as np
pd.__version__

## Pandas Series 

In [None]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

In [None]:
data.values

In [None]:
data.index

Clearly a Pandas `Series` is similar to a Python dictionary

In [None]:
data_dict = {'a': 0.25, 'b': 0.5, 'c': 0.75, 'd': 1.0}
data = pd.Series(data_dict)
data

`Series` can be viewed as dictionary with typed values and typed indeces. Typing makes it more efficient than the original dictionary. 

## Pandas DataFrame 

In [None]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                          'New York': 141297, 'Florida': 170312,
                          'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                         'New York': 19651127, 'Florida': 19552860,
                         'Illinois': 12882135})

In [None]:
table = pd.DataFrame({'area':area, 'pop':pop})
table

In [None]:
table.index

In [None]:
table.columns

In [None]:
type(table.values)

## Indexing & Selection

In [None]:
data

In [None]:
data['b':'d']
# explicit index; final index included

In [None]:
data[0]
# implicit index; final index excluded

In [None]:
data[(data > 0.3) & (data < 0.8)]
# masking

Integer indices can cause confusion. 

In [None]:
data_2 = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data_2


In [None]:
data_2[1]

In [None]:
data_2[1:3]

To make slicing more explicit, we use `loc` and `iloc`

In [None]:
data_2.loc[1:3]

In [None]:
data_2.iloc[1:3]

Explicit is always better implicit. 

In [None]:
table

In [None]:
table.iloc[:3, :2]

In [None]:
table.loc[:'Illinois', :'pop']

In [None]:
table['area']

In [None]:
table[['area']]

In [None]:
table[table['pop'] > 20000000]

## Aggregation & Grouping

We use the Planets dataset available in Seaborn package as an example

In [None]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape

In [None]:
planets.head()

### Simple aggregation

In [None]:
planets.dropna().describe()

In [None]:
planets['number'].mean(skipna=True)
# What's wrong?

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

### Groupby: conditional aggregation

Often the variable you need has to be constructed. This is part of 'cleaning'. 

<img src="./split-apply-combine.png" alt="drawing" width="600"/>

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data': range(6)}, columns=['key', 'data'])
df

In [None]:
df.groupby('key').sum()

In [None]:
planets.head()

In [None]:
planets.groupby('method')['orbital_period'].median()

In [None]:
planets.groupby('method')

# Not exactly a DataFrame (for efficiency reasons), but you can treat it as similar.

## Afterword 

- The material in this notebook is mainly drawn from Chapter 3 of [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) by Jake VanderPlas — Much more to learn!
- For similar functions and tools in R, see [R for Data Science](https://r4ds.had.co.nz/index.html) — You may find tidyverse in R a better interface for data cleaning than Pandas (I do). 