# [Pandas](https://pandas.pydata.org/docs/getting_started/index.html)([Manual](https://pandas.pydata.org/docs/user_guide/index.html))
 
- There are two primary data structure in Pandas: Series and Dataframe

## Series
- Series combines an index and corresponding data values together
- Original designed to work with **time-series data**.

In [1]:
import pandas as pd

x = pd.Series(index = range(5), data = [1, 3, 5, 7, 9])
x

0    1
1    3
2    5
3    7
4    9
dtype: int64

In [5]:
# Note the duplicated index "c"
y = pd.Series(index = ["a", "b", "c", "c"], data = [1, 2, 3, 4])
y.a, y.b, y.c

(1,
 2,
 c    3
 c    4
 dtype: int64)

In [7]:
# We can also retrieve the data by slicing
y.iloc[:4]

a    1
b    2
c    3
c    4
dtype: int64

#### Slicing in this way *includes* the endpoints:

In [11]:
y.loc["a" : "c"], y["a" : "c"]  # equal to y.iloc[:4]

(a    1
 b    2
 c    3
 c    4
 dtype: int64,
 a    1
 b    2
 c    3
 c    4
 dtype: int64)

### Set Group Rules and Separate into Groups

In [12]:
x = pd.Series(index = range(5), data = [1, 3, 5, 7, 9])
grp = x.groupby(lambda i : i%2) 
grp.get_group(0) # return the elements with index satisfying i%2 == 0
grp.get_group(1) # return the elements with index satisfying i%2 == 1

1    3
3    7
dtype: int64

After get_group(), we now have two separated groups. 
Now we can perform some operations on these groups

In [13]:
grp.max() # get max data in each group. Return value is also a Series object with index [0, 1]

0    9
1    7
dtype: int64

## Dataframe
- is an encapsulation of the Series that extends to 2-dimension


In [14]:
df = pd.DataFrame({"col1" : [1, 3, 5, 7], "col2" : [2, 4, 6, 8]})
df

Unnamed: 0,col1,col2
0,1,2
1,3,4
2,5,6
3,7,8


In [15]:
df.iloc[:2, :1] # get first two lines' first column

Unnamed: 0,col1
0,1
1,3


In [16]:
df["col1"] # using the col name

0    1
1    3
2    5
3    7
Name: col1, dtype: int64

In [17]:
df.col1

0    1
1    3
2    5
3    7
Name: col1, dtype: int64

In [18]:
df.sum() # operate on each col

col1    16
col2    20
dtype: int64

### Grouping
- We can use one col's data as the grouping rules

In [19]:
df = pd.DataFrame({"col1" : [0, 0, 1, 1], "col2" : [1, 3, 5, 7]})
grp = df.groupby("col1")
grp.get_group(0) # get the rows whose col1 == 0
grp.get_group(1) # get the rows whose col2 == 1

Unnamed: 0,col1,col2
2,1,5
3,1,7


In [20]:
# get sum according to the grouping rules
grp.sum()

Unnamed: 0_level_0,col2
col1,Unnamed: 1_level_1
0,4
1,12


#### Create new cols

In [22]:
df["sum"] = df.eval("col1 + col2")
df

Unnamed: 0,col1,col2,sum
0,0,1,1
1,0,3,3
2,1,5,6
3,1,7,8


We can group by multiple cols by:

In [23]:
grp = df.groupby(["sum", "col1"])
grp.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,col2
sum,col1,Unnamed: 2_level_1
1,0,1
3,0,3
6,1,5
8,1,7


We can also `unstack` the result

In [24]:
grp.sum().unstack()

Unnamed: 0_level_0,col2,col2
col1,0,1
sum,Unnamed: 1_level_2,Unnamed: 2_level_2
1,1.0,
3,3.0,
6,,5.0
8,,7.0
