# Pandas - Basic

`Pandas` is a package built on top of `NumPy`, which is consist of Series and DataFrame objects, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

Pandas has more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.

* Pandas series
* Pandas DataFrame
* Quick checking DataFrame
* Descriptive stats on DataFrame
* Indexing, slicing, conditional subsetting
* Basic operations

In [2]:
import numpy as np
import pandas as pd

## 1.1. Series Object

The Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [3]:
s = pd.Series([0.25, 0.5, 0.75, 1.0], index=["alice", "bob", "charles", "darwin"])
s

alice      0.25
bob        0.50
charles    0.75
darwin     1.00
dtype: float64

We can access with the `values` and `index` attributes, return a NumPy array and Index

In [4]:
s.values

array([0.25, 0.5 , 0.75, 1.  ])

In [5]:
s.index

Index(['alice', 'bob', 'charles', 'darwin'], dtype='object')

The Pandas Series is also like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [6]:
# Using a pre-defined Dictionary object
s = {'color': ['black'],
     'size': ['S'],
     'data': pd.date_range('1/1/2019', periods=1, freq='W'),
     'a': 0,
     'b': 1}
pd.Series(s)

color                                              [black]
size                                                   [S]
data     DatetimeIndex(['2019-01-06'], dtype='datetime6...
a                                                        0
b                                                        1
dtype: object

By default, a Series will be created where the index is drawn from the sorted keys.

In [7]:
s['size']

['S']

## 1.2. DataFrame Object

The DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [8]:
data = {'color': ['black', 'white', 'black', 'white', 'black', 'white', 'black', 'white', 'black', 'white'],
        'size': ['S', 'M', 'L', 'M', 'L', 'S', 'S', 'XL', 'XL', 'M'],
        'date': pd.date_range('1/1/2019', periods=10, freq='W'),
        'a': np.random.randn(10),
        'b': np.random.normal(0.5, 2, 10)}

df = pd.DataFrame(data)
df

Unnamed: 0,color,size,date,a,b
0,black,S,2019-01-06,0.804242,-2.063441
1,white,M,2019-01-13,1.150411,-1.980491
2,black,L,2019-01-20,-0.91326,3.782389
3,white,M,2019-01-27,0.761324,0.417888
4,black,L,2019-02-03,-0.372933,0.34441
5,white,S,2019-02-10,1.208735,-0.081171
6,black,S,2019-02-17,0.910037,1.05309
7,white,XL,2019-02-24,-0.148918,-0.712895
8,black,XL,2019-03-03,1.368588,-2.362638
9,white,M,2019-03-10,-0.533759,-0.234717


In [9]:
df.values

array([['black', 'S', Timestamp('2019-01-06 00:00:00'),
        0.8042415677144027, -2.063440976715521],
       ['white', 'M', Timestamp('2019-01-13 00:00:00'),
        1.1504109167698633, -1.9804909900155727],
       ['black', 'L', Timestamp('2019-01-20 00:00:00'),
        -0.9132604855010013, 3.7823891526280993],
       ['white', 'M', Timestamp('2019-01-27 00:00:00'),
        0.7613244578365003, 0.41788830359343615],
       ['black', 'L', Timestamp('2019-02-03 00:00:00'),
        -0.3729326803937552, 0.34440960058121156],
       ['white', 'S', Timestamp('2019-02-10 00:00:00'),
        1.2087345376490573, -0.08117118384957878],
       ['black', 'S', Timestamp('2019-02-17 00:00:00'),
        0.9100371802601188, 1.0530898585902602],
       ['white', 'XL', Timestamp('2019-02-24 00:00:00'),
        -0.14891839375247912, -0.7128947738728315],
       ['black', 'XL', Timestamp('2019-03-03 00:00:00'),
        1.3685877667975572, -2.362637985860463],
       ['white', 'M', Timestamp('2019-03-10

Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

In [10]:
df.index

RangeIndex(start=0, stop=10, step=1)

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:

In [11]:
df.columns

Index(['color', 'size', 'date', 'a', 'b'], dtype='object')

DataFrame can be created reading directly from a CSV or an Excel file using `pd.read_csv` and `pd.read_excel`

In [12]:
pd.read_csv('../Data/Test_Scores.csv').head()

Unnamed: 0,ACT,FinalExam,QuizAvg,TestAvg
0,33,181,95,89
1,31,169,81,89
2,21,176,65,68
3,25,181,66,90
4,29,169,89,81


## 1.3. Index Object

This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Index objects also have many of the attributes familiar from NumPy arrays:

In [13]:
index = [['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C'], ['JP', 'CN', 'US', 'US', 'US', 'CN', 'CN', 'CA', 'JP', 'CA']]

ind = pd.Index(index)
ind

Index([['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C'], ['JP', 'CN', 'US', 'US', 'US', 'CN', 'CN', 'CA', 'JP', 'CA']], dtype='object')

In [14]:
print(ind[0])

print(ind[0][0])

['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C']
A


In [15]:
print('size:', ind.size)
print('shape:', ind.shape)
print('dim:', ind.ndim)
print('type:', ind.dtype)

size: 2
shape: (2,)
dim: 1
type: object


One difference between Index objects are immutable–that is, they cannot be modified, but NumPy arrays does:

In [16]:
ind[0] = 1

TypeError: Index does not support mutable operations

The Index object follows many of the conventions used by Python's built-in `set` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [17]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

# intersection
print(indA & indB)

# union
print(indA | indB)

# symmetric difference
print(indA ^ indB)

Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')


Adding multi-indexing to dataframe

In [18]:
index = pd.MultiIndex.from_arrays(index, names=['class', 'country'])

df = pd.DataFrame(data, index=index)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,0.804242,-2.063441
B,CN,white,M,2019-01-13,1.150411,-1.980491
B,US,black,L,2019-01-20,-0.91326,3.782389
B,US,white,M,2019-01-27,0.761324,0.417888
C,US,black,L,2019-02-03,-0.372933,0.34441
A,CN,white,S,2019-02-10,1.208735,-0.081171
B,CN,black,S,2019-02-17,0.910037,1.05309
A,CA,white,XL,2019-02-24,-0.148918,-0.712895
C,JP,black,XL,2019-03-03,1.368588,-2.362638
C,CA,white,M,2019-03-10,-0.533759,-0.234717


## 2. Quick checking DataFrames

In [19]:
# select top 3
df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,0.804242,-2.063441
B,CN,white,M,2019-01-13,1.150411,-1.980491
B,US,black,L,2019-01-20,-0.91326,3.782389


In [20]:
# select last 3
df.tail(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,CA,white,XL,2019-02-24,-0.148918,-0.712895
C,JP,black,XL,2019-03-03,1.368588,-2.362638
C,CA,white,M,2019-03-10,-0.533759,-0.234717


In [21]:
# sample data
df.sample(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B,CN,white,M,2019-01-13,1.150411,-1.980491
C,JP,black,XL,2019-03-03,1.368588,-2.362638
B,US,black,L,2019-01-20,-0.91326,3.782389


In [22]:
# check data information
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10 entries, (A, JP) to (C, CA)
Data columns (total 5 columns):
color    10 non-null object
size     10 non-null object
date     10 non-null datetime64[ns]
a        10 non-null float64
b        10 non-null float64
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 586.0+ bytes


# 3. Basic descriptive statistics

In [23]:
# basic statistical analysis
df.describe()

Unnamed: 0,a,b
count,10.0,10.0
mean,0.423447,-0.183758
std,0.829942,1.816686
min,-0.91326,-2.362638
25%,-0.316929,-1.663592
50%,0.782783,-0.157944
75%,1.090317,0.399519
max,1.368588,3.782389


In [24]:
print('feature mean')
print(df.mean())
print('feature variance')
print(df.mean())
print('feature standard deviation')
print(df.std())
print('feature min')
print(df.min())
print('feature max')
print(df.mean())

feature mean
a    0.423447
b   -0.183758
dtype: float64
feature variance
a    0.423447
b   -0.183758
dtype: float64
feature standard deviation
a    0.829942
b    1.816686
dtype: float64
feature min
color                  black
size                       L
date     2019-01-06 00:00:00
a                   -0.91326
b                   -2.36264
dtype: object
feature max
a    0.423447
b   -0.183758
dtype: float64


In [25]:
# top 5 percentile
np.percentile(df.a,95)

1.2966538136807322

## 4. Indexing, slicing columns and rows

### 4.1. select rows

In [26]:
# select row by index, return Series
df.iloc[0]

color                  black
size                       S
date     2019-01-06 00:00:00
a                   0.804242
b                   -2.06344
Name: (A, JP), dtype: object

In [27]:
# select row by index, return dataframe
df.iloc[[0]]

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,0.804242,-2.063441


In [28]:
# select row by name, return dataframe
df.loc['B']

Unnamed: 0_level_0,color,size,date,a,b
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CN,white,M,2019-01-13,1.150411,-1.980491
US,black,L,2019-01-20,-0.91326,3.782389
US,white,M,2019-01-27,0.761324,0.417888
CN,black,S,2019-02-17,0.910037,1.05309


### 4.2. select columns

In [29]:
# select column by index, return Series
print(df.iloc[:, 0])

class  country
A      JP         black
B      CN         white
       US         black
       US         white
C      US         black
A      CN         white
B      CN         black
A      CA         white
C      JP         black
       CA         white
Name: color, dtype: object


In [30]:
# select column by name, return Series
df.loc[:, 'color']

class  country
A      JP         black
B      CN         white
       US         black
       US         white
C      US         black
A      CN         white
B      CN         black
A      CA         white
C      JP         black
       CA         white
Name: color, dtype: object

In [31]:
# select column by index, return dataframe
df.iloc[:, [0]].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,color
class,country,Unnamed: 2_level_1
A,JP,black
B,CN,white
B,US,black
B,US,white
C,US,black


In [32]:
# select column by name, return dataframe
df[['color']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,color
class,country,Unnamed: 2_level_1
A,JP,black
B,CN,white
B,US,black
B,US,white
C,US,black


### 4.3. conditional subsetting

In [33]:
df[df.a > 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,0.804242,-2.063441
B,CN,white,M,2019-01-13,1.150411,-1.980491
B,US,white,M,2019-01-27,0.761324,0.417888
A,CN,white,S,2019-02-10,1.208735,-0.081171
B,CN,black,S,2019-02-17,0.910037,1.05309
C,JP,black,XL,2019-03-03,1.368588,-2.362638


In [34]:
df.loc[(df.color == 'white') & (df['size'] == 'M'), ['a', 'b']]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1
B,CN,1.150411,-1.980491
B,US,0.761324,0.417888
C,CA,-0.533759,-0.234717


## 5. Basic Operation
### 5.1. mathematical operations and store a new column

In [35]:
df['c'] = df.a + df.b
df

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,JP,black,S,2019-01-06,0.804242,-2.063441,-1.259199
B,CN,white,M,2019-01-13,1.150411,-1.980491,-0.83008
B,US,black,L,2019-01-20,-0.91326,3.782389,2.869129
B,US,white,M,2019-01-27,0.761324,0.417888,1.179213
C,US,black,L,2019-02-03,-0.372933,0.34441,-0.028523
A,CN,white,S,2019-02-10,1.208735,-0.081171,1.127563
B,CN,black,S,2019-02-17,0.910037,1.05309,1.963127
A,CA,white,XL,2019-02-24,-0.148918,-0.712895,-0.861813
C,JP,black,XL,2019-03-03,1.368588,-2.362638,-0.99405
C,CA,white,M,2019-03-10,-0.533759,-0.234717,-0.768476


### 5.2. Evaluating an expression

A great feature supported by pandas is expression evaluation. This relies on the `numexpr` library which must be installed. Also, we can directly modify the DataFrame, and use global variable in an expression by prefixing it with `@`

In [36]:
# quick evaluate
print(df.eval('a + b'))

class  country
A      JP        -1.259199
B      CN        -0.830080
       US         2.869129
       US         1.179213
C      US        -0.028523
A      CN         1.127563
B      CN         1.963127
A      CA        -0.861813
C      JP        -0.994050
       CA        -0.768476
dtype: float64


In [37]:
threshold = 0
df.eval('c = a + b > @threshold')

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,JP,black,S,2019-01-06,0.804242,-2.063441,False
B,CN,white,M,2019-01-13,1.150411,-1.980491,False
B,US,black,L,2019-01-20,-0.91326,3.782389,True
B,US,white,M,2019-01-27,0.761324,0.417888,True
C,US,black,L,2019-02-03,-0.372933,0.34441,False
A,CN,white,S,2019-02-10,1.208735,-0.081171,True
B,CN,black,S,2019-02-17,0.910037,1.05309,True
A,CA,white,XL,2019-02-24,-0.148918,-0.712895,False
C,JP,black,XL,2019-03-03,1.368588,-2.362638,False
C,CA,white,M,2019-03-10,-0.533759,-0.234717,False


### 5.3. Deleting columns or rows 

In [55]:
# Notice the axis=1 option for columns
df.drop('c', axis = 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,0.804242,-2.063441
B,CN,white,M,2019-01-13,1.150411,-1.980491
B,US,black,L,2019-01-20,-0.91326,3.782389
B,US,white,M,2019-01-27,0.761324,0.417888
C,US,black,L,2019-02-03,-0.372933,0.34441
A,CN,white,S,2019-02-10,1.208735,-0.081171
B,CN,black,S,2019-02-17,0.910037,1.05309
A,CA,white,XL,2019-02-24,-0.148918,-0.712895
C,JP,black,XL,2019-03-03,1.368588,-2.362638
C,CA,white,M,2019-03-10,-0.533759,-0.234717


In [57]:
# axis = 0 is default for rows
df.drop(('B', 'US'), axis = 0)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,JP,black,S,2019-01-06,0.804242,-2.063441,-1.259199
B,CN,white,M,2019-01-13,1.150411,-1.980491,-0.83008
C,US,black,L,2019-02-03,-0.372933,0.34441,-0.028523
A,CN,white,S,2019-02-10,1.208735,-0.081171,1.127563
B,CN,black,S,2019-02-17,0.910037,1.05309,1.963127
A,CA,white,XL,2019-02-24,-0.148918,-0.712895,-0.861813
C,JP,black,XL,2019-03-03,1.368588,-2.362638,-0.99405
C,CA,white,M,2019-03-10,-0.533759,-0.234717,-0.768476


### 5.4. Counting and sorting

In [38]:
df.color.value_counts()

white    5
black    5
Name: color, dtype: int64

In [39]:
df.sort_values(by='date', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C,CA,white,M,2019-03-10,-0.533759,-0.234717,-0.768476
C,JP,black,XL,2019-03-03,1.368588,-2.362638,-0.99405
A,CA,white,XL,2019-02-24,-0.148918,-0.712895,-0.861813
B,CN,black,S,2019-02-17,0.910037,1.05309,1.963127
A,CN,white,S,2019-02-10,1.208735,-0.081171,1.127563
C,US,black,L,2019-02-03,-0.372933,0.34441,-0.028523
B,US,white,M,2019-01-27,0.761324,0.417888,1.179213
B,US,black,L,2019-01-20,-0.91326,3.782389,2.869129
B,CN,white,M,2019-01-13,1.150411,-1.980491,-0.83008
A,JP,black,S,2019-01-06,0.804242,-2.063441,-1.259199


==================================================================================

## 2.4. Appending

There are two ways to append data records: `append` a dictionary and `concat` a dataframe

In [40]:
df2 = pd.DataFrame(data)
df2.append({'color': 'green', 'size': 'XS', 'data': '2019-02-01 00:00:00', 'a': 1, 'b': -3}, ignore_index=True)

Unnamed: 0,color,size,date,a,b,data
0,black,S,2019-01-06,0.804242,-2.063441,
1,white,M,2019-01-13,1.150411,-1.980491,
2,black,L,2019-01-20,-0.91326,3.782389,
3,white,M,2019-01-27,0.761324,0.417888,
4,black,L,2019-02-03,-0.372933,0.34441,
5,white,S,2019-02-10,1.208735,-0.081171,
6,black,S,2019-02-17,0.910037,1.05309,
7,white,XL,2019-02-24,-0.148918,-0.712895,
8,black,XL,2019-03-03,1.368588,-2.362638,
9,white,M,2019-03-10,-0.533759,-0.234717,


Using `concat` is more efficent way to append two dataframe

In [41]:
temp = dict({'color': ['green'], 'size': ['XS'], 'data': ['2019-02-01 00:00:00'], 'a': [1], 'b': [-3]})

# append row wise
pd.concat([df2, pd.DataFrame(temp)], axis=0, ignore_index=True)

# append column wise
temp = pd.Series(np.linspace(4,20,10))
pd.concat([df2, temp], axis=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  after removing the cwd from sys.path.


Unnamed: 0,color,size,date,a,b,0
0,black,S,2019-01-06,0.804242,-2.063441,4.0
1,white,M,2019-01-13,1.150411,-1.980491,5.777778
2,black,L,2019-01-20,-0.91326,3.782389,7.555556
3,white,M,2019-01-27,0.761324,0.417888,9.333333
4,black,L,2019-02-03,-0.372933,0.34441,11.111111
5,white,S,2019-02-10,1.208735,-0.081171,12.888889
6,black,S,2019-02-17,0.910037,1.05309,14.666667
7,white,XL,2019-02-24,-0.148918,-0.712895,16.444444
8,black,XL,2019-03-03,1.368588,-2.362638,18.222222
9,white,M,2019-03-10,-0.533759,-0.234717,20.0


In [42]:
size_lvl = df.groupby('size')

for i in size_lvl:
    print(i)

('L',                color size       date         a         b         c
class country                                                     
B     US       black    L 2019-01-20 -0.913260  3.782389  2.869129
C     US       black    L 2019-02-03 -0.372933  0.344410 -0.028523)
('M',                color size       date         a         b         c
class country                                                     
B     CN       white    M 2019-01-13  1.150411 -1.980491 -0.830080
      US       white    M 2019-01-27  0.761324  0.417888  1.179213
C     CA       white    M 2019-03-10 -0.533759 -0.234717 -0.768476)
('S',                color size       date         a         b         c
class country                                                     
A     JP       black    S 2019-01-06  0.804242 -2.063441 -1.259199
      CN       white    S 2019-02-10  1.208735 -0.081171  1.127563
B     CN       black    S 2019-02-17  0.910037  1.053090  1.963127)
('XL',                color size       da

## 3.2. Select specific group

In [43]:
size_lvl.get_group('M')

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
B,CN,white,M,2019-01-13,1.150411,-1.980491,-0.83008
B,US,white,M,2019-01-27,0.761324,0.417888,1.179213
C,CA,white,M,2019-03-10,-0.533759,-0.234717,-0.768476


## 4.1. Aggregation

In [44]:
size_lvl.sum().add_prefix('sum_')

Unnamed: 0_level_0,sum_a,sum_b,sum_c
size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
L,-1.286193,4.126799,2.840606
M,1.377977,-1.79732,-0.419343
S,2.923013,-1.091522,1.831491
XL,1.219669,-3.075533,-1.855863


In [45]:
df.groupby(['size', 'color']).agg({'a': np.min, 'b': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
size,color,Unnamed: 2_level_1,Unnamed: 3_level_1
L,black,-0.91326,2.063399
M,white,-0.533759,-0.599107
S,black,0.804242,-0.505176
S,white,1.208735,-0.081171
XL,black,1.368588,-2.362638
XL,white,-0.148918,-0.712895


## 5.1. Apply customize function

In [46]:
# Transform
data_range = lambda x: x.max() - x.min()
df.groupby('size').transform(data_range)

Unnamed: 0_level_0,Unnamed: 1_level_0,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,JP,42 days,0.404493,3.116531,3.222326
B,CN,56 days,1.684169,2.398379,2.009293
B,US,14 days,0.540328,3.43798,2.897652
B,US,56 days,1.684169,2.398379,2.009293
C,US,14 days,0.540328,3.43798,2.897652
A,CN,42 days,0.404493,3.116531,3.222326
B,CN,42 days,0.404493,3.116531,3.222326
A,CA,7 days,1.517506,1.649743,0.132237
C,JP,7 days,1.517506,1.649743,0.132237
C,CA,56 days,1.684169,2.398379,2.009293


In [47]:
# Apply
df.groupby('size')['a'].apply(lambda x: x.max() - x.min())

size
L     0.540328
M     1.684169
S     0.404493
XL    1.517506
Name: a, dtype: float64

## 6.1. Rolling

Creating a n-row window to aggregate data group by columns. If not enough data, then return `NaN`

In [48]:
df.groupby('color').rolling(2).a.sum()

color  class  country
black  A      JP              NaN
       B      US        -0.109019
       C      US        -1.286193
       B      CN         0.537104
       C      JP         2.278625
white  B      CN              NaN
              US         1.911735
       A      CN         1.970059
              CA         1.059816
       C      CA        -0.682677
Name: a, dtype: float64

## 7.1. Expanding

Creating a window to aggregate data group by columns, but the window is increasing every step

In [49]:
# Cumulative sum
df.groupby('color').expanding(1).a.sum()

color  class  country
black  A      JP         0.804242
       B      US        -0.109019
       C      US        -0.481952
       B      CN         0.428086
       C      JP         1.796673
white  B      CN         1.150411
              US         1.911735
       A      CN         3.120470
              CA         2.971552
       C      CA         2.437793
Name: a, dtype: float64

## 8.1. Filter

In [50]:
df.groupby('class').filter(lambda x: len(x) > 3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
B,CN,white,M,2019-01-13,1.150411,-1.980491,-0.83008
B,US,black,L,2019-01-20,-0.91326,3.782389,2.869129
B,US,white,M,2019-01-27,0.761324,0.417888,1.179213
B,CN,black,S,2019-02-17,0.910037,1.05309,1.963127


## 9.1 Evaluating an expression