# Pandas - Basic

`Pandas` is a package built on top of `NumPy`, which is consist of Series and DataFrame objects, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

Pandas has more flexibility (e.g., attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (e.g., groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us.

* Pandas series, DataFrame
* Quick checking DataFrame
* Descriptive stats on DataFrame
* Indexing, slicing, conditional subsetting
* Basic operations

In [1]:
import numpy as np
import pandas as pd

## 1.1. Series Object

The Series object is basically interchangeable with a one-dimensional NumPy array. The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

In [2]:
s = pd.Series([0.25, 0.5, 0.75, 1.0], index=["alice", "bob", "charles", "darwin"])
s

alice      0.25
bob        0.50
charles    0.75
darwin     1.00
dtype: float64

We can access with the `values` and `index` attributes, return a NumPy array and Index

In [3]:
s.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
s.index

Index(['alice', 'bob', 'charles', 'darwin'], dtype='object')

The Pandas Series is also like a specialization of a Python dictionary. A dictionary is a structure that maps arbitrary keys to a set of arbitrary values, and a Series is a structure which maps typed keys to a set of typed values. This typing is important: just as the type-specific compiled code behind a NumPy array makes it more efficient than a Python list for certain operations, the type information of a Pandas Series makes it much more efficient than Python dictionaries for certain operations.

In [5]:
# Using a pre-defined Dictionary object
s = {'color': ['black'],
     'size': ['S'],
     'data': pd.date_range('1/1/2019', periods=1, freq='W'),
     'a': 0,
     'b': 1}
pd.Series(s)

color                                              [black]
size                                                   [S]
data     DatetimeIndex(['2019-01-06'], dtype='datetime6...
a                                                        0
b                                                        1
dtype: object

By default, a Series will be created where the index is drawn from the sorted keys.

In [6]:
s['size']

['S']

## 1.2. DataFrame Object

The DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names.

In [115]:
data = {'color': ['black', 'white', 'black', 'white', 'black', 'white', 'black', 'white', 'black', 'white'],
        'size': ['S', 'M', 'L', 'M', 'L', 'S', 'S', 'XL', 'XL', 'M'],
        'date': pd.date_range('1/1/2019', periods=10, freq='W'),
        'a': np.random.randn(10),
        'b': np.random.normal(0.5, 2, 10)}

df = pd.DataFrame(data)
df

Unnamed: 0,color,size,date,a,b
0,black,S,2019-01-06,-0.267885,-3.319572
1,white,M,2019-01-13,1.595304,-0.154607
2,black,L,2019-01-20,-0.453008,-2.344712
3,white,M,2019-01-27,2.578424,0.674871
4,black,L,2019-02-03,-1.734742,-0.399142
5,white,S,2019-02-10,2.20564,2.354891
6,black,S,2019-02-17,-0.156361,0.208433
7,white,XL,2019-02-24,0.123158,0.953606
8,black,XL,2019-03-03,0.828522,0.418054
9,white,M,2019-03-10,-0.636645,0.189614


In [116]:
df.values

array([['black', 'S', Timestamp('2019-01-06 00:00:00'),
        -0.2678850367290185, -3.319571703296763],
       ['white', 'M', Timestamp('2019-01-13 00:00:00'),
        1.595304200084292, -0.15460663822651677],
       ['black', 'L', Timestamp('2019-01-20 00:00:00'),
        -0.45300764826498896, -2.3447118975976147],
       ['white', 'M', Timestamp('2019-01-27 00:00:00'),
        2.5784237778957135, 0.6748707695079044],
       ['black', 'L', Timestamp('2019-02-03 00:00:00'),
        -1.7347421509921124, -0.3991420518495744],
       ['white', 'S', Timestamp('2019-02-10 00:00:00'),
        2.2056395254232366, 2.354891014600936],
       ['black', 'S', Timestamp('2019-02-17 00:00:00'),
        -0.15636058189967875, 0.20843310628830036],
       ['white', 'XL', Timestamp('2019-02-24 00:00:00'),
        0.12315833988704607, 0.9536060935904505],
       ['black', 'XL', Timestamp('2019-03-03 00:00:00'),
        0.828521625364857, 0.41805434634604294],
       ['white', 'M', Timestamp('2019-03-10

Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

In [117]:
df.index

RangeIndex(start=0, stop=10, step=1)

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:

In [118]:
df.columns

Index(['color', 'size', 'date', 'a', 'b'], dtype='object')

DataFrame can be created reading directly from a CSV or an Excel file using `pd.read_csv` and `pd.read_excel`

In [119]:
pd.read_csv('../Data/Test_Scores.csv').head()

Unnamed: 0,ACT,FinalExam,QuizAvg,TestAvg
0,33,181,95,89
1,31,169,81,89
2,21,176,65,68
3,25,181,66,90
4,29,169,89,81


## 1.3. Index Object

This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Index objects also have many of the attributes familiar from NumPy arrays, but they are immutable - that is, they cannot be modified.

In [120]:
index = [['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C'], ['JP', 'CN', 'US', 'US', 'US', 'CN', 'CN', 'CA', 'JP', 'CA']]

ind = pd.Index(index)
ind

Index([['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C'], ['JP', 'CN', 'US', 'US', 'US', 'CN', 'CN', 'CA', 'JP', 'CA']], dtype='object')

In [121]:
print(ind[0])

print(ind[0][0])

['A', 'B', 'B', 'B', 'C', 'A', 'B', 'A', 'C', 'C']
A


In [122]:
print('size:', ind.size)
print('shape:', ind.shape)
print('dim:', ind.ndim)
print('type:', ind.dtype)

size: 2
shape: (2,)
dim: 1
type: object


The Index object follows many of the conventions used by Python's built-in `set` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:

In [123]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

# intersection
print(indA & indB)

# union
print(indA | indB)

# symmetric difference
print(indA ^ indB)

Int64Index([3, 5, 7], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')
Int64Index([1, 2, 9, 11], dtype='int64')


Adding multi-indexing to dataframe

In [124]:
index = pd.MultiIndex.from_arrays(index, names=['class', 'country'])

df = pd.DataFrame(data, index=index)
df

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-0.267885,-3.319572
B,CN,white,M,2019-01-13,1.595304,-0.154607
B,US,black,L,2019-01-20,-0.453008,-2.344712
B,US,white,M,2019-01-27,2.578424,0.674871
C,US,black,L,2019-02-03,-1.734742,-0.399142
A,CN,white,S,2019-02-10,2.20564,2.354891
B,CN,black,S,2019-02-17,-0.156361,0.208433
A,CA,white,XL,2019-02-24,0.123158,0.953606
C,JP,black,XL,2019-03-03,0.828522,0.418054
C,CA,white,M,2019-03-10,-0.636645,0.189614


## 2. Quick checking DataFrames

In [17]:
# select top 3
df.head(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-0.364302,2.214871
B,CN,white,M,2019-01-13,-0.627125,-1.581838
B,US,black,L,2019-01-20,0.429098,0.111413


In [18]:
# select last 3
df.tail(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,CA,white,XL,2019-02-24,0.05332,0.201993
C,JP,black,XL,2019-03-03,0.383699,-1.152456
C,CA,white,M,2019-03-10,0.645334,-1.620715


In [19]:
# sample data
df.sample(3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
C,JP,black,XL,2019-03-03,0.383699,-1.152456
B,US,white,M,2019-01-27,-0.218822,-2.907143
A,JP,black,S,2019-01-06,-0.364302,2.214871


In [126]:
# check data type
print(df.dtypes)

color            object
size             object
date     datetime64[ns]
a               float64
b               float64
dtype: object
class  country
A      JP         0
B      CN         1
       US         0
       US         2
C      US        -1
A      CN         2
B      CN         0
A      CA         0
C      JP         0
       CA         0
Name: a, dtype: int32


In [127]:
# change data type
print(df.a.astype(int))

class  country
A      JP         0
B      CN         1
       US         0
       US         2
C      US        -1
A      CN         2
B      CN         0
A      CA         0
C      JP         0
       CA         0
Name: a, dtype: int32


In [20]:
# check data information
df.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 10 entries, (A, JP) to (C, CA)
Data columns (total 5 columns):
color    10 non-null object
size     10 non-null object
date     10 non-null datetime64[ns]
a        10 non-null float64
b        10 non-null float64
dtypes: datetime64[ns](1), float64(2), object(2)
memory usage: 586.0+ bytes


# 3. Basic descriptive statistics

In [21]:
# basic statistical analysis
df.describe()

Unnamed: 0,a,b
count,10.0,10.0
mean,-0.248564,-0.053448
std,0.722796,1.776664
min,-1.75386,-2.907143
25%,-0.561419,-1.474493
50%,-0.148091,0.156703
75%,0.301104,0.753462
max,0.645334,2.73832


In [22]:
print('feature mean')
print(df.mean())
print('feature variance')
print(df.mean())
print('feature standard deviation')
print(df.std())
print('feature min')
print(df.min())
print('feature max')
print(df.mean())

feature mean
a   -0.248564
b   -0.053448
dtype: float64
feature variance
a   -0.248564
b   -0.053448
dtype: float64
feature standard deviation
a    0.722796
b    1.776664
dtype: float64
feature min
color                  black
size                       L
date     2019-01-06 00:00:00
a                   -1.75386
b                   -2.90714
dtype: object
feature max
a   -0.248564
b   -0.053448
dtype: float64


In [23]:
# top 5 percentile
np.percentile(df.a,95)

0.5480278713127451

## 4. Indexing, slicing columns and rows

### 4.1. select rows

In [24]:
# select row by index, return Series
df.iloc[0]

color                  black
size                       S
date     2019-01-06 00:00:00
a                  -0.364302
b                    2.21487
Name: (A, JP), dtype: object

In [25]:
# select row by index, return dataframe
df.iloc[[0]]

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-0.364302,2.214871


In [26]:
# select row by name, return dataframe
df.loc['B']

Unnamed: 0_level_0,color,size,date,a,b
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CN,white,M,2019-01-13,-0.627125,-1.581838
US,black,L,2019-01-20,0.429098,0.111413
US,white,M,2019-01-27,-0.218822,-2.907143
CN,black,S,2019-02-17,-1.75386,2.73832


### 4.2. select columns

In [27]:
# select column by index, return Series
print(df.iloc[:, 0])

class  country
A      JP         black
B      CN         white
       US         black
       US         white
C      US         black
A      CN         white
B      CN         black
A      CA         white
C      JP         black
       CA         white
Name: color, dtype: object


In [28]:
# select column by name, return Series
df.loc[:, 'color']

class  country
A      JP         black
B      CN         white
       US         black
       US         white
C      US         black
A      CN         white
B      CN         black
A      CA         white
C      JP         black
       CA         white
Name: color, dtype: object

In [29]:
# select column by index, return dataframe
df.iloc[:, [0]].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,color
class,country,Unnamed: 2_level_1
A,JP,black
B,CN,white
B,US,black
B,US,white
C,US,black


In [30]:
# select column by name, return dataframe
df[['color']].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,color
class,country,Unnamed: 2_level_1
A,JP,black
B,CN,white
B,US,black
B,US,white
C,US,black


### 4.3. conditional subsetting

In [31]:
df[df.a > 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B,US,black,L,2019-01-20,0.429098,0.111413
A,CA,white,XL,2019-02-24,0.05332,0.201993
C,JP,black,XL,2019-03-03,0.383699,-1.152456
C,CA,white,M,2019-03-10,0.645334,-1.620715


In [91]:
df[df.color.isin(['white', 'black'])]

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-1.903429,-1.200664
B,CN,white,M,2019-01-13,-0.999367,-1.3093
B,US,black,L,2019-01-20,0.488169,0.271709
B,US,white,M,2019-01-27,-0.745023,-1.449695
C,US,black,L,2019-02-03,1.088622,-1.894711
A,CN,white,S,2019-02-10,0.63002,-1.489885
B,CN,black,S,2019-02-17,-0.753279,1.964937
A,CA,white,XL,2019-02-24,0.849969,-2.896975
C,JP,black,XL,2019-03-03,-0.743981,-0.267226
C,CA,white,M,2019-03-10,-0.567835,2.931386


In [88]:
df.loc[(df.color == 'white') & (df['size'] == 'M'), ['a', 'b']]

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1
B,CN,-0.999367,-1.3093
B,US,-0.745023,-1.449695
C,CA,-0.567835,2.931386


In [33]:
df.query("color == 'white' & size == 'M'")

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B,CN,white,M,2019-01-13,-0.627125,-1.581838
B,US,white,M,2019-01-27,-0.218822,-2.907143
C,CA,white,M,2019-03-10,0.645334,-1.620715


## 5. Basic Operation
### 5.1. mathematical operations and store a new column

In [34]:
df['c'] = df.a + df.b
df

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,JP,black,S,2019-01-06,-0.364302,2.214871,1.850569
B,CN,white,M,2019-01-13,-0.627125,-1.581838,-2.208963
B,US,black,L,2019-01-20,0.429098,0.111413,0.540511
B,US,white,M,2019-01-27,-0.218822,-2.907143,-3.125965
C,US,black,L,2019-02-03,-0.955619,0.68468,-0.270939
A,CN,white,S,2019-02-10,-0.07736,0.77639,0.69903
B,CN,black,S,2019-02-17,-1.75386,2.73832,0.984461
A,CA,white,XL,2019-02-24,0.05332,0.201993,0.255313
C,JP,black,XL,2019-03-03,0.383699,-1.152456,-0.768757
C,CA,white,M,2019-03-10,0.645334,-1.620715,-0.975381


### 5.2. Evaluating an expression

A great feature supported by pandas is expression evaluation. This relies on the `numexpr` library which must be installed. Also, we can directly modify the DataFrame, and use global variable in an expression by prefixing it with `@`

In [35]:
# quick evaluate
print(df.eval('a + b'))

class  country
A      JP         1.850569
B      CN        -2.208963
       US         0.540511
       US        -3.125965
C      US        -0.270939
A      CN         0.699030
B      CN         0.984461
A      CA         0.255313
C      JP        -0.768757
       CA        -0.975381
dtype: float64


In [36]:
threshold = 0
df.eval('c = a + b > @threshold')

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,JP,black,S,2019-01-06,-0.364302,2.214871,True
B,CN,white,M,2019-01-13,-0.627125,-1.581838,False
B,US,black,L,2019-01-20,0.429098,0.111413,True
B,US,white,M,2019-01-27,-0.218822,-2.907143,False
C,US,black,L,2019-02-03,-0.955619,0.68468,False
A,CN,white,S,2019-02-10,-0.07736,0.77639,True
B,CN,black,S,2019-02-17,-1.75386,2.73832,True
A,CA,white,XL,2019-02-24,0.05332,0.201993,True
C,JP,black,XL,2019-03-03,0.383699,-1.152456,False
C,CA,white,M,2019-03-10,0.645334,-1.620715,False


### 5.3. Deleting columns or rows 

In [37]:
# Notice the axis=1 option for columns
df.drop('c', axis = 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-0.364302,2.214871
B,CN,white,M,2019-01-13,-0.627125,-1.581838
B,US,black,L,2019-01-20,0.429098,0.111413
B,US,white,M,2019-01-27,-0.218822,-2.907143
C,US,black,L,2019-02-03,-0.955619,0.68468
A,CN,white,S,2019-02-10,-0.07736,0.77639
B,CN,black,S,2019-02-17,-1.75386,2.73832
A,CA,white,XL,2019-02-24,0.05332,0.201993
C,JP,black,XL,2019-03-03,0.383699,-1.152456
C,CA,white,M,2019-03-10,0.645334,-1.620715


In [38]:
# axis = 0 is default for rows
df.drop(('B', 'US'), axis = 0)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
A,JP,black,S,2019-01-06,-0.364302,2.214871,1.850569
B,CN,white,M,2019-01-13,-0.627125,-1.581838,-2.208963
C,US,black,L,2019-02-03,-0.955619,0.68468,-0.270939
A,CN,white,S,2019-02-10,-0.07736,0.77639,0.69903
B,CN,black,S,2019-02-17,-1.75386,2.73832,0.984461
A,CA,white,XL,2019-02-24,0.05332,0.201993,0.255313
C,JP,black,XL,2019-03-03,0.383699,-1.152456,-0.768757
C,CA,white,M,2019-03-10,0.645334,-1.620715,-0.975381


### 5.4. Counting and sorting

In [39]:
df.color.value_counts()

white    5
black    5
Name: color, dtype: int64

In [40]:
df.sort_values(by='date', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C,CA,white,M,2019-03-10,0.645334,-1.620715,-0.975381
C,JP,black,XL,2019-03-03,0.383699,-1.152456,-0.768757
A,CA,white,XL,2019-02-24,0.05332,0.201993,0.255313
B,CN,black,S,2019-02-17,-1.75386,2.73832,0.984461
A,CN,white,S,2019-02-10,-0.07736,0.77639,0.69903
C,US,black,L,2019-02-03,-0.955619,0.68468,-0.270939
B,US,white,M,2019-01-27,-0.218822,-2.907143,-3.125965
B,US,black,L,2019-01-20,0.429098,0.111413,0.540511
B,CN,white,M,2019-01-13,-0.627125,-1.581838,-2.208963
A,JP,black,S,2019-01-06,-0.364302,2.214871,1.850569


### 5.5. Re-setting and Setting Index

In [41]:
df.reset_index()

Unnamed: 0,class,country,color,size,date,a,b,c
0,A,JP,black,S,2019-01-06,-0.364302,2.214871,1.850569
1,B,CN,white,M,2019-01-13,-0.627125,-1.581838,-2.208963
2,B,US,black,L,2019-01-20,0.429098,0.111413,0.540511
3,B,US,white,M,2019-01-27,-0.218822,-2.907143,-3.125965
4,C,US,black,L,2019-02-03,-0.955619,0.68468,-0.270939
5,A,CN,white,S,2019-02-10,-0.07736,0.77639,0.69903
6,B,CN,black,S,2019-02-17,-1.75386,2.73832,0.984461
7,A,CA,white,XL,2019-02-24,0.05332,0.201993,0.255313
8,C,JP,black,XL,2019-03-03,0.383699,-1.152456,-0.768757
9,C,CA,white,M,2019-03-10,0.645334,-1.620715,-0.975381


In [42]:
df.reset_index(drop=True)

Unnamed: 0,color,size,date,a,b,c
0,black,S,2019-01-06,-0.364302,2.214871,1.850569
1,white,M,2019-01-13,-0.627125,-1.581838,-2.208963
2,black,L,2019-01-20,0.429098,0.111413,0.540511
3,white,M,2019-01-27,-0.218822,-2.907143,-3.125965
4,black,L,2019-02-03,-0.955619,0.68468,-0.270939
5,white,S,2019-02-10,-0.07736,0.77639,0.69903
6,black,S,2019-02-17,-1.75386,2.73832,0.984461
7,white,XL,2019-02-24,0.05332,0.201993,0.255313
8,black,XL,2019-03-03,0.383699,-1.152456,-0.768757
9,white,M,2019-03-10,0.645334,-1.620715,-0.975381


In [43]:
df.set_index('date')

Unnamed: 0_level_0,color,size,a,b,c
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2019-01-06,black,S,-0.364302,2.214871,1.850569
2019-01-13,white,M,-0.627125,-1.581838,-2.208963
2019-01-20,black,L,0.429098,0.111413,0.540511
2019-01-27,white,M,-0.218822,-2.907143,-3.125965
2019-02-03,black,L,-0.955619,0.68468,-0.270939
2019-02-10,white,S,-0.07736,0.77639,0.69903
2019-02-17,black,S,-1.75386,2.73832,0.984461
2019-02-24,white,XL,0.05332,0.201993,0.255313
2019-03-03,black,XL,0.383699,-1.152456,-0.768757
2019-03-10,white,M,0.645334,-1.620715,-0.975381


### 5.6. Drop/filling Missing Value

In [95]:
df.iloc[0, 4] = np.nan
df.iloc[5, 3] = np.nan
df

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-1.903429,
B,CN,white,M,2019-01-13,-0.999367,-1.3093
B,US,black,L,2019-01-20,0.488169,0.271709
B,US,white,M,2019-01-27,-0.745023,-1.449695
C,US,black,L,2019-02-03,1.088622,-1.894711
A,CN,white,S,2019-02-10,,-1.489885
B,CN,black,S,2019-02-17,-0.753279,1.964937
A,CA,white,XL,2019-02-24,0.849969,-2.896975
C,JP,black,XL,2019-03-03,-0.743981,-0.267226
C,CA,white,M,2019-03-10,-0.567835,2.931386


In [112]:
# drop any rows with NA, axis=1 as columns
df.dropna(axis=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
B,CN,white,M,2019-01-13,-0.999367,-1.3093
B,US,black,L,2019-01-20,0.488169,0.271709
B,US,white,M,2019-01-27,-0.745023,-1.449695
C,US,black,L,2019-02-03,1.088622,-1.894711
B,CN,black,S,2019-02-17,-0.753279,1.964937
A,CA,white,XL,2019-02-24,0.849969,-2.896975
C,JP,black,XL,2019-03-03,-0.743981,-0.267226
C,CA,white,M,2019-03-10,-0.567835,2.931386


In [108]:
# drop rows any specific column has NA values
df[df.a.notna()]

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-1.903429,
B,CN,white,M,2019-01-13,-0.999367,-1.3093
B,US,black,L,2019-01-20,0.488169,0.271709
B,US,white,M,2019-01-27,-0.745023,-1.449695
C,US,black,L,2019-02-03,1.088622,-1.894711
B,CN,black,S,2019-02-17,-0.753279,1.964937
A,CA,white,XL,2019-02-24,0.849969,-2.896975
C,JP,black,XL,2019-03-03,-0.743981,-0.267226
C,CA,white,M,2019-03-10,-0.567835,2.931386


In [113]:
# fill in NA with mean
df.fillna(df.mean())

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,JP,black,S,2019-01-06,-1.903429,-0.459973
B,CN,white,M,2019-01-13,-0.999367,-1.3093
B,US,black,L,2019-01-20,0.488169,0.271709
B,US,white,M,2019-01-27,-0.745023,-1.449695
C,US,black,L,2019-02-03,1.088622,-1.894711
A,CN,white,S,2019-02-10,-0.365128,-1.489885
B,CN,black,S,2019-02-17,-0.753279,1.964937
A,CA,white,XL,2019-02-24,0.849969,-2.896975
C,JP,black,XL,2019-03-03,-0.743981,-0.267226
C,CA,white,M,2019-03-10,-0.567835,2.931386


### 5.7. Mapping Functions

Mapping is used to transform an initial set of values to another set of values through a function. `Apply` is similar to `map`, except that it transforms the entire DataFrame.

In [136]:
df.a.map(lambda x: np.mean(x))

class  country
A      JP        -0.267885
B      CN         1.595304
       US        -0.453008
       US         2.578424
C      US        -1.734742
A      CN         2.205640
B      CN        -0.156361
A      CA         0.123158
C      JP         0.828522
       CA        -0.636645
Name: a, dtype: float64

In [140]:
df[['a','b']].apply(lambda x: x.mean())

a    0.408241
b   -0.141856
dtype: float64

==================================================================================

In [47]:
size_lvl = df.groupby('size')

for i in size_lvl:
    print(i)

('L',                color size       date         a         b         c
class country                                                     
B     US       black    L 2019-01-20  0.429098  0.111413  0.540511
C     US       black    L 2019-02-03 -0.955619  0.684680 -0.270939)
('M',                color size       date         a         b         c
class country                                                     
B     CN       white    M 2019-01-13 -0.627125 -1.581838 -2.208963
      US       white    M 2019-01-27 -0.218822 -2.907143 -3.125965
C     CA       white    M 2019-03-10  0.645334 -1.620715 -0.975381)
('S',                color size       date         a        b         c
class country                                                    
A     JP       black    S 2019-01-06 -0.364302      NaN  1.850569
      CN       white    S 2019-02-10       NaN  0.77639  0.699030
B     CN       black    S 2019-02-17 -1.753860  2.73832  0.984461)
('XL',                color size       date   

## 3.2. Select specific group

In [48]:
size_lvl.get_group('M')

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
B,CN,white,M,2019-01-13,-0.627125,-1.581838,-2.208963
B,US,white,M,2019-01-27,-0.218822,-2.907143,-3.125965
C,CA,white,M,2019-03-10,0.645334,-1.620715,-0.975381


## 4.1. Aggregation

In [49]:
size_lvl.sum().add_prefix('sum_')

Unnamed: 0_level_0,sum_a,sum_b,sum_c
size,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
L,-0.526521,0.796093,0.269573
M,-0.200613,-6.109696,-6.310309
S,-2.118162,3.51471,3.53406
XL,0.437019,-0.950463,-0.513444


In [50]:
df.groupby(['size', 'color']).agg({'a': np.min, 'b': np.mean})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
size,color,Unnamed: 2_level_1,Unnamed: 3_level_1
L,black,-0.955619,0.398047
M,white,-0.627125,-2.036565
S,black,-1.75386,2.73832
S,white,,0.77639
XL,black,0.383699,-1.152456
XL,white,0.05332,0.201993


## 5.1. Apply customize function

In [51]:
# Transform
data_range = lambda x: x.max() - x.min()
df.groupby('size').transform(data_range)

Unnamed: 0_level_0,Unnamed: 1_level_0,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
A,JP,42 days,1.389557,1.96193,1.151539
B,CN,56 days,1.272459,1.325305,2.150584
B,US,14 days,1.384717,0.573267,0.81145
B,US,56 days,1.272459,1.325305,2.150584
C,US,14 days,1.384717,0.573267,0.81145
A,CN,42 days,1.389557,1.96193,1.151539
B,CN,42 days,1.389557,1.96193,1.151539
A,CA,7 days,0.330379,1.354449,1.024069
C,JP,7 days,0.330379,1.354449,1.024069
C,CA,56 days,1.272459,1.325305,2.150584


In [52]:
# Apply
df.groupby('size')['a'].apply(lambda x: x.max() - x.min())

size
L     1.384717
M     1.272459
S     1.389557
XL    0.330379
Name: a, dtype: float64

## 6.1. Rolling

Creating a n-row window to aggregate data group by columns. If not enough data, then return `NaN`

In [53]:
df.groupby('color').rolling(2).a.sum()

color  class  country
black  A      JP              NaN
       B      US         0.064796
       C      US        -0.526521
       B      CN        -2.709478
       C      JP        -1.370160
white  B      CN              NaN
              US        -0.845947
       A      CN              NaN
              CA              NaN
       C      CA         0.698654
Name: a, dtype: float64

## 7.1. Expanding

Creating a window to aggregate data group by columns, but the window is increasing every step

In [54]:
# Cumulative sum
df.groupby('color').expanding(1).a.sum()

color  class  country
black  A      JP        -0.364302
       B      US         0.064796
       C      US        -0.890823
       B      CN        -2.644683
       C      JP        -2.260983
white  B      CN        -0.627125
              US        -0.845947
       A      CN        -0.845947
              CA        -0.792627
       C      CA        -0.147293
Name: a, dtype: float64

## 8.1. Filter

In [55]:
df.groupby('class').filter(lambda x: len(x) > 3)

Unnamed: 0_level_0,Unnamed: 1_level_0,color,size,date,a,b,c
class,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
B,CN,white,M,2019-01-13,-0.627125,-1.581838,-2.208963
B,US,black,L,2019-01-20,0.429098,0.111413,0.540511
B,US,white,M,2019-01-27,-0.218822,-2.907143,-3.125965
B,CN,black,S,2019-02-17,-1.75386,2.73832,0.984461


## 9.1 Evaluating an expression