# Data Mangling with pandas
Mark Santcroos, Department of Human Genetics, Leiden University Medical Center

Examples and ideas taken from: [Jupyter Documentation](https://pandas.pydata.org/pandas-docs/stable/10min.html)

# Introduction

## Powerful Python data analysis toolkit

pandas is a Python package aiming to provide
- **fast**
- **flexible**
- **expressive**

data structures designed to make working with

- **relational**
- **labeled**

data both
- **easy**
- **intuitive**

## Suitable for data of all sorts

- Tabular data with columns of different data types (as in an SQL table or Excel spreadsheet)
- Ordered and unordered time series data (not necessarily fixed-frequency)
- Arbitrary matrix data with row and column labels (homogeneously typed or heterogeneous)
- Any other form of observational / statistical data sets (the data  need not be labeled to be placed into a pandas data structure)

## Primary data structures

- Series (1-dimensional)
- DataFrame (2-dimensional)

For R users, DataFrame provides everything that R’s data.frame provides and much more.

pandas is built on top of NumPy and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.

# Getting started

In [1]:
import pandas as pd
import numpy as np

In [2]:
pd.__version__

'2.1.3'

# Object creation

## Series

In [3]:
# Create Series with missing data
s = pd.Series([1,3,5,np.nan,6,8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

## DataFrame

In [4]:
# Create DatetimeIndex for 6 days
dates = pd.date_range('20170901', periods=6)
dates

DatetimeIndex(['2017-09-01', '2017-09-02', '2017-09-03', '2017-09-04',
               '2017-09-05', '2017-09-06'],
              dtype='datetime64[ns]', freq='D')

In [5]:
# Create 6x4 NP array with random values
ran_values = np.random.randn(6,4)
ran_values

array([[ 0.44057999, -0.14813443, -0.64273978, -0.79652794],
       [ 0.9941771 , -0.90039051,  0.45449045,  2.17529027],
       [-0.41381434,  1.21695023, -1.03746123,  1.08426416],
       [-0.51187413, -0.33306356, -1.16664576, -1.1092606 ],
       [ 0.79378938,  0.33526963,  0.2428412 ,  0.80630805],
       [-0.38244712,  0.55489813,  1.66328568, -0.47015855]])

In [6]:
df = pd.DataFrame(ran_values, columns=list('ABDC'))

In [7]:
df

Unnamed: 0,A,B,D,C
0,0.44058,-0.148134,-0.64274,-0.796528
1,0.994177,-0.900391,0.45449,2.17529
2,-0.413814,1.21695,-1.037461,1.084264
3,-0.511874,-0.333064,-1.166646,-1.109261
4,0.793789,0.33527,0.242841,0.806308
5,-0.382447,0.554898,1.663286,-0.470159


In [8]:
df.set_index(dates, inplace=True)

In [9]:
df

Unnamed: 0,A,B,D,C
2017-09-01,0.44058,-0.148134,-0.64274,-0.796528
2017-09-02,0.994177,-0.900391,0.45449,2.17529
2017-09-03,-0.413814,1.21695,-1.037461,1.084264
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261
2017-09-05,0.793789,0.33527,0.242841,0.806308
2017-09-06,-0.382447,0.554898,1.663286,-0.470159


In [10]:
# Create DataFrame by using a dict of series-like objects.
df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20170920'),
                     'C' : pd.Series(1, index=list(range(4)), dtype='float32'),
                     'D' : np.array([3] * 4, dtype='int32'),
                     'E' : pd.Categorical(["LUMC","EMC","LUMC","EMC"]),
                     'F' : 'researcher'
                   })
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2017-09-20,1.0,3,LUMC,researcher
1,1.0,2017-09-20,1.0,3,EMC,researcher
2,1.0,2017-09-20,1.0,3,LUMC,researcher
3,1.0,2017-09-20,1.0,3,EMC,researcher


In [11]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

# Tab completion

In [12]:
# df.<TAB> shows attributes and column names

# Exploring data

## Top (head) and bottom (tail) of data

In [13]:
df.head()

Unnamed: 0,A,B,D,C
2017-09-01,0.44058,-0.148134,-0.64274,-0.796528
2017-09-02,0.994177,-0.900391,0.45449,2.17529
2017-09-03,-0.413814,1.21695,-1.037461,1.084264
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261
2017-09-05,0.793789,0.33527,0.242841,0.806308


In [14]:
df.tail(2)

Unnamed: 0,A,B,D,C
2017-09-05,0.793789,0.33527,0.242841,0.806308
2017-09-06,-0.382447,0.554898,1.663286,-0.470159


## Meta data

In [15]:
df.index

DatetimeIndex(['2017-09-01', '2017-09-02', '2017-09-03', '2017-09-04',
               '2017-09-05', '2017-09-06'],
              dtype='datetime64[ns]', freq='D')

In [16]:
df.columns

Index(['A', 'B', 'D', 'C'], dtype='object')

In [17]:
df.values

array([[ 0.44057999, -0.14813443, -0.64273978, -0.79652794],
       [ 0.9941771 , -0.90039051,  0.45449045,  2.17529027],
       [-0.41381434,  1.21695023, -1.03746123,  1.08426416],
       [-0.51187413, -0.33306356, -1.16664576, -1.1092606 ],
       [ 0.79378938,  0.33526963,  0.2428412 ,  0.80630805],
       [-0.38244712,  0.55489813,  1.66328568, -0.47015855]])

## Basic statistics

In [18]:
df.describe()

Unnamed: 0,A,B,D,C
count,6.0,6.0,6.0,6.0
mean,0.153402,0.120922,-0.081038,0.281653
std,0.670959,0.742581,1.081032,1.278096
min,-0.511874,-0.900391,-1.166646,-1.109261
25%,-0.405973,-0.286831,-0.938781,-0.714936
50%,0.029066,0.093568,-0.199949,0.168075
75%,0.705487,0.499991,0.401578,1.014775
max,0.994177,1.21695,1.663286,2.17529


## Transposing data

In [19]:
df.T

Unnamed: 0,2017-09-01,2017-09-02,2017-09-03,2017-09-04,2017-09-05,2017-09-06
A,0.44058,0.994177,-0.413814,-0.511874,0.793789,-0.382447
B,-0.148134,-0.900391,1.21695,-0.333064,0.33527,0.554898
D,-0.64274,0.45449,-1.037461,-1.166646,0.242841,1.663286
C,-0.796528,2.17529,1.084264,-1.109261,0.806308,-0.470159


## Sorting

In [20]:
# Sort on axis
df.sort_index(axis=1)

Unnamed: 0,A,B,C,D
2017-09-01,0.44058,-0.148134,-0.796528,-0.64274
2017-09-02,0.994177,-0.900391,2.17529,0.45449
2017-09-03,-0.413814,1.21695,1.084264,-1.037461
2017-09-04,-0.511874,-0.333064,-1.109261,-1.166646
2017-09-05,0.793789,0.33527,0.806308,0.242841
2017-09-06,-0.382447,0.554898,-0.470159,1.663286


In [21]:
# Sort by value
df.sort_values(by='B')

Unnamed: 0,A,B,D,C
2017-09-02,0.994177,-0.900391,0.45449,2.17529
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261
2017-09-01,0.44058,-0.148134,-0.64274,-0.796528
2017-09-05,0.793789,0.33527,0.242841,0.806308
2017-09-06,-0.382447,0.554898,1.663286,-0.470159
2017-09-03,-0.413814,1.21695,-1.037461,1.084264


# Data Selection

## Label based

In [22]:
# Select column, which returns a series
df['A']

2017-09-01    0.440580
2017-09-02    0.994177
2017-09-03   -0.413814
2017-09-04   -0.511874
2017-09-05    0.793789
2017-09-06   -0.382447
Freq: D, Name: A, dtype: float64

In [23]:
# Row based
df[1:4]

Unnamed: 0,A,B,D,C
2017-09-02,0.994177,-0.900391,0.45449,2.17529
2017-09-03,-0.413814,1.21695,-1.037461,1.084264
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261


In [24]:
# Or index based
df['20170902':'20170904']

Unnamed: 0,A,B,D,C
2017-09-02,0.994177,-0.900391,0.45449,2.17529
2017-09-03,-0.413814,1.21695,-1.037461,1.084264
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261


In [25]:
# Cross section using a label
df.loc['2017-09-02']

A    0.994177
B   -0.900391
D    0.454490
C    2.175290
Name: 2017-09-02 00:00:00, dtype: float64

In [26]:
# Multi access selection based on label
df.loc[:,['A','B']]

Unnamed: 0,A,B
2017-09-01,0.44058,-0.148134
2017-09-02,0.994177,-0.900391
2017-09-03,-0.413814,1.21695
2017-09-04,-0.511874,-0.333064
2017-09-05,0.793789,0.33527
2017-09-06,-0.382447,0.554898


In [27]:
# Multi dimension label slicing
df.loc['20170902':'20170904',['B','C']]

Unnamed: 0,B,C
2017-09-02,-0.900391,2.17529
2017-09-03,1.21695,1.084264
2017-09-04,-0.333064,-1.109261


In [28]:
# Reduced dimension of return object for single rows
df.loc['20170902',['A','B']]

A    0.994177
B   -0.900391
Name: 2017-09-02 00:00:00, dtype: float64

In [29]:
# Scalar values
df.loc['20170902','A']

0.9941770977562511

## Position based
The semantics follow closely python and numpy slicing. 

In [30]:
# Row
df.iloc[3]

A   -0.511874
B   -0.333064
D   -1.166646
C   -1.109261
Name: 2017-09-04 00:00:00, dtype: float64

In [31]:
# Multi dimension
df.iloc[3:5,2:4]

Unnamed: 0,D,C
2017-09-04,-1.166646,-1.109261
2017-09-05,0.242841,0.806308


In [32]:
# Select rows only
df.iloc[1:3,:]

Unnamed: 0,A,B,D,C
2017-09-02,0.994177,-0.900391,0.45449,2.17529
2017-09-03,-0.413814,1.21695,-1.037461,1.084264


In [33]:
# Select columns only
df.iloc[:,1:3]

Unnamed: 0,B,D
2017-09-01,-0.148134,-0.64274
2017-09-02,-0.900391,0.45449
2017-09-03,1.21695,-1.037461
2017-09-04,-0.333064,-1.166646
2017-09-05,0.33527,0.242841
2017-09-06,0.554898,1.663286


In [34]:
df.iloc[1,1]

-0.9003905086095164

## Boolean indexing

In [35]:
# Using a single column’s values to select data.
df[df.A > 0]

Unnamed: 0,A,B,D,C
2017-09-01,0.44058,-0.148134,-0.64274,-0.796528
2017-09-02,0.994177,-0.900391,0.45449,2.17529
2017-09-05,0.793789,0.33527,0.242841,0.806308


In [36]:
# Selecting values from a DataFrame where a boolean condition is met.
df[df < 0]

Unnamed: 0,A,B,D,C
2017-09-01,,-0.148134,-0.64274,-0.796528
2017-09-02,,-0.900391,,
2017-09-03,-0.413814,,-1.037461,
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261
2017-09-05,,,,
2017-09-06,-0.382447,,,-0.470159


In [37]:
# Create new copy and add extra column
df3 = df.copy()
df3['E'] = ['one', 'one','two','three','four','three']
df3

Unnamed: 0,A,B,D,C,E
2017-09-01,0.44058,-0.148134,-0.64274,-0.796528,one
2017-09-02,0.994177,-0.900391,0.45449,2.17529,one
2017-09-03,-0.413814,1.21695,-1.037461,1.084264,two
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261,three
2017-09-05,0.793789,0.33527,0.242841,0.806308,four
2017-09-06,-0.382447,0.554898,1.663286,-0.470159,three


In [38]:
# Use isin() filtering
df3[df3['E'].isin(['two','four'])]

Unnamed: 0,A,B,D,C,E
2017-09-03,-0.413814,1.21695,-1.037461,1.084264,two
2017-09-05,0.793789,0.33527,0.242841,0.806308,four


# Modifying data

In [39]:
s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20170902', periods=6))

s1

2017-09-02    1
2017-09-03    2
2017-09-04    3
2017-09-05    4
2017-09-06    5
2017-09-07    6
Freq: D, dtype: int64

In [40]:
# Add column F, align by original index
df['F'] = s1
df

Unnamed: 0,A,B,D,C,F
2017-09-01,0.44058,-0.148134,-0.64274,-0.796528,
2017-09-02,0.994177,-0.900391,0.45449,2.17529,1.0
2017-09-03,-0.413814,1.21695,-1.037461,1.084264,2.0
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261,3.0
2017-09-05,0.793789,0.33527,0.242841,0.806308,4.0
2017-09-06,-0.382447,0.554898,1.663286,-0.470159,5.0


In [41]:
# Setting values by label
df.at['20170902','A'] = 0
df

Unnamed: 0,A,B,D,C,F
2017-09-01,0.44058,-0.148134,-0.64274,-0.796528,
2017-09-02,0.0,-0.900391,0.45449,2.17529,1.0
2017-09-03,-0.413814,1.21695,-1.037461,1.084264,2.0
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261,3.0
2017-09-05,0.793789,0.33527,0.242841,0.806308,4.0
2017-09-06,-0.382447,0.554898,1.663286,-0.470159,5.0


In [42]:
# Set value at two dimensional location
df.iat[0,1] = 0
df

Unnamed: 0,A,B,D,C,F
2017-09-01,0.44058,0.0,-0.64274,-0.796528,
2017-09-02,0.0,-0.900391,0.45449,2.17529,1.0
2017-09-03,-0.413814,1.21695,-1.037461,1.084264,2.0
2017-09-04,-0.511874,-0.333064,-1.166646,-1.109261,3.0
2017-09-05,0.793789,0.33527,0.242841,0.806308,4.0
2017-09-06,-0.382447,0.554898,1.663286,-0.470159,5.0


In [43]:
# Setting a column based on a numpy array
df.loc[:,'D'] = np.array([5] * len(df))
df

Unnamed: 0,A,B,D,C,F
2017-09-01,0.44058,0.0,5.0,-0.796528,
2017-09-02,0.0,-0.900391,5.0,2.17529,1.0
2017-09-03,-0.413814,1.21695,5.0,1.084264,2.0
2017-09-04,-0.511874,-0.333064,5.0,-1.109261,3.0
2017-09-05,0.793789,0.33527,5.0,0.806308,4.0
2017-09-06,-0.382447,0.554898,5.0,-0.470159,5.0


## Setting with matching rule

In [44]:
df2 = df.copy()
df2[df2 > 0] = -df2
df2

Unnamed: 0,A,B,D,C,F
2017-09-01,-0.44058,0.0,-5.0,-0.796528,
2017-09-02,0.0,-0.900391,-5.0,-2.17529,-1.0
2017-09-03,-0.413814,-1.21695,-5.0,-1.084264,-2.0
2017-09-04,-0.511874,-0.333064,-5.0,-1.109261,-3.0
2017-09-05,-0.793789,-0.33527,-5.0,-0.806308,-4.0
2017-09-06,-0.382447,-0.554898,-5.0,-0.470159,-5.0


# Missing data

- pandas  uses the value np.nan to represent missing data
- It is by default not included in computations. 

In [45]:
# reindex (copy) a subset of the data and add an empty column E
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
# Set E to 1 for first two rows
df1.loc[dates[0]:dates[1],'E'] = 1
df1

Unnamed: 0,A,B,D,C,F,E
2017-09-01,0.44058,0.0,5.0,-0.796528,,1.0
2017-09-02,0.0,-0.900391,5.0,2.17529,1.0,1.0
2017-09-03,-0.413814,1.21695,5.0,1.084264,2.0,
2017-09-04,-0.511874,-0.333064,5.0,-1.109261,3.0,


In [46]:
# Drop all rows that have any unknown values
df1.dropna(how='any')

Unnamed: 0,A,B,D,C,F,E
2017-09-02,0.0,-0.900391,5.0,2.17529,1.0,1.0


In [47]:
# Replace NA with value
df1.fillna(value=42)

Unnamed: 0,A,B,D,C,F,E
2017-09-01,0.44058,0.0,5.0,-0.796528,42.0,1.0
2017-09-02,0.0,-0.900391,5.0,2.17529,1.0,1.0
2017-09-03,-0.413814,1.21695,5.0,1.084264,2.0,42.0
2017-09-04,-0.511874,-0.333064,5.0,-1.109261,3.0,42.0


In [48]:
# Show the boolean mask
pd.isnull(df1)

Unnamed: 0,A,B,D,C,F,E
2017-09-01,False,False,False,False,True,False
2017-09-02,False,False,False,False,False,False
2017-09-03,False,False,False,False,False,True
2017-09-04,False,False,False,False,False,True


# Operations

## Basic (stat) operators

In [49]:
# Mean per column
df.mean() # similar to axis=0

A   -0.012294
B    0.145611
D    5.000000
C    0.281653
F    3.000000
dtype: float64

In [50]:
# Mean per row
df.mean(axis=1)

2017-09-01    1.161013
2017-09-02    1.454980
2017-09-03    1.777480
2017-09-04    1.209160
2017-09-05    2.187073
2017-09-06    1.940458
Freq: D, dtype: float64

## Apply

In [51]:
# Create my own function that returns the negated value
def my_func(val):
    return -val
# Apply my function to all values
df.apply(my_func)

Unnamed: 0,A,B,D,C,F
2017-09-01,-0.44058,-0.0,-5.0,0.796528,
2017-09-02,-0.0,0.900391,-5.0,-2.17529,-1.0
2017-09-03,0.413814,-1.21695,-5.0,-1.084264,-2.0
2017-09-04,0.511874,0.333064,-5.0,1.109261,-3.0
2017-09-05,-0.793789,-0.33527,-5.0,-0.806308,-4.0
2017-09-06,0.382447,-0.554898,-5.0,0.470159,-5.0


## Histogramming

In [52]:
s = pd.Series(np.random.randint(0, 7, size=10))
s

0    2
1    0
2    6
3    5
4    1
5    0
6    6
7    1
8    5
9    3
dtype: int64

In [53]:
s.value_counts()

0    2
6    2
5    2
1    2
2    1
3    1
Name: count, dtype: int64

## Concatinating data

In [54]:
# Create 10x4 table with random numbers
df = pd.DataFrame(np.random.randn(10, 4))
df

Unnamed: 0,0,1,2,3
0,-1.032198,-0.887591,-1.463087,-2.050719
1,0.54408,-1.138772,-0.902064,-0.911823
2,0.541855,-1.523351,-0.209272,-1.610346
3,-0.627728,1.108326,-1.050366,-0.566356
4,-1.687683,1.139958,-1.611219,-1.009924
5,-0.646679,1.548301,0.676723,-0.693801
6,-0.094278,-0.076768,-0.493994,1.250234
7,-1.957639,0.148645,-1.016835,-0.541509
8,-0.349715,-0.89064,1.32932,-0.174841
9,-0.343248,0.883219,-1.65405,0.349884


In [55]:
# Split them into 3 chunks (row-based)
chunks = [df[:3], df[3:7], df[7:]]
chunks

[          0         1         2         3
 0 -1.032198 -0.887591 -1.463087 -2.050719
 1  0.544080 -1.138772 -0.902064 -0.911823
 2  0.541855 -1.523351 -0.209272 -1.610346,
           0         1         2         3
 3 -0.627728  1.108326 -1.050366 -0.566356
 4 -1.687683  1.139958 -1.611219 -1.009924
 5 -0.646679  1.548301  0.676723 -0.693801
 6 -0.094278 -0.076768 -0.493994  1.250234,
           0         1         2         3
 7 -1.957639  0.148645 -1.016835 -0.541509
 8 -0.349715 -0.890640  1.329320 -0.174841
 9 -0.343248  0.883219 -1.654050  0.349884]

In [56]:
# add them back together
pd.concat(chunks)

Unnamed: 0,0,1,2,3
0,-1.032198,-0.887591,-1.463087,-2.050719
1,0.54408,-1.138772,-0.902064,-0.911823
2,0.541855,-1.523351,-0.209272,-1.610346
3,-0.627728,1.108326,-1.050366,-0.566356
4,-1.687683,1.139958,-1.611219,-1.009924
5,-0.646679,1.548301,0.676723,-0.693801
6,-0.094278,-0.076768,-0.493994,1.250234
7,-1.957639,0.148645,-1.016835,-0.541509
8,-0.349715,-0.89064,1.32932,-0.174841
9,-0.343248,0.883219,-1.65405,0.349884


## Joining data
Many ways to combine multiple dataframes.

In [57]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [58]:
left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [59]:
right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


In [60]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


In [61]:
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [62]:
left

Unnamed: 0,key,lval
0,foo,1
1,bar,2


In [63]:
right

Unnamed: 0,key,rval
0,foo,4
1,bar,5


In [64]:
pd.merge(left, right, on='key')

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


## Appending rows

In [65]:
# create a 8x4 matrix
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,1.144167,-1.168476,0.249483,-1.734513
1,-0.70955,0.504567,-1.146155,-0.405963
2,-0.157156,-1.52509,0.704099,0.647796
3,-0.114077,-0.843899,0.331256,-1.461805
4,-0.215538,1.166669,-0.425822,1.092165
5,-1.386906,0.121748,-1.278508,-0.763579
6,-0.381231,0.007251,0.928618,-0.54172
7,0.54859,0.600198,-0.402174,0.411751


In [66]:
# extract a row
s = df.iloc[3]
s

A   -0.114077
B   -0.843899
C    0.331256
D   -1.461805
Name: 3, dtype: float64

In [67]:
# append the extract row at the end
# df.append(s, ignore_index=False).reindex() - not working since version 2.0.0
df = pd.concat([df, pd.DataFrame([s])], ignore_index=True)
df

Unnamed: 0,A,B,C,D
0,1.144167,-1.168476,0.249483,-1.734513
1,-0.70955,0.504567,-1.146155,-0.405963
2,-0.157156,-1.52509,0.704099,0.647796
3,-0.114077,-0.843899,0.331256,-1.461805
4,-0.215538,1.166669,-0.425822,1.092165
5,-1.386906,0.121748,-1.278508,-0.763579
6,-0.381231,0.007251,0.928618,-0.54172
7,0.54859,0.600198,-0.402174,0.411751
8,-0.114077,-0.843899,0.331256,-1.461805


## Grouping

Value based grouping in order to execute methods on the results.

In [68]:
# Create 
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                              'two', 'two', 'one', 'three'],
                   'C' : np.random.randn(8),
                   'D' : np.random.randn(8)})
df

Unnamed: 0,A,B,C,D
0,foo,one,0.631038,-0.022213
1,bar,one,-0.413756,-0.334146
2,foo,two,-1.698029,0.804681
3,bar,three,1.334126,0.163851
4,foo,two,-0.122999,0.510452
5,bar,two,-1.698212,0.670997
6,foo,one,0.785731,-0.144679
7,foo,three,-0.666275,0.038802


In [69]:
df.groupby('A').sum()

Unnamed: 0_level_0,B,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,onethreetwo,-0.777843,0.500702
foo,onetwotwoonethree,-1.070535,1.187045


In [70]:
df.groupby(['A','B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.413756,-0.334146
bar,three,1.334126,0.163851
bar,two,-1.698212,0.670997
foo,one,1.416769,-0.166891
foo,three,-0.666275,0.038802
foo,two,-1.821028,1.315134


## Reshaping using pivot_table

In [71]:
# Create a flat table with duplicated entries
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                       'B' : ['X', 'Y', 'Z'] * 4,
                       'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                       'D' : np.random.randn(12),
                       'E' : np.random.randn(12)})
df

Unnamed: 0,A,B,C,D,E
0,one,X,foo,0.120945,0.899031
1,one,Y,foo,-1.41281,-1.84877
2,two,Z,foo,-1.253503,2.625549
3,three,X,bar,1.049642,0.150325
4,one,Y,bar,-0.680348,-0.117214
5,one,Z,bar,-0.776564,-1.251835
6,two,X,foo,-0.411001,-1.54937
7,three,Y,foo,-0.635409,-0.170142
8,one,Z,foo,0.031961,-0.123966
9,one,X,bar,-1.048441,0.30167


In [72]:
# Create a pivot table using A and B as the index, making C columns, and using D as the values (E is not used)
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,X,-1.048441,0.120945
one,Y,-0.680348,-1.41281
one,Z,-0.776564,0.031961
three,X,1.049642,
three,Y,,-0.635409
three,Z,1.223787,
two,X,,-0.411001
two,Y,-0.334869,
two,Z,,-1.253503


# Further information
- https://pandas.pydata.org/pandas-docs/stable/index.html
- https://stackoverflow.com with pandas tag