# Pandas

This is a short introduction to pandas

In [89]:
import numpy as np

import pandas as pd

## Basic data structures in pandas
Pandas provides two types of classes for handling data:

1. Series: a one-dimensional labeled array holding data of any type

such as integers, strings, Python objects etc.

2. DataFrame: a two-dimensional data structure that holds data like a two-dimension array or a table with rows and columns.

In [90]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])

s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [91]:
dates = pd.date_range("20130101", periods=6)

print(dates)

dates_series = pd.Series(dates)

dates_series

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')


0   2013-01-01
1   2013-01-02
2   2013-01-03
3   2013-01-04
4   2013-01-05
5   2013-01-06
dtype: datetime64[ns]

In [92]:
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df

Unnamed: 0,A,B,C,D
2013-01-01,-1.834331,0.225912,-0.136538,0.608553
2013-01-02,-0.231736,0.709885,-0.372238,0.43267
2013-01-03,0.938392,-0.46687,0.019246,0.148294
2013-01-04,0.827991,0.532792,0.320971,0.63198
2013-01-05,-2.389382,-0.559091,-0.55791,-0.620393
2013-01-06,-0.463736,1.597879,-0.673202,0.030351


Creating a DataFrame by passing a dictionary of objects where the keys are the column labels and the values are the column values.

In [93]:
df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)


df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2013-01-02,1.0,3,test,foo
1,1.0,2013-01-02,1.0,3,train,foo
2,1.0,2013-01-02,1.0,3,test,foo
3,1.0,2013-01-02,1.0,3,train,foo


In [94]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

## Viewing data
See the Essentially basics functionality section.

Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame respectively:

In [95]:
df.head()

Unnamed: 0,A,B,C,D
2013-01-01,-1.834331,0.225912,-0.136538,0.608553
2013-01-02,-0.231736,0.709885,-0.372238,0.43267
2013-01-03,0.938392,-0.46687,0.019246,0.148294
2013-01-04,0.827991,0.532792,0.320971,0.63198
2013-01-05,-2.389382,-0.559091,-0.55791,-0.620393


In [96]:
df.tail(3)

Unnamed: 0,A,B,C,D
2013-01-04,0.827991,0.532792,0.320971,0.63198
2013-01-05,-2.389382,-0.559091,-0.55791,-0.620393
2013-01-06,-0.463736,1.597879,-0.673202,0.030351


In [97]:
df.index

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [98]:
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [99]:
df.to_numpy()

array([[-1.83433127,  0.22591151, -0.13653792,  0.60855274],
       [-0.23173552,  0.70988513, -0.37223821,  0.43266996],
       [ 0.9383921 , -0.46686967,  0.01924631,  0.14829428],
       [ 0.82799116,  0.53279159,  0.32097085,  0.63198028],
       [-2.38938181, -0.55909074, -0.55791035, -0.62039279],
       [-0.46373596,  1.59787876, -0.67320181,  0.03035106]])

In [100]:
df2.dtypes

A          float64
B    datetime64[s]
C          float32
D            int32
E         category
F           object
dtype: object

In [101]:
df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

describe() shows a quick statistic summary of your data

In [102]:
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.525467,0.340084,-0.233279,0.205243
std,1.360303,0.803575,0.373777,0.471591
min,-2.389382,-0.559091,-0.673202,-0.620393
25%,-1.491682,-0.293674,-0.511492,0.059837
50%,-0.347736,0.379352,-0.254388,0.290482
75%,0.563059,0.665612,-0.0197,0.564582
max,0.938392,1.597879,0.320971,0.63198


Transposing your data:

In [103]:
df.T

Unnamed: 0,2013-01-01,2013-01-02,2013-01-03,2013-01-04,2013-01-05,2013-01-06
A,-1.834331,-0.231736,0.938392,0.827991,-2.389382,-0.463736
B,0.225912,0.709885,-0.46687,0.532792,-0.559091,1.597879
C,-0.136538,-0.372238,0.019246,0.320971,-0.55791,-0.673202
D,0.608553,0.43267,0.148294,0.63198,-0.620393,0.030351


DataFrame.sort_index() sorts by an axis:

In [104]:
df.sort_index(axis=1, ascending=False)

Unnamed: 0,D,C,B,A
2013-01-01,0.608553,-0.136538,0.225912,-1.834331
2013-01-02,0.43267,-0.372238,0.709885,-0.231736
2013-01-03,0.148294,0.019246,-0.46687,0.938392
2013-01-04,0.63198,0.320971,0.532792,0.827991
2013-01-05,-0.620393,-0.55791,-0.559091,-2.389382
2013-01-06,0.030351,-0.673202,1.597879,-0.463736


DataFrame.sort_values() sorts by values:

In [105]:
df.sort_values(by="B")

Unnamed: 0,A,B,C,D
2013-01-05,-2.389382,-0.559091,-0.55791,-0.620393
2013-01-03,0.938392,-0.46687,0.019246,0.148294
2013-01-01,-1.834331,0.225912,-0.136538,0.608553
2013-01-04,0.827991,0.532792,0.320971,0.63198
2013-01-02,-0.231736,0.709885,-0.372238,0.43267
2013-01-06,-0.463736,1.597879,-0.673202,0.030351


## Getitem ([])
For a DataFrame, passing a single label selects a columns and yields a Series equivalent to df.A:

In [106]:
df["A"]

2013-01-01   -1.834331
2013-01-02   -0.231736
2013-01-03    0.938392
2013-01-04    0.827991
2013-01-05   -2.389382
2013-01-06   -0.463736
Freq: D, Name: A, dtype: float64

For a DataFrame, passing a slice : selects matching rows:

In [107]:
df[0:3]

Unnamed: 0,A,B,C,D
2013-01-01,-1.834331,0.225912,-0.136538,0.608553
2013-01-02,-0.231736,0.709885,-0.372238,0.43267
2013-01-03,0.938392,-0.46687,0.019246,0.148294


In [108]:
df["20130102":"20130104"]

Unnamed: 0,A,B,C,D
2013-01-02,-0.231736,0.709885,-0.372238,0.43267
2013-01-03,0.938392,-0.46687,0.019246,0.148294
2013-01-04,0.827991,0.532792,0.320971,0.63198


## Selection by label
See more in Selection by Label using DataFrame.loc() or DataFrame.at().

Selecting a row matching a label:

In [109]:
df.loc[dates[0]]

A   -1.834331
B    0.225912
C   -0.136538
D    0.608553
Name: 2013-01-01 00:00:00, dtype: float64

Selecting all rows (:) with a select column labels:

In [110]:
df.loc[:, ["A", "B"]]

Unnamed: 0,A,B
2013-01-01,-1.834331,0.225912
2013-01-02,-0.231736,0.709885
2013-01-03,0.938392,-0.46687
2013-01-04,0.827991,0.532792
2013-01-05,-2.389382,-0.559091
2013-01-06,-0.463736,1.597879


For label slicing, both endpoints are included:

In [111]:
df.loc["20130102":"20130104", ["A", "B"]]

Unnamed: 0,A,B
2013-01-02,-0.231736,0.709885
2013-01-03,0.938392,-0.46687
2013-01-04,0.827991,0.532792


Selecting a single row and column label returns a scalar:

In [112]:
df.loc[dates[0], "A"]

np.float64(-1.8343312672195884)

For getting fast access to a scalar (equivalent to the prior method):

In [113]:
df.at[dates[0], "A"]

np.float64(-1.8343312672195884)

## Selection by position
See more in Selection by Position using DataFrame.iloc() or DataFrame.iat().

Select via the position of the passed integers:

In [114]:
df.iloc[3]

A    0.827991
B    0.532792
C    0.320971
D    0.631980
Name: 2013-01-04 00:00:00, dtype: float64

Integer slices acts similar to NumPy/Python:

In [115]:
df.iloc[3:5, 0:2]

Unnamed: 0,A,B
2013-01-04,0.827991,0.532792
2013-01-05,-2.389382,-0.559091


Lists of integer position locations:

In [116]:
df.iloc[[1, 2, 4], [0, 2]]

Unnamed: 0,A,C
2013-01-02,-0.231736,-0.372238
2013-01-03,0.938392,0.019246
2013-01-05,-2.389382,-0.55791


For slicing rows explicitly:

In [117]:
df.iloc[1:3, :]

Unnamed: 0,A,B,C,D
2013-01-02,-0.231736,0.709885,-0.372238,0.43267
2013-01-03,0.938392,-0.46687,0.019246,0.148294


For getting a value explicitly:

In [118]:
df.iloc[1, 1]

np.float64(0.7098851281780919)

For getting fast access to a scalar (equivalent to the prior method):

In [119]:
df.iat[1, 1]

np.float64(0.7098851281780919)

## Boolean indexing
Select rows where df.A is greater than 0.

In [120]:
df[df["A"] > 0]

Unnamed: 0,A,B,C,D
2013-01-03,0.938392,-0.46687,0.019246,0.148294
2013-01-04,0.827991,0.532792,0.320971,0.63198


Selecting values from a DataFrame where a boolean condition is met:

In [121]:
df[df > 0]

Unnamed: 0,A,B,C,D
2013-01-01,,0.225912,,0.608553
2013-01-02,,0.709885,,0.43267
2013-01-03,0.938392,,0.019246,0.148294
2013-01-04,0.827991,0.532792,0.320971,0.63198
2013-01-05,,,,
2013-01-06,,1.597879,,0.030351


Using isin() method for filtering:

In [122]:
df2 = df.copy()
df2

Unnamed: 0,A,B,C,D
2013-01-01,-1.834331,0.225912,-0.136538,0.608553
2013-01-02,-0.231736,0.709885,-0.372238,0.43267
2013-01-03,0.938392,-0.46687,0.019246,0.148294
2013-01-04,0.827991,0.532792,0.320971,0.63198
2013-01-05,-2.389382,-0.559091,-0.55791,-0.620393
2013-01-06,-0.463736,1.597879,-0.673202,0.030351


In [123]:
# df2[df2["E"].isin(["two", "four"])]

## Setting
Setting a new column automatically aligns the data by the indexes:

In [124]:
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [125]:
df["F"] = s1

Setting values by label:

In [126]:
df.at[dates[0], "A"] = 0

Setting values by position:

In [127]:
df.iat[0, 1] = 0

Setting by assigning with a NumPy array:

In [128]:
df.loc[:, "D"] = np.array([5] * len(df))

The result of the prior setting operations:

In [129]:
df

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.136538,5.0,
2013-01-02,-0.231736,0.709885,-0.372238,5.0,1.0
2013-01-03,0.938392,-0.46687,0.019246,5.0,2.0
2013-01-04,0.827991,0.532792,0.320971,5.0,3.0
2013-01-05,-2.389382,-0.559091,-0.55791,5.0,4.0
2013-01-06,-0.463736,1.597879,-0.673202,5.0,5.0


A where operation with setting:

In [130]:
df2 = df.copy()

df2[df2 > 0] = -df2

df2

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-0.136538,-5.0,
2013-01-02,-0.231736,-0.709885,-0.372238,-5.0,-1.0
2013-01-03,-0.938392,-0.46687,-0.019246,-5.0,-2.0
2013-01-04,-0.827991,-0.532792,-0.320971,-5.0,-3.0
2013-01-05,-2.389382,-0.559091,-0.55791,-5.0,-4.0
2013-01-06,-0.463736,-1.597879,-0.673202,-5.0,-5.0


## Missing data
For NumPy data types, np.nan represents missing data. It is by default not included in computations. See the Missing Data section.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:

In [131]:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

df1.loc[dates[0] : dates[1], "E"] = 1

df1

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.136538,5.0,,1.0
2013-01-02,-0.231736,0.709885,-0.372238,5.0,1.0,1.0
2013-01-03,0.938392,-0.46687,0.019246,5.0,2.0,
2013-01-04,0.827991,0.532792,0.320971,5.0,3.0,


DataFrame.dropna() drops any rows that have missing data:

In [132]:
df1.dropna(how="any")

Unnamed: 0,A,B,C,D,F,E
2013-01-02,-0.231736,0.709885,-0.372238,5.0,1.0,1.0


DataFrame.fillna() fills missing data:

In [133]:
df1.fillna(value=5)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,0.0,0.0,-0.136538,5.0,5.0,1.0
2013-01-02,-0.231736,0.709885,-0.372238,5.0,1.0,1.0
2013-01-03,0.938392,-0.46687,0.019246,5.0,2.0,5.0
2013-01-04,0.827991,0.532792,0.320971,5.0,3.0,5.0


isna() gets the boolean mask where values are nan:

In [134]:
pd.isna(df1)

Unnamed: 0,A,B,C,D,F,E
2013-01-01,False,False,False,False,True,False
2013-01-02,False,False,False,False,False,False
2013-01-03,False,False,False,False,False,True
2013-01-04,False,False,False,False,False,True


## Operations
See the Basic section on Binary Ops.

### Stats
Operations in general exclude missing data.

Calculate the mean value for each column:

In [135]:
df.mean()

A   -0.219745
B    0.302433
C   -0.233279
D    5.000000
F    3.000000
dtype: float64

Calculate the mean value for each row:

In [136]:
df.mean(axis=1)

2013-01-01    1.215866
2013-01-02    1.221182
2013-01-03    1.498154
2013-01-04    1.936351
2013-01-05    1.098723
2013-01-06    2.092188
Freq: D, dtype: float64

Operating with another Series or DataFrame with a different index or column will align the result with the union of the index or column labels. In addition, pandas automatically broadcasts along the specified dimension and will fill unaligned labels with np.nan.

In [137]:
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)

s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [138]:
df.sub(s, axis="index")

Unnamed: 0,A,B,C,D,F
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,-0.061608,-1.46687,-0.980754,4.0,1.0
2013-01-04,-2.172009,-2.467208,-2.679029,2.0,0.0
2013-01-05,-7.389382,-5.559091,-5.55791,0.0,-1.0
2013-01-06,,,,,


## User defined functions
DataFrame.agg() and DataFrame.transform() applies a user defined function that reduces or broadcasts its result respectively.

In [139]:
df.agg(lambda x: np.mean(x) * 5.6)

A    -1.230572
B     1.693622
C    -1.306360
D    28.000000
F    16.800000
dtype: float64

In [140]:
df.transform(lambda x: x * 101.2)

Unnamed: 0,A,B,C,D,F
2013-01-01,0.0,0.0,-13.817637,506.0,
2013-01-02,-23.451635,71.840375,-37.670507,506.0,101.2
2013-01-03,94.96528,-47.247211,1.947727,506.0,202.4
2013-01-04,83.792706,53.918509,32.48225,506.0,303.6
2013-01-05,-241.805439,-56.579983,-56.460528,506.0,404.8
2013-01-06,-46.930079,161.705331,-68.128023,506.0,506.0


## Value Counts
See more at Histogramming and Discretization.

In [141]:
s = pd.Series(np.random.randint(0, 7, size=10))

s

0    3
1    1
2    6
3    2
4    3
5    3
6    1
7    5
8    5
9    6
dtype: int32

In [142]:
s.value_counts()

3    3
1    2
6    2
5    2
2    1
Name: count, dtype: int64

## String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. See more at Vectorized String Methods.

In [143]:
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

## Merge
### Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

See the Merging section.

Concatenating pandas objects together row-wise with concat():

In [144]:
df = pd.DataFrame(np.random.randn(10, 4))

df

Unnamed: 0,0,1,2,3
0,-1.369321,0.89306,-1.512647,0.374745
1,0.345276,1.315974,0.697619,0.274798
2,-0.095145,0.083045,-0.172043,-0.6438
3,1.452156,0.246755,-1.722196,-0.972701
4,-0.586752,-0.216849,-1.469916,-0.443152
5,0.435042,1.827516,-0.95191,0.670721
6,0.794047,1.664529,-0.272685,-1.26862
7,1.398667,-0.460835,-0.766781,-1.565662
8,-0.150253,1.28855,-0.296526,0.160859
9,-0.845485,-0.086543,1.265979,0.312342


In [145]:
pieces = [df[:3], df[3:7], df[7:]]

pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,-1.369321,0.89306,-1.512647,0.374745
1,0.345276,1.315974,0.697619,0.274798
2,-0.095145,0.083045,-0.172043,-0.6438
3,1.452156,0.246755,-1.722196,-0.972701
4,-0.586752,-0.216849,-1.469916,-0.443152
5,0.435042,1.827516,-0.95191,0.670721
6,0.794047,1.664529,-0.272685,-1.26862
7,1.398667,-0.460835,-0.766781,-1.565662
8,-0.150253,1.28855,-0.296526,0.160859
9,-0.845485,-0.086543,1.265979,0.312342


## Join
merge() enables SQL style join types along specific columns. See the Database style joining section.

In [146]:
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})

left

Unnamed: 0,key,lval
0,foo,1
1,foo,2


In [147]:
right

Unnamed: 0,key,rval
0,foo,4
1,foo,5


In [148]:
pd.merge(left, right, on="key")

Unnamed: 0,key,lval,rval
0,foo,1,4
1,foo,1,5
2,foo,2,4
3,foo,2,5


merge() on unique keys:

In [149]:
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})

left

Unnamed: 0,key,lval
0,foo,1
1,bar,2


In [150]:
right

Unnamed: 0,key,rval
0,foo,4
1,bar,5


In [151]:
pd.merge(left, right, on="key")

Unnamed: 0,key,lval,rval
0,foo,1,4
1,bar,2,5


## Grouping
By “group by” we are referring to a process involving one or more of the following steps:

- **Splitting** the data into groups based on some criteria

- **Applying** a function to each group independently

- **Combining** the results into a data structure

See the Grouping section.

In [152]:
df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)


df

Unnamed: 0,A,B,C,D
0,foo,one,-0.574507,-1.889548
1,bar,one,-0.263759,-0.88521
2,foo,two,0.164342,0.913474
3,bar,three,0.427279,0.078786
4,foo,two,-0.025906,-1.817447
5,bar,two,0.357855,1.02915
6,foo,one,-0.594578,0.254955
7,foo,three,0.708244,0.500073


Grouping by a column label, selecting column labels, and then applying the DataFrameGroupBy.sum() function to the resulting groups:

In [153]:
df.groupby("A")[["C", "D"]].sum()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,0.521375,0.222726
foo,-0.322405,-2.038492


Grouping by multiple columns label forms MultiIndex.

In [154]:
df.groupby(["A", "B"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-0.263759,-0.88521
bar,three,0.427279,0.078786
bar,two,0.357855,1.02915
foo,one,-1.169085,-1.634593
foo,three,0.708244,0.500073
foo,two,0.138436,-0.903973


## Reshaping
See the sections on Hierarchical Indexing and Reshaping.

### Stack

In [155]:
arrays = [
   ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ["one", "two", "one", "two", "one", "two", "one", "two"],
]


index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

df2 = df[:4]

df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-1.661152,-0.923383
bar,two,-1.026469,0.967119
baz,one,1.058012,-1.169852
baz,two,0.49958,-0.588362


The stack() method “compresses” a level in the DataFrame’s columns:

In [156]:
stacked = df2.stack(future_stack=True)

stacked

first  second   
bar    one     A   -1.661152
               B   -0.923383
       two     A   -1.026469
               B    0.967119
baz    one     A    1.058012
               B   -1.169852
       two     A    0.499580
               B   -0.588362
dtype: float64

With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is unstack(), which by default unstacks the last level:

In [157]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-1.661152,-0.923383
bar,two,-1.026469,0.967119
baz,one,1.058012,-1.169852
baz,two,0.49958,-0.588362


In [158]:
stacked.unstack(1)

Unnamed: 0_level_0,second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,-1.661152,-1.026469
bar,B,-0.923383,0.967119
baz,A,1.058012,0.49958
baz,B,-1.169852,-0.588362


In [159]:
stacked.unstack(0)

Unnamed: 0_level_0,first,bar,baz
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-1.661152,1.058012
one,B,-0.923383,-1.169852
two,A,-1.026469,0.49958
two,B,0.967119,-0.588362


## Pivot tables
See the section on Pivot Tables.

In [160]:
df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)


df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,-0.046587,0.495517
1,one,B,foo,0.411997,-0.578962
2,two,C,foo,0.356833,1.067128
3,three,A,bar,1.28572,0.16158
4,one,B,bar,0.386293,1.038785
5,one,C,bar,1.426516,1.898244
6,two,A,foo,0.543833,0.493683
7,three,B,foo,1.273832,-1.604933
8,one,C,foo,1.090763,-1.16944
9,one,A,bar,2.120226,-0.905402


pivot_table() pivots a DataFrame specifying the values, index and columns

In [161]:
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,2.120226,-0.046587
one,B,0.386293,0.411997
one,C,1.426516,1.090763
three,A,1.28572,
three,B,,1.273832
three,C,-0.269687,
two,A,,0.543833
two,B,-1.075269,
two,C,,0.356833


## Time series
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.

In [162]:
rng = pd.date_range("1/1/2012", periods=100, freq="s")

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample("5Min").sum()

2012-01-01    24639
Freq: 5min, dtype: int32

Series.tz_localize() localizes a time series to a time zone:

In [163]:
rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")

ts = pd.Series(np.random.randn(len(rng)), rng)

ts

2012-03-06   -0.709186
2012-03-07   -0.528161
2012-03-08   -0.624848
2012-03-09   -0.574481
2012-03-10   -0.021891
Freq: D, dtype: float64

In [164]:
ts_utc = ts.tz_localize("UTC")

ts_utc

2012-03-06 00:00:00+00:00   -0.709186
2012-03-07 00:00:00+00:00   -0.528161
2012-03-08 00:00:00+00:00   -0.624848
2012-03-09 00:00:00+00:00   -0.574481
2012-03-10 00:00:00+00:00   -0.021891
Freq: D, dtype: float64

Series.tz_convert() converts a timezones aware time series to another time zone:

In [165]:
ts_utc.tz_convert("US/Eastern")

2012-03-05 19:00:00-05:00   -0.709186
2012-03-06 19:00:00-05:00   -0.528161
2012-03-07 19:00:00-05:00   -0.624848
2012-03-08 19:00:00-05:00   -0.574481
2012-03-09 19:00:00-05:00   -0.021891
Freq: D, dtype: float64

Adding a non-fixed duration (BusinessDay) to a time series:

In [166]:
rng

DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')

In [167]:
rng + pd.offsets.BusinessDay(5)

DatetimeIndex(['2012-03-13', '2012-03-14', '2012-03-15', '2012-03-16',
               '2012-03-16'],
              dtype='datetime64[ns]', freq=None)

## Categoricals
pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.

In [168]:
df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)

# Converting the raw grades to a categorical data type:

df["grade"] = df["raw_grade"].astype("category")

df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

Rename the categories to more meaningful names:

In [169]:
new_categories = ["very good", "good", "very bad"]

df["grade"] = df["grade"].cat.rename_categories(new_categories)

Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):

In [170]:
df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)


df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

Sorting is per order in the categories, not lexical order:

In [171]:
df.sort_values(by="grade")

Unnamed: 0,id,raw_grade,grade
5,6,e,very bad
1,2,b,good
2,3,b,good
0,1,a,very good
3,4,a,very good
4,5,a,very good


Grouping by a categorical column with observed=False also shows empty categories:

In [172]:
df.groupby("grade", observed=False).size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64