---
### Fast Pandas
---

A quick tour of the pandas multiverse

In [2]:
import numpy as np, pandas as pd 

---
#### Object creation
---

Pandas foundation is the series, created with a list of values

In [3]:
s = pd.Series([1, 3, 5, 7, np.nan, 0.1])
s 

0    1.0
1    3.0
2    5.0
3    7.0
4    NaN
5    0.1
dtype: float64

Create a Dataframe by passing an array with a datetime index and labeled columns

In [4]:
rows = 8
cols = 4

dates = pd.date_range('20200401', periods = rows)

df = pd.DataFrame(np.random.randn(rows, cols), index = dates, columns = list('WXYZ'))
df

Unnamed: 0,W,X,Y,Z
2020-04-01,-0.673543,0.817844,-0.647671,-0.685011
2020-04-02,-1.028921,1.251291,-0.76178,-2.144773
2020-04-03,-0.383745,1.228993,0.051092,1.667557
2020-04-04,0.706284,2.508205,1.122956,0.047652
2020-04-05,-1.029733,-0.111493,1.191038,-0.087079
2020-04-06,-1.14506,-1.713272,-1.424586,-0.250127
2020-04-07,-0.782393,0.635984,-0.269792,-0.375141
2020-04-08,-1.280881,-1.377416,-0.41589,-0.721982


Creating a Dataframe by passing a dict of differnet types 

In [5]:
df2 = pd.DataFrame(
    {
        "Aa": 1 / np.sqrt(3),
        "Bb": pd.Timestamp("20200101"),
        "Cc": pd.Series(3, index = list(range(4)), dtype = "float32"),
        "Dd": np.array([8] * 4, dtype = "int32"),
        "Ee": pd.Categorical(["test", "train", "eval", "train"]),
        "Ff": "foofigthers",
    }
)

df2.dtypes

Aa           float64
Bb    datetime64[ns]
Cc           float32
Dd             int32
Ee          category
Ff            object
dtype: object

---
#### Viewing Data
---

Top and bottom n rows, index and column heads

In [6]:
df.head(3)

Unnamed: 0,W,X,Y,Z
2020-04-01,-0.673543,0.817844,-0.647671,-0.685011
2020-04-02,-1.028921,1.251291,-0.76178,-2.144773
2020-04-03,-0.383745,1.228993,0.051092,1.667557


In [7]:
df.tail(3)

Unnamed: 0,W,X,Y,Z
2020-04-06,-1.14506,-1.713272,-1.424586,-0.250127
2020-04-07,-0.782393,0.635984,-0.269792,-0.375141
2020-04-08,-1.280881,-1.377416,-0.41589,-0.721982


In [8]:
df.index

DatetimeIndex(['2020-04-01', '2020-04-02', '2020-04-03', '2020-04-04',
               '2020-04-05', '2020-04-06', '2020-04-07', '2020-04-08'],
              dtype='datetime64[ns]', freq='D')

In [9]:
df.columns

Index(['W', 'X', 'Y', 'Z'], dtype='object')

You can use to_numpy() if your data is all of the same type

In [10]:
df.to_numpy()

array([[-0.67354286,  0.81784398, -0.64767051, -0.68501105],
       [-1.02892135,  1.25129135, -0.7617799 , -2.14477325],
       [-0.38374493,  1.22899275,  0.05109194,  1.66755651],
       [ 0.70628351,  2.50820513,  1.12295605,  0.04765234],
       [-1.02973291, -0.11149286,  1.19103792, -0.08707934],
       [-1.14506043, -1.71327246, -1.42458556, -0.25012727],
       [-0.78239302,  0.63598384, -0.2697925 , -0.37514145],
       [-1.28088139, -1.37741605, -0.41589005, -0.72198151]])

Quick statistical analysis

In [11]:
df.describe()

Unnamed: 0,W,X,Y,Z
count,8.0,8.0,8.0,8.0
mean,-0.702249,0.405017,-0.144329,-0.318613
std,0.636678,1.412864,0.909517,1.053723
min,-1.280881,-1.713272,-1.424586,-2.144773
25%,-1.058565,-0.427974,-0.676198,-0.694254
50%,-0.905657,0.726914,-0.342841,-0.312634
75%,-0.601093,1.234567,0.319058,-0.053396
max,0.706284,2.508205,1.191038,1.667557


Transposing data

In [12]:
df.T

Unnamed: 0,2020-04-01,2020-04-02,2020-04-03,2020-04-04,2020-04-05,2020-04-06,2020-04-07,2020-04-08
W,-0.673543,-1.028921,-0.383745,0.706284,-1.029733,-1.14506,-0.782393,-1.280881
X,0.817844,1.251291,1.228993,2.508205,-0.111493,-1.713272,0.635984,-1.377416
Y,-0.647671,-0.76178,0.051092,1.122956,1.191038,-1.424586,-0.269792,-0.41589
Z,-0.685011,-2.144773,1.667557,0.047652,-0.087079,-0.250127,-0.375141,-0.721982


Sorting by axis

In [13]:
df.sort_index(axis = 1, ascending = False)

Unnamed: 0,Z,Y,X,W
2020-04-01,-0.685011,-0.647671,0.817844,-0.673543
2020-04-02,-2.144773,-0.76178,1.251291,-1.028921
2020-04-03,1.667557,0.051092,1.228993,-0.383745
2020-04-04,0.047652,1.122956,2.508205,0.706284
2020-04-05,-0.087079,1.191038,-0.111493,-1.029733
2020-04-06,-0.250127,-1.424586,-1.713272,-1.14506
2020-04-07,-0.375141,-0.269792,0.635984,-0.782393
2020-04-08,-0.721982,-0.41589,-1.377416,-1.280881


Sorting by values

In [14]:
df.sort_values(by = 'X')

Unnamed: 0,W,X,Y,Z
2020-04-06,-1.14506,-1.713272,-1.424586,-0.250127
2020-04-08,-1.280881,-1.377416,-0.41589,-0.721982
2020-04-05,-1.029733,-0.111493,1.191038,-0.087079
2020-04-07,-0.782393,0.635984,-0.269792,-0.375141
2020-04-01,-0.673543,0.817844,-0.647671,-0.685011
2020-04-03,-0.383745,1.228993,0.051092,1.667557
2020-04-02,-1.028921,1.251291,-0.76178,-2.144773
2020-04-04,0.706284,2.508205,1.122956,0.047652


---
#### Selection
---

A single columns becomes a series

In [15]:
df['W']

2020-04-01   -0.673543
2020-04-02   -1.028921
2020-04-03   -0.383745
2020-04-04    0.706284
2020-04-05   -1.029733
2020-04-06   -1.145060
2020-04-07   -0.782393
2020-04-08   -1.280881
Freq: D, Name: W, dtype: float64

Using [] to slice rows

In [16]:
df[0:3]

Unnamed: 0,W,X,Y,Z
2020-04-01,-0.673543,0.817844,-0.647671,-0.685011
2020-04-02,-1.028921,1.251291,-0.76178,-2.144773
2020-04-03,-0.383745,1.228993,0.051092,1.667557


In [17]:
df['20200401':'20200404']

Unnamed: 0,W,X,Y,Z
2020-04-01,-0.673543,0.817844,-0.647671,-0.685011
2020-04-02,-1.028921,1.251291,-0.76178,-2.144773
2020-04-03,-0.383745,1.228993,0.051092,1.667557
2020-04-04,0.706284,2.508205,1.122956,0.047652


Selection by label

for getting a cross section using a label

In [18]:
df.loc[dates[0]]

W   -0.673543
X    0.817844
Y   -0.647671
Z   -0.685011
Name: 2020-04-01 00:00:00, dtype: float64

Selecting on a multi axis basis

In [19]:
df.loc[:, ['X', 'Y']]

Unnamed: 0,X,Y
2020-04-01,0.817844,-0.647671
2020-04-02,1.251291,-0.76178
2020-04-03,1.228993,0.051092
2020-04-04,2.508205,1.122956
2020-04-05,-0.111493,1.191038
2020-04-06,-1.713272,-1.424586
2020-04-07,0.635984,-0.269792
2020-04-08,-1.377416,-0.41589


Showing label slicing, both endpoints can be included

In [20]:
df.loc[dates[0:4], ['X', 'Z']]

Unnamed: 0,X,Z
2020-04-01,0.817844,-0.685011
2020-04-02,1.251291,-2.144773
2020-04-03,1.228993,1.667557
2020-04-04,2.508205,0.047652


In [21]:
df.loc[dates[0], 'Z'].round(3)

-0.685

---
### Selection by position
---

Select via passed int positions

In [22]:
df.iloc[3]

W    0.706284
X    2.508205
Y    1.122956
Z    0.047652
Name: 2020-04-04 00:00:00, dtype: float64

By int slices similar to numpy /  Python

In [23]:
df.iloc[3:5, 0:2]

Unnamed: 0,W,X
2020-04-04,0.706284,2.508205
2020-04-05,-1.029733,-0.111493


By lists of int positions

In [24]:
df.iloc[[1,2,4],[0,2]]

Unnamed: 0,W,Y
2020-04-02,-1.028921,-0.76178
2020-04-03,-0.383745,0.051092
2020-04-05,-1.029733,1.191038


For slicing rows or columns and getting specific values

In [25]:
df.iloc[1:4,:]

Unnamed: 0,W,X,Y,Z
2020-04-02,-1.028921,1.251291,-0.76178,-2.144773
2020-04-03,-0.383745,1.228993,0.051092,1.667557
2020-04-04,0.706284,2.508205,1.122956,0.047652


In [26]:
df.iloc[:, 1:3]

Unnamed: 0,X,Y
2020-04-01,0.817844,-0.647671
2020-04-02,1.251291,-0.76178
2020-04-03,1.228993,0.051092
2020-04-04,2.508205,1.122956
2020-04-05,-0.111493,1.191038
2020-04-06,-1.713272,-1.424586
2020-04-07,0.635984,-0.269792
2020-04-08,-1.377416,-0.41589


In [27]:
df.iloc[3,3]

0.04765234122360332

---
#### Boolean Indexing
---

Use single columns values to select data or conditions on the dataframe as a whole

In [28]:
df[df['W'] > 0]

Unnamed: 0,W,X,Y,Z
2020-04-04,0.706284,2.508205,1.122956,0.047652


In [29]:
df[df >0]

Unnamed: 0,W,X,Y,Z
2020-04-01,,0.817844,,
2020-04-02,,1.251291,,
2020-04-03,,1.228993,0.051092,1.667557
2020-04-04,0.706284,2.508205,1.122956,0.047652
2020-04-05,,,1.191038,
2020-04-06,,,,
2020-04-07,,0.635984,,
2020-04-08,,,,


Using the isin() method

In [30]:
df2 = df.copy()
df2['E'] = ['one', 'one', 'two','three','eight','five','six','four']
df2

Unnamed: 0,W,X,Y,Z,E
2020-04-01,-0.673543,0.817844,-0.647671,-0.685011,one
2020-04-02,-1.028921,1.251291,-0.76178,-2.144773,one
2020-04-03,-0.383745,1.228993,0.051092,1.667557,two
2020-04-04,0.706284,2.508205,1.122956,0.047652,three
2020-04-05,-1.029733,-0.111493,1.191038,-0.087079,eight
2020-04-06,-1.14506,-1.713272,-1.424586,-0.250127,five
2020-04-07,-0.782393,0.635984,-0.269792,-0.375141,six
2020-04-08,-1.280881,-1.377416,-0.41589,-0.721982,four


In [31]:
df2[df2['E'].isin(['eight', 'one'])]

Unnamed: 0,W,X,Y,Z,E
2020-04-01,-0.673543,0.817844,-0.647671,-0.685011,one
2020-04-02,-1.028921,1.251291,-0.76178,-2.144773,one
2020-04-05,-1.029733,-0.111493,1.191038,-0.087079,eight


---
#### Setting
---

Setting a new column auto aligns to the index

In [32]:
s1 = pd.Series([1,2,3,4,5,6,7,8], index = dates)
s1


2020-04-01    1
2020-04-02    2
2020-04-03    3
2020-04-04    4
2020-04-05    5
2020-04-06    6
2020-04-07    7
2020-04-08    8
Freq: D, dtype: int64

Setting value by label, by position, by assigning a numpy array

In [33]:
df.at[dates[0], 'W'] = 0

df.iat[0,1] = 0

df.loc[:, 'X'] = np.array([8] * len(df))

df

Unnamed: 0,W,X,Y,Z
2020-04-01,0.0,8,-0.647671,-0.685011
2020-04-02,-1.028921,8,-0.76178,-2.144773
2020-04-03,-0.383745,8,0.051092,1.667557
2020-04-04,0.706284,8,1.122956,0.047652
2020-04-05,-1.029733,8,1.191038,-0.087079
2020-04-06,-1.14506,8,-1.424586,-0.250127
2020-04-07,-0.782393,8,-0.269792,-0.375141
2020-04-08,-1.280881,8,-0.41589,-0.721982


A where with operation

In [34]:
df2 = df.copy()

df2[df2 > 0] = -df2

df2

Unnamed: 0,W,X,Y,Z
2020-04-01,0.0,-8,-0.647671,-0.685011
2020-04-02,-1.028921,-8,-0.76178,-2.144773
2020-04-03,-0.383745,-8,-0.051092,-1.667557
2020-04-04,-0.706284,-8,-1.122956,-0.047652
2020-04-05,-1.029733,-8,-1.191038,-0.087079
2020-04-06,-1.14506,-8,-1.424586,-0.250127
2020-04-07,-0.782393,-8,-0.269792,-0.375141
2020-04-08,-1.280881,-8,-0.41589,-0.721982


---
### Missing Data
---

Pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.

In [35]:
df1 = df.reindex(index = dates[0:4], columns = list(df.columns) + ['E'])
df1.loc[dates[0] : dates[1], 'E'] = 1
df1

Unnamed: 0,W,X,Y,Z,E
2020-04-01,0.0,8,-0.647671,-0.685011,1.0
2020-04-02,-1.028921,8,-0.76178,-2.144773,1.0
2020-04-03,-0.383745,8,0.051092,1.667557,
2020-04-04,0.706284,8,1.122956,0.047652,


To drop any rows with NaN values

In [36]:
df1.dropna(how = 'any')

Unnamed: 0,W,X,Y,Z,E
2020-04-01,0.0,8,-0.647671,-0.685011,1.0
2020-04-02,-1.028921,8,-0.76178,-2.144773,1.0


In [37]:
df1.fillna(value = np.mean(df1))

Unnamed: 0,W,X,Y,Z,E
2020-04-01,0.0,8,-0.647671,-0.685011,1.0
2020-04-02,-1.028921,8,-0.76178,-2.144773,1.0
2020-04-03,-0.383745,8,0.051092,1.667557,1.0
2020-04-04,0.706284,8,1.122956,0.047652,1.0


To get the boolean mask where values = NaN

In [38]:
pd.isna(df1)

Unnamed: 0,W,X,Y,Z,E
2020-04-01,False,False,False,False,False
2020-04-02,False,False,False,False,False
2020-04-03,False,False,False,False,True
2020-04-04,False,False,False,False,True


---
#### Stats
---

In [39]:
# Total mean
print(df.mean(),'\n')

# Column mean
print(df.mean(0),'\n')

# Row mean
print(df.mean(1),'\n')

W   -0.618056
X    8.000000
Y   -0.144329
Z   -0.318613
dtype: float64 

W   -0.618056
X    8.000000
Y   -0.144329
Z   -0.318613
dtype: float64 

2020-04-01    1.666830
2020-04-02    1.016131
2020-04-03    2.333726
2020-04-04    2.469223
2020-04-05    2.018556
2020-04-06    1.295057
2020-04-07    1.643168
2020-04-08    1.395312
Freq: D, dtype: float64 



Operating with objects that have different dimensionality and need alignment, pandas automatically broadcasts along the specified dimension.

In [40]:
s = pd.Series([1,2,3,np.nan,8,12,21,88], index = dates).shift(2)
s

2020-04-01     NaN
2020-04-02     NaN
2020-04-03     1.0
2020-04-04     2.0
2020-04-05     3.0
2020-04-06     NaN
2020-04-07     8.0
2020-04-08    12.0
Freq: D, dtype: float64

In [41]:
df.sub(s, axis = 'index')

Unnamed: 0,W,X,Y,Z
2020-04-01,,,,
2020-04-02,,,,
2020-04-03,-1.383745,7.0,-0.948908,0.667557
2020-04-04,-1.293716,6.0,-0.877044,-1.952348
2020-04-05,-4.029733,5.0,-1.808962,-3.087079
2020-04-06,,,,
2020-04-07,-8.782393,0.0,-8.269792,-8.375141
2020-04-08,-13.280881,-4.0,-12.41589,-12.721982


---
#### Apply
---

Apply a function to the data

In [42]:
df.apply(np.cumsum)

Unnamed: 0,W,X,Y,Z
2020-04-01,0.0,8,-0.647671,-0.685011
2020-04-02,-1.028921,16,-1.40945,-2.829784
2020-04-03,-1.412666,24,-1.358358,-1.162228
2020-04-04,-0.706383,32,-0.235402,-1.114575
2020-04-05,-1.736116,40,0.955636,-1.201655
2020-04-06,-2.881176,48,-0.46895,-1.451782
2020-04-07,-3.663569,56,-0.738743,-1.826924
2020-04-08,-4.944451,64,-1.154633,-2.548905


In [43]:
df.apply(lambda x: x.max() - x.min())

W    1.987165
X    0.000000
Y    2.615623
Z    3.812330
dtype: float64

---
#### Merge
---

#### Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.


In [44]:
df = pd.DataFrame(np.random.randn(10,4))
df

Unnamed: 0,0,1,2,3
0,0.222762,0.014405,1.574961,-1.069147
1,-0.785593,0.300978,0.361302,-0.016889
2,0.623643,1.810705,-0.386034,0.456199
3,-0.481856,0.003001,-0.92077,-2.325746
4,0.267837,-0.57448,0.00195,2.125956
5,-0.43243,0.304838,-1.087905,-1.8508
6,0.312702,-1.864743,0.38492,1.365417
7,1.146239,1.701228,-0.980069,-0.205967
8,-0.905118,1.130917,-0.324341,-0.048605
9,-3.158523,0.482618,-0.80265,-0.066128


In [45]:
pieces = [df[:3], df[3:7], df[7:]]
pieces 

[          0         1         2         3
 0  0.222762  0.014405  1.574961 -1.069147
 1 -0.785593  0.300978  0.361302 -0.016889
 2  0.623643  1.810705 -0.386034  0.456199,
           0         1         2         3
 3 -0.481856  0.003001 -0.920770 -2.325746
 4  0.267837 -0.574480  0.001950  2.125956
 5 -0.432430  0.304838 -1.087905 -1.850800
 6  0.312702 -1.864743  0.384920  1.365417,
           0         1         2         3
 7  1.146239  1.701228 -0.980069 -0.205967
 8 -0.905118  1.130917 -0.324341 -0.048605
 9 -3.158523  0.482618 -0.802650 -0.066128]

In [46]:
pd.concat(pieces)

Unnamed: 0,0,1,2,3
0,0.222762,0.014405,1.574961,-1.069147
1,-0.785593,0.300978,0.361302,-0.016889
2,0.623643,1.810705,-0.386034,0.456199
3,-0.481856,0.003001,-0.92077,-2.325746
4,0.267837,-0.57448,0.00195,2.125956
5,-0.43243,0.304838,-1.087905,-1.8508
6,0.312702,-1.864743,0.38492,1.365417
7,1.146239,1.701228,-0.980069,-0.205967
8,-0.905118,1.130917,-0.324341,-0.048605
9,-3.158523,0.482618,-0.80265,-0.066128


#### Join

In [47]:
left = pd.DataFrame({'key':['foo', 'fighters'], 'lval':[1,2]})
right = pd.DataFrame({'key':['foo', 'fighters'], 'rval':[8,66]})
pd.merge(left,right, on = 'key')

Unnamed: 0,key,lval,rval
0,foo,1,8
1,fighters,2,66


---
#### Grouping
---

By “group by” we are referring to a process involving one or more of the following steps:

* Splitting the data into groups based on some criteria
* Applying a function to each group independently
* Combining the results into a data structure

In [48]:
df = pd.DataFrame(
       {
            "A": ["foo", "fighters", "foo", "fighters", "foo", "fighters", "foo", "foo"],
            "B": ["one", "six", "two", "three", "two", "two", "one", "three"],
            "C": np.random.randn(8),
            "D": np.random.randint(0,1,8)
        }
    )
df

Unnamed: 0,A,B,C,D
0,foo,one,1.709025,0
1,fighters,six,-0.868673,0
2,foo,two,0.690457,0
3,fighters,three,-0.692454,0
4,foo,two,0.235426,0
5,fighters,two,-1.532986,0
6,foo,one,0.40109,0
7,foo,three,0.003271,0


In [49]:
df.groupby('A').std()

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
fighters,0.443256,0.0
foo,0.664525,0.0


In [50]:
df.groupby(['A', 'B']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
fighters,six,-0.868673,0
fighters,three,-0.692454,0
fighters,two,-1.532986,0
foo,one,1.055058,0
foo,three,0.003271,0
foo,two,0.462942,0


In [51]:
df.groupby(['A', 'B', 'D']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,C
A,B,D,Unnamed: 3_level_1
fighters,six,0,-0.868673
fighters,three,0,-0.692454
fighters,two,0,-1.532986
foo,one,0,2.110115
foo,three,0,0.003271
foo,two,0,0.925883


#### Stack

In [52]:
tuples = list(
        zip(
            *[
                ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
                ["one", "two", "one", "two", "one", "two", "one", "two"],
            ]
        )
    )

In [53]:
index = pd.MultiIndex.from_tuples(tuples, names = ['1st', '2nd'])
df = pd.DataFrame(np.random.randn(8,2), index = index, columns = ['A','B'])
df2 = df[:4]
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
1st,2nd,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-2.214677,2.433334
bar,two,0.102401,-0.309848
baz,one,0.208497,-2.151535
baz,two,0.93036,-1.458873


In [54]:
stacked = df2.stack()
stacked

1st  2nd   
bar  one  A   -2.214677
          B    2.433334
     two  A    0.102401
          B   -0.309848
baz  one  A    0.208497
          B   -2.151535
     two  A    0.930360
          B   -1.458873
dtype: float64

In [55]:
stacked.unstack()

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
1st,2nd,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,-2.214677,2.433334
bar,two,0.102401,-0.309848
baz,one,0.208497,-2.151535
baz,two,0.93036,-1.458873


In [56]:
stacked.unstack(1)

Unnamed: 0_level_0,2nd,one,two
1st,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,A,-2.214677,0.102401
bar,B,2.433334,-0.309848
baz,A,0.208497,0.93036
baz,B,-2.151535,-1.458873


In [57]:
stacked.unstack(0)

Unnamed: 0_level_0,1st,bar,baz
2nd,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,-2.214677,0.208497
one,B,2.433334,-2.151535
two,A,0.102401,0.93036
two,B,-0.309848,-1.458873


Pivot Tables

In [58]:
df = pd.DataFrame(
        {
            "A": ["one", "one", "two", "three"] * 3,
            "B": ["A", "B", "C"] * 4,
            "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
            "D": np.random.randn(12),
            "E": np.random.randn(12),
        }
    )
df

Unnamed: 0,A,B,C,D,E
0,one,A,foo,0.868173,-1.367693
1,one,B,foo,-0.399962,0.527132
2,two,C,foo,-1.195178,-0.151998
3,three,A,bar,0.166032,-0.491186
4,one,B,bar,-0.326275,1.339558
5,one,C,bar,-0.706879,-0.040288
6,two,A,foo,-0.355317,0.149001
7,three,B,foo,1.078953,2.424316
8,one,C,foo,0.301134,-0.356991
9,one,A,bar,0.118463,0.546092


In [59]:
pd.pivot_table(df, values = 'D', index = ['A','B'], columns = ['C'])

Unnamed: 0_level_0,C,bar,foo
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
one,A,0.118463,0.868173
one,B,-0.326275,-0.399962
one,C,-0.706879,0.301134
three,A,0.166032,
three,B,,1.078953
three,C,-2.255886,
two,A,,-0.355317
two,B,-0.192469,
two,C,,-1.195178


---
#### Time Series
---

Pandas has powerful and efficient functionality for resampling  during frequency conversion, like converting sec data into 5min data

In [60]:
rng = pd.date_range('1/1/2020', periods = 100, freq = 'S')
ts = pd.Series(np.random.randint(0,500,len(rng)), index = rng)
ts.resample('5Min').sum()

2020-01-01    26667
Freq: 5T, dtype: int32

In [61]:
rng = pd.date_range("3/6/2020 00:00", periods = 5, freq = "D")
ts = pd.Series(np.random.randn(len(rng)), rng)
ts

2020-03-06   -0.895780
2020-03-07    0.257814
2020-03-08   -1.456436
2020-03-09   -1.419469
2020-03-10   -0.347814
Freq: D, dtype: float64

In [62]:
ts_utc = ts.tz_localize('UTC')
ts_utc

2020-03-06 00:00:00+00:00   -0.895780
2020-03-07 00:00:00+00:00    0.257814
2020-03-08 00:00:00+00:00   -1.456436
2020-03-09 00:00:00+00:00   -1.419469
2020-03-10 00:00:00+00:00   -0.347814
Freq: D, dtype: float64

Converting between time span representations:

In [63]:
ps = ts.to_period()
ps

2020-03-06   -0.895780
2020-03-07    0.257814
2020-03-08   -1.456436
2020-03-09   -1.419469
2020-03-10   -0.347814
Freq: D, dtype: float64

In [64]:
ps.to_timestamp()

2020-03-06   -0.895780
2020-03-07    0.257814
2020-03-08   -1.456436
2020-03-09   -1.419469
2020-03-10   -0.347814
Freq: D, dtype: float64

Convert a quarterly frequency with year ending in November to 9am of the end of the month following the quarter end

In [65]:
prng = pd.period_range("2000Q1", "2020Q4", freq="Q-NOV")
ts = pd.Series(np.random.randn(len(prng)), prng)
ts.index = (prng.asfreq("M", "e") + 1).asfreq("H", "s") + 9
ts.head()

2000-03-01 09:00    1.021150
2000-06-01 09:00   -0.214123
2000-09-01 09:00    1.180847
2000-12-01 09:00    0.041109
2001-03-01 09:00   -1.576260
Freq: H, dtype: float64

---
#### Catagoricals
---

In [66]:
df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["A", "B", "A", "C", "C", "D"]})
df["grade"] = df["raw_grade"].astype("category")
df['grade']

0    A
1    B
2    A
3    C
4    C
5    D
Name: grade, dtype: category
Categories (4, object): ['A', 'B', 'C', 'D']

---
#### Getting / Saving Data
---

CSV / Excel

In [67]:
df.to_csv('FF.csv')
pd.read_csv('FF.csv')

Unnamed: 0.1,Unnamed: 0,id,raw_grade,grade
0,0,1,A,A
1,1,2,B,B
2,2,3,A,A
3,3,4,C,C
4,4,5,C,C
5,5,6,D,D


In [68]:
df.to_excel('FF.xlsx')
pd.read_excel('FF.xlsx')

Unnamed: 0.1,Unnamed: 0,id,raw_grade,grade
0,0,1,A,A
1,1,2,B,B
2,2,3,A,A
3,3,4,C,C
4,4,5,C,C
5,5,6,D,D
