## Pandas

While pandas adopts many coding idioms from NumPy, the biggest difference is that
pandas is designed for working with tabular or heterogeneous data. NumPy, by con‐
trast, is best suited for working with homogeneous numerical array data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
obj = pd.Series([1,2,3,4])

In [3]:
obj

0    1
1    2
2    3
3    4
dtype: int64

In [4]:
obj.values

array([1, 2, 3, 4], dtype=int64)

In [5]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
obj2 = pd.Series([1,2,3,4], index=['d', 'a', 'c', 'v'])

In [7]:
obj2

d    1
a    2
c    3
v    4
dtype: int64

In [8]:
obj2.index

Index(['d', 'a', 'c', 'v'], dtype='object')

In [9]:
obj2[['a', 'c']]

a    2
c    3
dtype: int64

In [10]:
obj2[obj2 > 2]

c    3
v    4
dtype: int64

In [11]:
np.exp(obj2)

d     2.718282
a     7.389056
c    20.085537
v    54.598150
dtype: float64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a map‐
ping of index values to data values. It can be used in many contexts where you might
use a dict:


In [12]:
'a' in obj2

True

In [13]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [14]:
obj3 = pd.Series(sdata)

In [15]:
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

You can override this by passing the dict keys in the order you
want them to appear in the resulting Series:

In [16]:
states = ['California', 'Ohio', 'Oregon', 'Texas']

In [17]:
obj4 = pd.Series(sdata, index=states)

In [18]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [19]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [20]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [21]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [22]:
obj3, obj4

(Ohio      35000
 Texas     71000
 Oregon    16000
 Utah       5000
 dtype: int64,
 California        NaN
 Ohio          35000.0
 Oregon        16000.0
 Texas         71000.0
 dtype: float64)

In [23]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a name attribute, which integrates with
other key areas of pandas functionality:

In [24]:
obj4.name = 'population'

In [25]:
obj4.index.name = 'state'

In [26]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [27]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [28]:
obj

Bob      1
Steve    2
Jeff     3
Ryan     4
dtype: int64

**DataFrame**

A DataFrame represents a rectangular table of data and contains an ordered collec‐
tion of columns, each of which can be a different value type (numeric, string,
boolean, etc.). The DataFrame has both a row and column index; it can be thought of
as a dict of Series all sharing the same index. Under the hood, the data is stored as one
or more two-dimensional blocks rather than a list, dict, or some other collection of
one-dimensional arrays. The exact details of DataFrame’s internals are outside the
scope of this book.

There are many ways to construct a DataFrame, though one of the most common is
from a dict of equal-length lists or NumPy arrays:
    

In [29]:
data = { 
         'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
         'year': [2000, 2001, 2002, 2001, 2002, 2003],
         'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
       }
frame = pd.DataFrame(data)


In [30]:
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [31]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [32]:
frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four','five', 'six'])

In [33]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [34]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [35]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [36]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [37]:
frame2['debt']

one      NaN
two      NaN
three    NaN
four     NaN
five     NaN
six      NaN
Name: debt, dtype: object

In [38]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [39]:
frame2['debt'] = 16.5

In [40]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [41]:
frame2['debt'] = np.arange(6.)

In [42]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [43]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [44]:
frame2['debt'] = val

In [45]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Assigning a column that doesn’t exist will create a new column. The del keyword will
delete columns as with a dict.

In [46]:
frame2['eastern'] = frame2.state == 'Ohio'

In [47]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [48]:
del frame2['eastern']

In [49]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


The column returned from indexing a DataFrame is a view on the
underlying data, not a copy. Thus, any in-place modifications to the
Series will be reflected in the DataFrame. The column can be
explicitly copied with the Series’s copy method.


In [50]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9}, 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [51]:
frame3 = pd.DataFrame(pop)

 outer dict keys
as the columns and the inner keys as the row indices

In [52]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [53]:
pd.DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [54]:
pdata = {'Ohio': frame3['Ohio'][:-1], 'Nevada': frame3['Nevada'][:2]}

In [55]:
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


In [56]:
frame3.index.name = 'year'
frame3.columns.name = 'state'

In [57]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [58]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [59]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

Index Objects

In [60]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [61]:
index = obj.index

In [62]:
index[1:]

Index(['b', 'c'], dtype='object')

In [63]:
obj

a    0
b    1
c    2
dtype: int64

Index objects are immutable and thus can’t be modified by the user:

In [64]:
index[1] = 'd'

TypeError: Index does not support mutable operations

Immutability makes it safer to share Index objects among data structures:

In [65]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [66]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels) 

In [67]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [68]:
obj2.index is labels

True

Some users will not often take advantage of the capabilities pro‐
vided by indexes, but because some operations will yield results
containing indexed data, it’s important to understand how they
work.

In [69]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [70]:
frame3.columns

Index(['Nevada', 'Ohio'], dtype='object', name='state')

In [71]:
'Ohio' in frame3.columns

True

In [72]:
2003 in frame3.index

False

Unlike Python sets, a pandas Index can contain duplicate labels:

In [73]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

In [74]:
Index(['foo', 'foo', 'bar', 'bar'])

NameError: name 'Index' is not defined

In [75]:
dup_labels

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

In [76]:
dup_labels.isin(['foo'])

array([ True,  True, False, False])

**Reindexing**

In [77]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [78]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [79]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [80]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or fill‐
ing of values when reindexing. The method option allows us to do this, using a
method such as ffill, which forward-fills the values:

In [81]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0,2,4])
obj3

0      blue
2    purple
4    yellow
dtype: object

In [82]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [83]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])

In [84]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [85]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [86]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [87]:
states = ['Texas', 'Utah', 'California']

In [88]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [89]:
frame = frame.reindex(['a', 'b', 'c', 'd'])

In [90]:
frame

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


**Dropping Entries from an Axis**

In [91]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [92]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [93]:
new_obj = obj.drop(['c', 'd'])

In [94]:
new_obj

a    0.0
b    1.0
e    4.0
dtype: float64

In [95]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

In [96]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [97]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [98]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [99]:
data.drop(['two', 'four'], axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [100]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [101]:
obj.drop('c', inplace=True)

In [102]:
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

Be careful with the inplace, as it destroys any data that is dropped.

**Indexing, Selection, and Filtering**

In [103]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])

In [104]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [105]:
obj['b']

1.0

In [106]:
obj[1]

1.0

In [107]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [108]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [109]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [110]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the end‐
point is inclusive:

In [111]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

Setting using these methods modifies the corresponding section of the Series:

In [112]:
obj['b': 'c'] = 5

In [113]:
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single
value or sequence:

In [114]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
     index=['Ohio', 'Colorado', 'Utah', 'New York'],
     columns=['one', 'two', 'three', 'four'])

In [115]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [116]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [117]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [118]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [119]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [120]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


**Selection with loc and iloc**

For DataFrame label-indexing on the rows, I introduce the special indexing operators
loc and iloc. They enable you to select a subset of the rows and columns from a
DataFrame with NumPy-like notation using either axis labels (loc) or integers
(iloc).

In [121]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

In [122]:
data.iloc[2, [3,0,1]]

four    11
one      8
two      9
Name: Utah, dtype: int32

In [123]:
data.iloc[[1,2], [3,0,1]]

Unnamed: 0,four,one,two
Colorado,7,4,5
Utah,11,8,9


In [124]:
data.loc[:'Utah', 'two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

In [125]:
data.iloc[:, :3][data.three>5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


**Integer Indexes**

In [126]:
ser = pd.Series(np.arange(3.))

In [127]:
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [128]:
ser[-1]

KeyError: -1

On the other hand, with a non-integer index, there is no potential for ambiguity:

In [129]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])

In [130]:
ser2

a    0.0
b    1.0
c    2.0
dtype: float64

In [131]:
ser2[-1]

2.0

In [132]:
ser[:1]

0    0.0
dtype: float64

In [133]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [134]:
ser.iloc[:1]

0    0.0
dtype: float64

**Arithmetic and Data Alignment**

In [135]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [136]:
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [137]:
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [138]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [139]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces missing values in the label locations that don’t
overlap. Missing values will then propagate in further arithmetic computations.

In [140]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)),
                   columns=list('bcd'), 
                   index=['Ohio', 'Texas', 'Colorado'])

In [141]:
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)),
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])


In [142]:
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [143]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [144]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


In [145]:
df1 = pd.DataFrame({'A': [1, 2]})

In [146]:
df2 = pd.DataFrame({'B': [3, 4]})

In [147]:
df1

Unnamed: 0,A
0,1
1,2


In [148]:
df2

Unnamed: 0,B
0,3
1,4


In [149]:
df1 - df2

Unnamed: 0,A,B
0,,
1,,


**Arithmetic methods with fill values**

In [150]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))

In [151]:
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [152]:
df2.loc[1, 'b'] = np.nan

In [153]:
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [154]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [155]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [156]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,5.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [157]:
1/df1

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


In [158]:
df1.rdiv(1)

Unnamed: 0,a,b,c,d
0,inf,1.0,0.5,0.333333
1,0.25,0.2,0.166667,0.142857
2,0.125,0.111111,0.1,0.090909


Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill
value:

In [159]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


**Operations between DataFrame and Series**

In [160]:
arr = np.arange(12.).reshape((3, 4))

In [161]:
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

When we subtract arr[0] from arr, the subtraction is performed once for each row.
This is referred to as broadcasting and is explained in more detail as it relates to gen‐
eral NumPy arrays

In [162]:
arr - arr[0]

array([[0., 0., 0., 0.],
       [4., 4., 4., 4.],
       [8., 8., 8., 8.]])

In [163]:
arr

array([[ 0.,  1.,  2.,  3.],
       [ 4.,  5.,  6.,  7.],
       [ 8.,  9., 10., 11.]])

In [164]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), 
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [165]:
series = frame.iloc[0]

In [166]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [167]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [168]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [169]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])

In [170]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [171]:
series3 = frame['d']

In [172]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [173]:
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

If you want to instead broadcast over the columns, matching on the rows, you have to
use one of the arithmetic methods. For example:

In [174]:
frame.sub(series3, axis='index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


The axis number that you pass is the axis to match on. In this case we mean to match
on the DataFrame’s row index (axis='index' or axis=0) and broadcast across.

**Function Application and Mapping**

In [175]:
frame = pd.DataFrame(np.random.randn(4, 3),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,1.372975,0.801887,1.334447
Ohio,-0.582438,-0.70799,-0.415254
Texas,0.135311,1.711898,1.216171
Oregon,-1.354899,-0.962641,0.518327


In [176]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.372975,0.801887,1.334447
Ohio,0.582438,0.70799,0.415254
Texas,0.135311,1.711898,1.216171
Oregon,1.354899,0.962641,0.518327


In [177]:
f = lambda x: x.max() - x.min()

In [178]:
frame.apply(f)

b    2.727874
d    2.674539
e    1.749701
dtype: float64

Here the function f, which computes the difference between the maximum and mini‐
mum of a Series, is invoked once on each column in frame. The result is a Series hav‐
ing the columns of frame as its index.

In [179]:
# If you pass axis='columns' to apply, the function will be invoked once per row
# instead:
frame.apply(f, axis='columns')

Utah      0.571088
Ohio      0.292736
Texas     1.576587
Oregon    1.873226
dtype: float64

In [180]:
frame

Unnamed: 0,b,d,e
Utah,1.372975,0.801887,1.334447
Ohio,-0.582438,-0.70799,-0.415254
Texas,0.135311,1.711898,1.216171
Oregon,-1.354899,-0.962641,0.518327


In [181]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])

frame.apply(f)

Unnamed: 0,b,d,e
min,-1.354899,-0.962641,-0.415254
max,1.372975,1.711898,1.334447


In [182]:
format = lambda x: '%.2f' % x

In [183]:
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,1.37,0.8,1.33
Ohio,-0.58,-0.71,-0.42
Texas,0.14,1.71,1.22
Oregon,-1.35,-0.96,0.52


In [184]:
frame['e'].map(format)

Utah       1.33
Ohio      -0.42
Texas      1.22
Oregon     0.52
Name: e, dtype: object

**Sorting and Ranking**

In [185]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])

In [186]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [187]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'], columns=['d', 'a', 'b', 'c'])

In [188]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [189]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The data is sorted in ascending order by default, but can be sorted in descending
order, too:

In [190]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


In [191]:
obj = pd.Series([4, 7, -3, 2])

In [192]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default:

In [193]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

In [194]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [195]:
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [196]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [197]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


Ranking assigns ranks from one through the number of valid data points in an array.
The rank methods for Series and DataFrame are the place to look; by default rank
breaks ties by assigning each group the mean rank:

In [198]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])

In [199]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [200]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [201]:
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [202]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1], 'c': [-2, 5, 8, -2.5]})

In [203]:
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [204]:
frame.rank(axis='columns')

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


**Axis Indexes with Duplicate Labels**

In [205]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])

In [206]:
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [207]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])

In [208]:
df

Unnamed: 0,0,1,2
a,-1.501786,0.560845,-1.751286
a,-1.373868,-1.881644,-0.313102
b,2.580049,0.824563,-1.716245
b,0.659571,0.784268,0.134106


**Summarizing and Computing Descriptive Statistics**

In [213]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5],
 [np.nan, np.nan], [0.75, -1.3]],
  index=['a', 'b', 'c', 'd'],
  columns=['one', 'two'])

In [214]:
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [215]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [216]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [218]:
df.mean(axis='columns', skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

In [219]:
df.idxmax()

one    b
two    d
dtype: object

In [220]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [222]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


On non-numeric data, describe produces alternative summary statistics:

In [223]:
obj = pd.Series(['a', 'a', 'b', 'c'] * 4)

In [224]:
obj.describe()

count     16
unique     3
top        a
freq       8
dtype: object

**Correlation and Covariance**

In [225]:
!pip install pandas-datareader

Collecting pandas-datareader
  Downloading pandas_datareader-0.8.1-py2.py3-none-any.whl (107 kB)
Collecting requests>=2.3.0
  Using cached requests-2.23.0-py2.py3-none-any.whl (58 kB)
Collecting lxml
  Downloading lxml-4.5.0-cp38-cp38-win_amd64.whl (3.7 MB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.9-py2.py3-none-any.whl (126 kB)
Collecting chardet<4,>=3.0.2
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting idna<3,>=2.5
  Using cached idna-2.9-py2.py3-none-any.whl (58 kB)
Installing collected packages: urllib3, chardet, idna, requests, lxml, pandas-datareader
Successfully installed chardet-3.0.4 idna-2.9 lxml-4.5.0 pandas-datareader-0.8.1 requests-2.23.0 urllib3-1.25.9


In [228]:
import pandas_datareader.data as web
all_data = {ticker: web.get_data_yahoo(ticker) for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}

In [230]:
price = pd.DataFrame({ticker: data['Adj Close'] for ticker, data in all_data.items()})

In [231]:
volume = pd.DataFrame({ticker: data['Volume'] for ticker, data in all_data.items()})

In [232]:
returns = price.pct_change()

In [233]:
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2020-05-08,0.023802,0.014518,0.005882,0.011519
2020-05-11,0.015735,-0.003252,0.011154,0.010725
2020-05-12,-0.011428,-0.019006,-0.022652,-0.019611
2020-05-13,-0.012074,-0.037668,-0.015122,-0.019197
2020-05-14,0.006143,0.010542,0.004339,0.00504


In [235]:
returns['MSFT'].corr(returns['IBM'])

0.6004855284577707

In [236]:
returns['MSFT'].cov(returns['IBM'])

0.00016228809404216508

In [238]:
returns.MSFT.corr(returns.IBM)

0.6004855284577707

In [240]:
returns.corr()


Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.532697,0.710988,0.642625
IBM,0.532697,1.0,0.600486,0.530369
MSFT,0.710988,0.600486,1.0,0.750899
GOOG,0.642625,0.530369,0.750899,1.0


In [241]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000328,0.000152,0.000222,0.000199
IBM,0.000152,0.000247,0.000162,0.000143
MSFT,0.000222,0.000162,0.000296,0.000221
GOOG,0.000199,0.000143,0.000221,0.000293


Using DataFrame’s corrwith method, you can compute pairwise correlations
between a DataFrame’s columns or rows with another Series or DataFrame. Passing a
Series returns a Series with the correlation value computed for each column:


In [242]:
returns.corrwith(returns.IBM)

AAPL    0.532697
IBM     1.000000
MSFT    0.600486
GOOG    0.530369
dtype: float64

Passing a DataFrame computes the correlations of matching column names. Here I
compute correlations of percent changes with volume:


In [243]:
returns.corrwith(volume)

AAPL   -0.141765
IBM    -0.105764
MSFT   -0.067060
GOOG   -0.039164
dtype: float64

Passing axis='columns' does things row-by-row instead. In all cases, the data points
are aligned by label before the correlation is computed.

**Unique Values, Value Counts, and Membership**

In [248]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

In [249]:
uniques = obj.unique()

In [251]:
uniques.sort()

In [253]:
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

In [256]:
pd.value_counts(obj.values, sort=False)

b    2
c    3
d    1
a    3
dtype: int64

In [257]:
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [258]:
mask = obj.isin(['b', 'c'])

In [259]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [260]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

In [261]:
to_match = pd.Series(['c', 'a', 'b', 'b', 'c', 'a'])

In [262]:
unique_vals = pd.Series(['c', 'b', 'a'])

In [263]:
pd.Index(unique_vals).get_indexer(to_match)

array([0, 2, 1, 1, 0, 2], dtype=int64)

In [264]:
data = pd.DataFrame({'Qu1': [1, 3, 4, 3, 4],
                     'Qu2': [2, 3, 1, 2, 3],
                     'Qu3': [1, 5, 2, 4, 4]})

In [265]:
data

Unnamed: 0,Qu1,Qu2,Qu3
0,1,2,1
1,3,3,5
2,4,1,2
3,3,2,4
4,4,3,4


In [266]:
result = data.apply(pd.value_counts).fillna(0)

In [267]:
result

Unnamed: 0,Qu1,Qu2,Qu3
1,1.0,1.0,1.0
2,0.0,2.0,1.0
3,2.0,2.0,0.0
4,2.0,0.0,2.0
5,0.0,0.0,1.0
