# Chapter 5: Getting Started with Pandas


pandas contains data structures and data manip tools that make data cleaning and
analysis fast and easy. pandas is usually used in tandem with other libraries.

The biggest difference between pandas and NumPy is that pandas is designed for
working on heterogenous data while NumPy is designed for homogenous data.

In [1]:
import pandas as pd
import numpy as np

## 5.1: Introduction to pandas Data Structures


The two primary data structures are the *Series* and the *DataFrame*.

### Series

A Series is a one-dimensional array-like object containing an indexed sequence
of values (of a single dtype). The index acts as an entry's label and can be 
anything from an integer to a datetime object.

In [2]:
obj = pd.Series([4,7,-5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

A default index of integers from 0 to N-1 is provided for a Series with an 
unspecified index.

To get the values or index of a series you use the `values` or `index` properties
of the Series object respectively.

`values` returns a NumPy array and `index` returns a pandas Index object.

In [3]:
obj.values

array([ 4,  7, -5,  3])

In [4]:
obj.index

RangeIndex(start=0, stop=4, step=1)

To create a Series with a custom index, just pass a list-like or Index object
through the series parameter

In [5]:
obj2 = pd.Series([4,7,-5,3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [6]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

You can use these labels to select for values in the Series

In [7]:
obj2['a']

-5

In [8]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

You can also use NumPy-like operations such as array-wide operations and 
boolean array indexing

In [9]:
obj2*2

d     8
b    14
a   -10
c     6
dtype: int64

In [10]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [11]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict since
it is a mapping of index values to data values

Additionally, Series object can be made from Python dicts rather than a pair of 
lists

In [12]:
'b' in obj2

True

In [13]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

Notice that when passing a dict, the resulting Series is ordered by the sorted
order of the keys.

You can override this by passing a ordered list of indecies through the index 
arg.

In [14]:
states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

Note that since California was not given a value in sdata, it appears as NaN 
(not a number) which is used in pandas to mark missing or NA values.

Additionally, since Utah was not in states, it was ommitted from obj4

In [15]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [16]:
obj4.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

Chapter 7 will cover working with missing data in more detail

Another useful feature of Serieses is that they automatically align by index for
arithmetic operations between two Series objects for element-wise operations

In [17]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object and its Index also have names.

In [18]:
obj4.name = 'population'
obj4.index.name = 'state'
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

### DataFrame

A DataFrame is a rectangular table of data with an ordered collection of 
columns, each with its own data type.

The DataFrame has both a row and a column index that can be through of a dict
of Series, all sharing the same index.

While a DataFrame is physically 2D, it can be used to represent higher dim
data through hierarchical indexing (see Chapter 8)

There are many ways to construct a DataFrame. You can either pass a rectangular
array-like along with an index and column names or you can pass a dict of
equal-length array-likes.

In [19]:
data = {
    'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
    'year': [2000, 2001, 2002, 2001, 2002, 2003],
    'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]
}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


In [20]:
# Selects first 5 rows
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [21]:
# Selects last 5 rows
frame.tail()

Unnamed: 0,state,year,pop
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9
5,Nevada,2003,3.2


Like a Series index, passing a list of columns specifies their order and passing
a column without any values fills it with NaN

In [22]:
pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


In [23]:
frame2 = pd.DataFrame(data,
                      columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four', 'five', 'six'])
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [24]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

A column in a DataFrame can be retrieved as a Series either by a dict-like 
notation or by attribute

In [25]:
frame2['year']

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

In [26]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Rows can also be retrieved by position or name with the `loc` or `iloc` 
attributes

In [27]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [28]:
frame2.iloc[2]

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

When assigning lists or arrays to a column, the value's length must exactly
match the length of the DatFarme. If you assign a Series, its labels will be 
realigned exactly to the DataFrame's index, inserting any missing values.

In [29]:
frame2['debt'] = 16.5
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5
six,2003,Nevada,3.2,16.5


In [30]:
frame2['debt'] = np.arange(6.)
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


In [31]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
frame2['debt'] = val
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Assigning a column that doesn't exist will create a new column and `del` will
delete a column.

In [32]:
frame2['eastern'] = frame2['state'] == 'Ohio'
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


In [33]:
del frame2['eastern']
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Another way to format data is as a nester dict of dicts of values as follows:

In [34]:
pop = {
    'Nevada': {2001: 2.4, 2002: 2.9},
    'Ohio': {2000: 1.5, 2001: 1.7, 2002:3.6}
}
frame3 = pd.DataFrame(pop)
frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


Similar to a NumPy array, you can transpose a dataframe with .T

In [35]:
frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


Dicts of Series are treated similarly

In [36]:
pdata = {
    'Ohio': frame3['Ohio'][:-1],
    'Nevada': frame3['Nevada'][:2]
}
pd.DataFrame(pdata)

Unnamed: 0,Ohio,Nevada
2001,1.7,2.4
2002,3.6,2.9


You can also set a DataFrames index and column names

In [37]:
frame3.index.name = 'year'
frame3.columns.name = 'state'
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


The `values` attribute now returns a multidimensional NumPy array

In [38]:
frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

Possible data inputs to a DataFrame constructor
- 2D ndarray
- dict of list-likes
- NumPy structured / record array
- dict of Series or dicts
- array-like of dicts or Series
- array-like of array-likes
- Another DataFrame
- NumPy MaskedArray

### Index Objects

pandas Index objects are responsible for holding axis labels and other metadata.

Any array or other sequence of labels is internally converted to an Index.

In [39]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])
index = obj.index
index

Index(['a', 'b', 'c'], dtype='object')

In [40]:
index[1:]

Index(['b', 'c'], dtype='object')

Index objects are immutable and cannot be modified directly (only replaced)

This makes it safer to share Index objects among multiple data structures

In [41]:
#index[1] = 'd' doesn't work

In [42]:
labels = pd.Index(np.arange(3))
labels

Int64Index([0, 1, 2], dtype='int64')

In [43]:
obj2 = pd.Series([1.5, -2.5, 0], index=labels)
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [44]:
obj2.index is labels

True

In [45]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [46]:
'Ohio' in frame3.columns

True

In [47]:
2003 in frame3.columns

False

Unlike Python sers, an Index can contain duplicate labels

Some Index methods and properties
- append
- difference
- intersections
- union
- isin
- delete
- drop
- insert
- is_monotonic
- is_unique
- unique

## 5.2 Essential Functionality

This section walks through the fundamental mechanicts of interacting with the 
data in a Series or DataFrame. This is not an exhaustive documentation of the 
pandas library.

### Reindexing

Reindexing creates a new object with the data conformed to the new index

In [48]:
obj = pd.DataFrame([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

Unnamed: 0,0
d,4.5
b,7.2
a,-5.3
c,3.6


In [49]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

Unnamed: 0,0
a,-5.3
b,7.2
c,3.6
d,4.5
e,


The reindexing method also allows for interpolation or filling of empty values

In [50]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3 = obj3.reindex(range(6), method='ffill')
obj3

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

With a DataFrame, reindex can alter either the row index, columns, or both.

To reindex columns, pass an index to columns arg

reindex function args:
- index
- columns
- method: 'ffill', 'bfill', 'time'
- fill_value
- limit
- tolerance
- level
- copy

### Dropping Entries from an Axis

The drop method returns a new object with the indicated value or values deleted
from an axis

In [51]:
obj = pd.Series(np. arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [52]:
obj.drop('c')

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [53]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With a DataFrame, index values can be dropped from either axis.

To drop values from the columns, just pass axis=1 or axis='columns'

In [54]:
data = pd.DataFrame(np.arange(16).reshape((4,4)), 
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [55]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


### Indexing, Selection, and Filtering

Series indexing works similar to NumPy array indexing except with index values
instead of only integers.

In [56]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
dtype: float64

In [57]:
obj['b']

1.0

In [58]:
obj[1]

1.0

In [59]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [60]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [61]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [62]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels includes the end point rather than working how Python does

In [63]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

For DataFrames, indexing is for retrieving a single or list of columns

In [64]:
data = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [65]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [66]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [67]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [68]:
data['three'] > 5

Ohio        False
Colorado     True
Utah         True
New York     True
Name: three, dtype: bool

In [69]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [70]:
data[data < 5]

Unnamed: 0,one,two,three,four
Ohio,0.0,1.0,2.0,3.0
Colorado,4.0,,,
Utah,,,,
New York,,,,


#### Selection with loc and iloc

For DataFrame label-indexing on the rows, pandas includes `loc` and `iloc` which
enable you to select a subset of the rows and columns using labels (`loc`) or 
integers (`iloc`)

In [71]:
data.loc['Colorado']

one      4
two      5
three    6
four     7
Name: Colorado, dtype: int64

In [72]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [73]:
data.iloc[2, 2:]

three    10
four     11
Name: Utah, dtype: int64

In [74]:
data.iloc[:, :3]

Unnamed: 0,one,two,three
Ohio,0,1,2
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


In [75]:
data.iloc[:, :3][data['three'] > 5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New York,12,13,14


Indexing options with DataFrame
| Type                  | Description                               | 
| --------------------- | ----------------------------------------- |
| df[val]               | Select columns from df                    |
| df.loc[val]           | Select rows from df by label              |
| df.loc[:, val]        | Select columns from df by label           |
| df.loc[val1, val2]    | Select both rows and columns by label     |
| df.iloc[where]        | Select rows by integer position           |
| df.iloc[:, where]     | Select columns by int position            |
| df.iloc[w1, w2]       | Select both rows and cols by int position |
| df.at[val1, val2]     | Select single value by labels             |
| df.iat[w1, w2]        | Select single value by int position       |
| reindex               | Select either rows or columns by labels   |
| get_value, set_value  | Select single value by labels             |

### Integer Indexes

For integer indexed objects, there are some different semantics when compared
to Python list-likes.

For example, `series[-1]` would cause key error if it did not exist 

If you have an axis index containing integers, data selection will always be 
label-oriented rather than integer position.

In [76]:
ser = pd.Series(np.arange(3.))
ser

0    0.0
1    1.0
2    2.0
dtype: float64

In [77]:
ser[:1]

0    0.0
dtype: float64

In [78]:
ser.loc[:1]

0    0.0
1    1.0
dtype: float64

In [79]:
ser.iloc[:1]

0    0.0
dtype: float64

### Arithmetic and Data Alignment

When combining two pandas objects with an arithmetic operator, if the indecies
are not the same, the result's index will be the union of the index of the two
operands.

Then, element-wise operation will take place between pairs with matching index.

In [80]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [81]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [82]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In DataFrames, alignment is performed on both rows and columns

In [83]:
df1 = pd.DataFrame(np.arange(9.).reshape((3,3)),
                   columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4,3)),
                   columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])

df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [84]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [85]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


If you add DataFrame objects with no column or row labels in common, the result
will contain all nulls.

#### Airthmetic methods with fill values

| Method                | Description       |
| --------------------- | ----------------- |
| add, radd             | Addition          |
| sub, rsub             | Subtraction       |
| div, rdiv             | Division          |
| floordiv, rfloordiv   | Floor Division    |
| mul, rmul             | Multiply          |
| pow, rpow             | Exponentiation    |

Each operation has a counterpart starting with r that flipps the arguments.

These functions also allow you to choose custom fill values rather than NaN

In [86]:
df1.add(df2, fill_value=0)

Unnamed: 0,b,c,d,e
Colorado,6.0,7.0,8.0,
Ohio,3.0,1.0,6.0,5.0
Oregon,9.0,,10.0,11.0
Texas,9.0,4.0,12.0,8.0
Utah,0.0,,1.0,2.0


In [87]:
df1.rdiv(1)

Unnamed: 0,b,c,d
Ohio,inf,1.0,0.5
Texas,0.333333,0.25,0.2
Colorado,0.166667,0.142857,0.125


#### Operations between DataFrame and Series

Arithmetic between DataFrame and Series objects are also defined

In [88]:
frame = pd.DataFrame(np.arange(12.).reshape((4,3)),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
series = frame.iloc[0]

In [89]:
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [90]:
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [91]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


By default, arithmetic between DataFrame and Series matches the index of the
Series to the DataFrame's columns, broadcasting down the rows.

If an index value is not in one of the indecies, the objects will be reindexed
to form the union.

If you want to broadcast across the columns, matching on the rows, you have to 
use one of the arithmetic methods listed prior with the axis='index' or 0.

### Function Application and Mapping

NumPy unfuncs also work on pandas objects

In [92]:
frame = pd.DataFrame(np.random.randn(4,3),
                     columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,1.196559,-0.159124,1.313126
Ohio,0.485028,-0.201602,0.30772
Texas,1.214056,0.140903,-0.182716
Oregon,0.044526,0.15579,-1.119833


In [93]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.196559,0.159124,1.313126
Ohio,0.485028,0.201602,0.30772
Texas,1.214056,0.140903,0.182716
Oregon,0.044526,0.15579,1.119833


DataFrames have `apply` and `applymap` methods which apply a function across
the DataFrame.

`apply` applies the function across each column or row (choose axis with param)

`applymap` applies the function across every value in the DataFrame.

Series objects have a `map` method which applies a function across each element.

### Sorting and Ranking

pandas has built in sorting.

To sort an index lexicographically, use `sort_index`

In [94]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [95]:
frame = pd.DataFrame(np.arange(8).reshape((2,4)),
                     index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame

Unnamed: 0,d,a,b,c
three,0,1,2,3
one,4,5,6,7


In [96]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [98]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


You can also sort a Series by its values using `sort_values`

In [99]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

When sorting a DataFrame, you can use the data in one or more columns as the
sort

In [100]:
frame = pd.DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]})
frame

Unnamed: 0,b,a
0,4,0
1,7,1
2,-3,0
3,2,1


In [101]:
frame.sort_values(by='b')

Unnamed: 0,b,a
2,-3,0
3,2,1
0,4,0
1,7,1


In [102]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,b,a
2,-3,0
0,4,0
3,2,1
1,7,1


Ranking assigns ranks from one through N (breaks ties by assigning mean rank)

In [103]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [104]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

Tie-breaking methods for rank
- average
- min
- max
- first
- dense: (like min but only increment by 1)

In [105]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [106]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a':[0, 1, 0, 1], 'c':[-2, 5, 8, -2.5]})
frame

Unnamed: 0,b,a,c
0,4.3,0,-2.0
1,7.0,1,5.0
2,-3.0,0,8.0
3,2.0,1,-2.5


In [107]:
frame.rank(axis=1)

Unnamed: 0,b,a,c
0,3.0,2.0,1.0
1,3.0,1.0,2.0
2,1.0,2.0,3.0
3,3.0,2.0,1.0


### Axis Indexes with Duplicate Labels

Many pandas functions (like reindex) require that the labels are unique, but its
not mandatory.

In [109]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

In [110]:
obj.index.is_unique

False

In [111]:
obj['a']

a    0
a    1
dtype: int64

## 5.3 Summarizing and Computing Descriptive Statistics

pandas objects come with a set of *reductions* or *summary statistics*, methods
that extract a single value from a Series or a Series of values from a DataFrame

In [112]:
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], 
                   [np.nan, np.nan], [0.75, -1.3]],
                  index=['a', 'b', 'c', 'd'],
                  columns=['one', 'two'])
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [113]:
df.sum()

one    9.25
two   -5.80
dtype: float64

In [114]:
df.sum(axis='columns')

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [115]:
df.mean(axis=1, skipna=False)

a      NaN
b    1.300
c      NaN
d   -0.275
dtype: float64

Reduction methods parameters
- axis
- skipna
- level: for multiindex

In [117]:
df.idxmax()

one    b
two    d
dtype: object

Other methods are *accumulations* like cumsum

In [118]:
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


In [119]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


Descriptive and summary statistics
| Method            | Description                               |
| ----------------- | ----------------------------------------- |
| count             | Number of non-NA values                   |
| describe          | Compute set of summary statistics         |
| min, max          | min and max values                        |
| argmin, argmax    | Compute index locations where min/max are |
| idxmin, idxmax    | Compute index labels  where min/max are   |
| quantile          | Compute quantile from 0 to 1              |
| sum               | Sum of values                             |
| mean              | Mean of values                            |
| median            | Arithmetic median of values               |
| mad               | Mean-Absolute-Deviation from mean         |
| prod              | Product of values                         |
| var               | Sample variance of values                 |
| std               | Sample standard deviation of values       |
| skew              | Sample skewness of values                 |
| kurt              | Sample kurtosis of values                 |
| cumsum            | Cumulatative sum of values                |
| cummin, cummax    | Cumulatative min or max of values         |
| cumprod           | Cumulatative product of values            |
| diff              | Compute first arithmetic difference       |
| pct_change        | Compute percent changes                   |

### Correlation and Covariance

Correlation and Covariance are computed from pairs of values.

In [120]:
import pandas_datareader.data as web

In [121]:
all_data = {ticker: web.get_data_yahoo(ticker) 
            for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']}
price = pd.DataFrame({ticker: data['Adj Close']
                      for ticker, data in all_data.items()})
volume = pd.DataFrame({ticker: data['Volume']
                      for ticker, data in all_data.items()})

In [122]:
returns = price.pct_change()
returns.tail()

Unnamed: 0_level_0,AAPL,IBM,MSFT,GOOG
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2022-05-09,-0.033189,-0.011616,-0.036945,-0.022272
2022-05-10,0.016112,-0.039497,0.018596,0.013269
2022-05-11,-0.051841,0.012545,-0.03321,-0.005441
2022-05-12,-0.026894,0.016444,-0.019958,-0.00702
2022-05-13,0.029822,-0.001881,0.020443,0.036651


You can then use the `corr` and `cov` methods to compute the correlation and 
covariance of two of these time series.

In [123]:
returns['MSFT'].corr(returns['IBM'])

0.4736944686944108

In [124]:
returns['MSFT'].cov(returns['IBM'])

0.00014652929677818388

You can also compute these values for each column pair

In [125]:
returns.corr()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,1.0,0.426369,0.751613,0.676421
IBM,0.426369,1.0,0.473694,0.450091
MSFT,0.751613,0.473694,1.0,0.785623
GOOG,0.676421,0.450091,0.785623,1.0


In [126]:
returns.cov()

Unnamed: 0,AAPL,IBM,MSFT,GOOG
AAPL,0.000395,0.000143,0.000273,0.000242
IBM,0.000143,0.000287,0.000147,0.000137
MSFT,0.000273,0.000147,0.000334,0.000258
GOOG,0.000242,0.000137,0.000258,0.000324


In [127]:
returns.corrwith(returns['IBM'])

AAPL    0.426369
IBM     1.000000
MSFT    0.473694
GOOG    0.450091
dtype: float64

You can pass axis=1 to compute these values row by row

### Unique Values, Value Counts, and Membership

In [128]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])
obj

0    c
1    a
2    d
3    a
4    a
5    b
6    b
7    c
8    c
dtype: object

In [129]:
obj.unique()

array(['c', 'a', 'd', 'b'], dtype=object)

In [131]:
obj.value_counts()

c    3
a    3
b    2
d    1
dtype: int64

In [132]:
obj.isin(['b', 'c'])

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [135]:
pd.Index(obj.unique()).get_indexer(obj)

array([0, 1, 2, 1, 1, 3, 3, 0, 0])

Mentioned methods:
- isin
- match
- unique
- value_counts