## 1 Introduction to pandas Data Structures

NumPy is best suited for working with homogeneous numerical array data;
pandas is designed for working with tabular or heterogeneous data.

In [71]:
import numpy as np

In [1]:
import pandas as pd

In [3]:
from pandas import Series, DataFrame

### 1.1 Series

One-dimensional array-like object containing a sequence of values and an associated array of data labels, called ___index___

In [4]:
obj = pd.Series([4, 6, -1, 2])

In [5]:
obj

0    4
1    6
2   -1
3    2
dtype: int64

In [8]:
obj.values  # array representation of the Series

array([ 4,  6, -1,  2])

In [9]:
obj.index

RangeIndex(start=0, stop=4, step=1)

Create a Series with index labels

In [22]:
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [23]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [24]:
obj2.index

Index([u'd', u'b', u'a', u'c'], dtype='object')

In [25]:
obj2['a']

-5

In [26]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

In [27]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [28]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

Think about a Series is as a fixed-length, ordered dict.

In [30]:
'b' in obj2

True

Create a Series from a Python dict

In [32]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}

In [33]:
obj3 = pd.Series(sdata)

In [34]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

When passing a dict only, the index in the resulting Series will have the dict's keys in sorted order. Can be overrided by passing the dict keys in the order you want them to appear

In [35]:
obj4 = pd.Series(sdata, index=['California', 'Ohio', 'Oregon', 'Texas'])

In [36]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

NaN refers to missing or NA values. ___isnull___ and ___notnull___ should be used to detect missing data

In [37]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [38]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

Series automatically aligns by index label in arithmetic operations, which is similar to join operation

In [40]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [41]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In [42]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Both the Series object itself and its index have a nume attribue.

In [43]:
obj4.name = 'population'

In [48]:
obj4.index.name = 'state'

In [49]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

A Series's index can be altered in-place by assignment

In [51]:
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']

In [52]:
obj

Bob      4
Steve    6
Jeff    -1
Ryan     2
dtype: int64

### 1.2 DataFrame

Construct a DataFrame from a dict of equal-length lists or NumPy arrays

In [54]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
       'year': [2000, 2001, 2002, 2001, 2002, 2003],
       'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}

In [55]:
frame = pd.DataFrame(data)

In [56]:
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002
5,3.2,Nevada,2003


In [58]:
frame.head(3)

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002


Arrange the columns by specifying a sequence of columns

In [59]:
pd.DataFrame(data, columns=['year','state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9
5,2003,Nevada,3.2


If pass a column that isn't contained in the dict:

In [61]:
frame2 = pd.DataFrame(data, columns=['year','state', 'pop', 'debt'],
                     index=['one', 'two', 'three', 'four','five', 'six'])

In [62]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,
six,2003,Nevada,3.2,


In [63]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

Retrieve a column

In [64]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

In [65]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64

Retrieve a row

In [67]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

Assign columns

In [72]:
frame2['debt'] = np.arange(6.)

In [73]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0
six,2003,Nevada,3.2,5.0


When assigning lists or arrays to a colum, the length must match. If assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:

In [74]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [90]:
frame2['debt'] = val

In [91]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7
six,2003,Nevada,3.2,


Assign a column not existed

In [92]:
frame2['eastern'] = frame2.state == 'Ohio'  # Can not be created with frame2.eastern

In [93]:
frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False
six,2003,Nevada,3.2,,False


Remove a column with ___del___ method

In [96]:
del frame2['eastern']

In [97]:
frame2.columns

Index([u'year', u'state', u'pop', u'debt'], dtype='object')

If a nested dict is passed to the DataFrame, pandas will interpret the outer dict keys as the columns and the inner keys as the row indices

In [100]:
pop = {'Nevada': {2001:2.4, 2002: 2.9},
      'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

In [101]:
frame3 = pd.DataFrame(pop)

In [102]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


Transpose the DataFrame

In [103]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


Dicts of Series are treated in much the same way:

In [105]:
pdata = {'Ohio': frame3['Ohio'][:-1],
        'Nevada': frame3['Nevada'][:2]}

In [106]:
pd.DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


<img src="img/5_1_1.png">

The ___name___ attribute of DataFrame's index and columns

In [107]:
frame3.index.name = 'year'; frame3.columns.name = 'state'

In [108]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


The ___values___ attribute returns the data contained in the DF as a ndarray

In [109]:
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

If columns contains different types, the dtype of the value array will be chosen to accommodate all of the columns

In [111]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7],
       [2003, 'Nevada', 3.2, nan]], dtype=object)

In [110]:
frame2.values[0].dtype

dtype('O')

### 1.3 Index Objects

pandas's Index objects are responsible for holding the axis labels and other metadata. Any array or other sequence of labels used to construct a Series or DF is internally converted to an Index

In [112]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

In [113]:
index = obj.index

In [114]:
index

Index([u'a', u'b', u'c'], dtype='object')

In [115]:
index[1:]

Index([u'b', u'c'], dtype='object')

Index objects are immutable

In [116]:
index[1] = 'd'

TypeError: Index does not support mutable operations

An Index behaves like a fixed-size set:

In [123]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [125]:
frame3.columns

Index([u'Nevada', u'Ohio'], dtype='object', name=u'state')

In [126]:
'Ohio' in frame3.columns

True

But a pandas Index can contain duplicate labels. Selections with duplicate labels will select all occurrences of that label.

In [168]:
dup_labels = pd.Index(['foo', 'foo', 'bar', 'bar'])

In [169]:
dup_labels

Index([u'foo', u'foo', u'bar', u'bar'], dtype='object')

In [170]:
dup_labels2 = pd.Index(['bar', 'a'])

In [171]:
dup_labels.intersection(dup_labels2)

Index([u'bar', u'bar'], dtype='object')

<img src="img/5_1_2.png">

## 2 Essential Functionality