# Pandas 基礎

## 匯入 pandas 函式庫

In [1]:
import pandas as pd
pd.__version__

'0.23.4'

首先我們先介紹三個基本的 pandas 資料結構: Series、Dataframe、Index

## Pandas Series 物件

Pandas Series 是一個被索引資料的一維陣列，可以使用一個陣列來建立

Series 物件除了值之外，也包含一系列的索引

In [2]:
data = pd.Series([0.25, 0.5, 0.75, 1])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

In [3]:
data.values

array([0.25, 0.5 , 0.75, 1.  ])

In [4]:
data.index

RangeIndex(start=0, stop=4, step=1)

In [5]:
print(data[1],'\n')
print(data[0:3])

0.5 

0    0.25
1    0.50
2    0.75
dtype: float64


### 把 Series 當作是 Numpy array

Numpy array 預設是以 整數 作為索引來存取值，而 Series 則可以 **明確的定義** 和值相關聯的索引

In [6]:
data = pd.Series( [0.25, 0.5, 0.75, 1] ,
                  index = ['a','b','c','d'])

In [7]:
data['b']

0.5

### 把 Series 當作是 Dictionary

如上所見，整個結果很像 Pytohn 中的 dictionary，就是一個 key 和 value 的對應值

然而，Pandas 的型態資訊讓它比 Python 的字典更有效率

In [8]:
people_count = {'China':160000,
                'Taiwan':2300,
                'Japan':12000,
                'French':9000,
                'Russia':23000}

In [9]:
population = pd.Series(people_count)
population

China     160000
Taiwan      2300
Japan      12000
French      9000
Russia     23000
dtype: int64

In [10]:
population['Taiwan']

2300

然而並不像 Python 的 Dictionary，Pandas Series 可以去做陣列型式的操作

In [11]:
population['China':'French']

China     160000
Taiwan      2300
Japan      12000
French      9000
dtype: int64

### 創建 Series

基本的創建方式可以為先打我們的 data 陣列，再打  Index 陣列

In [12]:
pd.Series([2,3,4,6,7], index=[1,2,3,4,5])

1    2
2    3
3    4
4    6
5    7
dtype: int64

如果是純量，我們可以用 index 來定義它的多寡

In [13]:
pd.Series(3,index=[12,23,45])

12    3
23    3
45    3
dtype: int64

也可以用字典來建立，但關於 key 這方面，若 index 有指定次序，則以 index 為主

In [14]:
pd.Series( {'a':2, 'b':4, 'c':6} )

a    2
b    4
c    6
dtype: int64

In [15]:
pd.Series( {'a':2, 'b':4, 'c':6} , index = ['a','c'])

a    2
c    6
dtype: int64

## Pandas Dataframe 物件

若把 Series 當作是可以彈性設定索引的一維陣列，Dataframe 則可以當作是彈性設定索引的二維陣列

In [16]:
people_count = {'China':160000,
                'Taiwan':2300,
                'Japan':12000,
                'French':9000,
                'Russia':23000}
population = pd.Series(people_count)
population

China     160000
Taiwan      2300
Japan      12000
French      9000
Russia     23000
dtype: int64

In [17]:
death_raio = {'China':4,
              'Taiwan':2.1,
              'Japan':3,
              'French':2,
              'Russia':2.7}
death = pd.Series(death_raio)
death

China     4.0
Taiwan    2.1
Japan     3.0
French    2.0
Russia    2.7
dtype: float64

In [18]:
states = pd.DataFrame({'population':population, 'death':death})
states

Unnamed: 0,population,death
China,160000,4.0
Taiwan,2300,2.1
Japan,12000,3.0
French,9000,2.0
Russia,23000,2.7


In [19]:
states.columns

Index(['population', 'death'], dtype='object')

In [20]:
states.index

Index(['China', 'Taiwan', 'Japan', 'French', 'Russia'], dtype='object')

Dataframe 也可以當作是一般的 Series 來使用

In [21]:
states['population']

China     160000
Taiwan      2300
Japan      12000
French      9000
Russia     23000
Name: population, dtype: int64

### 創建 Dataframe

### 單一個 Series 建立

In [22]:
pd.DataFrame( population, columns=['population'])

Unnamed: 0,population
China,160000
Taiwan,2300
Japan,12000
French,9000
Russia,23000


### 從字典的 list 建立

In [23]:
data = [{'a':i, 'b':i**2} for i in range(4) ]
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,1
2,2,4
3,3,9


In [24]:
pd.DataFrame( [{'a':1,'b':2},{'b':3,'c':4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


### 從 Series 物件的字典建立

In [25]:
population = pd.Series(people_count)
death = pd.Series(death_raio)

pd.DataFrame( {'population':population,'death':death})

Unnamed: 0,population,death
China,160000,4.0
Taiwan,2300,2.1
Japan,12000,3.0
French,9000,2.0
Russia,23000,2.7


### 從 Numpy 的二維陣列建立

In [26]:
import numpy as np

pd.DataFrame(np.random.rand(3,2),
             columns=['foo','bar'],
             index=['a','b','c'])

Unnamed: 0,foo,bar
a,0.806188,0.464061
b,0.263528,0.493774
c,0.594618,0.162159


### 從 Numpy 的結構陣列建立

In [27]:
A = np.zeros(3, dtype=[('A','i8'),('B','f8')])     # dtype 第一個是 names，第二個是 format
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [28]:
print('A: ',A['A'])
print('B: ',A['B'])

A:  [0 0 0]
B:  [0. 0. 0.]


In [29]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


## Pandas Index 物件

Index 可以想成一個不能修改的陣列

In [30]:
ind = pd.Index([2,3,4,7,11])
ind

Int64Index([2, 3, 4, 7, 11], dtype='int64')

In [31]:
ind[1]

3

In [32]:
# index 可以支援一般的索引操作
ind[::2]

Int64Index([2, 4, 11], dtype='int64')

In [33]:
# Pandas 的 Index 是不能修改的
ind[2] = 2

TypeError: Index does not support mutable operations

### Index 當作是有序的集合

In [34]:
indA = pd.Index([1,12,3,4,5])
indB = pd.Index([4,5,13,17,8])

In [35]:
# 交集
indA & indB

Int64Index([4, 5], dtype='int64')

In [36]:
# 聯集
indA | indB

Int64Index([1, 3, 4, 5, 8, 12, 13, 17], dtype='int64')

In [37]:
# 差集
indA ^ indB

Int64Index([1, 3, 8, 12, 13, 17], dtype='int64')

## 資料的索引和選擇

### 在 Series 當成字典

In [38]:
data = pd.Series([0.25,0.5,0.75,1],
                 index=['a','b','c','d'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [39]:
# 一般的資料選擇
data['a']

0.25

In [40]:
# 確認東西是否存在 Series
'c' in data

True

In [41]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [42]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [43]:
# Series 也有一個如字典的功能，如果你用一般選擇的方式來改值，若陣列中沒有你要的值，則它會自動把它添加到 Series 中
data['e'] = 1.5
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.50
dtype: float64

### Series 當作是一維陣列來操作

In [44]:
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [45]:
data[0:2]

a    0.25
b    0.50
dtype: float64

In [46]:
data[ data>0.75 ]

d    1.0
e    1.5
dtype: float64

In [47]:
# fancy 索引
data[ ['a','e'] ]

a    0.25
e    1.50
dtype: float64

**混淆點**

當我們有指定索引時，我們的標籤是用我們自己定義的，但如果是用切片的方式來操作，則是以 python 預設的整數索引

In [48]:
data = pd.Series(['a','b','c','d'], index=[1,3,5,7])

In [49]:
data

1    a
3    b
5    c
7    d
dtype: object

In [50]:
data[3]

'b'

In [51]:
data[0:3]

1    a
3    b
5    c
dtype: object

為了避免這個狀況混淆視聽，pandas 有 indexer (loc、iloc) 來幫助我們明確的來作操作

#### loc

使用自己定義的索引來操作

In [52]:
data.loc[3]

'b'

In [53]:
data.loc[1:3]

1    a
3    b
dtype: object

#### iloc

使用 python 預設的整數索引

In [54]:
data.iloc[3]

'd'

In [55]:
data.iloc[1:3]

3    b
5    c
dtype: object

### Dataframe 當作是字典

In [56]:
area = pd.Series({'Taipei':42399, 'Taoyuan':71221, 'Taicheng':45553, 'Chiayi':12232, 'Tainan':29932})
pop  = pd.Series({'Taipei':83382, 'Taoyuan':42321, 'Taicheng':65543, 'Chiayi':21121, 'Tainan':53321})

data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
Taipei,42399,83382
Taoyuan,71221,42321
Taicheng,45553,65543
Chiayi,12232,21121
Tainan,29932,53321


In [57]:
data['area']

Taipei      42399
Taoyuan     71221
Taicheng    45553
Chiayi      12232
Tainan      29932
Name: area, dtype: int64

In [58]:
data.area

Taipei      42399
Taoyuan     71221
Taicheng    45553
Chiayi      12232
Tainan      29932
Name: area, dtype: int64

In [59]:
# dataframe 也可以像字典那樣建立
data['density'] = data['pop']/data['area']
data

Unnamed: 0,area,pop,density
Taipei,42399,83382,1.966603
Taoyuan,71221,42321,0.594221
Taicheng,45553,65543,1.438829
Chiayi,12232,21121,1.7267
Tainan,29932,53321,1.781405


In [60]:
# 將 data 的值存為二維陣列
data.values

array([[4.23990000e+04, 8.33820000e+04, 1.96660299e+00],
       [7.12210000e+04, 4.23210000e+04, 5.94220806e-01],
       [4.55530000e+04, 6.55430000e+04, 1.43882950e+00],
       [1.22320000e+04, 2.11210000e+04, 1.72670046e+00],
       [2.99320000e+04, 5.33210000e+04, 1.78140452e+00]])

In [61]:
# data 轉置
data.T

Unnamed: 0,Taipei,Taoyuan,Taicheng,Chiayi,Tainan
area,42399.0,71221.0,45553.0,12232.0,29932.0
pop,83382.0,42321.0,65543.0,21121.0,53321.0
density,1.966603,0.594221,1.438829,1.7267,1.781405


In [62]:
data

Unnamed: 0,area,pop,density
Taipei,42399,83382,1.966603
Taoyuan,71221,42321,0.594221
Taicheng,45553,65543,1.438829
Chiayi,12232,21121,1.7267
Tainan,29932,53321,1.781405


#### dataframe 也支援 iloc、loc 的方式

In [63]:
data.iloc[:2,:2]

Unnamed: 0,area,pop
Taipei,42399,83382
Taoyuan,71221,42321


In [64]:
data.loc['Taipei':'Chiayi']

Unnamed: 0,area,pop,density
Taipei,42399,83382,1.966603
Taoyuan,71221,42321,0.594221
Taicheng,45553,65543,1.438829
Chiayi,12232,21121,1.7267


#### dataframe 的 ix 也可同時支援兩種型式

In [65]:
data.ix[:3, :'pop']

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,area,pop
Taipei,42399,83382
Taoyuan,71221,42321
Taicheng,45553,65543
