# Chapter 5: 使用Pandas

## Pandas: 被設計來處理表格式和異值資料集。

### Pandas 的資料結構: Series & DataFrame

### Series: 一維陣列，為含有一個值的序列，以及資料標籤index。

In [1]:
import pandas as pd

In [2]:
from pandas import Series, DataFrame

In [3]:
obj= pd.Series([4,5,-1,2,0])
obj

0    4
1    5
2   -1
3    2
4    0
dtype: int64

### 由上可知，在Series資料格式中，左邊是index，右邊是值。

### values, index： 可知道該Series的值與index。

In [4]:
obj.values

array([ 4,  5, -1,  2,  0])

In [5]:
obj.index

RangeIndex(start=0, stop=5, step=1)

### 除了預設值的Index，你還可以自己設定index。

In [6]:
obj2=pd.Series([5,6,-3,7,2], index=['d','b','a','c','e'])
obj2

d    5
b    6
a   -3
c    7
e    2
dtype: int64

In [7]:
obj2.index

Index(['d', 'b', 'a', 'c', 'e'], dtype='object')

### 使用[index]，可以取得Series的值。

In [8]:
obj2['a']

-3

In [9]:
obj2[['a','b','c']] #注意如果要取多個值，也需要使用[]。

a   -3
b    6
c    7
dtype: int64

### 使用布林判斷式找到值。

In [10]:
obj2[obj2>0]

d    5
b    6
c    7
e    2
dtype: int64

### 使用加減乘除，每一個值都會跟著變化。

In [11]:
obj2*2

d    10
b    12
a    -6
c    14
e     4
dtype: int64

### 也可以對Series使用全域函式。

In [12]:
import numpy as np
np.exp(obj2)  #exp代表ｅ的ｘ次方。

d     148.413159
b     403.428793
a       0.049787
c    1096.633158
e       7.389056
dtype: float64

### in: 確認Series內有無特定值。

In [13]:
'f' in obj2

False

In [14]:
'e' in obj2

True

### Python的Dict可以用來建立Series

In [15]:
a_dict={'Orange':1000, 'Apple':2000, 'Banana':3000, 'Grape':4000, 'Watermelon':5000}

In [16]:
obj3= pd.Series(a_dict)
obj3

Orange        1000
Apple         2000
Banana        3000
Grape         4000
Watermelon    5000
dtype: int64

### Series的順序可以透過[]顯示。

In [17]:
a_list=['Apple','Banana','Grape','Orange','Peach','Watermelon']

In [18]:
obj4=pd.Series(a_dict, index=a_list)
obj4

Apple         2000.0
Banana        3000.0
Grape         4000.0
Orange        1000.0
Peach            NaN
Watermelon    5000.0
dtype: float64

### NaN表示遺失或是not available。

### isnull, notnull可以偵測Null值。

In [19]:
pd.isnull(obj4)

Apple         False
Banana        False
Grape         False
Orange        False
Peach          True
Watermelon    False
dtype: bool

In [20]:
pd.notnull(obj4)

Apple          True
Banana         True
Grape          True
Orange         True
Peach         False
Watermelon     True
dtype: bool

In [21]:
obj4.isnull()

Apple         False
Banana        False
Grape         False
Orange        False
Peach          True
Watermelon    False
dtype: bool

### Series可以彼此加減乘除。

In [22]:
obj3+obj4

Apple          4000.0
Banana         6000.0
Grape          8000.0
Orange         2000.0
Peach             NaN
Watermelon    10000.0
dtype: float64

### Series本身有name屬性，index也會有name屬性。

In [23]:
obj4.name='Price'

In [24]:
obj4.index.name='Fruit'

In [25]:
obj4

Fruit
Apple         2000.0
Banana        3000.0
Grape         4000.0
Orange        1000.0
Peach            NaN
Watermelon    5000.0
Name: Price, dtype: float64

### Series的Index可以直接給值修改。

In [26]:
obj

0    4
1    5
2   -1
3    2
4    0
dtype: int64

In [27]:
obj.index=['Math', 'English', 'Chinese', 'PE', 'Science']
obj

Math       4
English    5
Chinese   -1
PE         2
Science    0
dtype: int64

### DataFrame: 一個含有資料的方形資料表，裡面包括一堆欄位，每個欄位可以是不同的型態（布林、字串、數字）。有列有欄有index。

In [28]:
data={'fruit':['Apple','Banana','Grape','Orange','Peach','Watermelon'],
     'weekday':['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday'],
     'price':[1000,2000,3000,4000,5000,6000]}
frame= pd.DataFrame(data)
frame

Unnamed: 0,fruit,weekday,price
0,Apple,Monday,1000
1,Banana,Tuesday,2000
2,Grape,Wednesday,3000
3,Orange,Thursday,4000
4,Peach,Friday,5000
5,Watermelon,Saturday,6000


### head(): 遇到大型資料庫，可以選取開頭前5列。

In [29]:
frame.head()

Unnamed: 0,fruit,weekday,price
0,Apple,Monday,1000
1,Banana,Tuesday,2000
2,Grape,Wednesday,3000
3,Orange,Thursday,4000
4,Peach,Friday,5000


### columns=[] : 可以用來排序欄位名稱。

In [30]:
pd.DataFrame(data,columns=['weekday','price','fruit'])

Unnamed: 0,weekday,price,fruit
0,Monday,1000,Apple
1,Tuesday,2000,Banana
2,Wednesday,3000,Grape
3,Thursday,4000,Orange
4,Friday,5000,Peach
5,Saturday,6000,Watermelon


### index=[] ：可以用來調整index名稱。

In [31]:
frame2= pd.DataFrame(data,columns=['weekday','price','fruit','farmer'], index=['a','b','c','d','e','f'])
frame2

Unnamed: 0,weekday,price,fruit,farmer
a,Monday,1000,Apple,
b,Tuesday,2000,Banana,
c,Wednesday,3000,Grape,
d,Thursday,4000,Orange,
e,Friday,5000,Peach,
f,Saturday,6000,Watermelon,


### Frame[ 'column_name' ] 或是 Frame.colunm_name: Dataframe中可以使用[]取出一整個欄位。

In [32]:
frame2['weekday']

a       Monday
b      Tuesday
c    Wednesday
d     Thursday
e       Friday
f     Saturday
Name: weekday, dtype: object

In [33]:
frame2.fruit

a         Apple
b        Banana
c         Grape
d        Orange
e         Peach
f    Watermelon
Name: fruit, dtype: object

### loc[]: 可以取出index的所有值。

In [34]:
frame2.loc['b']

weekday    Tuesday
price         2000
fruit       Banana
farmer         NaN
Name: b, dtype: object

### 取出一個欄位名稱，可以填入值。

In [35]:
frame2['farmer']=['Nicole','Eason','Juan','Teresa','Johnny','Patrick']
frame2

Unnamed: 0,weekday,price,fruit,farmer
a,Monday,1000,Apple,Nicole
b,Tuesday,2000,Banana,Eason
c,Wednesday,3000,Grape,Juan
d,Thursday,4000,Orange,Teresa
e,Friday,5000,Peach,Johnny
f,Saturday,6000,Watermelon,Patrick


In [36]:
frame2=pd.DataFrame(data,columns=['rank','farmer','weekday','fruit','price'],
                   index=['a','b','c','d','e','f'])
frame2['farmer']=['Nicole','Eason','Juan','Teresa','Johnny','Patrick']
frame2

Unnamed: 0,rank,farmer,weekday,fruit,price
a,,Nicole,Monday,Apple,1000
b,,Eason,Tuesday,Banana,2000
c,,Juan,Wednesday,Grape,3000
d,,Teresa,Thursday,Orange,4000
e,,Johnny,Friday,Peach,5000
f,,Patrick,Saturday,Watermelon,6000


### 可以透過np.arange()回填值。

In [37]:
import numpy as np
frame2['rank']=np.arange(1,7)
frame2

Unnamed: 0,rank,farmer,weekday,fruit,price
a,1,Nicole,Monday,Apple,1000
b,2,Eason,Tuesday,Banana,2000
c,3,Juan,Wednesday,Grape,3000
d,4,Teresa,Thursday,Orange,4000
e,5,Johnny,Friday,Peach,5000
f,6,Patrick,Saturday,Watermelon,6000


### 可以指定值給特定index，沒有指定的地方會呈現NaN。

In [38]:
val=pd.Series([7000,8000,9000], index=['b','d','f'])

In [39]:
frame2['price']=val
frame2

Unnamed: 0,rank,farmer,weekday,fruit,price
a,1,Nicole,Monday,Apple,
b,2,Eason,Tuesday,Banana,7000.0
c,3,Juan,Wednesday,Grape,
d,4,Teresa,Thursday,Orange,8000.0
e,5,Johnny,Friday,Peach,
f,6,Patrick,Saturday,Watermelon,9000.0


### 對於不存在的欄位給值，就會建立一個新的欄。

In [40]:
frame2['taste']=frame2.farmer=='Teresa'
frame2

Unnamed: 0,rank,farmer,weekday,fruit,price,taste
a,1,Nicole,Monday,Apple,,False
b,2,Eason,Tuesday,Banana,7000.0,False
c,3,Juan,Wednesday,Grape,,False
d,4,Teresa,Thursday,Orange,8000.0,True
e,5,Johnny,Friday,Peach,,False
f,6,Patrick,Saturday,Watermelon,9000.0,False


### del 可以刪除欄位！

In [41]:
del frame2['taste']

In [42]:
frame2.columns

Index(['rank', 'farmer', 'weekday', 'fruit', 'price'], dtype='object')

### dict中包包含dict:加入dataframe後，最外層的dict key會成為欄index，內層的key成為列index。

In [43]:
farmer={'Nicole':{'age':30, 'gender':'Female'}, 
        'Eason':{'age':40, 'gender':'Male'},
        'Juan':{'age':50, 'gender':'Male'},
        'Teresa':{'age':60, 'gender':'Female'},
        'Johnny':{'age':70, 'gender':'Male'},
        'Patrick':{'age':80, 'gender':'Male'}}

In [44]:
frame3=pd.DataFrame(farmer)
frame3

Unnamed: 0,Nicole,Eason,Juan,Teresa,Johnny,Patrick
age,30,40,50,60,70,80
gender,Female,Male,Male,Female,Male,Male


### Ｔ：你可以轉置一個DataFrame，互換欄與行。

In [45]:
frame3.T

Unnamed: 0,age,gender
Nicole,30,Female
Eason,40,Male
Juan,50,Male
Teresa,60,Female
Johnny,70,Male
Patrick,80,Male


### index若增加一個不存在的欄位，view會出現該欄位，但是實體不會增加。

In [46]:
pd.DataFrame(farmer, index=['age','gender','experience'])

Unnamed: 0,Nicole,Eason,Juan,Teresa,Johnny,Patrick
age,30,40,50,60,70,80
gender,Female,Male,Male,Female,Male,Male
experience,,,,,,


In [47]:
frame3

Unnamed: 0,Nicole,Eason,Juan,Teresa,Johnny,Patrick
age,30,40,50,60,70,80
gender,Female,Male,Male,Female,Male,Male


### 可以透過[]與slice，切分出另一個dataframe。

In [48]:
pdata={'Nicole':frame3['Nicole'][:1],
      'Eason':frame3['Eason'][:2]}

In [49]:
pd.DataFrame(pdata)

Unnamed: 0,Nicole,Eason
age,30.0,40
gender,,Male


### 可以幫index, colunms 命名。 

In [50]:
frame3.index.name='personal_information' ; frame3.columns.name='name'

In [51]:
frame3

name,Nicole,Eason,Juan,Teresa,Johnny,Patrick
personal_information,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
age,30,40,50,60,70,80
gender,Female,Male,Male,Female,Male,Male


### 可以回傳values

In [52]:
frame3.values

array([[30, 40, 50, 60, 70, 80],
       ['Female', 'Male', 'Male', 'Female', 'Male', 'Male']], dtype=object)

In [53]:
frame2.values

array([[1, 'Nicole', 'Monday', 'Apple', nan],
       [2, 'Eason', 'Tuesday', 'Banana', 7000.0],
       [3, 'Juan', 'Wednesday', 'Grape', nan],
       [4, 'Teresa', 'Thursday', 'Orange', 8000.0],
       [5, 'Johnny', 'Friday', 'Peach', nan],
       [6, 'Patrick', 'Saturday', 'Watermelon', 9000.0]], dtype=object)

### index物件：你在建立的Series, DataFrame時使用的標籤會被轉成一個物件。

In [54]:
obj=pd.Series(range(3), index=['a','b','c'])

In [55]:
index=obj.index

In [56]:
index

Index(['a', 'b', 'c'], dtype='object')

In [57]:
index[1:]

Index(['b', 'c'], dtype='object')

### index是immutable不可以改變的。

In [58]:
index[1]='d'

TypeError: Index does not support mutable operations

### index物件可以拿來搭配其他資料結構。

In [59]:
labels=pd.Index(np.arange(3))

In [60]:
labels

Int64Index([0, 1, 2], dtype='int64')

In [61]:
obj2=pd.Series([2,4,6],index=labels)

In [62]:
obj2

0    2
1    4
2    6
dtype: int64

### index的特性很像固定長度的set。 透過in可以確認index,欄位名稱是否存在。

In [63]:
frame3

name,Nicole,Eason,Juan,Teresa,Johnny,Patrick
personal_information,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
age,30,40,50,60,70,80
gender,Female,Male,Male,Female,Male,Male


In [64]:
frame3.columns

Index(['Nicole', 'Eason', 'Juan', 'Teresa', 'Johnny', 'Patrick'], dtype='object', name='name')

In [65]:
'Johnny' in frame3.columns

True

In [66]:
'experience' in frame3.index

False

### pandas的index和Python的set不一樣在於他可以有重複的標籤。

In [67]:
dup_labels=pd.Index(['a','a','b','b','c','c'])
dup_labels

Index(['a', 'a', 'b', 'b', 'c', 'c'], dtype='object')

In [68]:
dup_labels.unique() #可以取得不重複的index。

Index(['a', 'b', 'c'], dtype='object')

In [69]:
labels2=pd.Index(['d','e','f'])
dup_labels.append(labels2) # 使用append(index)，括號內必須是index。

Index(['a', 'a', 'b', 'b', 'c', 'c', 'd', 'e', 'f'], dtype='object')

### 重做索引

### reindex(): 建立新物件時，附帶新索引資料。

In [70]:
obj=pd.Series([2,4,-6,8,-10],index=['e','d','c','b','a'])
obj

e     2
d     4
c    -6
b     8
a   -10
dtype: int64

### 使用reindex時，如果對應的index沒出現，那麼值就會連帶消失，如果index多一個，NaN就會出現在表格內。

In [71]:
obj2=obj.reindex(['b','c','d','e','f'])
obj2

b    8.0
c   -6.0
d    4.0
e    2.0
f    NaN
dtype: float64

### method(): 可以向內插入值。ffill: 插入跟前面一樣的值。

In [72]:
obj3=pd.Series(['Red','Orange','Green','Blue'],index=[0,1,3,4])
obj3

0       Red
1    Orange
3     Green
4      Blue
dtype: object

In [73]:
obj3.reindex(range(6),method='ffill')

0       Red
1    Orange
2    Orange
3     Green
4      Blue
5      Blue
dtype: object

### 你可以對列、欄同時進行reindex，如果單純只傳一個序列的話，那重做索引的對象就會是列。

In [74]:
frame=pd.DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Tree','Forest','Grass'])
frame

Unnamed: 0,Tree,Forest,Grass
a,0,1,2
c,3,4,5
d,6,7,8


In [75]:
frame2=frame.reindex(['a','b','c','d']) # 什麼都沒放，就會修改index。
frame2

Unnamed: 0,Tree,Forest,Grass
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [76]:
nature=['Flower','Wind','Seed']
frame.reindex(columns=nature) #放上columns，就可以針對欄位名稱做修改。

Unnamed: 0,Flower,Wind,Seed
a,,,
c,,,
d,,,


### loc可以用來提取dataframe的值。

In [77]:
frame.loc[['a','c','d']]

Unnamed: 0,Tree,Forest,Grass
a,0,1,2
c,3,4,5
d,6,7,8


### 指定軸刪除資料

### drop(): 可以用來移除特定行。

In [78]:
obj=pd.Series(np.arange(5),index=['a','b','c','d','e'])
obj

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [79]:
new_obj=obj.drop('c')

In [80]:
new_obj

a    0
b    1
d    3
e    4
dtype: int64

In [81]:
obj.drop(['d','e'])

a    0
b    1
c    2
dtype: int64

In [82]:
obj

a    0
b    1
c    2
d    3
e    4
dtype: int64

In [83]:
data=pd.DataFrame(np.arange(16).reshape(4,4),index=['Tea','Milktea','Juice','Cola']
                  ,columns=['one','two','three','four'])
data

Unnamed: 0,one,two,three,four
Tea,0,1,2,3
Milktea,4,5,6,7
Juice,8,9,10,11
Cola,12,13,14,15


In [84]:
data.drop(['Milktea','Juice'])

Unnamed: 0,one,two,three,four
Tea,0,1,2,3
Cola,12,13,14,15


### 如果想要刪除欄位，axis=1，指的就是欄，也可以寫axis='columns'，axis=0(預設值)，就是列。

In [85]:
data.drop('two',axis=1)

Unnamed: 0,one,three,four
Tea,0,2,3
Milktea,4,6,7
Juice,8,10,11
Cola,12,14,15


In [86]:
data.drop('four',axis='columns')

Unnamed: 0,one,two,three
Tea,0,1,2
Milktea,4,5,6
Juice,8,9,10
Cola,12,13,14


In [87]:
data

Unnamed: 0,one,two,three,four
Tea,0,1,2,3
Milktea,4,5,6,7
Juice,8,9,10,11
Cola,12,13,14,15


### inplace: 透過True，真實的將資料從母檔移除。

In [88]:
obj.drop('c',inplace=True)

In [89]:
obj

a    0
b    1
d    3
e    4
dtype: int64

### 索引、選擇和過濾

### 相較於Python 陣列，Series除了可以用數字找出值，也可以使用非數字的index找到值。

In [90]:
obj=pd.Series(np.arange(4),index=['a','b','c','d'])
obj

a    0
b    1
c    2
d    3
dtype: int64

In [91]:
obj['b']

1

In [92]:
obj[1]

1

In [93]:
obj[2:4]

c    2
d    3
dtype: int64

In [94]:
obj[['b','a','d']]

b    1
a    0
d    3
dtype: int64

In [95]:
obj[[1,3]]

b    1
d    3
dtype: int64

In [96]:
obj[obj<2]

a    0
b    1
dtype: int64

### 用標籤做的切片和Python有些差異，並不會排除尾端。

In [97]:
obj[['b','c']]

b    1
c    2
dtype: int64

### 如果要給值的畫，就會修改Sereis中對應的區域。

In [98]:
obj[['b','c']]=5

In [99]:
obj

a    0
b    5
c    5
d    3
dtype: int64

In [100]:
data=pd.DataFrame(np.arange(16).reshape(4,4),
                  index=['Apple','Sony','Samsung','HTC'], columns=['one','two','three','four'])

In [101]:
data

Unnamed: 0,one,two,three,four
Apple,0,1,2,3
Sony,4,5,6,7
Samsung,8,9,10,11
HTC,12,13,14,15


In [102]:
data['two']

Apple       1
Sony        5
Samsung     9
HTC        13
Name: two, dtype: int64

In [103]:
data[['three','one']]

Unnamed: 0,three,one
Apple,2,0
Sony,6,4
Samsung,10,8
HTC,14,12


In [104]:
data[:2]

Unnamed: 0,one,two,three,four
Apple,0,1,2,3
Sony,4,5,6,7


In [105]:
data[data['three']>5]

Unnamed: 0,one,two,three,four
Sony,4,5,6,7
Samsung,8,9,10,11
HTC,12,13,14,15


### 使用布林值做索引

In [106]:
data<5

Unnamed: 0,one,two,three,four
Apple,True,True,True,True
Sony,True,False,False,False
Samsung,False,False,False,False
HTC,False,False,False,False


In [107]:
data[data<5]=0

In [108]:
data

Unnamed: 0,one,two,three,four
Apple,0,0,0,0
Sony,0,5,6,7
Samsung,8,9,10,11
HTC,12,13,14,15


### 用loc, iloc做選擇

### 順序必須是loc[index,columns]，內容物必須為index與columns的名稱。

In [109]:
data.loc['Apple',['two','three']]

two      0
three    0
Name: Apple, dtype: int64

### 順序必須是iloc[index, columns]，內容物必須為數字。

In [110]:
data.iloc[3,[3,0,1]]

four    15
one     12
two     13
Name: HTC, dtype: int64

In [111]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Samsung, dtype: int64

In [112]:
data.iloc[[1,2],[3,0,1]]

Unnamed: 0,four,one,two
Sony,7,0,5
Samsung,11,8,9


### : 也支援loc, iloc。

In [113]:
data.loc[:'Sony','two']

Apple    0
Sony     5
Name: two, dtype: int64

In [114]:
data.iloc[:,:3][data.three>5]

Unnamed: 0,one,two,three
Sony,0,5,6
Samsung,8,9,10
HTC,12,13,14


### 整數索引

In [115]:
ser=pd.Series(np.arange(3))

In [116]:
ser

0    0
1    1
2    2
dtype: int64

### 系統無法判斷此時是要找index的名稱還是index where。

In [117]:
ser[-1]

KeyError: -1

### 把index換成非整數後，就不會混淆了。

In [120]:
ser2=pd.Series(np.arange(3),index=['a','b','c'])
ser2

a    0
b    1
c    2
dtype: int64

In [121]:
ser2[-1]

2

### 為保持一致性，當你想使用整數去取得某個值，可使用loc(標籤用）, iloc（整數用）。

In [118]:
ser.iloc[-1]

2

In [119]:
ser.loc[:1]

0    0
1    1
dtype: int64