# Ufuncs

### Pandas Ufuncs
由於pandas是設計來和numpy溝通，因此numpy上面的ufuncs都可以在Series和DataFrame上運作  
單元運算如:指數、對數、三角函數...etc，ufunc輸出後會保留index和colName  
二元運算如:加法、乘法...etc，ufunc輸出後會自動對齊index  

In [1]:
import numpy as np
import pandas as pd

In [3]:
rng = np.random.RandomState(50)
df = pd.DataFrame(rng.randint(0,10,(3,4)) ,
                 columns=['A','B','C','D']
                 )
df

Unnamed: 0,A,B,C,D
0,0,0,1,4
1,6,5,6,6
2,5,2,7,4


### Ufunc保留指標與欄位

In [5]:
np.exp(df)

Unnamed: 0,A,B,C,D
0,1.0,1.0,2.718282,54.59815
1,403.428793,148.413159,403.428793,403.428793
2,148.413159,7.389056,1096.633158,54.59815


In [6]:
np.sin(df*np.pi/4)

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.707107,1.224647e-16
1,-1.0,-0.707107,-1.0,-1.0
2,-0.707107,1.0,-0.707107,1.224647e-16


### Ufunc Index 對齊

In [7]:
# Series (除法對齊)
area = pd.Series({'A':105,'T':99,'C':73} , name='AREA')
popu = pd.Series({'C':21,'T':264,'N':42} , name='POPU')
area / popu

A        NaN
C    3.47619
N        NaN
T    0.37500
dtype: float64

In [8]:
area.index | popu.index

  area.index | popu.index


Index(['A', 'C', 'N', 'T'], dtype='object')

In [10]:
pd.Index.union(area.index ,popu.index)

Index(['A', 'C', 'N', 'T'], dtype='object')

In [11]:
# Series (加法對齊)
a = pd.Series([2,4,6] , index=[0,1,2], name='A')
b = pd.Series([1,3,5] , index=[1,2,3], name='B')
a+b

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [14]:
# 直接使用add()方法也可以，這個方法可以註明當不match時要填入的缺失值下去運算，否則預設為NaN
a.add(b , fill_value = 999)

0    1001.0
1       5.0
2       9.0
3    1004.0
dtype: float64

In [15]:
# DataFrame
A = pd.DataFrame(rng.randint(0,20,(2,2)) , columns=list('AB'))
B = pd.DataFrame(rng.randint(0,10,(3,3)) , columns=list('BAC'))

In [16]:
A

Unnamed: 0,A,B
0,14,3
1,6,11


In [17]:
B

Unnamed: 0,B,A,C
0,1,5,9
1,0,6,3
2,2,9,3


In [18]:
A+B

Unnamed: 0,A,B,C
0,19.0,4.0,
1,12.0,11.0,
2,,,


In [23]:
# 計算整體平均
fill = A.stack().mean()
fill

8.5

In [24]:
# 使用整體平均替代缺失值
A.add(B,fill_value=fill)

Unnamed: 0,A,B,C
0,19.0,4.0,17.5
1,12.0,11.0,11.5
2,17.5,10.5,11.5


### Numpy 算術運算子 對應 Pandas的方法  
Python運算子 | Pandas方法  
`+`         <=>   add()  
`-`         <=>   sub() , subtract()  
`*`         <=>   mul() , multiply()  
`/`         <=>   truediv() , div() , divide()  
`//`        <=>   floordiv()  
`%`         <=>   mod()  
`**`        <=>   pow()  

In [26]:
# Numpy broadcasting
A = rng.randint(10,size=(3,4))
A

array([[3, 3, 2, 0],
       [3, 2, 0, 3],
       [0, 0, 7, 3]])

In [27]:
A-A[0]

array([[ 0,  0,  0,  0],
       [ 0, -1, -2,  3],
       [-3, -3,  5,  3]])

In [28]:
df= pd.DataFrame(A , columns=list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,3,3,2,0
1,3,2,0,3
2,0,0,7,3


In [29]:
# pandas 套用 broadcasting
df - df.iloc[1]

Unnamed: 0,Q,R,S,T
0,0,1,2,-3
1,0,0,0,0
2,-3,-2,7,0


In [32]:
# pandas 指定軸做減法
df['R'] , df.subtract(df['R'] , axis=0)

(0    3
 1    2
 2    0
 Name: R, dtype: int32,
    Q  R  S  T
 0  0  0 -1 -3
 1  1  0 -2  1
 2  0  0  7  3)

In [33]:
df.iloc[0,::2]

Q    3
S    2
Name: 0, dtype: int32

In [34]:
df - df.iloc[0,::2]

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,0.0,,-2.0,
2,-3.0,,5.0,


### NaN 與 None
在Python內缺失資料用None或者NaN表示，當陣列或pd物件內包含None時，運算上就會報錯。  
包含None的資料，其整體型別會被表示為object。  
而object物件在運算上較原生資料型別來的慢，因此缺失值表示法常用為NaN。  
NaN為浮點數，這代表NaN在處理上會比object來的快許多!  


In [40]:
np.array([0,15,20]).min() 

0

In [41]:
# 運算None出現錯誤
# 注意錯誤訊息，int和None物件並沒有定義運算
np.array([0,15,20,None]).min() 

TypeError: '<=' not supported between instances of 'int' and 'NoneType'

In [36]:
# 運算速度差異
for type in ['object','int']:
    print("dtype =",type)
    %timeit np.arange(1e6 , dtype=type).sum()
    print()

dtype = object
40 ms ± 901 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
1.57 ms ± 9.82 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)



In [43]:
# NaN的運算，注意整體資料型態變成float64
vals2 = np.array([1,np.nan,3,5])
vals2.dtype

dtype('float64')

In [44]:
# 和javaScript相同，NaN操作後仍是NaN
1 + np.nan , 0*np.nan

(nan, nan)

In [46]:
# numpy內特殊函數能夠排外NaN並處理
np.nanmin(vals2)

1.0

### Pandas 處理遺失值
Pandas遇到 NA值(遺失)或者None時會根據資料型態做整體轉換:  

float =>  無轉換 => 使用 np.nan  
object =>  無轉換 => 使用 None 或者 np.nan  
integer =>  轉換成float64 => np.nan    
boolean =>  轉換成object => None 或者 np.nan  

須注意在Python內字串資料總是以 object 的型態儲存!
我們可以透過以下4種方法來處理空值(null):
1. isnull() : 產生一個boolean mask來顯示缺失資料
2. notnull() : isnull()的反操作
3. dropna() :  回傳過濾掉NA的資料
4. fillna() : 回傳取代空值NA的資料

In [48]:
# 偵測空值
data = pd.Series([1,np.nan,'hello',None])
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

In [49]:
# 利用偵測空值的布林mask來索引
data[data.notnull()]

0        1
2    hello
dtype: object

In [50]:
# 移除空值
data.dropna()

0        1
2    hello
dtype: object

In [51]:
# d.f. 移除空值有以下變形
from numpy import nan
data = pd.DataFrame([[1., 6.5, 3.], [1., nan, nan],
                     [nan, nan, nan], [nan, 6.5, 3.]])
cleaned = data.dropna()
print(f'{data}\n\n'
      f'{cleaned}')

     0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

     0    1    2
0  1.0  6.5  3.0


In [53]:
# 只移除全部都是NaN的資料列
data, data.dropna(how='all')

(     0    1    2
 0  1.0  6.5  3.0
 1  1.0  NaN  NaN
 2  NaN  NaN  NaN
 3  NaN  6.5  3.0,
      0    1    2
 0  1.0  6.5  3.0
 1  1.0  NaN  NaN
 3  NaN  6.5  3.0)

In [60]:
# 指定操作軸
# 注意how引數可為any或者all
data[4] = nan    # column 4
data, data.dropna(axis=1, how='all') 

(     0    1    2   4
 0  1.0  6.5  3.0 NaN
 1  1.0  NaN  NaN NaN
 2  NaN  NaN  NaN NaN
 3  NaN  6.5  3.0 NaN,
      0    1    2
 0  1.0  6.5  3.0
 1  1.0  NaN  NaN
 2  NaN  NaN  NaN
 3  NaN  6.5  3.0)

In [55]:
# 指定超過2個NaN的資料要移除
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = nan
df.iloc[:2, 2] = nan
df.iloc[4, :] = nan
print(f'{df}\n\n'
      f'{df.dropna()}\n\n'
      f'{df.dropna(thresh=2)}') 

          0         1         2
0 -0.114399       NaN       NaN
1 -0.595751       NaN       NaN
2 -0.391346       NaN  0.699021
3 -0.869674       NaN  0.204392
4       NaN       NaN       NaN
5  3.431717 -0.108276  1.587292
6 -1.128167 -0.288163  0.013278

          0         1         2
5  3.431717 -0.108276  1.587292
6 -1.128167 -0.288163  0.013278

          0         1         2
2 -0.391346       NaN  0.699021
3 -0.869674       NaN  0.204392
5  3.431717 -0.108276  1.587292
6 -1.128167 -0.288163  0.013278


In [56]:
# 填入空值
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = nan
df.iloc[:2, 2] = nan
df, df.fillna(0)

(          0         1         2
 0  1.255824       NaN       NaN
 1  0.504171       NaN       NaN
 2 -1.257268       NaN  0.712701
 3  0.570519       NaN  0.201638
 4  0.015863  0.996492 -0.033933
 5 -0.108160  0.208272  0.189202
 6 -0.365714  0.196896  1.414097,
           0         1         2
 0  1.255824  0.000000  0.000000
 1  0.504171  0.000000  0.000000
 2 -1.257268  0.000000  0.712701
 3  0.570519  0.000000  0.201638
 4  0.015863  0.996492 -0.033933
 5 -0.108160  0.208272  0.189202
 6 -0.365714  0.196896  1.414097)

In [59]:
# 指定column做填入
df['AAA'] = nan 
df, df.fillna({1: 0.5, 2: 0 , 'AAA':0})

(          0         1         2  AAA
 0  1.255824       NaN       NaN  NaN
 1  0.504171       NaN       NaN  NaN
 2 -1.257268       NaN  0.712701  NaN
 3  0.570519       NaN  0.201638  NaN
 4  0.015863  0.996492 -0.033933  NaN
 5 -0.108160  0.208272  0.189202  NaN
 6 -0.365714  0.196896  1.414097  NaN,
           0         1         2  AAA
 0  1.255824  0.500000  0.000000  0.0
 1  0.504171  0.500000  0.000000  0.0
 2 -1.257268  0.500000  0.712701  0.0
 3  0.570519  0.500000  0.201638  0.0
 4  0.015863  0.996492 -0.033933  0.0
 5 -0.108160  0.208272  0.189202  0.0
 6 -0.365714  0.196896  1.414097  0.0)

In [61]:
# 填入NA位置的方法還有透過鄰近值做填補
# ffill/bfill = forward/backwward-fill
df.fillna(method='ffill')

Unnamed: 0,0,1,2,AAA
0,1.255824,,,
1,0.504171,,,
2,-1.257268,,0.712701,
3,0.570519,,0.201638,
4,0.015863,0.996492,-0.033933,
5,-0.10816,0.208272,0.189202,
6,-0.365714,0.196896,1.414097,


In [64]:
# axis預設為0，可自動調整填補方向
df.fillna(method='bfill' , axis=1)

Unnamed: 0,0,1,2,AAA
0,1.255824,,,
1,0.504171,,,
2,-1.257268,0.712701,0.712701,
3,0.570519,0.201638,0.201638,
4,0.015863,0.996492,-0.033933,
5,-0.10816,0.208272,0.189202,
6,-0.365714,0.196896,1.414097,


### 資料切割、取代與置換
cut()與qcut()可以幫助我們對資料進行比例切割  
dupliceted()與replace()可移除重複資料與資料取代  

##### Duplicated

In [3]:
data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],
                     'k2': [1, 1, 2, 3, 3, 4, 4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


In [5]:
# 可以看到最後一筆變成True，因為(two,4)已重複
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

In [6]:
# 移除重複值
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


In [9]:
# 可以選擇保留第一筆重複或者最後一筆
data['v1'] = range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [12]:
# 移除 k1 欄位重複值，預設保留第一個
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


In [11]:
# 保留最後一個重複值
data.drop_duplicates(['k1', 'k2'], keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


##### Replace

In [17]:
data = pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

In [18]:
# 換掉-999為NaN
data.replace(-999, np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

In [20]:
# 換掉-999以及-1000為NaN
data.replace([-999, -1000], np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [22]:
# 使用list或者dict的方式指定都可以
data.replace([-999, -1000], [np.nan, 0]), data.replace({-999: np.nan, -1000: 0})

(0    1.0
 1    NaN
 2    2.0
 3    NaN
 4    0.0
 5    3.0
 dtype: float64,
 0    1.0
 1    NaN
 2    2.0
 3    NaN
 4    0.0
 5    3.0
 dtype: float64)

##### cut & qcut

In [23]:
# 指定 bins 間隔來切割
# 注意前2筆不在切割範圍內，所以回傳NaN!
ages = [12, 18, 20, 28, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

bins = [18, 25, 35, 60, 100]
cats = pd.cut(ages, bins)
cats

[NaN, NaN, (18.0, 25.0], (25.0, 35.0], (18.0, 25.0], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 14
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

In [33]:
# code屬性回傳該筆資料屬於哪個區段
# 注意前兩筆是NaN所以回傳-1
print(cats.codes, '\n')
print(cats.categories, '\n')
print(pd.value_counts(cats))

[-1 -1  0  1  0  1  0  0  2  1  3  2  2  1] 

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]], dtype='interval[int64, right]') 

(18, 25]     4
(25, 35]     4
(35, 60]     3
(60, 100]    1
dtype: int64


In [26]:
# 指定切割後的類別名稱
group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
print(f'ages\t : {ages}\n\n'
      f'bins\t : {bins}\n\n'
      f'pd.cut\t :\n{pd.cut(ages, bins, labels=group_names)}')

ages	 : [12, 18, 20, 28, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

bins	 : [18, 25, 35, 60, 100]

pd.cut	 :
[NaN, NaN, 'Youth', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 14
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']


In [34]:
# 等比例切4份 
data = np.random.randn(1000)  # Normally distributed
cats = pd.qcut(data, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])  # Cut into quartiles
print(f'data[:10] :\n{data[:20]}\n\n'
      f'cats :\n{cats}\n\n'
      f'cats.codes[:50] :\n{cats.codes[:50]}\n\n'
      f'pd.value_counts(cats) :\n{pd.value_counts(cats)}')

data[:10] :
[-0.76529235  1.01057972  2.19724265  1.00936626  0.77427903 -0.24245216
  0.62585299 -0.31176085  0.31793325  0.24502484  0.05073809  0.44774607
 -3.87935164 -0.67466071 -1.12835202 -0.09279127  1.13196589  1.57864755
  2.27146676 -0.02522283]

cats :
['Q1', 'Q4', 'Q4', 'Q4', 'Q4', ..., 'Q1', 'Q4', 'Q4', 'Q1', 'Q2']
Length: 1000
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

cats.codes[:50] :
[0 3 3 3 3 1 2 1 2 2 2 2 0 0 0 1 3 3 3 1 0 3 2 3 1 1 1 1 0 2 2 2 1 1 1 1 3
 0 1 0 2 3 0 2 1 0 3 3 2 0]

pd.value_counts(cats) :
Q1    250
Q2    250
Q3    250
Q4    250
dtype: int64


In [35]:
# 不等比例切分
cats = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.], precision=5)
print(f'{data[:20]}\n\n'
      f'{cats}\n\n'
      f'{pd.value_counts(cats)}')

[-0.76529235  1.01057972  2.19724265  1.00936626  0.77427903 -0.24245216
  0.62585299 -0.31176085  0.31793325  0.24502484  0.05073809  0.44774607
 -3.87935164 -0.67466071 -1.12835202 -0.09279127  1.13196589  1.57864755
  2.27146676 -0.02522283]

[(-1.25405, 0.0053047], (0.0053047, 1.30091], (1.30091, 3.22979], (0.0053047, 1.30091], (0.0053047, 1.30091], ..., (-1.25405, 0.0053047], (1.30091, 3.22979], (0.0053047, 1.30091], (-1.25405, 0.0053047], (-1.25405, 0.0053047]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.87936, -1.25405] < (-1.25405, 0.0053047] < (0.0053047, 1.30091] < (1.30091, 3.22979]]

(-1.25405, 0.0053047]    400
(0.0053047, 1.30091]     400
(-3.87936, -1.25405]     100
(1.30091, 3.22979]       100
dtype: int64
