## Python、Pandas、NumPy資料型別
|資料型別|Python type|Pandas dtype|NumPy type|
|-|-|-|-|
|字串/非數字|str or mixed|object|string_, unicode_, mixed types|
|整數|int|int64|int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64|
|浮點數|float|float64|float_, float16, float32, float64|
|步林|bool|bool|bool_|
|日期時間|datetime|datetime64|datetime64[ns]|
|時間差|NA|timedelta[ns]|NA|
|列舉|NA|category|NA|

In [8]:
import numpy as np
import pandas as pd

## 建立Series
* list產生Series
* tuple產生Series
* dict產生Series：dict的Key是唯一值，若有相同Key，value將被覆蓋
* Array產生Series

In [45]:
s1 = pd.Series([11,22,33,44,55], index = list("adcbe"))
print(s1)
s2 = pd.Series({10:66, 11:77, 12:88})
print(s2)
s = pd.Series(np.arange(10), index = range(0, 10))
print(s)

a    11
d    22
c    33
b    44
e    55
dtype: int64
10    66
11    77
12    88
dtype: int64
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
dtype: int64


## Series查詢元素

In [46]:
print(s[5])
print()
print(s[1:9:2])
print()
print(s[list(range(1, 9, 2))])

5

1    1
3    3
5    5
7    7
dtype: int64

1    1
3    3
5    5
7    7
dtype: int64


In [47]:
print(s[s<7])
print()
print(s[(2 < s) & (s < 7)])
print()
print(s[2:7][s<7])

0    0
1    1
2    2
3    3
4    4
5    5
6    6
dtype: int64

3    3
4    4
5    5
6    6
dtype: int64

2    2
3    3
4    4
5    5
6    6
dtype: int64


## Series增加元素
Series 以 append 添加的項目不會改變原本 Series 的內容，而是產生新的Series

In [29]:
s1.append(s2, ignore_index = True)

0    11
1    22
2    33
3    44
4    55
5    66
6    77
7    88
dtype: int64

## Series更新元素

In [48]:
s[1] = 10
s[8:] = 80
s

0     0
1    10
2     2
3     3
4     4
5     5
6     6
7     7
8    80
9    80
dtype: int64

## Series刪除元素

In [49]:
print(s.drop(3))
s = s.drop([1, 2, 5])
print(s)

0     0
1    10
2     2
4     4
5     5
6     6
7     7
8    80
9    80
dtype: int64
0     0
3     3
4     4
6     6
7     7
8    80
9    80
dtype: int64


## Series基本運算
- Series.describe()：描述統計
- Series.mean()：均值
- Series.median()：中位數
- Series.sum()：求和
- Series.std()：標準差
- Series.mode()：眾數
- Series.max()：最大值
- Series.idmax()：最大值的索引
- Series.value_counts()：每個值的數量
- Series.agg()：聚合函數

In [61]:
s = pd.Series(np.random.randint(30, size = 9))
print(s)
print()
print(s.describe())
print()
print(s.mean(), "\t", s.median(), "\t", s.sum(), "\t", s.std(), "\t", s.mode(), "\t", s.max(), "\t", s.idxmax())
print()
print(s.value_counts())

0    29
1    25
2    27
3     9
4    12
5     1
6    12
7    13
8    11
dtype: int64

count     9.000000
mean     15.444444
std       9.408920
min       1.000000
25%      11.000000
50%      12.000000
75%      25.000000
max      29.000000
dtype: float64

15.444444444444445 	 12.0 	 139 	 9.408920117514963 	 0    12
dtype: int64 	 29 	 0

12    2
29    1
25    1
27    1
9     1
1     1
13    1
11    1
dtype: int64


In [52]:
print(s.agg(['sum', 'max', 'mean', 'median', 'std', 'min', 'count']))

sum       82.000000
max       23.000000
mean       9.111111
median     7.000000
std        7.490735
min        0.000000
count      9.000000
dtype: float64


## Series加減乘除

In [59]:
s1 = pd.Series(np.random.randint(20, size = 5))
s2 = pd.Series(np.random.randint(10, size = 5))
print(s1)
print()
print(s2)
print()
print(s1 + s2)
print()
print(s1 - s2)
print()
print(s1 * s2)
print()
print(s1 / s2)
#print(s1 / 2)
#print(s1 % 2)
#print(s1**2)
#print(np.sqrt(s1))
#print(np.log(s1))

0    11
1     1
2    18
3     1
4     7
dtype: int64

0    4
1    4
2    4
3    8
4    9
dtype: int64

0    15
1     5
2    22
3     9
4    16
dtype: int64

0     7
1    -3
2    14
3    -7
4    -2
dtype: int64

0    44
1     4
2    72
3     8
4    63
dtype: int64

0    2.750000
1    0.250000
2    4.500000
3    0.125000
4    0.777778
dtype: float64


## Series排序
使用s.sort_values() 在沒覆蓋原始資料，將無法修改s的內容，若要覆蓋原始資料要改寫成 s = s.sort_values() 或 s.sort_values(inplace=True)

In [62]:
s = pd.Series(np.random.randint(30, size = 8))
print(s)
print()
print(s.sort_values())
print()
print(s.sort_values(ascending=False))

0    15
1    18
2    11
3     2
4     8
5     4
6    11
7    17
dtype: int64

3     2
5     4
4     8
2    11
6    11
0    15
7    17
1    18
dtype: int64

1    18
7    17
0    15
2    11
6    11
4     8
5     4
3     2
dtype: int64


## Series 缺失值處理
- 缺失值(NA)定義：None、numpy.nan
- 非缺失值(NA)：""、False、np.inf
- Python中，若要判斷資料中是否存在缺失值，可使用 isnull()、isna() 判別，或使用 notnull()、notna() 判別非空數量與總數量的差
- 資料中存在缺失值NA，需要填補，可以 fillna(數值) 填補。 

In [66]:
s = pd.Series(np.random.randint(30, size = 10))
s[1] = None
s[3] = ""
s[5] = False
s[9] = np.nan
s[12] = np.inf

print(s.values, len(s))
print()
print("isnull 判斷NA存在與否：", s.isnull().values)
print()
print("isna 判斷NA存在與否：", s.isna().values)
print()
print("非 NA 數量 isnull：{}, isna：{}".format(len(s) - s.isnull().sum(), len(s) - s.isna().sum()))
print()
print("非 NA 數量 notnull：{}, notna：{}".format(s.notnull().sum(), s.notna().sum()))

[19.0 nan 23.0 '' 9.0 False 17.0 19.0 18.0 nan inf] 11

isnull 判斷NA存在與否： [False  True False False False False False False False  True False]

isna 判斷NA存在與否： [False  True False False False False False False False  True False]

非 NA 數量 isnull：9, isna：9

非 NA 數量 notnull：9, notna：9


In [67]:
print(s.fillna(0))
print()
print(s.fillna({1: 0, 3: 1, 5: 2, 7: 3, 9: 4, 11: 5}))

0      19.0
1         0
2      23.0
3          
4       9.0
5     False
6      17.0
7      19.0
8      18.0
9         0
12      inf
dtype: object

0      19.0
1       0.0
2      23.0
3          
4       9.0
5     False
6      17.0
7      19.0
8      18.0
9       4.0
12      inf
dtype: object


## Series垂直合併 & 水平合併

In [70]:
s1 = pd.Series(np.random.rand(10), index = range(0,10))
s2 = pd.Series(np.random.rand(10), index = range(10,20))
s3 = pd.concat([s1, s2])
print(s3)
s4 = pd.concat([s1, s2],axis=1)
print(s4)

0     0.872792
1     0.886681
2     0.616811
3     0.933825
4     0.452563
5     0.383057
6     0.663677
7     0.901769
8     0.950067
9     0.525637
10    0.154941
11    0.557717
12    0.682394
13    0.810360
14    0.526413
15    0.076177
16    0.771774
17    0.045212
18    0.649675
19    0.996822
dtype: float64
           0         1
0   0.872792       NaN
1   0.886681       NaN
2   0.616811       NaN
3   0.933825       NaN
4   0.452563       NaN
5   0.383057       NaN
6   0.663677       NaN
7   0.901769       NaN
8   0.950067       NaN
9   0.525637       NaN
10       NaN  0.154941
11       NaN  0.557717
12       NaN  0.682394
13       NaN  0.810360
14       NaN  0.526413
15       NaN  0.076177
16       NaN  0.771774
17       NaN  0.045212
18       NaN  0.649675
19       NaN  0.996822


## Series轉成Dataframe

In [69]:
df = s.to_frame()
df

Unnamed: 0,0
0,19.0
1,
2,23.0
3,
4,9.0
5,False
6,17.0
7,19.0
8,18.0
9,
