Numpy 是使用Python進行科學運算中最基礎的模組，主要的功能環繞在**ndarray (n-dimensional array, 多維陣列)**物件上。<br>
由於ndarray可以用於表示超過二維的資料，因此這個部分我們將帶大家了解Numpy所提供的功能與函式。

# Numpy 的基本介紹



對於各種後續的應用，快速的陣列運算都是不可或缺的。<br>下圖出現的套件都有使用到 numpy，由此可知 numpy 對於科學領域的重要性。
![picture](https://drive.google.com/uc?id=1ENCioG_oQBfa2UdjVghAA9MTILllFw2z)

從 github 上的機器學習專案，可以看到 numpy 被引用的比例非常的高，可見 numpy 有多麼的熱門。

![picture](https://drive.google.com/uc?id=1JU1w9sz6sx3nq04_2ceAweVZit3RkkAV)

## 載入套件

In [None]:
# !pip install numpy
import numpy as np

# Numpy 的基本操作

Numpy 最主要的功能就是做出多維的陣列，並對陣列做快速的數值運算，所謂的多維陣列如下圖所示。

![picture](https://drive.google.com/uc?id=1tgREdFA8EeNvrQIg4azzCqd8MoadCWZB)

## 1 建立 ndarray

### 「建立」操作

In [None]:
# 1. 將 list 轉成 numpy.ndarray，就可以使用 numpy 提供的各種功能
a = np.array([1, 2, 3])
print(type(a))
print(a.shape)

b = np.array([[1,2,3]])
print(type(b))
print(b.shape)

c = np.array([[1,2,3,98],[4,5,6,7],[9,10,11,45]])
print(type(c))
print(c.shape)

d = np.array([[[1,2,3],[4,5,6]]])
print(type(d))
print(d.shape)

<class 'numpy.ndarray'>
(3,)
<class 'numpy.ndarray'>
(1, 3)
<class 'numpy.ndarray'>
(3, 4)
<class 'numpy.ndarray'>
(1, 2, 3)


# *Quiz
建構一個 shape 為(3, 2)的 ndarray

In [None]:
#@title *Answer

a = np.array([[1,2], [3,4], [5,6]])
print(f"a shape : \n{a.shape}", end = '\n-------------\n')

a shape : 
(3, 2)
-------------


In [None]:
# 2. 可以用 numpy 裡的各種 function 建立各種 numpy.ndarray

# 全部是 0
a = np.zeros((2,2), dtype=int)
print(f"zeros:\n{a}", end = '\n-------------\n')

# 全部是 1
b = np.ones((1,2))
print(f"ones:\n{b}", end = '\n-------------\n')

# 全部都是你設定的某個數字
c = np.full((2,2), 7)
print(f"full:\n{c}", end = '\n-------------\n')

# 只有斜邊是 1 的陣列(單位矩陣)
d = np.eye(7)
print(f"eye:\n{d}", end = '\n-------------\n')

# 給定一個範圍
e = np.arange(10, 100, 10) # np.arange(start, end, step)
print(f"arange:\n{e}", end = '\n-------------\n')


zeros:
[[0 0]
 [0 0]]
-------------
ones:
[[1. 1.]]
-------------
full:
[[7 7]
 [7 7]]
-------------
eye:
[[1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1.]]
-------------
arange:
[10 20 30 40 50 60 70 80 90]
-------------


In [None]:
# 3. 可以用 numpy.random 建立隨機的 numpy.ndarray

# 介於[0, 1)之間的均勻分布 
a = np.random.rand(2, 3) # np.random.rand(size)
print(f"rand:\n{a}", end = '\n-------------\n')

# 平均為 0, 變異數為 1的常態分佈
b = np.random.randn(3, 2) # np.random.randn(size)
print(f"randn:\n{b}", end = '\n-------------\n')

# 介於[low, high)之間均勻分布的整數
c = np.random.randint(50, 100, size=(2, 2)) # np.random.randint(low, high, size)
print(f"randint:\n{c}", end = '\n-------------\n')


rand:
[[0.93707112 0.27019498 0.8557732 ]
 [0.01754468 0.68041822 0.77668835]]
-------------
randn:
[[ 0.25926792  0.43590788]
 [-0.76153446  0.8987624 ]
 [ 0.07738535 -0.74974202]]
-------------
randint:
[[97 83]
 [55 83]]
-------------


### 「型態」操作

In [None]:
a = np.arange(3)

# type(a) 是物件a的型態
print(f"type(a): {type(a)}")
# a.dtype 是ndarray的型態
print(f"a.dtype: {a.dtype}")

# 型態操作: 可以利用 .dtype, .astype指定陣列的型態

# 在創建ndarray時使用 dtype 指定資料型態
b = np.arange(3, dtype='float32')
print(f"b.dtype: {b.dtype}")

# 透過 .astype 將ndarray型態強制轉換
b = b.astype('int8')
print(f"b.dtype: {b.dtype}")


type(a): <class 'numpy.ndarray'>
a.dtype: int64
b.dtype: float32
b.dtype: int8


### 「維度」操作

In [None]:
# 可以利用 .size, .ndim, .size等屬性查詢陣列的維度與大小
a = np.array([[1, 3, 5], [2, 4, 6]])

# .shape 會顯示陣列中每個維度的大小
print(f"a.shape: {a.shape}")

# .ndim 顯示陣列有幾個維度
print(f"a.ndim: {a.ndim}")

# .size 顯示陣列裡總共有幾個元素
print(f"a.size: {a.size}")

a.shape: (2, 3)
a.ndim: 2
a.size: 6


In [None]:
# 維度操作: reshape

# 創建一個一維陣列，裡面共有6個元素
a = np.arange(6)
print(a, a.shape, end = '\n-------------\n')

# 使用 reshape 將原本的陣列改成(3,2)的形狀
new_a = a.reshape(3, 2)
print(new_a, new_a.shape, end = '\n-------------\n')

# 使用 reshape 將原本的陣列改成(2,3,1)的形狀
new_a2 = np.arange(6).reshape(2, 3, 1)
print(new_a2, new_a2.shape, end = '\n-------------\n')

# 如果懶得算要轉換的維度，也可以給予-1讓 numpy自已計算
new_a3 = a.reshape(2, -1)
print(new_a3, new_a3.shape)

[0 1 2 3 4 5] (6,)
-------------
[[0 1]
 [2 3]
 [4 5]] (3, 2)
-------------
[[[0]
  [1]
  [2]]

 [[3]
  [4]
  [5]]] (2, 3, 1)
-------------
[[0 1 2]
 [3 4 5]] (2, 3)


## 2 選取 ndarray

### 「取值」操作

In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(f"a:\n{a}", end = '\n-------------\n')

# ndarray 用中括號接數字取值，以二維陣列來說逗點前是取 row，逗點後是取 column
print(f"a[1]:\n{a[1]}", end = '\n-------------\n')
print(f"a[1, 1]:\n{a[1, 1]}", end = '\n-------------\n')
print(f"a[1][1]:\n{a[1][1]}", end = '\n-------------\n')

# ndarray 可以用冒號來取一段範圍的值
# 下面這個就是 row 取 0 到 2，column 取 1 到 3
b = a[0:2, 1:3]
print(f"a[0:2, 1:3]:\n{b}", end = '\n-------------\n')

# ndarray 也可以用 list 跳著取
# 下面這個就是 row 取 0 和 2，column 取 1 到 3
c = a[[0, 2], 1:3]
print(f"a[[0, 2], 1:3]:\n{c}")

a:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
-------------
a[1]:
[5 6 7 8]
-------------
a[1, 1]:
6
-------------
a[1][1]:
6
-------------
a[0:2, 1:3]:
[[2 3]
 [6 7]]
-------------
a[[0, 2], 1:3]:
[[ 2  3]
 [10 11]]


### 「條件」操作

In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(f"a:\n{a}", end = '\n-------------\n')

# ndarray 也可以設條件取值 (boolean expression)
condition =  (a < 7) & ( a > 5) 
d = a[condition]
print(f"a[a > 5]:\n{d}", end = '\n-------------\n')

# 可以想像符合條件的值，為一個boolean dtype的陣列，作為mask
mask = np.array([[0, 0, 0, 0], [0, 1, 1, 1], [1, 1, 1, 1]], dtype=bool) 
print(f"a[mask]:\n{a[mask]}", end = '\n-------------\n')

# 也可以應用這樣的條件方法直接取代陣列中的元素數值
a[condition] = 0
print(f"a[a > 5]:\n{a}", end = '\n-------------\n')


a:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
-------------
a[a > 5]:
[6]
-------------
a[mask]:
[ 6  7  8  9 10 11 12]
-------------
a[a > 5]:
[[ 1  2  3  4]
 [ 5  0  7  8]
 [ 9 10 11 12]]
-------------


In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
print(f"a:\n{a}", end = '\n-------------\n')

# ndarray 搜索資料輸出 index
e = np.where(a > 5)
print(f"search a > 5 index from a: \n{e}")
print(f"search a > 5:\n{a[e]}")

# np.where也有一個功能是做完判斷後，依據你所指定的規則更改陣列中的值
np.where(a < 3, 0, a) # np.where(condition, if True, if False)

a:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
-------------
search a > 5 index from a: 
(array([1, 1, 1, 2, 2, 2, 2]), array([1, 2, 3, 0, 1, 2, 3]))
search a > 5:
[ 6  7  8  9 10 11 12]


array([[ 0,  0,  3,  4],
       [ 5,  6,  7,  8],
       [ 9, 10, 11, 12]])

# *Quiz
產生一個 shape 為 (4, 2)的隨機整數 ndarray，範圍 10~30之間
將大於15的數字變為 999，小於等於則保持不變。 

In [None]:
#@title *Answer
c = np.random.randint(10, 30, size=(4, 2)) # np.random.randint(low, high, size)
print(f"randint:\n{c}", end = '\n-------------\n')
np.where(c > 15, 999, c) # np.where(condition, if True, if False)

randint:
[[18 26]
 [29 13]
 [14 15]
 [15 29]]
-------------


array([[999, 999],
       [999,  13],
       [ 14,  15],
       [ 15, 999]])

### 「極值」操作

In [None]:
# 創建一個一維陣列
a = np.array([[3, 1, 6, 4, 2, 5],[3, 1, 6, 4, 2, 10]])
print(f"a:\n{a}", end = '\n-------------\n')

a:
[[ 3  1  6  4  2  5]
 [ 3  1  6  4  2 10]]
-------------


In [None]:
# 使用 np.max找陣列中最大元素
max_elm = np.max(a)
print('陣列a中最大數值是 : ', max_elm)

# 使用 np.min找陣列中最小元素
min_elm = np.min(a)
print('陣列a中最小數值是 : ', min_elm)

陣列a中最大數值是 :  10
陣列a中最小數值是 :  1


In [None]:
# 使用 np.argmax找陣列中最大元素相對應的 index
# max_idx = np.argmax(a)
max_idx = np.unravel_index(np.argmax(a), a.shape)
print('陣列a中最大數值的index是:', max_idx)
print('陣列a中最大數值是 : ', a[max_idx])

# 使用 np.argmin找陣列中最小元素相對應的 index
min_idx = np.argmin(a)
print('陣列a中最小數值的index是:', min_idx)
print('陣列a中最小數值是 : ', a[min_idx])

陣列a中最大數值的index是: (1, 5)
陣列a中最大數值是 :  10
陣列a中最小數值的index是: 1
陣列a中最小數值是 :  [ 3  1  6  4  2 10]


## 3 ndarray 運算

In [None]:
# 兩個 array 做一般的加減乘除運算的話，他會將對應位置的元素做運算
x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

print(f"x:\n{x}", end = '\n-------------\n')

print(f"y:\n{y}", end = '\n-------------\n')

print(f"x + y:\n{x + y}", end = '\n-------------\n')

print(f"x - y:\n{x - y}", end = '\n-------------\n')

print(f"x * y:\n{x * y}", end = '\n-------------\n')

print(f"x / y:\n{x / y}", end = '\n-------------\n')

# np.sqrt(): 對 array 進行開根號
print(f"sqrt(x):\n{np.sqrt(x)}", end = '\n-------------\n')

# .T: 對 array 進行轉置
print(f"x.T:\n{x.T}")

x:
[[1 2]
 [3 4]]
-------------
y:
[[5 6]
 [7 8]]
-------------
x + y:
[[ 6  8]
 [10 12]]
-------------
x - y:
[[-4 -4]
 [-4 -4]]
-------------
x * y:
[[ 5 12]
 [21 32]]
-------------
x / y:
[[0.2        0.33333333]
 [0.42857143 0.5       ]]
-------------
sqrt(x):
[[1.         1.41421356]
 [1.73205081 2.        ]]
-------------
x.T:
[[1 3]
 [2 4]]


In [None]:
# 也可以對陣列做點積
v = np.array([9,10])
w = np.array([11,12])

x = np.array([[1,2],[3,4]])
y = np.array([[5,6],[7,8]])

# 注意!做點積的時候前面的陣列的最後一個維度要跟後面的陣列的第一個維度一樣
# 例如前面的陣列是 1 * 2，後面的陣列就必須是 2 * N，N 可以是任意數
# 如果前面矩陣是 N-D 的矩陣，後面矩陣是 1-D 的矩陣，則做 最後一個軸上的積和
print(f"v:\n{v}", end = '\n-------------\n')

print(f"w:\n{w}", end = '\n-------------\n')

print(f"x:\n{x}", end = '\n-------------\n')

print(f"y:\n{y}", end = '\n-------------\n')

print(f"np.dot(v, w):\n{np.dot(v, w)}", end = '\n-------------\n')

print(f"np.dot(x, y):\n{np.dot(x, y)}", end = '\n-------------\n')

print(f"np.dot(x, v):\n{np.dot( v, x)}", end = '\n-------------\n')

print(f"np.dot(x, v):\n{np.dot( x, v)}")

v:
[ 9 10]
-------------
w:
[11 12]
-------------
x:
[[1 2]
 [3 4]]
-------------
y:
[[5 6]
 [7 8]]
-------------
np.dot(v, w):
219
-------------
np.dot(x, y):
[[19 22]
 [43 50]]
-------------
np.dot(x, v):
[39 58]
-------------
np.dot(x, v):
[29 67]


In [None]:
x = np.array([[1,2,3],[4,5,6]])
y = np.array([[7,8,9],[10,11,12]])
print(x.shape, end = '\n-------------\n')
print(y.shape, end = '\n-------------\n')

print(f"y.T:\n{y.T}", end = '\n-------------\n')
print(f"np.dot(x, y):\n{np.dot( x, y.T)}")

(2, 3)
-------------
(2, 3)
-------------
y.T:
[[ 7 10]
 [ 8 11]
 [ 9 12]]
-------------
np.dot(x, y):
[[ 50  68]
 [122 167]]


In [None]:
# 兩個 array 做加減乘除運算不見得要 shape 相同，numpy 會幫你做 broadcasting
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10,11,12]])
y = np.array([1,0,1])

print(f"x:\n{x}", end = '\n-------------\n')

print(f"y:\n{y}", end = '\n-------------\n')

print(f"x + y:\n{x + y}")

x:
[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]
-------------
y:
[1 0 1]
-------------
x + y:
[[ 2  2  4]
 [ 5  5  7]
 [ 8  8 10]
 [11 11 13]]


In [None]:
# 用 numpy 完成 數學運算 
x = np.array([10,10,10])

# y = exp(x)
y = np.exp(x)
print(f"exp({x}): { y }", end="\n-------------\n")

# y = log2(x)
y = np.log2(x)
print(f"log2({x}): { y }", end="\n-------------\n")

# y = log10(x)
y = np.log10(x)
print(f"log10({x}): { y }")

exp([10 10 10]): [22026.46579481 22026.46579481 22026.46579481]
-------------
log2([10 10 10]): [3.32192809 3.32192809 3.32192809]
-------------
log10([10 10 10]): [1. 1. 1.]


---
平時我們可能習慣以Excel等試算表工具開啟表格資料進行操作、運算、以及製作統計圖表，而Python中也有個非常好用的套件**「Pandas」**可以協助我們處理表格型的資料。Pandas主要的資料結構有`Series`和`Dataframe`形式，值得一提的是這些底層其實也是Numpy陣列的型式，因此先前在Numpy所使用的函式有許多可以應用在Pandas的物件上。接下來就讓我們來使用看看Pandas套件的功能吧。

# Pandas 的基本介紹

Pandas 被設計用來處理表格型資料，舉例來說，可以將學員的資料記錄成像右下圖這樣的表格型資料。在 Pandas 裡，這種表格叫做 DataFrame，一筆資料是一個 Row，一種特徵是一個 Column。

![picture](https://drive.google.com/uc?id=1N8Z3gA-rRwmJT-I31i9Qk9jkxla-R0we)

支援各種資料格式的讀寫，從常見的 csv 到資料庫都能做讀寫，非常的方便。

![picture](https://drive.google.com/uc?id=1JHLXXPZi7l7AkjwfQhaiKBxJ-tdKdq2w)

內建方便的視覺化功能，幫助使用者快速地做一些資料探索。

![picture](https://drive.google.com/uc?id=1NhHkjOqU2BA6sGN9HV7EJJrFA-0t4BJl)

從 Stack Overflow 上 Pandas 出現在提問裡的比例，就能看出 Pandas 有多受歡迎。

![picture](https://drive.google.com/uc?id=18af5ElNnWuyUEKspFBbRVFtQ1GFTt-BQ)

## 載入套件

In [None]:
# !pip install pandas
import pandas as pd

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Pandas 的基本操作



## 建立 DataFrame


In [None]:
# 1. dictionary 建立 DataFrame
df = pd.DataFrame({                     # 把key當作特徵(屬性)的命名, value表示該特徵的所有資料
    "Name": ["Jerry", "Mary", "Tom"], 
    "Age": [26, 20, 30], 
    "Sex": ["Male", "Female", "Male"]}
    )
print(type(df))
print(df)

<class 'pandas.core.frame.DataFrame'>
    Name  Age     Sex
0  Jerry   26    Male
1   Mary   20  Female
2    Tom   30    Male


## 取出特定 column

In [None]:
# 只需要用 DataFrame 加中括號，中括號裡給 feature 名稱，就能拿到這個 feature 裡所有的值
# 這裡可以看到，單獨取出一個 column，他的型態會是 Series，由此也可以知道 DataFrame 是由多個 Series 組合而成的
print(type(df["Name"]))
df["Name"]

<class 'pandas.core.series.Series'>


0    Jerry
1     Mary
2      Tom
Name: Name, dtype: object

## 自製 Series 並加進 DataFrame

In [None]:
# 給 pd.Series 一個 list 和 feature 名稱就行了，其實 Series 和 list 非常像，要把 list 轉 Series 主要是為了使用 Pandas 裡的功能
height = pd.Series([172, None, 188], name="height")
print(type(height))
height

<class 'pandas.core.series.Series'>


0    172.0
1      NaN
2    188.0
Name: height, dtype: float64

In [None]:
# 如果要新增一個 feature 到 DataFrame，只需要用 DataFrame["feature_name"] = some_series 這種寫法就行了
df["height"] = height
df

Unnamed: 0,Name,Age,Sex,height
0,Jerry,26,Male,172.0
1,Mary,20,Female,
2,Tom,30,Male,188.0


In [None]:
import os
folder = '/content/drive/MyDrive/Colab Notebooks/TWM/test'
if not os.path.exists(folder):
    os.mkdir(folder)

## 將 DataFrame 儲存至 CSV

In [None]:
df.to_csv("/content/drive/MyDrive/Colab Notebooks/TWM/sample.csv", index = False)

## 讀取 CSV

In [None]:
# csv檔案匯入

df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/TWM/sample.csv")  # 加入index_col可以指定特定行位置標籤作為索引
df

Unnamed: 0,Name,Age,Sex,height
0,Jerry,26,Male,172.0
1,Mary,20,Female,
2,Tom,30,Male,188.0


# *Quiz
請讀取剛剛建立的sample.csv，將內容從原本的3位擴增至10位，名子、性別與身高可以自己定義，在儲存成CSV。

In [None]:
#@title *Answer
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/TWM/sample.csv")

df_new = pd.DataFrame({                     # 把key當作特徵(屬性)的命名, value表示該特徵的所有資料
    "Name": ["Booker", "Ayton", "Crowder", "Paul", "Bridges","Craig", "Payne"], 
    "Age": [26, 20, 30, 35, 25, 26, 27], 
    "Sex": ["Male", "Male", "Male", "Male", "Male", "Male", "Male"],
    "height": [190, 188, 191, 189, 199, 178, 198]}
    )

df = df.append(df_new,ignore_index=True)
print(df)
df.to_csv("/content/drive/MyDrive/Colab Notebooks/TWM/sample.csv", index = False)

      Name  Age     Sex  height
0    Jerry   26    Male   172.0
1     Mary   20  Female     NaN
2      Tom   30    Male   188.0
3   Booker   26    Male   190.0
4    Ayton   20    Male   188.0
5  Crowder   30    Male   191.0
6     Paul   35    Male   189.0
7  Bridges   25    Male   199.0
8    Craig   26    Male   178.0
9    Payne   27    Male   198.0


### DataFrame 資訊

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    10 non-null     object
 1   Age     10 non-null     int64 
 2   Sex     10 non-null     object
 3   height  10 non-null     int64 
dtypes: int64(2), object(2)
memory usage: 448.0+ bytes


In [None]:
df.dtypes

Name      object
Age        int64
Sex       object
height     int64
dtype: object

## 選取 DataFrame

DataFrame由行與列組成，可以指定特定的行或列進行選取資料。

1. 頭尾選取 : `.head()` / `.tail()`
2. 針對行 選取 : `[]`
3. 針對列 選取 : `.loc[]` / `.iloc[]`
4. 條件 選取
5. 指定[row, column]選取

In [None]:
# 頭尾 選取

# .head(n) 選取前n列資料(預設5)
print(df.head(3), end = "\n----------\n")

# .tail(n) 選取最後n列資料(預設5)
print(df.tail(3), end = "\n----------\n")

# 也支援slicing用法
print(df[::2])

    Name  Age     Sex  height
0  Jerry   26    Male   172.0
1   Mary   20  Female     NaN
2    Tom   30    Male   188.0
----------
    Name  Age     Sex  height
0  Jerry   26    Male   172.0
1   Mary   20  Female     NaN
2    Tom   30    Male   188.0
----------
    Name  Age   Sex  height
0  Jerry   26  Male   172.0
2    Tom   30  Male   188.0


In [None]:
# 選取DataFrame的行(columns)

# 單行的資料，以columns命名作為索引
print(df['Name'])
print(type(df['Name']), end = "\n----------\n") # DataFrame由Series組成

# 多行的資料，以columns命名組成的list作為索引
print(df[['Name', 'Age']])
print(type(df[['Name', 'Age']]), end = "\n----------\n") # 多行則為DataFrame

# 也能以物件屬性存取方式來選取行(名稱內無空格)
print(df.Name.head(3))
print(df.Name)

0    Jerry
1     Mary
2      Tom
Name: Name, dtype: object
<class 'pandas.core.series.Series'>
----------
    Name  Age
0  Jerry   26
1   Mary   20
2    Tom   30
<class 'pandas.core.frame.DataFrame'>
----------
0    Jerry
1     Mary
2      Tom
Name: Name, dtype: object
0    Jerry
1     Mary
2      Tom
Name: Name, dtype: object


In [None]:
# 選取DataFrame的列(index)

# 原本的df index預設為數值索引，使用.set_index(key)可以指定key欄位作為索引標籤
nameIdx_df = df.set_index("Name")  # 或是加上參數 inplace = True 取代原本的df (這邊保留原本的df)

# .loc[]選取指定 列命名 的資料
print(nameIdx_df.loc["Paul"], end = "\n----------\n") # 單列: Series
print(nameIdx_df.loc[["Paul", "Booker"]], end = "\n----------\n") # 多列: DataFrame

# .iloc[]選取指定 列位置 的資料
print(nameIdx_df.iloc[0], end = "\n----------\n") # 單列: Series
print(nameIdx_df.iloc[[0, 4]], end = "\n----------\n") # 多列: DataFrame


Age         35
Sex       Male
height     189
Name: Paul, dtype: object
----------
        Age   Sex  height
Name                     
Paul     35  Male     189
Booker   26  Male     190
----------
Age         26
Sex       Male
height     172
Name: Jerry, dtype: object
----------
       Age   Sex  height
Name                    
Jerry   26  Male     172
Ayton   20  Male     188
----------


In [None]:
# 條件 選取

# 根據條件選取列(資料)
condition = df['Age'] < 26
print(condition, end = "\n----------\n") # 可以看到結果為bool型態

print(df[condition], end = "\n----------\n") # 套用到df只會回傳結果判斷為True的資料

condition = (df.Age < 26) & (df.height > 170) # 每個判斷式用小括號包圍，運算子用&或|
print(df[condition], end = "\n----------\n")

0    False
1     True
2    False
3    False
4     True
5    False
6    False
7     True
8    False
9    False
Name: Age, dtype: bool
----------
      Name  Age     Sex  height
1     Mary   20  Female     180
4    Ayton   20    Male     188
7  Bridges   25    Male     199
----------
      Name  Age     Sex  height
1     Mary   20  Female     180
4    Ayton   20    Male     188
7  Bridges   25    Male     199
----------


In [None]:
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/TWM/sample.csv")
# 5. 指定[row, column] / [row][column]選取
# 使用.loc根據 條件/命名 選取
condition = (df.Age < 26) & (df.height > 170)
print(df.loc[condition, 'Name'], end = "\n----------\n")
print(df.loc[condition][['Name', 'Age']], end = "\n----------\n")

# 使用.iloc根據 數值位置索引 選取 (支援slicing用法)
print(df.iloc[:10, ::2])
print(df.iloc[:10][::2])

4      Ayton
7    Bridges
Name: Name, dtype: object
----------
      Name  Age
4    Ayton   20
7  Bridges   25
----------
      Name     Sex
0    Jerry    Male
1     Mary  Female
2      Tom    Male
3   Booker    Male
4    Ayton    Male
5  Crowder    Male
6     Paul    Male
7  Bridges    Male
8    Craig    Male
9    Payne    Male
    Name  Age   Sex  height
0  Jerry   26  Male   172.0
2    Tom   30  Male   188.0
4  Ayton   20  Male   188.0
6   Paul   35  Male   189.0
8  Craig   26  Male   178.0


### 操作 DataFrame

1. 索引 操作
2. 行列 操作

### 「索引」操作

In [None]:
# .index:  DataFrame的索引物件
print(df, end = "\n-----\n")
print(df.index, end = "\n----------\n")

# .set_index(): 指定行(特徵)做為索引 (inplace=True 取代原本的df)
df.set_index('Age', inplace = True)
print(df, end = "\n-----\n")
print(df.index, end = "\n----------\n")
print('sort',df.sort_index())
# .reset_index(): 將索引重置
df.reset_index(inplace = True)
print(df, end = "\n-----\n")
print(df.index, end = "\n----------\n")

# .reindex(): 重新進行索引設定
reindexed = df.reindex(index=np.arange(1, 10 ,2)) # 不存在會填入NaN
print(reindexed, end = "\n-----\n")
print(reindexed.index, end = "\n----------\n")

   Age     Name     Sex  height
0   26    Jerry    Male   172.0
1   20     Mary  Female     NaN
2   30      Tom    Male   188.0
3   26   Booker    Male   190.0
4   20    Ayton    Male   188.0
5   30  Crowder    Male   191.0
6   35     Paul    Male   189.0
7   25  Bridges    Male   199.0
8   26    Craig    Male   178.0
9   27    Payne    Male   198.0
-----
RangeIndex(start=0, stop=10, step=1)
----------
        Name     Sex  height
Age                         
26     Jerry    Male   172.0
20      Mary  Female     NaN
30       Tom    Male   188.0
26    Booker    Male   190.0
20     Ayton    Male   188.0
30   Crowder    Male   191.0
35      Paul    Male   189.0
25   Bridges    Male   199.0
26     Craig    Male   178.0
27     Payne    Male   198.0
-----
Int64Index([26, 20, 30, 26, 20, 30, 35, 25, 26, 27], dtype='int64', name='Age')
----------
sort         Name     Sex  height
Age                         
20      Mary  Female     NaN
20     Ayton    Male   188.0
25   Bridges    Male   199.0

### 「行列」操作

In [None]:
# .columns:DataFrame的行(特徵)物件
print(reindexed, end = "\n-----\n")
print(reindexed.columns, end = "\n----------\n")

# .rename(): 重新命名行
reindexed =  reindexed.rename(columns={'height': 'H'})
print(reindexed.columns)  # .columns: DataFrame的行(特徵)物件

   Age     Name     Sex  height
1   20     Mary  Female     NaN
3   26   Booker    Male   190.0
5   30  Crowder    Male   191.0
7   25  Bridges    Male   199.0
9   27    Payne    Male   198.0
-----
Index(['Age', 'Name', 'Sex', 'height'], dtype='object')
----------
Index(['Age', 'Name', 'Sex', 'H'], dtype='object')


In [None]:
# 重新添加新欄位 weight
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/TWM/sample.csv")
weight = pd.Series([80, 60, 70], name="weight")
df["weight"] = weight
print(df)

      Name  Age     Sex  height  weight
0    Jerry   26    Male   172.0    80.0
1     Mary   20  Female     NaN    60.0
2      Tom   30    Male   188.0    70.0
3   Booker   26    Male   190.0     NaN
4    Ayton   20    Male   188.0     NaN
5  Crowder   30    Male   191.0     NaN
6     Paul   35    Male   189.0     NaN
7  Bridges   25    Male   199.0     NaN
8    Craig   26    Male   178.0     NaN
9    Payne   27    Male   198.0     NaN


In [None]:
# 新增
df_copy = df.copy()
# 以[]加入新的命名賦值，會新增行在最後
df_copy['BMI'] = df.weight / ((df.height/100)**2)
print(df_copy.head(3), end = '\n----------\n')

# .insert()可以指定特地位置增加新行
df_copy2 = df.copy()
df_copy2.insert(4, 'BMI', df.weight / ((df.height/100)**2))
print(df_copy2.head(3), end = "\n----------\n")

# pd.concat() 預設axis=0串接列，設定axis=1串接行
appended = pd.concat([df, df_copy['BMI'].round(2)], axis=1)
print(appended, end = "\n----------\n")

    Name  Age     Sex  height  weight        BMI
0  Jerry   26    Male   172.0    80.0  27.041644
1   Mary   20  Female     NaN    60.0        NaN
2    Tom   30    Male   188.0    70.0  19.805342
----------
    Name  Age     Sex  height        BMI  weight
0  Jerry   26    Male   172.0  27.041644    80.0
1   Mary   20  Female     NaN        NaN    60.0
2    Tom   30    Male   188.0  19.805342    70.0
----------
      Name  Age     Sex  height  weight    BMI
0    Jerry   26    Male   172.0    80.0  27.04
1     Mary   20  Female     NaN    60.0    NaN
2      Tom   30    Male   188.0    70.0  19.81
3   Booker   26    Male   190.0     NaN    NaN
4    Ayton   20    Male   188.0     NaN    NaN
5  Crowder   30    Male   191.0     NaN    NaN
6     Paul   35    Male   189.0     NaN    NaN
7  Bridges   25    Male   199.0     NaN    NaN
8    Craig   26    Male   178.0     NaN    NaN
9    Payne   27    Male   198.0     NaN    NaN
----------


In [None]:
# 刪除
# .drop() 預設axis=0移除列
dropRow = df_copy.drop([3, 4])
print(dropRow[:10], end = "\n-----\n")

# .drop() 設定axis=1移除行
dropCol = df_copy.drop(['height', 'weight'], axis = 1)
print(dropCol)

# 將 NaN 去除
df.dropna(inplace=True)
print(df)

      Name  Age     Sex  height  weight        BMI
0    Jerry   26    Male   172.0    80.0  27.041644
1     Mary   20  Female     NaN    60.0        NaN
2      Tom   30    Male   188.0    70.0  19.805342
5  Crowder   30    Male   191.0     NaN        NaN
6     Paul   35    Male   189.0     NaN        NaN
7  Bridges   25    Male   199.0     NaN        NaN
8    Craig   26    Male   178.0     NaN        NaN
9    Payne   27    Male   198.0     NaN        NaN
-----
      Name  Age     Sex        BMI
0    Jerry   26    Male  27.041644
1     Mary   20  Female        NaN
2      Tom   30    Male  19.805342
3   Booker   26    Male        NaN
4    Ayton   20    Male        NaN
5  Crowder   30    Male        NaN
6     Paul   35    Male        NaN
7  Bridges   25    Male        NaN
8    Craig   26    Male        NaN
9    Payne   27    Male        NaN
    Name  Age   Sex  height  weight
0  Jerry   26  Male   172.0    80.0
2    Tom   30  Male   188.0    70.0


## DataFrame 數值統計

In [None]:
df_copy

Unnamed: 0,Name,Age,Sex,height,weight,BMI
0,Jerry,26,Male,172.0,80.0,27.041644
1,Mary,20,Female,,60.0,
2,Tom,30,Male,188.0,70.0,19.805342
3,Booker,26,Male,190.0,,
4,Ayton,20,Male,188.0,,
5,Crowder,30,Male,191.0,,
6,Paul,35,Male,189.0,,
7,Bridges,25,Male,199.0,,
8,Craig,26,Male,178.0,,
9,Payne,27,Male,198.0,,


In [None]:
# .count(): 非NaN項目出現次數的資訊
print(df_copy['weight'].count(), end = "\n----------\n")

# .unique(): 回傳唯一值的list
print(df_copy['Sex'].unique(), end = "\n----------\n")

# .value_counts(): 顯示每個唯一值的出現次數
print(df_copy['height'].value_counts(), end = "\n----------\n")

# .min() / .max() : 找出最小/大值
print(df_copy['weight'].min(), end = "\n----------\n")
print(df_copy['height'].max())

3
----------
['Male' 'Female']
----------
188.0    2
198.0    1
178.0    1
199.0    1
189.0    1
191.0    1
190.0    1
172.0    1
Name: height, dtype: int64
----------
60.0
----------
199.0


In [None]:
# 資料的概括統計量
df_copy.describe()

Unnamed: 0,Age,height,weight,BMI
count,10.0,10.0,3.0,3.0
mean,26.5,187.6,70.0,21.582744
std,4.527693,8.579044,10.0,4.739384
min,20.0,172.0,60.0,18.518519
25%,25.25,182.0,65.0,18.853295
50%,26.0,189.5,70.0,19.188071
75%,29.25,191.0,75.0,23.114857
max,35.0,199.0,80.0,27.041644


# 實例練習

In [None]:
titanic = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/TWM/titanic_data_train.csv")
titanic

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [None]:
# 檢查每一項的數量狀況
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
# 只要用 DataFrame 中括號裡給一個 list，就可以一次取多個 column
sex_survived = titanic[["Sex", "Survived"]]
sex_survived.head(10)

Unnamed: 0,Sex,Survived
0,male,0
1,female,1
2,female,1
3,female,1
4,male,0
5,male,0
6,male,0
7,male,0
8,female,1
9,female,1


In [None]:
# 也可以用這種條件判斷式的寫法，來篩選特定資料
female_survived = sex_survived[sex_survived["Sex"] == "female"]
female_survived.head(10)

Unnamed: 0,Sex,Survived
1,female,1
2,female,1
3,female,1
8,female,1
9,female,1
10,female,1
11,female,1
14,female,0
15,female,1
18,female,0


In [None]:
# 以下兩種寫法等價
# 第二種寫法要注意判斷式兩邊要加括號、然後 or 是用 | 符號，and 是用 & 符號

# 方法 1
# pclass_1_3 = titanic[titanic["Pclass"].isin([1, 3])]

# 方法 2
pclass_1_3 = titanic[(titanic["Pclass"] == 1) | (titanic["Pclass"] == 3)]
pclass_1_3.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S


In [None]:
# loc 的寫法可以接受條件判斷式、字串等
pclass_age = titanic.loc[titanic["Pclass"] == 1, "Age"]
pclass_age.head(10)

1     38.0
3     35.0
6     54.0
11    58.0
23    28.0
27    19.0
30    40.0
31     NaN
34    28.0
35    42.0
Name: Age, dtype: float64

In [None]:
# iloc 只能接受數字
some_data = titanic.iloc[5:15, 1:3]
some_data

Unnamed: 0,Survived,Pclass
5,0,3
6,0,1
7,0,3
8,1,3
9,1,2
10,1,3
11,1,1
12,0,3
13,0,3
14,0,3


In [None]:
# 要設值的時候可以用 loc 或 iloc 指定位置
titanic.iloc[0:5, 3] = "Jerry"
titanic.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,Jerry,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,Jerry,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,Jerry,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,Jerry,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,Jerry,male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


# Cheat Sheet

很多比較熱門的套件都會有熱心人士幫忙製作這種 Cheat Sheet，上面會記錄一些常用功能的簡介，臨時忘記什麼功能的時候可以參考一下。


![picture](https://drive.google.com/uc?id=18Hu82RWdQ_zpfJjdxvEcZ0o0TooHg0A4)

![picture](https://drive.google.com/uc?id=1c_WiNQRu_KdDKy59pQQBysIyAJLAk64C)

# 參考連結

https://venturebeat.com/2019/01/24/github-numpy-and-scipy-are-the-most-popular-packages-for-machine-learning-projects/

https://numpy.org/

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/Numpy_Python_Cheat_Sheet.pdf

https://pandas.pydata.org/docs/getting_started/intro_tutorials/

https://www.sqlshack.com/getting-started-with-pandas-in-python/

http://datacamp-community-prod.s3.amazonaws.com/dbed353d-2757-4617-8206-8767ab379ab3

# 補充

### 合併陣列

concatenate可以幫助我們把多個陣列沿著某一個維度合併在一起，合併之後的陣列維度會跟原本的維度數量相同，不會額外增加新的維度。

要注意的是要合併的陣列除了要合併的維度之外，其餘的維度長度都要一樣喔!

In [None]:
import numpy as np
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[7, 8, 9]])

print(a)
print('--------')
print(b)

[[1 2 3]
 [4 5 6]]
--------
[[7 8 9]]


In [None]:
# 將兩個陣列沿著第一個維度合併起來，我們可以用三個陣列的shape看一下到底發生了甚麼事情
c = np.concatenate((a, b), axis=0)
print(c)
print(a.shape, b.shape, c.shape)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
(2, 3) (1, 3) (3, 3)


In [None]:
# 創建另外一個二維陣列
d = np.array([[0], [0]])

# 將a與d兩個陣列沿著第二個維度合併起來
e = np.concatenate((a, d), axis=1)

print(d)
print(a.shape, d.shape, e.shape)

[[0]
 [0]]
(2, 3) (2, 1) (2, 4)


### 堆疊陣列

stack與concatenate不同的地方在於，需要合併的陣列無論是維度或是各維度的長度都需要相同，另外合併後的陣列會額外新增一個維度。

而vstack(vertical stacking)與hstack(horizontal stacking)則與concatenate功能類似，只是不需要指定合併的維度。

In [None]:
# 創建三個二維陣列
a = np.array([[0, 1],
        [2, 3]])

b = np.array([[4, 5],
        [6, 7]])

c = np.array([[8,  9],
        [10, 11]])
print(a.shape, b.shape, c.shape)

(2, 2) (2, 2) (2, 2)


In [None]:
# 將三個陣列在第一個維度堆疊起來，會產生新的維度
s = np.stack([a, b, c], axis=0)
print(s)
print(s.shape)

[[[ 0  1]
  [ 2  3]]

 [[ 4  5]
  [ 6  7]]

 [[ 8  9]
  [10 11]]]
(3, 2, 2)


In [None]:
# 將三個陣列沿著第一個維度合併起來
v = np.vstack([a, b, c])
print(v)
print(v.shape)

[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
(6, 2)


In [None]:
# 將三個陣列沿著第二個維度合併起來
h = np.hstack([a, b, c])
print(h)
print(h.shape)
# (2, 6)

[[ 0  1  4  5  8  9]
 [ 2  3  6  7 10 11]]
(2, 6)
