# Input data

可以用 tensorflow 來 import 外部的資料，對於 manage complex pipelines 很有用

## 過程

1. 用 pandas 來輸入資料
2. 將資料轉換成 numpy 型態
3. Numpy array 可於 tensorflow 中使用


Pandas read_csv() 參數

![](Image/Image5.jpg)

In [1]:
# import numpy and pandas
import numpy as np
import pandas as pd

In [4]:
# load dataset
housing = pd.read_csv("Datasets/kc_house_data.csv")

# inspect
housing.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,...,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,7129300520,20141013T000000,221900.0,3,1.0,1180,5650,1.0,0,0,...,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,6414100192,20141209T000000,538000.0,3,2.25,2570,7242,2.0,0,0,...,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,5631500400,20150225T000000,180000.0,2,1.0,770,10000,1.0,0,0,...,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,2487200875,20141209T000000,604000.0,4,3.0,1960,5000,1.0,0,0,...,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,1954400510,20150218T000000,510000.0,3,2.0,1680,8080,1.0,0,0,...,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


In [5]:
# 轉換成 numpy array
housing_array = np.array(housing)

# inspect
print(housing_array)

[[7129300520 '20141013T000000' 221900.0 ... -122.257 1340 5650]
 [6414100192 '20141209T000000' 538000.0 ... -122.319 1690 7639]
 [5631500400 '20150225T000000' 180000.0 ... -122.233 2720 8062]
 ...
 [1523300141 '20140623T000000' 402101.0 ... -122.299 1020 2007]
 [291310100 '20150116T000000' 400000.0 ... -122.069 1410 1287]
 [1523300157 '20141015T000000' 325000.0 ... -122.299 1020 1357]]


**將 imported numpy data 轉換成 tensorflow 可以用的資料型態**

假設目前只需要 price (float) 和 waterfront (boolean) 這兩個 columns，有兩種方法。

方法一 (直接用 numpy array)：

In [11]:
# 定義 price numpy array
price = np.array(housing['price'], dtype = np.float32)
print(price)

# 定義 waterfront numpy array
waterfront = np.array(housing['waterfront'], dtype = np.bool)
print(waterfront)

[221900. 538000. 180000. ... 402101. 400000. 325000.]
[False False False ... False False False]


方法二 (用 tensorflow 的 cast 函數)：

In [12]:
# import tensorflow
import tensorflow as tf

# 定義 price
price2 = tf.cast(housing['price'], dtype = tf.float32)
print(price2)

# 定義 waterfront
waterfront2 = tf.cast(housing['waterfront'], dtype = tf.bool)
print(waterfront2)

tf.Tensor([221900. 538000. 180000. ... 402101. 400000. 325000.], shape=(21613,), dtype=float32)
tf.Tensor([False False False ... False False False], shape=(21613,), dtype=bool)


# Loss Function

Training models --> minimising the loss function of the model

Loss function 常見的種類：

1. Mean square error (MSE)
2. Mean absolute error (MAE)
3. Huber loss

不同的 loss function 代表的意義都不一樣，例如 mean square error 因為有平方項，因此會放大較大的 error，可以用來避免過大的誤差。

![](Image/Image6.jpg)

這三個 loss function 都可以透過 tensorflow 的 keras.losses() 函數來使用，例如：

1. tf.keras.losses.mse(TARGETS, PREDICTIONS)
2. tf.keras.losses.mae(TARGETS, PREDICTIONS)
3. tf.keras.losses.Huber(TARGETS, PREDICTIONS)

這些函數都會回傳一個代表 loss 的數字

MSE 使用的例子

In [13]:
import tensorflow as tf

In [None]:
# 僅示範，不須執行
# 定義一個 linear regression 的函數
def linear_regression(intercept, slope = slope, features = features):
    return intercept + slope * features

# 定義計算 loss 的函數
def loss_function(intercept, slope, feature = features, targets = targets):
    predictions = linear_regression(intercept, slope, features)
    return tf.keras.losses.mse(targets, predictions)

# 最後只要呼叫 loss_function 函數就可以計算 loss

## Linear Regression

Univariate regression:  <br/>
Prediction = intercept + Input * slope + error

### Implement linear regression in Tensorflow

In [15]:
# import
import tensorflow as tf

# 定義 dependent variable
price = np.array(housing["price"], np.float32)

# 定義 independent variable
size = np.array(housing["sqft_living"], np.float32)

# 初始化 intercept 和 slope
intercept = tf.Variable(0.1, tf.float32)
slope = tf.Variable(0.1, tf.float32)

定義 linear regression 的函數

In [16]:
def linear_regression(intercept, slope, features = size):
    return intercept + slope * features

定義 loss function

In [17]:
def loss_function(intercept, slope, features = size, target = price):
    predictions = linear_regression(intercept, slope)
    return tf.keras.losses.mse(target, predictions)

In [23]:
print(loss_function(0.1, 0.1, size, price).numpy())
print(loss_function(0.1, 0.5, size, price).numpy())

426199250000.0
425112540000.0


定義最佳化的 operation (使用 adam)

運行此 operation 會改變前面所定義的 slope 和 intercept，並且降低 loss

In [24]:
opt = tf.keras.optimizers.Adam(0.5)    # 0.5 是 learning rate

執行最佳化

minimize() 函數主要有兩個輸入參數：第一個是要最小化的函數，第二個是想要找到的

In [25]:
for j in range(1000):
    opt.minimize(lambda: loss_function(intercept, slope), var_list = [intercept, slope])
    print(loss_function(intercept, slope))

tf.Tensor(422132900000.0, shape=(), dtype=float32)
tf.Tensor(420781980000.0, shape=(), dtype=float32)
tf.Tensor(419433780000.0, shape=(), dtype=float32)
tf.Tensor(418088300000.0, shape=(), dtype=float32)
tf.Tensor(416745550000.0, shape=(), dtype=float32)
tf.Tensor(415405670000.0, shape=(), dtype=float32)
tf.Tensor(414068700000.0, shape=(), dtype=float32)
tf.Tensor(412734600000.0, shape=(), dtype=float32)
tf.Tensor(411403450000.0, shape=(), dtype=float32)
tf.Tensor(410075300000.0, shape=(), dtype=float32)
tf.Tensor(408750200000.0, shape=(), dtype=float32)
tf.Tensor(407428170000.0, shape=(), dtype=float32)
tf.Tensor(406109220000.0, shape=(), dtype=float32)
tf.Tensor(404793500000.0, shape=(), dtype=float32)
tf.Tensor(403480870000.0, shape=(), dtype=float32)
tf.Tensor(402171430000.0, shape=(), dtype=float32)
tf.Tensor(400865230000.0, shape=(), dtype=float32)
tf.Tensor(399562380000.0, shape=(), dtype=float32)
tf.Tensor(398262730000.0, shape=(), dtype=float32)
tf.Tensor(396966460000.0, shape

tf.Tensor(220207940000.0, shape=(), dtype=float32)
tf.Tensor(219488670000.0, shape=(), dtype=float32)
tf.Tensor(218772080000.0, shape=(), dtype=float32)
tf.Tensor(218058150000.0, shape=(), dtype=float32)
tf.Tensor(217346850000.0, shape=(), dtype=float32)
tf.Tensor(216638230000.0, shape=(), dtype=float32)
tf.Tensor(215932190000.0, shape=(), dtype=float32)
tf.Tensor(215228840000.0, shape=(), dtype=float32)
tf.Tensor(214528130000.0, shape=(), dtype=float32)
tf.Tensor(213830030000.0, shape=(), dtype=float32)
tf.Tensor(213134540000.0, shape=(), dtype=float32)
tf.Tensor(212441680000.0, shape=(), dtype=float32)
tf.Tensor(211751370000.0, shape=(), dtype=float32)
tf.Tensor(211063720000.0, shape=(), dtype=float32)
tf.Tensor(210378650000.0, shape=(), dtype=float32)
tf.Tensor(209696110000.0, shape=(), dtype=float32)
tf.Tensor(209016240000.0, shape=(), dtype=float32)
tf.Tensor(208338860000.0, shape=(), dtype=float32)
tf.Tensor(207664090000.0, shape=(), dtype=float32)
tf.Tensor(206991870000.0, shape

tf.Tensor(125730770000.0, shape=(), dtype=float32)
tf.Tensor(125401645000.0, shape=(), dtype=float32)
tf.Tensor(125074080000.0, shape=(), dtype=float32)
tf.Tensor(124748055000.0, shape=(), dtype=float32)
tf.Tensor(124423580000.0, shape=(), dtype=float32)
tf.Tensor(124100660000.0, shape=(), dtype=float32)
tf.Tensor(123779250000.0, shape=(), dtype=float32)
tf.Tensor(123459396000.0, shape=(), dtype=float32)
tf.Tensor(123141054000.0, shape=(), dtype=float32)
tf.Tensor(122824230000.0, shape=(), dtype=float32)
tf.Tensor(122508935000.0, shape=(), dtype=float32)
tf.Tensor(122195120000.0, shape=(), dtype=float32)
tf.Tensor(121882830000.0, shape=(), dtype=float32)
tf.Tensor(121572024000.0, shape=(), dtype=float32)
tf.Tensor(121262700000.0, shape=(), dtype=float32)
tf.Tensor(120954880000.0, shape=(), dtype=float32)
tf.Tensor(120648560000.0, shape=(), dtype=float32)
tf.Tensor(120343670000.0, shape=(), dtype=float32)
tf.Tensor(120040270000.0, shape=(), dtype=float32)
tf.Tensor(119738335000.0, shape

tf.Tensor(86586290000.0, shape=(), dtype=float32)
tf.Tensor(86461645000.0, shape=(), dtype=float32)
tf.Tensor(86337724000.0, shape=(), dtype=float32)
tf.Tensor(86214560000.0, shape=(), dtype=float32)
tf.Tensor(86092140000.0, shape=(), dtype=float32)
tf.Tensor(85970440000.0, shape=(), dtype=float32)
tf.Tensor(85849470000.0, shape=(), dtype=float32)
tf.Tensor(85729230000.0, shape=(), dtype=float32)
tf.Tensor(85609690000.0, shape=(), dtype=float32)
tf.Tensor(85490910000.0, shape=(), dtype=float32)
tf.Tensor(85372810000.0, shape=(), dtype=float32)
tf.Tensor(85255440000.0, shape=(), dtype=float32)
tf.Tensor(85138780000.0, shape=(), dtype=float32)
tf.Tensor(85022810000.0, shape=(), dtype=float32)
tf.Tensor(84907550000.0, shape=(), dtype=float32)
tf.Tensor(84792980000.0, shape=(), dtype=float32)
tf.Tensor(84679110000.0, shape=(), dtype=float32)
tf.Tensor(84565926000.0, shape=(), dtype=float32)
tf.Tensor(84453430000.0, shape=(), dtype=float32)
tf.Tensor(84341620000.0, shape=(), dtype=float32)


tf.Tensor(72665770000.0, shape=(), dtype=float32)
tf.Tensor(72632164000.0, shape=(), dtype=float32)
tf.Tensor(72598815000.0, shape=(), dtype=float32)
tf.Tensor(72565730000.0, shape=(), dtype=float32)
tf.Tensor(72532870000.0, shape=(), dtype=float32)
tf.Tensor(72500270000.0, shape=(), dtype=float32)
tf.Tensor(72467910000.0, shape=(), dtype=float32)
tf.Tensor(72435800000.0, shape=(), dtype=float32)
tf.Tensor(72403935000.0, shape=(), dtype=float32)
tf.Tensor(72372300000.0, shape=(), dtype=float32)
tf.Tensor(72340910000.0, shape=(), dtype=float32)
tf.Tensor(72309740000.0, shape=(), dtype=float32)
tf.Tensor(72278830000.0, shape=(), dtype=float32)
tf.Tensor(72248140000.0, shape=(), dtype=float32)
tf.Tensor(72217680000.0, shape=(), dtype=float32)
tf.Tensor(72187460000.0, shape=(), dtype=float32)
tf.Tensor(72157460000.0, shape=(), dtype=float32)
tf.Tensor(72127700000.0, shape=(), dtype=float32)
tf.Tensor(72098160000.0, shape=(), dtype=float32)
tf.Tensor(72068850000.0, shape=(), dtype=float32)


tf.Tensor(69362350000.0, shape=(), dtype=float32)
tf.Tensor(69355420000.0, shape=(), dtype=float32)
tf.Tensor(69348560000.0, shape=(), dtype=float32)
tf.Tensor(69341760000.0, shape=(), dtype=float32)
tf.Tensor(69335015000.0, shape=(), dtype=float32)
tf.Tensor(69328350000.0, shape=(), dtype=float32)
tf.Tensor(69321720000.0, shape=(), dtype=float32)
tf.Tensor(69315170000.0, shape=(), dtype=float32)
tf.Tensor(69308670000.0, shape=(), dtype=float32)
tf.Tensor(69302230000.0, shape=(), dtype=float32)
tf.Tensor(69295860000.0, shape=(), dtype=float32)
tf.Tensor(69289530000.0, shape=(), dtype=float32)
tf.Tensor(69283275000.0, shape=(), dtype=float32)
tf.Tensor(69277065000.0, shape=(), dtype=float32)
tf.Tensor(69270910000.0, shape=(), dtype=float32)
tf.Tensor(69264835000.0, shape=(), dtype=float32)
tf.Tensor(69258800000.0, shape=(), dtype=float32)
tf.Tensor(69252820000.0, shape=(), dtype=float32)
tf.Tensor(69246900000.0, shape=(), dtype=float32)
tf.Tensor(69241020000.0, shape=(), dtype=float32)


印出最後的 intercept 和 slope

In [26]:
print(intercept.numpy(), slope.numpy())

247.12434 253.93784


## Batch Training

當資料量過大時，無法一次將所有資料都放進記憶體中，因此要將資料分成數個 batches，並且依序訓練這些 batches (image的分析尤其重要，因為image的資料通常很大)

Epoch: 代表訓練過所有 batches 一次

Batch training 的好處不只可以處理大數據，還可以在每一個 batch 訓練完之後就 update 參數，而不需要等到所有資料都被訓練過一輪再更新參數，可以加快訓練速度

### Load data in batches

pandas 的 read_csv 可以做到，透過傳入 chunksize 這個參數，可以指定一個 chunk 的資料量

In [27]:
import pandas as pd
import numpy as np

In [28]:
for batch in pd.read_csv('Datasets/kc_house_data.csv', chunksize = 100):
    price = np.array(batch["price"], dtype = np.float32)
    size = np.array(batch["sqft_living"], dtype = np.float32)

### 實際操作

In [29]:
# import 重要套件
import tensorflow as tf
import pandas as pd
import numpy as np

In [30]:
# 定義可訓練的參數
intercept = tf.Variable(0.1, dtype = tf.float32)
slope = tf.Variable(0.1, dtype = tf.float32)

In [31]:
# 定義迴歸模型
def linear_regression(intercept, slope, features):
    return intercept + slope * features

In [32]:
# 定義 loss function
def loss_function(intercept, slope, features, actuals):
    predictions = linear_regression(intercept, slope, features)
    return tf.keras.losses.mse(actuals, predictions)

In [33]:
# 定義 optimiser
opt = tf.keras.optimizers.Adam()

開始 batch training

In [35]:
for batch in pd.read_csv('Datasets/kc_house_data.csv', chunksize = 100):
    price = np.array(batch["price"], dtype = np.float32)
    size = np.array(batch["sqft_living"], dtype = np.float32)
    
    opt.minimize(lambda:loss_function(intercept, slope, size, price), var_list = [intercept, slope])

In [36]:
# 印出最佳的 intercept 和 slope
print(intercept.numpy(), slope.numpy())

0.31799173 0.31615734


## Batch training vs full training

![](Image/Image7.jpg)