# 深度學習Pytorch手把手實作-資料庫: 私有資料庫存取方式

<br>**<font color = blue size=4 face=雅黑>1. 當資料為結構資料</font>**<br/>
<br>**<font color = blue size=4 face=雅黑>2. 當資料為非結構資料(圖像)</font>**<br/>
___

<br>**<font color = Blue size=5>1. 當資料為結構資料</font>**<br/>

><font color = black size=4>當資料為結構資料，有很多種儲存格式，例如txt, xml, json, pickle, csv,...等，但不論是什麼格式只要能讀進電腦整理成最終要進行模型訓練的型態即可。</font>


<br><font color = black size=4>在今天課程，我們使用IRIS data當作私有資料範例，這邊我已經將iris.data改成csv資料格式(iris.csv)</font><br/>
<font color = black size=4>iris.csv存放路徑為: [/dataset/iris.csv](./dataset/iris.csv)，我們用excel將檔案打開如下：</font>
<img src="Image/irisdata.png" width="40%">


<br><font color = black size=4>這邊我們使用numpy模組將csv讀到電腦內，但實際上不一定要用numpy，用你覺得方便的方法即可。</font><br/>
<br><font color = black size=4>因為此csv內有文字格式，所以我們用str方式進行存取</font><br/>

In [1]:
import numpy as np
csv_iris_filepath = './dataset/iris.csv'
csvdata_iris = np.loadtxt(csv_iris_filepath,dtype=np.str, delimiter=',')
# print幾筆資料出來看
print(csvdata_iris[0:150:20,:])

[['5.1' '3.5' '1.4' '0.2' 'Iris-setosa']
 ['5.4' '3.4' '1.7' '0.2' 'Iris-setosa']
 ['5.0' '3.5' '1.3' '0.3' 'Iris-setosa']
 ['5.0' '2.0' '3.5' '1.0' 'Iris-versicolor']
 ['5.5' '2.4' '3.8' '1.1' 'Iris-versicolor']
 ['6.3' '3.3' '6.0' '2.5' 'Iris-virginica']
 ['6.9' '3.2' '5.7' '2.3' 'Iris-virginica']
 ['6.7' '3.1' '5.6' '2.4' 'Iris-virginica']]


<font color = black size=4>我們將文字的nump array轉換成float格式，且將類別歸屬轉換成量化數字</font>
><font color = blue size=3>Iris-setosa為類別0<br/>
<font color = blue size=3>Iris-versicolor為類別1<br/>
<font color = blue size=3>Iris-virginica為類別2<br/>


In [2]:
label_names=['Iris-setosa','Iris-versicolor','Iris-virginica']
x_iris=[]
y =[]
for line in csvdata_iris:
    tmp_data = []
    for tmp in line:
        if tmp in label_names:
            if tmp == 'Iris-setosa':
                y.append(0)
            elif tmp == 'Iris-versicolor':
                y.append(1)
            elif tmp == 'Iris-virginica':
                y.append(2)
        else:
            tmp_data.append(np.float(tmp))
    x_iris.append(tmp_data)

x_iris = np.array(x_iris)
y = np.array(y)
print(x_iris[0:150:20,:])
print(y[0:150:15])

[[5.1 3.5 1.4 0.2]
 [5.4 3.4 1.7 0.2]
 [5.  3.5 1.3 0.3]
 [5.  2.  3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [6.3 3.3 6.  2.5]
 [6.9 3.2 5.7 2.3]
 [6.7 3.1 5.6 2.4]]
[0 0 0 0 1 1 1 2 2 2]


<font color = black size=4> 整理成numpy array後，看需不需要在儲存成numpy格式、json格式或是pickle格式，讓之後要訓練模型的時候可以省去這樣的步驟。但如果覺得這樣的處理很快不需要再額外存取就每一次訓練都執行一次這樣的前處理。</font><br>
><font color = red size=4> 建議額外儲存，因為當你的資料有一萬筆，這時候前處理真的很花時間</font><br>

<font color = black size=4>我們這邊用numpy做為資料儲存，如下</font><br>

In [3]:
np.save("./dataset/iris_x.npy", x_iris)
np.save("./dataset/iris_y.npy", y)

x = np.load( "./dataset/iris_x.npy" )
y = np.load( "./dataset/iris_y.npy" )
print(x[0:150:20,:])
print(y[0:150:20])

[[5.1 3.5 1.4 0.2]
 [5.4 3.4 1.7 0.2]
 [5.  3.5 1.3 0.3]
 [5.  2.  3.5 1. ]
 [5.5 2.4 3.8 1.1]
 [6.3 3.3 6.  2.5]
 [6.9 3.2 5.7 2.3]
 [6.7 3.1 5.6 2.4]]
[0 0 0 1 1 2 2 2]


<br>**<font color = Blue size=5>2. 當資料為非結構資料(圖像)</font>**<br/>

><br>**<font color = black size=4>當資料為自有且非結構資料的時候，通常我們不會一次將資料全部都讀到記憶體內，因為你的資料可能有幾萬張圖片或是幾百萬張，例如ImageNet這樣的資料庫</font>**<br/>
<br>**<font color = black size=4>所以我們通常只會處理資料的路徑，將所有圖片的路徑檔案都寫入一個你方便讀取的格式，之後在pytorch內我們只需要在pytorch模組database內的init中將這個格式讀取出來就好。
所以此部分我們只簡單介紹一般我們怎麼對這些資料做處理</font>**<br/>

<br>**<font color = black size=4>此範例我們用kaggle內別人整理的圖像Car Brands Images</font>**<br/>
Car Brands Images dataset link: :https://www.kaggle.com/yamaerenay/100-images-of-top-50-car-brands

<br><font color = black size=3>Car Brands Images資料集載下來有圖像檔案和一個meta資訊(用.csv表示)</font><br/>

<img src="Image/car_file.png" width="80%">

<br><font color = Red size=4>此資料集目的是要依據圖像訊息判斷此圖是屬於高級車種(豪車)還是一般車種</font><br/>
<br><font color = black size=3>所以我們需要利用資料庫提供的meta訊息來歸類高級車和一般車</font><br/>
#### companies.csv的meta information讀取，這邊我們也是用numpy做讀取動作

In [2]:
import numpy as np
csv_filename='./dataset/kaggle/CarBrandsImages/companies.csv'
csvdata = np.loadtxt(csv_filename,dtype=np.str, delimiter=',')
print(csvdata[0:3,:])

[['rank' 'logo_link' 'origin' 'name' 'segment']
 ['1' 'https://www.carlogos.org/car-logos/toyota-logo.png' 'Japan'
  'Toyota' 'Mass-Market Cars']
 ['2' 'https://www.carlogos.org/car-logos/honda-logo.png' 'Japan' 'Honda'
  'Mass-Market Cars']]


### 根據companies.csv的meta information進行處理

50個資料夾

>依據origin可分成
United States、Germany、Italy、Sweden、United Kingdom、South Korea、France七個國家

>依據segment可分成
Mass-Market Cars; 
Luxury Vehicles; 
Sport Utility Vehicles; 
Luxury Sports Cars; 
Pickup Trucks; 
Ultra-luxury Cars; 
Performance Cars; 
Luxury Sport Utility Vehicles; 
Luxury Electric Vehicles; 
Pickup Trucks, Vans; 
Super Luxury Sports Cars; 
Luxury Supercars; 
Small Cars; 
Luxury Small Cars; 
Automobiles; 
Economy Cars; 
Vehicles; 

<br><font color = Red size=4>應用分成一般車 (類別為0)和豪車(類別為1)兩種</font><br/>

<br><font color = black size=4>一般車: 
Mass-Market Cars;
Pickup Trucks; 
Pickup Trucks; 
Small Cars; 
Automobiles; 
Economy Cars; 
Vehicles; </font><br/>
<br><font color = black size=4>豪車: 
Luxury Vehicles;
Sport Utility Vehicles; 
Luxury Sports Cars; 
Ultra-luxury Cars; 
Performance Cars; 
Luxury Sport Utility Vehicles; 
Luxury Electric Vehicles; 
Super Luxury Sports Cars; 
Luxury Supercars; 
Luxury Small Cars; </font><br/>


In [5]:
massmark = ['Mass-Market Cars','Pickup Trucks','Pickup Trucks', 'Small Cars','Automobiles', 'Economy Cars', 'Vehicles']  
luxury = ['Luxury Vehicles', 'Sport Utility Vehicles','Luxury Sports Cars','Ultra-luxury Cars', 'Performance Cars',
        'Luxury Sport Utility Vehicles','Luxury Electric Vehicles','Super Luxury Sports Cars','Luxury Supercars','Luxury Small Cars']
import os
import numpy as np
imagepaths, labels=[],[]
for dirname, _, filenames in os.walk('./dataset/kaggle/CarBrandsImages/imgs'):
    if len(filenames)!=0:
        brand = dirname.split("\\")[-1]
        pos=np.where(csvdata[:,3]== brand)[0]
        segment = csvdata[pos,4][0]
        if segment in massmark:
            label = 0
        elif segment in luxury:
            label = 1
        print('{} : {}'.format(brand, label))
        for filename in filenames:
            imagepaths.append(os.path.join(dirname, filename))
            labels.append(label)

Acura : 1
Alfa Romeo : 1
Aston Martin : 1
Audi : 1
Bentley : 1
BMW : 1
Bugatti : 1
Buick : 0
Cadillac : 1
Chevrolet : 0
Chrysler : 1
Citroen : 0
Daewoo : 0
Dodge : 1
Ferrari : 1
Fiat : 0
Ford : 0
Genesis : 1
GMC : 0
Honda : 0
Hudson : 0
Hyundai : 0
Infiniti : 1
Jaguar : 1
Jeep : 1
Kia : 0
Land Rover : 1
Lexus : 1
Lincoln : 1
Maserati : 1
Mazda : 0
Mercedes-Benz : 1
MG : 0
Mini : 1
Mitsubishi : 0
Nissan : 0
Oldsmobile : 0
Peugeot : 0
Pontiac : 1
Porsche : 1
Ram Trucks : 0
Renault : 0
Saab : 0
Studebaker : 0
Subaru : 0
Suzuki : 0
Tesla : 1
Toyota : 0
Volkswagen : 0
Volvo : 1


<br><font color = black size=4>將filepath和label儲存成json檔</font><br/>

In [6]:
import json
data={}
data['imagepaths']=imagepaths
data['labels']=labels
with open('./dataset/kaggle/CarBrandsImages/carbrand.json', 'w', newline='') as jsonfile:
    json.dump(data, jsonfile)
    

<br><font color = black size=4>讀取JSON檔案</font><br/>

In [7]:
with open('./dataset/kaggle/CarBrandsImages/carbrand.json') as jsonfile:
    data_load = json.load(jsonfile)
for imagepath,label in zip(data_load['imagepaths'],data_load['labels']):
    print("filepath:{}, label: {}".format(imagepath,label))

filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_000.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_001.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_002.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_003.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_004.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_005.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_006.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_007.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_008.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_009.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_010.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_011.jpg, label: 1
filepath:./dataset/kaggle/CarBrandsImages/imgs\Acura\Acura_012.jpg, label: 1

filepath:./dataset/kaggle/CarBrandsImages/imgs\Renault\Renault_079.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Renault\Renault_080.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Renault\Renault_081.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Renault\Renault_082.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Renault\Renault_083.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Renault\Renault_084.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Renault\Renault_085.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Renault\Renault_086.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Saab\Saab_000.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Saab\Saab_001.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Saab\Saab_002.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Saab\Saab_003.jpg, label: 0
filepath:./dataset/kaggle/CarBrandsImages/imgs\Saab\