### 讀取資料
首先，我們用 pandas 讀取最主要的資料 application_train.csv (記得到 https://www.kaggle.com/c/home-credit-default-risk/data 下載)

Note: `data/application_train.csv` 表示 `application_train.csv` 與該 `.ipynb` 的資料夾結構關係如下
```
data
    /application_train.csv
Day_004_first_EDA.ipynb
```

# [教學目標]
- 初步熟悉以 Python 為主的資料讀取與簡單操作

# [範例重點]
- 如何使用 pandas.read_csv 讀取資料 (In[3], Out[3])
- 如何簡單瀏覽 pandas 所讀進的資料 (In[5], Out[5])

In [1]:
import os
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [2]:
# 設定 data_path
dir_data = './Data/'

#### 用 pd.read_csv 來讀取資料

In [3]:
f_app = os.path.join(dir_data, 'application_train.csv')
print('Path of read in data: %s' % (f_app))
app_train = pd.read_csv(f_app)

Path of read in data: ./Data/application_train.csv


#### Note: 在 jupyter notebook 中，可以使用 `?` 來調查函數的定義

In [4]:
# for example
?pd.read_csv

#### 接下來我們可以用 .head() 這個函數來觀察前 5 row 資料

In [5]:
app_train.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,FLAG_DOCUMENT_18,FLAG_DOCUMENT_19,FLAG_DOCUMENT_20,FLAG_DOCUMENT_21,AMT_REQ_CREDIT_BUREAU_HOUR,AMT_REQ_CREDIT_BUREAU_DAY,AMT_REQ_CREDIT_BUREAU_WEEK,AMT_REQ_CREDIT_BUREAU_MON,AMT_REQ_CREDIT_BUREAU_QRT,AMT_REQ_CREDIT_BUREAU_YEAR
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,1.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0,0,0,0,,,,,,
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0,0,0,0,0.0,0.0,0.0,0.0,0.0,0.0


## 練習時間
資料的操作有很多，接下來的馬拉松中我們會介紹常被使用到的操作，參加者不妨先自行想像一下，第一次看到資料，我們一般會想知道什麼訊息？

#### Ex: 如何知道資料的 row 數以及 column 數、有什麼欄位、多少欄位、如何截取部分的資料等等

有了對資料的好奇之後，我們又怎麼通過程式碼來達成我們的目的呢？

#### 可參考該[基礎教材](https://bookdata.readthedocs.io/en/latest/base/01_pandas.html#DataFrame-%E5%85%A5%E9%97%A8)或自行 google

In [6]:
#初始化資料及基本資訊
DF_src = DataFrame(app_train)
DF_src.info()
#取出rows, columns的尺寸存入變數
(DF_rows, DF_columns) = DF_src.shape 
print('Number of columns = ', DF_columns)
print('Number of rows = ', DF_rows)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 307511 entries, 0 to 307510
Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR
dtypes: float64(65), int64(41), object(16)
memory usage: 286.2+ MB
Number of columns =  122
Number of rows =  307511


In [7]:
#取出rows, columns的內容
(row, col) = DF_src.axes  
print('columns = ', col)
print('rows = ', row)

columns =  Index(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE', 'CODE_GENDER',
       'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
       'AMT_CREDIT', 'AMT_ANNUITY',
       ...
       'FLAG_DOCUMENT_18', 'FLAG_DOCUMENT_19', 'FLAG_DOCUMENT_20',
       'FLAG_DOCUMENT_21', 'AMT_REQ_CREDIT_BUREAU_HOUR',
       'AMT_REQ_CREDIT_BUREAU_DAY', 'AMT_REQ_CREDIT_BUREAU_WEEK',
       'AMT_REQ_CREDIT_BUREAU_MON', 'AMT_REQ_CREDIT_BUREAU_QRT',
       'AMT_REQ_CREDIT_BUREAU_YEAR'],
      dtype='object', length=122)
rows =  RangeIndex(start=0, stop=307511, step=1)


In [33]:
#取出指定col 的資料存入DataFrame中
Sample_DF = DF_src.loc[:, 'AMT_CREDIT']
#?DF_src.loc
print(Sample_DF)

0          406597.5
1         1293502.5
2          135000.0
3          312682.5
4          513000.0
5          490495.5
6         1560726.0
7         1530000.0
8         1019610.0
9          405000.0
10         652500.0
11         148365.0
12          80865.0
13         918468.0
14         773680.5
15         299772.0
16         509602.5
17         270000.0
18         157500.0
19         544491.0
20         427500.0
21        1132573.5
22         497520.0
23         239850.0
24         247500.0
25         225000.0
26         979992.0
27         327024.0
28         790830.0
29         180000.0
            ...    
307481     297000.0
307482     500566.5
307483     247275.0
307484     545040.0
307485     180000.0
307486     355536.0
307487    1071909.0
307488     135000.0
307489     521280.0
307490     135000.0
307491    1078200.0
307492    1575000.0
307493     946764.0
307494     479700.0
307495     808650.0
307496     337500.0
307497     270126.0
307498    1312110.0
307499     225000.0


In [36]:
#取出指定cols 的資料存入DataFrame中
#Samples_DF = DF_src[['SK_ID_CURR', 'AMT_INCOME_TOTAL', 'AMT_CREDIT']]
Samples_DF = DF_src.loc[:, ['AMT_CREDIT', 'SK_ID_CURR']]
print(Samples_DF)

        AMT_CREDIT  SK_ID_CURR
0         406597.5      100002
1        1293502.5      100003
2         135000.0      100004
3         312682.5      100006
4         513000.0      100007
5         490495.5      100008
6        1560726.0      100009
7        1530000.0      100010
8        1019610.0      100011
9         405000.0      100012
10        652500.0      100014
11        148365.0      100015
12         80865.0      100016
13        918468.0      100017
14        773680.5      100018
15        299772.0      100019
16        509602.5      100020
17        270000.0      100021
18        157500.0      100022
19        544491.0      100023
20        427500.0      100024
21       1132573.5      100025
22        497520.0      100026
23        239850.0      100027
24        247500.0      100029
25        225000.0      100030
26        979992.0      100031
27        327024.0      100032
28        790830.0      100033
29        180000.0      100034
...            ...         ...
307481  

In [45]:
#取出指定row 的資料
Samples_DF = DF_src.iloc[[0, 2, 5], 0:3]
print(Samples_DF)

   SK_ID_CURR  TARGET NAME_CONTRACT_TYPE
0      100002       1         Cash loans
2      100004       0    Revolving loans
5      100008       0         Cash loans


In [30]:
#取出指定rows 的資料
Samples_DF = DF_src.iloc[['0', '2', '9']]
print(Samples_DF)

   SK_ID_CURR  TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR  \
0      100002       1         Cash loans           M            N   
2      100004       0    Revolving loans           M            Y   
9      100012       0    Revolving loans           M            N   

  FLAG_OWN_REALTY  CNT_CHILDREN  AMT_INCOME_TOTAL  AMT_CREDIT  AMT_ANNUITY  \
0               Y             0          202500.0    406597.5      24700.5   
2               Y             0           67500.0    135000.0       6750.0   
9               Y             0          135000.0    405000.0      20250.0   

   ...  FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21  \
0  ...                 0                0                0                0   
2  ...                 0                0                0                0   
9  ...                 0                0                0                0   

  AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY  \
0                        0.0       

In [20]:
?DF_src.iloc
DF_src.iloc[0]

SK_ID_CURR                                           100002
TARGET                                                    1
NAME_CONTRACT_TYPE                               Cash loans
CODE_GENDER                                               M
FLAG_OWN_CAR                                              N
FLAG_OWN_REALTY                                           Y
CNT_CHILDREN                                              0
AMT_INCOME_TOTAL                                     202500
AMT_CREDIT                                           406598
AMT_ANNUITY                                         24700.5
AMT_GOODS_PRICE                                      351000
NAME_TYPE_SUITE                               Unaccompanied
NAME_INCOME_TYPE                                    Working
NAME_EDUCATION_TYPE           Secondary / secondary special
NAME_FAMILY_STATUS                     Single / not married
NAME_HOUSING_TYPE                         House / apartment
REGION_POPULATION_RELATIVE              