Objective
=== 
了解csv資料的格式處理和如何分析

## EDA資料類型介紹(格式處理)

**變數種類:**
1. 離散變數: 只能用整數單位計算的變數
    - 房子的數量、性別、國家
2. 連續變數: 在一定區間內可以任意取值的變數
    - 測量的身高、飛機起飛到降落所花費的時間、車速
    
**資料類型**
1. float64: 浮點數(離散跟連續)
2. int64: 整數(離散跟連續)
3. object: 包含字串(類別型變數)

**資料該如何處理**
進一步分析(ex.訓練模型)
字串/類別:
- label encoding: 使用時機通常是該資料的不同類別是有序的，例如該資料是年齡分組，類別有小孩、年輕人、老人，表示為 0, 1, 2 是合理的，因為年齡上老人 > 年輕人、年輕人 > 小孩
- one hot encoding: 使用時機通常是該資料的不同類別是無序的，例如國家

## Resources 

0. Machine Learning 鐵人30天

[IT幫幫忙](https://ithelp.ithome.com.tw/users/20112568/ironman)

1. Label Encoder & One Hot Encoder

[Label Encoder vs. One Hot Encoder in Machine Learning - Medium](https://medium.com/@contactsunny/label-encoder-vs-one-hot-encoder-in-machine-learning-3fc273365621)

[資料前處理Label Encoder vs. One Hot Encoder - Medium](https://medium.com/@PatHuang/%E5%88%9D%E5%AD%B8python%E6%89%8B%E8%A8%98-3-%E8%B3%87%E6%96%99%E5%89%8D%E8%99%95%E7%90%86-label-encoding-one-hot-encoding-85c983d63f87)

2. fit function in scikit learn: 

[fit func - scikit learn document](https://scikit-learn.org/stable/tutorial/basic/tutorial.html)

[fit func in elaboration - StackOverflow](https://stackoverflow.com/questions/45704226/what-does-fit-method-in-scikit-learn-do)

3. Unique func: 

[Unique func - geeks4geeks](https://www.geeksforgeeks.org/python-get-unique-values-list/)

4. Pandas get.dummies func:

[pd.get_dummies() - pandas documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html)

5. fit, transform, fit_transform func in scikit learn:

[fit, transform, fit_transform func - Kaggle](https://www.kaggle.com/questions-and-answers/58368)

### Raw data 
![](https://i.imgur.com/z0JkKwD.png)

### 1. Label Encoder (程式不好懂Data順序)
We can't have text in our data if we're going to run any kind of model on it. So, we have to convert all kind of categorical text data into model-understandable numerical data.

```Python
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
x[:, 0] = labelencoder.fit_transform(x[:, 0])

```
The result
![](https://i.imgur.com/lkeaXNA.png)


#### Disadvantage:
Since there are different numbers in the same column, the model will ***misunderstand the data to be in some kind of order, 0 < 1< 2(不同國家).*** But this isn't the case at all. To overcome this problem, we use One Hot Encoder.

## 程式Program(Label Encoding)

Label encoding 的表示方式會讓同一個欄位底下的類別之間有大小關係 (0<1<2<...)

In [34]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

app_train = pd.read_csv('application_train.csv') 
app_test = pd.read_csv('application_test.csv')

所以在這裡我們只對有類別數量小於等於 2 的類別型欄位示範使用 Label encoding，但不表示這樣處理是最好的，一切取決於欄位本身的意義適合哪一種表示方法，再利用fit&transform func to estimate the best representative function for the data points (could be a line,polynomial or discrete boarders around)

```python
if len(list(app_train[col].unique())) <= 2:  
    #這裡是將app_train變成一種list的形式儲存
    # Y = colunm type is object
    # N = colunm type is not object
    print(list(app_train[col]))

    # Train on the training data
    le.fit(app_train[col])
    # Transform both training and testing data
    app_train[col] = le.transform(app_train[col])
    app_test[col] = le.transform(app_test[col])
```

In [35]:
# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        
        # 如果column的全部種類少於2種
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            
            #這裡是將app_train變成一種list的形式儲存
            # Y = colunm type is object
            # N = colunm type is not object
#             print(list(app_train[col]))
            
            # Train on the training data
            le.fit(app_train[col])
            # Transform both training and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns were label encoded
            le_count += 1
            
print('%d columns were label encoded.' % le_count)

3 columns were label encoded.


In [36]:
print(list(app_train[col]))

In [None]:
## Why using fit and 

## 2. One Hot Encoder (程式好懂data順序) 
It takes a column which has categorical data, which has been label encoede, and then splits the column into multiple columns.

```python
from sklearn.preprocessing import OneHotEncoder
onehotencoder = OneHotEncoder(categorical_features = [0])
x = onehotencoder.fit_transform(x).toarray()
```

As the code applying, we'll get three new columns, one for each county - France, Germany, and Span

For rows which have the first column value as France, the ‘France’ column will have a ‘1’ and the other two columns will have ‘0’s. Similarly, for rows which have the first column value as Germany, the ‘Germany’ column will have a ‘1’ and the other two columns will have ‘0’s.
![](https://i.imgur.com/mzyEosW.png)

## Program(One Hot Encoding - pandas method)

In [None]:
import pandas as pd
import numpy as np

In [None]:
app_train = pd.read_csv('application_train.csv') 

### 檢視資料中各個欄位類型的數量

In [16]:
# 1. check the data type
app_train.dtypes.value_counts()

float64    65
int64      41
object     16
dtype: int64

### 檢視資料中類別型欄位各自類別的數量

用檢視功能，並用One Hot Encoding來檢查 "NAME_HOUSING_TYPE" 是否有6種不同的類別數量，

In [15]:
ob_num = app_train.select_dtypes(include=["object"]).apply(pd.Series.nunique, axis = 0)
print(ob_num, '\n')
print(app_train['NAME_HOUSING_TYPE'], '\n')

# One Hot Encoding 
app_train = pd.read_csv('application_train.csv') 
name_train = pd.DataFrame(app_train['NAME_HOUSING_TYPE'])
data_name = pd.get_dummies(name_train)
print(data_name)

NAME_CONTRACT_TYPE             2
CODE_GENDER                    3
FLAG_OWN_CAR                   2
FLAG_OWN_REALTY                2
NAME_TYPE_SUITE                7
NAME_INCOME_TYPE               8
NAME_EDUCATION_TYPE            5
NAME_FAMILY_STATUS             6
NAME_HOUSING_TYPE              6
OCCUPATION_TYPE               18
WEEKDAY_APPR_PROCESS_START     7
ORGANIZATION_TYPE             58
FONDKAPREMONT_MODE             4
HOUSETYPE_MODE                 3
WALLSMATERIAL_MODE             7
EMERGENCYSTATE_MODE            2
dtype: int64 

0         House / apartment
1         House / apartment
2         House / apartment
3         House / apartment
4         House / apartment
                ...        
307506         With parents
307507    House / apartment
307508    House / apartment
307509    House / apartment
307510    House / apartment
Name: NAME_HOUSING_TYPE, Length: 307511, dtype: object 

        NAME_HOUSING_TYPE_Co-op apartment  \
0                                       0   
1 