
| 根据下面要求，在右边写出对应代码 |                  train                   |                test                |
| :------------------------------: | :--------------------------------------: | :--------------------------------: |
|             读取csv              |    `train = pd.read_csv('train.csv')`    |  `test = pd.read_csv('test.csv')`  |
|               行数               |            `len(train.index)`            |         `len(test.index)`          |
|               列数               |           `len(train.columns)`           |        `len(test.columns)`         |
|           共有多少元素           |               `train.size`               |            `test.size`             |
|          是否含有target          |       `'target' in train.columns`        |      `'test' in test.columns`      |
|     float64的变量有多少个？      |   `(train.dtypes == 'float64').sum()`    | `(test.dtypes == 'float64').sum()` |
|       int64的变量有多少个?       |    `(train.dtypes == 'int64').sum()`     |  `(test.dtypes == 'int64').sum()`  |
|       查看数据information        |              `train.info()`              |           `test.info()`            |
|          查看数据前3行           |             `train.head(3)`              |           `test.head(3)`           |
|          查看数据后8行           |             `train.tail(8)`              |           `test.tail(8)`           |
|            数据的列名            |             `train.columns`              |           `test.columns`           |
|            数据的维度            |              `train.shape`               |            `test.shape`            |
|        数据各列的数据类型        |              `train.dtypes`              |           `test.dtypes`            |
|   更改设置使数据最多只展示2行    |  `pd.set_option('display.max_rows',2)`   |                同左                |
|   更改设置使数据最多只展示5列    | `pd.set_option('display.max_columns',5)` |                同左                |


# Feature Engineering

## Encoding (Discrete -> Continous)

- OneHotEncoder (pivot table)
  - no need to use in Tree model(no distance)
  - 不适合一个column有特别多的category(cardinality too high)，容易造成high dimension sparse(dimension explosion)
- Target Encoding
  - Data Leakage(only fit train)：Maybe learn the target from valid or test data
  - Unknown Category: Test dataset got the new or unseen category in the specific column
  - Rare Categories(do smoothing): 100 apple: 2 orange, the orange is rare, so it is not representative
  - Category Loss(add noise): After Target Mean Encoding, original different categories got the same values

- Binary Feature or Target: LabelEncoder(Good for tree model, not linear model)
- Ordinal Feature: OrdinalEncoder
- nominal: OneHotEncoder(Not for tree model)

## Binning (Continous -> Discrete)

- Unsupervised
  - Equal Width
  - Equal Frequency
- Supervised

In [1]:
import pandas as pd
value_list = [0, 10, 20, 59, 61, 79, 80, 90, 99, 100]

In [2]:
# Equal Frequency
value_freq_bins = pd.qcut(value_list, q=5)
value_freq_bins

[(-0.001, 18.0], (-0.001, 18.0], (18.0, 60.2], (18.0, 60.2], (60.2, 79.4], (60.2, 79.4], (79.4, 91.8], (79.4, 91.8], (91.8, 100.0], (91.8, 100.0]]
Categories (5, interval[float64, right]): [(-0.001, 18.0] < (18.0, 60.2] < (60.2, 79.4] < (79.4, 91.8] < (91.8, 100.0]]

In [3]:
# Equal Width
value_dis_bins = pd.cut(value_list, bins=5)
value_dis_bins

[(-0.1, 20.0], (-0.1, 20.0], (-0.1, 20.0], (40.0, 60.0], (60.0, 80.0], (60.0, 80.0], (60.0, 80.0], (80.0, 100.0], (80.0, 100.0], (80.0, 100.0]]
Categories (5, interval[float64, right]): [(-0.1, 20.0] < (20.0, 40.0] < (40.0, 60.0] < (60.0, 80.0] < (80.0, 100.0]]