## 背景
大妈在review代码时发现以下问题：

> 运行: 
$ python LCh1saledata.py

获得:

- raw/ZQ/
    + LCh1_test_featuretest.csv
    + LCh1_trainSaleDatetest.csv

其中: LCh1_trainSaleDatetest.csv
检测:

_tmp = train.loc[:,['class_id','sale_date','sale_quantity']]
print(_tmp[_tmp.class_id == 2])

...


      class_id  sale_date  sale_quantity
119        2.0        0.0            0.0
259        2.0        0.0            0.0
399        2.0        0.0            0.0
539        2.0        0.0            0.0
679        2.0        0.0            0.0
...

明确有 class_id 为 2 的数据项...

但是class_id为2肯定是一个错误数据，这个错误从哪里引入的，需要进行排查

## 问题排查

### 排查saledata.py
LCh1saledata.py来自于src/_yancheng4lichuan/saledata.py
- 验证saledata.py没有问题
    - 确认saledata.py生成的csv文件中，class_id没有2
- 对比saledata.py和LCh1saledata.py的区别
    - 唯一的区别是特征工程集不同
    - 特征工程集文件由cleandata.py产生

### 排查cleandata.py
- 执行一遍cleandata.py生成特征工程集，发现class_id的确存在2的列，问题确证来自于cleandata.py
- cleandata.py总共有不到80行，采取逐段分析的方式，排查在运行了哪段代码后，class_id引入了2

下述为逐段运行步骤，在每一段中间，会插入class_id是否为2的检测函数

In [7]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import MinMaxScaler

# readfile
train =pd.read_csv('../../raw/CarsSaleForecast/[new] yancheng_train_20171226.csv')
test = pd.read_csv('../../raw/CarsSaleForecast/yancheng_testA_20171225.csv')

In [8]:
# readfile
train =pd.read_csv('../../raw/CarsSaleForecast/[new] yancheng_train_20171226.csv')
test = pd.read_csv('../../raw/CarsSaleForecast/yancheng_testA_20171225.csv')

# duplicate removal 
labels = ['sale_date', 'class_id', 'brand_id', 'compartment', 'type_id', 'level_id', 'department_id', 'TR', 'gearbox_type', 'displacement', 'if_charging', 'price_level', 'price', 'driven_type_id', 'fuel_type_id', 'newenergy_type_id', 'emission_standards_id', 'if_MPV_id', 'if_luxurious_id', 'power', 'cylinder_number', 'engine_torque', 'car_length', 'car_width', 'car_height', 'total_quality', 'equipment_quality', 'rated_passenger', 'wheelbase', 'front_track', 'rear_track']
train = train.groupby(labels).sum().reset_index()

In [9]:
# exceptions
train['power'][train['power']=='81/70'] = 81
train['power'] = train['power'].astype('float32')



In [10]:
train[train['class_id'] == 2]

Unnamed: 0,sale_date,class_id,brand_id,compartment,type_id,level_id,department_id,TR,gearbox_type,displacement,...,car_length,car_width,car_height,total_quality,equipment_quality,rated_passenger,wheelbase,front_track,rear_track,sale_quantity


In [11]:
train['engine_torque'][train['engine_torque']=='155/140'] = 155
train['engine_torque'][train['engine_torque']=='-'] = 201.8
train['engine_torque'] = train['engine_torque'].astype('float32')

In [12]:
train[train['class_id'] == 2]

Unnamed: 0,sale_date,class_id,brand_id,compartment,type_id,level_id,department_id,TR,gearbox_type,displacement,...,car_length,car_width,car_height,total_quality,equipment_quality,rated_passenger,wheelbase,front_track,rear_track,sale_quantity


In [13]:
train['fuel_type_id'][train['fuel_type_id']==1] = '1'
train['fuel_type_id'][train['fuel_type_id']==2] = '2'
train['fuel_type_id'][train['fuel_type_id']==3] = '3'

In [14]:
train[train['class_id'] == 2]

Unnamed: 0,sale_date,class_id,brand_id,compartment,type_id,level_id,department_id,TR,gearbox_type,displacement,...,car_length,car_width,car_height,total_quality,equipment_quality,rated_passenger,wheelbase,front_track,rear_track,sale_quantity


In [15]:
train[train['displacement'] == 0] = 2.

In [41]:
train[train['class_id'] == 2]

Unnamed: 0,sale_date,class_id,brand_id,compartment,type_id,level_id,department_id,TR,gearbox_type,displacement,...,car_length,car_width,car_height,total_quality,equipment_quality,rated_passenger,wheelbase,front_track,rear_track,sale_quantity
16294,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
16663,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
16664,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
17018,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
17386,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
17754,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
17755,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
18144,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
18145,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0
18521,2.0,2.0,2.0,2.0,2.0,2,2.0,2,2,2.0,...,2.0,2.0,2.0,2.0,2.0,2,2.0,2.0,2.0,2.0


当执行到train[train['displacement'] == 0] = 2.时发现class_id引入了2

作者的本意是将displacement这一列中为0的项修改为2，但是忘记写displacement了，导致对满足train['displacement'] == 0的所有列都进行了赋值，class_id也不例外的被赋值为2。
正确的写法应该如下所示：

In [None]:
train[‘displacement’][train[‘displacement’] == 0] = 2. 

更新这个bug后，再次对cleandata.py文件进行审查，train_feature_test.csv为最新文件的生成的训练特征集文件，train_feature.csv为原始的特征集文件，比较发现原始的特征文件比最新的特征文件多了一个特征。

In [16]:
import numpy as np
import scipy as sp
import pandas as pd

In [17]:
train = pd.read_csv('../../raw/caijun/trainSaleDate.csv')
train_feature = pd.read_csv('../../raw/LiChuan/train_feature.csv')
train_feature_test = pd.read_csv('../../raw/caijun/train_feature_test.csv')

In [18]:
train_feature_test.columns

Index(['class_id', 'brand_id_12', 'brand_id_49', 'brand_id_68', 'brand_id_75',
       'brand_id_76', 'brand_id_98', 'brand_id_106', 'brand_id_126',
       'brand_id_236',
       ...
       'equipment_quality_mean', 'wheelbase_max', 'wheelbase_min',
       'wheelbase_mean', 'front_track_max', 'front_track_min',
       'front_track_mean', 'rear_track_max', 'rear_track_min',
       'rear_track_mean'],
      dtype='object', length=152)

In [19]:
train_feature.columns

Index(['class_id', 'brand_id_12', 'brand_id_49', 'brand_id_68', 'brand_id_75',
       'brand_id_76', 'brand_id_98', 'brand_id_106', 'brand_id_126',
       'brand_id_236',
       ...
       'rated_passenger_4', 'rated_passenger_4-5', 'rated_passenger_5',
       'rated_passenger_5-7', 'rated_passenger_5-8', 'rated_passenger_6-7',
       'rated_passenger_6-8', 'rated_passenger_7', 'rated_passenger_7-8',
       'rated_passenger_9'],
      dtype='object', length=153)

In [20]:
for i in train_feature.columns:
    if i not in train_feature_test.columns:
        print (i)

no_displacement
