把所有车型ID相同的数据统一处理

一、特殊值处理

1.1 同一时间相同数据有多条记录
简单加和

1.2 功率
功率为81/70，修改功率值为81

1.3 发动机扭矩
扭矩为155/140，修改扭矩值为155
扭矩为‘-’，修改值为平均值201.8

1.4 燃料种类
将燃料种类里面的 1, 2, 3 变成 '1', '2' , '3', 归类

二、特征处理

2.1 销量
同一时间的相同车型ID，销量进行加和

2.2 品牌ID 及 其他标称型特征
获得所有车型ID的所有品牌ID数据
对品牌ID首先进行one-hot编码，然后根据车型ID分组，对品牌ID加和，这样就可以获得本车型ID拥有的所有品牌ID

2.3 排量 及 其他数值型特征
排量有一个特殊值为0，所以创建一个特征值，记录排量是否为0
然后对非特殊值进行区间缩放，使排量保持在0~1之间

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

%matplotlib inline
plt.style.use('fivethirtyeight')

import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')

import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn

from scipy import stats
from scipy.stats import norm, skew


pd.set_option('display.float_format', lambda x: '{:.3f}'.format(x))


from subprocess import check_output
print(check_output(["ls", "../../raw/CarsSaleForecast"]).decode("utf8"))

[new] yancheng_train_20171226.csv
yancheng_testA_20171225.csv
yancheng_testB_20180224.csv



In [2]:
train = pd.read_csv('../../raw/CarsSaleForecast/[new] yancheng_train_20171226.csv')
test = pd.read_csv('../../raw/CarsSaleForecast/yancheng_testA_20171225.csv')

In [3]:
train.head()

Unnamed: 0,sale_date,class_id,sale_quantity,brand_id,compartment,type_id,level_id,department_id,TR,gearbox_type,...,engine_torque,car_length,car_width,car_height,total_quality,equipment_quality,rated_passenger,wheelbase,front_track,rear_track
0,201609,289403,94,12,2,1,1,1,6,MT,...,170.0,4440,1833,1545,1695,1320,5,2700,1556,1562
1,201609,745137,435,637,3,2,1,2,6,DCT,...,159.0,4534,1823,1483,1711,1336,5,2648,1553,1544
2,201609,714860,180,831,3,2,2,3,6,AT,...,176.0,4720,1815,1465,1860,1459,5,2770,1579,1589
3,201609,175962,40,750,3,2,1,4,6,AT,...,155.0,4475,1706,1469,1625,1145,5,2603,1460,1500
4,201609,270690,19,98,2,3,3,1,5,MT,...,146.5,4415,1685,1850,1825,1236,5,2720,1420,1440


In [5]:
test.head()

Unnamed: 0,predict_date,class_id,predict_quantity
0,201711,103507,
1,201711,124140,
2,201711,125403,
3,201711,136916,
4,201711,169673,


In [6]:
print("The train data size before dropping Id feature is : {} ".format(train.shape))
print("The test data size before dropping Id feature is : {} ".format(test.shape))

The train data size before dropping Id feature is : (20157, 32) 
The test data size before dropping Id feature is : (140, 3) 


## 异常处理

### 去重
同一车型在同一月份有多条记录, 简单相加

In [7]:
labels = ['sale_date','class_id','brand_id','compartment','type_id','level_id','department_id','TR','gearbox_type','displacement','if_charging',
          'price_level','driven_type_id','fuel_type_id','newenergy_type_id','emission_standards_id','if_MPV_id','if_luxurious_id','power',
          'cylinder_number','engine_torque','car_length','car_width','car_height','total_quality','equipment_quality','rated_passenger',
          'wheelbase','front_track','rear_track']

train = train.groupby(labels).agg('sum').reset_index()
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20106 entries, 0 to 20105
Data columns (total 31 columns):
sale_date                20106 non-null int64
class_id                 20106 non-null int64
brand_id                 20106 non-null int64
compartment              20106 non-null int64
type_id                  20106 non-null int64
level_id                 20106 non-null object
department_id            20106 non-null int64
TR                       20106 non-null object
gearbox_type             20106 non-null object
displacement             20106 non-null float64
if_charging              20106 non-null object
price_level              20106 non-null object
driven_type_id           20106 non-null int64
fuel_type_id             20106 non-null object
newenergy_type_id        20106 non-null int64
emission_standards_id    20106 non-null int64
if_MPV_id                20106 non-null int64
if_luxurious_id          20106 non-null int64
power                    20106 non-null object
cylinder

### 功率异常

* 三个功率为81/70的条目里共有两款车
* 其中一款车有记载的功率为66和81，我们令值为81
* 另一款车有记载的功率为66，70,81和96，也令值为81

In [8]:
train['power'].unique()

array([123.0, 190.0, 108.0, 109.0, 135.0, 45.0, 50.0, 60.0, 63.0, 104.0,
       78.0, 80.0, 144.0, 167.0, 211.0, 250.0, 78.7, 90.4, 96.0, 118.0,
       147.0, 110.0, 125.0, 166.0, 137.0, 103.0, 90.0, 74.0, 86.0, 89.0,
       105.0, 126.0, 98.0, '96', 77.0, 88.0, 112.0, 160.0, '115', 132.0,
       115.0, 155.0, 150.0, 180.0, 171.0, 83.0, 82.0, 81.58, 93.0, 140.0,
       121.0, '125', 130.0, 162.0, 213.0, '77', '73', 73.0, 114.0, 120.0,
       128.0, 106.0, 70.0, '45', 68.0, 184.0, 113.0, '160', '93', '132',
       '213', '63', '70', '108', '78.7', '103', '112', '155', '104',
       119.0, '118', '106', '135', '81.58', '89', '140', '147', '121',
       '50', '250', '83', '162', 122.0, '92', '120', '114', '113', '137',
       195.0, '130', 92.0, '74', '86', 220.0, '88', '123', '90.4', 102.0,
       '126', '150', '90', 93.8, 107.0, 178.0, '110', '93.8', '166', '78',
       '122', 81.0, '102', 200.0, 66.0, '167', '109', '66', '107', '119',
       '184', '220', 198.0, '80', '200', 227.0, '81

In [9]:
train[train['power']=='81/70']

Unnamed: 0,sale_date,class_id,brand_id,compartment,type_id,level_id,department_id,TR,gearbox_type,displacement,...,car_length,car_width,car_height,total_quality,equipment_quality,rated_passenger,wheelbase,front_track,rear_track,sale_quantity
12389,201601,175962,750,3,2,1,4,5,MT,1.6,...,4473,1706,1469,1735,1275,5,2603,1460,1500,8
13457,201603,961962,750,3,2,1,4,5,MT,1.6,...,4487,1706,1470,1740,1260,5,2603,1460,1500,5
13485,201604,175962,750,3,2,1,4,5,MT,1.6,...,4473,1706,1469,1735,1275,5,2603,1460,1500,32


In [10]:
train.loc[12389,'power'] = 81
train.loc[13457,'power'] = 81
train.loc[13485,'power'] = 81
train['power'] = train['power'].astype('float32')

### 发动机扭矩

有三个功率是'155/140'，就当是155吧

还有一个车型是'-'，给个平均值吧

In [11]:
train['engine_torque'].unique()

array(['225', '290', '190', '224', '235', '85', '90', '103', '108', '184',
       '142', '140', '230', '250', '310', '400', '135', '155', '220',
       '280', '350', '240', '173', '154', '145', '150', '177', '189',
       '153', '200', '296', '186', '180', '270', '300', '232', '138',
       '162.68', '168', '198', '231', '380', '420', '127', '320', '174',
       '194', '199', '226', '132', '130', '120', '222', '215', '187',
       '195', '159', '202', '210', '440', '213', '160', '155.5', '175',
       '355', '234', '360', '330', '172', '146.5', '170', '340', '243',
       '176', '192', '297', '123', '116', '148', '185', '141', '146',
       '245', '353', '150.7', '169', '265', '143', '315', '115', '203',
       '193', '370', '155/140', '252', '157', '285', '210.8', '165',
       '205', '178', '250.3', '132.4', '253', '233', '-', '174.9', '201',
       '275', '227.5', '147'], dtype=object)

In [12]:
train[train['engine_torque']=='155/140']

Unnamed: 0,sale_date,class_id,brand_id,compartment,type_id,level_id,department_id,TR,gearbox_type,displacement,...,car_length,car_width,car_height,total_quality,equipment_quality,rated_passenger,wheelbase,front_track,rear_track,sale_quantity
12389,201601,175962,750,3,2,1,4,5,MT,1.6,...,4473,1706,1469,1735,1275,5,2603,1460,1500,8
13457,201603,961962,750,3,2,1,4,5,MT,1.6,...,4487,1706,1470,1740,1260,5,2603,1460,1500,5
13485,201604,175962,750,3,2,1,4,5,MT,1.6,...,4473,1706,1469,1735,1275,5,2603,1460,1500,32


In [13]:
train.loc[12389,'engine_torque'] = 155
train.loc[13457,'engine_torque'] = 155
train.loc[13485,'engine_torque'] = 155

In [14]:
train[train['engine_torque']=='-']

Unnamed: 0,sale_date,class_id,brand_id,compartment,type_id,level_id,department_id,TR,gearbox_type,displacement,...,car_length,car_width,car_height,total_quality,equipment_quality,rated_passenger,wheelbase,front_track,rear_track,sale_quantity
16294,201612,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1050,4,2360,1405,1400,195
16663,201701,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1050,4,2360,1405,1400,6
16664,201701,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1058,4,2360,1405,1400,6
17018,201702,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1050,4,2360,1405,1400,35
17386,201703,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1050,4,2360,1405,1400,59
17754,201704,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1050,4,2360,1405,1400,30
17755,201704,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1058,4,2360,1405,1400,18
18144,201705,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1050,4,2360,1405,1400,6
18145,201705,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1058,4,2360,1405,1400,24
18521,201706,527765,236,2,2,5,1,1,AT,0.0,...,3675,1630,1518,1360,1050,4,2360,1405,1400,53


In [15]:
train['engine_torque'][train['engine_torque']=='-'] = 201.8
train['engine_torque'] = train['engine_torque'].astype('float32')

### 燃料种类
将燃料种类里面的 1, 2, 3 变成 '1', '2' , '3', 归类

In [16]:
train['fuel_type_id'].unique()

array([1, '1', 3, 2, '2', '3', '-', 4], dtype=object)

In [17]:
train['fuel_type_id'][train['fuel_type_id']==1] = '1'
train['fuel_type_id'][train['fuel_type_id']==2] = '2'
train['fuel_type_id'][train['fuel_type_id']==3] = '3'

In [18]:
train['fuel_type_id'].unique()

array(['1', '3', '2', '-', 4], dtype=object)

## 特征处理

### 车型 ID

In [19]:
classLabels = ['class_id']
trainClass = train[classLabels].drop_duplicates().reset_index(drop=True)
                                                                 
trainClass.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 140 entries, 0 to 139
Data columns (total 1 columns):
class_id    140 non-null int64
dtypes: int64(1)
memory usage: 1.2 KB


### 品牌

In [22]:
train['brand_id'].unique()

array([761, 106,  98, 836,  12, 814, 831, 750, 537, 450, 692, 985, 841,
       638, 872, 953, 304, 783, 637,  75, 923, 497, 813, 290, 807, 864,
       498, 236, 542, 512, 294,  49, 126, 682,  68,  76])

In [20]:
trainBrand = train[['class_id', 'brand_id']].drop_duplicates()
brand_id_dummies = pd.get_dummies(trainBrand['brand_id'], prefix='brand_id') # 对分类标签进行 one-hot 编码
trainBrand = pd.concat([trainBrand,brand_id_dummies],axis=1)
trainBrand = trainBrand.drop(['brand_id'],axis=1)
trainBrand = trainBrand.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainBrand, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,brand_id_813,brand_id_814,brand_id_831,brand_id_836,brand_id_841,brand_id_864,brand_id_872,brand_id_923,brand_id_953,brand_id_985
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [23]:
train[['class_id', 'brand_id']]

Unnamed: 0,class_id,brand_id
0,125403,761
1,125403,761
2,125403,761
3,136916,106
4,136916,106
5,136916,106
6,136916,106
7,136916,106
8,136916,106
9,136916,106


In [24]:
trainBrand = train[['class_id', 'brand_id']].drop_duplicates()

In [26]:
brand_id_dummies = pd.get_dummies(trainBrand['brand_id'], prefix='brand_id') # 对分类标签进行 one-hot 编码
trainBrand = pd.concat([trainBrand,brand_id_dummies],axis=1)
trainBrand = trainBrand.drop(['brand_id'],axis=1)

In [27]:
trainBrand

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,brand_id_813,brand_id_814,brand_id_831,brand_id_836,brand_id_841,brand_id_864,brand_id_872,brand_id_923,brand_id_953,brand_id_985
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,136916,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
10,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
18,194450,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
20,198427,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
22,209945,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
26,248352,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
29,281301,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
34,290854,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
40,291086,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
trainBrand.groupby('class_id').agg('sum').reset_index()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,brand_id_813,brand_id_814,brand_id_831,brand_id_836,brand_id_841,brand_id_864,brand_id_872,brand_id_923,brand_id_953,brand_id_985
0,103507,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,124140,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,136916,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
4,169673,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,175962,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,186250,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,194201,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
9,194450,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


### 箱数

In [21]:
train['compartment'].unique()

array([2, 3, 1])

In [22]:
trainCompartment = train[['class_id', 'compartment']].drop_duplicates()
compartment_dummies = pd.get_dummies(trainCompartment['compartment'], prefix='compartment')
trainCompartment = pd.concat([trainCompartment,compartment_dummies],axis=1)
trainCompartment = trainCompartment.drop(['compartment'],axis=1)
trainCompartment = trainCompartment.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainCompartment, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,brand_id_836,brand_id_841,brand_id_864,brand_id_872,brand_id_923,brand_id_953,brand_id_985,compartment_1,compartment_2,compartment_3
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,194450,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,1,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


### 车型类别

In [23]:
train['type_id'].unique()

array([3, 2, 4, 1])

In [24]:
trainType = train[['class_id', 'type_id']].drop_duplicates()
type_id_dummies = pd.get_dummies(trainType['type_id'], prefix='type_id')
trainType = pd.concat([trainType,type_id_dummies],axis=1)
trainType = trainType.drop(['type_id'],axis=1)
trainType = trainType.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainType, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,brand_id_923,brand_id_953,brand_id_985,compartment_1,compartment_2,compartment_3,type_id_1,type_id_2,type_id_3,type_id_4
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,1,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,1
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,1,0,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,1,0,0


### 车型级别

In [25]:
train['level_id'].unique()

array(['2', '-', '1', '4', '3', '5'], dtype=object)

In [26]:
trainLevel = train[['class_id', 'level_id']].drop_duplicates()
level_id_dummies = pd.get_dummies(trainLevel['level_id'], prefix='level_id')
trainLevel = pd.concat([trainLevel,level_id_dummies],axis=1)
trainLevel = trainLevel.drop(['level_id'],axis=1)
trainLevel = trainLevel.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass,trainLevel, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,type_id_1,type_id_2,type_id_3,type_id_4,level_id_-,level_id_1,level_id_2,level_id_3,level_id_4,level_id_5
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,1,0,0,0,0,1,0,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,1,1,0,0,0,0,0
3,194450,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,1,0,0,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,1,0,0,0,1,0,0,0,0


### 车型系别

In [27]:
train['department_id'].unique()

array([2, 5, 1, 3, 4, 6, 7])

In [28]:
trainDepartment = train[['class_id', 'department_id']].drop_duplicates()
department_id_dummies = pd.get_dummies(trainDepartment['department_id'], prefix='department_id')
trainDepartment = pd.concat([trainDepartment,department_id_dummies],axis=1)
trainDepartment = trainDepartment.drop(['department_id'],axis=1)
trainDepartment = trainDepartment.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainDepartment, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,level_id_3,level_id_4,level_id_5,department_id_1,department_id_2,department_id_3,department_id_4,department_id_5,department_id_6,department_id_7
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


### 变速档位

In [29]:
train['TR'].unique()

array(['6', '4', '5', '8', '7', '0', '5;4', '8;7', '9', '1'], dtype=object)

In [30]:
trainTR = train[['class_id', 'TR']].drop_duplicates()
TR_dummies = pd.get_dummies(trainTR['TR'], prefix='TR')
trainTR = pd.concat([trainTR,TR_dummies],axis=1)
trainTR = trainTR.drop(['TR'],axis=1)
trainTR = trainTR.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainTR, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,TR_0,TR_1,TR_4,TR_5,TR_5;4,TR_6,TR_7,TR_8,TR_8;7,TR_9
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,1,0,1,1,0,1,0,0,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,1,1,0,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,1,1,0,1,0,0,0,0


### 变速器形式

In [31]:
train['gearbox_type'].unique()

array(['AT', 'MT', 'DCT', 'CVT', 'MT;AT', 'AT;DCT', 'AMT'], dtype=object)

In [32]:
trainGearbox = train[['class_id', 'gearbox_type']].drop_duplicates()
gearbox_type_dummies = pd.get_dummies(trainGearbox['gearbox_type'], prefix='gearbox_type')
trainGearbox = pd.concat([trainGearbox,gearbox_type_dummies],axis=1)
trainGearbox = trainGearbox.drop(['gearbox_type'],axis=1)
trainGearbox = trainGearbox.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainGearbox, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,TR_8,TR_8;7,TR_9,gearbox_type_AMT,gearbox_type_AT,gearbox_type_AT;DCT,gearbox_type_CVT,gearbox_type_DCT,gearbox_type_MT,gearbox_type_MT;AT
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,1,0,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,1,1,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


### 排量

In [33]:
train['displacement'].unique()

array([ 2.4,  3. ,  2. ,  2.5,  1. ,  1.1,  1.2,  1.5,  1.4,  1.6,  1.8,
        2.3,  1.9,  2.7,  2.8,  1.3,  3.6,  3.1,  0. ])

In [34]:
trainDisplacement = train[['class_id', 'displacement']].drop_duplicates().reset_index(drop=True)
trainDisplacement.loc[:,'no_displacement']=0
trainDisplacement['no_displacement'][trainDisplacement['displacement']==0]=1
trainDisplacement['displacement'] = trainDisplacement['displacement'].apply(lambda x: (x-1.0)/(3.6-1.0))
displacement_max = trainDisplacement.groupby(trainDisplacement['class_id'])[['displacement']].agg('max').reset_index()
displacement_max.rename(columns={'displacement':'displacement_max'},inplace=True)
displacement_min = trainDisplacement.groupby(trainDisplacement['class_id'])[['displacement']].agg('min').reset_index()
displacement_min.rename(columns={'displacement':'displacement_min'},inplace=True)
displacement_mean = trainDisplacement.groupby(trainDisplacement['class_id'])[['displacement']].agg('mean').reset_index()
displacement_mean.rename(columns={'displacement':'displacement_mean'},inplace=True)
trainDisplacement = trainDisplacement.drop(['displacement'],axis=1)
trainDisplacement = trainDisplacement.drop_duplicates().reset_index(drop=True)
trainDisplacement = pd.merge(trainDisplacement, displacement_max, on='class_id')
trainDisplacement = pd.merge(trainDisplacement, displacement_min, on='class_id')
trainDisplacement = pd.merge(trainDisplacement, displacement_mean, on='class_id')
trainClass = pd.merge(trainClass, trainDisplacement, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,gearbox_type_AT,gearbox_type_AT;DCT,gearbox_type_CVT,gearbox_type_DCT,gearbox_type_MT,gearbox_type_MT;AT,no_displacement,displacement_max,displacement_min,displacement_mean
0,125403,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0.769,0.385,0.567
1,136916,0,0,0,0,0,0,1,0,0,...,1,0,1,0,0,0,0,0.577,0.385,0.5
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0.077,0.0,0.038
3,194450,0,0,0,0,0,0,0,0,0,...,1,0,0,1,1,0,0,0.385,0.231,0.308
4,198427,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,0,0.192,0.115,0.154


### 增压

In [35]:
train['if_charging'].unique()

array(['L', 'T'], dtype=object)

In [36]:
trainCharging = train[['class_id', 'if_charging']].drop_duplicates()
if_charging_dummies = pd.get_dummies(trainCharging['if_charging'], prefix='if_charging')
trainCharging = pd.concat([trainCharging,if_charging_dummies],axis=1)
trainCharging = trainCharging.drop(['if_charging'],axis=1)
trainCharging = trainCharging.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainCharging, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,gearbox_type_CVT,gearbox_type_DCT,gearbox_type_MT,gearbox_type_MT;AT,no_displacement,displacement_max,displacement_min,displacement_mean,if_charging_L,if_charging_T
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0.769,0.385,0.567,1,1
1,136916,0,0,0,0,0,0,1,0,0,...,1,0,0,0,0,0.577,0.385,0.5,1,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0.077,0.0,0.038,1,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0.385,0.231,0.308,1,1
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0.192,0.115,0.154,1,1


### 成交段

In [37]:
train['price_level'].unique()

array(['35-50W', '15-20W', '5WL', '5-8W', '8-10W', '25-35W', '20-25W',
       '10-15W', '50-75W'], dtype=object)

In [38]:
trainPrice = train[['class_id', 'price_level']].drop_duplicates()
price_level_dummies = pd.get_dummies(trainPrice['price_level'], prefix='price_level')
trainPrice = pd.concat([trainPrice,price_level_dummies],axis=1)
trainPrice = trainPrice.drop(['price_level'],axis=1)
trainPrice = trainPrice.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainPrice, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,if_charging_T,price_level_10-15W,price_level_15-20W,price_level_20-25W,price_level_25-35W,price_level_35-50W,price_level_5-8W,price_level_50-75W,price_level_5WL,price_level_8-10W
0,125403,0,0,0,0,0,0,0,0,0,...,1,0,0,1,1,1,0,0,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,1,1,1,0,0,0,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,1,0
3,194450,0,0,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0


### 驱动形式

In [39]:
train['driven_type_id'].unique()

array([1, 2, 3])

In [40]:
trainDriven = train[['class_id', 'driven_type_id']].drop_duplicates()
driven_type_id_dummies = pd.get_dummies(trainDriven['driven_type_id'], prefix='driven_type_id')
trainDriven = pd.concat([trainDriven,driven_type_id_dummies],axis=1)
trainDriven = trainDriven.drop(['driven_type_id'],axis=1)
trainDriven = trainDriven.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainDriven, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,price_level_20-25W,price_level_25-35W,price_level_35-50W,price_level_5-8W,price_level_50-75W,price_level_5WL,price_level_8-10W,driven_type_id_1,driven_type_id_2,driven_type_id_3
0,125403,0,0,0,0,0,0,0,0,0,...,1,1,1,0,0,0,0,1,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,1,1,0,0,0,0,0,1,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,1,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,1
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,1,0,0


### 燃料种类

In [41]:
train['fuel_type_id'].unique()

array(['1', '3', '2', '-', 4], dtype=object)

In [42]:
trainFuel = train[['class_id', 'fuel_type_id']].drop_duplicates()
fuel_type_id_dummies = pd.get_dummies(trainFuel['fuel_type_id'], prefix='fuel_type_id')
trainFuel = pd.concat([trainFuel,fuel_type_id_dummies],axis=1)
trainFuel = trainFuel.drop(['fuel_type_id'],axis=1)
trainFuel = trainFuel.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainFuel, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,price_level_5WL,price_level_8-10W,driven_type_id_1,driven_type_id_2,driven_type_id_3,fuel_type_id_4,fuel_type_id_-,fuel_type_id_1,fuel_type_id_2,fuel_type_id_3
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,1,1,0
2,178529,0,0,0,0,0,1,0,0,0,...,1,0,0,1,0,0,0,1,0,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,1,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


### 新能源类型

In [43]:
train['newenergy_type_id'].unique()

array([1, 2, 3, 4])

In [44]:
trainNewenergy = train[['class_id', 'newenergy_type_id']].drop_duplicates()
newenergy_type_id_dummies = pd.get_dummies(trainNewenergy['newenergy_type_id'], prefix='newenergy_type_id')
trainNewenergy = pd.concat([trainNewenergy,newenergy_type_id_dummies],axis=1)
trainNewenergy = trainNewenergy.drop(['newenergy_type_id'],axis=1)
trainNewenergy = trainNewenergy.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainNewenergy, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,driven_type_id_3,fuel_type_id_4,fuel_type_id_-,fuel_type_id_1,fuel_type_id_2,fuel_type_id_3,newenergy_type_id_1,newenergy_type_id_2,newenergy_type_id_3,newenergy_type_id_4
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,0,1,1,0,1,1,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,0,1,0,0,1,0,0,0
3,194450,0,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,1,0,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,1,0,0,0


### 排放标准

In [45]:
train['emission_standards_id'].unique()

array([3, 5, 1, 2])

In [46]:
trainEmission = train[['class_id', 'emission_standards_id']].drop_duplicates()
emission_standards_id_dummies = pd.get_dummies(trainEmission['emission_standards_id'], prefix='emission_standards_id')
trainEmission = pd.concat([trainEmission,emission_standards_id_dummies],axis=1)
trainEmission = trainEmission.drop(['emission_standards_id'],axis=1)
trainEmission = trainEmission.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainEmission, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,fuel_type_id_2,fuel_type_id_3,newenergy_type_id_1,newenergy_type_id_2,newenergy_type_id_3,newenergy_type_id_4,emission_standards_id_1,emission_standards_id_2,emission_standards_id_3,emission_standards_id_5
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,1,0
1,136916,0,0,0,0,0,0,1,0,0,...,1,0,1,1,0,0,1,0,1,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,1,0,1,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,1,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,1,0


### 是否微客 MPV

In [47]:
train['if_MPV_id'].unique()

array([2, 1])

In [48]:
trainMPV = train[['class_id', 'if_MPV_id']].drop_duplicates()
if_MPV_id_dummies = pd.get_dummies(trainMPV['if_MPV_id'], prefix='if_MPV_id')
trainMPV = pd.concat([trainMPV,if_MPV_id_dummies],axis=1)
trainMPV = trainMPV.drop(['if_MPV_id'],axis=1)
trainMPV = trainMPV.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainMPV, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,newenergy_type_id_1,newenergy_type_id_2,newenergy_type_id_3,newenergy_type_id_4,emission_standards_id_1,emission_standards_id_2,emission_standards_id_3,emission_standards_id_5,if_MPV_id_1,if_MPV_id_2
0,125403,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,1
1,136916,0,0,0,0,0,0,1,0,0,...,1,1,0,0,1,0,1,0,0,1
2,178529,0,0,0,0,0,1,0,0,0,...,1,0,0,0,1,0,1,0,0,1
3,194450,0,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,1
4,198427,1,0,0,0,0,0,0,0,0,...,1,0,0,0,1,0,1,0,0,1


### 是否豪华

In [49]:
train['if_luxurious_id'].unique()

array([1, 2])

In [50]:
trainLuxurious = train[['class_id', 'if_luxurious_id']].drop_duplicates()
if_luxurious_id_dummies = pd.get_dummies(trainLuxurious['if_luxurious_id'], prefix='if_luxurious_id')
trainLuxurious = pd.concat([trainLuxurious,if_luxurious_id_dummies],axis=1)
trainLuxurious = trainLuxurious.drop(['if_luxurious_id'],axis=1)
trainLuxurious = trainLuxurious.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainLuxurious, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,newenergy_type_id_3,newenergy_type_id_4,emission_standards_id_1,emission_standards_id_2,emission_standards_id_3,emission_standards_id_5,if_MPV_id_1,if_MPV_id_2,if_luxurious_id_1,if_luxurious_id_2
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,1,1,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,1,0,1,0,0,1,1,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,1,0,1,0,0,1,1,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,1,1,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,1,1,0


### 功率

In [51]:
train['power'].unique()

array([ 123.        ,  190.        ,  108.        ,  109.        ,
        135.        ,   45.        ,   50.        ,   60.        ,
         63.        ,  104.        ,   78.        ,   80.        ,
        144.        ,  167.        ,  211.        ,  250.        ,
         78.69999695,   90.40000153,   96.        ,  118.        ,
        147.        ,  110.        ,  125.        ,  166.        ,
        137.        ,  103.        ,   90.        ,   74.        ,
         86.        ,   89.        ,  105.        ,  126.        ,
         98.        ,   77.        ,   88.        ,  112.        ,
        160.        ,  115.        ,  132.        ,  155.        ,
        150.        ,  180.        ,  171.        ,   83.        ,
         82.        ,   81.58000183,   93.        ,  140.        ,
        121.        ,  130.        ,  162.        ,  213.        ,
         73.        ,  114.        ,  120.        ,  128.        ,
        106.        ,   70.        ,   68.        ,  184.     

In [52]:
trainPower = train[['class_id', 'power']].drop_duplicates().reset_index(drop=True)
trainPower['power'] = MinMaxScaler().fit_transform(trainPower['power'].reshape(-1, 1))
power_max = trainPower.groupby(trainPower['class_id'])[['power']].agg('max').reset_index()
power_max.rename(columns={'power':'power_max'},inplace=True)
power_min = trainPower.groupby(trainPower['class_id'])[['power']].agg('min').reset_index()
power_min.rename(columns={'power':'power_min'},inplace=True)
power_mean = trainPower.groupby(trainPower['class_id'])[['power']].agg('mean').reset_index()
power_mean.rename(columns={'power':'power_mean'},inplace=True)
trainPower = trainPower.drop(['power'],axis=1)
trainPower = trainPower.drop_duplicates().reset_index(drop=True)
trainPower = pd.merge(trainPower, power_max, on='class_id')
trainPower = pd.merge(trainPower, power_min, on='class_id')
trainPower = pd.merge(trainPower, power_mean, on='class_id')
trainClass = pd.merge(trainClass, trainPower, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,emission_standards_id_2,emission_standards_id_3,emission_standards_id_5,if_MPV_id_1,if_MPV_id_2,if_luxurious_id_1,if_luxurious_id_2,power_max,power_min,power_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0.732,0.409,0.576
1,136916,0,0,0,0,0,0,1,0,0,...,0,1,0,0,1,1,0,0.477,0.355,0.396
2,178529,0,0,0,0,0,1,0,0,0,...,0,1,0,0,1,1,0,0.15,0.068,0.121
3,194450,0,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0.455,0.336,0.402
4,198427,1,0,0,0,0,0,0,0,0,...,0,1,0,0,1,1,0,0.309,0.218,0.252


### 缸数

In [53]:
train['cylinder_number'].unique()

array([4, 6, 3, 0])

In [54]:
trainCylinder = train[['class_id', 'cylinder_number']].drop_duplicates()
cylinder_number_dummies = pd.get_dummies(trainCylinder['cylinder_number'], prefix='cylinder_number')
trainCylinder = pd.concat([trainCylinder,cylinder_number_dummies],axis=1)
trainCylinder = trainCylinder.drop(['cylinder_number'],axis=1)
trainCylinder = trainCylinder.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainCylinder, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,if_MPV_id_2,if_luxurious_id_1,if_luxurious_id_2,power_max,power_min,power_mean,cylinder_number_0,cylinder_number_3,cylinder_number_4,cylinder_number_6
0,125403,0,0,0,0,0,0,0,0,0,...,1,1,0,0.732,0.409,0.576,0,0,1,1
1,136916,0,0,0,0,0,0,1,0,0,...,1,1,0,0.477,0.355,0.396,0,0,1,0
2,178529,0,0,0,0,0,1,0,0,0,...,1,1,0,0.15,0.068,0.121,0,0,1,0
3,194450,0,0,0,0,0,0,0,0,0,...,1,1,0,0.455,0.336,0.402,0,0,1,0
4,198427,1,0,0,0,0,0,0,0,0,...,1,1,0,0.309,0.218,0.252,0,0,1,0


### 发动机扭矩

In [55]:
trainEngine = train[['class_id', 'engine_torque']].drop_duplicates().reset_index(drop=True)
trainEngine['engine_torque'] = MinMaxScaler().fit_transform(trainEngine['engine_torque'].reshape(-1, 1))
engine_torque_max = trainEngine.groupby(trainEngine['class_id'])[['engine_torque']].agg('max').reset_index()
engine_torque_max.rename(columns={'engine_torque':'engine_torque_max'},inplace=True)
engine_torque_min = trainEngine.groupby(trainEngine['class_id'])[['engine_torque']].agg('min').reset_index()
engine_torque_min.rename(columns={'engine_torque':'engine_torque_min'},inplace=True)
engine_torque_mean = trainEngine.groupby(trainEngine['class_id'])[['engine_torque']].agg('mean').reset_index()
engine_torque_mean.rename(columns={'engine_torque':'engine_torque_mean'},inplace=True)
trainEngine = trainEngine.drop(['engine_torque'],axis=1)
trainEngine = trainEngine.drop_duplicates().reset_index(drop=True)
trainEngine = pd.merge(trainEngine, engine_torque_max, on='class_id')
trainEngine = pd.merge(trainEngine, engine_torque_min, on='class_id')
trainEngine = pd.merge(trainEngine, engine_torque_mean, on='class_id')
trainClass = pd.merge(trainClass, trainEngine, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,power_max,power_min,power_mean,cylinder_number_0,cylinder_number_3,cylinder_number_4,cylinder_number_6,engine_torque_max,engine_torque_min,engine_torque_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0.732,0.409,0.576,0,0,1,1,0.746,0.394,0.526
1,136916,0,0,0,0,0,0,1,0,0,...,0.477,0.355,0.396,0,0,1,0,0.423,0.287,0.346
2,178529,0,0,0,0,0,1,0,0,0,...,0.15,0.068,0.121,0,0,1,0,0.087,0.0,0.05
3,194450,0,0,0,0,0,0,0,0,0,...,0.455,0.336,0.402,0,0,1,0,0.507,0.279,0.373
4,198427,1,0,0,0,0,0,0,0,0,...,0.309,0.218,0.252,0,0,1,0,0.282,0.155,0.199


### 车长

In [56]:
trainLength = train[['class_id', 'car_length']].drop_duplicates().reset_index(drop=True)
trainLength['car_length'] = MinMaxScaler().fit_transform(trainLength['car_length'].reshape(-1, 1))
car_length_max = trainLength.groupby(trainLength['class_id'])[['car_length']].agg('max').reset_index()
car_length_max.rename(columns={'car_length':'car_length_max'},inplace=True)
car_length_min = trainLength.groupby(trainLength['class_id'])[['car_length']].agg('min').reset_index()
car_length_min.rename(columns={'car_length':'car_length_min'},inplace=True)
car_length_mean = trainLength.groupby(trainLength['class_id'])[['car_length']].agg('mean').reset_index()
car_length_mean.rename(columns={'car_length':'car_length_mean'},inplace=True)
trainLength = trainLength.drop(['car_length'],axis=1)
trainLength = trainLength.drop_duplicates().reset_index(drop=True)
trainLength = pd.merge(trainLength, car_length_max, on='class_id')
trainLength = pd.merge(trainLength, car_length_min, on='class_id')
trainLength = pd.merge(trainLength, car_length_mean, on='class_id')
trainClass = pd.merge(trainClass, trainLength, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,cylinder_number_0,cylinder_number_3,cylinder_number_4,cylinder_number_6,engine_torque_max,engine_torque_min,engine_torque_mean,car_length_max,car_length_min,car_length_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0.746,0.394,0.526,1.0,0.96,0.98
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0.423,0.287,0.346,0.739,0.723,0.731
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0.087,0.0,0.05,0.371,0.035,0.167
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0.507,0.279,0.373,0.503,0.409,0.456
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0.282,0.155,0.199,0.583,0.548,0.569


### 车宽

In [57]:
trainWidth = train[['class_id', 'car_width']].drop_duplicates().reset_index(drop=True)
trainWidth['car_width'] = MinMaxScaler().fit_transform(trainWidth['car_width'].reshape(-1, 1))
car_width_max = trainWidth.groupby(trainWidth['class_id'])[['car_width']].agg('max').reset_index()
car_width_max.rename(columns={'car_width':'car_width_max'},inplace=True)
car_width_min = trainWidth.groupby(trainWidth['class_id'])[['car_width']].agg('min').reset_index()
car_width_min.rename(columns={'car_width':'car_width_min'},inplace=True)
car_width_mean = trainWidth.groupby(trainWidth['class_id'])[['car_width']].agg('mean').reset_index()
car_width_mean.rename(columns={'car_width':'car_width_mean'},inplace=True)
trainWidth = trainWidth.drop(['car_width'],axis=1)
trainWidth = trainWidth.drop_duplicates().reset_index(drop=True)
trainWidth = pd.merge(trainWidth, car_width_max, on='class_id')
trainWidth = pd.merge(trainWidth, car_width_min, on='class_id')
trainWidth = pd.merge(trainWidth, car_width_mean, on='class_id')
trainClass = pd.merge(trainClass, trainWidth, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,cylinder_number_6,engine_torque_max,engine_torque_min,engine_torque_mean,car_length_max,car_length_min,car_length_mean,car_width_max,car_width_min,car_width_mean
0,125403,0,0,0,0,0,0,0,0,0,...,1,0.746,0.394,0.526,1.0,0.96,0.98,0.852,0.78,0.816
1,136916,0,0,0,0,0,0,1,0,0,...,0,0.423,0.287,0.346,0.739,0.723,0.731,0.729,0.718,0.723
2,178529,0,0,0,0,0,1,0,0,0,...,0,0.087,0.0,0.05,0.371,0.035,0.167,0.243,0.0,0.15
3,194450,0,0,0,0,0,0,0,0,0,...,0,0.507,0.279,0.373,0.503,0.409,0.456,0.787,0.66,0.723
4,198427,1,0,0,0,0,0,0,0,0,...,0,0.282,0.155,0.199,0.583,0.548,0.569,0.519,0.498,0.508


### 车高

In [58]:
trainHeight = train[['class_id', 'car_height']].drop_duplicates().reset_index(drop=True)
trainHeight['car_height'] = MinMaxScaler().fit_transform(trainHeight['car_height'].reshape(-1, 1))
car_height_max = trainHeight.groupby(trainHeight['class_id'])[['car_height']].agg('max').reset_index()
car_height_max.rename(columns={'car_height':'car_height_max'},inplace=True)
car_height_min = trainHeight.groupby(trainHeight['class_id'])[['car_height']].agg('min').reset_index()
car_height_min.rename(columns={'car_height':'car_height_min'},inplace=True)
car_height_mean = trainHeight.groupby(trainHeight['class_id'])[['car_height']].agg('mean').reset_index()
car_height_mean.rename(columns={'car_height':'car_height_mean'},inplace=True)
trainHeight = trainHeight.drop(['car_height'],axis=1)
trainHeight = trainHeight.drop_duplicates().reset_index(drop=True)
trainHeight = pd.merge(trainHeight, car_height_max, on='class_id')
trainHeight = pd.merge(trainHeight, car_height_min, on='class_id')
trainHeight = pd.merge(trainHeight, car_height_mean, on='class_id')
trainClass = pd.merge(trainClass, trainHeight, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,engine_torque_mean,car_length_max,car_length_min,car_length_mean,car_width_max,car_width_min,car_width_mean,car_height_max,car_height_min,car_height_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0.526,1.0,0.96,0.98,0.852,0.78,0.816,0.727,0.615,0.683
1,136916,0,0,0,0,0,0,1,0,0,...,0.346,0.739,0.723,0.731,0.729,0.718,0.723,0.128,0.108,0.119
2,178529,0,0,0,0,0,1,0,0,0,...,0.05,0.371,0.035,0.167,0.243,0.0,0.15,0.817,0.743,0.774
3,194450,0,0,0,0,0,0,0,0,0,...,0.373,0.503,0.409,0.456,0.787,0.66,0.723,0.486,0.44,0.463
4,198427,1,0,0,0,0,0,0,0,0,...,0.199,0.583,0.548,0.569,0.519,0.498,0.508,0.128,0.101,0.115


### 质量

In [59]:
trainTQuality = train[['class_id', 'total_quality']].drop_duplicates().reset_index(drop=True)
trainTQuality['total_quality'] = MinMaxScaler().fit_transform(trainTQuality['total_quality'].reshape(-1, 1))
total_quality_max = trainTQuality.groupby(trainTQuality['class_id'])[['total_quality']].agg('max').reset_index()
total_quality_max.rename(columns={'total_quality':'total_quality_max'},inplace=True)
total_quality_min = trainTQuality.groupby(trainTQuality['class_id'])[['total_quality']].agg('min').reset_index()
total_quality_min.rename(columns={'total_quality':'total_quality_min'},inplace=True)
total_quality_mean = trainTQuality.groupby(trainTQuality['class_id'])[['total_quality']].agg('mean').reset_index()
total_quality_mean.rename(columns={'total_quality':'total_quality_mean'},inplace=True)
trainTQuality = trainTQuality.drop(['total_quality'],axis=1)
trainTQuality = trainTQuality.drop_duplicates().reset_index(drop=True)
trainTQuality = pd.merge(trainTQuality, total_quality_max, on='class_id')
trainTQuality = pd.merge(trainTQuality, total_quality_min, on='class_id')
trainTQuality = pd.merge(trainTQuality, total_quality_mean, on='class_id')
trainClass = pd.merge(trainClass, trainTQuality, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,car_length_mean,car_width_max,car_width_min,car_width_mean,car_height_max,car_height_min,car_height_mean,total_quality_max,total_quality_min,total_quality_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0.98,0.852,0.78,0.816,0.727,0.615,0.683,1.0,0.88,0.945
1,136916,0,0,0,0,0,0,1,0,0,...,0.731,0.729,0.718,0.723,0.128,0.108,0.119,0.667,0.556,0.621
2,178529,0,0,0,0,0,1,0,0,0,...,0.167,0.243,0.0,0.15,0.817,0.743,0.774,0.385,0.171,0.269
3,194450,0,0,0,0,0,0,0,0,0,...,0.456,0.787,0.66,0.723,0.486,0.44,0.463,0.684,0.462,0.56
4,198427,1,0,0,0,0,0,0,0,0,...,0.569,0.519,0.498,0.508,0.128,0.101,0.115,0.231,0.222,0.227


### 装备质量

In [60]:
trainEQuality = train[['class_id', 'equipment_quality']].drop_duplicates().reset_index(drop=True)
trainEQuality['equipment_quality'] = MinMaxScaler().fit_transform(trainEQuality['equipment_quality'].reshape(-1, 1))
equipment_quality_max = trainEQuality.groupby(trainEQuality['class_id'])[['equipment_quality']].agg('max').reset_index()
equipment_quality_max.rename(columns={'equipment_quality':'equipment_quality_max'},inplace=True)
equipment_quality_min = trainEQuality.groupby(trainEQuality['class_id'])[['equipment_quality']].agg('min').reset_index()
equipment_quality_min.rename(columns={'equipment_quality':'equipment_quality_min'},inplace=True)
equipment_quality_mean = trainEQuality.groupby(trainEQuality['class_id'])[['equipment_quality']].agg('mean').reset_index()
equipment_quality_mean.rename(columns={'equipment_quality':'equipment_quality_mean'},inplace=True)
trainEQuality = trainEQuality.drop(['equipment_quality'],axis=1)
trainEQuality = trainEQuality.drop_duplicates().reset_index(drop=True)
trainEQuality = pd.merge(trainEQuality, equipment_quality_max, on='class_id')
trainEQuality = pd.merge(trainEQuality, equipment_quality_min, on='class_id')
trainEQuality = pd.merge(trainEQuality, equipment_quality_mean, on='class_id')
trainClass = pd.merge(trainClass, trainEQuality, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,car_width_mean,car_height_max,car_height_min,car_height_mean,total_quality_max,total_quality_min,total_quality_mean,equipment_quality_max,equipment_quality_min,equipment_quality_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0.816,0.727,0.615,0.683,1.0,0.88,0.945,0.977,0.847,0.91
1,136916,0,0,0,0,0,0,1,0,0,...,0.723,0.128,0.108,0.119,0.667,0.556,0.621,0.69,0.468,0.551
2,178529,0,0,0,0,0,1,0,0,0,...,0.15,0.817,0.743,0.774,0.385,0.171,0.269,0.134,0.0,0.053
3,194450,0,0,0,0,0,0,0,0,0,...,0.723,0.486,0.44,0.463,0.684,0.462,0.56,0.65,0.519,0.58
4,198427,1,0,0,0,0,0,0,0,0,...,0.508,0.128,0.101,0.115,0.231,0.222,0.227,0.294,0.255,0.271


### 轴距

In [61]:
trainWheel = train[['class_id', 'wheelbase']].drop_duplicates().reset_index(drop=True)
trainWheel['wheelbase'] = MinMaxScaler().fit_transform(trainWheel['wheelbase'].reshape(-1, 1))
wheelbase_max = trainWheel.groupby(trainWheel['class_id'])[['wheelbase']].agg('max').reset_index()
wheelbase_max.rename(columns={'wheelbase':'wheelbase_max'},inplace=True)
wheelbase_min = trainWheel.groupby(trainWheel['class_id'])[['wheelbase']].agg('min').reset_index()
wheelbase_min.rename(columns={'wheelbase':'wheelbase_min'},inplace=True)
wheelbase_mean = trainWheel.groupby(trainWheel['class_id'])[['wheelbase']].agg('mean').reset_index()
wheelbase_mean.rename(columns={'wheelbase':'wheelbase_mean'},inplace=True)
trainWheel = trainWheel.drop(['wheelbase'],axis=1)
trainWheel = trainWheel.drop_duplicates().reset_index(drop=True)
trainWheel = pd.merge(trainWheel, wheelbase_max, on='class_id')
trainWheel = pd.merge(trainWheel, wheelbase_min, on='class_id')
trainWheel = pd.merge(trainWheel, wheelbase_mean, on='class_id')
trainClass = pd.merge(trainClass, trainWheel, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,car_height_mean,total_quality_max,total_quality_min,total_quality_mean,equipment_quality_max,equipment_quality_min,equipment_quality_mean,wheelbase_max,wheelbase_min,wheelbase_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0.683,1.0,0.88,0.945,0.977,0.847,0.91,0.973,0.973,0.973
1,136916,0,0,0,0,0,0,1,0,0,...,0.119,0.667,0.556,0.621,0.69,0.468,0.551,0.555,0.555,0.555
2,178529,0,0,0,0,0,1,0,0,0,...,0.774,0.385,0.171,0.269,0.134,0.0,0.053,0.521,0.187,0.354
3,194450,0,0,0,0,0,0,0,0,0,...,0.463,0.684,0.462,0.56,0.65,0.519,0.58,0.414,0.361,0.388
4,198427,1,0,0,0,0,0,0,0,0,...,0.115,0.231,0.222,0.227,0.294,0.255,0.271,0.324,0.321,0.322


### 前轮距

In [62]:
trainFTrack = train[['class_id', 'front_track']].drop_duplicates().reset_index(drop=True)
trainFTrack['front_track'] = MinMaxScaler().fit_transform(trainFTrack['front_track'].reshape(-1, 1))
front_track_max = trainFTrack.groupby(trainFTrack['class_id'])[['front_track']].agg('max').reset_index()
front_track_max.rename(columns={'front_track':'front_track_max'},inplace=True)
front_track_min = trainFTrack.groupby(trainFTrack['class_id'])[['front_track']].agg('min').reset_index()
front_track_min.rename(columns={'front_track':'front_track_min'},inplace=True)
front_track_mean = trainFTrack.groupby(trainFTrack['class_id'])[['front_track']].agg('mean').reset_index()
front_track_mean.rename(columns={'front_track':'front_track_mean'},inplace=True)
trainFTrack = trainFTrack.drop(['front_track'],axis=1)
trainFTrack = trainFTrack.drop_duplicates().reset_index(drop=True)
trainFTrack = pd.merge(trainFTrack, front_track_max, on='class_id')
trainFTrack = pd.merge(trainFTrack, front_track_min, on='class_id')
trainFTrack = pd.merge(trainFTrack, front_track_mean, on='class_id')
trainClass = pd.merge(trainClass, trainFTrack, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,total_quality_mean,equipment_quality_max,equipment_quality_min,equipment_quality_mean,wheelbase_max,wheelbase_min,wheelbase_mean,front_track_max,front_track_min,front_track_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0.945,0.977,0.847,0.91,0.973,0.973,0.973,0.885,0.816,0.849
1,136916,0,0,0,0,0,0,1,0,0,...,0.621,0.69,0.468,0.551,0.555,0.555,0.555,0.787,0.787,0.787
2,178529,0,0,0,0,0,1,0,0,0,...,0.269,0.134,0.0,0.053,0.521,0.187,0.354,0.027,0.0,0.013
3,194450,0,0,0,0,0,0,0,0,0,...,0.56,0.65,0.519,0.58,0.414,0.361,0.388,0.907,0.72,0.813
4,198427,1,0,0,0,0,0,0,0,0,...,0.227,0.294,0.255,0.271,0.324,0.321,0.322,0.56,0.539,0.549


### 后轮距

In [63]:
trainFTrack = train[['class_id', 'rear_track']].drop_duplicates().reset_index(drop=True)
trainFTrack['rear_track'] = MinMaxScaler().fit_transform(trainFTrack['rear_track'].reshape(-1, 1))
rear_track_max = trainFTrack.groupby(trainFTrack['class_id'])[['rear_track']].agg('max').reset_index()
rear_track_max.rename(columns={'rear_track':'rear_track_max'},inplace=True)
rear_track_min = trainFTrack.groupby(trainFTrack['class_id'])[['rear_track']].agg('min').reset_index()
rear_track_min.rename(columns={'rear_track':'rear_track_min'},inplace=True)
rear_track_mean = trainFTrack.groupby(trainFTrack['class_id'])[['rear_track']].agg('mean').reset_index()
rear_track_mean.rename(columns={'rear_track':'rear_track_mean'},inplace=True)
trainFTrack = trainFTrack.drop(['rear_track'],axis=1)
trainFTrack = trainFTrack.drop_duplicates().reset_index(drop=True)
trainFTrack = pd.merge(trainFTrack, rear_track_max, on='class_id')
trainFTrack = pd.merge(trainFTrack, rear_track_min, on='class_id')
trainFTrack = pd.merge(trainFTrack, rear_track_mean, on='class_id')
trainClass = pd.merge(trainClass, trainFTrack, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,equipment_quality_mean,wheelbase_max,wheelbase_min,wheelbase_mean,front_track_max,front_track_min,front_track_mean,rear_track_max,rear_track_min,rear_track_mean
0,125403,0,0,0,0,0,0,0,0,0,...,0.91,0.973,0.973,0.973,0.885,0.816,0.849,0.882,0.803,0.83
1,136916,0,0,0,0,0,0,1,0,0,...,0.551,0.555,0.555,0.555,0.787,0.787,0.787,0.709,0.709,0.709
2,178529,0,0,0,0,0,1,0,0,0,...,0.053,0.521,0.187,0.354,0.027,0.0,0.013,0.0,0.0,0.0
3,194450,0,0,0,0,0,0,0,0,0,...,0.58,0.414,0.361,0.388,0.907,0.72,0.813,0.895,0.682,0.789
4,198427,1,0,0,0,0,0,0,0,0,...,0.271,0.324,0.321,0.322,0.56,0.539,0.549,0.472,0.451,0.462


### 额定载客

In [64]:
train['rated_passenger'].unique()

array(['7', '5', '7-8', '6-8', '6-7', '5-8', '5-7', '4-5', '9', '4'], dtype=object)

In [65]:
trainPassenger = train[['class_id', 'rated_passenger']].drop_duplicates()
rated_passenger_dummies = pd.get_dummies(trainPassenger['rated_passenger'], prefix='rated_passenger')
trainPassenger = pd.concat([trainPassenger,rated_passenger_dummies],axis=1)
trainPassenger = trainPassenger.drop(['rated_passenger'],axis=1)
trainPassenger = trainPassenger.groupby('class_id').agg('sum').reset_index()
trainClass = pd.merge(trainClass, trainPassenger, on='class_id')
trainClass.head()

Unnamed: 0,class_id,brand_id_12,brand_id_49,brand_id_68,brand_id_75,brand_id_76,brand_id_98,brand_id_106,brand_id_126,brand_id_236,...,rated_passenger_4,rated_passenger_4-5,rated_passenger_5,rated_passenger_5-7,rated_passenger_5-8,rated_passenger_6-7,rated_passenger_6-8,rated_passenger_7,rated_passenger_7-8,rated_passenger_9
0,125403,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,136916,0,0,0,0,0,0,1,0,0,...,0,0,1,0,0,0,0,0,0,0
2,178529,0,0,0,0,0,1,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,194450,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,198427,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


### 特征合并

In [66]:
trainClass.to_csv('../../raw/LiChuan/train_feature.csv', index=False)