https://www.lintcode.com/ai/car-insurance-risk/overview
# 题目描述
某保险公司销售一种汽车保险，需要对汽车状态进行评估。现在你需要设计一个算法模型，可以根据汽车的各项指标对汽车的投保风险进行打分。投保风险是从0到70的正整数，数值越大代表风险越高。

# 小提示
本题是一个典型的回归问题
先从train.csv中提取每辆车的多项特征（feature）和类别(label)，使用feature和label进行模型训练
车辆的特征中有数值型和类别性，数值型的特征注意进行范围标准化，类别型的特征转化为one-hot encoding的形式
模型可以使用传统的回归模型，如SVR(支持向量回归)，CART(分类与回归树)等。但强烈建议使用集成学习的算法将多个简单模型集成为一个复杂模型，如Random Forest(随机森林)，Adaboost等，在使用集成学习(Ensemble Learning)的方法时，要注意引入正则化项，防止模型过拟合
最后使用训练好的模型预测test.csv中每辆车的风险分数
# 先修技能
懂得基本的机器学习回归模型的原理和使用，如SVR(支持向量回归)，CART(分类与回归树)等。
懂得集成学习的相关算法，如Random Forest(随机森林)，Adaboost等。
# 术语解释
SVR(支持向量回归) : 是一种基于SVM(支持向量机)的回归方法。支持向量回归在做拟合时，采用了支持向量的思想，和拉格朗日乘子式的方式，来对数据进行回归分析
CART(分类与回归树) : 属于一种特殊的决策树，其假设决策树是二叉树，内部结点特征的取值为“是”和“否”，这样的决策树等价于递归地二分每个特征，将输入空间即特征空间划分为有限个单元，并在这些单元上确定预测的概率分布，也就是在输入给定的条件下输出的条件概率分布，常用来做分类和回归分析。
集成学习(Ensemble Learning) : 是使用一系列学习器进行学习，并使用某种规则把各个学习结果进行整合从而获得比单个学习器更好的学习效果的一种机器学习方法
Random Forest(随机森林) : 指的是利用多棵树对样本进行训练并预测的一种分类器。在机器学习领域，随机森林是一个包含多个决策树的分类器，并且其输出的类别是由个别树输出的类别的众数而定
Adaboost : 是一种迭代算法，其核心思想是针对同一个训练集训练不同的分类器(弱分类器)，然后把这些弱分类器集合起来，构成一个更强的最终分类器(强分类器)。算法本身是通过改变数据分布来实现的，它根据每次训练集之中每个样本的分类是否正确，以及上次的总体分类的准确率，来确定每个样本的权值。将修改过权值的新数据集送给下层分类器进行训练，最后将每次训练得到的分类器最后融合起来，作为最后的决策分类器。

# 目标
给定汽车的各项性能指标，设计算法对汽车的投保风险进行打分

# 评价
对于提交的文件，我们将使用RMSE (Root Mean Square Error)作为评价指标.

In [2]:
import pandas as pd

train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
submission = pd.read_csv('data/submission.csv')

In [4]:
train.shape, test.shape, submission.shape

((32000, 34), (8000, 33), (8000, 2))

In [3]:
train.head()

Unnamed: 0,Id,Score,Col_1,Col_2,Col_3,Col_4,Col_5,Col_6,Col_7,Col_8,...,Col_23,Col_24,Col_25,Col_26,Col_27,Col_28,Col_29,Col_30,Col_31,Col_32
0,1,4,2,1,a,b,1,1,12,e,...,1,b,b,12,j,a,b,15,15,b
1,2,2,17,10,a,a,0,3,8,c,...,2,b,a,3,f,a,b,15,3,b
2,3,4,11,1,a,a,0,2,6,e,...,3,b,a,19,f,a,d,10,18,b
3,4,1,11,6,a,a,0,3,1,e,...,1,b,a,16,b,a,b,15,18,b
4,5,5,13,1,a,a,0,3,3,c,...,2,b,a,10,f,a,b,5,9,b


In [5]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32000 entries, 0 to 31999
Data columns (total 34 columns):
Id        32000 non-null int64
Score     32000 non-null int64
Col_1     32000 non-null int64
Col_2     32000 non-null int64
Col_3     32000 non-null object
Col_4     32000 non-null object
Col_5     32000 non-null int64
Col_6     32000 non-null int64
Col_7     32000 non-null int64
Col_8     32000 non-null object
Col_9     32000 non-null object
Col_10    32000 non-null int64
Col_11    32000 non-null object
Col_12    32000 non-null object
Col_13    32000 non-null int64
Col_14    32000 non-null int64
Col_15    32000 non-null object
Col_16    32000 non-null int64
Col_17    32000 non-null object
Col_18    32000 non-null int64
Col_19    32000 non-null int64
Col_20    32000 non-null object
Col_21    32000 non-null int64
Col_22    32000 non-null object
Col_23    32000 non-null int64
Col_24    32000 non-null object
Col_25    32000 non-null object
Col_26    32000 non-null int64
Col_27    3

In [10]:
def MissAnalysis(df):
    stats = []
    for col in df.columns:
        stats.append((col, df[col].nunique(), df[col].isnull().sum() * 100 / df.shape[0], df[col].value_counts(normalize=True, dropna=False).values[0] * 100, df[col].dtype))

    stats_df = pd.DataFrame(stats, columns=['Feature', 'Unique_values', 'Percentage of missing values', 'Percentage of values in the biggest category', 'type'])
    df1 = stats_df.sort_values('Percentage of missing values', ascending=False)[:10]
    df2 = stats_df.sort_values('Percentage of values in the biggest category', ascending=False)[:10]
    return df1, df2
df1, df2 = MissAnalysis(train)
df1

Unnamed: 0,Feature,Unique_values,Percentage of missing values,Percentage of values in the biggest category,type
0,Id,32000,0.0,0.003125,int64
25,Col_24,4,0.0,91.9875,object
19,Col_18,7,0.0,44.8875,int64
20,Col_19,12,0.0,42.19375,int64
21,Col_20,2,0.0,55.7375,object
22,Col_21,7,0.0,20.878125,int64
23,Col_22,2,0.0,67.746875,object
24,Col_23,5,0.0,65.65,int64
26,Col_25,2,0.0,81.73125,object
1,Score,43,0.0,37.315625,int64


In [8]:
df2

Unnamed: 0,Feature,Unique_values,Percentage of missing values,Percentage of values in the biggest category,type
6,Col_5,3,0.0,97.109375,int64
10,Col_9,4,0.0,94.05,object
25,Col_24,4,0.0,91.9875,object
30,Col_29,4,0.0,91.24375,object
29,Col_28,8,0.0,89.48125,object
26,Col_25,2,0.0,81.73125,object
5,Col_4,2,0.0,80.73125,object
33,Col_32,2,0.0,72.5625,object
23,Col_22,2,0.0,67.746875,object
4,Col_3,6,0.0,66.253125,object


In [11]:
df1, df2 = MissAnalysis(test)
df1

Unnamed: 0,Feature,Unique_values,Percentage of missing values,Percentage of values in the biggest category,type
0,Id,8000,0.0,0.0125,int64
17,Col_17,10,0.0,29.3375,object
31,Col_31,7,0.0,27.925,int64
30,Col_30,4,0.0,34.2875,int64
29,Col_29,4,0.0,90.8,object
28,Col_28,8,0.0,89.6,object
27,Col_27,12,0.0,32.8375,object
26,Col_26,22,0.0,7.1375,int64
25,Col_25,2,0.0,81.95,object
24,Col_24,4,0.0,91.9375,object


In [12]:
df2

Unnamed: 0,Feature,Unique_values,Percentage of missing values,Percentage of values in the biggest category,type
5,Col_5,3,0.0,97.3875,int64
9,Col_9,4,0.0,94.15,object
24,Col_24,4,0.0,91.9375,object
29,Col_29,4,0.0,90.8,object
28,Col_28,8,0.0,89.6,object
25,Col_25,2,0.0,81.95,object
4,Col_4,2,0.0,80.8875,object
32,Col_32,2,0.0,72.4625,object
22,Col_22,2,0.0,67.4125,object
3,Col_3,6,0.0,67.1,object


In [None]:
1、没有缺失值，但有占比重很大的类别数值，一般设置为90%删除。（5、9、24、29）
2、检查score是否在0-70范围内
3、提取类别型特征

In [16]:
good_features = list(train.columns)
for feat in train.columns:
    rate = train[feat].value_counts(normalize=True, dropna=False).values[0]
    if rate > 0.9:
        good_features.remove(feat)
len(good_features)

30

In [13]:
train['Score'].unique()

array([ 4,  2,  1,  5,  6, 12, 16, 11,  8, 17,  3, 10, 13,  7, 18, 23,  9,
       19, 15, 14, 20, 26, 28, 25, 32, 24, 22, 42, 46, 21, 31, 34, 29, 37,
       30, 35, 52, 64, 69, 38, 40, 41, 36], dtype=int64)

In [19]:
# 合并数据集
good_features.remove('Score')
target = train['Score']
del train['Score']  # 必须删除了才能合并
data = pd.concat([train, test], axis=0, ignore_index=True)
data = data.fillna(-1)
data = data[good_features]
data.shape

(40000, 29)

In [20]:
categorical_features = [feat for feat in data.columns if data[feat].dtype == object]
numerical_features = [feat for feat in data.columns if feat not in categorical_features]
categorical_features

['Col_3',
 'Col_4',
 'Col_8',
 'Col_11',
 'Col_12',
 'Col_15',
 'Col_17',
 'Col_20',
 'Col_22',
 'Col_25',
 'Col_27',
 'Col_28',
 'Col_32']

In [22]:
for feat in categorical_features:
    print(data[feat][0], data[feat].nunique())

a 6
b 2
e 5
c 6
f 8
i 18
f 10
a 2
a 2
b 2
j 12
a 8
b 2


In [None]:
#label encoder
for f in categorical_columns:
    data[f] = data[f].map(dict(zip(data[f].unique(), range(0, data[f].nunique()))))
train = data[:train.shape[0]]
test  = data[train.shape[0]:]

# one-hot
X_train = train[numerical_columns].values
X_test = test[numerical_columns].values
enc = OneHotEncoder()
for f in categorical_columns:
    enc.fit(data[f].values.reshape(-1, 1))
    X_train = sparse.hstack((X_train, enc.transform(train[f].values.reshape(-1, 1))), 'csr')
    X_test = sparse.hstack((X_test, enc.transform(test[f].values.reshape(-1, 1))), 'csr')

y_train = target.values