# 核心能力提升班商业智能方向 004期 Week 6

### Thinking 1: XGBoost与GBDT的区别是什么？

XGBoost是GBDT在工程上的实现，主要的区别在目标函数上加入了正则化项，并且对目标函数进行了二阶泰勒展开并保留了二次项，正则化项中包括两项一个是树里面叶子节点的个数，一个是树上叶子节点的得分的L2模平方。XGBoost在分裂节点也进行了优化，采用了贪心算法，不计算两两之间的组合而是按照某个特征值的大小进行排序，进行分割。对于数据量较大的情况，采用分桶方法大幅度减小计算量（Histogram）。

### Thinking 2: 举一个你之前做过的预测例子（用的什么模型，解决什么问题，比如我用LR模型，对员工离职进行了预测，效果如何... 请分享到课程微信群中）

我之前做过二手手机价值预测，需要根据二手手机的内存、存储容量、购买渠道等基本属性和机身外观、屏幕外观、屏幕显示、拆修情况等折损属性来预测二手手机的回收价值。我用的是两个GBDT模型分别对基本属性和折损属性进行特征提取，之后使用LR进行价值预测。  
在课上做员工离职预测时使用的是LR模型，我的做法是先将特征按照离散值和连续值进行了分类，然后离散值使用OneHot编码，连续值使用Z-Score规范化进行处理，将两种特征进行拼接之后使用LR进行员工离职率进行预测。目前是第七名的位置。<img src="./rank.PNG"></img>

### Thinking 3： 请你思考，在你的工作中，需要构建哪些特征（比如用户画像，item特征...），这些特征都包括哪些维度（鼓励分享到微信群中，进行交流）

在目前的工作中，在做序列推荐任务，需要根据用户点击或购买的历史数据推断用户的购买习惯或兴趣，预测用户接下来感兴趣的商品。需要构建的特征包括用户特征：性别、年龄段、工作、邮政编码，用户点击item的时间和时间戳，还有item的属性。

### Action 1： 男女声音识别
数据集：voice.csv  
3168个录制的声音样本（来自男性和女性演讲者），采集的频率范围是0hz-280hz，已经对数据进行了预处理  
一共有21个属性值，请判断该声音是男还是女？  
使用Accuracy作为评价标准  


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.metrics import accuracy_score

In [2]:
data = pd.read_csv("data/voice.csv")
data.head()

Unnamed: 0,meanfreq,sd,median,Q25,Q75,IQR,skew,kurt,sp.ent,sfm,...,centroid,meanfun,minfun,maxfun,meandom,mindom,maxdom,dfrange,modindx,label
0,0.059781,0.064241,0.032027,0.015071,0.090193,0.075122,12.863462,274.402906,0.893369,0.491918,...,0.059781,0.084279,0.015702,0.275862,0.007812,0.007812,0.007812,0.0,0.0,male
1,0.066009,0.06731,0.040229,0.019414,0.092666,0.073252,22.423285,634.613855,0.892193,0.513724,...,0.066009,0.107937,0.015826,0.25,0.009014,0.007812,0.054688,0.046875,0.052632,male
2,0.077316,0.083829,0.036718,0.008701,0.131908,0.123207,30.757155,1024.927705,0.846389,0.478905,...,0.077316,0.098706,0.015656,0.271186,0.00799,0.007812,0.015625,0.007812,0.046512,male
3,0.151228,0.072111,0.158011,0.096582,0.207955,0.111374,1.232831,4.177296,0.963322,0.727232,...,0.151228,0.088965,0.017798,0.25,0.201497,0.007812,0.5625,0.554688,0.247119,male
4,0.13512,0.079146,0.124656,0.07872,0.206045,0.127325,1.101174,4.333713,0.971955,0.783568,...,0.13512,0.106398,0.016931,0.266667,0.712812,0.007812,5.484375,5.476562,0.208274,male


In [3]:
feature_label = data.columns.tolist()[:-1]
target_label = data.columns.tolist()[-1]
feature_label, target_label

(['meanfreq',
  'sd',
  'median',
  'Q25',
  'Q75',
  'IQR',
  'skew',
  'kurt',
  'sp.ent',
  'sfm',
  'mode',
  'centroid',
  'meanfun',
  'minfun',
  'maxfun',
  'meandom',
  'mindom',
  'maxdom',
  'dfrange',
  'modindx'],
 'label')

In [4]:
target = data[target_label].values

In [5]:
feature = data[feature_label].values

In [6]:
# 采用Z-Score规范化
ss = preprocessing.StandardScaler()
feature = ss.fit_transform(feature)

In [7]:
LE = preprocessing.LabelEncoder()
target = LE.fit_transform(target).astype(float)

In [8]:
feature, target

(array([[-4.04924806,  0.4273553 , -4.22490077, ..., -1.43142165,
         -1.41913712, -1.45477229],
        [-3.84105325,  0.6116695 , -3.99929342, ..., -1.41810716,
         -1.4058184 , -1.01410294],
        [-3.46306647,  1.60384791, -4.09585052, ..., -1.42920257,
         -1.41691733, -1.06534356],
        ...,
        [-1.29877326,  2.32272355, -0.05197279, ..., -0.5992661 ,
         -0.58671739,  0.17588664],
        [-1.2452018 ,  2.012196  , -0.01772849, ..., -0.41286326,
         -0.40025537,  1.14916112],
        [-0.51474626,  2.14765111, -0.07087873, ..., -1.27608595,
         -1.2637521 ,  1.47567886]]), array([1., 1., 1., ..., 0., 0., 0.]))

In [9]:
# 分割数据，将20%的数据作为测试集，其余作为训练集
train_x, test_x, train_y, test_y = train_test_split(feature, target, test_size=0.2, random_state=33)

In [10]:
from sklearn.svm import SVC
svc = SVC(gamma='auto')
svc.fit(train_x, train_y)
predict_y = svc.predict(test_x)
print('SVM准确率: %0.4lf' % accuracy_score(predict_y, test_y))

SVM准确率: 0.9795


In [11]:
from sklearn.tree import DecisionTreeClassifier
CART = DecisionTreeClassifier()
CART.fit(train_x, train_y)
predict_y = CART.predict(test_x)
print('CART准确率: %0.4lf' % accuracy_score(predict_y, test_y))

CART准确率: 0.9574


In [12]:
from sklearn.linear_model import LogisticRegression
LogR = LogisticRegression(solver='lbfgs')
LogR.fit(train_x, train_y)
predict_y = LogR.predict(test_x)
print('逻辑回归准确率: %0.4lf' % accuracy_score(predict_y, test_y))

逻辑回归准确率: 0.9669


### RF+LR进行预测

In [13]:
# 再将训练集拆成两个部分（RF，LR）
X_train, X_train_lr, y_train, y_train_lr = train_test_split(train_x, train_y, test_size=0.5)

In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder

In [15]:
n_estimator = 10
# 基于随机森林的监督变换
rf = RandomForestClassifier(max_depth=3, n_estimators=n_estimator)
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=3, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [16]:
# 得到OneHot编码
rf_enc = OneHotEncoder(categories='auto')
rf_enc.fit(rf.apply(X_train))

OneHotEncoder(categorical_features=None, categories='auto',
       dtype=<class 'numpy.float64'>, handle_unknown='error',
       n_values=None, sparse=True)

In [17]:
# 使用OneHot编码作为特征，训练LR
rf_lm = LogisticRegression(solver='lbfgs', max_iter=1000)
rf_lm.fit(rf_enc.transform(rf.apply(X_train_lr)), y_train_lr)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=1000, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [18]:
# 使用LR进行预测
y_pred_rf_lm = rf_lm.predict_proba(rf_enc.transform(rf.apply(test_x)))[:, 1]
print('RF+LR准确率: %0.4lf' % accuracy_score(y_pred_rf_lm.round(), test_y))

RF+LR准确率: 0.9700


### GBDT+LR进行预测

In [19]:
from sklearn.ensemble import GradientBoostingClassifier
n_estimator = 10
# 基于GBDT的监督变换
gbdt = GradientBoostingClassifier(max_depth=3, n_estimators=n_estimator)
gbdt.fit(X_train, y_train)
# 得到OneHot编码
gbdt_enc = OneHotEncoder(categories='auto')
gbdt_enc.fit(gbdt.apply(X_train).reshape(-1,10))
# 使用OneHot编码作为特征，训练LR
gbdt_lm = LogisticRegression(solver='lbfgs', max_iter=1000)
gbdt_lm.fit(gbdt_enc.transform(gbdt.apply(X_train_lr).reshape(-1,10)), y_train_lr)
# 使用LR进行预测
y_pred_gbdt_lm = gbdt_lm.predict_proba(gbdt_enc.transform(gbdt.apply(test_x).reshape(-1,10)))[:, 1]
print('GBDT+LR准确率: %0.4lf' % accuracy_score(y_pred_gbdt_lm.round(), test_y))

GBDT+LR准确率: 0.9653


### XGBoost进行预测

In [25]:
import xgboost as xgb

In [26]:
param = {'boosting_type':'gbdt',
         'objective' : 'binary:logistic', #任务目标
         'eval_metric' : 'auc', #评估指标
         'eta' : 0.01, #学习率
         'max_depth' : 15, #树最大深度
         'colsample_bytree':0.8, #设置在每次迭代中使用特征的比例
         'subsample': 0.9, #样本采样比例
         'subsample_freq': 8, #bagging的次数
         'alpha': 0.6, #L1正则
         'lambda': 0, #L2正则
        }


In [28]:
train_data = xgb.DMatrix(train_x, train_y)
test_data = xgb.DMatrix(test_x, test_y)

In [31]:
model = xgb.train(param, train_data, evals=[(train_data, 'train'), (test_data, 'valid')], num_boost_round=300, early_stopping_rounds=25, verbose_eval=25)

Parameters: { boosting_type, subsample_freq } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


[0]	train-auc:0.99119	valid-auc:0.98693
Multiple eval metrics have been passed: 'valid-auc' will be used for early stopping.

Will train until valid-auc hasn't improved in 25 rounds.
[25]	train-auc:0.99873	valid-auc:0.99568
[50]	train-auc:0.99884	valid-auc:0.99614
Stopping. Best iteration:
[28]	train-auc:0.99878	valid-auc:0.99649



In [32]:
predict_y = model.predict(test_data)
print('XGBoost准确率: %0.4lf' % accuracy_score(predict_y.round(), test_y))

XGBoost准确率: 0.9716
