## 功能简介

使用pu-learning算法解决样本负例非常少的情况。

pu-learning主要有三种思路，这里使用pu-bagging和two-step的方法，介绍详解参考文章或者博客：

参考文章：https://roywright.me/2017/11/16/positive-unlabeled-learning/

引用的baggingPU.py来自：https://github.com/roywright/pu_learning/blob/master/baggingPU.py

In [1]:
import pymysql
import pandas as pd
import numpy as np
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from collections import Counter
from sklearn.model_selection import train_test_split, StratifiedKFold, KFold
from xgboost import XGBClassifier
from baggingPU import BaggingClassifierPU
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, precision_recall_fscore_support
from sklearn.ensemble import RandomForestClassifier
from math import isnan
from numpy import NaN
from numpy import nan
import pickle
import json
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

## 1、获取数据

In [2]:
connection = pymysql.Connect(
    host="localhost",
    port=3306,
    user="root",
    passwd="root",
    charset="utf8",
    db="project_researchers"
)

In [3]:
def getData(connection):
    """
    查询数据，包括特征和标签
    :param connection:
    :return:
    """
    sql_select = """
        SELECT teac_id, bys_cn, hindex_cn,a_conf+a_journal as a_paper, b_conf + b_journal as b_paper,c_conf + c_journal as c_paper,papernum2017, papernum2016, papernum2015, papernum2014, papernum2013,num_journal,num_conference, project_num, degree, pagerank,degree_centrality,last_year - first_year as diff_year , coauthors_top10000, coauthors_top20000, coauthors_top30000, category, CASE WHEN label is null THEN 0 ELSE label END label
        FROM classifier_isTeacher_xgbc WHERE (label=1 or label=0 or label is null ) and category is not null
    """
    df = pd.read_sql_query(sql_select, connection)
    all_features = ['teac_id', 'bys_cn', 'hindex_cn', 'a_paper', 'b_paper', 'c_paper', 'papernum2017', 'papernum2016', 'papernum2015', 'papernum2014', 'papernum2013', 'num_journal', 'num_conference',  'degree', 'pagerank', 'degree_centrality', 'diff_year', 'coauthors_top10000', 'coauthors_top20000', 'coauthors_top30000', 'category', 'label']
    data = df[all_features]
    return data

data = getData(connection)
print("shape of data:", data.shape)
print("data.info():", data.info())

shape of data: (199751, 22)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199751 entries, 0 to 199750
Data columns (total 22 columns):
teac_id               199751 non-null int64
bys_cn                195253 non-null float64
hindex_cn             199181 non-null float64
a_paper               199751 non-null int64
b_paper               199751 non-null int64
c_paper               199751 non-null int64
papernum2017          199751 non-null int64
papernum2016          199751 non-null int64
papernum2015          199751 non-null int64
papernum2014          199751 non-null int64
papernum2013          199751 non-null int64
num_journal           199751 non-null int64
num_conference        199751 non-null int64
degree                199470 non-null float64
pagerank              199470 non-null float64
degree_centrality     199470 non-null float64
diff_year             199470 non-null float64
coauthors_top10000    199751 non-null int64
coauthors_top20000    199751 non-null int64
coauthors_top

## 2、处理数据

In [4]:
# 对缺失值进行处理
# Method1：直接将含有缺失字段的值去掉
# data = data.dropna()
# print("shape of data::", data.shape)
# print("data.info()::", data.info())
columns_name_zero = ['bys_cn', 'hindex_cn', 'degree', 'pagerank', 'degree_centrality', 'diff_year']
for column_name in columns_name_zero:
    data[column_name].fillna(0, inplace=True)
print("info of data::", data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199751 entries, 0 to 199750
Data columns (total 22 columns):
teac_id               199751 non-null int64
bys_cn                199751 non-null float64
hindex_cn             199751 non-null float64
a_paper               199751 non-null int64
b_paper               199751 non-null int64
c_paper               199751 non-null int64
papernum2017          199751 non-null int64
papernum2016          199751 non-null int64
papernum2015          199751 non-null int64
papernum2014          199751 non-null int64
papernum2013          199751 non-null int64
num_journal           199751 non-null int64
num_conference        199751 non-null int64
degree                199751 non-null float64
pagerank              199751 non-null float64
degree_centrality     199751 non-null float64
diff_year             199751 non-null float64
coauthors_top10000    199751 non-null int64
coauthors_top20000    199751 non-null int64
coauthors_top30000    199751 non-null int

In [5]:
# 将连续值和离散值以及y分开
continuous_features = ['bys_cn', 'hindex_cn', 'a_paper', 'b_paper', 'c_paper', 'papernum2017', 'papernum2016', 'papernum2015', 'papernum2014', 'papernum2013', 'num_journal', 'num_conference',  'degree', 'pagerank', 'degree_centrality', 'diff_year', 'coauthors_top10000', 'coauthors_top20000', 'coauthors_top30000']
discrete_features = ['category']
X_continous = data[continuous_features]
X_discrete = data[discrete_features]
y = data['label']
ids = data['teac_id']
print("info of X_continuous::", X_continous.info())
print("info of X_discrete::", X_discrete.info())
print("len of ids::", len(ids))
print("y::", Counter(y))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199751 entries, 0 to 199750
Data columns (total 19 columns):
bys_cn                199751 non-null float64
hindex_cn             199751 non-null float64
a_paper               199751 non-null int64
b_paper               199751 non-null int64
c_paper               199751 non-null int64
papernum2017          199751 non-null int64
papernum2016          199751 non-null int64
papernum2015          199751 non-null int64
papernum2014          199751 non-null int64
papernum2013          199751 non-null int64
num_journal           199751 non-null int64
num_conference        199751 non-null int64
degree                199751 non-null float64
pagerank              199751 non-null float64
degree_centrality     199751 non-null float64
diff_year             199751 non-null float64
coauthors_top10000    199751 non-null int64
coauthors_top20000    199751 non-null int64
coauthors_top30000    199751 non-null int64
dtypes: float64(6), int64(13)
memory usag

In [6]:
# 将离散值变成one-hot编码
X_discrete_oneHot = OneHotEncoder(sparse=False).fit_transform(X_discrete)
print(X_discrete_oneHot)

X_all = np.hstack((X_continous, X_discrete_oneHot))
print("shape of X_all::", X_all.shape)

[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]
 ...
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 1. 0.]]
shape of X_all:: (199751, 22)


## 3、获取标签数据

In [7]:
def getGroundTruth(connection):
    """
    获取标签数据，这部分主要是训练的时候使用的是前17605个正例，所以这里验证使用id为17605之后的数据和负例数据
    """
    sql_select_zeros = "SELECT teac_id, label FROM classifier_isTeacher_label WHERE label = 0 and category is not null"
    sql_select_ones = "SELECT teac_id, label FROM classifier_isTeacher_label WHERE label = 1 and teac_id>17606 and category is not null ORDER BY xmpy LIMIT 10, 1700"
    cursor = connection.cursor()
    ids = []
    labels =[]
    cursor.execute(sql_select_zeros)
    results1 = cursor.fetchall() 
    for elem in results1:
        ids.append(elem[0])
        labels.append(elem[1])
    cursor.execute(sql_select_ones)
    results2 = cursor.fetchall() 
    for elem in results2:
        ids.append(elem[0])
        labels.append(elem[1])
    return ids, labels

label_ids, label_labels = getGroundTruth(connection)
print("len of label_ids::", len(label_ids))
print("len of label_labels::", len(label_labels))

len of label_ids:: 3409
len of label_labels:: 3409


## 4、评估函数

In [8]:
def getEvaluaton(ids, predict, label_ids, label_labels):
    """
    根据预测结果和ground-truth评估性能。
    ids: 所有人的id
    predict：所有人的预测结果
    label_ids：用于衡量性能的ground-truth的id
    label_labels：用于衡量性能的ground-truth的标签
    """
    predict_labels = []    # 存储对于ground_truth中的id，预测结果
    res_map = {}
    predict_lis = []    # 存储
    for i in range(0, len(ids)):
        if isnan(predict[i]):
            res_map[ids[i]] = 1
            predict_lis.append(1)
        else:
            if predict[i] > 0.6:   # 定义阈值为0.6
                res_map[ids[i]] = 1
                predict_lis.append(1)
            else:
                res_map[ids[i]] = 0
                predict_lis.append(0)
    consistent = 0
    consistent_zeros = 0
    for i in range(0, len(label_ids)):
        predict_label = res_map.get(label_ids[i])
        predict_labels.append(predict_label)
        if predict_label== label_labels[i]:
            consistent += 1
            if label_labels[i] == 0:
                consistent_zeros += 1
#     accuracy = consistent / len(label_ids)
    print("consistent::%d" % consistent)
    print("correct of consistent_zeros::%d" % consistent_zeros)
    return consistent, consistent_zeros, predict_labels, predict_lis

## 5、PU-Learning

### 5.1 pu-bagging方法

pu-bagging借助了bagging的思想，步骤如下：

（1）采样与正例相同大小的无标签数据当做负样本

（2）使用正例和负例训练分类器，预测除此正例和负例之外的数据标签

（3）重复多次，取预测的平均值

In [9]:
# 通过观察可以发现，使用rfe方法，当n_features_to_select=15时，f1值可以达到最大值：0.96038，这也是方差分析，rfe和rfecv中最好的效果。
def trainAndTestXGBCrfePuBagging(X_all, y, ids, label_ids, label_labels, n_features_to_select=15):
    
     # RFECV
    estimator = XGBClassifier()
    selector = RFE(estimator=estimator, n_features_to_select=n_features_to_select)
    X_all_rfe = selector.fit_transform(X_all, y) 
    print("Optimal number of features::%d" % selector.n_features_)
    print("Ranking of features:: %s" % selector.ranking_)
    
    selected_idx = np.where(pd.Series(selector.support_)==True)[0]   # n_features_to_select个选择出来的特征，每一个特征为True
    print("selector.support_::", selector.support_)
    
    # 因为Wrapper离散特征和连续特征需要一起训练搜索特征子集，但是因为离散特征不需要标准化，所以这里需要将其分开
    discrete_idx = list(set([19, 20, 21]) - set(selected_idx))   # 最后3列为离散值
    X_continuous_tmp = pd.DataFrame(X_all_rfe)[list(range(0, len(selected_idx)-len(discrete_idx)))]
    X_discreate_tmp = pd.DataFrame(X_all_rfe)[list(range(len(selected_idx)-len(discrete_idx), len(selected_idx)))]
    
    # 归一化
    ss = StandardScaler()
    X_continuous_new = ss.fit_transform(X_continuous_tmp)
    print("type of X_continuous_new::", type(X_continuous_new))
    print("shape of X_continuous_new::", X_continuous_new.shape)

    # 将连续值和离散值拼接
    X_all_new = np.hstack((X_continuous_new, X_discreate_tmp))
    print("shape of X_all::", X_all.shape)
    
    y_origin = y.copy()

    bc = BaggingClassifierPU(
        DecisionTreeClassifier(),
        n_estimators=1,  # 1000 trees as usual
        max_samples=sum(y),  # Balance the positives and unlabeled in each bag
    )
    bc.fit(X_all_new, y)
    
    rf = RandomForestClassifier(
        n_estimators = 1,  # Use 1000 trees
    )
    rf.fit(X_all_new, y)
    
    # Store the scores assigned by this approach
    results = pd.DataFrame({
        'truth'      : y_origin,   # The true labels
        'label'      : y,        # The labels to be shown to models in experiment
        'output_std' : rf.predict_proba(X_all_new)[:,1]   # The random forest's scores
    }, columns = ['truth', 'label', 'output_std'])

    results['output_skb'] = bc.oob_decision_function_[:, 1]
    print(bc.oob_decision_function_[:, 1])
    res = bc.oob_decision_function_[:, 1]
    count_Nan = 0
    count_one = 0
    for i in range(0, len(res)):
        if isnan(res[i]):
            count_Nan += 1
            count_one += 1
        if res[i] > 0.6:
            count_one += 1
    print("结果为Nan的元素是：：%d" % count_Nan)
    print("结果为1的元素是：：%d" % count_one)
    print("总长度是：：%d" % len(res))
    consistent, consistent_zeros, predict_labels, predict_lis = getEvaluaton(ids, res, label_ids, label_labels)
    print("consistent::%d" %consistent)
    print("correct of consistent_zeros::%d" % consistent_zeros)
    if len(predict_labels) != len(label_labels):
        print("真实值和预测值的长度不同")
#     print("predict_labels::", predict_labels)
#     print("type of predict_labels::", type(predict_labels))
#     print("label_labels::", label_labels)
#     print("type of label_labels::", type(label_labels))
    score = len(np.where((pd.Series(predict_labels) == pd.Series(label_labels)) == True)[0])/len(predict_labels)
    print("准确率是：", score)
    print(classification_report(predict_labels, label_labels))   # target_names=['1', '0']

# 调用预测函数
X_all_copy = X_all.copy()
y_copy = y.copy()
trainAndTestXGBCrfePuBagging(X_all_copy, y_copy, ids, label_ids, label_labels)

Optimal number of features::15
Ranking of features:: [1 1 4 1 1 3 1 1 1 1 1 1 1 1 7 1 5 6 1 1 8 2]
selector.support_:: [ True  True False  True  True False  True  True  True  True  True  True
  True  True False  True False False  True  True False False]
type of X_continuous_new:: <class 'numpy.ndarray'>
shape of X_continuous_new:: (199751, 13)
shape of X_all:: (199751, 22)
[nan nan nan ...  1.  0. nan]
结果为Nan的元素是：：33191
结果为1的元素是：：74227
总长度是：：199751
consistent::3101
correct of consistent_zeros::1401
consistent::3101
correct of consistent_zeros::1401
准确率是： 0.9096509240246407
              precision    recall  f1-score   support

           0       0.82      1.00      0.90      1401
           1       1.00      0.85      0.92      2008

   micro avg       0.91      0.91      0.91      3409
   macro avg       0.91      0.92      0.91      3409
weighted avg       0.93      0.91      0.91      3409



### 5.2 Two-step

two-step的思想如下：

（1）首先将所有的无标签数据当做负样本，和所有正例当做训练集训练分类器，识别出无标签样本数据中可靠的负例，将其当做真正的负例。

（2）使用正例和Step1中的可靠负例训练分类器，在挑选中可靠负例，不但迭代（本次实验迭代了10次）。

In [10]:
# 通过观察可以发现，使用rfe方法，当n_features_to_select=15时，f1值可以达到最大值：0.96038，这也是方差分析，rfe和rfecv中最好的效果。
def trainAndTestXGBCrfePuTwoStep(X_all, y, ids, label_ids, label_labels, n_features_to_select=15):
    
     # RFECV
    estimator = XGBClassifier()
    selector = RFE(estimator=estimator, n_features_to_select=n_features_to_select)
    X_all_rfe = selector.fit_transform(X_all, y) 
    print("Optimal number of features::%d" % selector.n_features_)
    print("Ranking of features:: %s" % selector.ranking_)
    
    selected_idx = np.where(pd.Series(selector.support_)==True)[0]   # n_features_to_select个选择出来的特征，每一个特征为True
    print("selector.support_::", selector.support_)
    
    # 因为Wrapper离散特征和连续特征需要一起训练搜索特征子集，但是因为离散特征不需要标准化，所以这里需要将其分开
    discrete_idx = list(set([19, 20, 21]) - set(selected_idx))   # 最后3列为离散值
    X_continuous_tmp = pd.DataFrame(X_all_rfe)[list(range(0, len(selected_idx)-len(discrete_idx)))]
    X_discreate_tmp = pd.DataFrame(X_all_rfe)[list(range(len(selected_idx)-len(discrete_idx), len(selected_idx)))]
    
    # 归一化
    ss = StandardScaler()
    X_continuous_new = ss.fit_transform(X_continuous_tmp)
    print("type of X_continuous_new::", type(X_continuous_new))
    print("shape of X_continuous_new::", X_continuous_new.shape)

    # 将连续值和离散值拼接
    X_all_new = np.hstack((X_continuous_new, X_discreate_tmp))
    print("shape of X_all::", X_all.shape)
    
    # Get the scores from RandomForestClassifier
    rf = RandomForestClassifier(n_estimators = 10)   # Use 1000 trees
    rf.fit(X_all_new, y)
    pred = rf.predict_proba(X_all_new)[:,1]
    
    # Find the range of scores given to positive data points
    range_P = [min(pred * (y == 1)), max(pred * (y == 1))]

    # STEP 1
    # If any unlabeled point has a score above all known positives, 
    # or below all known positives, label it accordingly
    iP_new = y[(y == 0) & (pred >= range_P[1])].index
    iN_new = y[(y == 0) & (pred <= range_P[0])].index
    y.loc[iP_new] = 1
    y.loc[iN_new] = 0
    
    
    # Classifier to be used for step 2
    rf2 = RandomForestClassifier(n_estimators = 10)

    # Limit to 10 iterations (this is arbitrary, but 
    # otherwise this approach can take a very long time)
    for i in range(10):
        # If step 1 didn't find new labels, we're done
        if len(iP_new) + len(iN_new) == 0 and i > 0:
            break

        print('Step 1 labeled %d new positives and %d new negatives.' % (len(iP_new), len(iN_new)))
        print('Doing step 2... ', end = '')

        # STEP 2
        # Retrain on new labels and get new scores
        rf2.fit(X_all_new, y)
        pred = rf2.predict_proba(X_all_new)[:,-1]

        # Find the range of scores given to positive data points
        range_P = [min(pred * (y == 1)), max(pred * (y == 1))]

        # Repeat step 1
        iP_new = y[(y == 0) & (pred >= range_P[1])].index
        iN_new = y[(y == 0) & (pred <= range_P[0])].index
        y.loc[iP_new] = 1
        y.loc[iN_new] = 0


    # Lastly, get the scores assigned by this approach    
    print("pred::", pred)
    res = pred
    count_Nan = 0
    count_one = 0
    for i in range(0, len(res)):
        if isnan(res[i]):
            count_Nan += 1
            count_one += 1
        if res[i] > 0.6:
            count_one += 1
    print("结果为Nan的元素是：：%d" % count_Nan)
    print("结果为1的元素是：：%d" % count_one)
    print("总长度是：：%d" % len(res))
    consistent, consistent_zeros, predict_labels, predict_lis = getEvaluaton(ids, res, label_ids, label_labels)
    print("consistent::%d" %consistent)
    print("correct of consistent_zeros::%d" % consistent_zeros)
    if len(predict_labels) != len(label_labels):
        print("真实值和预测值的结果不同")
    score = len(np.where((predict_labels == label_labels) == True)[0])/len(predict_labels)
    print("准确率是：", score)
    print(classification_report(predict_labels, label_labels))   # target_names=['1', '0']

# 调用预测函数
X_all_copy = X_all.copy()
y_copy = y.copy()
trainAndTestXGBCrfePuTwoStep(X_all_copy, y_copy, ids, label_ids, label_labels)

Optimal number of features::15
Ranking of features:: [1 1 4 1 1 3 1 1 1 1 1 1 1 1 7 1 5 6 1 1 8 2]
selector.support_:: [ True  True False  True  True False  True  True  True  True  True  True
  True  True False  True False False  True  True False False]
type of X_continuous_new:: <class 'numpy.ndarray'>
shape of X_continuous_new:: (199751, 13)
shape of X_all:: (199751, 22)
Step 1 labeled 0 new positives and 139733 new negatives.
Doing step 2... Step 1 labeled 0 new positives and 139480 new negatives.
Doing step 2... Step 1 labeled 0 new positives and 139613 new negatives.
Doing step 2... Step 1 labeled 0 new positives and 139752 new negatives.
Doing step 2... Step 1 labeled 0 new positives and 139635 new negatives.
Doing step 2... Step 1 labeled 0 new positives and 139850 new negatives.
Doing step 2... Step 1 labeled 0 new positives and 139980 new negatives.
Doing step 2... Step 1 labeled 0 new positives and 139638 new negatives.
Doing step 2... Step 1 labeled 0 new positives and 13982

### 分析

通过上述结果发现，使用pu-bagging的效果要稍微好于使用two-step的效果。