# 方案概述
比赛链接：[系统访问风险识别](https://www.datafountain.cn/competitions/580/datasets)

## 赛题介绍
### 赛题背景

随着国家、企业对安全和效率越来越重视，作为安全基础设施之一——统一身份管理(IAM，Identity and Access Management)系统也得到越来越多的关注。 在IAM领域中，其主要安全防护手段是身份鉴别，身份鉴别主要包括账密验证、扫码验证、短信验证、人脸识别及指纹验证等方式。这些身份鉴别方式一般可分为三类，即用户所知(如口令)、所有(如身份证)、特征(如人脸识别及指纹验证)。这些鉴别方式都有其各自的缺点——比如口令，强度高了不容易记住，强度低了又容易丢；又比如人脸识别，做活体验证用户体验不好，静默检测又容易被照片、视频、人脸模型绕过。也因此，在等保2.0中对于三级以上系统要求必须使用两种及以上的鉴别方式对用户进行身份鉴别，以提高身份鉴别的可信度，这种鉴别方式也被称为双因素认证。

对用户来说，双因素认证在一定程度上提高了安全性，但也极大地降低了用户体验。也因此，IAM厂商开始参考用户实体行为分析(UEBA,User and Entity Behavior Analytics)、用户画像等行为分析技术，来探索一种既能确保用户体验，又能提高身份鉴别可信度的方法。而在当前IAM的探索过程中，目前最容易落地的方法是基于规则的行为分析技术，因为它可理解性较高，且容易与其它身份鉴别方式进行联动。
但基于规则的行为分析技术局限性也很明显，首先这种技术是基于经验的，有“宁错杀一千，不放过一个”的特点，其次它也缺少从数据层面来证明是否有人正在尝试窃取或验证非法获取的身份信息，又或者正在使用窃取的身份信息。鉴于此，我们举办这次竞赛，希望各个参赛团队利用竞赛数据和行业知识，建立机器学习、人工智能或数据挖掘模型，来弥补传统方法的缺点，从而解决这一行业难题。

### 赛题任务

本赛题中，参赛团队将基于用户历史的系统访问日志及是否存在风险标记等数据，结合行业知识，构建必要的特征工程，建立机器学习、人工智能或数据挖掘模型，并用该模型预测将来的系统访问是否存在风险。

### 数据简介
本赛题数据是从竹云日志库中抽取某公司一定比例的员工从2022年1月到6月的系统访问日志数据，主要涉及认证日志与风险日志数据。部分字段经过一一对应脱敏处理，供参赛队伍使用。其中认证日志是用户在访问应用系统时产生的行为数据，包括用户名、认证时间、认证城市、接入系统、访问URL等关键信息。

### 数据说明
• 文档说明

| 名称                    | 说明     |
|-----------------------|--------|
| train.csv             | 训练集数据  |
| evaluation_public.csv | 测试集数据  |
| submit_sample.csv     | 提交样例数据 |

• 变量含义说明

| 变量名称                 | 业务含义       | 说明                            |
|----------------------|------------|-------------------------------|
| id                   | 样本ID       |                               |
| user_name            | 用户名        | 若该变量为空，则说明该条日志为用户登录系统前产生      |
| department           | 用户所在部门     |                               |
| ip_transform         | 认证IP(加密后)  | 真实认证IP与加密字符一一对应脱敏处理           |
| device_num_transform | 认证设备号(加密后) | 真实认证设备号与加密字符一一对应脱敏处理          |
| browser_version      | 浏览器版本      |                               |
| browser              | 浏览器        |                               |
| os_type              | 操作系统类型     |                               |
| os_version           | 操作系统版本     |                               |
| op_datetime          | 认证日期时间     |                               |
| ip_type              | IP类型       |                               |
| http_status_code     | HTTP类型码    |                               |
| op_city              | 认证城市       |                               |
| log_system_transform | 接入系统(加密后)  | 真实接入系统与加密字符一一对应脱敏处理           |
| url                  | 访问URL      |                               |
| op_month             | 认证月份       |                               |
| is_risk              | 是否存在风险     | 1：有风险；0：无风险。仅train.csv数据包含该字段 |



In [1]:
import os
import re
import numpy as np
import pandas as pd
import datetime

import matplotlib.pyplot as plt
import seaborn as sns


import scorpyo as sp
from null_importance import get_null_importance
from time_sequence_feats import get_time_base, get_sequence_statis, get_sequence_groupby_statis

from gensim.models import word2vec

import warnings
warnings.filterwarnings("ignore")

pd.set_option('max_rows', 320, 'max_columns',100)

In [2]:
path_project = r'/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification'

# path dir
path_row_data = os.path.join(path_project, 'row_data')
path_new_data = os.path.join(path_project, 'new_data')
path_results  = os.path.join(path_project, 'results')
path_results_jupyter  = os.path.join(path_results, 'jupyter')

# path row_data
path_train = os.path.join(path_row_data, 'train.csv')
path_test  = os.path.join(path_row_data, 'evaluation_public.csv')
path_sample_submission = os.path.join(path_row_data, 'submit_example.csv')


path_new_train = os.path.join(path_new_data, 'train_lightgbm_20221014.csv')
path_new_test  = os.path.join(path_new_data, 'test_lightgbm_20221014.csv')

## results
path_output_report = os.path.join(path_results, '01_原始数据探察_20221014.xlsx')

y_label = "is_risk"

In [3]:
df_row_train = pd.read_csv(path_train)
df_row_val  = pd.read_csv(path_test)

# 方案一-时间序列特征

## 时间序列特征
参考自：[时序特征挖掘的奇技淫巧](https://www.6aiq.com/article/1594474995881)

In [4]:
df = pd.concat([df_row_train, df_row_val]).reset_index(drop=True)
df = df.sort_values(by='op_datetime')

# 按客户进行统计这次认证和上次认证的时间差
df['op_second'] = pd.to_datetime(df['op_datetime'])
df['op_second1'] = df.groupby('device_num_transform')['op_second'].shift(1)
df['op_second2'] = df.groupby('device_num_transform')['op_second'].shift(2)
df['op_diff_second1'] = (df['op_second'] - df['op_second1']).map(lambda x: x.total_seconds())
df['op_diff_second2'] = (df['op_second'] - df['op_second2']).map(lambda x: x.total_seconds())

# 系统层面的一段时间
df['system_op_second'] = pd.to_datetime(df['op_datetime'])
df['system_op_second1'] = df['system_op_second'].shift(1)
df['system_op_second2'] = df['system_op_second'].shift(2)
df['system_op_diff_second1'] = (df['system_op_second'] - df['system_op_second1']).map(lambda x: x.total_seconds())
df['system_op_diff_second2'] = (df['system_op_second'] - df['system_op_second2']).map(lambda x: x.total_seconds())


df = df.drop(columns=['op_second','op_second1', 'op_second2'
                      'system_op_second', 'system_op_second1','system_op_second2'])
df.head()

Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk,op_diff_second1,system_op_diff_second1
44477,44477,xiongkai3397,rd,6H1iPLgBB,GCgxrFb69up7,chrome_93,chrome,win,win10,2022-01-07 02:44:29,内网,200,深圳,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,
45489,45489,zhengguiying7117,rd,0mjaEf4SB,8ftsXFm5I1Ej,safari_13,safari,macos,macos_big_sur_11,2022-01-07 02:54:32,内网,200,成都,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,603.0
45706,45706,yuanjun5870,hr,1Vk2kEa4X,W1Cstajd8x1s,firefox_78,firefox,win,win7,2022-01-07 03:00:56,内网,200,深圳,a5G25puBl9xj,hr.xxx.com/,2022-01,1.0,,384.0
45901,45901,zhoutingting3694,rd,4Wj6uxLx3,H8NAVsdws95G,edge_93,edge,win,win10,2022-01-07 04:29:34,内网,200,杭州,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,5318.0
43827,43827,yanglin6562,sales,eK12oQmm8,GnkVqPSy5nnl,ie_9,ie,win,win10,2022-01-07 05:17:44,内网,200,重庆,sW0whYIx8LFM,work.xxx.com/task,2022-01,1.0,,2890.0


In [5]:
cate_cols = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url']

df = get_time_base(df, cols='op_datetime')
df = get_sequence_statis(df, col='system_op_diff_second1', n=5, freq=3 )
df = get_sequence_groupby_statis(df, col='system_op_diff_second1',cate_cols= cate_cols, n=5, freq=3)
df = get_sequence_statis(df, col='system_op_diff_second2', n=5, freq=3 )
df = get_sequence_groupby_statis(df, col='system_op_diff_second2',cate_cols= cate_cols, n=5, freq=3)
df.head()

Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk,op_diff_second1,system_op_diff_second1,op_datetime_year,op_datetime_month,op_datetime_day,op_datetime_hour,op_datetime_minute,op_datetime_second,op_datetime_quarter,op_datetime_dayofweek,op_datetime_is_year_start,op_datetime_is_month_start,op_datetime_is_month_end,op_datetime_second_sin,op_datetime_second_cos,op_datetime_minute_sin,op_datetime_minute_cos,op_datetime_hour_sin,op_datetime_hour_cos,op_datetime_day_sin,op_datetime_day_cos,op_datetime_dayofweek_sin,op_datetime_dayofweek_cos,op_datetime_month_sin,op_datetime_month_cos,avg_3_system_op_diff_second1,median_3_system_op_diff_second1,max_3_system_op_diff_second1,min_3_system_op_diff_second1,std_3_system_op_diff_second1,skew_3_system_op_diff_second1,kurt_3_system_op_diff_second1,avg_6_system_op_diff_second1,...,kurt_log_system_transform_system_op_diff_second1_9,avg_log_system_transform_system_op_diff_second1_12,median_log_system_transform_system_op_diff_second1_12,max_log_system_transform_system_op_diff_second1_12,min_log_system_transform_system_op_diff_second1_12,std_log_system_transform_system_op_diff_second1_12,skew_log_system_transform_system_op_diff_second1_12,kurt_log_system_transform_system_op_diff_second1_12,avg_log_system_transform_system_op_diff_second1_15,median_log_system_transform_system_op_diff_second1_15,max_log_system_transform_system_op_diff_second1_15,min_log_system_transform_system_op_diff_second1_15,std_log_system_transform_system_op_diff_second1_15,skew_log_system_transform_system_op_diff_second1_15,kurt_log_system_transform_system_op_diff_second1_15,avg_url_system_op_diff_second1_3,median_url_system_op_diff_second1_3,max_url_system_op_diff_second1_3,min_url_system_op_diff_second1_3,std_url_system_op_diff_second1_3,skew_url_system_op_diff_second1_3,kurt_url_system_op_diff_second1_3,avg_url_system_op_diff_second1_6,median_url_system_op_diff_second1_6,max_url_system_op_diff_second1_6,min_url_system_op_diff_second1_6,std_url_system_op_diff_second1_6,skew_url_system_op_diff_second1_6,kurt_url_system_op_diff_second1_6,avg_url_system_op_diff_second1_9,median_url_system_op_diff_second1_9,max_url_system_op_diff_second1_9,min_url_system_op_diff_second1_9,std_url_system_op_diff_second1_9,skew_url_system_op_diff_second1_9,kurt_url_system_op_diff_second1_9,avg_url_system_op_diff_second1_12,median_url_system_op_diff_second1_12,max_url_system_op_diff_second1_12,min_url_system_op_diff_second1_12,std_url_system_op_diff_second1_12,skew_url_system_op_diff_second1_12,kurt_url_system_op_diff_second1_12,avg_url_system_op_diff_second1_15,median_url_system_op_diff_second1_15,max_url_system_op_diff_second1_15,min_url_system_op_diff_second1_15,std_url_system_op_diff_second1_15,skew_url_system_op_diff_second1_15,kurt_url_system_op_diff_second1_15
44477,44477,xiongkai3397,rd,6H1iPLgBB,GCgxrFb69up7,chrome_93,chrome,win,win10,2022-01-07 02:44:29,内网,200,深圳,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,,2022,1,7,2,44,29,1,4,False,False,False,0.104528,-0.994522,-0.994522,-0.104528,0.5,0.866025,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45489,45489,zhengguiying7117,rd,0mjaEf4SB,8ftsXFm5I1Ej,safari_13,safari,macos,macos_big_sur_11,2022-01-07 02:54:32,内网,200,成都,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,603.0,2022,1,7,2,54,32,1,4,False,False,False,-0.207912,-0.978148,-0.587785,0.809017,0.5,0.866025,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45706,45706,yuanjun5870,hr,1Vk2kEa4X,W1Cstajd8x1s,firefox_78,firefox,win,win7,2022-01-07 03:00:56,内网,200,深圳,a5G25puBl9xj,hr.xxx.com/,2022-01,1.0,,384.0,2022,1,7,3,0,56,1,4,False,False,False,-0.406737,0.913545,0.0,1.0,0.707107,0.707107,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45901,45901,zhoutingting3694,rd,4Wj6uxLx3,H8NAVsdws95G,edge_93,edge,win,win10,2022-01-07 04:29:34,内网,200,杭州,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,5318.0,2022,1,7,4,29,34,1,4,False,False,False,-0.406737,-0.913545,0.104528,-0.994522,0.866025,0.5,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,2101.666667,603.0,5318.0,384.0,2787.577861,1.720032,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
43827,43827,yanglin6562,sales,eK12oQmm8,GnkVqPSy5nnl,ie_9,ie,win,win10,2022-01-07 05:17:44,内网,200,重庆,sW0whYIx8LFM,work.xxx.com/task,2022-01,1.0,,2890.0,2022,1,7,5,17,44,1,4,False,False,False,-0.994522,-0.104528,0.978148,-0.207912,0.965926,0.258819,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,2864.0,2890.0,5318.0,384.0,2467.102754,-0.047419,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


## 特征筛选

In [6]:
remove_cols = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url', 'op_datetime', 'op_month']

df = df.drop(columns=remove_cols)

In [7]:
df_row_train = df[df[y_label].notna()].reset_index(drop=True)
df_row_val = df[df[y_label].isna()].reset_index(drop=True)

df_train, df_test, convert_cols = sp.transform_data_detail(df_row_train, df_row_val, y_label, excel_path=path_output_report)
df_train.head()

sheet05.可能为数值类型的object类型数据统计在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221013.xlsx中已经存在，我们将对原文件进行覆盖
sheet06.数据预处理在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221013.xlsx中已经存在，我们将对原文件进行覆盖


Unnamed: 0,id,is_risk,op_diff_second1,system_op_diff_second1,op_datetime_month,op_datetime_day,op_datetime_hour,op_datetime_minute,op_datetime_second,op_datetime_quarter,op_datetime_dayofweek,op_datetime_second_sin,op_datetime_second_cos,op_datetime_minute_sin,op_datetime_minute_cos,op_datetime_hour_sin,op_datetime_hour_cos,op_datetime_day_sin,op_datetime_day_cos,op_datetime_dayofweek_sin,op_datetime_dayofweek_cos,op_datetime_month_sin,op_datetime_month_cos,avg_3_system_op_diff_second1,median_3_system_op_diff_second1,max_3_system_op_diff_second1,min_3_system_op_diff_second1,std_3_system_op_diff_second1,skew_3_system_op_diff_second1,avg_6_system_op_diff_second1,median_6_system_op_diff_second1,max_6_system_op_diff_second1,min_6_system_op_diff_second1,std_6_system_op_diff_second1,skew_6_system_op_diff_second1,kurt_6_system_op_diff_second1,avg_9_system_op_diff_second1,median_9_system_op_diff_second1,max_9_system_op_diff_second1,min_9_system_op_diff_second1,std_9_system_op_diff_second1,skew_9_system_op_diff_second1,kurt_9_system_op_diff_second1,avg_12_system_op_diff_second1,median_12_system_op_diff_second1,max_12_system_op_diff_second1,min_12_system_op_diff_second1,std_12_system_op_diff_second1,skew_12_system_op_diff_second1,kurt_12_system_op_diff_second1,...,skew_log_system_transform_system_op_diff_second1_9,kurt_log_system_transform_system_op_diff_second1_9,avg_log_system_transform_system_op_diff_second1_12,median_log_system_transform_system_op_diff_second1_12,max_log_system_transform_system_op_diff_second1_12,min_log_system_transform_system_op_diff_second1_12,std_log_system_transform_system_op_diff_second1_12,skew_log_system_transform_system_op_diff_second1_12,kurt_log_system_transform_system_op_diff_second1_12,avg_log_system_transform_system_op_diff_second1_15,median_log_system_transform_system_op_diff_second1_15,max_log_system_transform_system_op_diff_second1_15,min_log_system_transform_system_op_diff_second1_15,std_log_system_transform_system_op_diff_second1_15,skew_log_system_transform_system_op_diff_second1_15,kurt_log_system_transform_system_op_diff_second1_15,avg_url_system_op_diff_second1_3,median_url_system_op_diff_second1_3,max_url_system_op_diff_second1_3,min_url_system_op_diff_second1_3,std_url_system_op_diff_second1_3,skew_url_system_op_diff_second1_3,avg_url_system_op_diff_second1_6,median_url_system_op_diff_second1_6,max_url_system_op_diff_second1_6,min_url_system_op_diff_second1_6,std_url_system_op_diff_second1_6,skew_url_system_op_diff_second1_6,kurt_url_system_op_diff_second1_6,avg_url_system_op_diff_second1_9,median_url_system_op_diff_second1_9,max_url_system_op_diff_second1_9,min_url_system_op_diff_second1_9,std_url_system_op_diff_second1_9,skew_url_system_op_diff_second1_9,kurt_url_system_op_diff_second1_9,avg_url_system_op_diff_second1_12,median_url_system_op_diff_second1_12,max_url_system_op_diff_second1_12,min_url_system_op_diff_second1_12,std_url_system_op_diff_second1_12,skew_url_system_op_diff_second1_12,kurt_url_system_op_diff_second1_12,avg_url_system_op_diff_second1_15,median_url_system_op_diff_second1_15,max_url_system_op_diff_second1_15,min_url_system_op_diff_second1_15,std_url_system_op_diff_second1_15,skew_url_system_op_diff_second1_15,kurt_url_system_op_diff_second1_15
0,44477,1.0,,,1,7,2,44,29,1,4,0.104528,-0.994522,-0.994522,-0.104528,0.5,0.866025,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1,45489,1.0,,603.0,1,7,2,54,32,1,4,-0.207912,-0.978148,-0.587785,0.809017,0.5,0.866025,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2,45706,1.0,,384.0,1,7,3,0,56,1,4,-0.406737,0.913545,0.0,1.0,0.707107,0.707107,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
3,45901,1.0,,5318.0,1,7,4,29,34,1,4,-0.406737,-0.913545,0.104528,-0.994522,0.866025,0.5,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,2101.666667,603.0,5318.0,384.0,2787.577861,1.720032,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,43827,1.0,,2890.0,1,7,5,17,44,1,4,-0.994522,-0.104528,0.978148,-0.207912,0.965926,0.258819,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,2864.0,2890.0,5318.0,384.0,2467.102754,-0.047419,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [8]:
df_train, df_test, bins = sp.binning_data_detail(df_train, df_test, y_label)

特征op_datetime_minute_cos, 单调分箱调整失败
特征op_datetime_minute, 单调分箱调整失败
特征op_datetime_second_sin, 单调分箱调整失败
特征op_datetime_second_cos, 单调分箱调整失败


[INFO] converting into woe values ...
Woe transformating on 47660 rows and 494 columns in 00:00:50
[INFO] converting into woe values ...
Woe transformating on 25710 rows and 494 columns in 00:00:27


In [11]:
iv_list, iv_drop_var, df_train = sp.feature_selection_iv(df_train, bins, y_label, min_threshold=0.02, max_threshold=3.0)

In [12]:
corr_matrix, corr_drop_var, df_train = sp.feature_selection_corr(df_train, y_label)

In [14]:
feats, categorical_feats = get_null_importance(df_train.drop(columns=[y_label]).copy(),
                                               df_train[y_label].copy(), 
                                               thresholds=15)

## modeling

In [15]:

import time
from sklearn.metrics import roc_auc_score as auc
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold

params = {
    'learning_rate': 0.05,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 31,
    'verbose': -1,
    'seed': 2222,
    'n_jobs': -1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 4,
    # 'min_child_weight': 10,
}

fold_num = 5
seeds = [2022]
oof = np.zeros(len(df_train))
importance = 0
pred_y = pd.DataFrame()
score = []
for seed in seeds:
    kf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=seed)
    # kf = KFold(n_splits=fold_num, shuffle=True, random_state=seed)
    for fold, (train_idx, val_idx) in enumerate(kf.split(df_train[feats], df_train[y_label])):
        print('-----------', fold)
        train = lgb.Dataset(df_train.loc[train_idx, feats],
                            df_train.loc[train_idx, y_label],
                           # categorical_feature=categorical_feats
                           )
        val = lgb.Dataset(df_train.loc[val_idx, feats],
                          df_train.loc[val_idx, y_label],
                          #categorical_feature=categorical_feats
                         )
        model = lgb.train(params, train, valid_sets=[val], 
                          num_boost_round=20000, early_stopping_rounds=100)

        oof[val_idx] += model.predict(df_train.loc[val_idx, feats]) / len(seeds)
        pred_y['fold_%d_seed_%d' % (fold, seed)] = model.predict(df_test[feats])
        importance += model.feature_importance(importance_type='gain') / fold_num
        score.append(auc(df_train.loc[val_idx, y_label], model.predict(df_train.loc[val_idx, feats])))
feats_importance = pd.DataFrame()
feats_importance['name'] = feats
feats_importance['importance'] = importance
display(feats_importance.sort_values('importance', ascending=False)[:30])

df_train['oof'] = oof
display(np.mean(score), np.std(score))

score = np.mean(score)
df_test[y_label] = pred_y.mean(axis=1).values
df_test = df_test.sort_values('id').reset_index(drop=True)

sub = pd.read_csv(path_sample_submission)
sub[y_label] = df_test[y_label].values
sub.to_csv(os.path.join(path_results_jupyter,time.strftime('lgb_%Y%m%d%H%M_')+'%.5f.csv'%score), index=False)

----------- 0
[1]	valid_0's auc: 0.926104
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.928411
[3]	valid_0's auc: 0.934373
[4]	valid_0's auc: 0.934504
[5]	valid_0's auc: 0.935857
[6]	valid_0's auc: 0.936274
[7]	valid_0's auc: 0.936334
[8]	valid_0's auc: 0.936462
[9]	valid_0's auc: 0.936462
[10]	valid_0's auc: 0.936576
[11]	valid_0's auc: 0.936434
[12]	valid_0's auc: 0.93655
[13]	valid_0's auc: 0.936745
[14]	valid_0's auc: 0.936727
[15]	valid_0's auc: 0.936772
[16]	valid_0's auc: 0.936887
[17]	valid_0's auc: 0.936927
[18]	valid_0's auc: 0.937104
[19]	valid_0's auc: 0.936902
[20]	valid_0's auc: 0.936923
[21]	valid_0's auc: 0.936893
[22]	valid_0's auc: 0.936856
[23]	valid_0's auc: 0.936932
[24]	valid_0's auc: 0.936899
[25]	valid_0's auc: 0.936876
[26]	valid_0's auc: 0.936011
[27]	valid_0's auc: 0.935895
[28]	valid_0's auc: 0.93589
[29]	valid_0's auc: 0.935888
[30]	valid_0's auc: 0.935893
[31]	valid_0's auc: 0.936014
[32]	valid_0's auc: 0.936407
[33]	va

[107]	valid_0's auc: 0.939645
[108]	valid_0's auc: 0.939971
[109]	valid_0's auc: 0.940184
[110]	valid_0's auc: 0.940348
[111]	valid_0's auc: 0.940919
[112]	valid_0's auc: 0.940935
[113]	valid_0's auc: 0.941107
[114]	valid_0's auc: 0.941183
[115]	valid_0's auc: 0.941207
[116]	valid_0's auc: 0.941044
[117]	valid_0's auc: 0.940988
[118]	valid_0's auc: 0.940947
[119]	valid_0's auc: 0.940841
[120]	valid_0's auc: 0.94088
[121]	valid_0's auc: 0.94095
[122]	valid_0's auc: 0.940945
[123]	valid_0's auc: 0.940963
[124]	valid_0's auc: 0.941017
[125]	valid_0's auc: 0.941155
[126]	valid_0's auc: 0.941419
[127]	valid_0's auc: 0.941418
[128]	valid_0's auc: 0.941451
[129]	valid_0's auc: 0.941457
[130]	valid_0's auc: 0.941456
[131]	valid_0's auc: 0.941289
[132]	valid_0's auc: 0.941317
[133]	valid_0's auc: 0.941243
[134]	valid_0's auc: 0.941081
[135]	valid_0's auc: 0.941088
[136]	valid_0's auc: 0.940936
[137]	valid_0's auc: 0.940997
[138]	valid_0's auc: 0.940887
[139]	valid_0's auc: 0.940434
[140]	valid_

----------- 3
[1]	valid_0's auc: 0.92508
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.926703
[3]	valid_0's auc: 0.934062
[4]	valid_0's auc: 0.932506
[5]	valid_0's auc: 0.935387
[6]	valid_0's auc: 0.936169
[7]	valid_0's auc: 0.936374
[8]	valid_0's auc: 0.937326
[9]	valid_0's auc: 0.937498
[10]	valid_0's auc: 0.937678
[11]	valid_0's auc: 0.937212
[12]	valid_0's auc: 0.937321
[13]	valid_0's auc: 0.937367
[14]	valid_0's auc: 0.93743
[15]	valid_0's auc: 0.937455
[16]	valid_0's auc: 0.937506
[17]	valid_0's auc: 0.937556
[18]	valid_0's auc: 0.937432
[19]	valid_0's auc: 0.937541
[20]	valid_0's auc: 0.937668
[21]	valid_0's auc: 0.938908
[22]	valid_0's auc: 0.93837
[23]	valid_0's auc: 0.938846
[24]	valid_0's auc: 0.939359
[25]	valid_0's auc: 0.939261
[26]	valid_0's auc: 0.9389
[27]	valid_0's auc: 0.93884
[28]	valid_0's auc: 0.94138
[29]	valid_0's auc: 0.940256
[30]	valid_0's auc: 0.940327
[31]	valid_0's auc: 0.940297
[32]	valid_0's auc: 0.940629
[33]	valid_0

Unnamed: 0,name,importance
56,avg_ip_type_system_op_diff_second1_3_woe,74618.086189
39,avg_ip_transform_system_op_diff_second1_6_woe,65005.797317
60,avg_os_type_system_op_diff_second1_15_woe,8961.918511
40,std_browser_version_system_op_diff_second1_6_woe,8656.634652
35,op_diff_second1_woe,8050.362024
51,op_datetime_hour_cos_woe,4381.677269
46,max_ip_transform_system_op_diff_second1_3_woe,4367.828151
14,median_ip_transform_system_op_diff_second1_12_woe,3571.206396
52,min_ip_type_system_op_diff_second1_6_woe,2968.27635
8,skew_ip_transform_system_op_diff_second1_3_woe,1778.307487


0.9412049006482046

0.002474145375367134

KeyError: 'id'

In [16]:
feats_importance.sort_values('importance', ascending=False)[:50]

Unnamed: 0,name,importance
56,avg_ip_type_system_op_diff_second1_3_woe,74618.086189
39,avg_ip_transform_system_op_diff_second1_6_woe,65005.797317
60,avg_os_type_system_op_diff_second1_15_woe,8961.918511
40,std_browser_version_system_op_diff_second1_6_woe,8656.634652
35,op_diff_second1_woe,8050.362024
51,op_datetime_hour_cos_woe,4381.677269
46,max_ip_transform_system_op_diff_second1_3_woe,4367.828151
14,median_ip_transform_system_op_diff_second1_12_woe,3571.206396
52,min_ip_type_system_op_diff_second1_6_woe,2968.27635
8,skew_ip_transform_system_op_diff_second1_3_woe,1778.307487


In [17]:
feats_importance.shape

(70, 2)

In [18]:
feats = feats_importance.name
feats = [i[:-4] for i in feats]

In [19]:
feats

['op_datetime_hour',
 'median_user_name_system_op_diff_second1_15',
 'min_department_system_op_diff_second1_15',
 'min_log_system_transform_system_op_diff_second1_3',
 'skew_user_name_system_op_diff_second1_3',
 'median_op_city_system_op_diff_second1_6',
 'max_device_num_transform_system_op_diff_second1_9',
 'median_op_city_system_op_diff_second1_15',
 'skew_ip_transform_system_op_diff_second1_3',
 'min_op_city_system_op_diff_second1_15',
 'max_user_name_system_op_diff_second1_3',
 'median_browser_system_op_diff_second1_15',
 'min_device_num_transform_system_op_diff_second1_6',
 'max_browser_version_system_op_diff_second1_9',
 'median_ip_transform_system_op_diff_second1_12',
 'skew_log_system_transform_system_op_diff_second1_9',
 'min_url_system_op_diff_second1_9',
 'skew_os_type_system_op_diff_second1_6',
 'skew_log_system_transform_system_op_diff_second1_3',
 'min_url_system_op_diff_second1_15',
 'median_device_num_transform_system_op_diff_second1_6',
 'skew_ip_type_system_op_diff_se

# 方案一-简化

## 特征生成

In [4]:
df = pd.concat([df_row_train, df_row_val]).reset_index(drop=True)
df = df.sort_values(by='op_datetime')

# 按客户进行统计这次认证和上次认证的时间差
df['op_second'] = pd.to_datetime(df['op_datetime'])
df['op_second1'] = df.groupby('device_num_transform')['op_second'].shift(1)
df['op_diff_second1'] = (df['op_second'] - df['op_second1']).map(lambda x: x.total_seconds())

# 系统层面的一段时间
df['system_op_second'] = pd.to_datetime(df['op_datetime'])
df['system_op_second1'] = df['system_op_second'].shift(1)
df['system_op_diff_second1'] = (df['system_op_second'] - df['system_op_second1']).map(lambda x: x.total_seconds())

df = df.drop(columns=['op_second','op_second1', 
                      'system_op_second', 'system_op_second1'])
df.head()

Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk,op_diff_second1,system_op_diff_second1
44477,44477,xiongkai3397,rd,6H1iPLgBB,GCgxrFb69up7,chrome_93,chrome,win,win10,2022-01-07 02:44:29,内网,200,深圳,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,
45489,45489,zhengguiying7117,rd,0mjaEf4SB,8ftsXFm5I1Ej,safari_13,safari,macos,macos_big_sur_11,2022-01-07 02:54:32,内网,200,成都,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,603.0
45706,45706,yuanjun5870,hr,1Vk2kEa4X,W1Cstajd8x1s,firefox_78,firefox,win,win7,2022-01-07 03:00:56,内网,200,深圳,a5G25puBl9xj,hr.xxx.com/,2022-01,1.0,,384.0
45901,45901,zhoutingting3694,rd,4Wj6uxLx3,H8NAVsdws95G,edge_93,edge,win,win10,2022-01-07 04:29:34,内网,200,杭州,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,5318.0
43827,43827,yanglin6562,sales,eK12oQmm8,GnkVqPSy5nnl,ie_9,ie,win,win10,2022-01-07 05:17:44,内网,200,重庆,sW0whYIx8LFM,work.xxx.com/task,2022-01,1.0,,2890.0


In [5]:
cate_cols = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url']

df = get_time_base(df, cols='op_datetime')
df = get_sequence_statis(df, col='system_op_diff_second1', n=5, freq=3 )
df = get_sequence_groupby_statis(df, col='system_op_diff_second1',cate_cols= cate_cols, n=5, freq=3)
df.head()

Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk,op_diff_second1,system_op_diff_second1,op_datetime_year,op_datetime_month,op_datetime_day,op_datetime_hour,op_datetime_minute,op_datetime_second,op_datetime_quarter,op_datetime_dayofweek,op_datetime_is_year_start,op_datetime_is_month_start,op_datetime_is_month_end,op_datetime_second_sin,op_datetime_second_cos,op_datetime_minute_sin,op_datetime_minute_cos,op_datetime_hour_sin,op_datetime_hour_cos,op_datetime_day_sin,op_datetime_day_cos,op_datetime_dayofweek_sin,op_datetime_dayofweek_cos,op_datetime_month_sin,op_datetime_month_cos,avg_3_system_op_diff_second1,median_3_system_op_diff_second1,max_3_system_op_diff_second1,min_3_system_op_diff_second1,std_3_system_op_diff_second1,skew_3_system_op_diff_second1,kurt_3_system_op_diff_second1,avg_6_system_op_diff_second1,...,kurt_log_system_transform_system_op_diff_second1_9,avg_log_system_transform_system_op_diff_second1_12,median_log_system_transform_system_op_diff_second1_12,max_log_system_transform_system_op_diff_second1_12,min_log_system_transform_system_op_diff_second1_12,std_log_system_transform_system_op_diff_second1_12,skew_log_system_transform_system_op_diff_second1_12,kurt_log_system_transform_system_op_diff_second1_12,avg_log_system_transform_system_op_diff_second1_15,median_log_system_transform_system_op_diff_second1_15,max_log_system_transform_system_op_diff_second1_15,min_log_system_transform_system_op_diff_second1_15,std_log_system_transform_system_op_diff_second1_15,skew_log_system_transform_system_op_diff_second1_15,kurt_log_system_transform_system_op_diff_second1_15,avg_url_system_op_diff_second1_3,median_url_system_op_diff_second1_3,max_url_system_op_diff_second1_3,min_url_system_op_diff_second1_3,std_url_system_op_diff_second1_3,skew_url_system_op_diff_second1_3,kurt_url_system_op_diff_second1_3,avg_url_system_op_diff_second1_6,median_url_system_op_diff_second1_6,max_url_system_op_diff_second1_6,min_url_system_op_diff_second1_6,std_url_system_op_diff_second1_6,skew_url_system_op_diff_second1_6,kurt_url_system_op_diff_second1_6,avg_url_system_op_diff_second1_9,median_url_system_op_diff_second1_9,max_url_system_op_diff_second1_9,min_url_system_op_diff_second1_9,std_url_system_op_diff_second1_9,skew_url_system_op_diff_second1_9,kurt_url_system_op_diff_second1_9,avg_url_system_op_diff_second1_12,median_url_system_op_diff_second1_12,max_url_system_op_diff_second1_12,min_url_system_op_diff_second1_12,std_url_system_op_diff_second1_12,skew_url_system_op_diff_second1_12,kurt_url_system_op_diff_second1_12,avg_url_system_op_diff_second1_15,median_url_system_op_diff_second1_15,max_url_system_op_diff_second1_15,min_url_system_op_diff_second1_15,std_url_system_op_diff_second1_15,skew_url_system_op_diff_second1_15,kurt_url_system_op_diff_second1_15
44477,44477,xiongkai3397,rd,6H1iPLgBB,GCgxrFb69up7,chrome_93,chrome,win,win10,2022-01-07 02:44:29,内网,200,深圳,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,,2022,1,7,2,44,29,1,4,False,False,False,0.104528,-0.994522,-0.994522,-0.104528,0.5,0.866025,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45489,45489,zhengguiying7117,rd,0mjaEf4SB,8ftsXFm5I1Ej,safari_13,safari,macos,macos_big_sur_11,2022-01-07 02:54:32,内网,200,成都,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,603.0,2022,1,7,2,54,32,1,4,False,False,False,-0.207912,-0.978148,-0.587785,0.809017,0.5,0.866025,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45706,45706,yuanjun5870,hr,1Vk2kEa4X,W1Cstajd8x1s,firefox_78,firefox,win,win7,2022-01-07 03:00:56,内网,200,深圳,a5G25puBl9xj,hr.xxx.com/,2022-01,1.0,,384.0,2022,1,7,3,0,56,1,4,False,False,False,-0.406737,0.913545,0.0,1.0,0.707107,0.707107,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
45901,45901,zhoutingting3694,rd,4Wj6uxLx3,H8NAVsdws95G,edge_93,edge,win,win10,2022-01-07 04:29:34,内网,200,杭州,nHrKgKdJ1Mzt,xxx.com/github,2022-01,1.0,,5318.0,2022,1,7,4,29,34,1,4,False,False,False,-0.406737,-0.913545,0.104528,-0.994522,0.866025,0.5,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,2101.666667,603.0,5318.0,384.0,2787.577861,1.720032,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
43827,43827,yanglin6562,sales,eK12oQmm8,GnkVqPSy5nnl,ie_9,ie,win,win10,2022-01-07 05:17:44,内网,200,重庆,sW0whYIx8LFM,work.xxx.com/task,2022-01,1.0,,2890.0,2022,1,7,5,17,44,1,4,False,False,False,-0.994522,-0.104528,0.978148,-0.207912,0.965926,0.258819,0.988468,0.151428,-0.433884,-0.900969,0.5,0.866025,2864.0,2890.0,5318.0,384.0,2467.102754,-0.047419,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [6]:
df_train = df[df[y_label].notna()].reset_index(drop=True)
df_test = df[df[y_label].isna()].reset_index(drop=True)

In [7]:
feats = ['op_datetime_hour',
 'median_user_name_system_op_diff_second1_15',
 'min_department_system_op_diff_second1_15',
 'min_log_system_transform_system_op_diff_second1_3',
 'skew_user_name_system_op_diff_second1_3',
 'median_op_city_system_op_diff_second1_6',
 'max_device_num_transform_system_op_diff_second1_9',
 'median_op_city_system_op_diff_second1_15',
 'skew_ip_transform_system_op_diff_second1_3',
 'min_op_city_system_op_diff_second1_15',
 'max_user_name_system_op_diff_second1_3',
 'median_browser_system_op_diff_second1_15',
 'min_device_num_transform_system_op_diff_second1_6',
 'max_browser_version_system_op_diff_second1_9',
 'median_ip_transform_system_op_diff_second1_12',
 'skew_log_system_transform_system_op_diff_second1_9',
 'min_url_system_op_diff_second1_9',
 'skew_os_type_system_op_diff_second1_6',
 'skew_log_system_transform_system_op_diff_second1_3',
 'min_url_system_op_diff_second1_15',
 'median_device_num_transform_system_op_diff_second1_6',
 'skew_ip_type_system_op_diff_second1_6',
 'min_user_name_system_op_diff_second1_6',
 'max_department_system_op_diff_second1_9',
 'kurt_department_system_op_diff_second1_15',
 'min_browser_system_op_diff_second1_15',
 'median_browser_version_system_op_diff_second1_6',
 'skew_http_status_code_system_op_diff_second1_15',
 'min_department_system_op_diff_second1_6',
 'skew_log_system_transform_system_op_diff_second1_15',
 'median_device_num_transform_system_op_diff_second1_3',
 'median_department_system_op_diff_second1_15',
 'kurt_ip_type_system_op_diff_second1_12',
 'max_device_num_transform_system_op_diff_second1_6',
 'avg_user_name_system_op_diff_second1_9',
 'op_diff_second1',
 'kurt_ip_transform_system_op_diff_second1_6',
 'skew_http_status_code_system_op_diff_second1_3',
 'op_datetime_dayofweek_sin',
 'avg_ip_transform_system_op_diff_second1_6',
 'std_browser_version_system_op_diff_second1_6',
 'kurt_http_status_code_system_op_diff_second1_12',
 'max_log_system_transform_system_op_diff_second1_15',
 'skew_url_system_op_diff_second1_6',
 'kurt_http_status_code_system_op_diff_second1_9',
 'min_op_city_system_op_diff_second1_6',
 'max_ip_transform_system_op_diff_second1_3',
 'skew_department_system_op_diff_second1_9',
 'median_user_name_system_op_diff_second1_9',
 'min_device_num_transform_system_op_diff_second1_15',
 'min_url_system_op_diff_second1_6',
 'op_datetime_hour_cos',
 'min_ip_type_system_op_diff_second1_6',
 'avg_browser_system_op_diff_second1_9',
 'kurt_url_system_op_diff_second1_6',
 'min_http_status_code_system_op_diff_second1_12',
 'avg_ip_type_system_op_diff_second1_3',
 'kurt_http_status_code_system_op_diff_second1_6',
 'min_op_city_system_op_diff_second1_3',
 'kurt_department_system_op_diff_second1_6',
 'avg_os_type_system_op_diff_second1_15',
 'skew_log_system_transform_system_op_diff_second1_6',
 'skew_browser_version_system_op_diff_second1_9',
 'max_op_city_system_op_diff_second1_9',
 'kurt_url_system_op_diff_second1_15',
 'min_browser_system_op_diff_second1_3',
 'skew_department_system_op_diff_second1_12',
 'avg_http_status_code_system_op_diff_second1_15',
 'avg_url_system_op_diff_second1_6',
 'min_ip_type_system_op_diff_second1_15']

## modeling

In [12]:
feats = feats_importance.sort_values('importance', ascending=False)[:30]['name'].values

In [13]:

import time
from sklearn.metrics import roc_auc_score as auc
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold

params = {
    'learning_rate': 0.05,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 31,
    'verbose': -1,
    'seed': 2222,
    'n_jobs': -1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 4,
    # 'min_child_weight': 10,
}

fold_num = 5
seeds = [2022]
oof = np.zeros(len(df_train))
importance = 0
pred_y = pd.DataFrame()
score = []
for seed in seeds:
    kf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=seed)
    # kf = KFold(n_splits=fold_num, shuffle=True, random_state=seed)
    for fold, (train_idx, val_idx) in enumerate(kf.split(df_train[feats], df_train[y_label])):
        print('-----------', fold)
        train = lgb.Dataset(df_train.loc[train_idx, feats],
                            df_train.loc[train_idx, y_label],
                           # categorical_feature=categorical_feats
                           )
        val = lgb.Dataset(df_train.loc[val_idx, feats],
                          df_train.loc[val_idx, y_label],
                          #categorical_feature=categorical_feats
                         )
        model = lgb.train(params, train, valid_sets=[val], 
                          num_boost_round=20000, early_stopping_rounds=100)

        oof[val_idx] += model.predict(df_train.loc[val_idx, feats]) / len(seeds)
        pred_y['fold_%d_seed_%d' % (fold, seed)] = model.predict(df_test[feats])
        importance += model.feature_importance(importance_type='gain') / fold_num
        score.append(auc(df_train.loc[val_idx, y_label], model.predict(df_train.loc[val_idx, feats])))
feats_importance = pd.DataFrame()
feats_importance['name'] = feats
feats_importance['importance'] = importance
display(feats_importance.sort_values('importance', ascending=False)[:30])

df_train['oof'] = oof
display(np.mean(score), np.std(score))

score = np.mean(score)
df_test[y_label] = pred_y.mean(axis=1).values
df_test = df_test.sort_values('id').reset_index(drop=True)

sub = pd.read_csv(path_sample_submission)
sub[y_label] = df_test[y_label].values
sub.to_csv(os.path.join(path_results_jupyter,time.strftime('lgb_%Y%m%d%H%M_')+'%.5f.csv'%score), index=False)

----------- 0
[1]	valid_0's auc: 0.930407
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.930293
[3]	valid_0's auc: 0.930255
[4]	valid_0's auc: 0.937198
[5]	valid_0's auc: 0.937009
[6]	valid_0's auc: 0.936946
[7]	valid_0's auc: 0.936932
[8]	valid_0's auc: 0.939514
[9]	valid_0's auc: 0.939627
[10]	valid_0's auc: 0.939473
[11]	valid_0's auc: 0.939493
[12]	valid_0's auc: 0.939497
[13]	valid_0's auc: 0.939564
[14]	valid_0's auc: 0.93975
[15]	valid_0's auc: 0.939677
[16]	valid_0's auc: 0.939785
[17]	valid_0's auc: 0.939604
[18]	valid_0's auc: 0.939581
[19]	valid_0's auc: 0.939581
[20]	valid_0's auc: 0.939455
[21]	valid_0's auc: 0.939459
[22]	valid_0's auc: 0.940199
[23]	valid_0's auc: 0.940669
[24]	valid_0's auc: 0.940734
[25]	valid_0's auc: 0.940769
[26]	valid_0's auc: 0.940169
[27]	valid_0's auc: 0.939943
[28]	valid_0's auc: 0.940454
[29]	valid_0's auc: 0.940456
[30]	valid_0's auc: 0.940436
[31]	valid_0's auc: 0.940354
[32]	valid_0's auc: 0.940436
[33]	v

[64]	valid_0's auc: 0.936435
[65]	valid_0's auc: 0.936247
[66]	valid_0's auc: 0.936462
[67]	valid_0's auc: 0.936704
[68]	valid_0's auc: 0.93592
[69]	valid_0's auc: 0.935815
[70]	valid_0's auc: 0.936255
[71]	valid_0's auc: 0.936432
[72]	valid_0's auc: 0.936045
[73]	valid_0's auc: 0.93726
[74]	valid_0's auc: 0.937022
[75]	valid_0's auc: 0.936998
[76]	valid_0's auc: 0.936894
[77]	valid_0's auc: 0.937668
[78]	valid_0's auc: 0.937611
[79]	valid_0's auc: 0.937523
[80]	valid_0's auc: 0.937908
[81]	valid_0's auc: 0.938058
[82]	valid_0's auc: 0.938066
[83]	valid_0's auc: 0.937773
[84]	valid_0's auc: 0.937949
[85]	valid_0's auc: 0.937996
[86]	valid_0's auc: 0.937971
[87]	valid_0's auc: 0.937626
[88]	valid_0's auc: 0.937512
[89]	valid_0's auc: 0.937661
[90]	valid_0's auc: 0.937336
[91]	valid_0's auc: 0.93716
[92]	valid_0's auc: 0.93709
[93]	valid_0's auc: 0.937249
[94]	valid_0's auc: 0.937277
[95]	valid_0's auc: 0.937337
[96]	valid_0's auc: 0.937397
[97]	valid_0's auc: 0.937267
[98]	valid_0's auc

[96]	valid_0's auc: 0.940247
[97]	valid_0's auc: 0.940176
[98]	valid_0's auc: 0.940455
[99]	valid_0's auc: 0.940341
[100]	valid_0's auc: 0.940399
[101]	valid_0's auc: 0.94043
[102]	valid_0's auc: 0.940654
[103]	valid_0's auc: 0.940992
[104]	valid_0's auc: 0.940928
[105]	valid_0's auc: 0.940938
[106]	valid_0's auc: 0.940998
[107]	valid_0's auc: 0.940935
[108]	valid_0's auc: 0.94094
[109]	valid_0's auc: 0.941041
[110]	valid_0's auc: 0.940885
[111]	valid_0's auc: 0.940814
[112]	valid_0's auc: 0.940737
[113]	valid_0's auc: 0.940714
[114]	valid_0's auc: 0.940981
[115]	valid_0's auc: 0.941137
[116]	valid_0's auc: 0.941336
[117]	valid_0's auc: 0.941314
[118]	valid_0's auc: 0.941344
[119]	valid_0's auc: 0.941385
[120]	valid_0's auc: 0.941391
[121]	valid_0's auc: 0.941523
[122]	valid_0's auc: 0.941432
[123]	valid_0's auc: 0.941715
[124]	valid_0's auc: 0.941791
[125]	valid_0's auc: 0.941836
[126]	valid_0's auc: 0.941708
[127]	valid_0's auc: 0.941551
[128]	valid_0's auc: 0.941588
[129]	valid_0's 

Unnamed: 0,name,importance
0,avg_ip_type_system_op_diff_second1_3,72829.95092
1,avg_ip_transform_system_op_diff_second1_6,60321.841697
4,max_ip_transform_system_op_diff_second1_3,11582.268004
2,avg_http_status_code_system_op_diff_second1_15,9551.359972
6,op_datetime_hour_cos,4645.363409
3,kurt_ip_transform_system_op_diff_second1_6,3479.557182
5,max_device_num_transform_system_op_diff_second1_6,3219.450278
7,op_diff_second1,1833.778993
11,median_ip_transform_system_op_diff_second1_12,1817.228476
12,std_browser_version_system_op_diff_second1_6,1787.135911


0.9405378137051301

0.0019773987465749755

In [14]:
feats_importance.sort_values('importance', ascending=False)[:50]

Unnamed: 0,name,importance
0,avg_ip_type_system_op_diff_second1_3,72829.95092
1,avg_ip_transform_system_op_diff_second1_6,60321.841697
4,max_ip_transform_system_op_diff_second1_3,11582.268004
2,avg_http_status_code_system_op_diff_second1_15,9551.359972
6,op_datetime_hour_cos,4645.363409
3,kurt_ip_transform_system_op_diff_second1_6,3479.557182
5,max_device_num_transform_system_op_diff_second1_6,3219.450278
7,op_diff_second1,1833.778993
11,median_ip_transform_system_op_diff_second1_12,1817.228476
12,std_browser_version_system_op_diff_second1_6,1787.135911


# 方案二-交叉累积特征

In [4]:
df = pd.concat([df_row_train, df_row_val]).reset_index(drop=True)
df = df.sort_values(by='op_datetime')

# 认证日期时间
df['op_datetime'] = pd.to_datetime(df['op_datetime'])
# 将数据分为每一天
df['op_days'] = df['op_datetime'].map(lambda x: x.strftime('%Y-%m-%d'))

# 按客户进行统计这次认证和上次认证的时间差
df['op_second'] = df['op_datetime']
df['op_second1'] = df.groupby('device_num_transform')['op_second'].shift(1)
df['op_diff_second1'] = (df['op_second'] - df['op_second1']).map(lambda x: x.total_seconds())

df['op_diff_second1_tmp'] = df['op_diff_second1']>10
# 客户第几次登录
df['op_times_groups'] = df.groupby('device_num_transform')['op_diff_second1_tmp'].apply(lambda x: x.cumsum())

# 系统层面的一段时间
df['system_op_second'] = df['op_datetime']
df['system_op_second1'] = df['system_op_second'].shift(1)
df['system_op_diff_second1'] = (df['system_op_second'] - df['system_op_second1']).map(lambda x: x.total_seconds())

df['system_op_diff_second1_tmp'] = df['system_op_diff_second1']>400
# 客户第几次登录
df['system_op_times_groups'] = df['system_op_diff_second1_tmp'].cumsum()

df = df.drop(columns=['op_second','op_second1','op_diff_second1_tmp', 'op_diff_second1', 'system_op_diff_second1',
                      'system_op_second', 'system_op_second1', 'system_op_diff_second1_tmp'])

## 环境层面

In [5]:

time_feats = ['system_op_times_groups', 'op_days', 'op_month']

cate_feats = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url']

# 客户+时间+按时间cumsum/cumunique
# 累计量统计
df['helper'] = 1
# 是否为异常状态码
df['http_status_code_helper'] = df['http_status_code'].map(lambda x: 1 if x in [400, 500, 502, 404] else 0) 
# 给样本编号
df['sampler_index_helper'] = df['helper'].cumsum()

for i in time_feats:
    i_tmp = df.groupby([i])
    # 系统往前看，处理了多少事
    df['system_{}_cumsum'.format(i)] = i_tmp['helper'].cumsum()
    # 系统往前看，处理了多少坏事
    df['system_{}_error_code_cumsum'.format(i)] = i_tmp['http_status_code_helper'].cumsum()
    
    for j in cate_feats:
        index_set = set(df.groupby([i, j],as_index=False).first()['sampler_index_helper'].values)
        df['tmp_helper'] = df['sampler_index_helper'].map(lambda x: 1 if x in index_set else 0)
        j_tmp = df.groupby([i, j])
        # 系统往前看，不同维度的处理了多少情况
        df['system_{}_{}_cumunique'.format(i, j)] = j_tmp['tmp_helper'].cumsum()
    

        if j not in ['ip_transform', 'user_name', 'device_num_transform']:
            for k in df[j].unique():
                tmp = df[df[j]==k].groupby([i])

                # 系统往前看，不同维度不同情况分别处理了多少次
                df['system_{}_{}_{}_cumsum'.format(i,j,k)] = tmp['helper'].cumsum()
                # 系统往前看，不同维度不同情况error_code分别处理了多少次
                df['system_{}_{}_{}_error_code_cumsum'.format(i,j,k)] = tmp['http_status_code_helper'].cumsum()

remove_cols = [x for x in df.columns if x[-6:]=='helper']

df = df.drop(columns=remove_cols)

## 对象层面

In [6]:

time_feats = ['op_times_groups', 'op_days', 'op_month']

object_feats = ['ip_transform', 'user_name', 'device_num_transform']

cate_feats = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url']

# 客户+时间+按时间cumsum/cumunique
# 累计量统计
df['helper'] = 1
# 是否为异常状态码
df['http_status_code_helper'] = df['http_status_code'].map(lambda x: 1 if x in [400, 500, 502, 404] else 0) 
# 给样本编号
df['sampler_index_helper'] = df['helper'].cumsum()


for i in time_feats:
    for j in object_feats:
        j_tmp = df.groupby([i,j])
        df['{}_{}_cumsum'.format(i,j)] = j_tmp['helper'].cumsum()
        df['{}_{}_error_code_cumsum'.format(i,j)] = j_tmp['http_status_code_helper'].cumsum()
        
        for k in cate_feats:
            if k == j: continue
            index_set = set(df.groupby([i,j,k], as_index=False).first()['sampler_index_helper'].values)            
            df['tmp_helper'] = df['sampler_index_helper'].map(lambda x: 1 if x in index_set else 0)
            k_tmp = df.groupby([i,j,k])
            df['{}_{}_{}_cumunique'.format(i,j,k)] = k_tmp['tmp_helper'].cumsum()

            if k not in ['ip_transform', 'user_name', 'device_num_transform']:
                for v in df[k].unique():
                    v_tmp = df[df[k]==v].groupby([i,j])
                    df['{}_{}_{}_{}_cumsum'.format(i,j,k,v)] = v_tmp['helper'].cumsum()
                    df['{}_{}_{}_{}_error_code_cumsum'.format(i,j,k,v)] = v_tmp['http_status_code_helper'].cumsum()

remove_cols = [x for x in df.columns if x[-6:]=='helper']

df = df.drop(columns=remove_cols)

In [7]:
remove_cols = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url', 'op_datetime', 'op_month']

df = df.drop(columns=remove_cols)

In [8]:
df_row_train = df[df[y_label].notna()].reset_index(drop=True)
df_row_val = df[df[y_label].isna()].reset_index(drop=True)

df_train, df_test, convert_cols = sp.transform_data_detail(df_row_train, df_row_val, y_label, excel_path=path_output_report)
df_train, df_test, bins = sp.binning_data_detail(df_train, df_test, y_label)

iv_list, iv_drop_var, df_train = sp.feature_selection_iv(df_train, bins, y_label, min_threshold=0.02, max_threshold=3.0)
corr_matrix, corr_drop_var, df_train = sp.feature_selection_corr(df_train, y_label)
feats, categorical_feats = get_null_importance(df_train.drop(columns=[y_label]).copy(),
                                               df_train[y_label].copy(), 
                                               thresholds=15)

sheet05.可能为数值类型的object类型数据统计在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221014.xlsx中已经存在，我们将对原文件进行覆盖
sheet06.数据预处理在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221014.xlsx中已经存在，我们将对原文件进行覆盖
There are 132 variables have only one binning intervals，please check the binning result. 
 (ColumnNames: op_times_groups_user_name_department_sales_cumsum, op_days_user_name_os_version_win7_cumsum, op_days_ip_transform_browser_version_edge_93_error_code_cumsum, op_days_ip_transform_op_city_杭州_error_code_cumsum, op_days_device_num_transform_os_version_win7_error_code_cumsum, op_times_groups_user_name_url_xxx.com/github_cumsum, op_times_groups_user_name_os_type_win_cumsum, op_times_groups_user_name_http_status_code_200_error_code_cumsum, op_times_groups_user_name_op_city_杭州_error_code_cumsum, op_times_groups_ip_transform_browser_chrome_error_code_cumsum, op_times_groups_ip_transfo

sheet07.初始分箱结果在06_初始分箱结果.xlsx中已经存在，我们将对原文件进行覆盖
特征op_times_groups_user_name_department_sales_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_days_user_name_os_version_win7_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_days_ip_transform_browser_version_edge_93_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_days_ip_transform_op_city_杭州_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_days_device_num_transform_os_version_win7_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_user_name_url_xxx.com/github_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_user_name_os_type_win_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_user_name_http_status_code_200_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_user_name_op_city_杭州_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_ip_transform_browser_chrome_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_ip_transform_os_version_win10_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_gr

特征op_times_groups_user_name_os_version_win7_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_user_name_browser_version_edge_93_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_days_ip_transform_os_version_win7_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_ip_transform_op_city_成都_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_ip_transform_department_sales_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征system_op_month_log_system_transform_nHrKgKdJ1Mzt_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_ip_transform_department_rd_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_user_name_op_city_深圳_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_ip_transform_ip_type_内网_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_times_groups_device_num_transform_url_xxx.com/github_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_days_device_num_transform_browser_edge_error_code_cumsum，除特殊值分箱外，分箱箱数少于2箱，无法调整, 建议删除
特征op_days_ip_transform

[INFO] converting into woe values ...
Woe transformating on 47660 rows and 334 columns in 00:00:29
[INFO] converting into woe values ...
Woe transformating on 25710 rows and 334 columns in 00:00:16


sheet09.调整后分箱结果在06_初始分箱结果.xlsx中已经存在，我们将对原文件进行覆盖


## modeling

In [9]:

import time
from sklearn.metrics import roc_auc_score as auc
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold

params = {
    'learning_rate': 0.05,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 31,
    'verbose': -1,
    'seed': 2222,
    'n_jobs': -1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 4,
    # 'min_child_weight': 10,
}

fold_num = 5
seeds = [2022]
oof = np.zeros(len(df_train))
importance = 0
pred_y = pd.DataFrame()
score = []
for seed in seeds:
    kf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=seed)
    # kf = KFold(n_splits=fold_num, shuffle=True, random_state=seed)
    for fold, (train_idx, val_idx) in enumerate(kf.split(df_train[feats], df_train[y_label])):
        print('-----------', fold)
        train = lgb.Dataset(df_train.loc[train_idx, feats],
                            df_train.loc[train_idx, y_label],
                           # categorical_feature=categorical_feats
                           )
        val = lgb.Dataset(df_train.loc[val_idx, feats],
                          df_train.loc[val_idx, y_label],
                          #categorical_feature=categorical_feats
                         )
        model = lgb.train(params, train, valid_sets=[val], 
                          num_boost_round=20000, early_stopping_rounds=100)

        oof[val_idx] += model.predict(df_train.loc[val_idx, feats]) / len(seeds)
        pred_y['fold_%d_seed_%d' % (fold, seed)] = model.predict(df_test[feats])
        importance += model.feature_importance(importance_type='gain') / fold_num
        score.append(auc(df_train.loc[val_idx, y_label], model.predict(df_train.loc[val_idx, feats])))
feats_importance = pd.DataFrame()
feats_importance['name'] = feats
feats_importance['importance'] = importance
display(feats_importance.sort_values('importance', ascending=False)[:30])

df_train['oof'] = oof
display(np.mean(score), np.std(score))

score = np.mean(score)
df_test[y_label] = pred_y.mean(axis=1).values
df_test = df_test.sort_values('id').reset_index(drop=True)

sub = pd.read_csv(path_sample_submission)
sub[y_label] = df_test[y_label].values
sub.to_csv(os.path.join(path_results_jupyter,time.strftime('lgb_%Y%m%d%H%M_')+'%.5f.csv'%score), index=False)

----------- 0
[1]	valid_0's auc: 0.923045
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.924755
[3]	valid_0's auc: 0.92591
[4]	valid_0's auc: 0.926979
[5]	valid_0's auc: 0.926943
[6]	valid_0's auc: 0.927341
[7]	valid_0's auc: 0.927609
[8]	valid_0's auc: 0.930244
[9]	valid_0's auc: 0.929873
[10]	valid_0's auc: 0.929693
[11]	valid_0's auc: 0.929365
[12]	valid_0's auc: 0.929509
[13]	valid_0's auc: 0.924587
[14]	valid_0's auc: 0.924664
[15]	valid_0's auc: 0.924767
[16]	valid_0's auc: 0.924848
[17]	valid_0's auc: 0.924932
[18]	valid_0's auc: 0.925096
[19]	valid_0's auc: 0.925777
[20]	valid_0's auc: 0.925705
[21]	valid_0's auc: 0.925775
[22]	valid_0's auc: 0.928089
[23]	valid_0's auc: 0.928107
[24]	valid_0's auc: 0.928171
[25]	valid_0's auc: 0.928167
[26]	valid_0's auc: 0.930038
[27]	valid_0's auc: 0.930332
[28]	valid_0's auc: 0.930806
[29]	valid_0's auc: 0.930774
[30]	valid_0's auc: 0.930966
[31]	valid_0's auc: 0.930256
[32]	valid_0's auc: 0.930157
[33]	v

[104]	valid_0's auc: 0.934237
[105]	valid_0's auc: 0.934058
[106]	valid_0's auc: 0.933915
[107]	valid_0's auc: 0.933775
[108]	valid_0's auc: 0.933713
[109]	valid_0's auc: 0.933586
[110]	valid_0's auc: 0.933596
[111]	valid_0's auc: 0.93363
[112]	valid_0's auc: 0.933565
[113]	valid_0's auc: 0.933556
[114]	valid_0's auc: 0.933615
[115]	valid_0's auc: 0.933712
[116]	valid_0's auc: 0.933635
[117]	valid_0's auc: 0.933814
[118]	valid_0's auc: 0.933972
[119]	valid_0's auc: 0.934193
[120]	valid_0's auc: 0.934361
[121]	valid_0's auc: 0.934265
[122]	valid_0's auc: 0.934324
[123]	valid_0's auc: 0.934409
[124]	valid_0's auc: 0.934408
[125]	valid_0's auc: 0.934409
[126]	valid_0's auc: 0.934439
[127]	valid_0's auc: 0.934381
[128]	valid_0's auc: 0.934338
[129]	valid_0's auc: 0.934352
[130]	valid_0's auc: 0.934262
[131]	valid_0's auc: 0.934204
[132]	valid_0's auc: 0.934097
[133]	valid_0's auc: 0.934152
[134]	valid_0's auc: 0.934176
[135]	valid_0's auc: 0.934234
[136]	valid_0's auc: 0.934224
[137]	valid

[67]	valid_0's auc: 0.930361
[68]	valid_0's auc: 0.930362
[69]	valid_0's auc: 0.930517
[70]	valid_0's auc: 0.930419
[71]	valid_0's auc: 0.930341
[72]	valid_0's auc: 0.930477
[73]	valid_0's auc: 0.930363
[74]	valid_0's auc: 0.930381
[75]	valid_0's auc: 0.93042
[76]	valid_0's auc: 0.930413
[77]	valid_0's auc: 0.930719
[78]	valid_0's auc: 0.93075
[79]	valid_0's auc: 0.931325
[80]	valid_0's auc: 0.931474
[81]	valid_0's auc: 0.931618
[82]	valid_0's auc: 0.931601
[83]	valid_0's auc: 0.931294
[84]	valid_0's auc: 0.931343
[85]	valid_0's auc: 0.931607
[86]	valid_0's auc: 0.931527
[87]	valid_0's auc: 0.931361
[88]	valid_0's auc: 0.931501
[89]	valid_0's auc: 0.931622
[90]	valid_0's auc: 0.931686
[91]	valid_0's auc: 0.931731
[92]	valid_0's auc: 0.93149
[93]	valid_0's auc: 0.931275
[94]	valid_0's auc: 0.931296
[95]	valid_0's auc: 0.931116
[96]	valid_0's auc: 0.930953
[97]	valid_0's auc: 0.930809
[98]	valid_0's auc: 0.930951
[99]	valid_0's auc: 0.930967
[100]	valid_0's auc: 0.930966
[101]	valid_0's 

Unnamed: 0,name,importance
38,system_system_op_times_groups_http_status_code...,66262.463128
25,op_month_ip_transform_cumsum_woe,29930.62893
8,op_times_groups_device_num_transform_url_wpsdo...,20654.046046
44,op_days_ip_transform_http_status_code_200_cums...,13688.7705
22,op_month_device_num_transform_cumsum_woe,7823.45211
19,system_op_days_ip_type_内网_cumsum_woe,7478.421022
10,op_month_user_name_http_status_code_200_error_...,7362.788845
9,system_op_days_ip_type_内网_error_code_cumsum_woe,4947.873127
29,system_system_op_times_groups_log_system_trans...,3192.853545
12,op_times_groups_device_num_transform_departmen...,3084.421558


0.9328122491250795

0.0020107975873669267

KeyError: 'id'

In [10]:
feats_importance.sort_values('importance', ascending=False)[:50]

Unnamed: 0,name,importance
38,system_system_op_times_groups_http_status_code...,66262.463128
25,op_month_ip_transform_cumsum_woe,29930.62893
8,op_times_groups_device_num_transform_url_wpsdo...,20654.046046
44,op_days_ip_transform_http_status_code_200_cums...,13688.7705
22,op_month_device_num_transform_cumsum_woe,7823.45211
19,system_op_days_ip_type_内网_cumsum_woe,7478.421022
10,op_month_user_name_http_status_code_200_error_...,7362.788845
9,system_op_days_ip_type_内网_error_code_cumsum_woe,4947.873127
29,system_system_op_times_groups_log_system_trans...,3192.853545
12,op_times_groups_device_num_transform_departmen...,3084.421558


In [11]:
feats_importance.shape

(45, 2)

In [12]:
feats = feats_importance.name
feats = [i[:-4] for i in feats]

In [13]:
feats

['op_month_ip_transform_browser_version_edge_93_cumsum',
 'system_system_op_times_groups_browser_version_edge_93_cumsum',
 'op_times_groups_device_num_transform_op_city_成都_cumsum',
 'op_month_ip_transform_op_city_成都_cumsum',
 'system_op_days_log_system_transform_nHrKgKdJ1Mzt_cumsum',
 'system_op_days_browser_chrome_cumsum',
 'system_system_op_times_groups_op_city_深圳_cumsum',
 'op_days_user_name_browser_version_edge_93_cumsum',
 'op_times_groups_device_num_transform_url_wpsdoc.xxx.com/download_cumsum',
 'system_op_days_ip_type_内网_error_code_cumsum',
 'op_month_user_name_http_status_code_200_error_code_cumsum',
 'system_op_days_os_version_win10_error_code_cumsum',
 'op_times_groups_device_num_transform_department_rd_cumsum',
 'op_days_ip_transform_department_sales_cumsum',
 'op_month_device_num_transform_browser_version_chrome_90_error_code_cumsum',
 'op_month_ip_transform_op_city_深圳_cumsum',
 'op_times_groups_device_num_transform_browser_edge_cumsum',
 'op_month_ip_transform_op_city_北京_

# 方案二-简化

In [6]:
df = pd.concat([df_row_train, df_row_val]).reset_index(drop=True)
df = df.sort_values(by='op_datetime')

# 认证日期时间
df['op_datetime'] = pd.to_datetime(df['op_datetime'])
# 将数据分为每一天
df['op_days'] = df['op_datetime'].map(lambda x: x.strftime('%Y-%m-%d'))

# 按客户进行统计这次认证和上次认证的时间差
df['op_second'] = df['op_datetime']
df['op_second1'] = df.groupby('device_num_transform')['op_second'].shift(1)
df['op_diff_second1'] = (df['op_second'] - df['op_second1']).map(lambda x: x.total_seconds())

df['op_diff_second1_tmp'] = df['op_diff_second1']>10
# 客户第几次登录
df['op_times_groups'] = df.groupby('device_num_transform')['op_diff_second1_tmp'].apply(lambda x: x.cumsum())

# 系统层面的一段时间
df['system_op_second'] = df['op_datetime']
df['system_op_second1'] = df['system_op_second'].shift(1)
df['system_op_diff_second1'] = (df['system_op_second'] - df['system_op_second1']).map(lambda x: x.total_seconds())

df['system_op_diff_second1_tmp'] = df['system_op_diff_second1']>400
# 客户第几次登录
df['system_op_times_groups'] = df['system_op_diff_second1_tmp'].cumsum()

df = df.drop(columns=['op_second','op_second1','op_diff_second1_tmp', 'op_diff_second1', 'system_op_diff_second1',
                      'system_op_second', 'system_op_second1', 'system_op_diff_second1_tmp'])

## 特征生成

In [7]:

time_feats = ['system_op_times_groups', 'op_days', 'op_month']

cate_feats = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url']

# 客户+时间+按时间cumsum/cumunique
# 累计量统计
df['helper'] = 1
# 是否为异常状态码
df['http_status_code_helper'] = df['http_status_code'].map(lambda x: 1 if x in [400, 500, 502, 404] else 0) 
# 给样本编号
df['sampler_index_helper'] = df['helper'].cumsum()

for i in time_feats:
    i_tmp = df.groupby([i])
    # 系统往前看，处理了多少事
    df['system_{}_cumsum'.format(i)] = i_tmp['helper'].cumsum()
    # 系统往前看，处理了多少坏事
    df['system_{}_error_code_cumsum'.format(i)] = i_tmp['http_status_code_helper'].cumsum()
    
    for j in cate_feats:
        index_set = set(df.groupby([i, j],as_index=False).first()['sampler_index_helper'].values)
        df['tmp_helper'] = df['sampler_index_helper'].map(lambda x: 1 if x in index_set else 0)
        j_tmp = df.groupby([i, j])
        # 系统往前看，不同维度的处理了多少情况
        df['system_{}_{}_cumunique'.format(i, j)] = j_tmp['tmp_helper'].cumsum()
    

        if j not in ['ip_transform', 'user_name', 'device_num_transform']:
            for k in df[j].unique():
                tmp = df[df[j]==k].groupby([i])

                # 系统往前看，不同维度不同情况分别处理了多少次
                df['system_{}_{}_{}_cumsum'.format(i,j,k)] = tmp['helper'].cumsum()
                # 系统往前看，不同维度不同情况error_code分别处理了多少次
                df['system_{}_{}_{}_error_code_cumsum'.format(i,j,k)] = tmp['http_status_code_helper'].cumsum()

remove_cols = [x for x in df.columns if x[-6:]=='helper']

df = df.drop(columns=remove_cols)

In [8]:

time_feats = ['op_times_groups', 'op_days', 'op_month']

object_feats = ['ip_transform', 'user_name', 'device_num_transform']

cate_feats = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url']

# 客户+时间+按时间cumsum/cumunique
# 累计量统计
df['helper'] = 1
# 是否为异常状态码
df['http_status_code_helper'] = df['http_status_code'].map(lambda x: 1 if x in [400, 500, 502, 404] else 0) 
# 给样本编号
df['sampler_index_helper'] = df['helper'].cumsum()


for i in time_feats:
    for j in object_feats:
        j_tmp = df.groupby([i,j])
        df['{}_{}_cumsum'.format(i,j)] = j_tmp['helper'].cumsum()
        df['{}_{}_error_code_cumsum'.format(i,j)] = j_tmp['http_status_code_helper'].cumsum()
        
        for k in cate_feats:
            if k == j: continue
            index_set = set(df.groupby([i,j,k], as_index=False).first()['sampler_index_helper'].values)            
            df['tmp_helper'] = df['sampler_index_helper'].map(lambda x: 1 if x in index_set else 0)
            k_tmp = df.groupby([i,j,k])
            df['{}_{}_{}_cumunique'.format(i,j,k)] = k_tmp['tmp_helper'].cumsum()

            if k not in ['ip_transform', 'user_name', 'device_num_transform']:
                for v in df[k].unique():
                    v_tmp = df[df[k]==v].groupby([i,j])
                    df['{}_{}_{}_{}_cumsum'.format(i,j,k,v)] = v_tmp['helper'].cumsum()
                    df['{}_{}_{}_{}_error_code_cumsum'.format(i,j,k,v)] = v_tmp['http_status_code_helper'].cumsum()

remove_cols = [x for x in df.columns if x[-6:]=='helper']

df = df.drop(columns=remove_cols)

In [9]:
remove_cols = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url', 'op_datetime', 'op_month']

df = df.drop(columns=remove_cols)

In [10]:
df_train = df[df[y_label].notna()].reset_index(drop=True)
df_test = df[df[y_label].isna()].reset_index(drop=True)

In [11]:
feats = ['op_month_ip_transform_browser_version_edge_93_cumsum',
 'system_system_op_times_groups_browser_version_edge_93_cumsum',
 'op_times_groups_device_num_transform_op_city_成都_cumsum',
 'op_month_ip_transform_op_city_成都_cumsum',
 'system_op_days_log_system_transform_nHrKgKdJ1Mzt_cumsum',
 'system_op_days_browser_chrome_cumsum',
 'system_system_op_times_groups_op_city_深圳_cumsum',
 'op_days_user_name_browser_version_edge_93_cumsum',
 'op_times_groups_device_num_transform_url_wpsdoc.xxx.com/download_cumsum',
 'system_op_days_ip_type_内网_error_code_cumsum',
 'op_month_user_name_http_status_code_200_error_code_cumsum',
 'system_op_days_os_version_win10_error_code_cumsum',
 'op_times_groups_device_num_transform_department_rd_cumsum',
 'op_days_ip_transform_department_sales_cumsum',
 'op_month_device_num_transform_browser_version_chrome_90_error_code_cumsum',
 'op_month_ip_transform_op_city_深圳_cumsum',
 'op_times_groups_device_num_transform_browser_edge_cumsum',
 'op_month_ip_transform_op_city_北京_cumsum',
 'op_month_device_num_transform_op_city_北京_error_code_cumsum',
 'system_op_days_ip_type_内网_cumsum',
 'op_times_groups_user_name_browser_chrome_cumsum',
 'op_month_user_name_op_city_杭州_error_code_cumsum',
 'op_month_device_num_transform_cumsum',
 'op_days_ip_transform_cumsum',
 'op_days_user_name_ip_type_内网_cumsum',
 'op_month_ip_transform_cumsum',
 'system_op_days_op_city_杭州_cumsum',
 'op_month_user_name_op_city_深圳_error_code_cumsum',
 'op_days_device_num_transform_op_city_成都_cumsum',
 'system_system_op_times_groups_log_system_transform_nHrKgKdJ1Mzt_cumsum',
 'op_days_ip_transform_url_xxx.com/github_cumsum',
 'op_month_device_num_transform_ip_type_内网_error_code_cumsum',
 'op_days_user_name_department_rd_error_code_cumsum',
 'op_days_ip_transform_op_city_北京_cumsum',
 'op_days_device_num_transform_op_city_杭州_cumsum',
 'system_system_op_times_groups_op_city_杭州_cumsum',
 'system_op_days_op_city_北京_cumsum',
 'system_system_op_times_groups_browser_chrome_cumsum',
 'system_system_op_times_groups_http_status_code_200_cumsum',
 'system_op_days_op_city_深圳_cumsum',
 'op_days_user_name_browser_chrome_cumsum',
 'op_month_ip_transform_ip_type_内网_cumsum',
 'system_system_op_times_groups_department_rd_error_code_cumsum',
 'op_days_device_num_transform_op_city_深圳_cumsum',
 'op_days_ip_transform_http_status_code_200_cumsum']

## modeling

In [14]:

import time
from sklearn.metrics import roc_auc_score as auc
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold

params = {
    'learning_rate': 0.05,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 31,
    'verbose': -1,
    'seed': 2222,
    'n_jobs': -1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 4,
    # 'min_child_weight': 10,
}

fold_num = 5
seeds = [2022]
oof = np.zeros(len(df_train))
importance = 0
pred_y = pd.DataFrame()
score = []
for seed in seeds:
    kf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=seed)
    # kf = KFold(n_splits=fold_num, shuffle=True, random_state=seed)
    for fold, (train_idx, val_idx) in enumerate(kf.split(df_train[feats], df_train[y_label])):
        print('-----------', fold)
        train = lgb.Dataset(df_train.loc[train_idx, feats],
                            df_train.loc[train_idx, y_label],
                           # categorical_feature=categorical_feats
                           )
        val = lgb.Dataset(df_train.loc[val_idx, feats],
                          df_train.loc[val_idx, y_label],
                          #categorical_feature=categorical_feats
                         )
        model = lgb.train(params, train, valid_sets=[val], 
                          num_boost_round=20000, early_stopping_rounds=100)

        oof[val_idx] += model.predict(df_train.loc[val_idx, feats]) / len(seeds)
        pred_y['fold_%d_seed_%d' % (fold, seed)] = model.predict(df_test[feats])
        importance += model.feature_importance(importance_type='gain') / fold_num
        score.append(auc(df_train.loc[val_idx, y_label], model.predict(df_train.loc[val_idx, feats])))
feats_importance = pd.DataFrame()
feats_importance['name'] = feats
feats_importance['importance'] = importance
display(feats_importance.sort_values('importance', ascending=False)[:30])

df_train['oof'] = oof
display(np.mean(score), np.std(score))

score = np.mean(score)
df_test[y_label] = pred_y.mean(axis=1).values
df_test = df_test.sort_values('id').reset_index(drop=True)

sub = pd.read_csv(path_sample_submission)
sub[y_label] = df_test[y_label].values
sub.to_csv(os.path.join(path_results_jupyter,time.strftime('lgb_%Y%m%d%H%M_')+'%.5f.csv'%score), index=False)

----------- 0
[1]	valid_0's auc: 0.929129
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.929328
[3]	valid_0's auc: 0.930972
[4]	valid_0's auc: 0.930954
[5]	valid_0's auc: 0.931506
[6]	valid_0's auc: 0.934219
[7]	valid_0's auc: 0.933651
[8]	valid_0's auc: 0.930657
[9]	valid_0's auc: 0.93012
[10]	valid_0's auc: 0.930171
[11]	valid_0's auc: 0.932121
[12]	valid_0's auc: 0.932142
[13]	valid_0's auc: 0.932619
[14]	valid_0's auc: 0.932567
[15]	valid_0's auc: 0.932581
[16]	valid_0's auc: 0.932562
[17]	valid_0's auc: 0.932596
[18]	valid_0's auc: 0.932849
[19]	valid_0's auc: 0.933922
[20]	valid_0's auc: 0.933738
[21]	valid_0's auc: 0.933502
[22]	valid_0's auc: 0.932484
[23]	valid_0's auc: 0.932479
[24]	valid_0's auc: 0.932682
[25]	valid_0's auc: 0.932984
[26]	valid_0's auc: 0.93279
[27]	valid_0's auc: 0.932916
[28]	valid_0's auc: 0.932943
[29]	valid_0's auc: 0.932897
[30]	valid_0's auc: 0.933619
[31]	valid_0's auc: 0.933487
[32]	valid_0's auc: 0.933365
[33]	va

[109]	valid_0's auc: 0.936483
[110]	valid_0's auc: 0.936336
[111]	valid_0's auc: 0.936472
[112]	valid_0's auc: 0.936262
[113]	valid_0's auc: 0.936144
[114]	valid_0's auc: 0.936141
[115]	valid_0's auc: 0.936132
[116]	valid_0's auc: 0.936016
[117]	valid_0's auc: 0.935864
[118]	valid_0's auc: 0.935784
[119]	valid_0's auc: 0.936114
[120]	valid_0's auc: 0.936032
[121]	valid_0's auc: 0.936067
[122]	valid_0's auc: 0.936111
[123]	valid_0's auc: 0.936119
[124]	valid_0's auc: 0.93607
[125]	valid_0's auc: 0.93589
[126]	valid_0's auc: 0.935696
[127]	valid_0's auc: 0.935419
[128]	valid_0's auc: 0.935359
[129]	valid_0's auc: 0.93532
[130]	valid_0's auc: 0.935383
[131]	valid_0's auc: 0.935644
[132]	valid_0's auc: 0.93558
[133]	valid_0's auc: 0.935699
[134]	valid_0's auc: 0.935817
[135]	valid_0's auc: 0.935724
[136]	valid_0's auc: 0.935816
[137]	valid_0's auc: 0.935805
[138]	valid_0's auc: 0.935784
[139]	valid_0's auc: 0.935744
[140]	valid_0's auc: 0.93569
[141]	valid_0's auc: 0.936237
[142]	valid_0's

[110]	valid_0's auc: 0.934949
[111]	valid_0's auc: 0.934905
[112]	valid_0's auc: 0.935081
[113]	valid_0's auc: 0.935354
[114]	valid_0's auc: 0.93542
[115]	valid_0's auc: 0.935274
[116]	valid_0's auc: 0.934963
[117]	valid_0's auc: 0.934774
[118]	valid_0's auc: 0.934562
[119]	valid_0's auc: 0.934322
[120]	valid_0's auc: 0.934129
[121]	valid_0's auc: 0.934621
[122]	valid_0's auc: 0.93501
[123]	valid_0's auc: 0.935034
[124]	valid_0's auc: 0.934914
[125]	valid_0's auc: 0.935034
[126]	valid_0's auc: 0.934923
[127]	valid_0's auc: 0.934939
[128]	valid_0's auc: 0.934883
[129]	valid_0's auc: 0.934808
[130]	valid_0's auc: 0.934747
[131]	valid_0's auc: 0.934548
[132]	valid_0's auc: 0.934263
[133]	valid_0's auc: 0.934392
[134]	valid_0's auc: 0.934348
[135]	valid_0's auc: 0.934317
[136]	valid_0's auc: 0.934139
[137]	valid_0's auc: 0.933863
[138]	valid_0's auc: 0.933816
[139]	valid_0's auc: 0.933755
[140]	valid_0's auc: 0.933834
[141]	valid_0's auc: 0.933753
[142]	valid_0's auc: 0.933699
[143]	valid_

[47]	valid_0's auc: 0.935362
[48]	valid_0's auc: 0.935519
[49]	valid_0's auc: 0.935409
[50]	valid_0's auc: 0.935012
[51]	valid_0's auc: 0.935169
[52]	valid_0's auc: 0.934316
[53]	valid_0's auc: 0.93382
[54]	valid_0's auc: 0.93346
[55]	valid_0's auc: 0.932959
[56]	valid_0's auc: 0.93528
[57]	valid_0's auc: 0.93554
[58]	valid_0's auc: 0.935866
[59]	valid_0's auc: 0.935858
[60]	valid_0's auc: 0.935738
[61]	valid_0's auc: 0.935873
[62]	valid_0's auc: 0.936038
[63]	valid_0's auc: 0.936131
[64]	valid_0's auc: 0.934878
[65]	valid_0's auc: 0.934675
[66]	valid_0's auc: 0.93477
[67]	valid_0's auc: 0.934402
[68]	valid_0's auc: 0.934061
[69]	valid_0's auc: 0.934025
[70]	valid_0's auc: 0.934061
[71]	valid_0's auc: 0.934043
[72]	valid_0's auc: 0.933734
[73]	valid_0's auc: 0.934146
[74]	valid_0's auc: 0.934404
[75]	valid_0's auc: 0.933763
[76]	valid_0's auc: 0.934067
[77]	valid_0's auc: 0.93433
[78]	valid_0's auc: 0.934214
[79]	valid_0's auc: 0.934167
[80]	valid_0's auc: 0.933767
[81]	valid_0's auc: 

Unnamed: 0,name,importance
38,system_system_op_times_groups_http_status_code...,91885.027613
25,op_month_ip_transform_cumsum,21363.149442
8,op_times_groups_device_num_transform_url_wpsdo...,19078.437135
19,system_op_days_ip_type_内网_cumsum,17813.608928
24,op_days_user_name_ip_type_内网_cumsum,9218.264851
41,op_month_ip_transform_ip_type_内网_cumsum,8795.538688
44,op_days_ip_transform_http_status_code_200_cumsum,7197.421087
10,op_month_user_name_http_status_code_200_error_...,5785.966296
12,op_times_groups_device_num_transform_departmen...,4695.032086
22,op_month_device_num_transform_cumsum,4441.231924


0.9376834078230324

0.0013897659538466263

In [15]:
feats_importance.sort_values('importance', ascending=False)[:50]

Unnamed: 0,name,importance
38,system_system_op_times_groups_http_status_code...,91885.027613
25,op_month_ip_transform_cumsum,21363.149442
8,op_times_groups_device_num_transform_url_wpsdo...,19078.437135
19,system_op_days_ip_type_内网_cumsum,17813.608928
24,op_days_user_name_ip_type_内网_cumsum,9218.264851
41,op_month_ip_transform_ip_type_内网_cumsum,8795.538688
44,op_days_ip_transform_http_status_code_200_cumsum,7197.421087
10,op_month_user_name_http_status_code_200_error_...,5785.966296
12,op_times_groups_device_num_transform_departmen...,4695.032086
22,op_month_device_num_transform_cumsum,4441.231924


# 方案三-类别特征编码

In [4]:
df_train = df_row_train
df_test = df_row_val

In [5]:
df_train.head()

Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk
0,0,guojianping9672,rd,GVhZtW4i1,rqRxAjAL1RYC,firefox_78,firefox,win,win10,2022-01-18 19:10:41,内网,200,成都,2umVQwhiiwNJ,xxx.com/mail,2022-01,0
1,1,yangtao1740,sales,l3MuTMPoQ,iKPTa3su50y7,chrome_93,chrome,win,win11,2022-04-01 17:04:00,内网,200,深圳,RwHe8Q1R7AlB,business.xxx.com/,2022-04,0
2,2,wangying9098,rd,4uHWcskWv,1baNbqxMWcCu,ie_11,ie,win,win10,2022-03-01 15:53:49,内网,200,成都,dwS3cdn15GK4,wpsdoc.xxx.com/kdocs,2022-03,0
3,3,liguixiang3860,rd,mQh3NwtY7,C04Llg4lKl4C,edge_93,edge,win,win10,2022-02-07 19:46:25,内网,200,北京,nHrKgKdJ1Mzt,xxx.com/github,2022-02,0
4,4,guanyu9205,sales,C2QtgDKAZ,kSscjiRSz1aD,edge_93,edge,win,win10,2022-04-12 10:05:19,内网,200,成都,RwHe8Q1R7AlB,business.xxx.com/,2022-04,0


In [6]:
df_feats = sp.detect(df_train)
df_feats.head()

Unnamed: 0,feat_name_row,type,size,missing,unique,zero_ratio,negative_ratio,top1_all_value,top1_all_ratio,mean_or_top1,std_or_top2,min_or_top3,1%_or_top4,10%_or_top5,50%_or_bottom5,75%_or_bottom4,90%_or_bottom3,99%_or_bottom2,max_or_bottom1
0,id,int64,47660,0.0,47660,0.0,0.0,0,0.0,23829.5,13758.401,0.0,476.59,4765.9,23829.5,35744.25,42893.1,47182.41,47659.0
1,user_name,object,47660,0.08,187,0.0,,xuxiuying8050,0.007,xuxiuying8050:0.65%,hongchang3029:0.63%,tanliu3173:0.62%,liuhong6350:0.62%,lufan2545:0.62%,zhouxiumei4433:0.38%,chenjian4844:0.37%,wanggang1192:0.36%,ranxiuzhen6780:0.33%,xujie9775:0.30%
2,department,object,47660,0.08,5,0.0,,rd,0.654,rd:65.36%,sales:17.26%,other:4.07%,accounting:3.56%,hr:1.75%,rd:65.36%,sales:17.26%,other:4.07%,accounting:3.56%,hr:1.75%
3,ip_transform,object,47660,0.0,2105,0.0,,w2CfuqTz3,0.007,w2CfuqTz3:0.68%,u9diCFdYZ:0.66%,pPgzIf3S4:0.65%,7YnPN3fqd:0.65%,DhTMwbtS5:0.64%,948U9MQcB:0.00%,h75YAkAAL:0.00%,m7512MutA:0.00%,ADL8GwW32:0.00%,g3dWezpzT:0.00%
4,device_num_transform,object,47660,0.0,844,0.0,,O54DfqjlCrhL,0.007,O54DfqjlCrhL:0.70%,kUa61ygA6gI3:0.68%,Rfv57YyO3vny:0.67%,5DmlITfRNR36:0.66%,TzmgdvYq3Kx0:0.66%,aUECyyFo55Zy:0.00%,cREgOG9x3d9X:0.00%,NGfeE42d1yHY:0.00%,T4hueKNccs7X:0.00%,A0TLDctT8OUR:0.00%


In [7]:
num_miss = list(df_feats[df_feats['type']!='object'].index)
char_miss = list(df_feats[df_feats['type']=='object'].index)
print('数值型变量个数：%d'%len(num_miss))
print('字符型变量个数：%d'%len(char_miss))

数值型变量个数：3
字符型变量个数：14


In [9]:
char_miss

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15]

In [12]:
df_train = df_train.fillna(' ')

In [13]:

from itertools import combinations
from catboost import CatBoostClassifier

# Keep list of all categorical features in dataset to specify this for CatBoost
cat_features_ids = char_miss
# Train the model:
clf = CatBoostClassifier(learning_rate=0.1, iterations=1000, random_seed=0, logging_level='Silent')
clf.fit(df_train.drop(columns=[y_label]), df_train[y_label], cat_features=cat_features_ids)

<catboost.core.CatBoostClassifier at 0x7fac0deb77d0>

In [15]:

import time
from sklearn.metrics import roc_auc_score as auc
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold



In [17]:
auc(df_train[y_label], clf.predict_proba(df_train.drop(columns=[y_label]))[:,1])

0.9755946184466229

In [21]:
feats_importance = pd.DataFrame()

feats_importance['name'] = df_train.drop(columns=[y_label]).columns
feats_importance['importance'] = clf.feature_importances_
feats_importance.sort_values(by='importance', ascending=False)

Unnamed: 0,name,importance
0,id,25.598064
14,url,11.080769
15,op_month,9.181134
3,ip_transform,8.50868
4,device_num_transform,7.931312
13,log_system_transform,7.79724
1,user_name,6.282338
12,op_city,5.953113
11,http_status_code,4.235067
5,browser_version,3.997807


# 方案四-Embeding

## Embeding

In [4]:
df = pd.concat([df_row_train, df_row_val]).reset_index(drop=True)
df = df.sort_values(by='op_datetime')

# 认证日期时间
df['op_datetime'] = pd.to_datetime(df['op_datetime'])
# 将数据分为每一天
df['op_days'] = df['op_datetime'].map(lambda x: x.strftime('%Y-%m-%d'))

In [5]:
cate_feats = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url']



In [6]:
for col in cate_feats:
    df_cols = df.groupby(['op_days'])[col].agg(lambda x: " ".join([str(i) for i in list(x)]))
    df_cols.to_csv(os.path.join(path_new_data, '{}.txt'.format(col)), index=False, header=False, sep='\t')

In [7]:
def train_embedding(path_corpus, path_save_models, path_save_txt, col):
    sentences = word2vec.Text8Corpus(path_corpus)  # 原始语料路径,已分词
    # 训练代码
    model = word2vec.Word2Vec(sentences, sg=1, vector_size=5, window=12, min_count=1,
                              hs=0,  workers=10, epochs=10)
    # save
    path_embedding_model = os.path.join(path_save_models, 'models_{}.model'.format(str(col)))
    path_embedding_vocab = os.path.join(path_save_txt, 'models_{}_embedding.txt'.format(str(col)))

    model.save(path_embedding_model)
    model.wv.save_word2vec_format(path_embedding_vocab)
    print('词向量训练完成：{}'.format(str(col)))

In [8]:
for col in cate_feats:
    train_embedding(
        path_corpus = os.path.join(path_new_data, '{}.txt'.format(col)),
        path_save_models = os.path.join(path_new_data, 'corpus_models'),
        path_save_txt = os.path.join(path_new_data, 'corpus_txt'),
        col = col,
    )

词向量训练完成：ip_transform
词向量训练完成：user_name
词向量训练完成：device_num_transform
词向量训练完成：department
词向量训练完成：browser_version
词向量训练完成：browser
词向量训练完成：os_type
词向量训练完成：os_version
词向量训练完成：ip_type
词向量训练完成：http_status_code
词向量训练完成：op_city
词向量训练完成：log_system_transform
词向量训练完成：url


## 特征合成

In [9]:
path_embeding = os.path.join(path_new_data, 'corpus_txt')

res_cols = []
for col in cate_feats:
    df_tmp = pd.read_csv(os.path.join(path_embeding, 'models_{}_embedding.txt'.format(col)), skiprows=1, header=None, sep=' ')
    df_tmp.columns = ['{}_{}'.format(col, i) for i in df_tmp.columns]
    df = pd.merge(left=df, right=df_tmp, how='left', left_on=col, right_on='{}_0'.format(col))
    res_cols.append('{}_0'.format(col))
    

remove_cols = ['ip_transform', 'user_name', 'device_num_transform', 'department', 'browser_version', 'browser', 'os_type','os_version',
              'ip_type','http_status_code', 'op_city', 'log_system_transform', 'url', 'op_datetime', 'op_month'] + res_cols

df = df.drop(columns=remove_cols)

In [10]:
df.head()

Unnamed: 0,id,is_risk,op_days,ip_transform_1,ip_transform_2,ip_transform_3,ip_transform_4,ip_transform_5,user_name_1,user_name_2,user_name_3,user_name_4,user_name_5,device_num_transform_1,device_num_transform_2,device_num_transform_3,device_num_transform_4,device_num_transform_5,department_1,department_2,department_3,department_4,department_5,browser_version_1,browser_version_2,browser_version_3,browser_version_4,browser_version_5,browser_1,browser_2,browser_3,browser_4,browser_5,os_type_1,os_type_2,os_type_3,os_type_4,os_type_5,os_version_1,os_version_2,os_version_3,os_version_4,os_version_5,ip_type_1,ip_type_2,ip_type_3,ip_type_4,ip_type_5,http_status_code_1,http_status_code_2,http_status_code_3,http_status_code_4,http_status_code_5,op_city_1,op_city_2,op_city_3,op_city_4,op_city_5,log_system_transform_1,log_system_transform_2,log_system_transform_3,log_system_transform_4,log_system_transform_5,url_1,url_2,url_3,url_4,url_5
0,44477,1.0,2022-01-07,-0.33038,1.129871,0.961583,-1.036296,-0.018143,1.267496,-0.132999,0.487008,-0.110869,0.024866,0.142125,-0.025883,1.171935,-0.236819,-0.863699,0.680698,-0.176669,0.453668,-0.426449,0.808273,0.316753,0.677824,0.507325,0.19064,-0.238577,-0.43749,-0.178135,-0.048388,0.660779,-0.583008,0.036222,0.13268,0.362773,-0.600175,-0.015433,-0.113136,0.587809,0.919612,0.293686,-0.609237,0.230143,-0.307836,-0.46708,0.365286,0.258894,0.138753,-0.556985,-0.593201,0.194739,1.012353,-0.394216,0.205529,-0.166808,0.574525,-0.393799,0.036253,0.201734,0.616837,0.101823,-0.538526,-0.325827,0.443498,0.516953,-0.199497,-0.37483
1,45489,1.0,2022-01-07,0.480117,0.466897,0.863924,-1.034336,-0.120813,0.385874,-0.35464,0.478967,-0.423299,-0.112615,0.727712,-0.286462,0.782063,-0.337248,-0.621135,0.680698,-0.176669,0.453668,-0.426449,0.808273,0.2033,0.735199,0.337215,0.260974,-0.339925,-0.433921,-0.162566,-0.030453,0.598946,-0.647217,0.123968,0.0388,0.089271,-0.686136,0.270111,-0.124564,0.728578,0.888321,0.234774,-0.535743,0.230143,-0.307836,-0.46708,0.365286,0.258894,0.138753,-0.556985,-0.593201,0.194739,1.012353,-0.337376,0.221963,-0.235764,0.59532,-0.367792,0.036253,0.201734,0.616837,0.101823,-0.538526,-0.325827,0.443498,0.516953,-0.199497,-0.37483
2,45706,1.0,2022-01-07,0.496152,-0.269469,1.23773,-1.125736,-0.032863,-0.223427,-0.173802,0.540786,-0.86274,0.21626,1.194992,-0.502227,0.836918,0.210922,-0.526698,0.43326,0.066907,0.743641,0.194118,1.090568,0.129662,0.807761,0.191058,0.129936,-0.48289,-0.775097,0.04896,0.001714,0.580933,-0.423884,0.036222,0.13268,0.362773,-0.600175,-0.015433,-0.101174,0.557539,0.965694,0.276071,-0.550529,0.230143,-0.307836,-0.46708,0.365286,0.258894,0.138753,-0.556985,-0.593201,0.194739,1.012353,-0.394216,0.205529,-0.166808,0.574525,-0.393799,0.002649,0.183322,0.585861,0.226186,-0.557642,-0.520717,0.423581,0.463295,-0.260969,-0.252226
3,45901,1.0,2022-01-07,-0.103946,0.79422,1.304546,-0.875434,-0.128924,0.757518,0.016061,0.366299,-0.259814,-0.851974,-0.04697,0.145993,1.017507,-0.853762,-0.856155,0.680698,-0.176669,0.453668,-0.426449,0.808273,0.362684,0.625959,0.352978,0.233141,-0.346408,-0.469713,-0.114776,-0.037209,0.702632,-0.5298,0.036222,0.13268,0.362773,-0.600175,-0.015433,-0.113136,0.587809,0.919612,0.293686,-0.609237,0.230143,-0.307836,-0.46708,0.365286,0.258894,0.138753,-0.556985,-0.593201,0.194739,1.012353,-0.398405,0.131198,-0.081122,0.612046,-0.356016,0.036253,0.201734,0.616837,0.101823,-0.538526,-0.325827,0.443498,0.516953,-0.199497,-0.37483
4,43827,1.0,2022-01-07,-0.212897,0.367712,0.325107,-1.549298,-0.302923,0.461826,-0.103717,0.30773,-0.893549,0.66008,0.357756,-0.572041,1.161138,-0.448016,-0.253548,0.653394,-0.123547,0.506458,-0.375551,0.842778,0.446013,0.208226,0.276145,0.48668,-0.779232,-0.231549,0.080135,0.003704,0.67525,-0.751565,0.036222,0.13268,0.362773,-0.600175,-0.015433,-0.113136,0.587809,0.919612,0.293686,-0.609237,0.230143,-0.307836,-0.46708,0.365286,0.258894,0.138753,-0.556985,-0.593201,0.194739,1.012353,-0.05034,0.373271,-0.481796,0.590685,-0.474255,0.050245,0.206227,0.564998,0.035857,-0.574978,-0.352987,0.514602,0.445111,-0.262223,-0.296265


In [11]:
df_row_train = df[df[y_label].notna()].reset_index(drop=True)
df_row_val = df[df[y_label].isna()].reset_index(drop=True)

df_train, df_test, convert_cols = sp.transform_data_detail(df_row_train, df_row_val, y_label, excel_path=path_output_report)

feats, categorical_feats = get_null_importance(df_train.drop(columns=[y_label,'id']).copy(),
                                               df_train[y_label].copy(), 
                                               thresholds=15)

sheet05.可能为数值类型的object类型数据统计在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221014.xlsx中已经存在，我们将对原文件进行覆盖
sheet06.数据预处理在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221014.xlsx中已经存在，我们将对原文件进行覆盖


## modeling

In [14]:
feats = feats_importance.sort_values('importance', ascending=False)[:20]['name'].values

In [15]:

import time
from sklearn.metrics import roc_auc_score as auc
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold

params = {
    'learning_rate': 0.05,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 31,
    'verbose': -1,
    'seed': 2222,
    'n_jobs': -1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 4,
    # 'min_child_weight': 10,
}

fold_num = 5
seeds = [2022]
oof = np.zeros(len(df_train))
importance = 0
pred_y = pd.DataFrame()
score = []
for seed in seeds:
    kf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=seed)
    # kf = KFold(n_splits=fold_num, shuffle=True, random_state=seed)
    for fold, (train_idx, val_idx) in enumerate(kf.split(df_train[feats], df_train[y_label])):
        print('-----------', fold)
        train = lgb.Dataset(df_train.loc[train_idx, feats],
                            df_train.loc[train_idx, y_label],
                           # categorical_feature=categorical_feats
                           )
        val = lgb.Dataset(df_train.loc[val_idx, feats],
                          df_train.loc[val_idx, y_label],
                          #categorical_feature=categorical_feats
                         )
        model = lgb.train(params, train, valid_sets=[val], 
                          num_boost_round=20000, early_stopping_rounds=100)

        oof[val_idx] += model.predict(df_train.loc[val_idx, feats]) / len(seeds)
        pred_y['fold_%d_seed_%d' % (fold, seed)] = model.predict(df_test[feats])
        importance += model.feature_importance(importance_type='gain') / fold_num
        score.append(auc(df_train.loc[val_idx, y_label], model.predict(df_train.loc[val_idx, feats])))
feats_importance = pd.DataFrame()
feats_importance['name'] = feats
feats_importance['importance'] = importance
display(feats_importance.sort_values('importance', ascending=False)[:30])

df_train['oof'] = oof
display(np.mean(score), np.std(score))

score = np.mean(score)
df_test[y_label] = pred_y.mean(axis=1).values
df_test = df_test.sort_values('id').reset_index(drop=True)

sub = pd.read_csv(path_sample_submission)
sub[y_label] = df_test[y_label].values
sub.to_csv(os.path.join(path_results_jupyter,time.strftime('lgb_%Y%m%d%H%M_')+'%.5f.csv'%score), index=False)

----------- 0
[1]	valid_0's auc: 0.707502
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.713697
[3]	valid_0's auc: 0.722641
[4]	valid_0's auc: 0.722227
[5]	valid_0's auc: 0.724225
[6]	valid_0's auc: 0.725652
[7]	valid_0's auc: 0.724389
[8]	valid_0's auc: 0.725143
[9]	valid_0's auc: 0.723947
[10]	valid_0's auc: 0.72366
[11]	valid_0's auc: 0.722941
[12]	valid_0's auc: 0.722152
[13]	valid_0's auc: 0.722714
[14]	valid_0's auc: 0.7217
[15]	valid_0's auc: 0.722575
[16]	valid_0's auc: 0.721779
[17]	valid_0's auc: 0.721769
[18]	valid_0's auc: 0.721241
[19]	valid_0's auc: 0.721309
[20]	valid_0's auc: 0.721201
[21]	valid_0's auc: 0.721755
[22]	valid_0's auc: 0.721631
[23]	valid_0's auc: 0.721691
[24]	valid_0's auc: 0.721846
[25]	valid_0's auc: 0.721881
[26]	valid_0's auc: 0.72234
[27]	valid_0's auc: 0.722104
[28]	valid_0's auc: 0.72245
[29]	valid_0's auc: 0.722888
[30]	valid_0's auc: 0.722693
[31]	valid_0's auc: 0.722755
[32]	valid_0's auc: 0.72277
[33]	valid_

[109]	valid_0's auc: 0.728864
[110]	valid_0's auc: 0.728795
[111]	valid_0's auc: 0.7287
[112]	valid_0's auc: 0.728584
[113]	valid_0's auc: 0.728587
[114]	valid_0's auc: 0.728575
[115]	valid_0's auc: 0.728632
[116]	valid_0's auc: 0.728199
[117]	valid_0's auc: 0.728464
[118]	valid_0's auc: 0.728505
[119]	valid_0's auc: 0.728537
[120]	valid_0's auc: 0.72862
[121]	valid_0's auc: 0.72893
[122]	valid_0's auc: 0.729035
[123]	valid_0's auc: 0.729083
[124]	valid_0's auc: 0.729071
[125]	valid_0's auc: 0.729052
[126]	valid_0's auc: 0.729083
[127]	valid_0's auc: 0.729037
[128]	valid_0's auc: 0.72915
[129]	valid_0's auc: 0.729203
[130]	valid_0's auc: 0.728969
Early stopping, best iteration is:
[30]	valid_0's auc: 0.730202
----------- 3
[1]	valid_0's auc: 0.719062
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.716168
[3]	valid_0's auc: 0.727636
[4]	valid_0's auc: 0.727099
[5]	valid_0's auc: 0.729333
[6]	valid_0's auc: 0.729394
[7]	valid_0's auc: 0.731978
[8]	valid

[141]	valid_0's auc: 0.73791
[142]	valid_0's auc: 0.738013
[143]	valid_0's auc: 0.737931
[144]	valid_0's auc: 0.737973
[145]	valid_0's auc: 0.737861
[146]	valid_0's auc: 0.737798
[147]	valid_0's auc: 0.737829
[148]	valid_0's auc: 0.737859
[149]	valid_0's auc: 0.737728
[150]	valid_0's auc: 0.737745
[151]	valid_0's auc: 0.73774
[152]	valid_0's auc: 0.737716
[153]	valid_0's auc: 0.737717
[154]	valid_0's auc: 0.737775
[155]	valid_0's auc: 0.737755
[156]	valid_0's auc: 0.737714
[157]	valid_0's auc: 0.737689
[158]	valid_0's auc: 0.737579
[159]	valid_0's auc: 0.73777
[160]	valid_0's auc: 0.737755
[161]	valid_0's auc: 0.737732
[162]	valid_0's auc: 0.737736
[163]	valid_0's auc: 0.737458
[164]	valid_0's auc: 0.73746
[165]	valid_0's auc: 0.737565
[166]	valid_0's auc: 0.737639
[167]	valid_0's auc: 0.737509
[168]	valid_0's auc: 0.737548
[169]	valid_0's auc: 0.737419
[170]	valid_0's auc: 0.737502
[171]	valid_0's auc: 0.737479
[172]	valid_0's auc: 0.737457
[173]	valid_0's auc: 0.737572
[174]	valid_0'

Unnamed: 0,name,importance
0,url_1,10775.820746
1,ip_transform_4,8030.752723
2,ip_transform_3,5387.786316
3,ip_transform_2,5307.437761
7,device_num_transform_2,3846.886076
6,ip_transform_5,3752.416171
9,url_2,2821.175287
5,ip_transform_1,2314.138867
10,user_name_3,1435.089698
8,user_name_1,1399.562236


0.732181023401934

0.004321110200142313

In [16]:
feats_importance.sort_values('importance', ascending=False)[:50]

Unnamed: 0,name,importance
0,url_1,10775.820746
1,ip_transform_4,8030.752723
2,ip_transform_3,5387.786316
3,ip_transform_2,5307.437761
7,device_num_transform_2,3846.886076
6,ip_transform_5,3752.416171
9,url_2,2821.175287
5,ip_transform_1,2314.138867
10,user_name_3,1435.089698
8,user_name_1,1399.562236
