# 方案概述
比赛链接：[系统访问风险识别](https://www.datafountain.cn/competitions/580/datasets)

## 赛题介绍
### 赛题背景

随着国家、企业对安全和效率越来越重视，作为安全基础设施之一——统一身份管理(IAM，Identity and Access Management)系统也得到越来越多的关注。 在IAM领域中，其主要安全防护手段是身份鉴别，身份鉴别主要包括账密验证、扫码验证、短信验证、人脸识别及指纹验证等方式。这些身份鉴别方式一般可分为三类，即用户所知(如口令)、所有(如身份证)、特征(如人脸识别及指纹验证)。这些鉴别方式都有其各自的缺点——比如口令，强度高了不容易记住，强度低了又容易丢；又比如人脸识别，做活体验证用户体验不好，静默检测又容易被照片、视频、人脸模型绕过。也因此，在等保2.0中对于三级以上系统要求必须使用两种及以上的鉴别方式对用户进行身份鉴别，以提高身份鉴别的可信度，这种鉴别方式也被称为双因素认证。

对用户来说，双因素认证在一定程度上提高了安全性，但也极大地降低了用户体验。也因此，IAM厂商开始参考用户实体行为分析(UEBA,User and Entity Behavior Analytics)、用户画像等行为分析技术，来探索一种既能确保用户体验，又能提高身份鉴别可信度的方法。而在当前IAM的探索过程中，目前最容易落地的方法是基于规则的行为分析技术，因为它可理解性较高，且容易与其它身份鉴别方式进行联动。
但基于规则的行为分析技术局限性也很明显，首先这种技术是基于经验的，有“宁错杀一千，不放过一个”的特点，其次它也缺少从数据层面来证明是否有人正在尝试窃取或验证非法获取的身份信息，又或者正在使用窃取的身份信息。鉴于此，我们举办这次竞赛，希望各个参赛团队利用竞赛数据和行业知识，建立机器学习、人工智能或数据挖掘模型，来弥补传统方法的缺点，从而解决这一行业难题。

### 赛题任务

本赛题中，参赛团队将基于用户历史的系统访问日志及是否存在风险标记等数据，结合行业知识，构建必要的特征工程，建立机器学习、人工智能或数据挖掘模型，并用该模型预测将来的系统访问是否存在风险。

### 数据简介
本赛题数据是从竹云日志库中抽取某公司一定比例的员工从2022年1月到6月的系统访问日志数据，主要涉及认证日志与风险日志数据。部分字段经过一一对应脱敏处理，供参赛队伍使用。其中认证日志是用户在访问应用系统时产生的行为数据，包括用户名、认证时间、认证城市、接入系统、访问URL等关键信息。

### 数据说明
• 文档说明

## 解决方案
### 机器学习解决方案
#### 特征衍生方案
1. 在特征衍生环节，根据不同类型的可用数据，按照一定的加工逻辑，设计衍生指标。如：可以通过组合“主维度+算子函数+度量+条件维度+时间维度”的逻辑进行指标的逻辑衍生。

#### 模型选用
1. 评分卡模型
2. lightgbm

### 深度学习解决方案
1. 将数据处理为序列数据
2. 使用esim进行建模

# 数据探索性分析（EDA）

In [1]:
import os
import re
import numpy as np
import pandas as pd
import datetime

import matplotlib.pyplot as plt
import seaborn as sns

import scorpyo as sp


pd.set_option('max_rows', 320)

In [2]:
path_project = r'/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification'

# path dir
path_row_data = os.path.join(path_project, 'row_data')
path_new_data = os.path.join(path_project, 'new_data')
path_results  = os.path.join(path_project, 'results')

# path row_data
path_train = os.path.join(path_row_data, 'train.csv')
path_test  = os.path.join(path_row_data, 'evaluation_public.csv')
path_sample_submission = os.path.join(path_row_data, 'submit_example.csv')


## results
path_output_report = os.path.join(path_results, '01_原始数据探察_20221013.xlsx')

y_label = "is_risk"

In [3]:
df_row_train = sp.read_data(path_train)
df_row_val  = sp.read_data(path_test)

## 数据描述性统计

In [4]:
_ = sp.excel_label(df_row_train, y=y_label, excel_path=path_output_report, show=True)

------------------------------------------------------------------------------------------
标签分布展示如下:

数据集样本有: 47660个,特征列有:16列, 标签列为: is_risk


Unnamed: 0,label取值,label数量,label占比
0,0,39964,83.85%
1,1,7696,16.15%


Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk
0,0,guojianping9672,rd,GVhZtW4i1,rqRxAjAL1RYC,firefox_78,firefox,win,win10,2022-01-18 19:10:41,内网,200,成都,2umVQwhiiwNJ,xxx.com/mail,2022-01,0
1,1,yangtao1740,sales,l3MuTMPoQ,iKPTa3su50y7,chrome_93,chrome,win,win11,2022-04-01 17:04:00,内网,200,深圳,RwHe8Q1R7AlB,business.xxx.com/,2022-04,0
2,2,wangying9098,rd,4uHWcskWv,1baNbqxMWcCu,ie_11,ie,win,win10,2022-03-01 15:53:49,内网,200,成都,dwS3cdn15GK4,wpsdoc.xxx.com/kdocs,2022-03,0


sheet01.标签分布分析在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221013.xlsx中已经存在，我们将对原文件进行覆盖


------------------------------------------------------------------------------------------


In [5]:
sp.excel_detect(df_row_train, sheet_name='02.训练集-数据描述性统计',excel_path=path_output_report)

sheet02.训练集-数据描述性统计在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221013.xlsx中已经存在，我们将对原文件进行覆盖


Unnamed: 0,feat_name_row,type,size,missing,unique,zero_ratio,negative_ratio,top1_all_value,top1_all_ratio,mean_or_top1,std_or_top2,min_or_top3,1%_or_top4,10%_or_top5,50%_or_bottom5,75%_or_bottom4,90%_or_bottom3,99%_or_bottom2,max_or_bottom1
0,id,int64,47660,0.0,47660,0.0,0.0,0,0.0,23829.5,13758.401,0.0,476.59,4765.9,23829.5,35744.25,42893.1,47182.41,47659.0
1,user_name,object,47660,0.08,187,0.0,,xuxiuying8050,0.007,xuxiuying8050:0.65%,hongchang3029:0.63%,tanliu3173:0.62%,liuhong6350:0.62%,lufan2545:0.62%,zhouxiumei4433:0.38%,chenjian4844:0.37%,wanggang1192:0.36%,ranxiuzhen6780:0.33%,xujie9775:0.30%
2,department,object,47660,0.08,5,0.0,,rd,0.654,rd:65.36%,sales:17.26%,other:4.07%,accounting:3.56%,hr:1.75%,rd:65.36%,sales:17.26%,other:4.07%,accounting:3.56%,hr:1.75%
3,ip_transform,object,47660,0.0,2105,0.0,,w2CfuqTz3,0.007,w2CfuqTz3:0.68%,u9diCFdYZ:0.66%,pPgzIf3S4:0.65%,7YnPN3fqd:0.65%,DhTMwbtS5:0.64%,948U9MQcB:0.00%,h75YAkAAL:0.00%,m7512MutA:0.00%,ADL8GwW32:0.00%,g3dWezpzT:0.00%
4,device_num_transform,object,47660,0.0,844,0.0,,O54DfqjlCrhL,0.007,O54DfqjlCrhL:0.70%,kUa61ygA6gI3:0.68%,Rfv57YyO3vny:0.67%,5DmlITfRNR36:0.66%,TzmgdvYq3Kx0:0.66%,aUECyyFo55Zy:0.00%,cREgOG9x3d9X:0.00%,NGfeE42d1yHY:0.00%,T4hueKNccs7X:0.00%,A0TLDctT8OUR:0.00%
5,browser_version,object,47660,0.0,8,0.0,,edge_93,0.372,edge_93:37.16%,chrome_90:33.38%,safari_13:9.48%,chrome_77:5.32%,firefox_78:4.82%,chrome_77:5.32%,firefox_78:4.82%,chrome_93:4.44%,ie_11:3.37%,ie_9:2.02%
6,browser,object,47660,0.0,5,0.0,,chrome,0.431,chrome:43.15%,edge:37.16%,safari:9.48%,ie:5.39%,firefox:4.82%,chrome:43.15%,edge:37.16%,safari:9.48%,ie:5.39%,firefox:4.82%
7,os_type,object,47660,0.0,2,0.0,,win,0.905,win:90.52%,macos:9.48%,,,,,,,win:90.52%,macos:9.48%
8,os_version,object,47660,0.0,4,0.0,,win10,0.758,win10:75.84%,win7:10.81%,macos_big_sur_11:9.48%,win11:3.87%,,,win10:75.84%,win7:10.81%,macos_big_sur_11:9.48%,win11:3.87%
9,op_datetime,object,47660,0.0,47343,0.0,,2022-04-02 17:02:24,0.0,2022-04-02 17:02:24:0.01%,2022-02-16 11:36:12:0.01%,2022-04-28 09:33:53:0.01%,2022-03-04 17:13:36:0.01%,2022-03-23 14:47:36:0.00%,2022-04-29 09:58:59:0.00%,2022-04-20 19:02:48:0.00%,2022-01-19 15:11:24:0.00%,2022-01-19 10:53:46:0.00%,2022-04-08 19:30:10:0.00%


In [6]:
sp.excel_detect(df_row_val, sheet_name='03.测试集-数据描述性统计',excel_path=path_output_report)

sheet03.测试集-数据描述性统计在/Users/liliangshan/workspace/python/01_datasets/ccf_system_access_risk_identification/results/01_原始数据探察_20221013.xlsx中已经存在，我们将对原文件进行覆盖


Unnamed: 0,feat_name_row,type,size,missing,unique,zero_ratio,negative_ratio,top1_all_value,top1_all_ratio,mean_or_top1,std_or_top2,min_or_top3,1%_or_top4,10%_or_top5,50%_or_bottom5,75%_or_bottom4,90%_or_bottom3,99%_or_bottom2,max_or_bottom1
0,id,int64,25710,0.0,25710,0.0,0.0,0,0.0,12854.5,7421.982,0.0,257.09,2570.9,12854.5,19281.75,23138.1,25451.91,25709.0
1,user_name,object,25710,0.079,187,0.0,,yuyuzhen3194,0.007,yuyuzhen3194:0.72%,fengying9449:0.68%,pengfan5076:0.67%,linbin8358:0.67%,lijing7913:0.66%,likun8302:0.33%,maohaiyan4824:0.33%,chengli6873:0.32%,yanglin6562:0.32%,chenying2872:0.32%
2,department,object,25710,0.079,5,0.0,,rd,0.656,rd:65.60%,sales:17.08%,other:4.33%,accounting:3.08%,hr:1.99%,rd:65.60%,sales:17.08%,other:4.33%,accounting:3.08%,hr:1.99%
3,ip_transform,object,25710,0.0,1192,0.0,,H0TKapkPL,0.008,H0TKapkPL:0.76%,YBCE8ld50:0.70%,2qWPkWg5V:0.70%,8J3dZCu0A:0.69%,4hPUiX1CK:0.68%,88aHOvoHa:0.00%,358EfARvQ:0.00%,4L9vshrCi:0.00%,9RzuSrhOL:0.00%,LWv4Mjkys:0.00%
4,device_num_transform,object,25710,0.0,505,0.0,,K8Ith9mjHsKo,0.008,K8Ith9mjHsKo:0.79%,3wDqyLqvVCn1:0.72%,4BWxjoSreaOm:0.71%,uRYWimJ18UEk:0.70%,sdN7y26qL30M:0.70%,E9qxuBiAo3Ju:0.00%,AF9IY4wFm5vY:0.00%,jI1j9ekI0wfW:0.00%,IN0qCiPw2eHv:0.00%,MmIkTEts5OIC:0.00%
5,browser_version,object,25710,0.0,8,0.0,,edge_93,0.367,edge_93:36.69%,chrome_90:33.55%,safari_13:9.37%,chrome_77:5.57%,firefox_78:5.11%,chrome_77:5.57%,firefox_78:5.11%,chrome_93:4.20%,ie_11:3.50%,ie_9:2.02%
6,browser,object,25710,0.0,5,0.0,,chrome,0.433,chrome:43.31%,edge:36.69%,safari:9.37%,ie:5.52%,firefox:5.11%,chrome:43.31%,edge:36.69%,safari:9.37%,ie:5.52%,firefox:5.11%
7,os_type,object,25710,0.0,2,0.0,,win,0.906,win:90.63%,macos:9.37%,,,,,,,win:90.63%,macos:9.37%
8,os_version,object,25710,0.0,4,0.0,,win10,0.765,win10:76.51%,win7:10.50%,macos_big_sur_11:9.37%,win11:3.62%,,,win10:76.51%,win7:10.50%,macos_big_sur_11:9.37%,win11:3.62%
9,op_datetime,object,25710,0.0,25542,0.0,,2022-06-23 10:51:17,0.0,2022-06-23 10:51:17:0.01%,2022-06-01 11:38:12:0.01%,2022-05-25 15:09:56:0.01%,2022-05-06 19:34:25:0.01%,2022-06-07 09:10:38:0.01%,2022-05-24 18:01:53:0.00%,2022-05-25 10:29:21:0.00%,2022-05-07 16:09:19:0.00%,2022-06-30 14:19:20:0.00%,2022-06-28 14:50:21:0.00%


1. 训练集测试集客户数量都为187人，需要考虑这些人是否是相互包含
2. 测试集中用户登录系统前产生日志数量占比与训练集数量不一致
3. ip和mac地址数量不太一致，存在一个客户多个ip的情况
4. 浏览器、操作系统类型、IP类型、认证城市这种，一般认为同一客户不太会变，考虑从这方面做些特征，以及onehot
5. 说是2022年1月到6月的系统访问日志数据，但op_month只有两个月，很奇怪。

## 数据基础情况分析

客户历史记录数量统计

In [7]:
df_row_train['user_name'].value_counts()

xuxiuying8050       311
hongchang3029       300
tanliu3173          297
liuhong6350         296
lufan2545           296
qiuyan8450          292
liulin3167          291
sunzhiqiang8616     290
yuanwei8501         289
chenghaiyan1579     286
zhanglihua7105      285
zhaoxiang7127       283
pantingting3662     283
pengxia7510         283
jingbo3416          281
maohaiyan4824       280
xiexiaohong5806     280
huanglei6824        278
mayang4022          277
cenglili3725        275
luyan5353           275
lifan7769           273
xujia4357           271
fengying9449        270
gaofeng5184         269
yuandan8814         269
yangyong8917        269
yangtao1740         268
liuyang8834         268
wangshuhua1453      267
heyuhua2679         266
youzhiqiang3249     266
fangxiurong4573     266
huanghui5940        265
liuchunmei3912      265
lichen1456          265
jiangtao2581        262
chenguizhi2238      262
suping3694          260
wanghongmei7436     260
lixia2119           259
wangchang5581   

In [8]:
df_row_val['user_name'].value_counts()

yuyuzhen3194        185
fengying9449        174
pengfan5076         171
linbin8358          171
lijing7913          170
chenxiaohong3284    168
huangning3243       165
liyuzhen4662        162
tanghua6212         162
pantingting3662     161
caili5590           160
fangxiurong4573     159
lufan2545           159
xulanying3873       158
chengjie1656        158
wangxiurong2873     158
caoyu4082           155
gaofeng5184         155
tangguifang4636     154
youzhiqiang3249     153
linbin5576          152
renming5624         152
wanghongmei3888     152
wanghongmei7436     151
wuqian3014          150
shenping7146        150
ligang8428          148
luojun4825          148
yuanwei8501         147
yuanjun5870         147
linyulan9408        147
wangchang5581       146
liguiying8319       146
heyuhua2679         146
genglin9252         146
wangying9098        146
luoxiuzhen8469      146
wangshuhua1453      145
jiangtao2581        144
huting4731          144
duanguiying2657     142
mayang4022      

客户交叉情况

In [9]:
print('测试训练集合共同数量：', len(set(df_row_val['user_name'].unique())&set(df_row_val['user_name'].unique())))
print('训练集合-测试集合数量：', len(set(df_row_val['user_name'].unique()) - set(df_row_val['user_name'].unique())))
print('测试集合-训练集合数量：', len(set(df_row_val['user_name'].unique()) - set(df_row_val['user_name'].unique())))

测试训练集合共同数量： 188
训练集合-测试集合数量： 0
测试集合-训练集合数量： 0


客户登录月份

In [10]:
df_row_train['op_month'].unique()

array(['2022-01', '2022-04', '2022-03', '2022-02'], dtype=object)

In [11]:
df_row_val['op_month'].unique()

array(['2022-05', '2022-06'], dtype=object)

一个客户有多个标签

In [12]:
tmp = df_row_train.groupby(['user_name']).agg({y_label:'nunique'}).reset_index()
display(tmp.head())
display(tmp[tmp['is_risk']<2])

Unnamed: 0,user_name,is_risk
0,baojianhua2916,2
1,caili5590,2
2,caohui3132,2
3,caoyu4082,2
4,cendandan2851,2


Unnamed: 0,user_name,is_risk


## 单个客户数据分析

In [13]:
df_row_train['url_sit'] = df_row_train['url'].map(lambda x: x.split('/')[0])
df_row_train['url_page'] = df_row_train['url'].map(lambda x: x.split('/')[1])

df_row_val['url_sit'] = df_row_val['url'].map(lambda x: x.split('/')[0])
df_row_val['url_page'] = df_row_val['url'].map(lambda x: x.split('/')[1])

In [14]:
df_row_train[df_row_train['user_name']=='xuxiuying8050'].sort_values(by='op_datetime')

Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk,url_sit,url_page
17341,17341,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-07 19:05:47,内网,200,北京,nHrKgKdJ1Mzt,xxx.com/github,2022-01,0,xxx.com,github
18993,18993,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-10 15:04:49,内网,200,北京,2umVQwhiiwNJ,xxx.com/mail,2022-01,0,xxx.com,mail
31374,31374,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-10 19:23:04,内网,200,北京,nHrKgKdJ1Mzt,xxx.com/github,2022-01,0,xxx.com,github
3102,3102,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-11 08:51:16,内网,200,北京,nHrKgKdJ1Mzt,xxx.com/github,2022-01,0,xxx.com,github
17518,17518,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-11 15:50:39,内网,200,北京,nHrKgKdJ1Mzt,xxx.com/github,2022-01,0,xxx.com,github
12741,12741,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-11 16:26:46,内网,200,北京,sW0whYIx8LFM,work.xxx.com/task,2022-01,0,work.xxx.com,task
27132,27132,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-13 11:37:58,内网,200,北京,nHrKgKdJ1Mzt,xxx.com/github,2022-01,0,xxx.com,github
17951,17951,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-13 19:02:14,内网,200,北京,2umVQwhiiwNJ,xxx.com/mail,2022-01,0,xxx.com,mail
20405,20405,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-14 11:22:05,内网,200,北京,nHrKgKdJ1Mzt,xxx.com/github,2022-01,0,xxx.com,github
27238,27238,xuxiuying8050,rd,w2CfuqTz3,O54DfqjlCrhL,ie_11,ie,win,win10,2022-01-14 11:29:04,内网,200,北京,nHrKgKdJ1Mzt,xxx.com/github,2022-01,0,xxx.com,github


In [15]:
df_row_train[df_row_train['user_name']=='xujie9775'].sort_values(by='op_datetime')

Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk,url_sit,url_page
3414,3414,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-07 10:12:01,内网,200,深圳,9RAS6RNfETj5,xxx.com/checkingin,2022-01,0,xxx.com,checkingin
30415,30415,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-07 15:15:54,内网,200,深圳,fwM6KZKjrzjm,xxx.com/oa,2022-01,0,xxx.com,oa
5340,5340,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-10 10:28:51,内网,200,深圳,2umVQwhiiwNJ,xxx.com/mail,2022-01,0,xxx.com,mail
24957,24957,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-10 13:45:22,内网,200,深圳,9RAS6RNfETj5,xxx.com/checkingin,2022-01,0,xxx.com,checkingin
13560,13560,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-11 11:54:50,内网,200,深圳,9RAS6RNfETj5,xxx.com/checkingin,2022-01,0,xxx.com,checkingin
31230,31230,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-12 15:17:41,内网,200,深圳,2umVQwhiiwNJ,xxx.com/mail,2022-01,0,xxx.com,mail
4666,4666,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-12 16:32:11,内网,200,深圳,fwM6KZKjrzjm,xxx.com/oa,2022-01,0,xxx.com,oa
22959,22959,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-17 11:11:20,内网,200,深圳,fwM6KZKjrzjm,xxx.com/oa,2022-01,0,xxx.com,oa
47293,47293,xujie9775,other,9KreK1Eb3,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-17 12:02:00,内网,200,深圳,fwM6KZKjrzjm,xxx.com/oa,2022-01,1,xxx.com,oa
3094,3094,xujie9775,other,SJwv4mEe7,PV9ahGuqwn4t,chrome_77,chrome,win,win10,2022-01-17 13:49:36,内网,200,深圳,fwM6KZKjrzjm,xxx.com/oa,2022-01,0,xxx.com,oa


In [16]:
df_row_train[df_row_train['user_name'].isnull()].sort_values(by='op_datetime')

Unnamed: 0,id,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,http_status_code,op_city,log_system_transform,url,op_month,is_risk,url_sit,url_page
35393,35393,,,3Qm3OCoLY,xTOamJ5o9Ugy,edge_93,edge,win,win10,2022-01-07 08:43:56,,400,国外,,xxx.com/getVerifyCode,2022-01,1,xxx.com,getVerifyCode
32240,32240,,,C0IxaOdrh,Gk0JoiqhROiD,edge_93,edge,win,win10,2022-01-07 08:47:40,,200,深圳,,xxx.com/loginAuth,2022-01,0,xxx.com,loginAuth
33210,33210,,,C4Wb7HV14,lHZQcsid67md,chrome_90,chrome,win,win10,2022-01-07 09:01:54,,200,深圳,,xxx.com/loginAuth,2022-01,0,xxx.com,loginAuth
34373,34373,,,2GkUZeD9D,DIBl5zjCQg9U,chrome_90,chrome,win,win10,2022-01-07 09:12:24,,200,成都,,xxx.com/loginAuth,2022-01,0,xxx.com,loginAuth
33788,33788,,,5BLwyu5pl,Y0ic4I4cr0UU,edge_93,edge,win,win10,2022-01-07 09:17:51,,200,深圳,,xxx.com/loginAuth,2022-01,0,xxx.com,loginAuth
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34535,34535,,,uV1paHVG6,t3ts32NPjtG9,edge_93,edge,win,win10,2022-04-29 18:44:15,,200,深圳,,xxx.com/loginAuth,2022-04,0,xxx.com,loginAuth
32905,32905,,,9lcqFeapI,T7mGjWmswm9Z,chrome_90,chrome,win,win11,2022-04-29 19:02:23,,200,深圳,,xxx.com/loginAuth,2022-04,0,xxx.com,loginAuth
35201,35201,,,0mjaEf4SB,8ftsXFm5I1Ej,safari_13,safari,macos,macos_big_sur_11,2022-04-29 19:17:02,,200,成都,,xxx.com/loginAuth,2022-04,0,xxx.com,loginAuth
33168,33168,,,V1OBTNxYA,6NRAoXZogVDX,safari_13,safari,macos,macos_big_sur_11,2022-04-29 19:24:51,,200,杭州,,xxx.com/loginAuth,2022-04,0,xxx.com,loginAuth


In [21]:
df_row_train[df_row_train['http_status_code']==404][['is_risk']].value_counts()

is_risk
1          855
0          219
dtype: int64

In [22]:
df_row_train[df_row_train['url_page']=='download'][['is_risk']].value_counts()

is_risk
0          5143
1          2209
dtype: int64

In [23]:
df_row_train[df_row_train['url_page']=='getVerifyCode'][['is_risk']].value_counts()

is_risk
1          201
0            8
dtype: int64

In [24]:
df_row_train['url_page'].unique()

array(['mail', '', 'kdocs', 'github', 'checkingin', 'oa', 'task',
       'accounting', 'loginAuth', 'getVerifyCode', 'getLoginType',
       'download'], dtype=object)

In [25]:
df_row_train[df_row_train['url_page']==''][['is_risk']].value_counts()

is_risk
0          3396
1           567
dtype: int64

1. http_status_code、log_system_transform、url_sit、url_page效果还可以，可以考虑onehot, 或者woe编码
2. 时间类的特征应该是有意义的

# 特征工程

In [26]:
df = pd.concat([df_row_train, df_row_val])
df = df.sort_values(by='op_datetime')

# 认证日期时间
df['op_datetime'] = pd.to_datetime(df['op_datetime'])
# 几点钟
df['hour'] = df['op_datetime'].dt.hour
# 周几
df['dayofweek'] = df['op_datetime'].dt.dayofweek
# 一个月的第几天
df['day'] = df['op_datetime'].dt.day
# 一年的第几月
df['month'] = df['op_datetime'].dt.month

# 用户名-认证时间
df = df.sort_values(by=['user_name', 'op_datetime']).reset_index(drop=True)
# datetime转int时间戳
df['ts'] = df['op_datetime'].values.astype(np.int64) // 10 ** 9
# 按客户进行统计这次认证和上次认证的时间差
df['ts1'] = df.groupby('user_name')['ts'].shift(1)
# 按客户进行统计这次认证和上上次认证的时间差
df['ts2'] = df.groupby('user_name')['ts'].shift(2)
# 按客户进行统计这次认证和上上上次认证的时间差
df['ts3'] = df.groupby('user_name')['ts'].shift(3)
df['ts_diff1'] = df['ts1'] - df['ts']
df['ts_diff2'] = df['ts2'] - df['ts']
df['ts_diff3'] = df['ts3'] - df['ts']

df['hour_sin'] = np.sin(df['hour']/24*2*np.pi)
df['hour_cos'] = np.cos(df['hour']/24*2*np.pi)


In [27]:
un_numeric_f = []
for col in df.columns:
    if not pd.api.types.is_numeric_dtype(df[col]):
        un_numeric_f.append(col)
un_numeric_f

['user_name',
 'department',
 'ip_transform',
 'device_num_transform',
 'browser_version',
 'browser',
 'os_type',
 'os_version',
 'op_datetime',
 'ip_type',
 'op_city',
 'log_system_transform',
 'url',
 'op_month',
 'url_sit',
 'url_page']

In [28]:
df[un_numeric_f].head()

Unnamed: 0,user_name,department,ip_transform,device_num_transform,browser_version,browser,os_type,os_version,op_datetime,ip_type,op_city,log_system_transform,url,op_month,url_sit,url_page
0,baojianhua2916,rd,W4suCwUym,RlZlLWSvh292,chrome_90,chrome,win,win10,2022-01-07 18:55:24,内网,杭州,nHrKgKdJ1Mzt,xxx.com/github,2022-01,xxx.com,github
1,baojianhua2916,rd,W4suCwUym,RlZlLWSvh292,chrome_90,chrome,win,win10,2022-01-07 19:43:28,内网,杭州,nHrKgKdJ1Mzt,xxx.com/github,2022-01,xxx.com,github
2,baojianhua2916,rd,W4suCwUym,RlZlLWSvh292,chrome_90,chrome,win,win10,2022-01-10 11:51:39,内网,杭州,2umVQwhiiwNJ,xxx.com/mail,2022-01,xxx.com,mail
3,baojianhua2916,rd,W4suCwUym,RlZlLWSvh292,chrome_90,chrome,win,win10,2022-01-11 10:18:49,内网,杭州,nHrKgKdJ1Mzt,xxx.com/github,2022-01,xxx.com,github
4,baojianhua2916,rd,W4suCwUym,RlZlLWSvh292,chrome_90,chrome,win,win10,2022-01-11 11:56:43,内网,杭州,nHrKgKdJ1Mzt,xxx.com/github,2022-01,xxx.com,github


In [30]:
cat_f = ['user_name','department','ip_transform','device_num_transform','browser_version','browser',
 'os_type','os_version','ip_type','op_city','log_system_transform','url','url_sit','url_page']

# 删除时间及类别型变量过多的特征
remove_col = ['op_datetime', 'op_month', 'user_name', 'ip_transform', 'device_num_transform', ]

In [31]:
cat_f = ['user_name', 'department', 'ip_transform', 'device_num_transform', 'browser_version', 'browser',
          'os_type', 'os_version', 'ip_type', 'op_city', 'log_system_transform', 'url',]

for f in cat_f:
    df[f+'_ts_diff_mean'] = df.groupby([f])['ts_diff1'].transform('mean')
    df[f+'_ts_diff_std'] = df.groupby([f])['ts_diff1'].transform('std')
    df[f+'_ts_diff2_mean'] = df.groupby([f])['ts_diff2'].transform('mean')
    df[f+'_ts_diff2_std'] = df.groupby([f])['ts_diff2'].transform('std')
    df[f+'_ts_diff3_mean'] = df.groupby([f])['ts_diff3'].transform('mean')
    df[f+'_ts_diff3_std'] = df.groupby([f])['ts_diff3'].transform('std')


In [32]:
df = df.drop(columns=remove_col)
df.head()

Unnamed: 0,id,department,browser_version,browser,os_type,os_version,ip_type,http_status_code,op_city,log_system_transform,...,log_system_transform_ts_diff2_mean,log_system_transform_ts_diff2_std,log_system_transform_ts_diff3_mean,log_system_transform_ts_diff3_std,url_ts_diff_mean,url_ts_diff_std,url_ts_diff2_mean,url_ts_diff2_std,url_ts_diff3_mean,url_ts_diff3_std
0,29148,rd,chrome_90,chrome,win,win10,内网,200,杭州,nHrKgKdJ1Mzt,...,-95372.721121,101869.684766,-141655.055306,122990.238931,-48312.637199,72520.154742,-95372.721121,101869.684766,-141655.055306,122990.238931
1,21403,rd,chrome_90,chrome,win,win10,内网,200,杭州,nHrKgKdJ1Mzt,...,-95372.721121,101869.684766,-141655.055306,122990.238931,-48312.637199,72520.154742,-95372.721121,101869.684766,-141655.055306,122990.238931
2,2153,rd,chrome_90,chrome,win,win10,内网,200,杭州,2umVQwhiiwNJ,...,-101227.892681,106073.290645,-149558.636775,126903.136146,-50542.50415,75311.114041,-101227.892681,106073.290645,-149558.636775,126903.136146
3,6953,rd,chrome_90,chrome,win,win10,内网,200,杭州,nHrKgKdJ1Mzt,...,-95372.721121,101869.684766,-141655.055306,122990.238931,-48312.637199,72520.154742,-95372.721121,101869.684766,-141655.055306,122990.238931
4,12888,rd,chrome_90,chrome,win,win10,内网,200,杭州,nHrKgKdJ1Mzt,...,-95372.721121,101869.684766,-141655.055306,122990.238931,-48312.637199,72520.154742,-95372.721121,101869.684766,-141655.055306,122990.238931


In [33]:
# 类别型变量one-hot 编码
df = pd.get_dummies(df)

In [34]:
df_train = df[df[y_label].notna()].reset_index(drop=True)
df_test = df[df[y_label].isna()].reset_index(drop=True)

In [36]:
feats = df_train.columns.drop(['id', y_label])
feats

Index(['http_status_code', 'hour', 'dayofweek', 'day', 'month', 'ts', 'ts1',
       'ts2', 'ts3', 'ts_diff1',
       ...
       'url_page_checkingin', 'url_page_download', 'url_page_getLoginType',
       'url_page_getVerifyCode', 'url_page_github', 'url_page_kdocs',
       'url_page_loginAuth', 'url_page_mail', 'url_page_oa', 'url_page_task'],
      dtype='object', length=164)

# modeling

In [37]:
params = {
    'learning_rate': 0.05,
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 31,
    'verbose': -1,
    'seed': 2222,
    'n_jobs': -1,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.9,
    'bagging_freq': 4,
    # 'min_child_weight': 10,
}

In [41]:

import time
from sklearn.metrics import roc_auc_score as auc
import lightgbm as lgb
from sklearn.model_selection import StratifiedKFold, KFold

In [None]:

import lightgbm as lgb
import time
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score as auc
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_absolute_percentage_error as mape
from sklearn.model_selection import StratifiedKFold, KFold
from matplotlib.pyplot import plot, show

In [42]:
fold_num = 5
seeds = [2222]
oof = np.zeros(len(df_train))
importance = 0
pred_y = pd.DataFrame()
score = []
for seed in seeds:
    kf = StratifiedKFold(n_splits=fold_num, shuffle=True, random_state=seed)
    # kf = KFold(n_splits=fold_num, shuffle=True, random_state=seed)
    for fold, (train_idx, val_idx) in enumerate(kf.split(df_train[feats], df_train[y_label])):
        print('-----------', fold)
        train = lgb.Dataset(df_train.loc[train_idx, feats],
                            df_train.loc[train_idx, y_label])
        val = lgb.Dataset(df_train.loc[val_idx, feats],
                          df_train.loc[val_idx, y_label])
        model = lgb.train(params, train, valid_sets=[val], 
                          num_boost_round=20000, early_stopping_rounds=100)

        oof[val_idx] += model.predict(df_train.loc[val_idx, feats]) / len(seeds)
        pred_y['fold_%d_seed_%d' % (fold, seed)] = model.predict(df_test[feats])
        importance += model.feature_importance(importance_type='gain') / fold_num
        score.append(auc(df_train.loc[val_idx, y_label], model.predict(df_train.loc[val_idx, feats])))
feats_importance = pd.DataFrame()
feats_importance['name'] = feats
feats_importance['importance'] = importance
display(feats_importance.sort_values('importance', ascending=False)[:30])

df_train['oof'] = oof
display(np.mean(score), np.std(score))

score = np.mean(score)
df_test[y_label] = pred_y.mean(axis=1).values
df_test = df_test.sort_values('id').reset_index(drop=True)

sub = pd.read_csv(path_sample_submission)
sub[y_label] = df_test[y_label].values
sub.to_csv(os.path.join(path_results,time.strftime('lgb_%Y%m%d%H%M_')+'%.5f.csv'%score), index=False)

----------- 0
[1]	valid_0's auc: 0.930003
Training until validation scores don't improve for 100 rounds
[2]	valid_0's auc: 0.930121
[3]	valid_0's auc: 0.930372
[4]	valid_0's auc: 0.930982
[5]	valid_0's auc: 0.932712
[6]	valid_0's auc: 0.932634
[7]	valid_0's auc: 0.93278
[8]	valid_0's auc: 0.93273
[9]	valid_0's auc: 0.931996
[10]	valid_0's auc: 0.932045
[11]	valid_0's auc: 0.932193
[12]	valid_0's auc: 0.932171
[13]	valid_0's auc: 0.93177
[14]	valid_0's auc: 0.932042
[15]	valid_0's auc: 0.932916
[16]	valid_0's auc: 0.933183
[17]	valid_0's auc: 0.933349
[18]	valid_0's auc: 0.933626
[19]	valid_0's auc: 0.933295
[20]	valid_0's auc: 0.933543
[21]	valid_0's auc: 0.933617
[22]	valid_0's auc: 0.933968
[23]	valid_0's auc: 0.934006
[24]	valid_0's auc: 0.934126
[25]	valid_0's auc: 0.934173
[26]	valid_0's auc: 0.934328
[27]	valid_0's auc: 0.93426
[28]	valid_0's auc: 0.934122
[29]	valid_0's auc: 0.934632
[30]	valid_0's auc: 0.934946
[31]	valid_0's auc: 0.934738
[32]	valid_0's auc: 0.934712
[33]	vali

[75]	valid_0's auc: 0.933798
[76]	valid_0's auc: 0.934154
[77]	valid_0's auc: 0.934004
[78]	valid_0's auc: 0.933816
[79]	valid_0's auc: 0.933569
[80]	valid_0's auc: 0.933454
[81]	valid_0's auc: 0.933455
[82]	valid_0's auc: 0.933249
[83]	valid_0's auc: 0.9331
[84]	valid_0's auc: 0.93314
[85]	valid_0's auc: 0.932965
[86]	valid_0's auc: 0.933058
[87]	valid_0's auc: 0.932991
[88]	valid_0's auc: 0.932968
[89]	valid_0's auc: 0.93302
[90]	valid_0's auc: 0.933094
[91]	valid_0's auc: 0.93316
[92]	valid_0's auc: 0.933387
[93]	valid_0's auc: 0.933584
[94]	valid_0's auc: 0.933656
[95]	valid_0's auc: 0.934013
[96]	valid_0's auc: 0.933943
[97]	valid_0's auc: 0.934332
[98]	valid_0's auc: 0.93432
[99]	valid_0's auc: 0.934264
[100]	valid_0's auc: 0.934141
[101]	valid_0's auc: 0.933964
[102]	valid_0's auc: 0.933969
[103]	valid_0's auc: 0.934016
[104]	valid_0's auc: 0.93391
[105]	valid_0's auc: 0.934327
[106]	valid_0's auc: 0.934352
[107]	valid_0's auc: 0.934511
[108]	valid_0's auc: 0.934455
[109]	valid_

[94]	valid_0's auc: 0.940208
[95]	valid_0's auc: 0.940174
[96]	valid_0's auc: 0.940056
[97]	valid_0's auc: 0.940153
[98]	valid_0's auc: 0.940161
[99]	valid_0's auc: 0.940112
[100]	valid_0's auc: 0.940096
[101]	valid_0's auc: 0.940197
[102]	valid_0's auc: 0.94017
[103]	valid_0's auc: 0.940081
[104]	valid_0's auc: 0.940069
[105]	valid_0's auc: 0.940078
[106]	valid_0's auc: 0.940228
[107]	valid_0's auc: 0.9402
[108]	valid_0's auc: 0.940188
[109]	valid_0's auc: 0.940188
[110]	valid_0's auc: 0.940193
[111]	valid_0's auc: 0.940099
[112]	valid_0's auc: 0.940026
[113]	valid_0's auc: 0.940056
[114]	valid_0's auc: 0.940092
[115]	valid_0's auc: 0.940041
[116]	valid_0's auc: 0.94007
[117]	valid_0's auc: 0.940052
[118]	valid_0's auc: 0.939948
[119]	valid_0's auc: 0.939815
[120]	valid_0's auc: 0.939715
[121]	valid_0's auc: 0.939573
[122]	valid_0's auc: 0.939488
[123]	valid_0's auc: 0.93959
[124]	valid_0's auc: 0.93957
[125]	valid_0's auc: 0.939551
[126]	valid_0's auc: 0.939495
[127]	valid_0's auc: 0

[76]	valid_0's auc: 0.934594
[77]	valid_0's auc: 0.934891
[78]	valid_0's auc: 0.93491
[79]	valid_0's auc: 0.934758
[80]	valid_0's auc: 0.934568
[81]	valid_0's auc: 0.934578
[82]	valid_0's auc: 0.934784
[83]	valid_0's auc: 0.934803
[84]	valid_0's auc: 0.934806
[85]	valid_0's auc: 0.934697
[86]	valid_0's auc: 0.934996
[87]	valid_0's auc: 0.935141
[88]	valid_0's auc: 0.935111
[89]	valid_0's auc: 0.935095
[90]	valid_0's auc: 0.934745
[91]	valid_0's auc: 0.934837
[92]	valid_0's auc: 0.93477
[93]	valid_0's auc: 0.934806
[94]	valid_0's auc: 0.934858
[95]	valid_0's auc: 0.935075
[96]	valid_0's auc: 0.935213
[97]	valid_0's auc: 0.93519
[98]	valid_0's auc: 0.935129
[99]	valid_0's auc: 0.934982
[100]	valid_0's auc: 0.935222
[101]	valid_0's auc: 0.935294
[102]	valid_0's auc: 0.935412
[103]	valid_0's auc: 0.935434
[104]	valid_0's auc: 0.935508
[105]	valid_0's auc: 0.93547
[106]	valid_0's auc: 0.935461
[107]	valid_0's auc: 0.935486
[108]	valid_0's auc: 0.935652
[109]	valid_0's auc: 0.935656
[110]	va

Unnamed: 0,name,importance
13,hour_cos,47414.185803
0,http_status_code,37308.611818
26,ip_transform_ts_diff_mean,31764.402039
1,hour,16959.964554
11,ts_diff3,11722.530132
3,day,8497.749582
29,ip_transform_ts_diff2_std,7382.382659
2,dayofweek,7142.379993
27,ip_transform_ts_diff_std,5098.562416
9,ts_diff1,4488.759857


0.9375430615690267

0.0018823084888451292