## 1.数据说明
赛题数据由约62万条训练集、20万条测试集数据组成，共包含13个字段。其中uuid为样本唯一标识，eid为访问行为ID，udmap为行为属性，其中的key1到key9表示不同的行为属性，如项目名、项目id等相关字段，common_ts为应用访问记录发生时间（毫秒时间戳），其余字段x1至x8为用户相关的属性，为匿名处理字段。target字段为预测目标，即是否为新增用户。

## 2.评估指标
本次竞赛的评价标准采用f1_score，分数越高，效果越好。

## 3.评测及排行
1、本赛题均提供下载数据，选手在本地进行算法调试，在比赛页面提交结果。<br />
2、排行按照得分从高到低排序，排行榜将选择团队的历史最优成绩进行排名。

In [121]:
import pandas as pd 
import numpy as np 

train_path = "/kaggle/input/Xunfei-dataset/train.csv"
train_df = pd.read_csv(train_path)
print(f"len(train_df):{(len(train_df))}")
train_df.head()


len(train_df):620356


Unnamed: 0,uuid,eid,udmap,common_ts,x1,x2,x3,x4,x5,x6,x7,x8,target
0,0,26,"{""key3"":""67804"",""key2"":""650""}",1689673468244,4,0,41,107,206,1,0,1,0
1,1,26,"{""key3"":""67804"",""key2"":""484""}",1689082941469,4,0,41,24,283,4,8,1,0
2,2,8,unknown,1689407393040,4,0,41,71,288,4,7,1,0
3,3,11,unknown,1689467815688,1,3,41,17,366,1,6,1,0
4,4,26,"{""key3"":""67804"",""key2"":""650""}",1689491751442,0,3,41,92,383,4,8,1,0


可以发现有缺失值存在。

处理缺失值

In [122]:
# 统计'udmap'列缺失值比例，决定是否舍弃
udmap = train_df['udmap'].values
print(f"len(np.unique(udmap)):{len(np.unique(udmap))}")
np.sum(udmap=="unknown")/len(udmap)

len(np.unique(udmap)):92481


0.4214096422054433

可以看到“udmap”列缺失值达到了42%，故决定舍弃掉以获得更好的结果。<br />
除此之外，“uuid”列（即样本唯一标识）是递增的序号，对预测无作用，故舍弃。

In [123]:
# 快速获得数据摘要
train_df.describe()

Unnamed: 0,uuid,eid,common_ts,x1,x2,x3,x4,x5,x6,x7,x8,target
count,620356.0,620356.0,620356.0,620356.0,620356.0,620356.0,620356.0,620356.0,620356.0,620356.0,620356.0,620356.0
mean,310177.5,22.148287,1689317000000.0,2.675723,1.10635,40.974499,82.86008,224.909096,2.901681,5.86372,0.855459,0.140566
std,179081.496134,12.139122,274686500.0,1.719279,1.174157,1.373016,44.109037,114.305062,1.444797,2.575854,0.351638,0.347574
min,0.0,0.0,1688382000000.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,155088.75,11.0,1689088000000.0,1.0,0.0,41.0,51.0,133.0,1.0,6.0,1.0,0.0
50%,310177.5,26.0,1689377000000.0,4.0,1.0,41.0,86.0,241.0,4.0,7.0,1.0,0.0
75%,465266.25,34.0,1689563000000.0,4.0,2.0,41.0,107.0,313.0,4.0,7.0,1.0,0.0
max,620355.0,42.0,1689696000000.0,4.0,3.0,74.0,151.0,413.0,4.0,9.0,1.0,1.0


In [125]:
print(train_df.columns)


Index(['uuid', 'eid', 'udmap', 'common_ts', 'x1', 'x2', 'x3', 'x4', 'x5', 'x6',
       'x7', 'x8', 'target'],
      dtype='object')


In [126]:
train_df.drop(['uuid','udmap'],axis=1,inplace=True)
train_df.head()

Unnamed: 0,eid,common_ts,x1,x2,x3,x4,x5,x6,x7,x8,target
0,26,1689673468244,4,0,41,107,206,1,0,1,0
1,26,1689082941469,4,0,41,24,283,4,8,1,0
2,8,1689407393040,4,0,41,71,288,4,7,1,0
3,11,1689467815688,1,3,41,17,366,1,6,1,0
4,26,1689491751442,0,3,41,92,383,4,8,1,0


## 参数解析（pandas.DataFrame.drop()）
axis=1: 
这指定了我们操作的轴。在 pandas 中，axis=0 指的是行，axis=1 指的是列。
因此，axis=1 意味着我们希望在列方向上进行操作。换句话说，我们想要删除列，而不是行。

inplace=True: 
这表示直接在原始的 DataFrame (train_df 在这种情况下) 上进行修改，而不返回一个新的 DataFrame。
默认情况下，inplace 是 False，这意味着 drop() 会返回一个新的 DataFrame，原始的 DataFrame 不会被修改。
但是，设置 inplace=True 会导致原始的 DataFrame 被直接修改，并且不会返回任何值。

分析毫秒时间戳（“common_ts”列），理解其含义。

In [127]:
import datetime
# 从 DataFrame 中获取毫秒时间戳
millisecond_timestamp = train_df['common_ts'].values[0]

# 将毫秒时间戳转换为秒
second_timestamp = millisecond_timestamp / 1000

# 转换为日期时间对象
dt_object = datetime.datetime.fromtimestamp(second_timestamp)

# 获取具体的年份和日期
formatted_date = dt_object.strftime('%Y-%m-%d')

print(formatted_date)

2023-07-18


可以发现它是2023年前一阵子的新数据，

In [128]:
# 计算相对于2023年的进度百分比
train_df['common_ts'] = ((train_df['common_ts'] / 1000) % 31536000) / 31536000 
train_df.head()

Unnamed: 0,eid,common_ts,x1,x2,x3,x4,x5,x6,x7,x8,target
0,26,0.579194,4,0,41,107,206,1,0,1,0
1,26,0.560469,4,0,41,24,283,4,8,1,0
2,8,0.570757,4,0,41,71,288,4,7,1,0
3,11,0.572673,1,3,41,17,366,1,6,1,0
4,26,0.573432,0,3,41,92,383,4,8,1,0


再去查看数据摘要，发现“x3”列数据中”41“占比较大，统计一下。

In [63]:
np.sum(train_df['x3'].values==41)/len(train_df)

0.9959652199704686

In [64]:
# 查看'x3'列的可能取值
np.unique(train_df['x3'].values)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34,
       35, 36, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52,
       53, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70,
       71, 72, 73, 74])

In [105]:
# 了解'x3'中是否有取值和'target'的结果的强相关
for i in range(75):
    print(f"x3={i} and target=0: {np.sum((train_df['x3'].values == i) & (train_df['target'].values == 1))/len(train_df)}")

x3=0 and target=0: 1.6119776386461967e-06
x3=1 and target=0: 0.0
x3=2 and target=0: 8.059888193230983e-06
x3=3 and target=0: 2.095570930240056e-05
x3=4 and target=0: 0.0
x3=5 and target=0: 0.0001015545912347104
x3=6 and target=0: 0.0
x3=7 and target=0: 3.546350805021633e-05
x3=8 and target=0: 0.0
x3=9 and target=0: 0.0
x3=10 and target=0: 0.0
x3=11 and target=0: 8.059888193230983e-06
x3=12 and target=0: 0.0
x3=13 and target=0: 1.6119776386461967e-06
x3=14 and target=0: 9.188272540283322e-05
x3=15 and target=0: 2.2567686941046753e-05
x3=16 and target=0: 1.6119776386461967e-06
x3=17 and target=0: 1.6119776386461967e-06
x3=18 and target=0: 1.6119776386461967e-06
x3=19 and target=0: 0.0
x3=20 and target=0: 2.5791642218339148e-05
x3=21 and target=0: 0.0
x3=22 and target=0: 0.0
x3=23 and target=0: 0.0
x3=24 and target=0: 4.83593291593859e-06
x3=25 and target=0: 3.2239552772923935e-06
x3=26 and target=0: 1.6119776386461967e-06
x3=27 and target=0: 0.0
x3=28 and target=0: 0.0
x3=29 and target=0

发现'x8'列也有大量'1'元素，故作和对'x3'列一样的处理。

In [66]:
np.sum(train_df['x8'].values==1)/len(train_df)

0.8554588010755115

In [67]:
# 查看'x8'列的可能取值
np.unique(train_df['x8'].values)

array([0, 1])

In [91]:
# 了解'x8'的取值和'target'的结果是否有强相关性
for i in range(2):
    print(f"x8={i} and target=0: {np.sum((train_df['x8'].values == i) & (train_df['target'].values == 0))/len(train_df)}")
    print(f"x8={i} and target=1: {np.sum((train_df['x8'].values == i) & (train_df['target'].values == 1))/len(train_df)}")


x8=0 and target=0: 0.10861505329198073
x8=0 and target=1: 0.03592614563250779
x8=1 and target=0: 0.7508188846404322
x8=1 and target=1: 0.10463991643507921


对'x1'列也做类似操作

In [101]:
# 查看'x1列的可能取值
np.unique(train_df['x1'].values)

array([0, 1, 2, 3, 4])

In [102]:
for i in range(5):
    print(f"x1={i} and target=0: {np.sum((train_df['x1'].values == i) & (train_df['target'].values == 0))/len(train_df)}")
print("\n")
for i in range(5):    
    print(f"x1={i} and target=1: {np.sum((train_df['x1'].values == i) & (train_df['target'].values == 1))/len(train_df)}")


x1=0 and target=0: 0.17263796916609173
x1=1 and target=0: 0.13413588326702733
x1=2 and target=0: 0.02315928273442991
x1=3 and target=0: 0.0006077155697696161
x1=4 and target=0: 0.5288930871950944


x1=0 and target=1: 0.02725370593659125
x1=1 and target=1: 0.020070733578783796
x1=2 and target=1: 0.007544055348864201
x1=3 and target=1: 7.576294901637125e-05
x1=4 and target=1: 0.08562180425433139


再分析'eid'列

In [103]:
np.unique(train_df['eid'].values)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42])

In [106]:
for i in range(43):
    print(f"eid={i} and target=0: {np.sum((train_df['eid'].values == i) & (train_df['target'].values == 0))/len(train_df)}")
    print(f"eid={i} and target=1: {np.sum((train_df['eid'].values == i) & (train_df['target'].values == 1))/len(train_df)}")


eid=0 and target=0: 0.008034096551012644
eid=0 and target=1: 0.000598043703937739
eid=1 and target=0: 0.0006109395250469085
eid=1 and target=1: 0.0005754760169966923
eid=2 and target=0: 0.07084802919613899
eid=2 and target=1: 0.011477280787160921
eid=3 and target=0: 0.0020359277576101464
eid=3 and target=1: 0.0011090406153885833
eid=4 and target=0: 0.0006576868765676482
eid=4 and target=1: 0.0006222233685174319
eid=5 and target=0: 0.04726802029802243
eid=5 and target=1: 0.0061319629374101325
eid=6 and target=0: 3.2239552772923935e-06
eid=6 and target=1: 0.0
eid=7 and target=0: 1.7731754025108163e-05
eid=7 and target=1: 0.0
eid=8 and target=0: 0.0750633507211988
eid=8 and target=1: 0.00810018763419714
eid=9 and target=0: 0.0016039177504529657
eid=9 and target=1: 0.0009494548291626098
eid=10 and target=0: 0.0017393238720992462
eid=10 and target=1: 0.00017086962969649686
eid=11 and target=0: 0.07804067341977831
eid=11 and target=1: 0.00851930182024515
eid=12 and target=0: 0.00887393690074