# 用户游戏行为及消费情况分析
![png](./pics/mind.png)

## 一. 背景分析
准确了解每个玩家的背后行为和价值，对游戏的产品改进，广告投放和高效的运营活动（如精准的促销活动和礼包推荐）都具有重要意义，有助于给玩家带来更个性化的体验。此数据来自《野蛮时代》（Brutal Age）这款游戏，这是一款风靡全球的SLG类型手机游戏。数据包含了玩家在游戏内前7日的行为数据。

数据来源：
http://www.dcjingsai.com/common/cmpt/游戏玩家付费金额预测大赛_竞赛信息.html

主要字段解释：
- user_id	玩家唯一ID
- register_time	玩家注册时间
- pvp_battle_count	PVP次数
- pvp_lanch_count	主动发起PVP次数
- pvp_win_count	PVP胜利次数
- pve_battle_count	PVE次数
- pve_lanch_count	主动发起PVE次数
- pve_win_count	PVE胜利次数
- avg_online_minutes	在线时长
- pay_price	付费金额
- pay_count	付费次数
- 以及游戏中相关的各类数据（包括物资、军队、加速、要塞等级等，详见附加文件）

## 二. 目标确定
1. 监控游戏运营情况
2. 提高付费转化率
3. 提高付费金额

从注册人数、流失率、付费转化率、每用户/付费用户平均收入角度对游戏运营情况进行监控。

分析不同付费群体的行为，分析不同等级玩家的付费行为，针对不同付费群体采取不同营销措施，在游戏不同等级进行设置。

对玩家进行分群，同时对付费额度进行预测，以采取相关措施。

In [1]:
import pandas as pd
import numpy as np
from time import time

In [2]:
df = pd.read_csv('tap_fun_train.csv')
pd.set_option("display.max_columns", len(df.columns))
df.head()

Unnamed: 0,user_id,register_time,wood_add_value,wood_reduce_value,stone_add_value,stone_reduce_value,ivory_add_value,ivory_reduce_value,meat_add_value,meat_reduce_value,magic_add_value,magic_reduce_value,infantry_add_value,infantry_reduce_value,cavalry_add_value,cavalry_reduce_value,shaman_add_value,shaman_reduce_value,wound_infantry_add_value,wound_infantry_reduce_value,wound_cavalry_add_value,wound_cavalry_reduce_value,wound_shaman_add_value,wound_shaman_reduce_value,general_acceleration_add_value,general_acceleration_reduce_value,building_acceleration_add_value,building_acceleration_reduce_value,reaserch_acceleration_add_value,reaserch_acceleration_reduce_value,training_acceleration_add_value,training_acceleration_reduce_value,treatment_acceleraion_add_value,treatment_acceleration_reduce_value,bd_training_hut_level,bd_healing_lodge_level,bd_stronghold_level,bd_outpost_portal_level,bd_barrack_level,bd_healing_spring_level,bd_dolmen_level,bd_guest_cavern_level,bd_warehouse_level,bd_watchtower_level,bd_magic_coin_tree_level,bd_hall_of_war_level,bd_market_level,bd_hero_gacha_level,bd_hero_strengthen_level,bd_hero_pve_level,sr_scout_level,sr_training_speed_level,sr_infantry_tier_2_level,sr_cavalry_tier_2_level,sr_shaman_tier_2_level,sr_infantry_atk_level,sr_cavalry_atk_level,sr_shaman_atk_level,sr_infantry_tier_3_level,sr_cavalry_tier_3_level,sr_shaman_tier_3_level,sr_troop_defense_level,sr_infantry_def_level,sr_cavalry_def_level,sr_shaman_def_level,sr_infantry_hp_level,sr_cavalry_hp_level,sr_shaman_hp_level,sr_infantry_tier_4_level,sr_cavalry_tier_4_level,sr_shaman_tier_4_level,sr_troop_attack_level,sr_construction_speed_level,sr_hide_storage_level,sr_troop_consumption_level,sr_rss_a_prod_levell,sr_rss_b_prod_level,sr_rss_c_prod_level,sr_rss_d_prod_level,sr_rss_a_gather_level,sr_rss_b_gather_level,sr_rss_c_gather_level,sr_rss_d_gather_level,sr_troop_load_level,sr_rss_e_gather_level,sr_rss_e_prod_level,sr_outpost_durability_level,sr_outpost_tier_2_level,sr_healing_space_level,sr_gathering_hunter_buff_level,sr_healing_speed_level,sr_outpost_tier_3_level,sr_alliance_march_speed_level,sr_pvp_march_speed_level,sr_gathering_march_speed_level,sr_outpost_tier_4_level,sr_guest_troop_capacity_level,sr_march_size_level,sr_rss_help_bonus_level,pvp_battle_count,pvp_lanch_count,pvp_win_count,pve_battle_count,pve_lanch_count,pve_win_count,avg_online_minutes,pay_price,pay_count,prediction_pay_price
0,1,2018-02-02 19:47:15,20125.0,3700.0,0.0,0.0,0.0,0.0,16375.0,2000.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,50,0,50,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.333333,0.0,0,0.0
1,1593,2018-01-26 00:01:05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.333333,0.0,0,0.0
2,1594,2018-01-26 00:01:58,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.166667,0.0,0,0.0
3,1595,2018-01-26 00:02:13,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3.166667,0.0,0,0.0
4,1596,2018-01-26 00:02:46,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,2.333333,0.0,0,0.0


## 三. 数据清洗
### 1.重复值检测

In [11]:
df.duplicated('user_id').sum()

0

In [12]:
df.shape

(2288007, 109)

### 2.缺失值检测

In [23]:
def null_info(df):
    info = pd.DataFrame(df.isnull().sum()).T.rename(index={0:'null values (nb)'})
    info=info.append(pd.DataFrame(df.isnull().sum()/df.shape[0]*100).T.rename(index={0:'null values (%)'}))
    display(info)
null_info(df)

Unnamed: 0,user_id,register_time,wood_add_value,wood_reduce_value,stone_add_value,stone_reduce_value,ivory_add_value,ivory_reduce_value,meat_add_value,meat_reduce_value,magic_add_value,magic_reduce_value,infantry_add_value,infantry_reduce_value,cavalry_add_value,cavalry_reduce_value,shaman_add_value,shaman_reduce_value,wound_infantry_add_value,wound_infantry_reduce_value,wound_cavalry_add_value,wound_cavalry_reduce_value,wound_shaman_add_value,wound_shaman_reduce_value,general_acceleration_add_value,general_acceleration_reduce_value,building_acceleration_add_value,building_acceleration_reduce_value,reaserch_acceleration_add_value,reaserch_acceleration_reduce_value,training_acceleration_add_value,training_acceleration_reduce_value,treatment_acceleraion_add_value,treatment_acceleration_reduce_value,bd_training_hut_level,bd_healing_lodge_level,bd_stronghold_level,bd_outpost_portal_level,bd_barrack_level,bd_healing_spring_level,bd_dolmen_level,bd_guest_cavern_level,bd_warehouse_level,bd_watchtower_level,bd_magic_coin_tree_level,bd_hall_of_war_level,bd_market_level,bd_hero_gacha_level,bd_hero_strengthen_level,bd_hero_pve_level,sr_scout_level,sr_training_speed_level,sr_infantry_tier_2_level,sr_cavalry_tier_2_level,sr_shaman_tier_2_level,sr_infantry_atk_level,sr_cavalry_atk_level,sr_shaman_atk_level,sr_infantry_tier_3_level,sr_cavalry_tier_3_level,sr_shaman_tier_3_level,sr_troop_defense_level,sr_infantry_def_level,sr_cavalry_def_level,sr_shaman_def_level,sr_infantry_hp_level,sr_cavalry_hp_level,sr_shaman_hp_level,sr_infantry_tier_4_level,sr_cavalry_tier_4_level,sr_shaman_tier_4_level,sr_troop_attack_level,sr_construction_speed_level,sr_hide_storage_level,sr_troop_consumption_level,sr_rss_a_prod_levell,sr_rss_b_prod_level,sr_rss_c_prod_level,sr_rss_d_prod_level,sr_rss_a_gather_level,sr_rss_b_gather_level,sr_rss_c_gather_level,sr_rss_d_gather_level,sr_troop_load_level,sr_rss_e_gather_level,sr_rss_e_prod_level,sr_outpost_durability_level,sr_outpost_tier_2_level,sr_healing_space_level,sr_gathering_hunter_buff_level,sr_healing_speed_level,sr_outpost_tier_3_level,sr_alliance_march_speed_level,sr_pvp_march_speed_level,sr_gathering_march_speed_level,sr_outpost_tier_4_level,sr_guest_troop_capacity_level,sr_march_size_level,sr_rss_help_bonus_level,pvp_battle_count,pvp_lanch_count,pvp_win_count,pve_battle_count,pve_lanch_count,pve_win_count,avg_online_minutes,pay_price,pay_count,prediction_pay_price
null values (nb),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
null values (%),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.异常值检测
观察各属性分布情况。

In [24]:
df.describe()

Unnamed: 0,user_id,wood_add_value,wood_reduce_value,stone_add_value,stone_reduce_value,ivory_add_value,ivory_reduce_value,meat_add_value,meat_reduce_value,magic_add_value,magic_reduce_value,infantry_add_value,infantry_reduce_value,cavalry_add_value,cavalry_reduce_value,shaman_add_value,shaman_reduce_value,wound_infantry_add_value,wound_infantry_reduce_value,wound_cavalry_add_value,wound_cavalry_reduce_value,wound_shaman_add_value,wound_shaman_reduce_value,general_acceleration_add_value,general_acceleration_reduce_value,building_acceleration_add_value,building_acceleration_reduce_value,reaserch_acceleration_add_value,reaserch_acceleration_reduce_value,training_acceleration_add_value,training_acceleration_reduce_value,treatment_acceleraion_add_value,treatment_acceleration_reduce_value,bd_training_hut_level,bd_healing_lodge_level,bd_stronghold_level,bd_outpost_portal_level,bd_barrack_level,bd_healing_spring_level,bd_dolmen_level,bd_guest_cavern_level,bd_warehouse_level,bd_watchtower_level,bd_magic_coin_tree_level,bd_hall_of_war_level,bd_market_level,bd_hero_gacha_level,bd_hero_strengthen_level,bd_hero_pve_level,sr_scout_level,sr_training_speed_level,sr_infantry_tier_2_level,sr_cavalry_tier_2_level,sr_shaman_tier_2_level,sr_infantry_atk_level,sr_cavalry_atk_level,sr_shaman_atk_level,sr_infantry_tier_3_level,sr_cavalry_tier_3_level,sr_shaman_tier_3_level,sr_troop_defense_level,sr_infantry_def_level,sr_cavalry_def_level,sr_shaman_def_level,sr_infantry_hp_level,sr_cavalry_hp_level,sr_shaman_hp_level,sr_infantry_tier_4_level,sr_cavalry_tier_4_level,sr_shaman_tier_4_level,sr_troop_attack_level,sr_construction_speed_level,sr_hide_storage_level,sr_troop_consumption_level,sr_rss_a_prod_levell,sr_rss_b_prod_level,sr_rss_c_prod_level,sr_rss_d_prod_level,sr_rss_a_gather_level,sr_rss_b_gather_level,sr_rss_c_gather_level,sr_rss_d_gather_level,sr_troop_load_level,sr_rss_e_gather_level,sr_rss_e_prod_level,sr_outpost_durability_level,sr_outpost_tier_2_level,sr_healing_space_level,sr_gathering_hunter_buff_level,sr_healing_speed_level,sr_outpost_tier_3_level,sr_alliance_march_speed_level,sr_pvp_march_speed_level,sr_gathering_march_speed_level,sr_outpost_tier_4_level,sr_guest_troop_capacity_level,sr_march_size_level,sr_rss_help_bonus_level,pvp_battle_count,pvp_lanch_count,pvp_win_count,pve_battle_count,pve_lanch_count,pve_win_count,avg_online_minutes,pay_price,pay_count,prediction_pay_price
count,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0,2288007.0
mean,1529543.0,454306.9,369843.3,189778.8,137607.4,80756.23,36131.7,585515.5,354810.2,75389.54,47253.99,143.6104,226.7821,128.2639,178.0055,119.5425,156.853,135.3035,91.98413,116.5028,84.34621,110.4308,84.76289,282.5446,192.0313,205.5574,142.7759,132.6712,73.65257,210.8085,65.02153,10.41641,0.3699333,1.306311,1.02675,2.098073,1.764193,1.283829,0.9244596,0.9694551,0.4408474,0.9343398,0.9125252,1.146429,0.1024254,0.7723403,0.2011803,0.179756,0.2919528,0.3310995,0.2992207,0.02388411,0.02294705,0.01969137,0.0238911,0.02248857,0.01851262,0.0002749117,0.0002609258,0.000233828,0.0007141587,0.0004453658,0.0004305057,0.0004034079,0.0006276205,0.0005472011,0.0005004355,1.748246e-05,1.966777e-05,1.486009e-05,2.05419e-05,0.1221509,0.03474815,0.03261528,0.03545531,0.002529713,0.0005773584,0.0424837,0.02153971,0.00170017,0.0004160826,0.01793876,0.02884388,0.001925693,0.007764836,0.1104227,0.04435432,0.02869703,0.03393871,0.02206287,0.0008789309,0.0004077785,0.0006643336,0.0006070786,5.681801e-06,2.185308e-06,1.398597e-05,6.118862e-06,2.148313,1.059639,0.9838589,2.844738,2.832409,2.556749,10.20749,0.5346691,0.05770699,1.793146
std,939939.3,4958667.0,3737720.0,4670620.0,3370166.0,2220540.0,1782499.0,5868629.0,3400632.0,966289.2,881122.3,1781.468,1738.488,1334.977,1347.096,5958.519,5958.508,1333.236,1287.586,1009.972,950.375,5890.129,5841.053,3001.938,2619.487,1427.626,1283.584,1516.142,1339.241,1942.369,1554.042,49.63815,13.7252,1.971849,1.811002,2.520964,2.358619,2.032131,1.900493,2.057987,1.622136,1.960826,1.886617,2.110285,0.4410252,1.821324,0.9732027,0.8920109,1.304187,0.737634,0.9860072,0.1526881,0.1497347,0.1389375,0.2531641,0.2463319,0.2248276,0.01657819,0.01615109,0.01528965,0.06896717,0.05298984,0.05274195,0.05005509,0.06631231,0.06172524,0.05829155,0.004181168,0.004434794,0.003854851,0.01266496,0.5955828,0.267412,0.2689738,0.2994074,0.07521916,0.03492143,0.3108336,0.2309065,0.06119213,0.02830898,0.1939576,0.3072257,0.06361588,0.1408421,0.4808994,0.2058811,0.2384797,0.2508571,0.202167,0.02963374,0.03922663,0.0513667,0.04809365,0.002383647,0.002192641,0.008967674,0.006271804,11.67797,9.074459,8.95128,12.76245,12.7182,11.84737,38.95946,22.63835,0.7090886,88.46303
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,749992.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0
50%,1419095.0,42038.0,9830.0,0.0,0.0,0.0,0.0,34587.0,6470.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,45.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.833333,0.0,0.0,0.0
75%,2299006.0,153118.0,98557.0,0.0,0.0,0.0,0.0,136001.0,66054.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0,0.0,232.0,46.0,50.0,0.0,100.0,0.0,0.0,0.0,2.0,1.0,4.0,3.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,4.833333,0.0,0.0,0.0
max,3190530.0,1239962000.0,799587500.0,1214869000.0,796237800.0,574496100.0,448197200.0,1470644000.0,888953700.0,263722800.0,263722900.0,874918.0,878883.0,353852.0,370375.0,8767537.0,8769162.0,803996.0,803996.0,347967.0,347967.0,8763256.0,8704561.0,795484.0,764637.0,411362.0,306515.0,377796.0,338437.0,671607.0,529587.0,14048.0,14642.0,20.0,19.0,23.0,23.0,21.0,21.0,22.0,20.0,21.0,21.0,21.0,18.0,19.0,19.0,17.0,21.0,9.0,17.0,1.0,1.0,1.0,13.0,11.0,10.0,1.0,1.0,1.0,14.0,11.0,11.0,11.0,11.0,11.0,11.0,1.0,1.0,1.0,11.0,15.0,11.0,11.0,9.0,9.0,9.0,11.0,9.0,9.0,7.0,9.0,10.0,9.0,9.0,10.0,1.0,11.0,10.0,9.0,1.0,10.0,11.0,10.0,1.0,3.0,9.0,8.0,2054.0,2051.0,1904.0,509.0,509.0,488.0,2049.667,7457.95,105.0,32977.81


In [37]:
# 付费金额描述统计分析
df[df.pay_price>0].pay_price.describe()

count    41439.000000
mean        29.521143
std        165.655561
min          0.990000
25%          0.990000
50%          1.990000
75%         11.970000
max       7457.950000
Name: pay_price, dtype: float64

In [27]:
# 付费次数描述统计分析
df[df.pay_count>0].pay_count.describe()

count    41439.000000
mean         3.186226
std          4.218311
min          1.000000
25%          1.000000
50%          2.000000
75%          4.000000
max        105.000000
Name: pay_count, dtype: float64

In [36]:
# 检测是否存在付款次数大于0，而付款金额却为0的情况
len(df[(df.pay_count>0) & (df.pay_price==0)])

0

In [39]:
# 检测是否存在付款次数为0，而付款金额大于0的情况
len(df[(df.pay_count==0) & (df.pay_price>0)])

0

## 四. 游戏总体运营及盈利情况分析
### 1. 每天注册人数
![png](./pics/register.png)

** 分析 **

从图中可看到2月19日达到高峰，很可能因为加大了宣传力度或采取了某些营销措施，致使出现暂时的高峰。

### 2. 流失率（新用户）
流失用户：平均在线时长小于1分钟。

这些用户通常因受到营销策略吸引，或者促销的鼓励进行了注册，但实际上对此游戏并不感兴趣。

流失率=流失用户数/注册用户数

In [59]:
# 计算流失和注册用户数，并计算流失率
loster = len(df[df.avg_online_minutes<1])
register = len(df)
Churn_rate = loster/register
print('流失率为：{:.2%}'.format(Churn_rate))

流失率为：36.12%


### 3. 付费转化率
活跃用户：平均每天游戏时长大于30分钟的用户定义为活跃用户

付费转化率 = 付费人数 / 活跃用户数

In [53]:
payment = len(df[df.pay_price>0])
active = len(df[df.avg_online_minutes>30])
Conversion_rate = payment/active
print('付费转化率：{:.2%}'.format(Conversion_rate))

付费转化率：29.06%


### 4. 平台从每个用户获取的平均收入
ARPU和ARPPU可以帮助我们衡量手游的盈利情况

每用户平均收入：ARPU = 付费金额/活跃人数

每付费用户平均收入：ARPPU = 付费金额/付费用户人数

In [60]:
total_price = df.pay_price.sum()
ARPU = total_price/active
print('每用户平均收入：{:.2f}'.format(ARPU))

每用户平均收入：8.58


In [61]:
ARPPU = total_price/payment
print('每付费用户平均收入：{:.2f}'.format(ARPPU))

每付费用户平均收入：29.52


** 分析 **

较好的手游ARPU超过5元；一般的手游ARPU在3~5元之间；ARPU低于3元则说明表现较差。可见该手游的盈利能力较好。

## 五. 不同付费群体的行为分析
（找出影响用户消费的因素，进而找出提高方案。）

按付费金额大小把玩家分为：无氪金用户、低氪金用户和高氪金用户，分别以数字0、1、2表示。

消费主要包含物资消耗类消费和加速券类消费，因此以不同物资的消耗率和加速券的使用率为指标分析，同时以PVP和PVE的主动发起率和胜率为指标，分析不同付费群体的行为差异。


各指标计算公式如下：

物资消耗率：
- wood_reduce_value/wood_add_value
- stone_reduce_value/stone_add_value
- ivory_reduce_value/ivory_add_value
- meat_reduce_value/meat_add_value
- magic_reduce_value/magic_add_value

加速券使用率：
- general_acceleration_reduce_value/general_acceleration_add_value
- building_acceleration_reduce_value/building_acceleration_add_value
- reaserch_acceleration_reduce_value/reaserch_acceleration_add_value
- training_acceleration_reduce_value/training_acceleration_add_value
- treatment_acceleration_reduce_value/treatment_acceleration_add_value

PVP主动发起率：
- pvp_lanch_count/pvp_battle_count

PVP胜率：
- pvp_win_count/pvp_battle_count

PVE主动发起率：
- pve_lanch_count/pve_battle_count

PVE胜率：
- pve_win_count/pve_battle_count

### 1.按付费额度进行玩家分层

In [6]:
# 人均付费额(付费用户)
pay_m = df[df.pay_price>0].pay_price.mean()
print(pay_m)

29.521143367347563


In [14]:
# 按付费金额对用户进行分层
# 0:无氪金用户
# 1:低氪金用户
# 2:高氪金用户
def pay_level(x):
    if x==0:
        return 0
    elif x<=29.52:
        return 1
    elif x>29.52:
        return 2

df['pay_level'] = df.pay_price.apply(pay_level)

In [19]:
df.to_csv('pay.csv',index=False)

### 2.（不同群体玩家）物资消耗率（分析）
![png](./pics/rate_materials.png)

** 分析 **

低氪金玩家相比无氪金玩家，在各物资消耗率上均有提升；高氪金玩家相比低氪金玩家，虽然大部分物资消耗率仍有所提升，但在石头和魔法的使用率上有所减少。
总体来说，象牙（ivory）的消耗率在不同群体用户中差异最明显，无氪金玩家几乎不消耗象牙，但高氪金玩家象牙消耗率是最高的。可见，引导用户提高象牙这种物资的消耗率对提升玩家付费额度有帮助。

### 3.（不同群体玩家）加速券使用率（分析）
![png](./pics/rate_ac.png)

** 分析 **

低氪金玩家相比无氪金玩家，在各种加速券的使用率上均有提升；高氪金玩家相比低氪金玩家，通用加速、科研加速、训练加速和治疗加速的使用率均有所提高，只有建筑加速的使用率下降。可见，在前期建筑加速券比较受欢迎，而对提高低氪金用户的消费额度作用不大。

### 4.（不同群体玩家）PVP主动发起率和胜率（分析）
![png](./pics/rate_pvp.png)

** 分析 **

低氪金玩家相比无氪金玩家，高氪金玩家相比低氪金玩家，PVP主动发起率和胜率都有所提高。可见，主动发起PVP及PVP胜率越高的玩家，越容易变成高氪金玩家。因此，增加社交，提高PVP的趣味性，对提高玩家付费额度有很大帮助。

### 5.（不同群体玩家）PVE主动发起率和胜率（分析）
![png](./pics/rate_pve.png)

** 分析 **

不同群体玩家的PVE主动发起率和胜率区别不大，因此PVE对提高用户付费额度关系不大。另一方面也说明目前PVE对付费用户的吸引力还不够。

## 六. 不同游戏等级玩家的付费行为分析
分析不同要塞等级的玩家的付费情况,可以在不同等级设置不同的营销方案，同时也可以对游戏产品的改进进行数据支持。

统计的指标如下（包含计算公式）：
- 到达人数: user_id计数
- 付费次数: pay_count
- 付费总额: pay_price
- 付费人数（pay_num）: SUM(IF [pay_price]>0 THEN 1 END)
- 所有玩家付费转化率（pay_rate）:[pay_num]/COUNT([user_id])
- 活跃玩家付费转化率（pay_active_rate）:[pay_num]/COUNT([avg_online_minutes]>30)
- 玩家平均付费金额（avg_pay_price）: SUM([pay_price])/[pay_num]
- 玩家平均付费次数（avg_pay_count）: SUM([pay_count])/[pay_num]

![png](./pics/level_pay.png)

** 分析 **

随着游戏中要塞等级的上升，玩家人数也越来越少，二百多万注册用户中，达到11级以上的玩家已不到一万。而在这不到一万的玩家中包含了所有的高氪金玩家。

10级的玩家付费次数最多；10、11级的玩家付费总金额最高；付费玩家到达第9级的最多；随着等级的提升，付费转化率也逐步提升，从第9级开始，付费转化率开始大幅度上升，到14级时，付费转化率达到100%。因此在第9级的时候，我们可以采取一些营销措施帮助玩家提升等级，更加顺利抵达下一层。

平均付费金额和付费次数都是在22级达到最大，23级反而少于22级，正常情况随着等级升高，玩家的付费金额和次数都应该增大，通过观察目前到达22级的玩家只有2人，到达23级的玩家只有一人，属于小样本事件，不能当作结论，还需继续观察。

## 七.（按玩家行为喜好）细分用户群体

按玩家行为喜好，使用象限法进行分群，先计算物资，士兵，加速券的使用率/消耗率等，然后按每个类别使用/消耗率的多少进行划分，再找出每个玩家所属的类别，最后对每个玩家进行打标签，标记出这个玩家的行为喜好，是物资消耗者还是喜好训练士兵，或者爱好使用加速券。最终根据不同的人群采取不同的营销措施。

- 物资类消耗（如：木头、石头、象牙、肉、魔法）

- 军队类（如：勇士、驯兽师、萨满）

- 加速券类（如：通用加速、建筑加速、科研加速、训练加速、治疗加速）

### 1. 计算

In [None]:
# 计算物资，士兵，加速券的使用率/消耗率等
df['wood_usedRate'] = df['wood_reduce_value']/(df['wood_add_value']+0.0001)
df['stone_usedRate'] = df['stone_reduce_value']/(df['stone_add_value']+0.0001)
df['ivory_usedRate'] = df['ivory_reduce_value']/(df['ivory_add_value']+0.0001)
df['meat_usedRate'] = df['meat_reduce_value']/(df['meat_add_value']+0.0001)
df['magic_usedRate'] = df['magic_reduce_value']/(df['magic_add_value']+0.0001)

df['infantry_rate']=df['infantry_reduce_value']/(df['infantry_reduce_value']+0.0001)
df['cavalry_rate']=df['cavalry_reduce_value']/(df['cavalry_add_value']+0.0001)
df['shaman_rate']=df['shaman_reduce_value']/(df['shaman_add_value']+0.0001)
df['wound_infantry_rate']=df['wound_infantry_reduce_value']/(df['wound_infantry_add_value']+0.0001)
df['wound_cavalry_rate']=df['wound_cavalry_reduce_value']/(df['wound_cavalry_add_value']+0.0001)
df['wound_shaman_rate']=df['wound_shaman_reduce_value']/(df['wound_shaman_add_value']+0.0001)

df['general_rate']=df['general_acceleration_reduce_value']/(df['general_acceleration_add_value']+0.0001)
df['building_rate']=df['building_acceleration_reduce_value']/(df['building_acceleration_add_value']+0.0001)
df['reaserch_rate']=df['reaserch_acceleration_reduce_value']/(df['reaserch_acceleration_add_value']+0.0001)
df['training_rate']=df['training_acceleration_reduce_value']/(df['training_acceleration_add_value']+0.0001)
df['treatment_rate']=df['treatment_acceleration_reduce_value']/(df['treatment_acceleraion_add_value']+0.0001)

In [6]:
# 计算每个玩家物资，士兵，以及加速券的总体使用情况
df['materials']=df['wood_usedRate']+df['stone_usedRate']+df['ivory_usedRate']+df['meat_usedRate']+df['magic_usedRate']
df['army']=df['infantry_rate']+df['cavalry_rate']+df['shaman_rate']+df['wound_infantry_rate']+df['wound_cavalry_rate']+df['wound_shaman_rate']
df['acceleration']=df['general_rate']+df['building_rate']+df['reaserch_rate']+df['training_rate']+df['treatment_rate']

### 2. 分组

In [7]:
# 按照使用喜好进行分群
MA_mean = df['materials'].mean()
AR_mean = df['army'].mean()
AC_mean = df['acceleration'].mean()

df['MA_class'] = df['materials'].apply(lambda x:1 if x>MA_mean else 0)
df['AR_class'] = df['army'].apply(lambda x:1 if x>AR_mean else 0)
df['AC_class'] = df['acceleration'].apply(lambda x:1 if x>AC_mean else 0)

df['class'] = df['MA_class'].map(str)+df['AR_class'].map(str)+df['AC_class'].map(str)

### 3. 标签化

- 111: "物资消耗迅速者|喜好训练军队者|加速券喜好者"
- 110: "物资消耗迅速者|喜好训练军队者"
- 101: "物资消耗迅速者|加速券喜好者"
- 011: "喜好训练军队者|加速券喜好者"
- 100: "物资消耗迅速者"
- 010: "喜好训练军队者"
- 001: "加速券喜好者"
- 000: "无特别偏好玩家"

In [None]:
def tags(x):
    if x == '111':
        return "物资消耗迅速者|喜好训练军队者|加速券喜好者"
    elif x == '110':
        return "物资消耗迅速者|喜好训练军队者"
    elif x == '101':
        return "物资消耗迅速者|加速券喜好者"
    elif x == '011':
        return "喜好训练军队者|加速券喜好者"
    elif x == '100':
        return "物资消耗迅速者"
    elif x == '010':
        return "喜好训练军队者"
    elif x == '001':
        return "加速券喜好者"
    elif x == '000':
        return "无特别偏好玩家"

df['tag'] = df['class'].map(tags)

In [15]:
df[['user_id','MA_class','AR_class','AC_class','class','tag']].head(2)

Unnamed: 0,user_id,MA_class,AR_class,AC_class,class,tag
0,1,0,0,0,0,无特别偏好玩家
1,1593,0,0,0,0,无特别偏好玩家


## 八. 预测玩家付费额度（45天时是付费额度否大于100）
预测玩家付费额度，可以使运营有的放矢的针对不同付费用户采取不同营销方案。

### 1.数据预处理
#### 1.1 样本平衡
由于45天总付费额度大于或等于100元的玩家数少于付费额度小于100元的玩家，如果直接用现有数据样本必然会造成数据不平衡。数据不平衡会对分类模型造成不利影响。

比如：有100个玩家，其中只有一个玩家支付额大于100，而其余均小于100。如果我们预测这100个玩家支付额都小于100元，那么准确率也能达到99%，显然这样的模型是不准确的。

In [71]:
# 获取（45天）付款大于100的玩家数据
df_above100 = df[df.prediction_pay_price>=100]
# 平衡样本数据
lenght = len(df[df.prediction_pay_price>=100])
df_below100 = df[df.prediction_pay_price<100].sample(n=lenght)

In [74]:
# 合并数据
df_predict = df_above100.append(df_below100, ignore_index=False)
df_predict.head(2)

Unnamed: 0,user_id,register_time,wood_add_value,wood_reduce_value,stone_add_value,stone_reduce_value,ivory_add_value,ivory_reduce_value,meat_add_value,meat_reduce_value,magic_add_value,magic_reduce_value,infantry_add_value,infantry_reduce_value,cavalry_add_value,cavalry_reduce_value,shaman_add_value,shaman_reduce_value,wound_infantry_add_value,wound_infantry_reduce_value,wound_cavalry_add_value,wound_cavalry_reduce_value,wound_shaman_add_value,wound_shaman_reduce_value,general_acceleration_add_value,general_acceleration_reduce_value,building_acceleration_add_value,building_acceleration_reduce_value,reaserch_acceleration_add_value,reaserch_acceleration_reduce_value,training_acceleration_add_value,training_acceleration_reduce_value,treatment_acceleraion_add_value,treatment_acceleration_reduce_value,bd_training_hut_level,bd_healing_lodge_level,bd_stronghold_level,bd_outpost_portal_level,bd_barrack_level,bd_healing_spring_level,bd_dolmen_level,bd_guest_cavern_level,bd_warehouse_level,bd_watchtower_level,bd_magic_coin_tree_level,bd_hall_of_war_level,bd_market_level,bd_hero_gacha_level,bd_hero_strengthen_level,bd_hero_pve_level,sr_scout_level,sr_training_speed_level,sr_infantry_tier_2_level,sr_cavalry_tier_2_level,sr_shaman_tier_2_level,sr_infantry_atk_level,sr_cavalry_atk_level,sr_shaman_atk_level,sr_infantry_tier_3_level,sr_cavalry_tier_3_level,sr_shaman_tier_3_level,sr_troop_defense_level,sr_infantry_def_level,sr_cavalry_def_level,sr_shaman_def_level,sr_infantry_hp_level,sr_cavalry_hp_level,sr_shaman_hp_level,sr_infantry_tier_4_level,sr_cavalry_tier_4_level,sr_shaman_tier_4_level,sr_troop_attack_level,sr_construction_speed_level,sr_hide_storage_level,sr_troop_consumption_level,sr_rss_a_prod_levell,sr_rss_b_prod_level,sr_rss_c_prod_level,sr_rss_d_prod_level,sr_rss_a_gather_level,sr_rss_b_gather_level,sr_rss_c_gather_level,sr_rss_d_gather_level,sr_troop_load_level,sr_rss_e_gather_level,sr_rss_e_prod_level,sr_outpost_durability_level,sr_outpost_tier_2_level,sr_healing_space_level,sr_gathering_hunter_buff_level,sr_healing_speed_level,sr_outpost_tier_3_level,sr_alliance_march_speed_level,sr_pvp_march_speed_level,sr_gathering_march_speed_level,sr_outpost_tier_4_level,sr_guest_troop_capacity_level,sr_march_size_level,sr_rss_help_bonus_level,pvp_battle_count,pvp_lanch_count,pvp_win_count,pve_battle_count,pve_lanch_count,pve_win_count,avg_online_minutes,pay_price,pay_count,prediction_pay_price
155,1747,2018-01-26 02:36:55,5475019.0,7377434.0,4926938.0,4552369.0,2507392.0,1772039.0,6040642.0,4213417.0,1671444.0,1873795.0,4434,1504,847,1484,813,1534,20,559,13,322,0,288,3375,9350,3540,3527,4178,15610,5426,2301,180,0,7,5,12,12,12,0,12,0,0,11,10,0,0,0,0,0,0,0,0,0,0,4,5,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5,6,1,2,0,0,6,2,0,0,2,4,0,1,3,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,34,34,29,311.166667,156.88,12,166.86
381,1973,2018-01-26 08:02:21,55495770.0,12635052.0,54377106.0,12088437.0,23307953.0,7683341.0,64528810.0,9544950.0,1932278.0,1791441.0,1851,194,1863,201,1864,220,194,231,201,243,220,244,14881,11824,17059,16517,21896,22955,281,2325,2,0,11,10,11,10,0,11,11,10,10,10,10,9,10,10,11,10,0,0,0,0,0,4,4,4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,7,0,0,0,1,0,0,0,3,0,7,4,6,0,0,5,5,4,1,2,2,2,0,0,0,0,2,2,2,16,16,15,143.166667,406.87,13,8982.78


#### 1.2 特征选择

In [78]:
# 删除与付费额度关系不大的变量
df_predict = df_predict.drop(['user_id','register_time'],axis=1)

#### 1.3 分离特征变量和标签变量

In [79]:
features = df_predict.drop('prediction_pay_price',axis=1)
label = df_predict.prediction_pay_price
# 把付费大于或等于100的玩家标记为1，否则为0
label = label.apply(lambda x:1 if x>=100 else 0)

#### 1.4 特征转化

使特征间的差距缩小，同时在应用监督式学习器时，能够平等地对待每个特征。

In [81]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler() # default=(0, 1)

features_final = scaler.fit_transform(features)

#### 1.5 把数据集拆分为训练集和测试集

In [82]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features_final,
                                                    label,
                                                    test_size = 0.2,
                                                    random_state = 0)

### 2. 模型选择及构建
#### 2.1.评估指标的选择

首先准确率是评估模型效果的一个有效指标， 同时我们还可以通过 F-β 分数考虑精确率和召回率。我们希望尽可能找出所有氪金较多的玩家，因此在这里召回率似乎更重要一些。

精确率：TP/（TP+FP），真正例占所有预测为正例的比，关注是否找得准。
召回率：TP/（TP+TN），真正例占所有真实为正例的比，关注的是是否找得全。

F-β = (1+β2)⋅precision⋅recall / [ (β2⋅precision)+recall ]，β值越大越倾向于召回率， 因此取beta = 1.5。

#### 2.2 模型的选择
根据经验，选择SVM，LR，和DT进行初始模型尝试。

分别比较他们在不同数量的样本中时间消耗以及准确度和fbeta_score。

In [83]:
from sklearn.metrics import fbeta_score, accuracy_score
def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    
    results = {}
    
    # 计算训练时间
    start = time() # Get start time
    learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
    end = time() # Get end time    
    results['train_time'] = end - start
    
    # 计算预测时间
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    end = time() # Get end time
    results['pred_time'] = end - start
    
    # Compute accuracy on test set using accuracy_score()
    results['acc_test'] = accuracy_score(y_test, predictions_test)
        
    # Compute F-score on the test set which is y_test
    results['f_test'] = fbeta_score(y_test, predictions_test, beta=1.5)
       
    # Success
    print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
    print('train time is: {}'.format(results['train_time']))
    print('predition time is: {}'.format(results['pred_time']))
    #print('accuracy on training set is: {}'.format(results['acc_train']))
    print('accuracy on testing set is: {}'.format(results['acc_test']))
    #print('fbeta score on training set is: {}'.format(results['f_train']))
    print('fbata score on testinig set is: {}'.format(results['f_test']))
    print('------------------------------------------------------') 

In [86]:
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier    

# nitialize the three models
clf_A = SVC(random_state=42)
clf_B = LogisticRegression(random_state=42)
clf_C = DecisionTreeClassifier(random_state=42)

# Calculate the number of samples for 50% and 100% of the training data
samples_100 = len(y_train)
samples_50 = int(samples_100 * 0.5)

# Collect results on the learners
results = {}
for clf in [clf_A, clf_B, clf_C]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_50, samples_100]):
        train_predict(clf, samples, X_train, y_train, X_test, y_test)

SVC trained on 3687 samples.
train time is: 0.460521936416626
predition time is: 0.14394903182983398
accuracy on testing set is: 0.9425162689804772
fbata score on testinig set is: 0.9302526079834469
------------------------------------------------------
SVC trained on 7374 samples.
train time is: 1.776181936264038
predition time is: 0.3858299255371094
accuracy on testing set is: 0.9517353579175705
fbata score on testinig set is: 0.9447059830522982
------------------------------------------------------
LogisticRegression trained on 3687 samples.
train time is: 0.3741939067840576
predition time is: 0.0031228065490722656
accuracy on testing set is: 0.9490238611713666
fbata score on testinig set is: 0.9398128273375119
------------------------------------------------------
LogisticRegression trained on 7374 samples.
train time is: 0.06888222694396973
predition time is: 0.0003268718719482422
accuracy on testing set is: 0.9533622559652929
fbata score on testinig set is: 0.945677319852626
----

通过比较三个模型的在不同样本数量下的训练、预测时间，准确度和fbeta分数。可知：

1. 样本数量相同时，三个模型在测试集上的预测准确度和F-beta值几乎差不多
2. 但支持向量机的时间消耗相对其他两个模型要更多一些，
3. 随着样本数的增多，逻辑斯蒂回归的准确率和F-beta值呈现上升趋势，而决策数却出现下降趋势

因此，最终选取逻辑斯蒂回归模型。

#### 2.3 模型构建及优化
使用网格搜索交叉验证，为模型挑选最优参数。

In [93]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, fbeta_score, accuracy_score
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42)

parameters = { 'penalty':['l1','l2'],'C':[0.5,1,1.5]}

scorer = make_scorer(fbeta_score, beta=1.5)

grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

grid_fit = grid_obj.fit(X_train, y_train)

best_clf = grid_fit.best_estimator_

predictions  = (clf.fit(X_train, y_train)).predict(X_test)

best_predictions = best_clf.predict(X_test)

print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 1.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 1.5)))


Unoptimized model
------
Accuracy score on testing data: 0.9534
F-score on testing data: 0.9457

Optimized Model
------
Final accuracy score on the testing data: 0.9664
Final F-score on the testing data: 0.9613
