行動パターンを分析できるということは退会の時期もある程度予測でき、退会を避けるための施策を前もって講じることができる。

本章では**決定木**を使用し、原因の分析を行っていく。

## ノック41：データを読み込んで利用データを整形しよう

In [1]:
import pandas as pd
customer = pd.read_csv('customer_join.csv')
uselog_months = pd.read_csv('uselog_months.csv')

当月と1ヶ月前までの利用履歴のみのデータを作成していく。理由としては、過去6か月分のデータから予測する場合、5ヶ月以内に退会する予測が立てられないため。

In [2]:
year_months = list(uselog_months['年月'].unique())
uselog = pd.DataFrame()
for i in range(1,len(year_months)):
    tmp = uselog_months.loc[uselog_months['年月']==year_months[i]]
    tmp.rename(columns={'count':'count_0'},inplace=True)
    tmp_before = uselog_months.loc[uselog_months['年月']==year_months[i-1]]
    del tmp_before['年月']
    tmp_before.rename(columns={'count':'count_1'},inplace=True)
    tmp = pd.merge(tmp,tmp_before,on='customer_id',how='left')
    uselog = pd.concat([uselog,tmp],ignore_index=True)
uselog.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,年月,customer_id,count_0,count_1
0,201805,AS002855,5,4.0
1,201805,AS009373,4,3.0
2,201805,AS015233,7,
3,201805,AS015315,3,6.0
4,201805,AS015739,5,7.0


## ノック42：退会前月の退会顧客データを作成しょう
このジムでは月末までに退会申請を提出することで、翌月末で退会できる。例えば、2018年9月30日で退会したとする、その場合8月には退会申請を提出しており、9月のデータを用いても未然に防ぐことはできない。そのため退会月を2018年8月として、その1ヶ月前の7月のデータから8月に退会申請を提出する確率を予測する。

まずは退会した顧客を絞り込む。

In [3]:
from dateutil.relativedelta import relativedelta

exit_customer = customer.loc[customer['is_deleted']==1]
exit_customer['exit_date'] = None
exit_customer['end_date'] = pd.to_datetime(exit_customer['end_date'])

for i in range(len(exit_customer)):
    exit_customer['exit_date'].iloc[i] = exit_customer['end_date'].iloc[i] - relativedelta(months=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [4]:
exit_customer['年月'] = exit_customer['exit_date'].dt.strftime('%Y%m')
uselog['年月'] = uselog['年月'].astype(str)
exit_uselog = pd.merge(uselog,exit_customer,on=['customer_id','年月'],how='left')
print(len(uselog))
exit_uselog.head()

33851


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,年月,customer_id,count_0,count_1,name,class,gender,start_date,end_date,campaign_id,...,campaign_name,mean,median,max,min,count,routine_flg,calc_date,membership_period,exit_date
0,201805,AS002855,5,4.0,,,,,NaT,,...,,,,,,,,,,
1,201805,AS009373,4,3.0,,,,,NaT,,...,,,,,,,,,,
2,201805,AS015233,7,,,,,,NaT,,...,,,,,,,,,,
3,201805,AS015315,3,6.0,,,,,NaT,,...,,,,,,,,,,
4,201805,AS015739,5,7.0,,,,,NaT,,...,,,,,,,,,,


結合データは退会した顧客の退会前月のデータのみなので欠損値が多い。欠損値を除去していく

In [5]:
exit_uselog = exit_uselog.dropna(subset=['name'])
print(len(exit_uselog))
print(len(exit_uselog['customer_id'].unique()))
exit_uselog.head()

1104
1104


Unnamed: 0,年月,customer_id,count_0,count_1,name,class,gender,start_date,end_date,campaign_id,...,campaign_name,mean,median,max,min,count,routine_flg,calc_date,membership_period,exit_date
19,201805,AS055680,3,3.0,XXXXX,C01,M,2018-03-01,2018-06-30,CA1,...,通常,3.0,3.0,3.0,3.0,2.0,0.0,2018-06-30,3.0,2018-05-30 00:00:00
57,201805,AS169823,2,3.0,XX,C01,M,2017-11-01,2018-06-30,CA1,...,通常,3.0,3.0,4.0,2.0,4.0,1.0,2018-06-30,7.0,2018-05-30 00:00:00
110,201805,AS305860,5,3.0,XXXX,C01,M,2017-06-01,2018-06-30,CA1,...,通常,3.333333,3.0,5.0,2.0,1.0,0.0,2018-06-30,12.0,2018-05-30 00:00:00
128,201805,AS363699,5,3.0,XXXXX,C01,M,2018-02-01,2018-06-30,CA1,...,通常,3.333333,3.0,5.0,2.0,2.0,0.0,2018-06-30,4.0,2018-05-30 00:00:00
147,201805,AS417696,1,4.0,XX,C03,F,2017-09-01,2018-06-30,CA1,...,通常,2.0,1.0,4.0,1.0,1.0,0.0,2018-06-30,9.0,2018-05-30 00:00:00


## ノック43：継続顧客のデータを作成しよう
継続顧客は退会前月がないので、どの年月のデータを作成してもよい。

In [6]:
conti_customer = customer.loc[customer['is_deleted']==0]
conti_uselog = pd.merge(uselog,conti_customer,on=['customer_id'],how='left')
print(len(conti_uselog))
conti_uselog = conti_uselog.dropna(subset=['name'])
print(len(conti_uselog))

33851
27422


継続顧客はどのデータを活用してもよいが、退会データが977件、と継続顧客のデータが27422件のためこのままいくと不均衡なデータとなる。そこで、継続顧客のデータも1人1件になるようにアンダーサンプリングする。

In [7]:
# 1行目でデータのシャッフル、2行目でcustomer_idの重複しているデータは最初のデータのみ取得
conti_uselog = conti_uselog.sample(frac=1).reset_index(drop=True)
conti_uselog = conti_uselog.drop_duplicates(subset='customer_id')
print(len(conti_uselog))
conti_uselog.head()

2842


Unnamed: 0,年月,customer_id,count_0,count_1,name,class,gender,start_date,end_date,campaign_id,...,price,campaign_name,mean,median,max,min,count,routine_flg,calc_date,membership_period
0,201811,IK612587,4,5.0,XXXX,C01,M,2016-02-01,,CA1,...,10500.0,通常,4.666667,4.5,8.0,2.0,5.0,1.0,2019-04-30,38.0
1,201808,AS275753,4,6.0,XXXXX,C03,M,2017-02-01,,CA1,...,6000.0,通常,4.75,4.5,7.0,3.0,5.0,1.0,2019-04-30,26.0
2,201811,PL863680,6,6.0,XXXXX,C01,M,2016-12-01,,CA1,...,10500.0,通常,5.5,5.5,8.0,3.0,5.0,1.0,2019-04-30,28.0
3,201808,HI835158,8,8.0,XXXXX,C03,M,2017-02-01,,CA1,...,6000.0,通常,5.0,5.0,8.0,2.0,5.0,1.0,2019-04-30,26.0
4,201809,HI994023,7,,XXX,C03,M,2018-09-07,,CA1,...,6000.0,通常,7.428571,7.0,11.0,5.0,5.0,1.0,2019-04-30,7.0


In [8]:
predict_data = pd.concat([conti_uselog,exit_uselog],ignore_index=True)
print(len(predict_data))
predict_data.head()

3946


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,calc_date,campaign_id,campaign_name,class,class_name,count,count_0,count_1,customer_id,end_date,...,max,mean,median,membership_period,min,name,price,routine_flg,start_date,年月
0,2019-04-30,CA1,通常,C01,オールタイム,5.0,4,5.0,IK612587,,...,8.0,4.666667,4.5,38.0,2.0,XXXX,10500.0,1.0,2016-02-01,201811
1,2019-04-30,CA1,通常,C03,ナイト,5.0,4,6.0,AS275753,,...,7.0,4.75,4.5,26.0,3.0,XXXXX,6000.0,1.0,2017-02-01,201808
2,2019-04-30,CA1,通常,C01,オールタイム,5.0,6,6.0,PL863680,,...,8.0,5.5,5.5,28.0,3.0,XXXXX,10500.0,1.0,2016-12-01,201811
3,2019-04-30,CA1,通常,C03,ナイト,5.0,8,8.0,HI835158,,...,8.0,5.0,5.0,26.0,2.0,XXXXX,6000.0,1.0,2017-02-01,201808
4,2019-04-30,CA1,通常,C03,ナイト,5.0,7,,HI994023,,...,11.0,7.428571,7.0,7.0,5.0,XXX,6000.0,1.0,2018-09-07,201809


## ノック44：予測する月の在籍期間を作成しよう

In [9]:
predict_data['period'] = 0
predict_data['now_date'] = pd.to_datetime(predict_data['年月'],format='%Y%m')
predict_data['start_date'] = pd.to_datetime(predict_data['start_date'])

for i in range(len(predict_data)):
    delta = relativedelta(predict_data['now_date'][i],predict_data['start_date'][i])
    kakunin = delta
    predict_data['period'][i] = int(delta.years*12 + delta.months)
predict_data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,calc_date,campaign_id,campaign_name,class,class_name,count,count_0,count_1,customer_id,end_date,...,median,membership_period,min,name,price,routine_flg,start_date,年月,period,now_date
0,2019-04-30,CA1,通常,C01,オールタイム,5.0,4,5.0,IK612587,,...,4.5,38.0,2.0,XXXX,10500.0,1.0,2016-02-01,201811,33,2018-11-01
1,2019-04-30,CA1,通常,C03,ナイト,5.0,4,6.0,AS275753,,...,4.5,26.0,3.0,XXXXX,6000.0,1.0,2017-02-01,201808,18,2018-08-01
2,2019-04-30,CA1,通常,C01,オールタイム,5.0,6,6.0,PL863680,,...,5.5,28.0,3.0,XXXXX,10500.0,1.0,2016-12-01,201811,23,2018-11-01
3,2019-04-30,CA1,通常,C03,ナイト,5.0,8,8.0,HI835158,,...,5.0,26.0,2.0,XXXXX,6000.0,1.0,2017-02-01,201808,18,2018-08-01
4,2019-04-30,CA1,通常,C03,ナイト,5.0,7,,HI994023,,...,7.0,7.0,5.0,XXX,6000.0,1.0,2018-09-07,201809,0,2018-09-01


## ノック45：欠損値を除去しよう

In [10]:
predict_data.isnull().sum()

calc_date               0
campaign_id             0
campaign_name           0
class                   0
class_name              0
count                   0
count_0                 0
count_1               272
customer_id             0
end_date             2842
exit_date            2842
gender                  0
is_deleted              0
max                     0
mean                    0
median                  0
membership_period       0
min                     0
name                    0
price                   0
routine_flg             0
start_date              0
年月                      0
period                  0
now_date                0
dtype: int64

In [11]:
predict_data = predict_data.dropna(subset=['count_1'])
predict_data.isnull().sum()

calc_date               0
campaign_id             0
campaign_name           0
class                   0
class_name              0
count                   0
count_0                 0
count_1                 0
customer_id             0
end_date             2622
exit_date            2622
gender                  0
is_deleted              0
max                     0
mean                    0
median                  0
membership_period       0
min                     0
name                    0
price                   0
routine_flg             0
start_date              0
年月                      0
period                  0
now_date                0
dtype: int64

## ノック46：文字列型の変数を処理できるように整形しよう
性別などのカテゴリー関連のデータを**カテゴリカル変数**と呼ぶ。これらのデータを活用するにはフラグ化をする。これを**ダミー変数**と呼ぶ。

In [12]:
# 予測に使用するデータを絞りこむ
target_col = ['campaign_name','class_name','gender','count_1','routine_flg','period','is_deleted']
predict_data = predict_data[target_col]
predict_data.head()

Unnamed: 0,campaign_name,class_name,gender,count_1,routine_flg,period,is_deleted
0,通常,オールタイム,M,5.0,1.0,33,0.0
1,通常,ナイト,M,6.0,1.0,18,0.0
2,通常,オールタイム,M,6.0,1.0,23,0.0
3,通常,ナイト,M,8.0,1.0,18,0.0
5,通常,オールタイム,F,7.0,1.0,5,0.0


In [13]:
predict_data = pd.get_dummies(predict_data)
predict_data.head()

Unnamed: 0,count_1,routine_flg,period,is_deleted,campaign_name_入会費半額,campaign_name_入会費無料,campaign_name_通常,class_name_オールタイム,class_name_デイタイム,class_name_ナイト,gender_F,gender_M
0,5.0,1.0,33,0.0,0,0,1,1,0,0,0,1
1,6.0,1.0,18,0.0,0,0,1,0,0,1,0,1
2,6.0,1.0,23,0.0,0,0,1,1,0,0,0,1
3,8.0,1.0,18,0.0,0,0,1,0,0,1,0,1
5,7.0,1.0,5,0.0,0,0,1,1,0,0,1,0


重複している特徴がいくつかあるため、削除していく。

In [14]:
del predict_data['campaign_name_通常']
del predict_data['class_name_ナイト']
del predict_data['gender_M']
predict_data.head()

Unnamed: 0,count_1,routine_flg,period,is_deleted,campaign_name_入会費半額,campaign_name_入会費無料,class_name_オールタイム,class_name_デイタイム,gender_F
0,5.0,1.0,33,0.0,0,0,1,0,0
1,6.0,1.0,18,0.0,0,0,0,0,0
2,6.0,1.0,23,0.0,0,0,1,0,0
3,8.0,1.0,18,0.0,0,0,0,0,0
5,7.0,1.0,5,0.0,0,0,1,0,1


## ノック47：決定木を用いて退会予測モデルを作成してみよう

In [16]:
from sklearn.tree import DecisionTreeClassifier
import sklearn.model_selection

exit = predict_data.loc[predict_data['is_deleted']==1]
conti = predict_data.loc[predict_data['is_deleted']==0].sample(len(exit))

X = pd.concat([exit,conti],ignore_index=True)
y =X['is_deleted']
del X['is_deleted']
X_train,X_test,y_train,y_test =sklearn.model_selection.train_test_split(X,y)

model = DecisionTreeClassifier(random_state =0)
model.fit(X_train,y_train)
y_test_pred = model.predict(X_test)
print(y_test_pred)

[1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 1.
 0. 0. 1. 1. 1. 1. 0. 0. 0. 1. 0. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 0. 0. 1.
 1. 0. 1. 0. 0. 0. 0. 1. 0. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0. 1. 0. 1. 0.
 1. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 0. 1. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 0.
 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 1. 0. 1. 0. 1.
 1. 0. 0. 0. 1. 1. 0. 1. 0. 0. 1. 1. 1. 0. 1. 1. 1. 0. 0. 0. 0. 1. 1. 0.
 0. 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1.
 0. 1. 0. 0. 1. 0. 1. 1. 0. 1. 0. 1. 0. 0. 1. 0. 0. 0. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 0. 1. 0. 0. 0. 1. 0.
 1. 0. 1. 0. 1. 1. 1. 1. 1. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 1. 0. 0. 1. 1.
 0. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 1. 0. 1. 1. 1. 1. 0. 0. 1. 0. 1. 1. 1.
 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 0. 1. 1. 0. 0. 1. 0. 0. 1. 1. 1. 0. 0.
 0. 1. 1. 0. 0. 0. 1. 0. 0. 1. 0. 1. 0. 1. 1. 1. 0. 0. 1. 1. 0. 1. 1. 1.
 1. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 1. 0. 0. 0. 0. 1.

In [17]:
results_test = pd.DataFrame({'y_test':y_test,'y_pred':y_test_pred})
results_test.head()

Unnamed: 0,y_test,y_pred
1572,0.0,1.0
988,1.0,1.0
1062,0.0,0.0
447,1.0,1.0
1490,0.0,0.0


2行目以外は予想が当たっている。

## ノック48：予測モデルの評価を行い、モデルのチューニングをしてみよう

In [19]:
correct = len(results_test.loc[results_test['y_test']==results_test['y_pred']])
data_count = len(results_test)
score_test = correct / data_count
print(score_test)

0.8935361216730038


In [20]:
print('Test Accuracy: %.3f' % model.score(X_test,y_test))
print('Train Accuracy: %.3f' % model.score(X_train,y_train))

Test Accuracy: 0.894
Train Accuracy: 0.979


予測精度が高く、**過学習傾向**になっている。

In [21]:
X = pd.concat([exit,conti],ignore_index=True)
y = X['is_deleted']
del X['is_deleted']
X_train,X_test,y_train,y_test = sklearn.model_selection.train_test_split(X,y)

model = DecisionTreeClassifier(random_state=0,max_depth=5)
model.fit(X_train,y_train)
print('Test Accuracy: %.3f' % model.score(X_test,y_test))
print('Train Accuracy: %.3f' % model.score(X_train,y_train))

Test Accuracy: 0.913
Train Accuracy: 0.928


学習用、評価用データが共に同じ精度になり過学習が解消された。

## ノック49：モデルに寄与している変数を確認しよう

In [22]:
importance = pd.DataFrame({'feature_names':X.columns,'coefficient':model.feature_importances_})
importance

Unnamed: 0,feature_names,coefficient
0,count_1,0.342175
1,routine_flg,0.114257
2,period,0.538529
3,campaign_name_入会費半額,0.003956
4,campaign_name_入会費無料,0.0
5,class_name_オールタイム,5.6e-05
6,class_name_デイタイム,0.0
7,gender_F,0.001027


モデルに寄与している変数を確認したところ、1ヶ月前の利用回数、定期利用、会員期間が大きく付与していることがわかる。決定木の場合は木構造の可視化を行うことで直観的にモデルの理解をすることができる。

## ノック50：顧客の退会を予測しよう
適当に作成した顧客データを使用し、退会の予測をしてみる。

In [24]:
count_1 = 3
routine_flg = 1
period = 10
campaign_name = '入会費無料'
class_name ='オールタイム'
gender = 'M'

In [25]:
if campaign_name == '入会費半額':
    campaign_name_list = [1,0]
elif campaign_name == '入会費無料':
    campaign_name_list = [0,1]
elif campaign_name == '通常':
    campaign_name_list = [0,0]

if class_name == '通常':
    class_name_list = [1,0]
elif class_name == 'オールタイム':
    class_name_list = [0,1]
elif class_name == 'ナイト':
    class_name_list =[0,0]

if gender == 'F':
    gender_list =[1]
elif gender == 'M':
    gender_list = [0]

In [31]:
input_data = [count_1,routine_flg,period]
input_data.extend(campaign_name_list)
input_data.extend(class_name_list)
input_data.extend(gender_list)

In [32]:
print(model.predict([input_data]))
print(model.predict_proba([input_data]))

[1.]
[[0. 1.]]


In [56]:
count_1 = 5
routine_flg = 1
period = 20
campaign_name = '入会費無料'
class_name ='オールタイム'
gender = 'F'

if campaign_name == '入会費半額':
    campaign_name_list = [1,0]
elif campaign_name == '入会費無料':
    campaign_name_list = [0,1]
elif campaign_name == '通常':
    campaign_name_list = [0,0]

if class_name == '通常':
    class_name_list = [1,0]
elif class_name == 'オールタイム':
    class_name_list = [0,1]
elif class_name == 'ナイト':
    class_name_list =[0,0]

if gender == 'F':
    gender_list =[1]
elif gender == 'M':
    gender_list = [0]
    
input_data = [count_1,routine_flg,period]
input_data.extend(campaign_name_list)
input_data.extend(class_name_list)
input_data.extend(gender_list)

In [57]:
print(model.predict([input_data]))
print(model.predict_proba([input_data]))

[0.]
[[0.82692308 0.17307692]]


一通り変数をいじって遊んでみたが、会員期間と前月の利用回数を変えると予測値が大きく変わったので、退会予測に大きな影響をもたらしていることが分かった。