In [24]:
# Data を読み込み確認する
import pandas as pd

uselog = pd.read_csv('sample_code/chapter_4/use_log.csv')
uselog.isnull().sum()  # 欠損値の確認

log_id         0
customer_id    0
usedate        0
dtype: int64

In [25]:
customer = pd.read_csv('sample_code/chapter_4/customer_join.csv')
customer.isnull().sum()  # 欠損値の確認

customer_id             0
name                    0
class                   0
gender                  0
start_date              0
end_date             2842
campaign_id             0
is_deleted              0
class_name              0
price                   0
campaign_name           0
mean                    0
median                  0
max                     0
min                     0
routine_flg             0
calc_date               0
membership_period       0
dtype: int64

end_date 以外は欠損値が 0 であることを確認。

顧客 Data を 利用履歴に基づいた Group 化を行う。
Clustering に用いる変数は、顧客の月内利用履歴に関する Data である
- mean
- median
- max
- min
- membership_period

### Clustering で顧客を Group 化

In [26]:
# 必要な変数の絞り込み
customer_clustering = customer[['mean', 'median', 'max', 'min', 'membership_period']]
customer_clustering.head()

Unnamed: 0,mean,median,max,min,membership_period
0,4.833333,5.0,8,2,47
1,5.083333,5.0,7,3,47
2,4.583333,5.0,6,3,47
3,4.833333,4.5,7,2,47
4,3.916667,4.0,6,1,47


In [27]:
# Clustering: K-means法
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# 標準化
sc = StandardScaler()
customer_clustering_sc = sc.fit_transform(customer_clustering)

# K-means の Model 構築
kmeans = KMeans(n_clusters=4, random_state=0)
clusters = kmeans.fit(customer_clustering_sc)
customer_clustering['cluster'] = clusters.labels_
print(customer_clustering['cluster'].unique())
customer_clustering.head()

[1 2 3 0]


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  customer_clustering['cluster'] = clusters.labels_


Unnamed: 0,mean,median,max,min,membership_period,cluster
0,4.833333,5.0,8,2,47,1
1,5.083333,5.0,7,3,47,1
2,4.583333,5.0,6,3,47,1
3,4.833333,4.5,7,2,47,1
4,3.916667,4.0,6,1,47,1


### Clustering 結果の分析

In [28]:
# 列名の変更
customer_clustering.columns = ['月内平均値', '月内中央値', '月内最大値', '月内最小値', '会員期間', 'cluster']

# Cluster 毎の Data 件数
customer_clustering.groupby('cluster').count()

Unnamed: 0_level_0,月内平均値,月内中央値,月内最大値,月内最小値,会員期間
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,840,840,840,840,840
1,1249,1249,1249,1249,1249
2,771,771,771,771,771
3,1332,1332,1332,1332,1332


Group 3 が最も多く、Group 1, Group 0, Group 2 の順番になっている。

In [30]:
# Group 毎に平均値
customer_clustering.groupby('cluster').mean()

Unnamed: 0_level_0,月内平均値,月内中央値,月内最大値,月内最小値,会員期間
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,8.061942,8.047024,10.014286,6.175,7.019048
1,4.677561,4.670937,7.233787,2.153723,36.915933
2,3.065504,2.90013,4.783398,1.649805,9.276265
3,5.539535,5.391141,8.756006,2.702703,14.867868


Group | 特徴
--- | ---
Group 0 | 会員期間が短く、利用率が高い顧客
Group 1 | group 2 よりも会員期間が長い。Group 3 に比べて会員期間は長いが利用率は若干低い
Group 2 | 会員期間が短く、最も利用率が低い
Group 3 | zgroup 2 よりも会員期間が長い