<a href="https://colab.research.google.com/github/MinsooKwak/Study/blob/main/DA/RFM_%EB%B6%84%EC%84%9D.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## RFM

- 고객 분류 기법 중 하나
- 고객 타겟팅에 주로 활용됨
  - R (Recency) : 얼마나 최근에
  - F (Frequency) : 얼마나 자주
  - M (Monetary) : 얼만큼의 규모로 구매했나

- 각 feature가 다른 성격
  - 통 score를 percentile로 group화해 접근
  

### 백분위 수 (percentile)

- 전체를 100으로, 작은 것부터 큰 것까지 나열했을 때 N 번째 오는 수

### 독립 변수와 종속 변수의 상관관계

## RFM 실습

1. 분석 주제 : 이커머스 고객의 이탈 가능성 예측
2. 종속 변수 : 최근 1달 이내 복귀 여부 ( 복귀 : 1, 이탈 : 0)
3. 독립 변수 : RFM
  - RFM 얼마나 정확히 잡고 예측할지 예측 기간 설정
  - R,F,M들을 100분위(percentile) 기준으로 score 계산
    - R : 트레인 기간 내 마지막 주문일로부터 경과일 수 => Recency score
    - F : 총 주문 수 => Frequency Score
    - M : 평균 주문 금액 => Monetary Score
  - Total score
    - 단순히 score 값들을 더해 total 정하지 않을 것
    - score 값에 회귀계수를 곱해 산출할 것
    
4. 측정인원 : 1,000
5. 분석 방법
  - RFM 점수 산출해 5그룹으로 고객 분류
6. 기준 기간
  - 트레인 기간 (60일전 ~ 30일전) : RFM 점수 산출
  - 예측 기간 (30일전 ~ 현재) : 이탈 여부 확인


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


- sample : 1000명

In [5]:
df = pd.read_csv('/content/drive/MyDrive/Study/DA/Data/RFM/rfm_mart.csv')
print(f'dataset : {df.shape[0]}')
df.head()

dataset : 1000


Unnamed: 0,mem_no,last_ord_dt,recency,frequency,monetary,is_back
0,1,2023-01-01,30,1,28700,0
1,2,2023-01-12,19,7,26929,0
2,3,2023-01-01,30,1,54800,0
3,4,2023-01-27,4,3,25700,1
4,5,2023-01-01,30,2,6500,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   mem_no       1000 non-null   int64 
 1   last_ord_dt  1000 non-null   object
 2   recency      1000 non-null   int64 
 3   frequency    1000 non-null   int64 
 4   monetary     1000 non-null   int64 
 5   is_back      1000 non-null   int64 
dtypes: int64(5), object(1)
memory usage: 47.0+ KB


### 로지스틱 회귀계수 구하기

In [11]:
X = df.drop(['mem_no','last_ord_dt','is_back'], axis=1)
y = df['is_back']

In [12]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y)

model = LogisticRegression()
model.fit(X_train, y_train)

In [13]:
model.score(X_test, y_test)

0.76

- 나쁘다고는 하지 않지만 보통 80 이상은 나와야 함
- 실무적으로는 독립변수들을 공들여 RFM 정제 진행 필요 (예측도에 영향 미치는 요인으로 중요)

In [14]:
coef = pd.DataFrame({'features': X.columns,
                     'coefficient': model.coef_[0]})
coef

Unnamed: 0,features,coefficient
0,recency,-0.044259
1,frequency,0.006359
2,monetary,1.8e-05


- recency coefficient

In [24]:
coef.iloc[0,1]

-0.044259342940119094

- monetary와 종속변수는 상관관계가 거의 없다고 볼 수 있음
- recency 약한 상관관계

### 백분위 수 기반 RFM 산출

In [16]:
a1, a2, a3 = np.percentile(df['recency'], [20, 40, 60])  # 20, 40, 60번째 기준으로
a1, a2, a3

(7.0, 23.0, 30.0)

In [22]:
def percent(x):
  '''
  높을수록 좋음
  '''
  if x <= a1 :
    return 4
  elif x > a1 and x <=a2 :
    return 3
  elif x > a2 and x <a3 :
    return 2
  elif x >= a3 :
    return 1

In [25]:
df['recency_score'] = df['recency'].apply(percent)*-coef.iloc[0,1]  # coef 값이 -여서 +로 바꿔줌
df.head()

Unnamed: 0,mem_no,last_ord_dt,recency,frequency,monetary,is_back,recency_score
0,1,2023-01-01,30,1,28700,0,0.044259
1,2,2023-01-12,19,7,26929,0,0.132778
2,3,2023-01-01,30,1,54800,0,0.044259
3,4,2023-01-27,4,3,25700,1,0.177037
4,5,2023-01-01,30,2,6500,0,0.044259


In [27]:
b1, b2, b3 = np.percentile(df['frequency'], [10, 50, 80]) # 20, 40, 60일 때 값 거의 비슷 -> 더 극단적으로
b1, b2, b3

(1.0, 2.0, 3.0)

In [28]:
def percent_freq(x):
  '''
  낮을수록 점수가 낮아짐
  '''
  if x <= b1 :
    return 1
  elif x > b1 and x <=b2 :
    return 2
  elif x > b2 and x <b3 :
    return 3
  elif x >= b3 :
    return 4

In [29]:
df['freq_score'] = df['frequency'].apply(percent_freq)*coef.iloc[1,1]
df.head(3)

Unnamed: 0,mem_no,last_ord_dt,recency,frequency,monetary,is_back,recency_score,freq_score
0,1,2023-01-01,30,1,28700,0,0.044259,0.006359
1,2,2023-01-12,19,7,26929,0,0.132778,0.025437
2,3,2023-01-01,30,1,54800,0,0.044259,0.006359


In [30]:
c1, c2, c3 = np.percentile(df['monetary'], [20,40,60])
c1, c2, c3

(18500.0, 22299.800000000003, 27000.0)

In [32]:
def percent_mon(x):
  if x <= c1 :
    return 1
  elif x > c1 and x <=c2 :
    return 2
  elif x > c2 and x <c3 :
    return 3
  elif x >= c3 :
    return 4

In [33]:
df['monetary_score'] = df['monetary'].apply(percent_mon)*coef.iloc[2,1]
df.head(3)

Unnamed: 0,mem_no,last_ord_dt,recency,frequency,monetary,is_back,recency_score,freq_score,monetary_score
0,1,2023-01-01,30,1,28700,0,0.044259,0.006359,7.1e-05
1,2,2023-01-12,19,7,26929,0,0.132778,0.025437,5.3e-05
2,3,2023-01-01,30,1,54800,0,0.044259,0.006359,7.1e-05


In [34]:
df['total_score'] = df['recency_score'] + df['freq_score'] + df['monetary_score']
df.head(3)

Unnamed: 0,mem_no,last_ord_dt,recency,frequency,monetary,is_back,recency_score,freq_score,monetary_score,total_score
0,1,2023-01-01,30,1,28700,0,0.044259,0.006359,7.1e-05,0.05069
1,2,2023-01-12,19,7,26929,0,0.132778,0.025437,5.3e-05,0.158269
2,3,2023-01-01,30,1,54800,0,0.044259,0.006359,7.1e-05,0.05069


- total score 기준 5그룹 => percentile

In [36]:
t1, t2, t3, t4  = np.percentile(df['total_score'], [20,50,70,90])  # 20, 40, 60, 80은 너무 비슷해서 조정
t1, t2, t3, t4

(0.05067204257613926,
 0.05704917586461634,
 0.1582332124529509,
 0.20251033856684153)

In [37]:
def level(x):
  if x < t1 :
    return 5
  elif x >= t1 and x< t2:
    return 4
  elif x >= t2 and x < t3:
    return 3
  elif x >= t3 and x < t4:
    return 2
  elif x >= t4:
    return 1

In [38]:
df['level'] = df['total_score'].apply(level)
df.head(3)

Unnamed: 0,mem_no,last_ord_dt,recency,frequency,monetary,is_back,recency_score,freq_score,monetary_score,total_score,level
0,1,2023-01-01,30,1,28700,0,0.044259,0.006359,7.1e-05,0.05069,4
1,2,2023-01-12,19,7,26929,0,0.132778,0.025437,5.3e-05,0.158269,2
2,3,2023-01-01,30,1,54800,0,0.044259,0.006359,7.1e-05,0.05069,4


In [42]:
df['level'].value_counts()

4    293
3    203
2    194
5    186
1    124
Name: level, dtype: int64

### 그룹별 리텐션 구하기

In [43]:
pivot = df.groupby('level').agg({'is_back': 'sum', 'mem_no':'count'})
pivot

Unnamed: 0_level_0,is_back,mem_no
level,Unnamed: 1_level_1,Unnamed: 2_level_1
1,101,124
2,100,194
3,87,203
4,79,293
5,51,186


In [44]:
pivot['retention'] = pivot.is_back/pivot.mem_no
pivot

Unnamed: 0_level_0,is_back,mem_no,retention
level,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,101,124,0.814516
2,100,194,0.515464
3,87,203,0.428571
4,79,293,0.269625
5,51,186,0.274194


- 1등급 : 다음달까지 남아있음
- 2등급 : 좀 떨어지지만 반 정도 남아있음
- 3등급 : 이탈 고객 많다
  - 의사 결정 방향
    - 3~5등급 케어
    - 4, 5등급 케어
  - 그룹 안에서도 다른 기법 통해 이탈/남아있는 고객 분석할 수 있음