# **사용자 행동 로그 데이터 - 퍼널 분석**

## **데이터 살펴보기**

In [None]:
import pandas as pd
import plotly.express as px

In [None]:
data = pd.read_csv('/content/ecommerce_behavior.csv')

In [None]:
data.head()

Unnamed: 0.1,Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,0,2020-01-01 00:00:00 UTC,view,5809910,1602943681873052386,,grattol,5.24,595414620,4adb70bb-edbd-4981-b60f-a05bfd32683a
1,1,2020-01-01 00:00:09 UTC,view,5812943,1487580012121948301,,kinetics,3.97,595414640,c8c5205d-be43-4f1d-aa56-4828b8151c8a
2,2,2020-01-01 00:00:19 UTC,view,5798924,1783999068867920626,,zinger,3.97,595412617,46a5010f-bd69-4fbe-a00d-bb17aa7b46f3
3,3,2020-01-01 00:00:24 UTC,view,5793052,1487580005754995573,,,4.92,420652863,546f6af3-a517-4752-a98b-80c4c5860711
4,4,2020-01-01 00:00:25 UTC,view,5899926,2115334439910245200,,,3.92,484071203,cff70ddf-529e-4b0c-a4fc-f43a749c0acb


In [None]:
data.drop('Unnamed: 0', axis=1, inplace=True)

In [None]:
data.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2020-01-01 00:00:00 UTC,view,5809910,1602943681873052386,,grattol,5.24,595414620,4adb70bb-edbd-4981-b60f-a05bfd32683a
1,2020-01-01 00:00:09 UTC,view,5812943,1487580012121948301,,kinetics,3.97,595414640,c8c5205d-be43-4f1d-aa56-4828b8151c8a
2,2020-01-01 00:00:19 UTC,view,5798924,1783999068867920626,,zinger,3.97,595412617,46a5010f-bd69-4fbe-a00d-bb17aa7b46f3
3,2020-01-01 00:00:24 UTC,view,5793052,1487580005754995573,,,4.92,420652863,546f6af3-a517-4752-a98b-80c4c5860711
4,2020-01-01 00:00:25 UTC,view,5899926,2115334439910245200,,,3.92,484071203,cff70ddf-529e-4b0c-a4fc-f43a749c0acb


- event_time: 이벤트가 발생한 시각
- event_type: 이벤트 종류
    - view: 상품을 조회
    - cart: 상품을 카트에 추가
    - remove_from_cart: 상품을 카트에서 제거
    - purchase: 구매
- product_id: 상품번호
- category_id: 카테고리번호
- category_code: 카테고리명
- brand: 브랜드명
- price: 상품 가격
- user_id: 고객번호
- user_session: 세션

## **질문 만들기**

- DAU(일간 활성 사용자수) 추이는?
    - 어느 요일에 가장 많이 방문하는가?
- 사이트 체류시간 평균은?
    - 조회만 한 유저, 카트에 담은 유저, 구매까지 한 유저별로 체류시간이 어떻게 다른가?
- 퍼널 분석
    - 어느 단계에서 유저들이 가장 많이 이탈하는가?

![이미지](https://ifdo.co.kr/viewHelpImage.apz?MTY4MTg1OTUwMCU3QiU3RA)

## **데이터 전처리**

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3851293 entries, 0 to 3851292
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   event_time     object 
 1   event_type     object 
 2   product_id     int64  
 3   category_id    int64  
 4   category_code  object 
 5   brand          object 
 6   price          float64
 7   user_id        int64  
 8   user_session   object 
dtypes: float64(1), int64(3), object(5)
memory usage: 264.4+ MB


- 데이터 타입 변경

In [None]:
data['event_time'] = pd.to_datetime(data['event_time'], format='%Y-%m-%d %H:%M:%S UTC')

In [None]:
data.head()

Unnamed: 0,event_time,event_type,product_id,category_id,category_code,brand,price,user_id,user_session
0,2020-01-01 00:00:00,view,5809910,1602943681873052386,,grattol,5.24,595414620,4adb70bb-edbd-4981-b60f-a05bfd32683a
1,2020-01-01 00:00:09,view,5812943,1487580012121948301,,kinetics,3.97,595414640,c8c5205d-be43-4f1d-aa56-4828b8151c8a
2,2020-01-01 00:00:19,view,5798924,1783999068867920626,,zinger,3.97,595412617,46a5010f-bd69-4fbe-a00d-bb17aa7b46f3
3,2020-01-01 00:00:24,view,5793052,1487580005754995573,,,4.92,420652863,546f6af3-a517-4752-a98b-80c4c5860711
4,2020-01-01 00:00:25,view,5899926,2115334439910245200,,,3.92,484071203,cff70ddf-529e-4b0c-a4fc-f43a749c0acb


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3851293 entries, 0 to 3851292
Data columns (total 9 columns):
 #   Column         Dtype         
---  ------         -----         
 0   event_time     datetime64[ns]
 1   event_type     object        
 2   product_id     int64         
 3   category_id    int64         
 4   category_code  object        
 5   brand          object        
 6   price          float64       
 7   user_id        int64         
 8   user_session   object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 264.4+ MB


- 결측치 제거

In [None]:
data.isna().sum()

event_time             0
event_type             0
product_id             0
category_id            0
category_code    3781761
brand            1601839
price                  0
user_id                0
user_session           0
dtype: int64

In [None]:
#category_code, brand에 너무 많은 컬럼이 비어있고, 카테고리나 브랜드별로 분석할 계획이 없으므로 해당 컬럼을 제거한다.
data.drop(['category_code','brand'], axis=1, inplace=True)
data.head()

Unnamed: 0,event_time,event_type,product_id,category_id,price,user_id,user_session
0,2020-01-01 00:00:00,view,5809910,1602943681873052386,5.24,595414620,4adb70bb-edbd-4981-b60f-a05bfd32683a
1,2020-01-01 00:00:09,view,5812943,1487580012121948301,3.97,595414640,c8c5205d-be43-4f1d-aa56-4828b8151c8a
2,2020-01-01 00:00:19,view,5798924,1783999068867920626,3.97,595412617,46a5010f-bd69-4fbe-a00d-bb17aa7b46f3
3,2020-01-01 00:00:24,view,5793052,1487580005754995573,4.92,420652863,546f6af3-a517-4752-a98b-80c4c5860711
4,2020-01-01 00:00:25,view,5899926,2115334439910245200,3.92,484071203,cff70ddf-529e-4b0c-a4fc-f43a749c0acb


- 날짜 컬럼 추가

In [None]:
data['date_ymd'] = data['event_time'].dt.date
data.head()

Unnamed: 0,event_time,event_type,product_id,category_id,price,user_id,user_session,date_ymd
0,2020-01-01 00:00:00,view,5809910,1602943681873052386,5.24,595414620,4adb70bb-edbd-4981-b60f-a05bfd32683a,2020-01-01
1,2020-01-01 00:00:09,view,5812943,1487580012121948301,3.97,595414640,c8c5205d-be43-4f1d-aa56-4828b8151c8a,2020-01-01
2,2020-01-01 00:00:19,view,5798924,1783999068867920626,3.97,595412617,46a5010f-bd69-4fbe-a00d-bb17aa7b46f3,2020-01-01
3,2020-01-01 00:00:24,view,5793052,1487580005754995573,4.92,420652863,546f6af3-a517-4752-a98b-80c4c5860711,2020-01-01
4,2020-01-01 00:00:25,view,5899926,2115334439910245200,3.92,484071203,cff70ddf-529e-4b0c-a4fc-f43a749c0acb,2020-01-01


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3851293 entries, 0 to 3851292
Data columns (total 8 columns):
 #   Column        Dtype         
---  ------        -----         
 0   event_time    datetime64[ns]
 1   event_type    object        
 2   product_id    int64         
 3   category_id   int64         
 4   price         float64       
 5   user_id       int64         
 6   user_session  object        
 7   date_ymd      object        
dtypes: datetime64[ns](1), float64(1), int64(3), object(3)
memory usage: 235.1+ MB


In [None]:
data['date_ymd'] = pd.to_datetime(data['date_ymd'], format='%Y-%m-%d')

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3851293 entries, 0 to 3851292
Data columns (total 8 columns):
 #   Column        Dtype         
---  ------        -----         
 0   event_time    datetime64[ns]
 1   event_type    object        
 2   product_id    int64         
 3   category_id   int64         
 4   price         float64       
 5   user_id       int64         
 6   user_session  object        
 7   date_ymd      datetime64[ns]
dtypes: datetime64[ns](2), float64(1), int64(3), object(2)
memory usage: 235.1+ MB


## **분석**

### [1] DAU(일간 활성 사용자수) 추이는?
- 어느 요일에 가장 많이 방문하는가?

In [None]:
dau = data.groupby('date_ymd')[['user_id']].nunique().reset_index().rename({'user_id':'dau'}, axis=1)
dau

Unnamed: 0,date_ymd,dau
0,2020-01-01,11765
1,2020-01-02,14039
2,2020-01-03,15396
3,2020-01-04,16044
4,2020-01-05,16511
5,2020-01-06,15707
6,2020-01-07,17099
7,2020-01-08,18580
8,2020-01-09,19879
9,2020-01-10,18878


In [None]:
fig = px.line(data_frame = dau, x='date_ymd', y='dau', title='DAU 추이')
fig.show()

In [None]:
dau['day_of_week'] = dau['date_ymd'].dt.day_name()
dau['day_of_week1'] = dau['date_ymd'].dt.day_of_week
dau.head()

Unnamed: 0,date_ymd,dau,day_of_week,day_of_week1
0,2020-01-01,11765,Wednesday,2
1,2020-01-02,14039,Thursday,3
2,2020-01-03,15396,Friday,4
3,2020-01-04,16044,Saturday,5
4,2020-01-05,16511,Sunday,6


In [None]:
avg_dau_by_dow = dau.groupby(['day_of_week','day_of_week1'])[['dau']].mean().reset_index()
avg_dau_by_dow.sort_values('day_of_week1', inplace=True)
avg_dau_by_dow

Unnamed: 0,day_of_week,day_of_week1,dau
1,Monday,0,19284.75
5,Tuesday,1,19855.5
6,Wednesday,2,18425.2
4,Thursday,3,18477.8
0,Friday,4,18195.8
2,Saturday,5,17041.0
3,Sunday,6,18146.25


In [None]:
fig = px.bar(data_frame = avg_dau_by_dow, x='day_of_week', y='dau', title='요일별 DAU 평균', width=700, height=500)
fig.show()

### [2] 사이트 체류시간 평균은?
- 조회만 한 유저, 카트에 담은 유저, 구매까지 한 유저별로 체류시간이 어떻게 다른가?

- 한 세션의 끝에서 시작 시간을 뺀 값을 체류시간으로 정의한다.

In [None]:
data.query('user_session == "2806ff10-08bc-4811-9ab7-af074fe22a88"').sort_values('event_time')

Unnamed: 0,event_time,event_type,product_id,category_id,price,user_id,user_session,date_ymd
3850000,2020-01-31 23:12:25,view,5813496,1487580005553668971,11.03,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850011,2020-01-31 23:12:44,view,5813496,1487580005553668971,11.03,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850169,2020-01-31 23:17:36,view,5713113,1783999064069636330,9.79,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850184,2020-01-31 23:18:01,view,5713113,1783999064069636330,9.79,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850190,2020-01-31 23:18:11,view,5739989,1783999064069636330,3.49,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850197,2020-01-31 23:18:27,view,5813858,1783999064069636330,3.97,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850229,2020-01-31 23:19:08,view,5838649,1487580011283087468,19.05,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850245,2020-01-31 23:19:35,cart,5838649,1487580011283087468,19.05,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850368,2020-01-31 23:23:23,view,5809859,1783999064136745198,5.71,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31
3850371,2020-01-31 23:23:29,view,5809858,1783999064136745198,15.08,583267679,2806ff10-08bc-4811-9ab7-af074fe22a88,2020-01-31


In [None]:
print(data.query('user_session == "2806ff10-08bc-4811-9ab7-af074fe22a88"')['event_time'].max())
print(data.query('user_session == "2806ff10-08bc-4811-9ab7-af074fe22a88"')['event_time'].min())
print(data.query('user_session == "2806ff10-08bc-4811-9ab7-af074fe22a88"')['event_time'].max() - data.query('user_session == "2806ff10-08bc-4811-9ab7-af074fe22a88"')['event_time'].min())

2020-01-31 23:59:50
2020-01-31 23:12:25
0 days 00:47:25


In [None]:
duration = data.groupby('user_session')[['event_time']].agg(['max','min']).reset_index()
duration['duration'] = duration['event_time']['max'] - duration['event_time']['min']

In [None]:
duration.columns = ['user_session', 'max', 'min', 'duration']
duration

Unnamed: 0,user_session,max,min,duration
0,0000061d-f3e9-484b-8c73-e54f355032a3,2020-01-16 03:30:41,2020-01-16 03:30:41,0 days 00:00:00
1,00000ac8-0015-4f12-996a-be2896323738,2020-01-24 22:22:20,2020-01-24 22:22:20,0 days 00:00:00
2,00001ca1-f2df-4572-b0b8-e752e2064aae,2020-01-01 19:09:23,2020-01-01 19:09:23,0 days 00:00:00
3,00002db7-16b6-4db2-bf8b-7a1cb6bd0e7f,2020-01-22 16:51:50,2020-01-22 16:51:50,0 days 00:00:00
4,00002f68-09b8-4db3-a092-aeff45fd13ad,2020-01-25 07:17:58,2020-01-25 07:17:58,0 days 00:00:00
...,...,...,...,...
911569,ffff7b96-9751-4eaa-806e-fe979cc00dc8,2020-01-25 11:32:02,2020-01-24 16:57:30,0 days 18:34:32
911570,ffff80e2-ad33-4704-9ffe-d6c612e9641f,2020-01-21 18:07:47,2020-01-21 18:07:47,0 days 00:00:00
911571,ffff8da3-b79a-48f2-888c-117f2d1a7793,2020-01-26 10:53:09,2020-01-26 10:53:09,0 days 00:00:00
911572,ffff9422-39ba-4cdf-afd1-a9d87bb3d79b,2020-01-13 09:55:09,2020-01-13 09:55:09,0 days 00:00:00


- 체류시간 평균 구하기

In [None]:
duration['duration'].mean()

Timedelta('0 days 00:59:16.683693260')

- 조회만 한 유저, 카트에 담은 유저, 구매까지 한 유저별로 체류시간이 어떻게 다른가?

In [None]:
session_pivot = pd.pivot_table(data=data, index='user_session', columns='event_type', values='event_time', aggfunc='count').reset_index().fillna(0)
session_pivot

event_type,user_session,cart,purchase,remove_from_cart,view
0,0000061d-f3e9-484b-8c73-e54f355032a3,0.0,0.0,0.0,1.0
1,00000ac8-0015-4f12-996a-be2896323738,0.0,0.0,0.0,1.0
2,00001ca1-f2df-4572-b0b8-e752e2064aae,0.0,0.0,0.0,1.0
3,00002db7-16b6-4db2-bf8b-7a1cb6bd0e7f,0.0,0.0,0.0,1.0
4,00002f68-09b8-4db3-a092-aeff45fd13ad,0.0,0.0,0.0,1.0
...,...,...,...,...,...
911569,ffff7b96-9751-4eaa-806e-fe979cc00dc8,1.0,0.0,2.0,10.0
911570,ffff80e2-ad33-4704-9ffe-d6c612e9641f,0.0,0.0,0.0,1.0
911571,ffff8da3-b79a-48f2-888c-117f2d1a7793,0.0,0.0,0.0,1.0
911572,ffff9422-39ba-4cdf-afd1-a9d87bb3d79b,0.0,0.0,0.0,1.0


In [None]:
cart_session = list(session_pivot.query('cart > 0')['user_session'])
purchase_session = list(session_pivot.query('purchase > 0')['user_session'])

In [None]:
view_session_avg_duration = duration.query('user_session not in @cart_session and user_session not in @purchase_session')['duration'].mean()
cart_session_avg_duration = duration.query('user_session in @cart_session')['duration'].mean()
purchase_session_avg_duration = duration.query('user_session in @purchase_session')['duration'].mean()

print(f'조회만 한 유저의 평균 체류시간: {view_session_avg_duration}')
print(f'카트에 담은 유저의 평균 체류시간: {cart_session_avg_duration}')
print(f'구매까지 한 유저의 평균 체류시간: {purchase_session_avg_duration}')

조회만 한 유저의 평균 체류시간: 0 days 00:38:30.953374025
카트에 담은 유저의 평균 체류시간: 0 days 02:39:48.642760643
구매까지 한 유저의 평균 체류시간: 0 days 06:42:21.679333566


### [3] 퍼널 분석: 어느 단계에서 유저들이 가장 많이 이탈하는가?

In [None]:
funnel = session_pivot[['view','cart','remove_from_cart','purchase']].sum().to_frame().reset_index()
funnel.columns = ['event_type','count']
funnel = funnel.query('event_type != "remove_from_cart"')
funnel

Unnamed: 0,event_type,count
0,view,2035188.0
1,cart,957169.0
3,purchase,184619.0


In [None]:
fig = px.funnel(data_frame=funnel, x='event_type', y='count')
fig.update_traces(texttemplate="%{value:,.0f}")
fig.show()

In [None]:
view_to_cart_rate = list(funnel['count'])[1] / list(funnel['count'])[0]
view_to_purchase_rate = list(funnel['count'])[2] / list(funnel['count'])[0]

In [None]:
funnel['retain_rate'] = [1, view_to_cart_rate, view_to_purchase_rate]
funnel

Unnamed: 0,event_type,count,retain_rate
0,view,2035188.0,1.0
1,cart,957169.0,0.47031
3,purchase,184619.0,0.090713


In [None]:
fig = px.funnel(data_frame=funnel, x='event_type', y='retain_rate')
fig.update_traces(texttemplate="%{value:,.2%}")
fig.show()

## **정리**

[1] DAU(일간 활성 사용자수) 추이는?
- 월 초에서 중순까지 DAU가 증가하다가, 이후 유지
- 화요일에 가장 많이 방문하고, 주말에 사용자수가 줄어든다.

[2] 사이트 체류시간 평균은?
- 체류시간 평균은 약 1시간
- 조회만 한 유저는 약 40분, 카트에 담은 유저는 약 2시간 40분, 구매까지 한 유저는 약 6시간 40분을 체류한다.

[3] 퍼널 분석
- 상품 조회를 한 후 카트를 담는 단계에서 약 47.3%만 남고
- 카트를 담고 구매를 하는 단계에서 약 9%만 남는다.
- 카트를 담고 구매를 하는 단계에서 이탈이 많이 일어나므로, 해당 단계에서 전환율을 높이기 위한 전략이 필요하다.
- 예를 들어 주문서나 혜택, 회원가입에서 문제가 없는지 드릴다운 해볼 수 있다.