# Discovering knowledge in customer shopping behaviors
### Course: DAMI330484_22_2_01
### Instructor: M.Sc. Nguyen Van Thanh
| Group 19         |          |
|:-----------------|:---------|
| Đỗ Hoàng Thịnh   | 20133122 |
| Nguyễn Minh Tiến | 20133093 |
| Huỳnh Nguyễn Tín | 20133094 |
| Bùi Lê Hải Triều | 20133101 |

### 1. Dataset
Nhóm sử dụng tập dữ liệu chứa thông tin giao dịch của khách hàng từ 10 trung tâm mua sắm lớn tại đất nước Istanbul, từ năm 2021 đến thời điểm hiện tại năm 2023 trên [Kaggle](https://www.kaggle.com/datasets/mehmettahiraslan/customer-shopping-dataset). Ngoài thông tin giao dịch, tập dữ liệu cũng cung cấp thông tin về độ tuổi, giới tính, phù hợp với nghiệp vụ khai phá.

In [30]:
import matplotlib.pyplot as plt
import pandas as pd

In [52]:
transactions = pd.read_csv("./transactions.csv")
transactions.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99457 entries, 0 to 99456
Data columns (total 10 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   invoice_no      99457 non-null  object 
 1   customer_id     99457 non-null  object 
 2   gender          99457 non-null  object 
 3   age             99457 non-null  int64  
 4   category        99457 non-null  object 
 5   quantity        99457 non-null  int64  
 6   price           99457 non-null  float64
 7   payment_method  99457 non-null  object 
 8   invoice_date    99457 non-null  object 
 9   shopping_mall   99457 non-null  object 
dtypes: float64(1), int64(2), object(7)
memory usage: 7.6+ MB


Tập dữ liệu có 99457 giao dịch và 10 cột.

| Attribute      | Description                       | Example                       | Data type   |
|:---------------|:----------------------------------|:------------------------------|:------------|
| invoice_no     | Mã giao dịch                      | I138884                       | Categorical |
| customer_id    | Mã khách hàng                     | C241288                       | Categorical |
| gender         | Giới tính                         | Male, Female                  | Categorical |
| age            | Độ tuổi                           | 18, 69                        | Numerical   |
| category       | Danh mục sản phẩm                 | Clothing                      | Categorical |
| quantity       | Số lượng sản phẩm trong giao dịch | 1, 5                          | Numerical   |
| price          | Đơn giá sản phẩm trong giao dịch  | 1500.4                        | Numerical   |
| payment_method | Phương thức thanh toán            | Cash, Credit Card, Debit Card | Categorical |
| invoice_date   | Ngày diễn ra giao dịch            | 5/8/2022                      | Categorical |
| shopping_mall  | Địa điểm diễn ra giao dịch        | Kanyon                        | Categorical |

In [32]:
transactions.sample(5)

Unnamed: 0,invoice_no,customer_id,gender,age,category,quantity,price,payment_method,invoice_date,shopping_mall
64847,I224433,C475590,Male,48,Toys,1,35.84,Cash,16/07/2021,Viaport Outlet
79935,I183804,C258724,Female,38,Clothing,3,900.24,Credit Card,27/06/2021,Kanyon
54611,I153727,C119752,Female,40,Clothing,4,1200.32,Credit Card,24/03/2022,Istinye Park
32856,I302841,C249686,Male,30,Technology,2,2100.0,Credit Card,25/01/2022,Mall of Istanbul
2614,I286335,C542440,Female,66,Cosmetics,5,203.3,Cash,1/8/2022,Forum Istanbul


In [33]:
transactions.isnull().sum()

invoice_no        0
customer_id       0
gender            0
age               0
category          0
quantity          0
price             0
payment_method    0
invoice_date      0
shopping_mall     0
dtype: int64

In [34]:
transactions.duplicated().sum()

0

Tập dữ liệu không chứa giá trị null ở bất kỳ cột nào và không có giao dịch trùng lặp.

### 2. Data preparation
Để phục vụ việc khai phá về sau, nhóm sẽ tạo cột mới chứa thông tin tổng số tiền thanh toán trên mỗi giao dịch.

In [35]:
transactions['total'] = transactions['quantity'] * transactions['price']
transactions.sample(5)
Kmeans_df=transactions

Nhóm cũng sẽ thực hiện nhóm tuổi khách hàng thành 6 khung tuổi để giảm độ nhiễu của tập dữ liệu: 18 đến 24, 25 đến 34, 35 đến 44, 45 đến 54, 55 đến 64, và 65 đến 70.

In [36]:
bins = [18, 24, 34, 44, 54, 64, 70]
labels = ['18-24', '25-34', '35-44', '45-54', '55-64', '65-70']
transactions['AgeGroup'] = pd.cut(transactions['age'], bins=bins, labels=labels)
AgeGroup_type = pd.CategoricalDtype(labels, ordered=True)
transactions['AgeGroup'] = transactions['AgeGroup'].astype(AgeGroup_type)
transactions.drop('age', axis=1, inplace=True)
transactions.sample(5)

Unnamed: 0,invoice_no,customer_id,gender,category,quantity,price,payment_method,invoice_date,shopping_mall,total,AgeGroup
63999,I268656,C213175,Female,Food & Beverage,2,10.46,Credit Card,14/03/2022,Metrocity,20.92,35-44
75356,I268008,C160492,Female,Food & Beverage,5,26.15,Credit Card,15/08/2021,Metrocity,130.75,45-54
41915,I422015,C216247,Female,Clothing,2,600.16,Cash,26/01/2023,Emaar Square Mall,1200.32,18-24
24606,I101915,C112646,Male,Clothing,1,300.08,Credit Card,22/09/2022,Kanyon,300.08,35-44
62821,I244110,C267494,Female,Clothing,5,1500.4,Cash,25/10/2021,Metropol AVM,7502.0,55-64


Nhóm có thể giảm lượng dữ liệu qua việc loại bỏ cột không mang ý nghĩa khai phá như mã giao dịch và mã khách hàng.

In [37]:
transactions.duplicated(subset=['invoice_no']).any()

False

In [38]:
transactions.duplicated(subset=['customer_id']).any()

False

Tập dữ liệu không có giao dịch với cùng mã giao dịch hoặc cùng mã khách hàng. Điều này có nghĩa mỗi khách hàng chỉ thực hiện giao dịch một lần. Vì vậy, nhóm có thể loại bỏ hai cột này.

In [39]:
transactions.drop(['invoice_no', 'customer_id'], axis=1, inplace=True)
transactions.sample(5)

Unnamed: 0,gender,category,quantity,price,payment_method,invoice_date,shopping_mall,total,AgeGroup
63761,Female,Souvenir,1,11.73,Debit Card,2/3/2023,Viaport Outlet,11.73,25-34
39471,Female,Shoes,3,1800.51,Cash,4/2/2022,Istinye Park,5401.53,45-54
88720,Female,Food & Beverage,2,10.46,Cash,21/12/2021,Istinye Park,20.92,35-44
82914,Female,Cosmetics,4,162.64,Credit Card,14/02/2021,Metrocity,650.56,45-54
92106,Male,Food & Beverage,2,10.46,Debit Card,21/10/2021,Kanyon,20.92,45-54


Kiểm tra số lượng giao dịch trùng lặp sau khi loại bỏ hai cột trên.

In [40]:
transactions.duplicated().sum()

1111

In [41]:
transactions.drop_duplicates(keep='first')

Unnamed: 0,gender,category,quantity,price,payment_method,invoice_date,shopping_mall,total,AgeGroup
0,Female,Clothing,5,1500.40,Credit Card,5/8/2022,Kanyon,7502.00,25-34
1,Male,Shoes,3,1800.51,Debit Card,12/12/2021,Forum Istanbul,5401.53,18-24
2,Male,Clothing,1,300.08,Cash,9/11/2021,Metrocity,300.08,18-24
3,Female,Shoes,5,3000.85,Credit Card,16/05/2021,Metropol AVM,15004.25,65-70
4,Female,Books,4,60.60,Cash,24/10/2021,Kanyon,242.40,45-54
...,...,...,...,...,...,...,...,...,...
99452,Female,Souvenir,5,58.65,Credit Card,21/09/2022,Kanyon,293.25,45-54
99453,Male,Food & Beverage,2,10.46,Cash,22/09/2021,Forum Istanbul,20.92,25-34
99454,Male,Food & Beverage,2,10.46,Debit Card,28/03/2021,Metrocity,20.92,55-64
99455,Male,Technology,4,4200.00,Cash,16/03/2021,Istinye Park,16800.00,55-64


### 3. EDA
Trước khi thực hiện việc khai phá dữ liệu, nhóm sẽ thực hiện phân tích sơ bộ tập dữ liệu hiện tại thông qua biểu đồ trực quan để hiểu hơn về nghiệp vụ trước khi thực hiện khai phá.

In [42]:
import seaborn as sns
import plotly.express as px

#### 3.1. Category wise
Đầu tiên, danh mục sản phẩm phổ biến nhất trên tổng số lượng sản phẩm trong mỗi giao dịch.

In [43]:
category = transactions.groupby('category')['quantity'].sum()
category = pd.DataFrame({'category': category.index, 'quantity': category.values})
category['categories'] = 'categories'

fig = px.treemap(category, path=['categories', 'category'], values='quantity', color='quantity',
                 hover_data=['category'], color_continuous_scale='Blues')
fig.update_layout(width=1000, height=600, paper_bgcolor='LightSteelBlue')
fig.show()

Như vậy, sản phẩm thuộc danh mục Clothing, Cosmetics, và Food & Beverage xuất hiện nhiều nhất trong toàn bộ số giao dịch.

#### 3.2. Gender wise
Đáng lưu ý, Clothing và Cosmetics là hai danh mục sản phẩm trên thực tế thường được mua bởi phụ nữ, nên có thể số lượng khách hàng nữ cao hơn nam.

In [44]:
transactions['gender'].value_counts()

Female    59482
Male      39975
Name: gender, dtype: int64

Với số lượng khách hàng nữ cao hơn gần 20000, doanh thu có thể phần lớn đến từ khách hàng nữ.

In [45]:
gender = transactions.groupby('gender')['total'].sum()
gender = pd.DataFrame({'gender': gender.index, 'total': gender.values})

fig = px.pie(gender, values='total', names='gender')
fig.update_layout(paper_bgcolor='LightSteelBlue')
fig.show()

Đúng như dự đoán, gần 60% doanh thu đến từ khách hàng nữ.

In [46]:
gender_category = transactions.groupby(['gender', 'category'])['total'].sum().unstack().reset_index()

fig = px.bar(gender_category,
             x=['Books', 'Clothing', 'Cosmetics', 'Food & Beverage', 'Shoes', 'Souvenir', 'Technology', 'Toys'],
             y='gender')
fig.update_layout(width=1000, height=600, plot_bgcolor='LightSteelBlue', paper_bgcolor='LightSteelBlue',
                  legend=dict(title='category'))
fig.show()

Với mỗi danh mục sản phẩm, khách hàng nữ đều chi nhiều hơn khách hàng nam khi mua sắm. Tuy nhiên, đây cũng có thể là vì số lượng khách hàng nữ cao hơn.
Vì vậy, nhóm không thể dựa vào biểu đồ trực quan như trên để đưa ra quyết định nghiệp vụ marketing hoặc xây dựng hệ thống recommendation. Thay vào đó, để đưa ra chiến lược nhằm duy trì mối quan hệ khách hàng chính xác và hiệu quả, nhóm cần thực hiện quá trình khai phá dữ liệu.

### 4. Data mining
Mục tiêu chính của nhóm là xác định phân khúc khách hàng thân thiết hoặc sản phẩm có giá trị doanh nghiệp cao dựa trên thuật toán phân cụm (Clustering) và phân loại (Classification). Ngoài ra, thuật toán kết hợp (Associate) cũng sẽ được sử dụng để phân tích hành vi mua hàng của khách hàng và xu hướng, khuôn mẫu có ích cho quyết định nghiệp vụ.

#### 4.1 FP Growth
FP Growth là một thuật toán khai thác mẫu thường xuyên (frequent patterns) từ các tập dữ liệu lớn.

FP Growth sử dụng phương pháp phân chia và chinh phục đệ quy để xây dựng đệ quy các cây FP có điều kiện (conditional FP-trees) và từ đó tạo ra các tập phổ biến

In [47]:
from mlxtend.frequent_patterns import fpgrowth, apriori
from mlxtend.frequent_patterns import association_rules
import plotly.express as px

#### Xu hướng mua hàng theo tuổi, mặt hàng và nơi mua hàng

##### Data Preration

In [53]:
Shopping=transactions.assign(AgeGroup=None)

Shopping.loc[(Shopping['age'] > 0) & (Shopping['age'] < 10), 'AgeGroup'] = '1-10'
Shopping.loc[(Shopping['age'] >= 10) & (Shopping['age'] < 20), 'AgeGroup'] = '10-20'
Shopping.loc[(Shopping['age'] >= 20) & (Shopping['age'] < 30) , 'AgeGroup'] = '20-30'
Shopping.loc[(Shopping['age'] >= 30) & (Shopping['age'] < 40), 'AgeGroup'] = '30-40'
Shopping.loc[(Shopping['age'] >= 40) & (Shopping['age'] < 50), 'AgeGroup'] = '40-50'
Shopping.loc[(Shopping['age'] >= 50) & (Shopping['age'] < 60), 'AgeGroup'] = '50-60'
Shopping.loc[(Shopping['age'] >=60 ) & (Shopping['age'] < 70), 'AgeGroup'] = '60-70'
Shopping.loc[Shopping['age'] >= 70, 'AgeGroup'] = 'Elderly'
Shopping

Unnamed: 0,invoice_no,customer_id,gender,age,category,quantity,price,payment_method,invoice_date,shopping_mall,AgeGroup
0,I138884,C241288,Female,28,Clothing,5,1500.40,Credit Card,5/8/2022,Kanyon,20-30
1,I317333,C111565,Male,21,Shoes,3,1800.51,Debit Card,12/12/2021,Forum Istanbul,20-30
2,I127801,C266599,Male,20,Clothing,1,300.08,Cash,9/11/2021,Metrocity,20-30
3,I173702,C988172,Female,66,Shoes,5,3000.85,Credit Card,16/05/2021,Metropol AVM,60-70
4,I337046,C189076,Female,53,Books,4,60.60,Cash,24/10/2021,Kanyon,50-60
...,...,...,...,...,...,...,...,...,...,...,...
99452,I219422,C441542,Female,45,Souvenir,5,58.65,Credit Card,21/09/2022,Kanyon,40-50
99453,I325143,C569580,Male,27,Food & Beverage,2,10.46,Cash,22/09/2021,Forum Istanbul,20-30
99454,I824010,C103292,Male,63,Food & Beverage,2,10.46,Debit Card,28/03/2021,Metrocity,60-70
99455,I702964,C800631,Male,56,Technology,4,4200.00,Cash,16/03/2021,Istinye Park,50-60


In [54]:
AgeGroup_Category_trends = Shopping.loc[:, ['AgeGroup','category']]
AgeGroup_Category_trends.head(10)

Unnamed: 0,AgeGroup,category
0,20-30,Clothing
1,20-30,Shoes
2,20-30,Clothing
3,60-70,Shoes
4,50-60,Books
5,20-30,Clothing
6,40-50,Cosmetics
7,30-40,Clothing
8,60-70,Clothing
9,60-70,Clothing


In [55]:
one_hot = pd.get_dummies(AgeGroup_Category_trends[['AgeGroup', 'category']])
AgeGroup_Category_binary = pd.concat([AgeGroup_Category_trends, one_hot], axis=1)
AgeGroup_Category_binary.drop(['AgeGroup',  'category'], axis=1, inplace=True)
AgeGroup_Category_trends_binary = AgeGroup_Category_binary.astype(bool)
AgeGroup_Category_trends_binary.head(10)

Unnamed: 0,AgeGroup_10-20,AgeGroup_20-30,AgeGroup_30-40,AgeGroup_40-50,AgeGroup_50-60,AgeGroup_60-70,category_Books,category_Clothing,category_Cosmetics,category_Food & Beverage,category_Shoes,category_Souvenir,category_Technology,category_Toys
0,False,True,False,False,False,False,False,True,False,False,False,False,False,False
1,False,True,False,False,False,False,False,False,False,False,True,False,False,False
2,False,True,False,False,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,True,False,False,False,False,True,False,False,False
4,False,False,False,False,True,False,True,False,False,False,False,False,False,False
5,False,True,False,False,False,False,False,True,False,False,False,False,False,False
6,False,False,False,True,False,False,False,False,True,False,False,False,False,False
7,False,False,True,False,False,False,False,True,False,False,False,False,False,False
8,False,False,False,False,False,True,False,True,False,False,False,False,False,False
9,False,False,False,False,False,True,False,True,False,False,False,False,False,False


##### Build Model

In [56]:
res=fpgrowth(AgeGroup_Category_binary,min_support=0.02, use_colnames=True)
res.head(10)


DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type



Unnamed: 0,support,itemsets
0,0.346753,(category_Clothing)
1,0.193682,(AgeGroup_20-30)
2,0.100888,(category_Shoes)
3,0.19147,(AgeGroup_60-70)
4,0.190344,(AgeGroup_50-60)
5,0.050082,(category_Books)
6,0.192576,(AgeGroup_40-50)
7,0.151794,(category_Cosmetics)
8,0.193923,(AgeGroup_30-40)
9,0.148567,(category_Food & Beverage)


##### Show Rules in dataset

In [57]:
AgeGroup_Category_Rules = association_rules(res, metric="lift")
AgeGroup_Category_Rules = AgeGroup_Category_Rules.sort_values("confidence",ascending=False)
#Rules
AgeGroup_Category_Rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
6,(AgeGroup_40-50),(category_Clothing),0.192576,0.346753,0.067165,0.34877,1.005818,0.000389,1.003098,0.007164
1,(AgeGroup_20-30),(category_Clothing),0.193682,0.346753,0.067466,0.348336,1.004566,0.000307,1.00243,0.005637
3,(AgeGroup_60-70),(category_Clothing),0.19147,0.346753,0.066391,0.346742,0.999967,-2e-06,0.999983,-4e-05
5,(AgeGroup_50-60),(category_Clothing),0.190344,0.346753,0.065667,0.34499,0.994915,-0.000336,0.997308,-0.006273
18,(AgeGroup_30-40),(category_Clothing),0.193923,0.346753,0.066662,0.343755,0.991354,-0.000581,0.995432,-0.010704
30,(category_Toys),(AgeGroup_20-30),0.101421,0.193682,0.020451,0.201646,1.041119,0.000808,1.009976,0.043953
20,(category_Food & Beverage),(AgeGroup_30-40),0.148567,0.193923,0.02948,0.19843,1.023241,0.00067,1.005623,0.026676
17,(category_Cosmetics),(AgeGroup_30-40),0.151794,0.193923,0.029691,0.195602,1.008657,0.000255,1.002087,0.010119
14,(category_Cosmetics),(AgeGroup_50-60),0.151794,0.190344,0.029561,0.194741,1.023101,0.000667,1.00546,0.02662
0,(category_Clothing),(AgeGroup_20-30),0.346753,0.193682,0.067466,0.194566,1.004566,0.000307,1.001098,0.006958


**Kết luận**: Ta có thể thấy rằng, mặt hàng **Clothing** được chọn mua ở mọi lứa tuổi. 

Ngoài ra thì nhóm khách hàng mua mặt hàng **Toy** thường ở độ tuổi **20-30** và mặt hàng **Food & Beverage** cũng có xu hướng được mua ở lứa tuổi 30-40

### Xu hướng mua hàng theo shopping mall và category

##### Data Preparation

In [58]:
Mall_Category_trends = Shopping.loc[:, ['shopping_mall','category']]
Mall_Category_trends.head(10)

Unnamed: 0,shopping_mall,category
0,Kanyon,Clothing
1,Forum Istanbul,Shoes
2,Metrocity,Clothing
3,Metropol AVM,Shoes
4,Kanyon,Books
5,Forum Istanbul,Clothing
6,Istinye Park,Cosmetics
7,Mall of Istanbul,Clothing
8,Metrocity,Clothing
9,Kanyon,Clothing


In [59]:
one_hot = pd.get_dummies(Mall_Category_trends[['shopping_mall', 'category']])
Mall_Category_binary = pd.concat([Mall_Category_trends, one_hot], axis=1)
Mall_Category_binary.drop(['shopping_mall',  'category'], axis=1, inplace=True)
Mall_Category_binary = Mall_Category_binary.astype(bool)
Mall_Category_binary.head(10)

Unnamed: 0,shopping_mall_Cevahir AVM,shopping_mall_Emaar Square Mall,shopping_mall_Forum Istanbul,shopping_mall_Istinye Park,shopping_mall_Kanyon,shopping_mall_Mall of Istanbul,shopping_mall_Metrocity,shopping_mall_Metropol AVM,shopping_mall_Viaport Outlet,shopping_mall_Zorlu Center,category_Books,category_Clothing,category_Cosmetics,category_Food & Beverage,category_Shoes,category_Souvenir,category_Technology,category_Toys
0,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False
1,False,False,True,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
2,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False
4,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False
5,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False
6,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False
7,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False
8,False,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False
9,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False


##### Build Model

In [60]:
res=fpgrowth(Mall_Category_binary,min_support=0.02, use_colnames=True)

Mall_Category_Rules = association_rules(res, metric="lift")
Mall_Category_Rules = Mall_Category_Rules.sort_values("confidence",ascending=False)
#Rules
Mall_Category_Rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
7,(shopping_mall_Metrocity),(category_Clothing),0.15093,0.346753,0.052968,0.350943,1.012083,0.000632,1.006455,0.014061
18,(shopping_mall_Mall of Istanbul),(category_Clothing),0.200519,0.346753,0.069608,0.347139,1.001115,7.7e-05,1.000592,0.001393
10,(shopping_mall_Metropol AVM),(category_Clothing),0.102165,0.346753,0.035442,0.346915,1.000467,1.7e-05,1.000248,0.000519
0,(shopping_mall_Kanyon),(category_Clothing),0.199312,0.346753,0.068773,0.345054,0.9951,-0.000339,0.997406,-0.006112
17,(shopping_mall_Istinye Park),(category_Clothing),0.098344,0.346753,0.033713,0.342807,0.988622,-0.000388,0.993997,-0.012603
13,(category_Cosmetics),(shopping_mall_Mall of Istanbul),0.151794,0.200519,0.030667,0.202027,1.007521,0.000229,1.00189,0.008801
5,(category_Shoes),(shopping_mall_Mall of Istanbul),0.100888,0.200519,0.02034,0.201615,1.005464,0.000111,1.001372,0.006044
3,(category_Shoes),(shopping_mall_Kanyon),0.100888,0.199312,0.02028,0.201017,1.008551,0.000172,1.002133,0.00943
19,(category_Clothing),(shopping_mall_Mall of Istanbul),0.346753,0.200519,0.069608,0.200742,1.001115,7.7e-05,1.00028,0.001704
15,(category_Cosmetics),(shopping_mall_Kanyon),0.151794,0.199312,0.030395,0.200238,1.004647,0.000141,1.001158,0.005453


**Kết luận**: Ta có thể thấy được rằng mặt hàng **Clothing** được mua nhiều nhất ở hầu hết các khu trung tâm mua sắm. Cho thấy đây là mặt hàng bán chạy được mua rộng rãi ở mọi lứa tuổi. Đi kèm với đó mặt hàng **Shoes** cũng được mua nhiều ở trung tâm mua sắm Kanyon. Từ đó ta có thể thay trung tâm mua sắm này được tin tưởng để mua các mặt hàng may mặc

#### 4.2 Apriori

In [61]:
res=apriori(AgeGroup_Category_binary,min_support=0.02, use_colnames=True)
res.head(10)


DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type



Unnamed: 0,support,itemsets
0,0.038006,(AgeGroup_10-20)
1,0.193682,(AgeGroup_20-30)
2,0.193923,(AgeGroup_30-40)
3,0.192576,(AgeGroup_40-50)
4,0.190344,(AgeGroup_50-60)
5,0.19147,(AgeGroup_60-70)
6,0.050082,(category_Books)
7,0.346753,(category_Clothing)
8,0.151794,(category_Cosmetics)
9,0.148567,(category_Food & Beverage)


In [62]:

AgeGroup_Category_Rules = association_rules(res, metric="lift")
AgeGroup_Category_Rules = AgeGroup_Category_Rules.sort_values("confidence",ascending=False)
#Rules
AgeGroup_Category_Rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
14,(AgeGroup_40-50),(category_Clothing),0.192576,0.346753,0.067165,0.34877,1.005818,0.000389,1.003098,0.007164
1,(AgeGroup_20-30),(category_Clothing),0.193682,0.346753,0.067466,0.348336,1.004566,0.000307,1.00243,0.005637
27,(AgeGroup_60-70),(category_Clothing),0.19147,0.346753,0.066391,0.346742,0.999967,-2e-06,0.999983,-4e-05
21,(AgeGroup_50-60),(category_Clothing),0.190344,0.346753,0.065667,0.34499,0.994915,-0.000336,0.997308,-0.006273
8,(AgeGroup_30-40),(category_Clothing),0.193923,0.346753,0.066662,0.343755,0.991354,-0.000581,0.995432,-0.010704
6,(category_Toys),(AgeGroup_20-30),0.101421,0.193682,0.020451,0.201646,1.041119,0.000808,1.009976,0.043953
12,(category_Food & Beverage),(AgeGroup_30-40),0.148567,0.193923,0.02948,0.19843,1.023241,0.00067,1.005623,0.026676
11,(category_Cosmetics),(AgeGroup_30-40),0.151794,0.193923,0.029691,0.195602,1.008657,0.000255,1.002087,0.010119
22,(category_Cosmetics),(AgeGroup_50-60),0.151794,0.190344,0.029561,0.194741,1.023101,0.000667,1.00546,0.02662
0,(category_Clothing),(AgeGroup_20-30),0.346753,0.193682,0.067466,0.194566,1.004566,0.000307,1.001098,0.006958


In [63]:
res=apriori(Mall_Category_binary,min_support=0.02, use_colnames=True)
res.head(10)

Unnamed: 0,support,itemsets
0,0.050182,(shopping_mall_Cevahir AVM)
1,0.048373,(shopping_mall_Emaar Square Mall)
2,0.04974,(shopping_mall_Forum Istanbul)
3,0.098344,(shopping_mall_Istinye Park)
4,0.199312,(shopping_mall_Kanyon)
5,0.200519,(shopping_mall_Mall of Istanbul)
6,0.15093,(shopping_mall_Metrocity)
7,0.102165,(shopping_mall_Metropol AVM)
8,0.049408,(shopping_mall_Viaport Outlet)
9,0.051027,(shopping_mall_Zorlu Center)


In [64]:

Mall_Category_Rules = association_rules(res, metric="lift")
Mall_Category_Rules = Mall_Category_Rules.sort_values("confidence",ascending=False)
#Rules
Mall_Category_Rules.head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
21,(shopping_mall_Metrocity),(category_Clothing),0.15093,0.346753,0.052968,0.350943,1.012083,0.000632,1.006455,0.014061
10,(shopping_mall_Mall of Istanbul),(category_Clothing),0.200519,0.346753,0.069608,0.347139,1.001115,7.7e-05,1.000592,0.001393
26,(shopping_mall_Metropol AVM),(category_Clothing),0.102165,0.346753,0.035442,0.346915,1.000467,1.7e-05,1.000248,0.000519
2,(shopping_mall_Kanyon),(category_Clothing),0.199312,0.346753,0.068773,0.345054,0.9951,-0.000339,0.997406,-0.006112
1,(shopping_mall_Istinye Park),(category_Clothing),0.098344,0.346753,0.033713,0.342807,0.988622,-0.000388,0.993997,-0.012603
13,(category_Cosmetics),(shopping_mall_Mall of Istanbul),0.151794,0.200519,0.030667,0.202027,1.007521,0.000229,1.00189,0.008801
17,(category_Shoes),(shopping_mall_Mall of Istanbul),0.100888,0.200519,0.02034,0.201615,1.005464,0.000111,1.001372,0.006044
9,(category_Shoes),(shopping_mall_Kanyon),0.100888,0.199312,0.02028,0.201017,1.008551,0.000172,1.002133,0.00943
11,(category_Clothing),(shopping_mall_Mall of Istanbul),0.346753,0.200519,0.069608,0.200742,1.001115,7.7e-05,1.00028,0.001704
5,(category_Cosmetics),(shopping_mall_Kanyon),0.151794,0.199312,0.030395,0.200238,1.004647,0.000141,1.001158,0.005453


**So sánh**: Khi sử dụng thuật toán Apriori để so sánh với đối chiếu với thuật toán FpGrowth thì kết quả cho ra là tương đông với nhau