# 👩‍💻 BBC CUSTOMER ANALYSIS PROJECT
---

# INTRODUCTION
The project aims to know if BBC customers in different BBC ages have different serviceid cluster, and how we do cross-sale to customers. 

## Objectives
1. The first goal of this project is to use SQL to define: 
    - The first 2 serviceid customers used and Date that customer used those services.
    - The last serviceid users used and Date that users used that service.
    - The total distinct serviceid that customers used.
    - Present the findings into one table with the columns as follows:
 
    
| User_id  | FirstServiceid | SecondServiceid | FirstServiceDate | SecondServiceDate | LastServiceid | LastServiceDate | TotalService |
| -------- | -------------- | --------------- | ---------------- | ----------------- | ------------- | --------------- | ------------ |
| 407901   | 12  |  45 | 1-Jan-18 | 2-Jan-18 | 667 | 22-July-2018| 6 |
    
   
    
2. The second goal is identifying patterns using Python to answer the business questions below.

## Business question
1. Determine whether customers in different BBC ages have different serviceid cluster?
2. Determine which serviceid we can cross-sell to customers? Give recommendation to your findings.

## Database description
We have one table `test` with three columns `User_id`, `Date`, and `Serviceid`.

- **User_id**: The identity of the individual customer.
- **Servicedid**: The identity of the transaction.
- **Date**: the date that transaction is performed.

_Note: This is the open data._

# PROCESS
Doing SQL query to find the answer for the first goal, we use the combination of three `CTE`s to find the answer for `The first 2 serviceid`, `last serviceid` csutomers used and `Date` 

Combine with `COUNT` statement to find the total distinct serviceid that customers used. Then join everything to show the results as the required table above.

Natural join is being used to optimize the time efficiency. No JOIN command, but it is implicit (INNER JOIN).

```sql
-- create a ranked_cte to sort data with ROW_NUMBER
WITH ranked_cte AS( 
SELECT User_Id, date, serviceid,
ROW_NUMBER() OVER(PARTITION BY User_Id
ORDER BY Date) RN,
ROW_NUMBER() OVER(PARTITION BY User_Id
ORDER BY Date DESC) NR
FROM test
),
-- create the first_cte to define the firstservice value by choosing the RN = 1
first_cte AS(
SELECT t.User_Id, r.serviceid AS FirstServiceid, r.date AS FirstServiceDate
FROM ranked_cte AS r, test AS t
WHERE RN = 1 
AND r.User_Id = t.User_Id
GROUP BY t.User_Id
),
-- create second_cte to define the secondservice value by choosing RN = 2
second_cte AS(
SELECT f.*, r.serviceid as SecondServiceid, r.date AS SecondServiceDate
FROM first_cte as f, 
	 ranked_cte as r
WHERE RN = 2
AND r.User_Id = f.User_Id
GROUP BY r.User_Id
) 
-- Join everything, use the DATE_FORMAT statement to convert the date into your desired format
SELECT t.User_Id AS User_id, s.FirstServiceid, 
    s.SecondServiceid, DATE_FORMAT(s.FirstServiceDate, "%e" "-" "%b" "-" "%y") AS FirstServiceDate,
    DATE_FORMAT(s.SecondServiceDate, "%e" "-" "%b" "-" "%y") AS SecondServiceDate,
    r.serviceid AS LastServiceid, DATE_FORMAT(r.date, "%e" "-" "%b" "-" "%y") AS LastServiceDate,
    COUNT(DISTINCT t.serviceid) AS TotalService -- find the total distinct serviceid each customer used
FROM second_cte AS s,
     ranked_cte AS r,
     test AS t
WHERE NR = 1
AND s.User_Id = r.User_Id
AND t.User_Id = r.User_Id
GROUP BY r.User_Id;
;
```

Now that we have all the data in a single result table let's download it as a `CSV` file.

Next step, using `Python` to prepare and process the data.

## Importing Libraries


In [1]:
# Data processing
import numpy as np # data arrays
import pandas as pd # data structure and data analysis
import functools
import datetime as dt # date time

!pip install mlxtend
from mlxtend.frequent_patterns import apriori # Data pattern exploration
from mlxtend.frequent_patterns import association_rules # Association rules conversion

Collecting mlxtend
  Downloading mlxtend-0.19.0-py2.py3-none-any.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 29.4 MB/s eta 0:00:01
Installing collected packages: mlxtend
Successfully installed mlxtend-0.19.0


## Data Exploration
File "BBC_analysis.csv" has been extracted by querying from dataset "User_data.csv" contains info for further analysis. Below, you can see a sample of this info. 

In [3]:

body = client_b7209d1e410e48e798bf539a04efcb4c.get_object(Bucket='pythonbasicsfordataproject-donotdelete-pr-2lygblpvjr9idu',Key='BBC_analysis.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )


In [4]:
# Read in data in CSV format
df1 = pd.read_csv(body)
df1.head() 

Unnamed: 0,User_id,FirstServiceid,SecondServiceid,FirstServiceDate,SecondServiceDate,LastServiceid,LastServiceDate,TotalService
0,2464231,18,667,7-Jan-18,16-Jan-18,982,23-Jul-18,10
1,3676235,273,481,5-Jan-18,6-Jan-18,355,12-Apr-18,9
2,4958104,946,310,22-Jan-18,22-Jan-18,268,30-Jun-18,7
3,11642209,20,666,1-Jan-18,2-Jan-18,666,29-Jun-18,14
4,18246539,984,269,22-Mar-18,22-Mar-18,666,1-Apr-18,4


In [11]:
df1.info() # print to check the dtype

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 43 entries, 0 to 42
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   User_id            43 non-null     int64 
 1   FirstServiceid     43 non-null     int64 
 2   SecondServiceid    43 non-null     int64 
 3   FirstServiceDate   43 non-null     object
 4   SecondServiceDate  43 non-null     object
 5   LastServiceid      43 non-null     int64 
 6   LastServiceDate    43 non-null     object
 7   TotalService       43 non-null     int64 
dtypes: int64(5), object(3)
memory usage: 2.8+ KB


**From the summary, we can see that**

- There are a total of 43 rows and 7 columns in the data set.
- Data types are assigned correctly (apart from Servicedate). We will attempt at converting the datatype to a form suitable for analysis in the next section.
- The are no null values.

In [5]:
# @Hidden_Cell
body = client_b7209d1e410e48e798bf539a04efcb4c.get_object(Bucket='pythonbasicsfordataproject-donotdelete-pr-2lygblpvjr9idu',Key='Test_TTS Phan tich Du lieu.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

# @Hidden_cell

In [6]:
# Read in data in CSV format
df2 = pd.read_csv(body)
df2.head()

Unnamed: 0,Serviceid,Date,User_Id
0,2,23-May-17,38411702
1,2,12-Jun-17,38411702
2,2,24-Aug-17,38411702
3,2,25-Sep-17,38411702
4,667,01-Jan-18,20831230


In [12]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1464 entries, 0 to 1463
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Serviceid  1464 non-null   int64 
 1   Date       1464 non-null   object
 2   User_id    1464 non-null   int64 
dtypes: int64(2), object(1)
memory usage: 34.4+ KB


**From the summary, we can see that**

- There are a total of 1464 rows and 3 columns in the data set.
- Data types are assigned correctly (apart from Date). We will attempt at converting the datatype to a form suitable for analysis in the next section.
- The are no null values.

In [7]:
# Rename the column to match with above table
df2.rename(columns = {'User_Id':'User_id'}, inplace=True)
df2.head()

Unnamed: 0,Serviceid,Date,User_id
0,2,23-May-17,38411702
1,2,12-Jun-17,38411702
2,2,24-Aug-17,38411702
3,2,25-Sep-17,38411702
4,667,01-Jan-18,20831230


## 🚩 Analysis question 1

We would like to know whether customers in different BBC ages have different serviceid cluster?

First, we attempt to merge the two tables to see a big picture including the serviceid and the day of service transaction of each customer. In addition, this will help categorize customer into Ages groups and count service transactions for each group. 

In [13]:
#Merge df1 and df2 into one datafram dfs for further analysis
dfs = [df1, df2]
UserTable = functools.reduce(lambda left,right: pd.merge(left,right,on='User_id', how='outer'), dfs)
UserTable.dropna(inplace=True)
UserTable.head() # print to confirm 

Unnamed: 0,User_id,FirstServiceid,SecondServiceid,FirstServiceDate,SecondServiceDate,LastServiceid,LastServiceDate,TotalService,Serviceid,Date
0,2464231,18.0,667.0,7-Jan-18,16-Jan-18,982.0,23-Jul-18,10.0,18,07-Jan-18
1,2464231,18.0,667.0,7-Jan-18,16-Jan-18,982.0,23-Jul-18,10.0,667,16-Jan-18
2,2464231,18.0,667.0,7-Jan-18,16-Jan-18,982.0,23-Jul-18,10.0,333,16-Jan-18
3,2464231,18.0,667.0,7-Jan-18,16-Jan-18,982.0,23-Jul-18,10.0,667,26-Jan-18
4,2464231,18.0,667.0,7-Jan-18,16-Jan-18,982.0,23-Jul-18,10.0,333,26-Jan-18


**Converting** data into dtype datatime64 to define the correct date for next section, then print the info to double check.

In [14]:
#convert all service date to datatime64 dtype
UserTable['FirstServiceDate'] = pd.to_datetime(UserTable['FirstServiceDate'])
UserTable['SecondServiceDate'] = pd.to_datetime(UserTable['SecondServiceDate'])
UserTable['LastServiceDate'] = pd.to_datetime(UserTable['LastServiceDate'])
UserTable['Date'] = pd.to_datetime(UserTable['Date'])

# re-print information to confirm
UserTable.info()

# print 1st 5 rows to confirm
UserTable.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1463 entries, 0 to 1462
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   User_id            1463 non-null   int64         
 1   FirstServiceid     1463 non-null   float64       
 2   SecondServiceid    1463 non-null   float64       
 3   FirstServiceDate   1463 non-null   datetime64[ns]
 4   SecondServiceDate  1463 non-null   datetime64[ns]
 5   LastServiceid      1463 non-null   float64       
 6   LastServiceDate    1463 non-null   datetime64[ns]
 7   TotalService       1463 non-null   float64       
 8   Serviceid          1463 non-null   int64         
 9   Date               1463 non-null   datetime64[ns]
dtypes: datetime64[ns](4), float64(4), int64(2)
memory usage: 125.7 KB


Unnamed: 0,User_id,FirstServiceid,SecondServiceid,FirstServiceDate,SecondServiceDate,LastServiceid,LastServiceDate,TotalService,Serviceid,Date
0,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,18,2018-01-07
1,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-01-16
2,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,333,2018-01-16
3,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-01-26
4,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,333,2018-01-26


**Calculate the User_age**

To know if different BBC customer ages convert into differents serviceid cluster, we will count the `User_age` first then arrange them into the coresponding `Age_group`. Let's see the results as follows

In [16]:
#Calculate the User_age
UserTable['User_age'] = 1 + (UserTable['LastServiceDate']-UserTable['FirstServiceDate']).dt.days

# Split users into 5 categories of 30_age, 100_age, 200_age, 300_age, 300_plus_age
bins = [-1, 30, 100, 200, 300, np.inf]
names = ['30_age', '100_age', '200_age', '300_age', '300_plus_age']
UserTable['Age_bin'] = pd.cut(UserTable['User_age'], bins, labels = names)

#Print the first 10 rows to confirm
UserTable.head(10)


Unnamed: 0,User_id,FirstServiceid,SecondServiceid,FirstServiceDate,SecondServiceDate,LastServiceid,LastServiceDate,TotalService,Serviceid,Date,User_age,Age_bin
0,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,18,2018-01-07,198,200_age
1,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-01-16,198,200_age
2,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,333,2018-01-16,198,200_age
3,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-01-26,198,200_age
4,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,333,2018-01-26,198,200_age
5,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,1366,2018-02-06,198,200_age
6,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-03-26,198,200_age
7,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-04-26,198,200_age
8,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-05-25,198,200_age
9,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,268,2018-06-22,198,200_age


This step is to **group** and **count** customers bases on categories of BBC Age_bin and servicedid in order to know which services were used most by each group, then sort to see the first 5 highest values describing number of BBC customers per serviceid in each Age_group.

In [19]:
UserAgg = UserTable.groupby(['Age_bin','Serviceid'])[['User_id']].count().dropna()
UserAgg

#We group by the first level of the index:
g = UserAgg['User_id'].groupby(level=0, group_keys=False)

# Sort 'order' each group and take the first five elements:
g.nlargest(5)


Age_bin       Serviceid
30_age        487           9
              1014          9
              299           4
              47            2
              65            2
100_age       333          24
              667          24
              981          13
              1014         11
              487           9
200_age       981          75
              20           70
              1014         58
              271          46
              326          42
300_age       18           93
              667          65
              333          64
              981          60
              268          49
300_plus_age  981           9
              19            8
              1014          5
              2             4
              271           4
Name: User_id, dtype: int64

### **Observation**

From the above analysis, we can see that:
- Customers in 30_age group tend to use Serviceid of 487, 1014, 299, 47, 65.
- Customers in 100_age group tend to use Serviceid of 333, 667, 981, 1014, 487.
- Customers in 200_age group tend to use Serviceid of 981, 20, 1014, 271, 326.
- Customers in 300_age group tend to use Serviceid of 18, 667, 333, 981, 268.
- Customers in 300_plus_age group tend to use Serviceid of 981, 19, 1014, 2, 271.


## 🚩 Analysis question 2:
We would like to know which serviceid we can cross-sales to users?

First, what is cross-selling? 
Cross-selling is the practice of selling related or additional products to existing customers. To know which `serviceid`s we can cross-sell for customer, we need to find which items (serviceid) that customers usually buy together then combine these items into combo.

The analysis below is based on the [Assocations Rules](https://towardsdatascience.com/association-rules-2-aa9a77241654) 
 and [Apriori principle](https://towardsdatascience.com/complete-guide-to-association-rules-2-2-c92072b56c84).
 
The Rules do not extract an individual’s preference, rather find **relationships between set of elements of every distinct transaction**. 

In [21]:
# Present the UserTable again to find pattern for cross-sell analysis
UserTable.head()


Unnamed: 0,User_id,FirstServiceid,SecondServiceid,FirstServiceDate,SecondServiceDate,LastServiceid,LastServiceDate,TotalService,Serviceid,Date,User_age,Age_bin
0,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,18,2018-01-07,198,200_age
1,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-01-16,198,200_age
2,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,333,2018-01-16,198,200_age
3,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,667,2018-01-26,198,200_age
4,2464231,18.0,667.0,2018-01-07,2018-01-16,982.0,2018-07-23,10.0,333,2018-01-26,198,200_age


**Create** the df cross_sell for further analysis & **Find** the association between unique serviceids per day and each User_id.

In [24]:
#create the df cross_sell for further analysis  
Cross_sell = pd.DataFrame(UserTable, columns =['User_id', 'Date', 'Serviceid']) 

#Find the association between unique serviceids per day and each User_id
Cross_sell['User_id & Date'] = Cross_sell['User_id'].astype('str') + ' ' + Cross_sell['Date'].astype('str')
Cross_sell.drop(['User_id', 'Date'], axis=1, inplace=True) 
Cross_sell.head(10) #print the first ten rows to check 

Unnamed: 0,Serviceid,User_id & Date
0,18,2464231 2018-01-07
1,667,2464231 2018-01-16
2,333,2464231 2018-01-16
3,667,2464231 2018-01-26
4,333,2464231 2018-01-26
5,1366,2464231 2018-02-06
6,667,2464231 2018-03-26
7,667,2464231 2018-04-26
8,667,2464231 2018-05-25
9,268,2464231 2018-06-22


Use `drop_duplicates` function to eliminate duplicate rows. For example, a customer use 1 particular serviceid twice a day, all we need is to know the number of unique serviceid per day for each customer.

Another example, if a customer usually buy razer along with blades, we can cross-sell 2 items in a combo.

In [25]:
Cross_sell.drop_duplicates(inplace = True)
Cross_sell.set_index('User_id & Date', inplace=True)
Cross_sell.head(10) #print the first ten rows to check 

Unnamed: 0_level_0,Serviceid
User_id & Date,Unnamed: 1_level_1
2464231 2018-01-07,18
2464231 2018-01-16,667
2464231 2018-01-16,333
2464231 2018-01-26,667
2464231 2018-01-26,333
2464231 2018-02-06,1366
2464231 2018-03-26,667
2464231 2018-04-26,667
2464231 2018-05-25,667
2464231 2018-06-22,268


**Convert** categorical serviceid to numeric through get_dummies to see the customers per day buy which `serviceid`s.

In [27]:
Cross_sell['Serviceid'] = Cross_sell['Serviceid'].astype('str')

basket = pd.get_dummies(Cross_sell)
basket.head()

Unnamed: 0_level_0,Serviceid_10,Serviceid_1014,Serviceid_1046,Serviceid_1086,Serviceid_11,Serviceid_1136,Serviceid_1166,Serviceid_1186,Serviceid_12,Serviceid_125,...,Serviceid_82,Serviceid_904,Serviceid_905,Serviceid_91,Serviceid_946,Serviceid_981,Serviceid_982,Serviceid_983,Serviceid_984,Serviceid_985
User_id & Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2464231 2018-01-07,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2464231 2018-01-16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2464231 2018-01-16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2464231 2018-01-26,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2464231 2018-01-26,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Convert** the service baskets into pivot table type to see the total.

In [28]:
basket_sets = pd.pivot_table(basket, index='User_id & Date', aggfunc='sum')
basket_sets

Unnamed: 0_level_0,Serviceid_10,Serviceid_1014,Serviceid_1046,Serviceid_1086,Serviceid_11,Serviceid_1136,Serviceid_1166,Serviceid_1186,Serviceid_12,Serviceid_125,...,Serviceid_82,Serviceid_904,Serviceid_905,Serviceid_91,Serviceid_946,Serviceid_981,Serviceid_982,Serviceid_983,Serviceid_984,Serviceid_985
User_id & Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
11642209 2018-01-01,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11642209 2018-01-02,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11642209 2018-01-03,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11642209 2018-01-09,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
11642209 2018-01-11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4958104 2018-03-04,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4958104 2018-03-08,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4958104 2018-03-11,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4958104 2018-06-14,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


**Create** the `frequent_itemsets` through apriori and add a new column that stores the length of each itemset. Frequent itemsets are the ones which occur at least a minimum number of times in the transactions. Technically, these are the itemsets for which support value (fraction of transactions containing the itemset) is above a minimum threshold.

Take this as an example, the number of transactions containing items {Bread, Egg} is greater than or equal to number of transactions containing {Bread, Egg, Vegetables}. If the latter occurs in 30 transactions, former is occurring in all 30 of them and possibly will occur in even some more transactions. So if support value of {Bread, Egg, Vegetables} i.e. (30/100) = 0.3 is above minsup, then we can be assured that support value of {Bread, Egg} i.e. (>30/100) = >0.3 is above minsup too. 

Get support [here](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/)

In [29]:
frequent_itemsets = apriori(basket_sets, min_support = 0.03, use_colnames = True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.216071,(Serviceid_1014),1
1,0.094643,(Serviceid_13),1
2,0.2125,(Serviceid_18),1
3,0.132143,(Serviceid_20),1
4,0.123214,(Serviceid_268),1
5,0.125,(Serviceid_271),1
6,0.030357,(Serviceid_273),1
7,0.032143,(Serviceid_299),1
8,0.101786,(Serviceid_326),1
9,0.198214,(Serviceid_333),1


Then, we can customize the results that satisfy our desired criteria as follows. In this case we put the `length` > 1 (meaning that there are more than one item in the item_sets and the `support` >=0.02. 

In [30]:
frequent_itemsets[ (frequent_itemsets['length'] > 1) &
                   (frequent_itemsets['support'] >= 0.02)]

Unnamed: 0,support,itemsets,length
16,0.060714,"(Serviceid_18, Serviceid_1014)",2
17,0.171429,"(Serviceid_981, Serviceid_1014)",2
18,0.073214,"(Serviceid_333, Serviceid_13)",2
19,0.073214,"(Serviceid_667, Serviceid_13)",2
20,0.075,"(Serviceid_268, Serviceid_18)",2
21,0.033929,"(Serviceid_18, Serviceid_271)",2
22,0.096429,"(Serviceid_18, Serviceid_981)",2
23,0.076786,"(Serviceid_982, Serviceid_18)",2
24,0.057143,"(Serviceid_326, Serviceid_20)",2
25,0.057143,"(Serviceid_666, Serviceid_20)",2


### Generating the association rule: 
- Association Rule: Ex. {X → Y} is a representation of finding Y on the basket which has X on it
- Itemset: Ex. {X,Y} is a representation of the list of all items which form the association rule
- Support: Fraction of transactions containing the itemset
- Confidence:  Measures how often items in Y appear in transactions that contain X.
- Lift: Ratio of confidence to baseline probability of occurrence of {Y}

**Selecting** the important parameters for analysis

In [32]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values('support', ascending=False).head(10)

Unnamed: 0,antecedents,consequents,support,confidence,lift
26,(Serviceid_333),(Serviceid_667),0.196429,0.990991,4.624625
27,(Serviceid_667),(Serviceid_333),0.196429,0.916667,4.624625
2,(Serviceid_981),(Serviceid_1014),0.171429,0.619355,2.866436
3,(Serviceid_1014),(Serviceid_981),0.171429,0.793388,2.866436
20,(Serviceid_268),(Serviceid_982),0.114286,0.927536,7.991081
21,(Serviceid_982),(Serviceid_268),0.114286,0.984615,7.991081
24,(Serviceid_666),(Serviceid_326),0.101786,0.966102,9.491525
23,(Serviceid_981),(Serviceid_271),0.101786,0.367742,2.941935
22,(Serviceid_271),(Serviceid_981),0.101786,0.814286,2.941935
25,(Serviceid_326),(Serviceid_666),0.101786,1.0,9.491525


# 📌 Observations:
- If the customer buy serviceid_667, we will cross-sell the serviceid_333 with the Support of 19%, Confidence ~ 92% and Lift >1.
- If the customer buy serviceid_333, we will cross-sell the serviceid_667 with the Support of 19%, Confidence ~ 99% and Lift >1.
- If the customer buy serviceid_981, we will cross-sell the serviceid_1014 with the Support of 17%, Confidence ~ 62% and Lift >1.
- If the customer buy serviceid_1014, we will cross-sell the serviceid_268 with the Support of 17%, Confidence ~ 79% and Lift >1.
- If the customer buy serviceid_982, we will cross-sell the serviceid_268 with the Support of 11%, Confidence ~ 98% and Lift >1.

...

- If the customer buy serviceid_326, we will cross-sell the serviceid_666 with the Support of 10%, Confidence of 100% and Lift >1.


## REFERENCES
- [Pandas Groupby](https://stackoverflow.com/questions/27842613/pandas-groupby-sort-within-groups)
- [Association Rules](https://towardsdatascience.com/association-rules-2-aa9a77241654)
- [Apriori Algorithm](https://towardsdatascience.com/complete-guide-to-association-rules-2-2-c92072b56c84)
- [Frequent Itemsets via Apriori Algorithm](http://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/)
- [Cross-sell](https://www.youtube.com/watch?v=VMavY0pBo2o&ab_channel=AngossSoftware)


Thank you for reading! Give a 🌟 if you like this analysis.