# TP5 - Sequential Pattern Mining


In this practical work, you are given a dataset of customers shopping baskets. Each bsket contains different products. Eachh customer may have one or more baskets. The first objective is to prepare the dataset that contains for each customer the total baskets of products purchased by him as sequences of events. Then compute the frequent patterns in these sequences.

The dataset format
- An event is a list of strings.
- A sequence is a list of events.
- A dataset is a list of sequences.
Thus, a dataset is a list of lists of lists of strings.

E.g.

dataset =  [
  
  [["a"], ["a", "b", "c"], ["a", "c"], ["c"]],
  
  [["a"], ["c"], ["b", "c"]],
  
  [["a", "b"], ["d"], ["c"], ["b"], ["c"]],
  
  [["a"], ["c"], ["b"], ["c"]] ]

**Step1** Loading the dataset: df.csv. Apply `index_col=0` to state the first column as the index.

In [1]:
import pandas as pd

In [14]:
df = pd.read_csv('dataset/df.csv', index_col=0)
df.head()

Unnamed: 0,BasketID,BasketDate,Sale,CustomerID,CustomerCountry,ProdID,ProdDescr,Qta,Sale_per_Qta
0,536365,2010-01-12 08:26:00,2.55,17850.0,United Kingdom,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,15.3
1,536365,2010-01-12 08:26:00,3.39,17850.0,United Kingdom,71053,WHITE METAL LANTERN,6,20.34
2,536365,2010-01-12 08:26:00,2.75,17850.0,United Kingdom,84406B,CREAM CUPID HEARTS COAT HANGER,8,22.0
3,536365,2010-01-12 08:26:00,3.39,17850.0,United Kingdom,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,20.34
4,536365,2010-01-12 08:26:00,3.39,17850.0,United Kingdom,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,20.34


## Modelling sequences :

Slighty modify the shape of the dataframe to fit the requirements for using it as an input of the apriori function.

**Step2** First we model each customer as a sequence of baskets. Group by for each customer (`CustomerID`) the associated unique baskets (`BasketID`). Apply `list` to every group.

In [70]:
basket = df.groupby(['CustomerID', 'BasketID']).apply(list)
basket

CustomerID  BasketID
12347.0     537626      [BasketID, BasketDate, Sale, CustomerID, Custo...
            542237      [BasketID, BasketDate, Sale, CustomerID, Custo...
            549222      [BasketID, BasketDate, Sale, CustomerID, Custo...
            556201      [BasketID, BasketDate, Sale, CustomerID, Custo...
            562032      [BasketID, BasketDate, Sale, CustomerID, Custo...
                                              ...                        
18288.0     553148      [BasketID, BasketDate, Sale, CustomerID, Custo...
            557675      [BasketID, BasketDate, Sale, CustomerID, Custo...
            564087      [BasketID, BasketDate, Sale, CustomerID, Custo...
            571652      [BasketID, BasketDate, Sale, CustomerID, Custo...
            573154      [BasketID, BasketDate, Sale, CustomerID, Custo...
Length: 18867, dtype: object

In [114]:
basket.apply(len)>1

CustomerID  BasketID
12347.0     537626      True
            542237      True
            549222      True
            556201      True
            562032      True
                        ... 
18288.0     553148      True
            557675      True
            564087      True
            571652      True
            573154      True
Length: 18867, dtype: bool

**Step3** Next, drop the customers having performed only one shopping session (.apply(len)==1).

In [110]:
import numpy as np
basket.apply(np.unique)

CustomerID  BasketID
12347.0     537626      [BasketDate, BasketID, CustomerCountry, Custom...
            542237      [BasketDate, BasketID, CustomerCountry, Custom...
            549222      [BasketDate, BasketID, CustomerCountry, Custom...
            556201      [BasketDate, BasketID, CustomerCountry, Custom...
            562032      [BasketDate, BasketID, CustomerCountry, Custom...
                                              ...                        
18288.0     553148      [BasketDate, BasketID, CustomerCountry, Custom...
            557675      [BasketDate, BasketID, CustomerCountry, Custom...
            564087      [BasketDate, BasketID, CustomerCountry, Custom...
            571652      [BasketDate, BasketID, CustomerCountry, Custom...
            573154      [BasketDate, BasketID, CustomerCountry, Custom...
Length: 18867, dtype: object

**Step4** Now compute a dataframe where each row presents a basket ID and the products bought during said transaction. Consider the `CustomerID` which are in `baskets`. Apply the `unique()` and `apply(list)` functions.

In [109]:
basket.apply(np.unique).apply(list)

CustomerID  BasketID
12347.0     537626      [BasketDate, BasketID, CustomerCountry, Custom...
            542237      [BasketDate, BasketID, CustomerCountry, Custom...
            549222      [BasketDate, BasketID, CustomerCountry, Custom...
            556201      [BasketDate, BasketID, CustomerCountry, Custom...
            562032      [BasketDate, BasketID, CustomerCountry, Custom...
                                              ...                        
18288.0     553148      [BasketDate, BasketID, CustomerCountry, Custom...
            557675      [BasketDate, BasketID, CustomerCountry, Custom...
            564087      [BasketDate, BasketID, CustomerCountry, Custom...
            571652      [BasketDate, BasketID, CustomerCountry, Custom...
            573154      [BasketDate, BasketID, CustomerCountry, Custom...
Length: 18867, dtype: object

**Step5** Now combine the two dataframe in order to compute a list of each product bought by each customer during each of his sessions. Consider two columns ['CustomerID', 'basket_list']. Set the column `CustomerID` as index.

In [111]:
basket.reset_index().set_index('CustomerID')

Unnamed: 0_level_0,BasketID,0
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1
12347.0,537626,"[BasketID, BasketDate, Sale, CustomerID, Custo..."
12347.0,542237,"[BasketID, BasketDate, Sale, CustomerID, Custo..."
12347.0,549222,"[BasketID, BasketDate, Sale, CustomerID, Custo..."
12347.0,556201,"[BasketID, BasketDate, Sale, CustomerID, Custo..."
12347.0,562032,"[BasketID, BasketDate, Sale, CustomerID, Custo..."
...,...,...
18288.0,553148,"[BasketID, BasketDate, Sale, CustomerID, Custo..."
18288.0,557675,"[BasketID, BasketDate, Sale, CustomerID, Custo..."
18288.0,564087,"[BasketID, BasketDate, Sale, CustomerID, Custo..."
18288.0,571652,"[BasketID, BasketDate, Sale, CustomerID, Custo..."


In [112]:
df2 = basket

**Step6** Print out the resulting dataframe (df2).

In [113]:
df2

CustomerID  BasketID
12347.0     537626      [BasketID, BasketDate, Sale, CustomerID, Custo...
            542237      [BasketID, BasketDate, Sale, CustomerID, Custo...
            549222      [BasketID, BasketDate, Sale, CustomerID, Custo...
            556201      [BasketID, BasketDate, Sale, CustomerID, Custo...
            562032      [BasketID, BasketDate, Sale, CustomerID, Custo...
                                              ...                        
18288.0     553148      [BasketID, BasketDate, Sale, CustomerID, Custo...
            557675      [BasketID, BasketDate, Sale, CustomerID, Custo...
            564087      [BasketID, BasketDate, Sale, CustomerID, Custo...
            571652      [BasketID, BasketDate, Sale, CustomerID, Custo...
            573154      [BasketID, BasketDate, Sale, CustomerID, Custo...
Length: 18867, dtype: object

**Step7** Now the shape of the dataframe is restructured to fit the requirements for using it as an input of the apriori function. Define a list named `dataset` that appends the rows of the dataframe (the sequences).

**Step8** Count the total number of sequences and events contained within the dataset (a sequence is composed of multiple events; an event is a list of strings).

## Frequent Patterns Computation:

**Step9** Compute the frequent patterns with minimum support equal to the 5%, 10% and 15% of the dataset. A `apriori` algorithm is needed for the computation; `apriori` computes the frequent sequences in a sequence dataset for a given min support (Generalized Sequential Pattern Mining Approach). 

Args:
   - dataset: A list of sequences, for which the frequent (sub-)sequences are computed
   - minSupport: The minimum support that makes a sequence frequent
   
Returns: list of tuples (s, c), where s is a frequent sequence, and c is the count for that sequence

**Step10** Print how many patterns have been identified for each percentage of support.