# TP5 - Sequential Pattern Mining


In this practical work, you are given a dataset of customers shopping baskets. Each bsket contains different products. Eachh customer may have one or more baskets. The first objective is to prepare the dataset that contains for each customer the total baskets of products purchased by him as sequences of events. Then compute the frequent patterns in these sequences.

The dataset format
- An event is a list of strings.
- A sequence is a list of events.
- A dataset is a list of sequences.
Thus, a dataset is a list of lists of lists of strings.

E.g.

dataset =  [
  
  [["a"], ["a", "b", "c"], ["a", "c"], ["c"]],
  
  [["a"], ["c"], ["b", "c"]],
  
  [["a", "b"], ["d"], ["c"], ["b"], ["c"]],
  
  [["a"], ["c"], ["b"], ["c"]] ]

**Step1** Loading the dataset: df.csv. Apply `index_col=0` to state the first column as the index.

In [2]:
import pandas as pd
data=pd.read_csv('df.csv',index_col=0)
data.head()

Unnamed: 0,BasketID,BasketDate,Sale,CustomerID,CustomerCountry,ProdID,ProdDescr,Qta,Sale_per_Qta
0,536365,2010-01-12 08:26:00,2.55,17850.0,United Kingdom,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,15.3
1,536365,2010-01-12 08:26:00,3.39,17850.0,United Kingdom,71053,WHITE METAL LANTERN,6,20.34
2,536365,2010-01-12 08:26:00,2.75,17850.0,United Kingdom,84406B,CREAM CUPID HEARTS COAT HANGER,8,22.0
3,536365,2010-01-12 08:26:00,3.39,17850.0,United Kingdom,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,20.34
4,536365,2010-01-12 08:26:00,3.39,17850.0,United Kingdom,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,20.34


## Modelling sequences :

Slighty modify the shape of the dataframe to fit the requirements for using it as an input of the apriori function.

**Step2** First we model each customer as a sequence of baskets. Group by for each customer (`CustomerID`) the associated unique baskets (`BasketID`). Apply `list` to every group.

In [3]:
baskets=data.groupby(['CustomerID'])['BasketID'].unique().apply(list)
baskets.head()


CustomerID
12347.0    [537626, 542237, 549222, 556201, 562032, 57351...
12348.0                             [539318, 541998, 548955]
12349.0                                             [577609]
12350.0                                             [543037]
12352.0    [544156, 545323, 546869, 547390, 567505, 56869...
Name: BasketID, dtype: object

**Step3** Next, drop the customers having performed only one shopping session (.apply(len)==1).

In [4]:
baskets.drop(baskets[baskets.apply(len)==1].index,inplace=True)
baskets.shape

(2720,)

**Step4** Now compute a dataframe where each row presents a basket ID and the products bought during said transaction. Consider the `CustomerID` which are in `baskets`. Apply the `unique()` and `apply(list)` functions.

In [5]:
transactions=data[data.CustomerID.isin(baskets.index)].groupby(['BasketID'])['ProdID'].unique().apply(list)
transactions.head()

BasketID
536365       [85123A, 71053, 84406B, 84029G, 84029E, 21730]
536366                                              [22633]
536367    [22745, 22748, 22749, 22310, 84969, 22623, 217...
536368                         [22960, 22913, 22912, 22914]
536369                                              [21756]
Name: ProdID, dtype: object

**Step5** Now combine the two dataframe in order to compute a list of each product bought by each customer during each of his sessions. Consider two columns ['CustomerID', 'basket_list']. Set the column `CustomerID` as index.

In [6]:
df2 = pd.DataFrame(columns=['CustomerID', 'basket_list'])
for CustomerID in baskets.index:
    customer_list = []
    for BasketID in baskets.loc[CustomerID]:
        customer_list.append(transactions.loc[BasketID])
    current_customer = pd.DataFrame(data= {'CustomerID': CustomerID, 'basket_list': [customer_list]})
    df2 = pd.concat([df2, current_customer])

In [7]:
df2.set_index('CustomerID')

Unnamed: 0_level_0,basket_list
CustomerID,Unnamed: 1_level_1
12347.0,"[[85116, 22375, 71477, 22771, 22772, 22773, 22..."
12348.0,"[[84991, 21213, 22952, 21977], [21726], [22437]]"
12352.0,"[[21380, 22064, 21232, 22646, 22779, 22654, 21..."
12356.0,"[[22138, 22062, 22066, 22131, 22195, 22937, 84..."
12358.0,"[[15060B, 22059, 37447, 15056P, 15056N, 20679,..."
...,...
18273.0,"[[79302M], [79302M]]"
18282.0,"[[23295, 22089, 21108, 21109], [22699, 22818, ..."
18283.0,"[[22356, 20726, 22384, 22386, 20717, 20718, 85..."
18287.0,"[[22755, 22754, 22753, 22756, 22758, 22757, 22..."


**Step6** Print out the resulting dataframe (df2).

In [8]:
df2.head()

Unnamed: 0,CustomerID,basket_list
0,12347.0,"[[85116, 22375, 71477, 22771, 22772, 22773, 22..."
0,12348.0,"[[84991, 21213, 22952, 21977], [21726], [22437]]"
0,12352.0,"[[21380, 22064, 21232, 22646, 22779, 22654, 21..."
0,12356.0,"[[22138, 22062, 22066, 22131, 22195, 22937, 84..."
0,12358.0,"[[15060B, 22059, 37447, 15056P, 15056N, 20679,..."


**Step7** Now the shape of the dataframe is restructured to fit the requirements for using it as an input of the apriori function. Define a list named `dataset` that appends the rows of the dataframe (the sequences).

In [9]:
dataset=[]
for row in df2.values:
    dataset.append(row[1])
dataset

[[['85116',
   '22375',
   '71477',
   '22771',
   '22772',
   '22773',
   '22774',
   '22775',
   '22805',
   '22725',
   '22726',
   '22727',
   '22728',
   '22729',
   '22212',
   '21171',
   '22195',
   '84969',
   '84997C',
   '84997B',
   '84997D',
   '22494',
   '22497',
   '85232D',
   '21064',
   '21731',
   '84558A',
   '20780',
   '20782'],
  ['84625A',
   '84625C',
   '85116',
   '20719',
   '22375',
   '22376',
   '20966',
   '22725',
   '22726',
   '22727',
   '22728',
   '22729',
   '22196',
   '84992',
   '84991',
   '21976',
   '22417',
   '47559B',
   '21154',
   '21041',
   '21035',
   '84969',
   '22134',
   '21832',
   '22422',
   '22497',
   '21731',
   '84558A'],
  ['22376',
   '22374',
   '22371',
   '22375',
   '20665',
   '21791',
   '22550',
   '23177',
   '22432',
   '22774',
   '22195',
   '22196',
   '21975',
   '21041',
   '22699',
   '21731',
   '84559A',
   '84559B',
   '16008',
   '22821',
   '22497'],
  ['23084',
   '23162',
   '23171',
   '23172',
  

**Step8** Count the total number of sequences and events contained within the dataset (a sequence is composed of multiple events; an event is a list of strings).

In [10]:
nb_sequences=len(dataset)
nb_sequences

2720

In [11]:
nb_events=(sum([len(seq) for seq in dataset]))
nb_events

17378

## Frequent Patterns Computation:

**Step9** Compute the frequent patterns with minimum support equal to the 5%, 10% and 15% of the dataset. A `apriori` algorithm is needed for the computation; `apriori` computes the frequent sequences in a sequence dataset for a given min support (Generalized Sequential Pattern Mining Approach). 

Args:
   - dataset: A list of sequences, for which the frequent (sub-)sequences are computed
   - minSupport: The minimum support that makes a sequence frequent
   
Returns: list of tuples (s, c), where s is a frequent sequence, and c is the count for that sequence

- recursive method that checks if subsequence is a subSequence of mainSequence


In [20]:
from gsp import *

In [21]:
itemsets_5 = apriori(dataset, minSupport=272) 

In [22]:
itemsets_5

[([['20724']], 276),
 ([['20725']], 465),
 ([['20726']], 336),
 ([['20727']], 410),
 ([['20728']], 411),
 ([['20914']], 288),
 ([['21034']], 373),
 ([['21080']], 360),
 ([['21175']], 286),
 ([['21181']], 295),
 ([['21212']], 520),
 ([['21485']], 321),
 ([['21733']], 309),
 ([['21754']], 324),
 ([['21755']], 296),
 ([['21790']], 365),
 ([['21791']], 282),
 ([['21889']], 282),
 ([['21915']], 283),
 ([['21931']], 282),
 ([['21977']], 327),
 ([['22077']], 335),
 ([['22086']], 476),
 ([['22111']], 352),
 ([['22112']], 317),
 ([['22114']], 280),
 ([['22138']], 460),
 ([['22139']], 405),
 ([['22144']], 272),
 ([['22149']], 282),
 ([['22178']], 319),
 ([['22197']], 308),
 ([['22382']], 426),
 ([['22383']], 378),
 ([['22384']], 395),
 ([['22386']], 306),
 ([['22411']], 312),
 ([['22457']], 487),
 ([['22469']], 462),
 ([['22470']], 407),
 ([['22551']], 286),
 ([['22554']], 299),
 ([['22558']], 293),
 ([['22621']], 300),
 ([['22629']], 283),
 ([['22662']], 276),
 ([['22666']], 409),
 ([['22697']]

**Step10** Time Constraints: Consider the date of transactions (baskets).

In [32]:
data['BasketDate'] = pd.to_datetime(data['BasketDate']) 
data['BasketDate'] = data['BasketDate'].dt.year 

In [33]:
df2 = pd.DataFrame(columns=['CustomerID', 'basket_list'])
for CustomerID in baskets.index:
    customer_list = []
    for BasketID in baskets.loc[CustomerID]:
        transaction_date = data[data.BasketID == BasketID].BasketDate.unique()[0]
        basket_event = [transaction_date, transactions.loc[BasketID]]
        customer_list.append(basket_event)
    current_customer = pd.DataFrame(data= {'CustomerID': CustomerID, 'basket_list': [customer_list]})
    df2 = pd.concat([df2, current_customer])

In [34]:
df2.set_index('CustomerID', inplace=True)
df2

Unnamed: 0_level_0,basket_list
CustomerID,Unnamed: 1_level_1
12347.0,"[[2010, [85116, 22375, 71477, 22771, 22772, 22..."
12348.0,"[[2010, [84991, 21213, 22952, 21977]], [2011, ..."
12352.0,"[[2011, [21380, 22064, 21232, 22646, 22779, 22..."
12356.0,"[[2011, [22138, 22062, 22066, 22131, 22195, 22..."
12358.0,"[[2011, [15060B, 22059, 37447, 15056P, 15056N,..."
...,...
18273.0,"[[2011, [79302M]], [2011, [79302M]]]"
18282.0,"[[2011, [23295, 22089, 21108, 21109]], [2011, ..."
18283.0,"[[2011, [22356, 20726, 22384, 22386, 20717, 20..."
18287.0,"[[2011, [22755, 22754, 22753, 22756, 22758, 22..."


In [36]:
dataset2=[]
for row in df2.values:
    #print(row[1])
    dataset2.append(row[0])
#dataset
itemsets_5 = apriori(dataset2, minSupport=272) 

TypeError: unhashable type: 'list'