## LCM
LCM looks for closed itemset with respect to an input minimum support

#### load the chess dataset

In [1]:
from skmine.datasets.fimi import fetch_chess
chess = fetch_chess()
chess.head()

0    [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
1    [1, 3, 5, 7, 9, 12, 13, 15, 17, 19, 21, 23, 25...
2    [1, 3, 5, 7, 9, 12, 13, 16, 17, 19, 21, 23, 25...
3    [1, 3, 5, 7, 9, 11, 13, 15, 17, 20, 21, 23, 25...
4    [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25...
Name: chess, dtype: object

In [2]:
chess.shape

(3196,)

#### fit_discover()
fit_discover makes pattern discovery more user friendly by outputting pretty formatted
patterns, instead of the traditional tabular format used in the `scikit` community

In [3]:
from skmine.itemsets import LCM
lcm = LCM(min_supp=2000, n_jobs=4)
# minimum support of 2000, running on 4 processes
%time patterns = lcm.fit_discover(chess)

CPU times: user 449 ms, sys: 362 ms, total: 811 ms
Wall time: 9.98 s


In [4]:
patterns.shape

(68967, 2)

This format in which patterns are rendered makes post hoc analysis easier

Here we filter patterns with a length strictly superior to 3

In [5]:
patterns[patterns.itemset.map(len) > 3]

Unnamed: 0,itemset,support
14,"(17, 25, 29, 58)",2179
17,"(17, 29, 3, 58)",2272
18,"(25, 29, 3, 58)",2520
22,"(25, 29, 31, 58)",2220
23,"(29, 3, 31, 58)",2360
...,...,...
68958,"(40, 5, 52, 54)",2034
68959,"(29, 40, 5, 52, 54)",2027
68964,"(29, 40, 52, 70)",2001
68965,"(29, 40, 58, 70)",2006


`Note`

Even when setting a very high minimum support threshold, we discovered more than 60K from only 3196 original transactions.
This is a good illustration of the so-called **pattern explosion problem**

------------
We could also get the top-k patterns in terms of supports, with a single line of code

In [6]:
patterns.nlargest(10, columns=['support'])  # top 10 patterns

Unnamed: 0,itemset,support
0,"(58,)",3195
3749,"(52,)",3185
1118,"(52, 58)",3184
4498,"(29,)",3181
8,"(29, 58)",3180
3757,"(29, 52)",3170
4506,"(40,)",3170
88,"(40, 58)",3169
1126,"(29, 52, 58)",3169
3800,"(40, 52)",3159


---------------
#### fit_transform()
LCM can also be used as a preprocessing step to work on tabular data

In [7]:
lcm = LCM(min_supp=2000, n_jobs=4)
lcm.fit_transform(chess)



Unnamed: 0,11,15,17,21,25,27,29,3,31,34,...,58,60,62,64,66,7,70,72,74,9
0,2129,2026,2500,2225,2003,2205,2000,2003,2008,2000,...,2000,2005,2043,2005,2005,2005,2000,2005,2029,2001
1,0,2026,2500,2225,2003,2205,2000,2003,2008,2000,...,2000,2005,2043,2005,2005,2005,2000,2005,2029,2001
2,0,0,2500,2225,2003,2205,2000,2003,2008,2000,...,2000,2005,2043,2005,2005,2005,2000,2005,2029,2001
3,2129,2026,2500,2225,2003,2205,2000,2003,2008,2000,...,2000,2005,2043,2005,2005,2005,2000,2005,2029,2001
4,2129,2026,2500,2225,2003,2205,2000,2003,2008,2000,...,2000,2005,2043,2005,2005,2005,2000,2005,2029,2001
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3191,2129,0,2500,2225,0,2205,2000,0,0,2000,...,2000,0,2006,2007,0,2006,2000,0,2006,2009
3192,2129,0,2500,2225,0,2205,2000,0,0,2000,...,2000,0,2006,2007,0,2006,2000,0,2006,2009
3193,2129,0,2500,2225,0,2205,2000,0,2001,2000,...,2000,0,2006,2007,0,2006,2000,0,2006,2009
3194,2129,0,2500,2225,0,2205,0,0,0,0,...,2025,0,2025,2112,0,0,0,0,2025,2077


Setting a lower threshold will fill cells currenly filled with 0s

But will also lead to longer run times