In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

# Frequent Pattern Mining

> __Note:__ marked as _experimental_

Mining frequent items, itemsets, subsequences, or other substructures is usually among the first steps to analyze a large-scale dataset, which has been an active research topic in data mining for years.  
See Wikipedia association rule learning for more information: 
https://en.wikipedia.org/wiki/Association_rule_learning


## FPGrowth

### `pyspark.ml.fpm.FPGrowth` and `pyspark.ml.fpm.FPGrowthModel`
Spark's official documentation explains this one really well actually. To quote the API Guide:
> A parallel FP-growth algorithm to mine frequent itemsets. The algorithm is described in [Li et al., PFP: Parallel FP-Growth for Query Recommendation LI2008](http://dx.doi.org/10.1145/1454008.1454027). PFP distributes computation in such a way that each worker executes an independent group of mining tasks. The FP-Growth algorithm is described in [Han et al., Mining frequent patterns without candidate generation HAN2000](http://dx.doi.org/10.1145/335191.335372)

MLlib Main Guide: https://spark.apache.org/docs/2.4.3/ml-frequent-pattern-mining.html#fp-growth  
Official API documentation: https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#module-pyspark.ml.fpm.FPGrowth

Please navigate to the API guide and MLlib Main Guide to learn more about the FPGrowth and it's syntax.

### Example:

In [12]:
from pyspark.ml.fpm import FPGrowth

data = [
    (0, [1, 2, 5]),
    (1, [1, 2, 3, 5]),
    (2, [1, 2]),
    (3, [1, 2, 5]),
    (4, [1, 2, 3, 5]),
    (5, [1, 2, 7]),
    (6, [1, 2, 5]),
    (7, [1, 2, 3, 5]),
    (8, [1, 2, 4]),
]

df = spark.createDataFrame(data, ["id", "items"],)

fp_growth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fp_growth.fit(df)

# Display frequent itemsets.
print("Frequent itemsets:")
model.freqItemsets.show()

# Display generated association rules.
print("Generated association rules")
model.associationRules.show()

# transform examines the input items against all the association rules and summarize the
# consequents as prediction
print("Summary/Prediction:")
model.transform(df).show()

## PrefixSpan
A parallel PrefixSpan algorithm to mine frequent sequential patterns. The PrefixSpan algorithm is described in [J. Pei, et al., PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth](http://doi.org/10.1109/ICDE.2001.914830)

### `pyspark.ml.fpm.PrefixSpan`

> __Note:__ This class is not yet an Estimator/Transformer, use `.findFrequentSequentialPatterns()` method to run the `PrefixSpan` algorithm. This feature was newly added in Spark 2.4.0.

Sequential Pattern Mining (Wikipedia): https://en.wikipedia.org/wiki/Sequential_Pattern_Mining  
MLlib Main Guide: https://spark.apache.org/docs/2.4.3/ml-frequent-pattern-mining.html#prefixspan  
Official API documentation: https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#pyspark.ml.fpm.PrefixSpan  

Input data should be a DataFrame with an Array of Arrays. The example below shows how the data should look like.

### Example:

In [31]:
from pyspark.ml.fpm import PrefixSpan
from pyspark.sql import Row

data = [
    Row(sequence=[[1, 2], [3]]),
    Row(sequence=[[1], [3, 2], [1, 2]]),
    Row(sequence=[[1, 2], [5]]),
    Row(sequence=[[6]]),
]

df = spark.createDataFrame(data, ["sequence"])
print("Find Frequent Sequential Patterns:")
print("in:")
df.show(4, False)

prefixSpan = PrefixSpan(minSupport=0.5, maxPatternLength=5, maxLocalProjDBSize=32000000)

# Find frequent sequential patterns.
# note: class is not yet an Estimator/Transformer, use .findFrequentSequentialPatterns() method to run
print("out:")
prefixSpan.findFrequentSequentialPatterns(df).show()

Find Frequent Sequential Patterns:
in:
+---------------------+
|sequence             |
+---------------------+
|[[1, 2], [3]]        |
|[[1], [3, 2], [1, 2]]|
|[[1, 2], [5]]        |
|[[6]]                |
+---------------------+

out:
+----------+----+
|  sequence|freq|
+----------+----+
|     [[2]]|   3|
|     [[3]]|   2|
|     [[1]]|   3|
|  [[1, 2]]|   3|
|[[1], [3]]|   2|
+----------+----+

