<a href="https://colab.research.google.com/github/JeyyGit/Data-Mining/blob/main/part_2/CS246_Colab_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Colab 2
## Frequent Pattern Mining in Spark

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

Collecting pyspark
  Downloading pyspark-3.5.0.tar.gz (316.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m316.9/316.9 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.5.0-py2.py3-none-any.whl size=317425345 sha256=5e68cc996d6560556686cfc7e6c649527d17124a37032f0b14b8c99fa2267438
  Stored in directory: /root/.cache/pip/wheels/41/4e/10/c2cf2467f71c678cfc8a6b9ac9241e5e44a01940da8fbb17fc
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.0
The following additional packages will be installed:
  libxtst6 openjdk-8-jre-headless
Suggested packages:
  openjdk-8-demo openjdk-8-source libnss-mdns fonts-dejavu-extra fonts-nanum fonts-ipafont-gothic
  fonts-ipafont-mincho fonts-wqy-microhei fonts-wqy-zenhei fonts-indic

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now we authenticate a Google Drive client to download the file we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

If you executed the cells above, you should be able to see the dataset we will need for this Colab under the "Files" tab on the left panel.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

Let's initialize the Spark context.

In [4]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

### Your task

If you run successfully the setup stage, you are ready to work with the **3 Million Instacart Orders** dataset. In case you want to read more about it, check the [official Instacart blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) about it, a concise [schema description](https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b) of the dataset, and the [download page](https://www.instacart.com/datasets/grocery-shopping-2017).

In this Colab, we will be working only with a small training dataset (~131K orders) to perform fast Frequent Pattern Mining with the FP-Growth algorithm.

In [5]:
products = spark.read.csv('/content/drive/MyDrive/Data/products.csv', header=True, inferSchema=True)
orders = spark.read.csv('/content/drive/MyDrive/Data/order_products__train.csv', header=True, inferSchema=True)

In [6]:
products.printSchema()

root
 |-- product_id: integer (nullable = true)
 |-- product_name: string (nullable = true)
 |-- aisle_id: string (nullable = true)
 |-- department_id: string (nullable = true)



In [7]:
orders.printSchema()

root
 |-- order_id: integer (nullable = true)
 |-- product_id: integer (nullable = true)
 |-- add_to_cart_order: integer (nullable = true)
 |-- reordered: integer (nullable = true)



Use the Spark Dataframe API to join 'products' and 'orders', so that you will be able to see the product names in each transaction (and not only their ids).  Then, group by the orders by 'order_id' to obtain one row per basket (i.e., set of products purchased together by one customer).

In [8]:
# Join 'products' and 'orders' on 'product_id'
combined_df = orders.join(products, 'product_id')

# Group by 'order_id' to obtain one row per basket
grouped_baskets = combined_df.groupBy('order_id').agg(
    collect_list("product_name").alias("basket_items")
)

# Show the first few rows of the resulting DataFrame
grouped_baskets.show(truncate=False)

+--------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|order_id|basket_items                                                 

In this Colab we will explore [MLlib](https://spark.apache.org/mllib/), Apache Spark's scalable machine learning library. Specifically, you can use its implementation of the [FP-Growth](https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html#fp-growth) algorithm to perform efficiently Frequent Pattern Mining in Spark.
Use the Python example in the documentation, and train a model with

```minSupport=0.01``` and ```minConfidence=0.5```



In [9]:
from pyspark.ml.fpm import FPGrowth

fp_growth = FPGrowth(minSupport=0.01, minConfidence=0.5, itemsCol="basket_items")
model = fp_growth.fit(grouped_baskets)

# Display frequent itemsets
frequent_itemsets = model.freqItemsets
frequent_itemsets.show()

# Display association rules
association_rules = model.associationRules
association_rules.show()

# Make predictions (optional)
predictions = model.transform(grouped_baskets)
predictions.show()

+--------------------+-----+
|               items| freq|
+--------------------+-----+
|      [Green Onions]| 1445|
|   [Red Raspberries]| 1493|
|    [Organic Banana]| 2332|
|  [Jalapeno Peppers]| 1899|
|[Organic Large Ex...| 2891|
|[Organic Whole St...| 1993|
|[Organic Peeled W...| 2460|
|             [Limes]| 6033|
|[Limes, Large Lemon]| 1595|
|     [Limes, Banana]| 1331|
|       [Raspberries]| 3279|
|      [Hass Avocado]| 1633|
|[Organic Broccoli...| 1361|
|[Uncured Genoa Sa...| 1788|
|      [Spring Water]| 2225|
|[Michigan Organic...| 2627|
|     [Yellow Onions]| 3762|
|[Organic Strawber...|10894|
|[Organic Strawber...| 3074|
|[Organic Strawber...| 2174|
+--------------------+-----+
only showing top 20 rows

+----------+----------+----------+----+-------+
|antecedent|consequent|confidence|lift|support|
+----------+----------+----------+----+-------+
+----------+----------+----------+----+-------+

+--------+--------------------+----------+
|order_id|        basket_items|prediction|

Compute how many frequent itemsets and association rules were generated by running FP-growth alongside visalizing top frequent itemsets and association rules.


In [10]:
# Get the frequent itemsets
frequent_itemsets = model.freqItemsets

# Get the association rules
association_rules = model.associationRules

# Compute the number of frequent itemsets and association rules
num_frequent_itemsets = frequent_itemsets.count()
num_association_rules = association_rules.count()

print("Number of Frequent Itemsets:", num_frequent_itemsets)
print("Number of Association Rules:", num_association_rules)

# Display the top frequent itemsets
print("\nTop Frequent Itemsets:")
frequent_itemsets.sort("freq", ascending=False).show(5, truncate=False)

# Display the top association rules
print("\nTop Association Rules:")
association_rules.sort("confidence", ascending=False).show(5, truncate=False)


Number of Frequent Itemsets: 120
Number of Association Rules: 0

Top Frequent Itemsets:
+------------------------+-----+
|items                   |freq |
+------------------------+-----+
|[Banana]                |18726|
|[Bag of Organic Bananas]|15480|
|[Organic Strawberries]  |10894|
|[Organic Baby Spinach]  |9784 |
|[Large Lemon]           |8135 |
+------------------------+-----+
only showing top 5 rows


Top Association Rules:
+----------+----------+----------+----+-------+
|antecedent|consequent|confidence|lift|support|
+----------+----------+----------+----+-------+
+----------+----------+----------+----+-------+



Now retrain the FP-growth model changing only
```minsupport=0.001```
and compute how many frequent itemsets and association rules were generated.


In [11]:
''' 5 lines of code in total expected but can differ based on your style. for sub-parts of the question, creating different cells of code would be recommended.'''
# YOUR CODE HERE
# Create and fit the FP-Growth model with minSupport=0.001
fp_growth_updated = FPGrowth(minSupport=0.001, minConfidence=0.5, itemsCol="basket_items")
model_updated = fp_growth_updated.fit(grouped_baskets)

# Get the frequent itemsets
frequent_itemsets_updated = model_updated.freqItemsets

# Get the association rules
association_rules_updated = model_updated.associationRules

# Compute the number of frequent itemsets and association rules
num_frequent_itemsets_updated = frequent_itemsets_updated.count()
num_association_rules_updated = association_rules_updated.count()

print("Number of Frequent Itemsets (Updated):", num_frequent_itemsets_updated)
print("Number of Association Rules (Updated):", num_association_rules_updated)

# Display the top frequent itemsets
print("\nTop Frequent Itemsets (Updated):")
frequent_itemsets_updated.sort("freq", ascending=False).show(5, truncate=False)

# Display the top association rules
print("\nTop Association Rules (Updated):")
association_rules_updated.sort("confidence", ascending=False).show(5, truncate=False)


Number of Frequent Itemsets (Updated): 4444
Number of Association Rules (Updated): 11

Top Frequent Itemsets (Updated):
+------------------------+-----+
|items                   |freq |
+------------------------+-----+
|[Banana]                |18726|
|[Bag of Organic Bananas]|15480|
|[Organic Strawberries]  |10894|
|[Organic Baby Spinach]  |9784 |
|[Large Lemon]           |8135 |
+------------------------+-----+
only showing top 5 rows


Top Association Rules (Updated):
+-----------------------------------------------------------------+------------------------+------------------+------------------+---------------------+
|antecedent                                                       |consequent              |confidence        |lift              |support              |
+-----------------------------------------------------------------+------------------------+------------------+------------------+---------------------+
|[Organic Raspberries, Organic Hass Avocado, Organic Strawberries

The output you provided is a representation of the top association rules generated by the FP-Growth algorithm. Let's break down the columns in the table:

- **antecedent:** The item or set of items that precede or are on the left side of the rule.

- **consequent:** The item or set of items that follow or are on the right side of the rule.

- **confidence:** The conditional probability of the consequent given the antecedent. It is the ratio of the support of the rule to the support of the antecedent. Confidence measures the strength of the rule.

- **lift:** Lift is a measure of how much more likely the consequent is given the antecedent compared to its overall likelihood. It is calculated as the ratio of the confidence of the rule to the support of the consequent.

- **support:** The support of a rule is the proportion of transactions in the dataset that contain both the antecedent and the consequent. It indicates the frequency of occurrence of the rule.

Now, let's interpret one of the association rules from the output:

```plaintext
antecedent: [Organic Raspberries, Organic Hass Avocado, Organic Strawberries]
consequent: [Bag of Organic Bananas]
confidence: 0.5984 (59.84%)
lift: 5.0723
support: 0.0017 (0.17%)
```

Interpretation:

- **If a customer buys Organic Raspberries, Organic Hass Avocado, and Organic Strawberries (antecedent), they are 59.84% likely to buy Bag of Organic Bananas (consequent).**

- **The lift value of 5.0723 suggests that the likelihood of buying Bag of Organic Bananas is about 5 times higher when the antecedent items are present compared to the overall likelihood of buying Bag of Organic Bananas.**

- **The support value of 0.0017 indicates that this association rule is present in approximately 0.17% of all transactions.**

This rule and the others in the output provide insights into patterns of co-occurrence and associations between different items in the dataset, helping to understand customer behavior and potential recommendations for product placements or promotions. Adjusting the support and confidence thresholds can influence the number and strength of generated association rules.

As a sanity check, for `minSupport=0.01` you should obtain `freq([Limes]) = 6033`

In [12]:
# Create and fit the FP-Growth model with minSupport=0.01
fp_growth_sanity_check = FPGrowth(minSupport=0.01, minConfidence=0.5, itemsCol="basket_items")
model_sanity_check = fp_growth_sanity_check.fit(grouped_baskets)

# Get the frequent itemsets
frequent_itemsets_sanity_check = model_sanity_check.freqItemsets

# Filter the frequent itemsets for [Limes]
limes_frequency = frequent_itemsets_sanity_check.filter("array_contains(items, 'Limes')")

# Display the frequency of [Limes]
limes_frequency.show()

+--------------------+----+
|               items|freq|
+--------------------+----+
|             [Limes]|6033|
|[Limes, Large Lemon]|1595|
|     [Limes, Banana]|1331|
+--------------------+----+

