# Problem Statement
Market basket analysis is a technique used by retailers to identify the relationships between items that customers frequently purchase together. Using association rules, retailers can predict the likelihood of purchasing certain products together, which can be used for product placement, cross-selling, or personalized recommendations.

In this use case, we will perform Market Basket Analysis on a transactional dataset using association rule mining. The goal is to discover patterns or associations between items that are frequently bought together. We will use the Apriori Algorithm to generate association rules from the transactional data.

# Dataset Overview
The dataset consists of transactional data from a retail store. Each transaction contains a list of items purchased by customers. This data is organized as:

- Transaction ID: A unique identifier for each purchase.
- Items: The list of items purchased in each transaction.

# Steps to be Covered

### 1. Problem Definition
Define the problem of identifying frequent itemsets and generating association rules to recommend or understand items frequently purchased together.

### 2. Loading and Exploring the Dataset
- Load the transactional dataset using `pandas`.
- Explore the dataset structure, count of transactions, and unique items involved.
- Preprocess the dataset to a suitable format for association rule mining.

### 3. Data Preprocessing
- Convert the transactional data into a suitable structure for the Apriori algorithm (e.g., converting the transactional data into a one-hot encoded format).
- Ensure the data is clean and correctly formatted for the analysis.

### 4. Apply Apriori Algorithm
- Use the **Apriori algorithm** to identify frequent itemsets with a minimum support threshold.
- Analyze and interpret the results, identifying the most frequently occurring item combinations.

### 5. Generate Association Rules
- Apply the **Association Rule Mining** technique on the frequent itemsets using a confidence threshold.
- Calculate key metrics such as support, confidence, and lift for each rule.
- Extract meaningful insights from the rules, such as which items are frequently purchased together.

### 6. Interpretation of Results
- Interpret the association rules, focusing on items that have high support and confidence.
- Highlight any interesting patterns, such as which items are often bought together or which product combinations can drive cross-selling opportunities.

### 7. Conclusion and Business Impact
- Summarize the findings from the analysis.
- Suggest potential strategies for product placement, bundling, or recommendation systems based on the discovered rules.
- Discuss how the retail store can use these insights to improve sales or customer satisfaction.


# Import Libraries and Load Dataset

In [None]:
! pip install mlxtend

In [None]:
# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Load dataset
data = pd.read_csv('groceries_dataset.csv')

# Preview the dataset
print(data.head())

# Data Preprocessing

In [None]:
# Convert the 'Date' column to datetime
data['Date'] = pd.to_datetime(data['Date'], errors='coerce')

# Group items by Member_number and Date to create a list of transactions
transactions = data.groupby(['Member_number', 'Date'])['itemDescription'].apply(list).values.tolist()

# Use TransactionEncoder to transform the list of transactions into a one-hot encoded DataFrame
te = TransactionEncoder()
te_data = te.fit(transactions).transform(transactions)
df_transactions = pd.DataFrame(te_data, columns=te.columns_)

# Preview the one-hot encoded DataFrame
print(df_transactions.head())

`TransactionEncoder`: This is a tool from the mlxtend library that is used to perform one-hot encoding on the list of transactions.

`One-hot encoding`: It converts the list of items into a binary format (0s and 1s), where:

Each unique item is a column in the DataFrame.
If an item is present in a particular transaction, it gets a '1' in the corresponding column; otherwise, it gets a '0'.
This transformation makes the data suitable for algorithms like Apriori, which require transactional data in binary form.

Example: If a transaction contains ['Milk', 'Bread'], the one-hot encoded format will look like this:

| Milk | Bread | Butter | Beer |
|------|-------|--------|------|
|  1   |   1   |    0   |   0  |



# Model Building (Apriori Algorithm)

In [None]:
# Lower the minimum support threshold to 0.001
frequent_itemsets = apriori(df_transactions, min_support=0.001, use_colnames=True)

# Check for frequent itemsets
if frequent_itemsets.empty:
    print("No frequent itemsets found. Try lowering the support threshold.")
else:
    print("Frequent Itemsets")
    print(frequent_itemsets.head())

# Generate association rules with a lower confidence threshold (0.1)
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)

# Check for rules
if rules.empty:
    print("No association rules found. Try lowering the confidence threshold.")
else:
    print("Association Rules")
    print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].head())


#### Table 1: Frequent Itemsets

This table shows the **frequent itemsets** discovered by the Apriori algorithm. Each itemset is a combination of items that frequently appear together in transactions, along with their **support** value. The **support** represents the proportion of transactions in which that particular itemset appears.

| Support  | Itemsets                  |
|----------|---------------------------|
| 0.004010 | (Instant food products)    |
| 0.021386 | (UHT-milk)                 |
| 0.001470 | (abrasive cleaner)         |
| 0.001938 | (artificial sweetener)     |
| 0.000807 | (baking powder)            |

### Explanation of Columns:
- **Support**: The percentage (or fraction) of total transactions in which an itemset occurs. It is a measure of how often an item or itemset appears in the dataset.
  - Example: `UHT-milk` has a support value of **0.021386**, meaning it appears in about **2.14%** of all transactions in the dataset.

- **Itemsets**: The groups of items that frequently appear together in transactions. For example, in many transactions, customers buy **UHT-milk**, so it appears as a frequent itemset.

### Insights from Frequent Itemsets:
- **High support values** indicate that an item or combination of items is popular among customers.
  - Example: **UHT-milk** has the highest support value of **0.021386**, meaning it's one of the most commonly bought items.

- **Low support values** show items that are bought less frequently but are still important if combined with other items to form meaningful associations.

### Importance of Frequent Itemsets:
- Frequent itemsets serve as the foundation for generating **association rules**. Once frequent itemsets are identified, we can analyze them further to discover rules that help in understanding customer purchasing behavior.


#### Table 2: Association Rules

This table presents the **association rules** generated from the frequent itemsets. Each rule indicates that when a certain item (or set of items, called the **antecedent**) is bought, another item (called the **consequent**) is likely to be bought as well. The table also provides key metrics for evaluating these rules:

| Antecedents | Consequents       | Support | Confidence | Lift    |
|-------------|-------------------|---------|------------|---------|
| (UHT-milk)  | (other vegetables) | 0.002139 | 0.100000   | 0.818993 |
| (UHT-milk)  | (whole milk)       | 0.002540 | 0.118750   | 0.751949 |
| (beef)      | (whole milk)       | 0.004678 | 0.137795   | 0.872548 |
| (berries)   | (other vegetables) | 0.002673 | 0.122699   | 1.004899 |
| (berries)   | (whole milk)       | 0.002272 | 0.104294   | 0.660414 |

### Key Metrics:
- **Support**: The percentage of transactions that contain both the antecedent and the consequent.
  - Example: The rule `(UHT-milk) -> (whole milk)` has a support of 0.002540, meaning both items are bought together in 0.25% of all transactions.
  
- **Confidence**: The probability that the consequent is purchased when the antecedent is purchased.
  - Example: The rule `(beef) -> (whole milk)` has a confidence of 0.137795, meaning that when beef is purchased, there is about a 13.78% chance that whole milk is also purchased.
  
- **Lift**: The ratio of the observed support to the expected support if the items were independent. A lift greater than 1 implies a positive association between the antecedent and consequent (i.e., the items are more likely to be bought together).
  - Example: The rule `(berries) -> (other vegetables)` has a lift of 1.004899, indicating that these items are slightly more likely to be bought together than if they were independent.

### Summary:
- **Frequent itemsets** show which items are often bought together.
- **Association rules** help to understand the likelihood of certain items being bought together based on historical data.
- **Support**, **confidence**, and **lift** are key metrics to assess how often items are bought together and how strong the associations are.

These tables help businesses make decisions about product placement, bundling, and recommendations based on customer purchasing patterns.

# Interpretation:
- Higher support means that the itemset is more frequent in the transactions.
- Higher confidence means that when the antecedent is purchased, the consequent is also purchased more frequently.
- Lift > 1 suggests that the items are positively associated.


# Model Saving

In [24]:
# Saving the frequent itemsets and rules for later use
import pickle

# Save the frequent itemsets
with open('frequent_itemsets.pkl', 'wb') as f:
    pickle.dump(frequent_itemsets, f)

# Save the association rules
with open('association_rules.pkl', 'wb') as f:
    pickle.dump(rules, f)


# Predictions Using Saved Model

In [25]:
# Load the saved model for predictions
with open('association_rules.pkl', 'rb') as f:
    loaded_rules = pickle.load(f)

# Example: Predict products that are likely to be bought together
# For demonstration, let's find the rules where the antecedent contains 'whole milk'
milk_related_rules = loaded_rules[loaded_rules['antecedents'].apply(lambda x: 'whole milk' in str(x))]

# Display the rules related to 'whole milk'
print(milk_related_rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])


               antecedents         consequents   support  confidence      lift
119   (yogurt, whole milk)  (other vegetables)  0.001136    0.101796  0.833705
120  (whole milk, sausage)        (rolls/buns)  0.001136    0.126866  1.153275
123   (yogurt, whole milk)        (rolls/buns)  0.001337    0.119760  1.088685
126  (whole milk, sausage)              (soda)  0.001069    0.119403  1.229612
127   (yogurt, whole milk)           (sausage)  0.001470    0.131737  2.182917
129  (whole milk, sausage)            (yogurt)  0.001470    0.164179  1.911760
