# Association Rule Mining and Analysis

This notebook demonstrates data preprocessing, rule mining using the Apriori algorithm, and detailed insights into association rules.

## Task 1: Data Preprocessing

In [2]:

# Load the dataset
import pandas as pd

# Load dataset
data = pd.read_csv('Online retail.csv', encoding='ISO-8859-1')

# Check for missing values and remove duplicates
data.dropna(inplace=True)
data.drop_duplicates(inplace=True)

# Remove transactions with negative or zero quantities
data = data[data['Quantity'] > 0]

# Convert InvoiceNo to string for processing
data['InvoiceNo'] = data['InvoiceNo'].astype('str')

# Check the first few rows of the cleaned dataset
data.head()


KeyError: 'Quantity'

## Task 2: Exploratory Data Analysis (EDA)

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of products purchased
plt.figure(figsize=(10, 6))
sns.histplot(data['Quantity'], bins=50, kde=True)
plt.title('Distribution of Product Quantities Purchased')
plt.show()

# Visualize the total revenue per country
data['Revenue'] = data['Quantity'] * data['UnitPrice']
country_revenue = data.groupby('Country')['Revenue'].sum().sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=country_revenue.values, y=country_revenue.index)
plt.title('Total Revenue by Country')
plt.show()


## Task 3: Association Rule Mining with Apriori (To be implemented locally)

In [None]:

# NOTE: You'll need to install mlxtend using pip: 
# !pip install mlxtend
# Then, you can proceed with the Apriori algorithm as follows:

# from mlxtend.frequent_patterns import apriori, association_rules

# Prepare data for association rule mining
# basket = (data.groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))
# def encode_units(x):
#     return 1 if x >= 1 else 0
# basket_encoded = basket.applymap(encode_units)

# Apply Apriori algorithm with min support of 0.01
# frequent_itemsets = apriori(basket_encoded, min_support=0.01, use_colnames=True)

# Generate association rules with min confidence of 0.2
# rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.2)

# Sort rules by lift and display the top 10
# rules = rules.sort_values('lift', ascending=False)
# rules.head(10)


## Task 4: Analysis of Association Rules (To be implemented locally)

In [None]:

# After running the Apriori algorithm, you can visualize the top association rules as follows:

# import matplotlib.pyplot as plt
# import seaborn as sns

# Visualize the top 10 association rules based on lift
# plt.figure(figsize=(10, 6))
# sns.barplot(x=rules['lift'].head(10), y=rules['consequents'].head(10).astype(str))
# plt.title('Top 10 Association Rules by Lift')
# plt.xlabel('Lift')
# plt.ylabel('Consequent Items')
# plt.show()


## Task 5: Interview Questions and Answers


**1. What is Lift and why is it important in Association Rules?**

**Answer**: Lift is a measure of how much more likely two items are to be purchased together compared to if they were independent. A lift greater than 1 indicates a strong association, meaning that the occurrence of the antecedent increases the likelihood of the consequent occurring. Lift helps in identifying the strength of the relationship between items.

**2. What is Support and Confidence? How do you calculate them?**

**Answer**: Support is the proportion of transactions that contain a particular itemset. It indicates how frequently an item or set of items appears in the dataset. Confidence is a measure of the likelihood that the consequent will occur given that the antecedent has occurred. Both metrics help in determining which rules are meaningful.

- **Support**: Support(X) = (Transactions containing X) / (Total Transactions)
- **Confidence**: Confidence(X -> Y) = Support(X and Y) / Support(X)

**3. What are some limitations or challenges of Association Rules mining?**

**Answer**: Some challenges include:
- Handling large datasets: Frequent itemset generation can be computationally expensive.
- Selecting appropriate thresholds: It can be difficult to set meaningful thresholds for support, confidence, and lift.
- Overfitting: Too many rules can lead to overfitting, generating rules that are not practically useful.
