# Midterm Project (Cs634)

Name - Ashot Kirakosyan<br>
NJIT ID - ak2095<br>
Email - ak2995@njit.edu<br>
Date: 10/13/2024<br>
Professor - Yasser Abduallah

### Abstract


This report discusses a Python implementation for performing association rule mining using three different algorithms: Brute Force, Apriori, and FP-Growth. The goal is to analyze transactions from selected datasets to identify frequently purchased itemsets and generate association rules based on user-defined parameters.

### Dataset Selection

The available datasets are:<br>
1. Amazon (Created by using an example in Canvas)<br>
2. Farmmarket(Created by using ChatGPT experimental!!!)<br>
3. Wholefoods (The big data containing 1000 transactions and a total of 20 items. Downloaded from Github user name luoyetx https://github.com/luoyetx/Apriori/blob/master/data.csv)<br>
4. Bestbuy (Created by using an example in Canvas) <br>
5. Kmart (Created by using an example in Canvas)<br>
The user can select a dataset and input two critical parameters: minimum support and minimum confidence. These parameters help determine the significance of the itemsets and the strength of the association rules.

### User Input Functionality

The function select_store_and_input_parameters allows users to:<br>
1.Select a dataset.<br>
2.Confirm their selection.<br>
3.Input the minimum support and confidence, with validation checks to ensure values are within acceptable ranges (0 to 1).

### Data Preparation

Once the dataset is selected, the corresponding CSV files are read into Pandas DataFrames. The transactions are preprocessed to:<br>
1. Remove unnecessary characters (like \x92).<br>
2. Ensure items are in a consistent string format and sorted according to a predefined order.

### Frequent Itemset Generation

The implementation provides two approaches to generate frequent itemsets:<br>

1. Brute Force Approach: This method counts all possible itemsets generated from transactions. It filters itemsets based on the minimum support to identify frequent itemsets.<br>
2. Apriori Algorithm: Using the apriori_python library, the algorithm efficiently generates frequent itemsets by reducing the search space through the property of support.<br>
3. FP-Growth Algorithm: Using the pyfpgrowth library, this algorithm constructs a frequent pattern tree (FP-tree) to discover frequent itemsets without candidate generation, making it faster for larger datasets.

### Association Rule Generation

For each frequent itemset, association rules are generated based on the specified minimum confidence. The following process is followed:<BR>
1. For each frequent itemset, possible antecedents are determined, and the consequent is derived.<br>
2. The confidence for each rule is calculated as the ratio of the support of the itemset to the support of the antecedent.<br>
3. Rules that meet or exceed the minimum confidence are retained.

### Results

1. Brute Force Frequent Itemsets: The results list frequent itemsets along with their counts and support values.<br>
2. Brute Force Association Rules: The rules are displayed in the format of antecedent -> consequent, showing their support and confidence. <br>
3. Apriori Frequent Itemsets and Rules: A similar output is generated using the Apriori algorithm, providing an alternative view of the relationships between items. <br>
4. FP-Growth Frequent Itemsets and Rules: The results from the FP-Growth algorithm are printed, showcasing its efficiency and effectiveness in mining associations.

### Requirements

To run this program, the following software is required:
Python: Version 3.6 or higher.<br>
Libraries:<br>
1. pandas<br>
2. numpy <Br>
3. pyfpgrowth <br>
4. apriori_python

### Installation Instructions

Install Python: Download and install Python from the official website. https://www.python.org/downloads/ <br>
1. Open Command Line Interface (CLI):<br>
 .Windows: Search for "cmd" in the Start menu. <br>
 . macOS/Linux: Open the Terminal application. <br>
2. Install Required Libraries: Use the following command to install the necessary libraries:<br>
    pip install pandas numpy pyfpgrowth apriori_python
    

### Download repository

Link to the repository https://github.com/Ash-K-97/Kirakosyan_Ashot.Midtermproject<br>
Download the zip file and extract all files into one folder<BR>
Read a readme file and follow the instructions

### How to Run the Program

1. After downloading the repository and extracting the files move it to the directory of your choice<br>
2. Run the Program: In the CLI, navigate to the directory containing the script and execute:<br>
    .Example: cd C:\Users\YourName\Documents\Kirakosyan_Ashot.Midtermproject<br>
    .where: YourName is the name of the user.<br>
3. After code execution, you should be in the directory of the file <br>
4. You can check which Python files are in this directory by using the following command: dir <br>
5. Once you see the Python file you want to run, you can execute it by typing: python Midtermproject_code.py<br>



### Program Workflow

1. Store Selection: The program prompts the user to select a store from a list.<br>
2. Parameter Input: Users are prompted to input the minimum support and confidence values.<br>
3. Data Loading: The program reads the selected store's transaction and item list CSV files.<br>
4. Algorithm Execution:<br>
    .Brute Force: Computes frequent itemsets and generates association rules.<br>
    .Apriori: Uses the Apriori algorithm to find frequent itemsets and rules.<br>
    .FP-Growth: Utilizes the FP-Growth algorithm for the same purpose.<br>
5. Output Results: The program displays frequent itemsets and association rules for each algorithm, along with their execution times.

### Below is the running code.

In [4]:
import pandas as pd
import itertools
import numpy as np
from collections import defaultdict
import pyfpgrowth
from apriori_python.apriori import apriori
import time  
print("Hello, welcome to my Midterm project")
# Create Dataset
datasetlist = ['Amazon', 'Farmmarket','Wholefood','Bestbuy','Kmart']

# Function to confirm the selected store and input min_support and min_confidence
def select_store_and_input_parameters():
    while True:
        try:
            selected_file = int(input("Please enter the index of the store you want to check (0 for Amazon, 1 for Farmmarket, 2 for Wholefood, 3 for Bestbuy, 4 for Kmart): "))
            if selected_file < 0 or selected_file >= len(datasetlist):
                print("Invalid selection. Please select a valid store.")
                continue
            print(f"You selected store: {datasetlist[selected_file]}")
            confirmation = input("Is this correct? (yes/no): ").strip().lower()
            if confirmation == 'yes':
                min_support = float(input("Enter minimum support (as a decimal between 0 and 1): "))
                if not (0 <= min_support <= 1):
                    raise ValueError("Minimum support must be between 0 and 1.")
                min_confidence = float(input("Enter minimum confidence (as a decimal between 0 and 1): "))
                if not (0 <= min_confidence <= 1):
                    raise ValueError("Minimum confidence must be between 0 and 1.")
                return selected_file, min_support, min_confidence
            elif confirmation == 'no':
                print("Returning to store selection...\n")
                continue
            else:
                print("Invalid input. Please type 'yes' or 'no'.")
        except ValueError as e:
            print(f"Input Error: {e}")
            continue

# Call the function to get user input for store selection, min_support, and min_confidence
selected_file, min_support, min_confidence = select_store_and_input_parameters()

print(f"Proceeding with store: {datasetlist[selected_file]}")
print(f"Minimum support: {min_support}, Minimum confidence: {min_confidence}")

# Open and read the corresponding CSV file
file_name = 'data_' + datasetlist[selected_file] + '.csv'
list_name = 'datalist_' + datasetlist[selected_file] + '.csv'
print(file_name)

# Load the data into a DataFrame
df = pd.read_csv(file_name, encoding='ISO-8859-1')
df_list = pd.read_csv(list_name, encoding='ISO-8859-1')

# Prepare the order list and dataset
order = sorted(df_list['Item.name'].astype(str))
dataset = []

for lines in df['Transaction']:
    trans = [str(item.strip().replace('\x92', "'")) for item in lines.strip().split(',')]
    trans_1 = sorted(np.unique(trans), key=lambda x: order.index(x) if x in order else float('inf'))
    dataset.append(trans_1)

# Brute Force: Function to get frequent itemsets using brute force
def get_frequent_itemsets(dataset, min_support):
    itemset_counts = defaultdict(int)
    num_transactions = len(dataset)

    for transaction in dataset:
        for k in range(1, len(transaction) + 1):  
            for itemset in itertools.combinations(transaction, k):
                itemset_counts[itemset] += 1

    frequent_itemsets = {itemset: count for itemset, count in itemset_counts.items() if count / num_transactions >= min_support}
    return frequent_itemsets

# Function to generate association rules from frequent itemsets
def generate_association_rules(frequent_itemsets, dataset, min_confidence):
    rules = []
    num_transactions = len(dataset)

    for itemset, count in frequent_itemsets.items():
        for k in range(1, len(itemset)):
            for antecedent in itertools.combinations(itemset, k):
                antecedent = set(antecedent)
                consequent = set(itemset) - antecedent

                if len(consequent) > 0:
                    antecedent_count = sum(1 for transaction in dataset if antecedent.issubset(set(transaction)))
                    rule_support = count
                    rule_confidence = rule_support / antecedent_count if antecedent_count > 0 else 0

                    if rule_confidence >= min_confidence:
                        rules.append((antecedent, consequent, rule_support, rule_confidence))

    return rules

# Measure the performance of Brute Force and execute the function.
start_time = time.time()
frequent_itemsets = get_frequent_itemsets(dataset, min_support)
brute_force_rules = generate_association_rules(frequent_itemsets, dataset, min_confidence)


print("\nBrute Force Frequent Itemsets:")
for itemset, count in frequent_itemsets.items():
    print(f"Itemset: {set(map(str, itemset))}, Count: {count}, Support: {count / len(dataset):.4f}")

print("\nBrute Force Association Rules (Antecedent -> Consequent):")
for antecedent, consequent, support, confidence in brute_force_rules:
    antecedent_str = set(map(str, antecedent))
    consequent_str = set(map(str, consequent))
    print(f"{antecedent_str} -> {consequent_str}, Support: {support / len(dataset):.4f}, Confidence: {confidence:.4f}")
brute_force_time = time.time() - start_time
# Measure the performance of Apriori algorithm and execute the function.
start_time = time.time()
frequent_itemsets_apriori, apriori_rules = apriori(dataset, minSup=min_support, minConf=min_confidence)

print("\nApriori Frequent Itemsets:")
for support, itemsets in frequent_itemsets_apriori.items():
    for itemset in itemsets:
        itemset_str = set(map(str, itemset))
        print(f"Itemset: {itemset_str}, Support: {support}")

if isinstance(apriori_rules, list) and len(apriori_rules) > 0:
    print("\nApriori Association Rules using apriori-python (Antecedent -> Consequent):")
    for i, rule in enumerate(apriori_rules):
        antecedent = set(map(str, rule[0]))
        consequent = set(map(str, rule[1]))
        confidence = rule[2]
        print(f"Rule {i + 1}: {antecedent} -> {consequent}, Confidence: {confidence:.4f}")
apriori_time = time.time() - start_time
# Measure the performance of FP-Growth and execute the function.
start_time = time.time()
def run_fp_growth(dataset, min_support):
    min_support_count = int(min_support * len(dataset))
    patterns = pyfpgrowth.find_frequent_patterns(dataset, min_support_count)
    rules = pyfpgrowth.generate_association_rules(patterns, min_confidence)
    return patterns, rules


patterns_fp_growth, rules_fp_growth = run_fp_growth(dataset, min_support)

print("\nFP-Growth Frequent Itemsets:")
for pattern, count in patterns_fp_growth.items():
    pattern_str = set(map(str, pattern))
    print(f"Itemset: {pattern_str}, Count: {count}")

print("\nFP-Growth Association Rules (Antecedent -> Consequent):")
for antecedent, (consequent, confidence) in rules_fp_growth.items():
    antecedent_str = set(map(str, antecedent))
    consequent_str = set(map(str, consequent))
    print(f"{antecedent_str} -> {consequent_str}, Confidence: {confidence:.4f}")
fp_growth_time = time.time() - start_time

# Performance Summary
print("\nPerformance Summary:")
print(f"Brute Force Execution Time: {brute_force_time:.4f} seconds")
print(f"Apriori Execution Time: {apriori_time:.4f} seconds")
print(f"FP-Growth Execution Time: {fp_growth_time:.4f} seconds")


Hello, welcome to my Midterm project


Please enter the index of the store you want to check (0 for Amazon, 1 for Farmmarket, 2 for Wholefood, 3 for Bestbuy, 4 for Kmart):  2


You selected store: Wholefood


Is this correct? (yes/no):  yes
Enter minimum support (as a decimal between 0 and 1):  0.3
Enter minimum confidence (as a decimal between 0 and 1):  0.3


Proceeding with store: Wholefood
Minimum support: 0.3, Minimum confidence: 0.3
data_Wholefood.csv

Brute Force Frequent Itemsets:
Itemset: {'apples'}, Count: 314, Support: 0.3137
Itemset: {'bourbon'}, Count: 403, Support: 0.4026
Itemset: {'chicken'}, Count: 315, Support: 0.3147
Itemset: {'corned_b'}, Count: 391, Support: 0.3906
Itemset: {'cracker'}, Count: 488, Support: 0.4875
Itemset: {'baguette'}, Count: 392, Support: 0.3916
Itemset: {'ham'}, Count: 305, Support: 0.3047
Itemset: {'ice_crea'}, Count: 313, Support: 0.3127
Itemset: {'olives'}, Count: 473, Support: 0.4725
Itemset: {'hering'}, Count: 486, Support: 0.4855
Itemset: {'avocado'}, Count: 363, Support: 0.3626
Itemset: {'heineken'}, Count: 600, Support: 0.5994
Itemset: {'soda'}, Count: 318, Support: 0.3177
Itemset: {'cracker', 'heineken'}, Count: 366, Support: 0.3656
Itemset: {'artichok'}, Count: 305, Support: 0.3047

Brute Force Association Rules (Antecedent -> Consequent):
{'cracker'} -> {'heineken'}, Support: 0.3656, Confiden

### Conclusion

The report demonstrates a comprehensive implementation of association rule mining in Python, leveraging three distinct algorithms. Each method provides unique benefits and is suitable for different scenarios depending on the dataset size and complexity. The ability to select datasets and customize parameters adds flexibility, making the tool valuable for various analytical needs in market basket analysis and other domains.