# Case Study Assignment

## Problem statement

An organization wanted to mine association rules of frequently bought items from its stores and suggest some recommendations to its customers. 
As a data scientist, you are required to recognize patterns from the available data and evaluate efficacy of methods to obtain patterns. 

Your activities should include 
1. Preparing the dataset for analysis 
2. investigating the relationships in the data set with visualization
3. Identify frequent patterns
4. Formulate association rules 
5. Evaluate quality of rules

##  Importing necessary libraries

In [43]:
# Import "os" library as it provides functions for interacting with the operating system
# Import "numpy" library as it supports large, multi-dimensional arrays & matrices
# Import "pandas" library as it supports data manipulation and analysis
import os, numpy as np, pandas as pd
# Import "TransactionEncoder" - Encodes database transaction data in form of a Python list of lists into a NumPy array
from mlxtend.preprocessing import TransactionEncoder
# Import "apriori" - algorithms for frequent itemset generation
# Import "fpgrowth"- frequent pattern generation algorithm that inserts items into a pattern search tree
# Import "fpmax" - variant of FP-Growth, which focuses on obtaining maximal itemsets
from mlxtend.frequent_patterns import apriori, fpmax, fpgrowth
# Import "matplotlib" -  comprehensive library for creating static, animated, and interactive visualizations
import matplotlib

In [53]:
# getcwd() returns current working directory of a process
# listdir() returns a list containing the names of the entries in the directory given by path
os.listdir(os.getcwd())

['.ipynb_checkpoints',
 'Assignment 3',
 'Assignment3_G013.ipynb',
 'Dataset.xlsx',
 'desktop.ini',
 'DM_GROUP013.docx',
 '~$oblem Bank 13.docx',
 '~$_GROUP013.docx']

## Perform exploratory data analysis

## Reading data from Dataset.xlsx

In [54]:
# Need openpyxl as the engine to read the xlsx file
# Input file - Dataset.xlsx where all the data are available
# Reading an Excel file into a pandas DataFrame
df = pd.read_excel("Dataset.xlsx", engine="openpyxl") 

In [4]:
# Installing openpyxl is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files 
!pip install openpyxl



In [55]:
# info() function is used to print a concise summary of a DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 20 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   shrimp             7500 non-null   object 
 1   almonds            5746 non-null   object 
 2   avocado            4388 non-null   object 
 3   vegetables mix     3344 non-null   object 
 4   green grapes       2528 non-null   object 
 5   whole weat flour   1863 non-null   object 
 6   yams               1368 non-null   object 
 7   cottage cheese     980 non-null    object 
 8   energy drink       653 non-null    object 
 9   tomato juice       394 non-null    object 
 10  low fat yogurt     255 non-null    object 
 11  green tea          153 non-null    object 
 12  honey              86 non-null     object 
 13  salad              46 non-null     object 
 14  mineral water      24 non-null     object 
 15  salmon             7 non-null      object 
 16  antioxydant juice  3 non

##  Preprocess the data. Identify relevant & irrelevant attributes for the problem

In [56]:
# "display.max_columns" sets the maximum number of columns displayed when a frame is pretty-printed
pd.set_option('display.max_columns', None)

# Formatting every floating numbers
pd.options.display.float_format = '{:.5f}'.format

In [57]:
# Returns the first 5 rows of the dataframe 
df.head(5)

Unnamed: 0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
0,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
1,chutney,,,,,,,,,,,,,,,,,,,
2,turkey,avocado,,,,,,,,,,,,,,,,,,
3,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
4,low fat yogurt,,,,,,,,,,,,,,,,,,,


In [65]:
data = []
# Conveys dataframe into list
for rowIndex in range(df.shape[0]):
    data.append(list(df.iloc[rowIndex][df.iloc[rowIndex].notnull()]))
data[:25]

[['burgers', 'meatballs', 'eggs'],
 ['chutney'],
 ['turkey', 'avocado'],
 ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea'],
 ['low fat yogurt'],
 ['whole wheat pasta', 'french fries'],
 ['soup', 'light cream', 'shallot'],
 ['frozen vegetables', 'spaghetti', 'green tea'],
 ['french fries'],
 ['eggs', 'pet food'],
 ['cookies'],
 ['turkey', 'burgers', 'mineral water', 'eggs', 'cooking oil'],
 ['spaghetti', 'champagne', 'cookies'],
 ['mineral water', 'salmon'],
 ['mineral water'],
 ['shrimp',
  'chocolate',
  'chicken',
  'honey',
  'oil',
  'cooking oil',
  'low fat yogurt'],
 ['turkey', 'eggs'],
 ['turkey',
  'fresh tuna',
  'tomatoes',
  'spaghetti',
  'mineral water',
  'black tea',
  'salmon',
  'eggs',
  'chicken',
  'extra dark chocolate'],
 ['meatballs', 'milk', 'honey', 'french fries', 'protein bar'],
 ['red wine', 'shrimp', 'pasta', 'pepper', 'eggs', 'chocolate', 'shampoo'],
 ['rice', 'sparkling water'],
 ['spaghetti', 'mineral water', 'ham', 'body spray',

In [10]:
# To find the length of the data
len(data)

7500

In [12]:
# "TransactionEncoder" - Encodes database transaction data in form of a Python list of lists into a NumPy array
# Transform it into the right format via the TransactionEncoder
te = TransactionEncoder()
#  Fit performs the training, transform changes the data in the pipeline in order to pass it on to the next stage in the pipeline
te_ary = te.fit(data).transform(data)
# shape is used to get the current shape of an array
te_ary.shape

(7500, 119)

In [13]:
df_1 = pd.DataFrame(te_ary, columns = te.columns_)
# Returns the first 5 rows of the dataframe 
df_1.head(5)

Unnamed: 0,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,bramble,brownies,bug spray,burger sauce,burgers,butter,cake,candy bars,carrots,cauliflower,cereals,champagne,chicken,chili,chocolate,chocolate bread,chutney,cider,clothes accessories,cookies,cooking oil,corn,cottage cheese,cream,dessert wine,eggplant,eggs,energy bar,energy drink,escalope,extra dark chocolate,flax seed,french fries,french wine,fresh bread,fresh tuna,fromage blanc,frozen smoothie,frozen vegetables,gluten free bar,grated cheese,green beans,green grapes,green tea,ground beef,gums,ham,hand protein bar,herb & pepper,honey,hot dogs,ketchup,light cream,light mayo,low fat yogurt,magazines,mashed potato,mayonnaise,meatballs,melons,milk,mineral water,mint,mint green tea,muffins,mushroom cream sauce,napkins,nonfat milk,oatmeal,oil,olive oil,pancakes,parmesan cheese,pasta,pepper,pet food,pickles,protein bar,red wine,rice,salad,salmon,salt,sandwich,shallot,shampoo,shrimp,soda,soup,spaghetti,sparkling water,spinach,strawberries,strong cheese,tea,tomato juice,tomato sauce,tomatoes,toothpaste,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


##  Discover frequent patterns

In [15]:
# Return itemsets with at least 60% support:
frequent_itemsets = fpgrowth(df_1, min_support=0.6, use_colnames=True)
frequent_itemsets_apriori = apriori(df_1, min_support=0.6, use_colnames=True)
frequent_itemsets_fpmax = fpmax(df_1, min_support=0.6, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets


In [17]:
frequent_itemsets_apriori

Unnamed: 0,support,itemsets


In [18]:
frequent_itemsets_fpmax

Unnamed: 0,support,itemsets


## 	Iterate previous steps by varying parameters

In [20]:
# Return the itemsets with at least 30% support:
frequent_itemsets = fpgrowth(df_1, min_support=0.3, use_colnames=True)
frequent_itemsets_apriori = apriori(df_1, min_support=0.3, use_colnames=True)
frequent_itemsets_fpmax = fpmax(df_1, min_support=0.3, use_colnames=True)

In [22]:
frequent_itemsets

Unnamed: 0,support,itemsets


In [24]:
frequent_itemsets_apriori

Unnamed: 0,support,itemsets


In [26]:
frequent_itemsets_fpmax

Unnamed: 0,support,itemsets


## 	Iterate previous steps by varying parameters

In [66]:
# Return the itemsets with at least 10% support:
frequent_itemsets = fpgrowth(df_1, min_support=0.1, use_colnames=True)
frequent_itemsets_apriori = apriori(df_1, min_support=0.1, use_colnames=True)
frequent_itemsets_fpmax = fpmax(df_1, min_support=0.1, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.17973,(eggs)
1,0.23827,(mineral water)
2,0.132,(green tea)
3,0.1296,(milk)
4,0.17093,(french fries)
5,0.17413,(spaghetti)
6,0.16387,(chocolate)


In [30]:
frequent_itemsets_apriori

Unnamed: 0,support,itemsets
0,0.16387,(chocolate)
1,0.17973,(eggs)
2,0.17093,(french fries)
3,0.132,(green tea)
4,0.1296,(milk)
5,0.23827,(mineral water)
6,0.17413,(spaghetti)


In [31]:
frequent_itemsets_fpmax

Unnamed: 0,support,itemsets
0,0.1296,(milk)
1,0.132,(green tea)
2,0.16387,(chocolate)
3,0.17093,(french fries)
4,0.17413,(spaghetti)
5,0.17973,(eggs)
6,0.23827,(mineral water)


## 	Iterate previous steps by varying parameters

In [33]:
# Return the itemsets with at least 5% support:
frequent_itemsets = fpgrowth(df_1, min_support=0.05, use_colnames=True)
frequent_itemsets_apriori = apriori(df_1, min_support=0.05, use_colnames=True)
frequent_itemsets_fpmax = fpmax(df_1, min_support=0.05, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.17973,(eggs)
1,0.0872,(burgers)
2,0.06253,(turkey)
3,0.23827,(mineral water)
4,0.132,(green tea)
5,0.1296,(milk)
6,0.05853,(whole wheat rice)
7,0.0764,(low fat yogurt)
8,0.17093,(french fries)
9,0.05053,(soup)


In [36]:
frequent_itemsets_apriori

Unnamed: 0,support,itemsets
0,0.0872,(burgers)
1,0.08107,(cake)
2,0.06,(chicken)
3,0.16387,(chocolate)
4,0.0804,(cookies)
5,0.05107,(cooking oil)
6,0.17973,(eggs)
7,0.07933,(escalope)
8,0.17093,(french fries)
9,0.0632,(frozen smoothie)


In [37]:
frequent_itemsets_fpmax

Unnamed: 0,support,itemsets
0,0.05053,(soup)
1,0.05107,(cooking oil)
2,0.0524,(grated cheese)
3,0.05853,(whole wheat rice)
4,0.06,(chicken)
5,0.06253,(turkey)
6,0.0632,(frozen smoothie)
7,0.06573,(olive oil)
8,0.0684,(tomatoes)
9,0.07133,(shrimp)


## Formulate Association Rules from Frequent Itemsets

In [38]:
# Rule Generation and Selection Criteria
from mlxtend.frequent_patterns import association_rules
#  "association_rules" - Function allows you to 
#  (1) specify your metric of interest and 
#  (2) the according threshold
# Currently implemented measures are confidence and lift
# Here, rules derived from the frequent itemsets only if the level of confidence is above the 5% threshold (min_threshold=0.05):
association_rules(frequent_itemsets, metric="confidence", min_threshold=0.05)
# Here, rules derived from the frequent itemsets only if the level of lift is above the 12% threshold (min_threshold=1.2):
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(spaghetti),(mineral water),0.17413,0.23827,0.05973,0.34303,1.4397,0.01824,1.15947
1,(mineral water),(spaghetti),0.23827,0.17413,0.05973,0.2507,1.4397,0.01824,1.10218
2,(mineral water),(chocolate),0.23827,0.16387,0.05267,0.22104,1.34891,0.01362,1.0734
3,(chocolate),(mineral water),0.16387,0.23827,0.05267,0.3214,1.34891,0.01362,1.12251


In [39]:
# Compute the antecedent length as follows:
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
0,(spaghetti),(mineral water),0.17413,0.23827,0.05973,0.34303,1.4397,0.01824,1.15947,1
1,(mineral water),(spaghetti),0.23827,0.17413,0.05973,0.2507,1.4397,0.01824,1.10218,1
2,(mineral water),(chocolate),0.23827,0.16387,0.05267,0.22104,1.34891,0.01362,1.0734,1
3,(chocolate),(mineral water),0.16387,0.23827,0.05267,0.3214,1.34891,0.01362,1.12251,1


In [40]:
# we are only interested in rules that satisfy the following criteria:
# at least 2 antecedents
# a confidence > 0.75
# a lift score > 1.2
rules[ (rules['antecedent_len'] >= 1) &
       (rules['confidence'] > 0.25) &
       (rules['lift'] > 1.2) ]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedent_len
0,(spaghetti),(mineral water),0.17413,0.23827,0.05973,0.34303,1.4397,0.01824,1.15947,1
1,(mineral water),(spaghetti),0.23827,0.17413,0.05973,0.2507,1.4397,0.01824,1.10218,1
3,(chocolate),(mineral water),0.16387,0.23827,0.05267,0.3214,1.34891,0.01362,1.12251,1


##  Comparation of  association rules

We discovered the following 3 rules:

Rule 1= {spaghetti} => {mineral water}
Rule 2 = {mineral water} => {spaghetti}

Rule 3 = {chocolate} => {mineral water}
We can see confidence is high in rule 1 compared to rule 2 & 3.
According to confidence measure, rule 1 has the highest value followed by Rule 3 & 2. Lift & Support are same in case of Rule 1 & 2. Rule 3 has the lowest value for lift & support measure compared to other Rules. Thus we say, Rule 1 to be the best. 

Conclusions based on rules:
Keep Mineral water & Spaghetti in common area place in case of store and keep these items in the first page in case of online marketing such that it is very much visible for customer to buy when he visits the store/online page of the store. This is the recommendation provided to the customer.

Provide combination of spaghetti & chocolate to increase the sale.

## Importance of discovered rules

In data mining, association rules are useful for analyzing and predicting customer behavior. They play an important part in customer analytics, market basket analysis, product clustering, catalog design and store layout. Programmers use association rules to build programs capable of machine learning.