Machine Learning Project : Association rules mining using Apriori algorithm.
1- Introduction
Association Rules is one of the very important concepts of machine learning being used in market basket analysis. In a store, all vegetables are placed in the same aisle, all dairy items are placed together and cosmetics form another set of such groups. Investing time and resources on deliberate product placements like this not only reduces a customer’s shopping time, but also reminds the customer of what relevant items (s)he might be interested in buying, thus helping stores cross-sell in the process. Association rules help uncover all such relationships between items from huge databases. One important thing to note is- Rules do not extract an individual’s preference, rather find relationships between set of elements of every distinct transaction.

2- Dataset
We used the grocery dataset, which can be found on kaggle. The dataset contains 9835 transactions by customers shopping for groceries. The data contains 169 unique items.

3- Approach
This kernel has 7 main sections :

Importing libraries
Reading the Datasert
Exploratory Data Analysis ;
Data Vizualisation ;
Data Preprocessing ;
Association rules ;
Conclusion.
1. Importing libraries :
For basic operation, we need to use 2 libraries : Numpy and Pandas For vizualisation we need to use : matplotlib, seaborn, wordcloud ans squarify For preprocessing we need to use : TransactionEncoder For market basket analysis we need to use Mlxtend

2. Reading the dataset :
To read our datasetWe, we will need to use the pd.read_csv method for

3. Exploratory Data Analysis :
To understand our dataset, we need to know more about our dataset, so we going to kno about the shape of our dataset, information about our dataset including the index dtype and columns, non-null values and memory usage, the number of missing values in the dataset, the List unique values.

4. Datavizuamisation :
For vizualisation, we use 4 method :

Data vizualisation by Wordcloud ;
Data vizualisation by Bar graph ;
Data vizualisation by Tree Map ;
Data vizualisation by Networkx.
5. Data Preprocessing
To know all the different products in the transactions, we must assign to each of them a list that contains a boolean
array where each index represents the corresponding product, whether it is purchased in the transaction or not. But first our dataset needs a preprocessing by converting into required format of TransactionEncoder. So we need to :

making each customers shopping items an identical list
converting it into an numpy array
checking the shape of the array
6. Association Rules
To solve this case study, we need to use the Apriori Algorithm, and we must explain the approach of the method.

Association rule is unsupervised learning where algorithm tries to learn without a teacher as data are not labelled. Association rule is descriptive not the predictive method, generally used to discover interesting relationship hidden in large datasets. The relationship are usually represented in form of rules or frequent itemsets.

Apriori Concept
Apriori is one of the algorithms that we can use for market basket analysis.

Apriori is based on 3 metrics:
a.Support

b.Confidence

c.Lift

a. Support :
Quantify how many times an item or an itemset appear in a set of transactions. In other words, support quantifies the frequency of an itemset.

b. Confidence :
After buying an item X what’s the probability of buying item Y.

c. Lift :
What’s the probability to buy items X and Y together rather than just buying item X.

In [None]:
import numpy as np
import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
from wordcloud import WordCloud
import squarify

from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [None]:
data = pd.read_csv('groceries - groceries.csv')
data.head()

In [None]:
data.shape


In [None]:
data.info()
data.isnull().sum()

In [None]:
data.sample(10)


In [None]:
data_It1 = data['Item 1'].unique()
data_It2 = data['Item 2'].unique()
data_It3 = data['Item 3'].unique()

print('Unique Product in 1st Item is :', data_It1)
print('')
print('Unique Product in 2nd Item is :', data_It2)
print('')
print('Unique Product in 3rd Item is :', data_It3)

In [None]:
y = data['Item 1'].value_counts().head(100).to_frame()
y.index

In [None]:
plt.rcParams['figure.figsize'] = (40, 40)

# 1st Item

plt.subplot2grid ((2,3),(0,0))

wordcloud = WordCloud(background_color = 'white', width = 1200,  height = 1200, max_words = 50).generate(str(data['Item 1']))
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most Popular_1st Items',fontsize = 20)
plt.show()

In [None]:

# looking at the frequency of most popular items

plt.rcParams['figure.figsize'] = (30, 15)
color = plt.cm.copper(np.linspace(0, 1, 40))
data['Item 1'].value_counts().head(20).plot.bar(color = color)
plt.title('frequency of most popular items', fontsize = 20)
plt.xticks(rotation = 90 )
plt.grid()
plt.show()

In [None]:
# plotting a tree map

plt.rcParams['figure.figsize'] = (20, 20)
color = plt.cm.cool(np.linspace(0, 1, 50))
squarify.plot(sizes = y.values, label = y.index, alpha=.8, color = color)
plt.title('Tree Map for Popular Items')
plt.axis('off')
plt.show()

In [None]:
data['food'] = 'Food'
food = data.truncate(before = -1, after = 15)


import networkx as nx

food = nx.from_pandas_edgelist(food, source = 'food', target = 'Item 1', edge_attr = True)

In [None]:
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (20, 20)
pos = nx.spring_layout(food)
color = plt.cm.Wistia(np.linspace(0, 15, 1))
nx.draw_networkx_nodes(food, pos, node_size = 15000, node_color = color)
nx.draw_networkx_edges(food, pos, width = 3, alpha = 0.6, edge_color = 'black')
nx.draw_networkx_labels(food, pos, font_size = 20, font_family = 'sans-serif')
plt.axis('off')
plt.grid()
plt.title('Top 15 First Choices', fontsize = 40)
plt.show()

In [None]:
data['secondchoice'] = 'Second Choice'
secondchoice = data.truncate(before = -1, after = 15)
secondchoice = nx.from_pandas_edgelist(secondchoice, source = 'food', target = 'Item 2', edge_attr = True)

In [None]:
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (20, 20)
pos = nx.spring_layout(secondchoice)
color = plt.cm.Blues(np.linspace(0, 15, 1))
nx.draw_networkx_nodes(secondchoice, pos, node_size = 15000, node_color = color)
nx.draw_networkx_edges(secondchoice, pos, width = 3, alpha = 0.6, edge_color = 'brown')
nx.draw_networkx_labels(secondchoice, pos, font_size = 20, font_family = 'sans-serif')
plt.axis('off')
plt.grid()
plt.title('Top 15 Second Choices', fontsize = 40)
plt.show()

In [None]:
data['thirdchoice'] = 'Third Choice'
secondchoice = data.truncate(before = -1, after = 15)
secondchoice = nx.from_pandas_edgelist(secondchoice, source = 'food', target = 'Item 3', edge_attr = True)

In [None]:
import warnings
warnings.filterwarnings('ignore')

plt.rcParams['figure.figsize'] = (20, 20)
pos = nx.spring_layout(secondchoice)
color = plt.cm.Reds(np.linspace(0, 15, 1))
nx.draw_networkx_nodes(secondchoice, pos, node_size = 15000, node_color = color)
nx.draw_networkx_edges(secondchoice, pos, width = 3, alpha = 0.6, edge_color = 'pink')
nx.draw_networkx_labels(secondchoice, pos, font_size = 20, font_family = 'sans-serif')
plt.axis('off')
plt.grid()
plt.title('Top 15 Third Choices', fontsize = 40)
plt.show()

In [None]:
# making each customers shopping items an identical list
trans = []
for i in range(0, 9835):
    trans.append([str(data.values[i,j]) for j in range(0, 33)])

# converting it into an numpy array
trans = np.array(trans)

# checking the shape of the array
print(trans.shape)

In [None]:
te = TransactionEncoder()
data = te.fit_transform(trans)
data = pd.DataFrame(data, columns = te.columns_)

# getting the shape of the data
data.shape

In [None]:
import warnings
warnings.filterwarnings('ignore')

#I'm going to use just 100 Items for my analysis

data = data.loc[:, ['sausage', 'whole milk', 'frankfurter', 'tropical fruit',
       'other vegetables', 'citrus fruit', 'pork', 'rolls/buns', 'chicken',
       'canned beer', 'beef', 'soda', 'root vegetables', 'pip fruit', 'yogurt',
       'ham', 'bottled beer', 'meat', 'bottled water', 'hamburger meat',
       'pastry', 'berries', 'curd', 'ice cream', 'beverages', 'coffee',
       'whipped/sour cream', 'butter', 'dessert', 'onions', 'UHT-milk',
       'grapes', 'brown bread', 'newspapers', 'domestic eggs', 'frozen meals',
       'finished products', 'misc. beverages', 'turkey', 'shopping bags',
       'chocolate', 'butter milk', 'salty snack', 'fruit/vegetable juice',
       'liver loaf', 'frozen vegetables', 'cream cheese',
       'specialty chocolate', 'packaged fruit/vegetables', 'waffles', 'herbs',
       'oil', 'photo/film', 'white bread', 'chewing gum', 'margarine',
       'white wine', 'condensed milk', 'pet care', 'specialty bar', 'cat food',
       'sugar', 'napkins', 'fish', 'hard cheese', 'long life bakery product',
       'semi-finished bread', 'processed cheese', 'frozen fish',
       'hygiene articles', 'nuts/prunes', 'liquor', 'detergent',
       'sliced cheese', 'candy', 'spread cheese', 'zwieback', 'red/blush wine',
       'pasta', 'frozen dessert', 'potted plants', 'sparkling wine',
       'organic sausage', 'dishes', 'seasonal products', 'dog food',
       'baking powder', 'frozen potato products', 'soft cheese', 'curd cheese',
       'salt', 'sweet spreads', 'mayonnaise', 'canned vegetables',
       'specialty cheese', 'chocolate marshmallow', 'flower soil/fertilizer',
       'cookware', 'dish cleaner', 'instant coffee']]

# checking the shape
data.shape

In [None]:
# let's check the columns

data.columns

In [None]:
# getting the head of the data

data.head()