## Apriori Association with Groceries Market Basket Dataset

**Name - Surname** = Ozan Can Demir

**Student Number** = 402533

**Department**     = Data Science and Business Analytics

**Project**        = Association Project 
    
### Purpose

The purpose of this project is to analyze the observation of the orders of the people from the grocery stores. These orders will be analyzed and we will apply association rule (Apriori Algorithm) on this data set.

**Apriori** is an algorithm for frequent item set mining and association rule learning over relational databases. It proceeds by identifying the frequent individual items in the database and extending them to larger and larger item sets as long as those item sets appear sufficiently often in the database. The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.(https://en.wikipedia.org/wiki/Apriori_algorithm)

**Market Basket Analysis** is one of the most used technique in associations by people. It simply search for combinations of items which occur together in the transactional aspect. To be more precise, this techique is to uncover how items are associated with each other. Wiht this technique, orders patterns can be observed in a meaningful way such as the most popular transaction was sauage along with white cheese. In large dataset we are able to discover significant relations between items.


  

### Packages

In [25]:
!pip install -U scikit-learn
!pip install mlxtend



In [26]:
import numpy as np
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

import warnings
warnings.filterwarnings("ignore")

### Data
   [Groceries Market Basket Dataset](https://www.kaggle.com/irfanasrullah/groceries)

In [27]:
df_apriori = pd.read_csv('C:\\Users\\ozanc\\Desktop\\UW- Data Science and Business Analytics\\First Year\\First Semester\\11- Unsupervised Learning\\UL Projects\\Apriori\\Groceries_dataset.csv')

### Details of Data Set

In [28]:
df_apriori.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


In [29]:
df_apriori.tail()

Unnamed: 0,Member_number,Date,itemDescription
38760,4471,08-10-2014,sliced cheese
38761,2022,23-02-2014,candy
38762,1097,16-04-2014,cake bar
38763,1510,03-12-2014,fruit/vegetable juice
38764,1521,26-12-2014,cat food


Random selection in dataset

In [30]:
df_apriori.sample(5)

Unnamed: 0,Member_number,Date,itemDescription
15056,2341,23-03-2014,roll products
654,1946,25-01-2015,cream cheese
7139,4177,13-04-2015,whole milk
3738,2856,21-08-2015,meat
19067,2236,03-02-2015,whole milk


In [31]:
df_apriori.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38765 entries, 0 to 38764
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Member_number    38765 non-null  int64 
 1   Date             38765 non-null  object
 2   itemDescription  38765 non-null  object
dtypes: int64(1), object(2)
memory usage: 908.7+ KB


In [32]:
print(df_apriori.columns)

Index(['Member_number', 'Date', 'itemDescription'], dtype='object')


### Analyze

In [33]:
df = df_apriori.set_index(['Date'])
data = pd.DataFrame(df_apriori) 

data.values

array([[1808, '21-07-2015', 'tropical fruit'],
       [2552, '05-01-2015', 'whole milk'],
       [2300, '19-09-2015', 'pip fruit'],
       ...,
       [1097, '16-04-2014', 'cake bar'],
       [1510, '03-12-2014', 'fruit/vegetable juice'],
       [1521, '26-12-2014', 'cat food']], dtype=object)

In [34]:
data['itemDescription'].value_counts()

whole milk               2502
other vegetables         1898
rolls/buns               1716
soda                     1514
yogurt                   1334
                         ... 
frozen chicken              5
bags                        4
baby cosmetics              3
preservation products       1
kitchen utensil             1
Name: itemDescription, Length: 167, dtype: int64

In [35]:
df=data.groupby(['Member_number','Date'])['itemDescription'].apply(sum)
df.values

array(['sausagewhole milksemi-finished breadyogurt',
       'whole milkpastrysalty snack', 'canned beermisc. beverages', ...,
       'fruit/vegetable juiceonions',
       'sodaroot vegetablessemi-finished bread',
       'bottled beerother vegetables'], dtype=object)

In [36]:
cols = [i[1]['itemDescription'].tolist() for i in list(data.groupby(['Date','Member_number']))]

Let's transform them into the right format via the TransactionEncoder as follows:

In [37]:
encoder=TransactionEncoder()
te_data=encoder.fit(cols).transform(cols)
encoder.columns_



['Instant food products',
 'UHT-milk',
 'abrasive cleaner',
 'artif. sweetener',
 'baby cosmetics',
 'bags',
 'baking powder',
 'bathroom cleaner',
 'beef',
 'berries',
 'beverages',
 'bottled beer',
 'bottled water',
 'brandy',
 'brown bread',
 'butter',
 'butter milk',
 'cake bar',
 'candles',
 'candy',
 'canned beer',
 'canned fish',
 'canned fruit',
 'canned vegetables',
 'cat food',
 'cereals',
 'chewing gum',
 'chicken',
 'chocolate',
 'chocolate marshmallow',
 'citrus fruit',
 'cleaner',
 'cling film/bags',
 'cocoa drinks',
 'coffee',
 'condensed milk',
 'cooking chocolate',
 'cookware',
 'cream',
 'cream cheese ',
 'curd',
 'curd cheese',
 'decalcifier',
 'dental care',
 'dessert',
 'detergent',
 'dish cleaner',
 'dishes',
 'dog food',
 'domestic eggs',
 'female sanitary products',
 'finished products',
 'fish',
 'flour',
 'flower (seeds)',
 'flower soil/fertilizer',
 'frankfurter',
 'frozen chicken',
 'frozen dessert',
 'frozen fish',
 'frozen fruits',
 'frozen meals',
 'froze

In [38]:
df=pd.DataFrame(te_data,columns=encoder.columns_)
df

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14958,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14959,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14960,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
14961,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


Now, we will return items in dataset at 0.10% support as below:

In [39]:
results = apriori(df, min_support=0.10, use_colnames=True)
results.sort_values(ascending=False, axis=0,by='support')


Unnamed: 0,support,itemsets
2,0.157923,(whole milk)
0,0.122101,(other vegetables)
1,0.110005,(rolls/buns)


# Results


After running our algorithm, as we have mentioned before we has set our support percentage to 0.10% in order to identify the combinations of items which occur together in the dataset. According to the result, the most sold product is Milk.

Based on the result our most frequesnt single items are:

1- Milk (0.157923)

2- Vegetables (0.122101)

3- Rolls/Buns (0.110005) 

So with this result we are able to claim that Milk and Vegetables are the items which are associated with each other when we set our support percentage to 0.10%.

### REFERENCES

https://en.wikipedia.org/wiki/Apriori_algorithm

https://www.softwaretestinghelp.com/apriori-algorithm/

https://github.com/topics/apriori-algorithm

https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/