#### Source:

Dr Daqing Chen, Director: Public Analytics group. chend '@' lsbu.ac.uk, School of Engineering, London South Bank University, London SE1 0AA, UK.

#### Data Set Information:

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.



#### Attribute Information:

- InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
- StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
- Description: Product (item) name. Nominal.
- Quantity: The quantities of each product (item) per transaction. Numeric.
- InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
- UnitPrice: Unit price. Numeric, Product price per unit in sterling.
- CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
- Country: Country name. Nominal, the name of the country where each customer resides.

In [1]:
import sys
sys.path.append("/Users/kumarchk/Library/Python/3.7/lib/python/site-packages")

In [2]:
# libraries

import pandas as pd
import numpy as np
from apyori import apriori
import time
pd.options.display.max_colwidth = 100

In [3]:
df = pd.read_excel("Online Retail.xlsx")
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      541909 non-null object
StockCode      541909 non-null object
Description    540455 non-null object
Quantity       541909 non-null int64
InvoiceDate    541909 non-null datetime64[ns]
UnitPrice      541909 non-null float64
CustomerID     406829 non-null float64
Country        541909 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 33.1+ MB


- clearly 2 columns has missing values

In [5]:
df.Description.isna().sum(), df.CustomerID.isna().sum()

(1454, 135080)

- InvoiceNo normally should have numbers, but on examination we found that there are invoices starting with 'C', these are the Cancellations/Returns

- Lets make a subset of the data, which is without Cancellations/Return

In [6]:
#df_without_can = df[~df.InvoiceNo.str.startwith("C")]

df_without_can = df.loc[~df.InvoiceNo.str.startswith('C', na=False)]


In [7]:
df.shape[0]-df_without_can.shape[0]

9288

In [8]:
df.shape, df_without_can.shape

((541909, 8), (532621, 8))

In [9]:
df_without_can.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 532621 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      532621 non-null object
StockCode      532621 non-null object
Description    531167 non-null object
Quantity       532621 non-null int64
InvoiceDate    532621 non-null datetime64[ns]
UnitPrice      532621 non-null float64
CustomerID     397924 non-null float64
Country        532621 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 36.6+ MB


- There are still missing data, if the Description is missing we can't do Market Basket Analysis on them, so lets drop them

In [10]:
df_without_can.Description = df_without_can.Description.str.strip()
df_without_can_and_nan = df_without_can.loc[df_without_can.Description.notnull()]
df_without_can_and_nan['Description'] = df_without_can_and_nan['Description'].map(lambda x: x.lstrip('.').rstrip('.'))
df_without_can_and_nan.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


(531166, 8)

In [11]:
df_without_can_and_nan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 531166 entries, 0 to 541908
Data columns (total 8 columns):
InvoiceNo      531166 non-null object
StockCode      531166 non-null object
Description    531166 non-null object
Quantity       531166 non-null int64
InvoiceDate    531166 non-null datetime64[ns]
UnitPrice      531166 non-null float64
CustomerID     397924 non-null float64
Country        531166 non-null object
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 36.5+ MB


In [12]:
len(df_without_can_and_nan[df_without_can_and_nan.Quantity == 0])

0

In [13]:
df_without_can_and_nan.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom


##### Clean the Description column


In [14]:
df_without_can_and_nan.Description.unique()

array(['WHITE HANGING HEART T-LIGHT HOLDER', 'WHITE METAL LANTERN',
       'CREAM CUPID HEARTS COAT HANGER', ..., 'lost',
       'CREAM HANGING HEART T-LIGHT HOLDER',
       'PAPER CRAFT , LITTLE BIRDIE'], dtype=object)

 There are a lot of junks in this column :
 
 - lost
 - lost in space
 - POSSIBLE DAMAGES OR LOST?
 - lost??
 - ?lost
 - Damaged
 - wet damaged
 - samples/damages
 - POSSIBLE DAMAGES OR LOST?
 - damages/dotcom?
 - damages?
 - missing
 - ?missing
 - wrongly marked. 23343 in box
 - stock creditted wrongly
 - wrongly sold sets
 - sold with wrong barcode
 - smashed ......... etc etc
 

We can create a list of all such items and filter them out of the data, or we can just make a short list of root words and eliminate all the data that contains them.

In [15]:
# Convert the entire column to lowercase and remove unnecessary spaces : for consistency

df_without_can_and_nan.Description = df_without_can_and_nan.Description.str.lower()


In [16]:
# Create a list of all unique Description and write to a excel file for further examination

unique_desc_list = df_without_can_and_nan.Description.unique()
unique_desc_list.sort()

unique_desc = pd.DataFrame(unique_desc_list)
writer = pd.ExcelWriter('items_100.xlsx', engine='xlsxwriter')
unique_desc.to_excel(writer,sheet_name='unique list of items',index=False)
writer.save()

print(len(unique_desc_list))

4174


In [17]:
# Creating the list junk words (got all the analysis of such words done in excel)

unique_desc_cleaned = pd.read_excel("items.xlsx")
unique_desc_cleaned.head(2)

Unnamed: 0,items,junk,item_class
0,*boombox ipod classic,,
1,*usb office mirror ball,,


- We found that there are many items in the invoice for Gift wraps too, we marked them in 'class' column and we marked 'Yes' in 'junk' column for the items that needs to be dropped

In [18]:
# Drop these items from the our data

drop_list = unique_desc_cleaned[~((unique_desc_cleaned.junk == 'Yes') | (unique_desc_cleaned.item_class == 'Wrap'))]['items']
print("Total number of unique items to be considered :",len(drop_list))

new_data = df_without_can_and_nan[df_without_can_and_nan.Description.isin(drop_list)]
new_data.shape


Total number of unique items to be considered : 3943


(521289, 8)

- Lets check the Quantity column for values less than 0 (these may be returned items)

In [19]:
new_data[new_data.Quantity <= 0]

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
151033,549527,21620,mystery! only ever imported 1800,-1479,2011-04-08 16:03:00,0.0,,United Kingdom
323872,565370,82494L,crushed ctn,-33,2011-09-02 15:04:00,0.0,,United Kingdom
381679,569878,90057,crushed boxes,-380,2011-10-06 15:12:00,0.0,,United Kingdom
418065,572686,23118,breakages,-30,2011-10-25 14:03:00,0.0,,United Kingdom


In [20]:
# Dropping these data from our dataset

data_clean = new_data.loc[~new_data.Quantity <= 0]
data_clean.shape

(521285, 8)

In [21]:
# Change the Date format - remove the time stamp

data_clean.InvoiceDate = data_clean["InvoiceDate"].dt.strftime("%m-%d-%y")
data_clean.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,white hanging heart t-light holder,6,12-01-10,2.55,17850.0,United Kingdom
1,536365,71053,white metal lantern,6,12-01-10,3.39,17850.0,United Kingdom
2,536365,84406B,cream cupid hearts coat hanger,8,12-01-10,2.75,17850.0,United Kingdom
3,536365,84029G,knitted union flag hot water bottle,6,12-01-10,3.39,17850.0,United Kingdom
5,536365,22752,set 7 babushka nesting boxes,2,12-01-10,7.65,17850.0,United Kingdom


In [22]:
data_clean.InvoiceNo.nunique()

19763

#### Create a new data with each record as the list of items sold together in one transaction

In [23]:
'''

data_t = [['Geeks', '10'], ['for', '15'], ['geeks', '20'], ['num'], ['ram', 'lax', 'ramesh', 'suresh']]  
  
# Create the pandas DataFrame  
df_t = pd.DataFrame(data_t)  
df_t.fillna(value=0, inplace=True)
#df_t.fillna(value=pd.np.nan, inplace=True)
# print dataframe.  
df_t 

'''

"\n\ndata_t = [['Geeks', '10'], ['for', '15'], ['geeks', '20'], ['num'], ['ram', 'lax', 'ramesh', 'suresh']]  \n  \n# Create the pandas DataFrame  \ndf_t = pd.DataFrame(data_t)  \ndf_t.fillna(value=0, inplace=True)\n#df_t.fillna(value=pd.np.nan, inplace=True)\n# print dataframe.  \ndf_t \n\n"

In [24]:
data_prep = data_clean.groupby('InvoiceNo')['Description'].apply(list).reset_index(name='Items')
data_prep.head(10)

Unnamed: 0,InvoiceNo,Items
0,536365,"[white hanging heart t-light holder, white metal lantern, cream cupid hearts coat hanger, knitte..."
1,536366,"[hand warmer union jack, hand warmer red polka dot]"
2,536367,"[assorted colour bird ornament, poppy's playhouse bedroom, poppy's playhouse kitchen, feltcraft ..."
3,536368,"[jam making set with jars, red coat rack paris fashion, yellow coat rack paris fashion, blue coa..."
4,536369,[bath building block word]
5,536370,"[alarm clock bakelike pink, alarm clock bakelike red, alarm clock bakelike green, panda and bunn..."
6,536371,[paper chain kit 50's christmas]
7,536372,"[hand warmer red polka dot, hand warmer union jack]"
8,536373,"[white hanging heart t-light holder, white metal lantern, cream cupid hearts coat hanger, edward..."
9,536374,[victorian sewing box large]


In [25]:
data_prep.shape

(19763, 2)

In [27]:
invoice_records = list(data_prep.Items)
print("Size of list :",len(invoice_records))

data = pd.DataFrame(invoice_records)
data.fillna(value=0, inplace=True)
data.head()

Size of list : 19763


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1099,1100,1101,1102,1103,1104,1105,1106,1107,1108
0,white hanging heart t-light holder,white metal lantern,cream cupid hearts coat hanger,knitted union flag hot water bottle,set 7 babushka nesting boxes,glass star frosted t-light holder,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,hand warmer union jack,hand warmer red polka dot,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,assorted colour bird ornament,poppy's playhouse bedroom,poppy's playhouse kitchen,feltcraft princess charlotte doll,ivory knitted mug cosy,box of 6 assorted colour teaspoons,box of vintage jigsaw blocks,box of vintage alphabet blocks,home building block word,love building block word,...,0,0,0,0,0,0,0,0,0,0
3,jam making set with jars,red coat rack paris fashion,yellow coat rack paris fashion,blue coat rack paris fashion,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,bath building block word,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


#### Applying the Algorithm

In [28]:
# Check the first item set in the list 

invoice_records[0]

['white hanging heart t-light holder',
 'white metal lantern',
 'cream cupid hearts coat hanger',
 'knitted union flag hot water bottle',
 'set 7 babushka nesting boxes',
 'glass star frosted t-light holder']

In [37]:
# Call apriori function which requires minimum support, confidance and lift
# min length is combination of item default is 2

rules = apriori(invoice_records, min_support=0.005, min_confidance=0.2, min_lift=3, min_length=2)

In [38]:
# it generates a set of rules in a generator file

rules

<generator object apriori at 0x1280ada50>

In [39]:
# all rules need to be converted in a list 
start_time_1 = time.time()
results = list(rules)
end_time_1 = time.time()
print("--- %s minutes ---" , (end_time_1 - start_time_1)/60)
results

--- %s minutes --- 16.9437269171079


[RelationRecord(items=frozenset({'12 pencils small tube red retrospot', '12 pencil small tube woodland'}), support=0.006729747507969438, ordered_statistics=[OrderedStatistic(items_base=frozenset({'12 pencil small tube woodland'}), items_add=frozenset({'12 pencils small tube red retrospot'}), confidence=0.37464788732394366, lift=20.453497782273754), OrderedStatistic(items_base=frozenset({'12 pencils small tube red retrospot'}), items_add=frozenset({'12 pencil small tube woodland'}), confidence=0.3674033149171271, lift=20.453497782273754)]),
 RelationRecord(items=frozenset({'12 pencils small tube skull', '12 pencil small tube woodland'}), support=0.0055153569802155545, ordered_statistics=[OrderedStatistic(items_base=frozenset({'12 pencil small tube woodland'}), items_add=frozenset({'12 pencils small tube skull'}), confidence=0.3070422535211268, lift=17.09317198968459), OrderedStatistic(items_base=frozenset({'12 pencils small tube skull'}), items_add=frozenset({'12 pencil small tube woodl

In [40]:
# convert result in a dataframe for further operation

df_results = pd.DataFrame(results)
df_results.head(10)

Unnamed: 0,items,support,ordered_statistics
0,"(12 pencils small tube red retrospot, 12 pencil small tube woodland)",0.00673,"[((12 pencil small tube woodland), (12 pencils small tube red retrospot), 0.37464788732394366, 2..."
1,"(12 pencils small tube skull, 12 pencil small tube woodland)",0.005515,"[((12 pencil small tube woodland), (12 pencils small tube skull), 0.3070422535211268, 17.0931719..."
2,"(lunch bag red retrospot, 12 pencil small tube woodland)",0.005161,"[((12 pencil small tube woodland), (lunch bag red retrospot), 0.28732394366197184, 3.63067973055..."
3,"(pack of 72 retrospot cake cases, 12 pencil small tube woodland)",0.005262,"[((12 pencil small tube woodland), (pack of 72 retrospot cake cases), 0.2929577464788733, 4.3861..."
4,"(paper chain kit 50's christmas, 12 pencil small tube woodland)",0.005819,"[((12 pencil small tube woodland), (paper chain kit 50's christmas), 0.323943661971831, 5.519050..."
5,"(plasters in tin woodland animals, 12 pencil small tube woodland)",0.00506,"[((12 pencil small tube woodland), (plasters in tin woodland animals), 0.28169014084507044, 7.90..."
6,"(set 12 colour pencils spaceboy, 12 pencil small tube woodland)",0.005465,"[((12 pencil small tube woodland), (set 12 colour pencils spaceboy), 0.3042253521126761, 21.3205..."
7,"(vintage snap cards, 12 pencil small tube woodland)",0.005111,"[((12 pencil small tube woodland), (vintage snap cards), 0.2845070422535212, 6.072043926626716),..."
8,"(12 pencils small tube red retrospot, 12 pencils small tube skull)",0.00759,"[((12 pencils small tube red retrospot), (12 pencils small tube skull), 0.4143646408839779, 23.0..."
9,"(12 pencils tall tube skulls, 12 pencils tall tube red retrospot)",0.005819,"[((12 pencils tall tube red retrospot), (12 pencils tall tube skulls), 0.4291044776119403, 33.12..."


In [41]:
df_results.shape

(11448, 3)

In [42]:
# keep support in a separate data frame so we can use later.. 

support = df_results.support

In [43]:
'''
convert orderstatistic in a proper format.
order statistic has lhs => rhs as well rhs => lhs we can choose any one for convience i choose first one which is 'df_results['ordered_statistics'][i][0]'
''' 

# all four empty list which will contain lhs, rhs, confidance and lift respectively.

first_values = []
second_values = []
third_values = []
fourth_value = []

# loop number of rows time and append 1 by 1 value in a separate list 
# first and second element was frozenset which need to be converted in list

for i in range(df_results.shape[0]):
    single_list = df_results['ordered_statistics'][i][0]
    first_values.append(list(single_list[0]))
    second_values.append(list(single_list[1]))
    third_values.append(single_list[2])
    fourth_value.append(single_list[3])

In [44]:
# convert all four list into dataframe for further operation

lhs = pd.DataFrame(first_values)
rhs= pd.DataFrame(second_values)
confidance=pd.DataFrame(third_values,columns=['Confidance'])
lift=pd.DataFrame(fourth_value,columns=['lift'])


In [45]:
# concat all list together in a single dataframe

df_final = pd.concat([lhs,rhs,support,confidance,lift], axis=1)
df_final

Unnamed: 0,0,0.1,1,2,3,4,support,Confidance,lift
0,12 pencil small tube woodland,12 pencils small tube red retrospot,,,,,0.006730,0.374648,20.453498
1,12 pencil small tube woodland,12 pencils small tube skull,,,,,0.005515,0.307042,17.093172
2,12 pencil small tube woodland,lunch bag red retrospot,,,,,0.005161,0.287324,3.630680
3,12 pencil small tube woodland,pack of 72 retrospot cake cases,,,,,0.005262,0.292958,4.386155
4,12 pencil small tube woodland,paper chain kit 50's christmas,,,,,0.005819,0.323944,5.519051
...,...,...,...,...,...,...,...,...,...
11443,charlotte bag suki design,strawberry charlotte bag,woodland charlotte bag,red retrospot charlotte bag,pack of 72 retrospot cake cases,regency cakestand 3 tier,0.005364,0.120181,20.300386
11444,green regency teacup and saucer,regency tea plate roses,regency tea plate pink,pink regency teacup and saucer,regency tea plate green,roses regency teacup and saucer,0.006831,0.133005,18.642385
11445,herb marker basil,herb marker chives,herb marker thyme,herb marker mint,herb marker parsley,herb marker rosemary,0.007944,0.648760,75.866570
11446,jumbo bag baroque black white,jumbo storage bag suki,jumbo shopper vintage red paisley,jumbo bag woodland animals,jumbo bag red retrospot,jumbo bag pink polkadot,0.005009,0.106109,12.051946


In [46]:
'''
 we have some of place only 1 item in lhs and some place 3 or more so we need to a proper represenation for User to understand. 
 removing none with ' ' extra so when we combine three column in 1 then only 1 item will be there with spaces which is proper rather than none.
 example : coffee,none,none which converted to coffee, ,
'''

df_final.fillna(value=' ', inplace=True)

In [48]:
# set column name

df_final.columns = ['lhs',1,2,'rhs','support','confidance','lift']

ValueError: Length mismatch: Expected axis has 9 elements, new values have 7 elements