# **Market Basket Analysis - Exploring eCommerce data**
Source: https://www.kaggle.com/code/ostrowski/market-basket-analysis-exploring-e-commerce-data

**This notebook is structured as follows:**

    1. Loading libraries and data
    A:Analysis For Students
      A2. Frequent sets and association rules with apriori
      A3. Conclusions
    B:Analysis for Teachers
      B2. Frequent sets and association rules with apriori
      B3. Conclusions
    C:Analysis for Parents
      C2. Frequent sets and association rules with apriori
      C3. Conclusions


# 1. Loading libraries and data:

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed

# Loading libraries for python
import numpy as np # Linear algebra
import pandas as pd # Data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Data visualization
import seaborn as sns # Advanced data visualization
import re # Regular expressions for advanced string selection
from mlxtend.frequent_patterns import apriori # Data pattern exploration
from mlxtend.frequent_patterns import association_rules # Association rules conversion
from mlxtend.preprocessing import OnehotTransactions # Transforming dataframe for apriori
import missingno as msno # Advanced missing values handling
%matplotlib inline

# **Analysis of responses from Students**

**A2 Loading data**

In [None]:
# Reading input, converting TimeStamp to datetime, and setting index: df
#import complete csv file
df = pd.read_csv('../content/MB_analysisStudentOnlyBaskets1.csv')
df.Timestamp=pd.to_datetime(df.Timestamp)
#set this column as index column
df.set_index('Timestamp',inplace=True)
df.sample(5, random_state=42)
df.Features = df.Features.astype('category')

 **A3 Frequent sets and association rules with apriori:**

generating market basket list

In [None]:
print(df)
# Starting preparation of df for receiving product association
# Cleaning Features field for proper aggregation 
df.loc[:, 'Features'] = df.Features.str.strip()
print(df)
print(df.loc[:,'Features'])

basket = pd.get_dummies(df.reset_index().loc[:, ('Timestamp', 'Features')])
basket_sets = pd.pivot_table(basket, index='Timestamp', aggfunc='sum')
display(basket_sets)

                                                                    Features
Timestamp                                                                   
2022-10-16 21:48:42-05:30                              College or University
2022-10-16 21:48:42-05:30                                       Mobile phone
2022-10-16 21:48:42-05:30                      Attended but not very regular
2022-10-16 21:48:42-05:30   More convenient to contact teachers from home...
2022-10-16 21:48:42-05:30                            Internet unavailability
...                                                                      ...
2022-11-25 13:18:19-05:30                            Internet unavailability
2022-11-25 13:18:19-05:30                                         Power cuts
2022-11-25 13:18:19-05:30   Hybrid classes should be a regular provision ...
2022-11-25 13:18:19-05:30                                                  4
2022-11-25 13:18:19-05:30   Hybrid mode (There is option to either attend...

Unnamed: 0_level_0,Features_1,Features_2,Features_3,Features_4,Features_5,Features_Attended but not very regular,Features_College or University,Features_Desktop,Features_Distraction was very high,Features_Distractions,...,Features_Network issue,"Features_Offline learning ( Without any of video conferencing , recorded lectures, and online exams )","Features_Online classes ( Through video conferencing or recorded lectures, and online exams )",Features_Power cuts,Features_Primary or Middle School (till 5 th grade),Features_Rarely,Features_Recorded lectures made daily schedule flexible,Features_Regularly or most often,Features_Syllabus was reduced,Features_Unfamiliarity with technology
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-10-16 21:48:42-05:30,0,0,1,0,0,1,1,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2022-10-21 16:48:50-05:30,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,1,1,0,0,1
2022-10-21 16:49:24-05:30,0,1,0,0,0,1,0,0,0,0,...,0,1,0,1,0,0,1,0,0,0
2022-10-21 16:50:01-05:30,0,1,0,0,0,1,1,0,0,0,...,0,0,1,1,0,0,1,0,0,1
2022-10-26 20:16:42-05:30,0,0,0,1,0,1,1,0,0,0,...,0,0,1,1,0,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-01 21:39:38-05:30,0,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,1,1,0,0,0
2022-11-02 01:06:39-05:30,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2022-11-08 13:16:32-05:30,1,0,0,0,0,1,1,0,0,0,...,0,1,0,0,0,0,0,0,1,0
2022-11-12 16:06:46-05:30,0,0,0,0,1,1,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0


In [None]:
# Apriori aplication: frequent_itemsets
# Note that min_support parameter was set to a very low value, this is the Spurious limitation, more on conclusion section
frequent_itemsets = apriori(basket_sets, min_support=0.1, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

# Advanced and strategical data frequent set selection
frequent_itemsets[ (frequent_itemsets['length'] > 1) &
                   (frequent_itemsets['support'] >= 0.02) ]

Unnamed: 0,support,itemsets,length
24,0.102564,"(Features_2, Features_Internet unavailability)",2
25,0.102564,(Features_Recorded lectures made daily schedul...,2
26,0.166667,"(Features_College or University, Features_3)",2
27,0.102564,(Features_Hybrid classes should be a regular p...,2
28,0.141026,(Features_Hybrid mode (There is option to eith...,2
...,...,...,...
593,0.115385,(Features_More study resources helped prepare ...,5
594,0.102564,(Features_Recorded lectures made daily schedul...,5
595,0.115385,(Features_Recorded lectures made daily schedul...,6
596,0.102564,(Features_Recorded lectures made daily schedul...,6


In [None]:
# Generating the association_rules: rules
# Selecting the important parameters for analysis
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
#print(rules)
rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values('confidence', ascending=False)

Unnamed: 0,antecedents,consequents,support,confidence,lift
3734,(Features_Recorded lectures made daily schedul...,(Features_Internet unavailability),0.102564,1.000000,1.625000
224,"(Features_High School (till 12 th grade), Feat...",(Features_Recorded lectures made daily schedul...,0.102564,1.000000,1.392857
3595,(Features_More study resources helped prepare ...,(Features_Hybrid mode (There is option to eith...,0.102564,1.000000,1.392857
3645,(Features_Recorded lectures made daily schedul...,(Features_Hybrid mode (There is option to eith...,0.102564,1.000000,1.392857
1967,(Features_Hybrid classes should be conducted o...,(Features_Hybrid mode (There is option to eith...,0.102564,1.000000,1.392857
...,...,...,...,...,...
599,(Features_College or University),"(Features_Syllabus was reduced, Features_Lapto...",0.102564,0.126984,1.100529
1912,(Features_College or University),(Features_More study resources helped prepare ...,0.102564,0.126984,1.100529
3391,(Features_College or University),(Features_Recorded lectures made daily schedul...,0.102564,0.126984,1.238095
993,(Features_College or University),"(Features_Internet unavailability, Features_4,...",0.102564,0.126984,1.100529


#**Analysing Teachers responses**

**B2 Loading data**

In [None]:
# Reading input, converting TimeStamp to datetime, and setting index: df
#import complete csv file
df2 = pd.read_csv('../content/MB_analysisTeacherOnlyBaskets1.csv')

#df.set_index(['InvoiceDate'] , inplace=True)

df2.Timestamp=pd.to_datetime(df2.Timestamp)
#set this column as index column
df2.set_index('Timestamp',inplace=True)
# Dropping StockCode to reduce data dimension
# Checking df.sample() for quick evaluation of entries
df2.sample(5, random_state=42)
df2.Features = df2.Features.astype('category')

 **B3 Frequent sets and association rules with apriori:**

generating market basket list

In [None]:
print(df2)
# Starting preparation of df for receiving product association
# Cleaning Features field for proper aggregation 
df2.loc[:, 'Features'] = df2.Features.str.strip()
print(df2)
print(df2.loc[:,'Features'])
# Once again, this line was generating me the SettingWithCopyWarning, solved by adding the .copy()

# Dummy conding and creation of the baskets_sets, indexed by date and time with 1 corresponding to every item presented on the basket
# Note that the quantity bought is not considered, only if the item was present or not in the basket
basket2 = pd.get_dummies(df2.reset_index().loc[:, ('Timestamp', 'Features')])
basket_sets2 = pd.pivot_table(basket2, index='Timestamp', aggfunc='sum')
display(basket_sets2)

                                                                    Features
Timestamp                                                                   
2022-10-29 04:09:26-05:30                              College or university
2022-10-29 04:09:26-05:30  Online resources helped demonstrate concepts m...
2022-10-29 04:09:26-05:30                                   Poor interaction
2022-10-29 04:09:26-05:30                            They asked fewer doubts
2022-10-29 04:09:26-05:30  Hybrid classes should be conducted occasionall...
...                                                                      ...
2022-11-22 14:20:13-05:30  Offline learning ( Without any of video confer...
2022-11-28 11:47:12-05:30                              College or university
2022-11-28 11:47:12-05:30                            They asked fewer doubts
2022-11-28 11:47:12-05:30  Hybrid classes should be conducted only during...
2022-11-28 11:47:12-05:30  Offline learning ( Without any of video confer...

Unnamed: 0_level_0,Features_A constant difficulty to judge and evaluate learning from what's is being taught. Practicals are not possible,Features_Can't say,Features_College or university,Features_Flexibility in conducting classes,Features_High School (till 12 th grade),Features_Hybrid classes should be a regular provision to incorporate digital literacy,Features_Hybrid classes should be conducted occasionally to supplement regular modes of teaching,Features_Hybrid classes should be conducted only during emergencies,"Features_Hybrid mode (There is option to either attend live lectures in person or view recordings , exams conducted at centers)",Features_Internet Banwidth Issues,...,"Features_Offline learning ( Without any of video conferencing , recorded lectures, and online exams )",Features_Online assignments and tests are easier to evaluate,Features_Online resources helped demonstrate concepts more effectively,Features_Poor interaction,Features_Power cuts,Features_Primary or Middle School (till 5 th grade),Features_Saved time and effort during lectures,Features_They asked fewer doubts,Features_They asked more doubts,Features_Unfamiliarity with technology
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-10-29 04:09:26-05:30,0,0,1,0,0,0,1,0,1,0,...,0,0,1,1,0,0,0,1,0,0
2022-10-29 11:20:26-05:30,0,0,1,1,0,0,1,0,1,0,...,0,0,0,1,0,0,0,1,0,0
2022-10-29 15:20:35-05:30,0,0,0,1,1,0,1,0,0,0,...,1,1,0,1,0,0,1,0,0,0
2022-10-29 15:42:36-05:30,0,1,0,0,0,0,0,1,0,0,...,1,0,1,1,1,1,0,0,0,1
2022-10-29 17:51:21-05:30,0,0,0,1,1,0,1,0,1,0,...,0,1,1,0,1,0,1,0,0,0
2022-10-29 18:33:52-05:30,0,0,1,1,0,0,0,1,0,0,...,1,0,0,1,0,0,1,0,0,0
2022-10-29 21:57:25-05:30,0,0,0,1,0,0,1,0,0,0,...,1,0,0,0,0,1,0,1,0,0
2022-10-30 03:55:48-05:30,0,0,1,0,0,0,0,1,0,0,...,1,0,1,0,0,0,0,0,0,0
2022-10-30 08:04:06-05:30,1,0,1,0,0,0,0,1,0,0,...,1,0,0,1,0,0,1,1,0,0
2022-10-30 09:11:38-05:30,0,0,0,0,1,0,0,1,1,0,...,0,0,1,0,1,0,0,0,1,0


B4 Association rules

In [None]:
# Apriori aplication: frequent_itemsets
# Note that min_support parameter was set to a very low value, this is the Spurious limitation, more on conclusion section
frequent_itemsets2 = apriori(basket_sets2, min_support=0.20, use_colnames=True)
frequent_itemsets2['length'] = frequent_itemsets2['itemsets'].apply(lambda x: len(x))

# Advanced and strategical data frequent set selection
frequent_itemsets2[ (frequent_itemsets2['length'] > 1) &
                   (frequent_itemsets2['support'] >= 0.02) ]

Unnamed: 0,support,itemsets,length
13,0.47619,"(Features_College or university, Features_Flex...",2
14,0.333333,"(Features_College or university, Features_Hybr...",2
15,0.333333,"(Features_College or university, Features_Hybr...",2
16,0.285714,"(Features_College or university, Features_It w...",2
17,0.285714,"(Features_College or university, Features_Low ...",2
18,0.333333,(Features_Offline learning ( Without any of vi...,2
19,0.285714,"(Features_College or university, Features_Onli...",2
20,0.428571,"(Features_Poor interaction, Features_College o...",2
21,0.238095,"(Features_College or university, Features_Save...",2
22,0.380952,"(Features_College or university, Features_They...",2


In [None]:
# Generating the association_rules: rules
# Selecting the important parameters for analysis
rules2 = association_rules(frequent_itemsets2, metric="lift", min_threshold=1)
#print(rules)
rules2[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values('confidence', ascending=False)

Unnamed: 0,antecedents,consequents,support,confidence,lift
97,"(Features_Poor interaction, Features_Hybrid mo...",(Features_College or university),0.238095,1.000000,1.500000
122,(Features_Saved time and effort during lecture...,(Features_Flexibility in conducting classes),0.238095,1.000000,1.400000
130,"(Features_Low attendance, Features_Saved time ...",(Features_Flexibility in conducting classes),0.238095,1.000000,1.400000
68,"(Features_College or university, Features_Low ...",(Features_Flexibility in conducting classes),0.285714,1.000000,1.400000
108,"(Features_Low attendance, Features_Hybrid mode...",(Features_Flexibility in conducting classes),0.238095,1.000000,1.400000
...,...,...,...,...,...
113,(Features_Flexibility in conducting classes),"(Features_Low attendance, Features_Hybrid mode...",0.238095,0.333333,1.400000
135,(Features_Flexibility in conducting classes),"(Features_Low attendance, Features_Saved time ...",0.238095,0.333333,1.400000
66,(Features_Flexibility in conducting classes),"(Features_College or university, Features_It w...",0.238095,0.333333,1.166667
139,(Features_Flexibility in conducting classes),"(Features_Poor interaction, Features_Offline l...",0.238095,0.333333,1.000000


#**Analysing Parents responses ,with focus on level of concern from academic competence**

**C2 Loading data**

In [None]:
# Reading input, converting TimeStamp to datetime, and setting index: df
#import complete csv file
df3 = pd.read_csv('../content/MB_analysisParentsCon.csv')

#df.set_index(['InvoiceDate'] , inplace=True)

df3.Timestamp=pd.to_datetime(df3.Timestamp)
#set this column as index column
df3.set_index('Timestamp',inplace=True)
# Dropping StockCode to reduce data dimension
# Checking df.sample() for quick evaluation of entries
df3.sample(5, random_state=42)
df3.Features = df3.Features.astype('category')

 **C3 Frequent sets and association rules with apriori:**

generating market basket list

In [None]:
print(df3)
# Starting preparation of df for receiving product association
# Cleaning Features field for proper aggregation 
df3.loc[:, 'Features'] = df3.Features.str.strip()
print(df3)
print(df3.loc[:,'Features'])
# Once again, this line was generating me the SettingWithCopyWarning, solved by adding the .copy()

# Dummy conding and creation of the baskets_sets, indexed by date and time with 1 corresponding to every item presented on the basket
# Note that the quantity bought is not considered, only if the item was present or not in the basket
basket3 = pd.get_dummies(df3.reset_index().loc[:, ('Timestamp', 'Features')])
basket_sets3 = pd.pivot_table(basket3, index='Timestamp', aggfunc='sum')
display(basket_sets3)

                                                                    Features
Timestamp                                                                   
2022-10-16 21:19:36-05:30                             College or University 
2022-10-16 21:19:36-05:30                            No ,we adjusted easily 
2022-10-16 21:19:36-05:30                      Communication gap with peers 
2022-10-16 21:19:36-05:30                     Not having proper environment 
2022-10-16 21:19:36-05:30        Internet problem during studies and exams. 
...                                                                      ...
2022-12-12 10:17:09-05:30                                 Lack of Attention 
2022-12-12 10:17:09-05:30                     Not having proper environment 
2022-12-12 10:17:09-05:30                                                 4 
2022-12-12 10:17:09-05:30  Hybrid classes should be a regular provision t...
2022-12-12 10:17:09-05:30  Hybrid mode (There is option to either attend ...

Unnamed: 0_level_0,Features_1,Features_2,Features_3,Features_4,Features_5,Features_College or University,Features_Communication gap with peers,Features_High School (till 12 th grade),Features_Hybrid classes should be a regular provision to incorporate digital literacy,Features_Hybrid classes should be conducted occasionally to supplement regular teaching,...,Features_Lack of Attention,Features_Network issues,"Features_No ,we adjusted easily",Features_Not getting coffee frequently,Features_Not having proper environment,"Features_Offline learning ( Without any of video conferencing , recorded lectures, and online exams )","Features_Online classes ( Through video conferencing or recorded lectures, and online exams )",Features_Primary or Middle School (till 5th grade),Features_Teachers were not available to discuss doubts,"Features_Yes ,it had a considerable impact"
Timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2022-10-16 21:19:36-05:30,0,0,1,0,0,1,1,0,1,0,...,0,0,1,0,1,0,0,0,0,0
2022-10-20 22:37:03-05:30,0,0,1,0,0,0,1,1,1,0,...,0,0,1,0,0,0,0,0,0,0
2022-10-26 20:47:16-05:30,0,1,0,0,0,1,0,0,1,0,...,0,0,1,0,1,0,0,0,0,0
2022-10-28 23:21:54-05:30,0,0,1,0,0,1,1,0,1,0,...,1,0,1,0,0,1,0,0,0,0
2022-10-28 23:22:15-05:30,0,0,1,0,0,1,1,0,0,1,...,0,1,1,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2022-11-01 07:19:56-05:30,0,1,0,0,0,0,1,0,0,0,...,1,0,0,0,1,1,0,1,0,1
2022-11-01 07:49:35-05:30,0,0,0,0,1,0,1,0,0,1,...,1,0,0,0,1,1,0,1,0,1
2022-11-01 09:08:53-05:30,0,0,0,0,1,0,0,0,0,1,...,1,0,1,0,1,0,0,1,0,0
2022-11-01 10:49:17-05:30,0,0,1,0,0,0,1,1,0,1,...,1,0,1,0,1,1,0,0,0,0


In [None]:
# Apriori aplication: frequent_itemsets
# Note that min_support parameter was set to a very low value, this is the Spurious limitation, more on conclusion section
frequent_itemsets3 = apriori(basket_sets3, min_support=0.03, use_colnames=True)
frequent_itemsets3['length'] = frequent_itemsets3['itemsets'].apply(lambda x: len(x))

# Advanced and strategical data frequent set selection
frequent_itemsets3[ (frequent_itemsets3['length'] > 1) &
                   (frequent_itemsets3['support'] >= 0.02) ]

Unnamed: 0,support,itemsets,length
20,0.073529,"(Features_No ,we adjusted easily, Features_1)",2
21,0.044118,"(Features_College or University, Features_2)",2
22,0.058824,"(Features_2, Features_Communication gap with p...",2
23,0.044118,"(Features_2, Features_Hybrid classes should be...",2
24,0.058824,(Features_Hybrid mode (There is option to eith...,2
...,...,...,...
1009,0.044118,(Features_Offline learning ( Without any of vi...,6
1010,0.058824,(Features_Offline learning ( Without any of vi...,6
1011,0.044118,(Features_Offline learning ( Without any of vi...,6
1012,0.044118,(Features_Offline learning ( Without any of vi...,6


In [None]:
# Generating the association_rules: rules
# Selecting the important parameters for analysis
rules3 = association_rules(frequent_itemsets3, metric="lift", min_threshold=1)
#print(rules)
rules3[['antecedents', 'consequents', 'support', 'confidence', 'lift']].sort_values('confidence', ascending=False)

Unnamed: 0,antecedents,consequents,support,confidence,lift
9010,(Features_Offline learning ( Without any of vi...,(Features_Communication gap with peers),0.058824,1.000000,1.743590
10506,(Features_Offline learning ( Without any of vi...,(Features_Lack of Attention),0.058824,1.000000,1.789474
9283,(Features_Primary or Middle School (till 5th g...,(Features_Hybrid mode (There is option to eith...,0.058824,1.000000,3.238095
11278,(Features_Offline learning ( Without any of vi...,(Features_Not having proper environment),0.044118,1.000000,1.581395
9286,(Features_Primary or Middle School (till 5th g...,(Features_Hybrid mode (There is option to eith...,0.058824,1.000000,2.518519
...,...,...,...,...,...
7842,"(Features_No ,we adjusted easily)",(Features_Hybrid classes should be conducted o...,0.044118,0.054545,1.236364
2070,"(Features_No ,we adjusted easily)",(Features_Primary or Middle School (till 5th g...,0.044118,0.054545,1.236364
7401,"(Features_No ,we adjusted easily)","(Features_5, Features_Offline learning ( Witho...",0.044118,0.054545,1.236364
2003,"(Features_No ,we adjusted easily)","(Features_High School (till 12 th grade), Feat...",0.044118,0.054545,1.236364
