## Data Mining / Prospeção de Dados

## Diogo Soares and Sara C. Madeira, 2020/21

# Project 1 - Pattern Mining

## Logistics 
**_Read Carefully_**

**Students should work in teams of 2 or 3 people**. 

**TASK 3 - Spring vs Summer Purchases** must be done only by groups of 3 people.

Individual projects might be allowed (with valid justification), but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of `March, 28th (23:59)`.** 

Students should **upload a `.zip` file** containing all the files necessary for project evaluation. 
Groups should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=139096) and the zip file should be identified as `PDnn.zip` where `nn` is the number of your group.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `PD_202021_P1.ipynb`as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs** (File > Download as > HTML).

**Decisions should be justified and results should be critically discussed.** 

_Project solutions containing only code and outputs without discussions will achieve a maximum grade 10 out of 20._

## Dataset and Tools



In this project you will analyse data from an online Store collected over 4 months (April - July 2014). The folder `data` contains three files that you should use to obtain the dataset to be used in pattern mining. 

The file `store-buys.dat` comprises the buy events of the users over the items. It contains **318.444 sessions**. Each record/line in the file has the following fields (with this order): 

* **Session ID** - the id of the session. In one session there are one or many buying events. Could be represented as an integer number.
* **Timestamp** - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of item that has been bought. Could be represented as an integer number. 
* **Price** – the price of the item. Could be represented as an integer number.
* **Quantity** – the quantity in this buying.  Could be represented as an integer number.

The file `store-clicks.dat` comprises the clicks of the users over the items. It contains **5.613.499 sessions**.  Each record/line in the file has the following fields (with this order):

* **Session ID** – the id of the session. In one session there are one or many clicks. Could be represented as an integer number.
* **Timestamp** – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of the item that has been clicked. Could be represented as an integer number.
* **Context** – the context of the click. The value "S" indicates a special offer, "0" indicates  a missing value, a number between 1 to 12 indicates a real category identifier,
any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH,
then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. 
 
The file `products.csv` comprises the list of products sold by the online store. It contains **46.294 different products** associated with **123 different subcategories**. Each record/line in the file has the following fields:

* **Item ID** - the unique identifier of the item. Could be represented as an integer number. 
* **Product Categories** - the category and subcategories of the item. It is a string containing the category and subcategories of the item. Eg. `appliances.kitchen.juice`


In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org) and **[MLxtend](http://rasbt.github.io/mlxtend/)**. When using MLxtend, frequent patterns can either be discovered using `Apriori` and `FP-Growth`. **Choose the pattern mining algorithm to be used.** 


## Team Identification

**GROUP NNN**

Students:

* Student 1 - n_student1
* Student 2 - n_student2
* Student 3 - n_student3

## 1. Mining Frequent Itemsets and Association Rules


In this first part of the project you should load and preprocess the dataset  in order to compute frequent itemsets and generate association rules considering all the sessions.

**In what follows keep the following question in mind and be creative!**

1. What are the most interesting products?
2. What are the most bought products?
3. Which products are bought together?
4. Can you find associations between the clicked products? 
5. Can you find associations highliting that when people buy a product/set of products also buy other product(s)?
6. Can you find associations highliting that when people click in a product/set of products also buy this product(s)?
7. Can you find relevant associated categories? 

### 1.1. Load and Preprocess Data

 **Product quantities should not be considered.**

Aqui fizemos os imports

In [1]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import  TransactionEncoder
from mlxtend.frequent_patterns import apriori
import mlxtend.preprocessing
import mlxtend.frequent_patterns
from mlxtend.frequent_patterns import association_rules

### 1.1.1 Product

The file `products.csv` comprises the list of products sold by the online store. It contains **46.294 different products** associated with **123 different subcategories**. Each record/line in the file has the following fields:

* **Item ID** - the unique identifier of the item. Could be represented as an integer number. 
* **Product Categories** - the category and subcategories of the item. It is a string containing the category and subcategories of the item. Eg. `appliances.kitchen.juice`

In [2]:
product_df = pd.read_csv("products.csv")
product_df.columns = ["Item_ID","Product_Categories"]
product_df.head(5)

Unnamed: 0,Item_ID,Product_Categories
0,214536500,electronics.tablet
1,214536506,electronics.tablet
2,214577561,electronics.audio.headphone
3,214662742,furniture.kitchen.table
4,214662742,furniture.kitchen.table


In [3]:
product_df.shape

(20704558, 2)

In [4]:
pdf = product_df.drop_duplicates()

In [5]:
pdf.shape

(46294, 2)

### 1.1.2 Store-buys

The file `store-buys.dat` comprises the buy events of the users over the items. It contains **318.444 sessions**. Each record/line in the file has the following fields (with this order): 

* **Session ID** - the id of the session. In one session there are one or many buying events. Could be represented as an integer number.
* **Timestamp** - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of item that has been bought. Could be represented as an integer number. 
* **Price** – the price of the item. Could be represented as an integer number.
* **Quantity** – the quantity in this buying.  Could be represented as an integer number.

In [7]:
store_buys = pd.read_csv("store-buys.dat")

store_buys.columns = ["Session_ID","Timestamp","Item_ID", "Price", "Quantity"]

store_buys.head(5)

Unnamed: 0,Session_ID,Timestamp,Item_ID,Price,Quantity
0,420374,2014-04-06T18:44:58.325Z,214537850,10471,1
1,281626,2014-04-06T09:40:13.032Z,214535653,1883,1
2,420368,2014-04-04T06:13:28.848Z,214530572,6073,1
3,420368,2014-04-04T06:13:28.858Z,214835025,2617,1
4,140806,2014-04-07T09:22:28.132Z,214668193,523,1


**Drop quantities**

In [8]:
store_buys = store_buys.drop("Quantity",axis=1)
store_buys = store_buys.drop("Price",axis=1)
store_buys

Unnamed: 0,Session_ID,Timestamp,Item_ID
0,420374,2014-04-06T18:44:58.325Z,214537850
1,281626,2014-04-06T09:40:13.032Z,214535653
2,420368,2014-04-04T06:13:28.848Z,214530572
3,420368,2014-04-04T06:13:28.858Z,214835025
4,140806,2014-04-07T09:22:28.132Z,214668193
...,...,...,...
679483,6926714,2014-07-27T15:35:40.221Z,214665277
679484,6645086,2014-07-28T10:08:58.076Z,214567057
679485,6740437,2014-07-25T19:02:58.252Z,214708044
679486,6926707,2014-07-27T13:58:54.040Z,214848986


### 1.1.3 Store-clicks(falta processar melhor os dados)


The file `store-clicks.dat` comprises the clicks of the users over the items. It contains **5.613.499 sessions**.  Each record/line in the file has the following fields (with this order):

* **Session ID** – the id of the session. In one session there are one or many clicks. Could be represented as an integer number.
* **Timestamp** – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of the item that has been clicked. Could be represented as an integer number.
* **Context** – the context of the click. The value "S" indicates a special offer, "0" indicates  a missing value, a number between 1 to 12 indicates a real category identifier,
any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH,
then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. 
 

In [9]:
store_clicks = pd.read_csv("store-clicks.dat")

store_clicks.columns = ["Session_ID","Timestamp","Item_ID", "Context"]

store_clicks.head(5)

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,Session_ID,Timestamp,Item_ID,Context
0,1,2014-04-07T10:54:09.868Z,214536500,0
1,1,2014-04-07T10:54:46.998Z,214536506,0
2,1,2014-04-07T10:57:00.306Z,214577561,0
3,2,2014-04-07T13:56:37.614Z,214662742,0
4,2,2014-04-07T13:57:19.373Z,214662742,0


In [10]:
store_clicks.shape

(20704558, 4)

In [4]:
store_clicks = store_clicks.drop_duplicates()

### 1.2 Merge store-buys and product_df 

To get a transaction list with the storage buys with the correspondent product description, we did a merge with the ghsfnofdks

In [11]:
masterdf = pd.merge(pdf.set_index('Item_ID'), store_buys.set_index('Item_ID'),on='Item_ID')
masterdf = masterdf.reset_index()

In [12]:
masterdf = masterdf[['Session_ID','Item_ID','Product_Categories','Timestamp']]

In [13]:
masterdf

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,2859734,214536500,electronics.tablet,2014-05-14T12:06:02.717Z
1,4276371,214536500,electronics.tablet,2014-06-04T13:44:10.725Z
2,4440056,214536500,electronics.tablet,2014-06-14T19:14:00.581Z
3,70532,214536506,electronics.tablet,2014-04-06T09:59:03.143Z
4,691119,214536506,electronics.tablet,2014-04-14T15:56:30.514Z
...,...,...,...,...
679483,6649378,214851182,furniture.universal.light,2014-07-28T15:33:57.709Z
679484,6649378,214851182,furniture.universal.light,2014-07-28T15:53:12.209Z
679485,7001599,214571152,stationery.paper,2014-07-24T20:16:42.857Z
679486,6961172,214571152,stationery.paper,2014-07-28T19:39:09.688Z


In [15]:
buy_transactions= []
#for i in list(masterdf.Session_ID[:50].unique()):
for i in list(masterdf.Session_ID[:500].unique()):
    buy_transactions.append(masterdf[masterdf.Session_ID==i].Product_Categories.values.tolist())
    
buy_transactions 


[[' electronics.tablet'],
 [' electronics.tablet', ' computers.peripherals.monitor'],
 [' electronics.tablet', ' sport.tennis'],
 [' electronics.tablet'],
 [' electronics.tablet',
  ' appliances.kitchen.grill',
  ' appliances.kitchen.grill',
  ' furniture.living_room.cabinet'],
 [' furniture.kitchen.table', ' computers.components.memory'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table',
  ' computers.components.memory',
  ' kids.toys',
  ' computers.components.memory',
  ' computers.components.memory',
  ' electronics.video.tv'],
 [' furniture.kitchen.table', ' electronics.audio.dictaphone'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table', ' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table',
  ' computers.components.videocards',
  ' furniture.kitchen.table',
  ' country_yard.lawn_mower'],
 [' electroni

Now that we have a transactions list, ggfsdafs

The Apriori implementation at MLxtend receives a binary database

In [15]:
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(buy_transactions).transform(buy_transactions)
binary_database = pd.DataFrame(trans_array, columns=tr_enc.columns_)
binary_database

Unnamed: 0,accessories.bag,apparel.costume,apparel.glove,apparel.shirt,apparel.shoes,apparel.sock,apparel.trousers,apparel.tshirt,appliances.environment.air_heater,appliances.environment.climate,...,furniture.living_room.sofa,kids.toys,medicine.tools.tonometer,sport.bicycle,sport.ski,sport.tennis,stationery.battery,stationery.cartrige,stationery.paper,stationery.stapler
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
443,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
444,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
445,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
446,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### 1.3  Merge store-clicks and product_df

In [16]:
clickprod_df = pd.merge(pdf.set_index('Item_ID'), store_clicks.set_index('Item_ID'),on='Item_ID')
clickprod_df = clickprod_df.reset_index()


In [17]:
clickprod_df = clickprod_df[['Session_ID','Item_ID','Product_Categories','Timestamp']]

In [18]:
clickprod_df

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,1,214536500,electronics.tablet,2014-04-07T10:54:09.868Z
1,21133,214536500,electronics.tablet,2014-04-06T11:07:41.937Z
2,26623,214536500,electronics.tablet,2014-04-06T12:54:28.549Z
3,25964,214536500,electronics.tablet,2014-04-02T00:37:45.985Z
4,33429,214536500,electronics.tablet,2014-04-03T19:49:41.979Z
...,...,...,...,...
20704553,6928129,214535055,kids.dolls,2014-07-28T09:46:10.486Z
20704554,6927372,214818485,furniture.living_room.chair,2014-07-23T19:32:47.960Z
20704555,6927372,214818485,furniture.living_room.chair,2014-07-23T19:33:33.705Z
20704556,6926506,214646096,apparel.glove,2014-07-27T10:12:04.929Z


In [19]:
#teste
buyers_list = list(store_buys.Session_ID.unique())
new_store_clicks = store_clicks.loc[store_clicks.Session_ID.isin(buyers_list)]
new_store_clicks

Unnamed: 0,Session_ID,Timestamp,Item_ID,Context
23,11,2014-04-03T10:44:35.672Z,214821275,0
24,11,2014-04-03T10:45:01.674Z,214821275,0
25,11,2014-04-03T10:45:29.873Z,214821371,0
26,11,2014-04-03T10:46:12.162Z,214821371,0
27,11,2014-04-03T10:46:57.355Z,214821371,0
...,...,...,...,...
20704553,6926707,2014-07-27T13:46:14.563Z,214848757,S
20704554,6926707,2014-07-27T13:47:47.168Z,214848986,S
20704555,6926707,2014-07-27T13:49:35.200Z,214848945,S
20704556,6926707,2014-07-27T13:52:29.177Z,214561477,3


(Só meti 500 transanctions q tem tipo 2M e demora imenso)

In [20]:
click_transactions= []
#for i in list(masterdf.Session_ID[:50].unique()):
for i in list(clickprod_df.Session_ID[:500].unique()):
    click_transactions.append(clickprod_df[clickprod_df.Session_ID==i].Product_Categories.values.tolist())
    
click_transactions 


[[' electronics.tablet',
  ' electronics.tablet',
  ' electronics.audio.headphone'],
 [' electronics.tablet', ' computers.network.router', ' accessories.bag'],
 [' electronics.tablet',
  ' electronics.tablet',
  ' auto.accessories.videoregister',
  ' appliances.personal.scales',
  ' computers.network.router',
  ' computers.peripherals.scanner',
  ' electronics.calculator',
  ' electronics.smartphone',
  ' appliances.kitchen.hood',
  ' appliances.kitchen.dishwasher',
  ' appliances.kitchen.washer',
  ' construction.tools.drill',
  ' computers.peripherals.scanner',
  ' kids.fmcg.diapers',
  ' computers.gaming',
  ' furniture.bedroom.bed',
  ' appliances.kitchen.mixer',
  ' stationery.paper',
  ' electronics.smartphone',
  ' medicine.tools.tonometer',
  ' apparel.trousers',
  ' electronics.audio.headphone',
  ' furniture.living_room.shelving',
  ' electronics.smartphone',
  ' electronics.clocks',
  ' appliances.kitchen.coffee_machine',
  ' construction.tools.drill',
  ' computers.ebooks',

In [21]:
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(click_transactions).transform(click_transactions)
binary_database2 = pd.DataFrame(trans_array, columns=tr_enc.columns_)
binary_database2

Unnamed: 0,accessories.bag,accessories.umbrella,apparel.costume,apparel.glove,apparel.shirt,apparel.shoes,apparel.sock,apparel.trousers,apparel.tshirt,appliances.environment.air_conditioner,...,sport.bicycle,sport.diving,sport.ski,sport.snowboard,sport.tennis,sport.trainer,stationery.battery,stationery.cartrige,stationery.paper,stationery.stapler
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,True,False,False,...,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
4,False,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,True,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
408,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
409,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
410,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
411,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## 1.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 1.2.1 Store Buys

In [22]:
frequent_itemsets = apriori(binary_database, min_support=0.1,  use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.113839,( appliances.personal.scales)
1,0.116071,( computers.components.cpu)
2,0.180804,( computers.notebook)
3,0.46875,( electronics.video.tv)


**What are the most bought products?**

The most bought product is sport.tennis



In [23]:
frequent_itemsets = apriori(binary_database, min_support=0.05, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.058036,( appliances.kitchen.blender)
1,0.113839,( appliances.personal.scales)
2,0.116071,( computers.components.cpu)
3,0.180804,( computers.notebook)
4,0.082589,( computers.peripherals.monitor)
5,0.46875,( electronics.video.tv)
6,0.051339,"( appliances.kitchen.blender, appliances.pers..."


In [24]:
frequent_itemsets = apriori(binary_database, min_support=0.01, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.03125,( accessories.bag)
1,0.011161,( apparel.trousers)
2,0.017857,( appliances.environment.vacuum)
3,0.017857,( appliances.environment.water_heater)
4,0.058036,( appliances.kitchen.blender)
5,0.020089,( appliances.kitchen.kettle)
6,0.011161,( appliances.kitchen.meat_grinder)
7,0.026786,( appliances.kitchen.refrigerators)
8,0.024554,( appliances.kitchen.toster)
9,0.113839,( appliances.personal.scales)


In [25]:
frequent_itemsets = apriori(binary_database, min_support=0.01, use_colnames=True)

frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.03125,( accessories.bag),1
1,0.011161,( apparel.trousers),1
2,0.017857,( appliances.environment.vacuum),1
3,0.017857,( appliances.environment.water_heater),1
4,0.058036,( appliances.kitchen.blender),1
5,0.020089,( appliances.kitchen.kettle),1
6,0.011161,( appliances.kitchen.meat_grinder),1
7,0.026786,( appliances.kitchen.refrigerators),1
8,0.024554,( appliances.kitchen.toster),1
9,0.113839,( appliances.personal.scales),1


**Which products are bought together?**

In [26]:
frequent_itemsets = frequent_itemsets[ (frequent_itemsets['support'] >= 0.01) & (frequent_itemsets['length'] == 2)]
frequent_itemsets

Unnamed: 0,support,itemsets,length
21,0.011161,"( accessories.bag, computers.notebook)",2
22,0.013393,"( appliances.environment.vacuum, appliances.p...",2
23,0.011161,"( appliances.environment.water_heater, applia...",2
24,0.051339,"( appliances.kitchen.blender, appliances.pers...",2
25,0.013393,"( computers.components.cpu, appliances.kitche...",2
26,0.011161,"( appliances.kitchen.toster, computers.periph...",2
27,0.013393,"( computers.components.cpu, computers.notebook)",2
28,0.03125,"( computers.components.cpu, sport.tennis)",2


### 1.2.2 Store Clicks (Só tá com 500 transactions)

In [27]:
frequent_itemsets = apriori(binary_database2, min_support=0.2,  use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.382567,( appliances.personal.scales)
1,0.992736,( electronics.tablet)
2,0.380145,"( electronics.tablet, appliances.personal.sca..."


### 1.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 1.3.1 Store Buys

In [28]:
frequent_itemsets = apriori(binary_database, min_support=0.01, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( accessories.bag),( computers.notebook),0.03125,0.180804,0.011161,0.357143,1.975309,0.005511,1.274306
1,( appliances.environment.vacuum),( appliances.personal.scales),0.017857,0.113839,0.013393,0.75,6.588235,0.01136,3.544643
2,( appliances.environment.water_heater),( appliances.personal.scales),0.017857,0.113839,0.011161,0.625,5.490196,0.009128,2.363095
3,( appliances.kitchen.blender),( appliances.personal.scales),0.058036,0.113839,0.051339,0.884615,7.770739,0.044733,7.68006
4,( appliances.personal.scales),( appliances.kitchen.blender),0.113839,0.058036,0.051339,0.45098,7.770739,0.044733,1.715721
5,( appliances.kitchen.toster),( computers.components.cpu),0.024554,0.116071,0.013393,0.545455,4.699301,0.010543,1.944643
6,( appliances.kitchen.toster),( computers.peripherals.monitor),0.024554,0.082589,0.011161,0.454545,5.503686,0.009133,1.68192
7,( computers.components.cpu),( sport.tennis),0.116071,0.046875,0.03125,0.269231,5.74359,0.025809,1.304276
8,( sport.tennis),( computers.components.cpu),0.046875,0.116071,0.03125,0.666667,5.74359,0.025809,2.651786



* When people buy appliances.environment.fan, computers.components.memory appears in 36% of the transactions. 

* When people buy appliances.kitchen.grill, country_yard.lawn_mowe appears in 23% of the transactions. 

* When people buy country_yard.lawn_mower, sport.tennis appears in 25% of the transactions. 


In [29]:
lift_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=2.0)
lift_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( appliances.environment.vacuum),( appliances.personal.scales),0.017857,0.113839,0.013393,0.75,6.588235,0.01136,3.544643
1,( appliances.personal.scales),( appliances.environment.vacuum),0.113839,0.017857,0.013393,0.117647,6.588235,0.01136,1.113095
2,( appliances.environment.water_heater),( appliances.personal.scales),0.017857,0.113839,0.011161,0.625,5.490196,0.009128,2.363095
3,( appliances.personal.scales),( appliances.environment.water_heater),0.113839,0.017857,0.011161,0.098039,5.490196,0.009128,1.088898
4,( appliances.kitchen.blender),( appliances.personal.scales),0.058036,0.113839,0.051339,0.884615,7.770739,0.044733,7.68006
5,( appliances.personal.scales),( appliances.kitchen.blender),0.113839,0.058036,0.051339,0.45098,7.770739,0.044733,1.715721
6,( computers.components.cpu),( appliances.kitchen.toster),0.116071,0.024554,0.013393,0.115385,4.699301,0.010543,1.102679
7,( appliances.kitchen.toster),( computers.components.cpu),0.024554,0.116071,0.013393,0.545455,4.699301,0.010543,1.944643
8,( appliances.kitchen.toster),( computers.peripherals.monitor),0.024554,0.082589,0.011161,0.454545,5.503686,0.009133,1.68192
9,( computers.peripherals.monitor),( appliances.kitchen.toster),0.082589,0.024554,0.011161,0.135135,5.503686,0.009133,1.12786


Lift - how likely item Y is to be purchased when item X is purchased, while controlling for how popular item Y is.
A lift value greater than 1 means that item Y is likely to be bought if item X is bought

Can you find associations highliting that when people buy a product/set of products also buy other product(s)?

Resposta: gfsdfsddfsvssvgrsfff

### 1.2.3 Store Clicks(só tá com 500 transactions)

In [30]:
frequent_itemsets = apriori(binary_database2, min_support=0.15, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( electronics.tablet),( appliances.personal.scales),0.992736,0.382567,0.380145,0.382927,1.000942,0.000358,1.000584
1,( appliances.personal.scales),( electronics.tablet),0.382567,0.992736,0.380145,0.993671,1.000942,0.000358,1.1477
2,( electronics.audio.headphone),( electronics.tablet),0.188862,0.992736,0.181598,0.961538,0.968574,-0.005892,0.188862
3,( electronics.tablet),( electronics.audio.headphone),0.992736,0.188862,0.181598,0.182927,0.968574,-0.005892,0.992736
4,( electronics.tablet),( furniture.living_room.sofa),0.992736,0.179177,0.179177,0.180488,1.007317,0.001302,1.0016
5,( furniture.living_room.sofa),( electronics.tablet),0.179177,0.992736,0.179177,1.0,1.007317,0.001302,inf


In [31]:
lift_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)
lift_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( electronics.tablet),( appliances.personal.scales),0.992736,0.382567,0.380145,0.382927,1.000942,0.000358,1.000584
1,( appliances.personal.scales),( electronics.tablet),0.382567,0.992736,0.380145,0.993671,1.000942,0.000358,1.1477
2,( electronics.tablet),( furniture.living_room.sofa),0.992736,0.179177,0.179177,0.180488,1.007317,0.001302,1.0016
3,( furniture.living_room.sofa),( electronics.tablet),0.179177,0.992736,0.179177,1.0,1.007317,0.001302,inf


### 1.4. Take a Look at Maximal Patterns: Compute Maximal Frequent Itemsets

In [32]:
%timeit FI_apriori = apriori(binary_database, min_support=0.6, use_colnames=True)

3.52 ms ± 40.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### 1.5. Conclusions 

# 2. Week vs Weekend Purchases

In this part of the project you should analyse the consumption patterns during the week vs during the weekeed.

**In what follows keep the following question in mind and be creative!**

1. The most interesting products are the same during the week and the weekend? 
2. What are the most bought products during the week? And during the weekend?
3. There are differences between the sets of products bought during the week and the weekend?
4. Can you find different associations highliting that when people click in a product/set of products also buy this product(s) during the week vs the weekend?
5. Discuss the results obtained for the week sessions vs weekend sessions.

### 2.1. Load and Preprocess Data

 **Product quantities should not be considered.**
 


### 2.1.1Store Buys df

We have to separate this df into 2 different df one with week days and the other with weekend days

In [33]:
masterdf.head(5)


Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,2859734,214536500,electronics.tablet,2014-05-14T12:06:02.717Z
1,4276371,214536500,electronics.tablet,2014-06-04T13:44:10.725Z
2,4440056,214536500,electronics.tablet,2014-06-14T19:14:00.581Z
3,70532,214536506,electronics.tablet,2014-04-06T09:59:03.143Z
4,691119,214536506,electronics.tablet,2014-04-14T15:56:30.514Z


In [20]:
weeksdf = masterdf
time = weeksdf["Timestamp"].str.split("T", n = 1, expand = True)
weeksdf["Timestamp"] = time[0]
weeksdf.head(5)

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,2859734,214536500,electronics.tablet,2014-05-14
1,4276371,214536500,electronics.tablet,2014-06-04
2,4440056,214536500,electronics.tablet,2014-06-14
3,70532,214536506,electronics.tablet,2014-04-06
4,691119,214536506,electronics.tablet,2014-04-14


In [35]:
teste = weeksdf
teste["Timestamp"] = pd.to_datetime(teste['Timestamp'])
teste['Semana'] = teste['Timestamp'].dt.day_name()

buys_weeknd_df =  teste.loc[teste['Semana'].isin(['Saturday', 'Sunday'])] 
buys_weekday_df = teste.loc[~teste['Semana'].isin(['Saturday', 'Sunday'])] 


## Weekend

In [36]:
buys_weeknd_df

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp,Semana
2,4440056,214536500,electronics.tablet,2014-06-14,Saturday
3,70532,214536506,electronics.tablet,2014-04-06,Sunday
5,405567,214662742,furniture.kitchen.table,2014-04-06,Sunday
6,763567,214662742,furniture.kitchen.table,2014-04-13,Sunday
7,842649,214662742,furniture.kitchen.table,2014-04-13,Sunday
...,...,...,...,...,...
679464,6724562,214789372,sport.tennis,2014-07-27,Sunday
679467,6940131,214816352,appliances.ironing_board,2014-07-27,Sunday
679475,6797042,214854759,electronics.calculator,2014-07-27,Sunday
679476,6711771,214851605,furniture.universal.light,2014-07-27,Sunday


In [37]:
buy_weeknd_transactions= []

for i in list(buys_weeknd_df.Session_ID[:100].unique()):
    buy_weeknd_transactions.append(buys_weeknd_df[buys_weeknd_df.Session_ID==i].Product_Categories.values.tolist())
    
buy_weeknd_transactions 


[[' electronics.tablet', ' sport.tennis'],
 [' electronics.tablet'],
 [' furniture.kitchen.table', ' computers.components.memory'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table',
  ' computers.components.memory',
  ' kids.toys',
  ' computers.components.memory',
  ' computers.components.memory',
  ' electronics.video.tv'],
 [' electronics.smartphone', ' computers.components.memory'],
 [' appliances.kitchen.refrigerators',
  ' appliances.iron',
  ' appliances.environment.water_heater',
  ' construction.tools.heater'],
 [' appliances.kitchen.refrigerators',
  ' sport.tennis',
  ' computers.peripherals.scanner'],
 [' appliances.kitchen.refrigerators'],
 [' appliances.kitchen.refrigerators',
  ' appliances.kitchen.toster',
  ' computers.peripherals.monitor'],
 [' appliances.kitchen.refrigerators', ' medicine.tools.tonometer'],
 [' appliances.personal.scales'],
 [' appliances.personal.scales'],
 [' appliances.personal.scales', ' appliances.kitchen.blender'],
 [' appliances.per

In [38]:
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(buy_weeknd_transactions).transform(buy_weeknd_transactions)
buy_weeknd_binary_db = pd.DataFrame(trans_array, columns=tr_enc.columns_)
buy_weeknd_binary_db

Unnamed: 0,accessories.bag,apparel.trousers,appliances.environment.vacuum,appliances.environment.water_heater,appliances.iron,appliances.kitchen.blender,appliances.kitchen.coffee_machine,appliances.kitchen.dishwasher,appliances.kitchen.hood,appliances.kitchen.microwave,...,electronics.video.projector,electronics.video.tv,furniture.bedroom.blanket,furniture.kitchen.table,furniture.living_room.cabinet,furniture.living_room.shelving,kids.toys,medicine.tools.tonometer,sport.ski,sport.tennis
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,True,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
88,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
89,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
90,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
91,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## WeekDay

In [39]:
buys_weekday_df

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp,Semana
0,2859734,214536500,electronics.tablet,2014-05-14,Wednesday
1,4276371,214536500,electronics.tablet,2014-06-04,Wednesday
4,691119,214536506,electronics.tablet,2014-04-14,Monday
8,573948,214662742,furniture.kitchen.table,2014-04-10,Thursday
9,877203,214662742,furniture.kitchen.table,2014-04-14,Monday
...,...,...,...,...,...
679483,6649378,214851182,furniture.universal.light,2014-07-28,Monday
679484,6649378,214851182,furniture.universal.light,2014-07-28,Monday
679485,7001599,214571152,stationery.paper,2014-07-24,Thursday
679486,6961172,214571152,stationery.paper,2014-07-28,Monday


In [40]:
buy_weekday_transactions= []

for i in list(buys_weekday_df.Session_ID[:500].unique()):
    buy_weekday_transactions.append(buys_weekday_df[buys_weekday_df.Session_ID==i].Product_Categories.values.tolist())
    
buy_weekday_transactions 


[[' electronics.tablet'],
 [' electronics.tablet', ' computers.peripherals.monitor'],
 [' electronics.tablet',
  ' appliances.kitchen.grill',
  ' appliances.kitchen.grill',
  ' furniture.living_room.cabinet'],
 [' furniture.kitchen.table', ' electronics.audio.dictaphone'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table', ' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table',
  ' computers.components.videocards',
  ' furniture.kitchen.table',
  ' country_yard.lawn_mower'],
 [' appliances.kitchen.refrigerators'],
 [' appliances.kitchen.refrigerators',
  ' appliances.kitchen.toster',
  ' computers.peripherals.monitor',
  ' electronics.clocks'],
 [' appliances.kitchen.refrigerators',
  ' sport.tennis',
  ' country_yard.lawn_mower',
  ' computers.notebook'],
 [' appliances.kitchen.refrigerators', ' appliances.sewing_machin

In [41]:
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(buy_weekday_transactions).transform(buy_weekday_transactions)
buy_weekday_binary_db = pd.DataFrame(trans_array, columns=tr_enc.columns_)
buy_weekday_binary_db

Unnamed: 0,accessories.bag,accessories.umbrella,apparel.costume,apparel.glove,apparel.shirt,apparel.shoes,apparel.sock,apparel.trousers,apparel.tshirt,appliances.environment.air_heater,...,furniture.living_room.sofa,furniture.universal.light,medicine.tools.tonometer,sport.bicycle,sport.ski,sport.tennis,stationery.battery,stationery.cartrige,stationery.paper,stationery.stapler
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
444,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
445,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
446,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,True,False,False
447,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False


### 2.1.2 Clicks Buys df

The same has to be done for the clicks df

In [42]:
clickprod_df.head(5)

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,1,214536500,electronics.tablet,2014-04-07T10:54:09.868Z
1,21133,214536500,electronics.tablet,2014-04-06T11:07:41.937Z
2,26623,214536500,electronics.tablet,2014-04-06T12:54:28.549Z
3,25964,214536500,electronics.tablet,2014-04-02T00:37:45.985Z
4,33429,214536500,electronics.tablet,2014-04-03T19:49:41.979Z


In [43]:
weeksdf = clickprod_df[:500]
time = weeksdf["Timestamp"].str.split("T", n = 1, expand = True)
weeksdf["Timestamp"] = time[0]
weeksdf.head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,1,214536500,electronics.tablet,2014-04-07
1,21133,214536500,electronics.tablet,2014-04-06
2,26623,214536500,electronics.tablet,2014-04-06
3,25964,214536500,electronics.tablet,2014-04-02
4,33429,214536500,electronics.tablet,2014-04-03


In [44]:
teste = weeksdf
teste["Timestamp"] = pd.to_datetime(teste['Timestamp'])
teste['Semana'] = teste['Timestamp'].dt.day_name()

clicks_weeknd_df =  teste.loc[teste['Semana'].isin(['Saturday', 'Sunday'])] 
clicks_weekday_df = teste.loc[~teste['Semana'].isin(['Saturday', 'Sunday'])] 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


## Weekend

In [45]:
clicks_weeknd_df

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp,Semana
1,21133,214536500,electronics.tablet,2014-04-06,Sunday
2,26623,214536500,electronics.tablet,2014-04-06,Sunday
10,141376,214536500,electronics.tablet,2014-04-06,Sunday
17,483849,214536500,electronics.tablet,2014-04-06,Sunday
34,666396,214536500,electronics.tablet,2014-04-12,Saturday
...,...,...,...,...,...
485,859688,214536506,electronics.tablet,2014-04-13,Sunday
486,859688,214536506,electronics.tablet,2014-04-13,Sunday
487,865129,214536506,electronics.tablet,2014-04-12,Saturday
490,1899381,214536506,electronics.tablet,2014-05-03,Saturday


In [46]:
clicks_weeknd_transactions= []
for i in list(buys_weeknd_df.Session_ID[:150].unique()):
    clicks_weeknd_transactions.append(clicks_weeknd_df[clicks_weeknd_df.Session_ID==i].Product_Categories.values.tolist())
    
#clicks_weeknd_transactions 


In [47]:
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(clicks_weeknd_transactions).transform(clicks_weeknd_transactions)
click_weeknd_binary_db = pd.DataFrame(trans_array, columns=tr_enc.columns_)
click_weeknd_binary_db

Unnamed: 0,electronics.tablet
0,True
1,True
2,False
3,False
4,False
...,...
133,False
134,False
135,False
136,False


## WeekDay

In [48]:
clicks_weekday_df

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp,Semana
0,1,214536500,electronics.tablet,2014-04-07,Monday
3,25964,214536500,electronics.tablet,2014-04-02,Wednesday
4,33429,214536500,electronics.tablet,2014-04-03,Thursday
5,53412,214536500,electronics.tablet,2014-04-03,Thursday
6,73463,214536500,electronics.tablet,2014-04-03,Thursday
...,...,...,...,...,...
494,6861812,214536506,electronics.tablet,2014-07-23,Wednesday
495,6861812,214536506,electronics.tablet,2014-07-23,Wednesday
496,1,214577561,electronics.audio.headphone,2014-04-07,Monday
497,56398,214577561,electronics.audio.headphone,2014-04-02,Wednesday


In [49]:
click_weekday_transactions= []
for i in list(clicks_weekday_df.Session_ID.unique()):
    click_weekday_transactions.append(clicks_weekday_df[clicks_weekday_df.Session_ID==i].Product_Categories.values.tolist())
    
click_weekday_transactions 


[[' electronics.tablet',
  ' electronics.tablet',
  ' electronics.audio.headphone'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet',
  ' electronics.tablet',
  ' electronics.tablet',
  ' electronics.tablet',
  ' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet', ' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet', ' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.tablet', ' electronics.tablet'],
 [' electronics.tablet'],
 [' electronics.ta

In [50]:
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(click_weekday_transactions).transform(click_weekday_transactions)
click_weekday_binary_db = pd.DataFrame(trans_array, columns=tr_enc.columns_)
click_weekday_binary_db

Unnamed: 0,electronics.audio.headphone,electronics.tablet
0,True,True
1,False,True
2,False,True
3,False,True
4,False,True
...,...,...
283,False,True
284,False,True
285,False,True
286,True,False


### 2.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 2.2.1 Buys Week Day

In [51]:
frequent_itemsets = apriori(buy_weekday_binary_db, min_support=0.1,  use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.25167,( computers.components.memory)
1,0.140312,( computers.notebook)
2,0.380846,( electronics.video.tv)


In [52]:
frequent_itemsets = apriori(buy_weekday_binary_db, min_support=0.02, use_colnames=True)

frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.024499,( accessories.bag),1
1,0.020045,( appliances.environment.fan),1
2,0.020045,( appliances.iron),1
3,0.073497,( appliances.kitchen.blender),1
4,0.03118,( appliances.kitchen.kettle),1
5,0.048998,( appliances.kitchen.meat_grinder),1
6,0.022272,( appliances.kitchen.toster),1
7,0.075724,( appliances.personal.scales),1
8,0.08686,( computers.components.cpu),1
9,0.25167,( computers.components.memory),1


In [53]:
frequent_itemsets = frequent_itemsets[ (frequent_itemsets['support'] >= 0.02) & (frequent_itemsets['length'] == 2)]
frequent_itemsets

Unnamed: 0,support,itemsets,length
18,0.028953,"( appliances.kitchen.blender, appliances.pers...",2
19,0.046771,"( computers.components.memory, appliances.kit...",2
20,0.037862,"( computers.components.memory, appliances.kit...",2
21,0.022272,"( computers.components.cpu, sport.tennis)",2
22,0.03118,"( computers.components.memory, country_yard.w...",2
23,0.028953,"( computers.components.memory, sport.tennis)",2


### 2.2.2 Buys Weekend

In [54]:
frequent_itemsets = apriori(buy_weeknd_binary_db, min_support=0.1,  use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.139785,( appliances.kitchen.blender)
1,0.193548,( appliances.personal.scales)
2,0.150538,( computers.components.cpu)
3,0.129032,( computers.notebook)
4,0.430108,( electronics.video.tv)
5,0.11828,"( appliances.kitchen.blender, appliances.pers..."


### 2.2.3 Clicks WeekDay

In [55]:
frequent_itemsets = apriori(click_weekday_binary_db, min_support=0.1,  use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.993056,( electronics.tablet)


### 2.2.4 Clicks Weekend

In [56]:
frequent_itemsets = apriori(click_weeknd_binary_db, min_support=0.01,  use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.014493,( electronics.tablet)


### 2.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 2.3.1 Buys Week Day

In [57]:
frequent_itemsets = apriori(buy_weekday_binary_db, min_support=0.02, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( appliances.kitchen.blender),( appliances.personal.scales),0.073497,0.075724,0.028953,0.393939,5.202317,0.023388,1.525056
1,( appliances.personal.scales),( appliances.kitchen.blender),0.075724,0.073497,0.028953,0.382353,5.202317,0.023388,1.500053
2,( appliances.kitchen.blender),( computers.components.memory),0.073497,0.25167,0.046771,0.636364,2.52856,0.028274,2.057906
3,( appliances.kitchen.meat_grinder),( computers.components.memory),0.048998,0.25167,0.037862,0.772727,3.070394,0.025531,3.29265
4,( computers.components.cpu),( sport.tennis),0.08686,0.053452,0.022272,0.25641,4.797009,0.017629,1.272944
5,( sport.tennis),( computers.components.cpu),0.053452,0.08686,0.022272,0.416667,4.797009,0.017629,1.565383
6,( country_yard.watering),( computers.components.memory),0.033408,0.25167,0.03118,0.933333,3.708555,0.022773,11.224944
7,( sport.tennis),( computers.components.memory),0.053452,0.25167,0.028953,0.541667,2.152286,0.015501,1.632719


In [58]:
lift_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=2.0)
lift_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( appliances.kitchen.blender),( appliances.personal.scales),0.073497,0.075724,0.028953,0.393939,5.202317,0.023388,1.525056
1,( appliances.personal.scales),( appliances.kitchen.blender),0.075724,0.073497,0.028953,0.382353,5.202317,0.023388,1.500053
2,( computers.components.memory),( appliances.kitchen.blender),0.25167,0.073497,0.046771,0.185841,2.52856,0.028274,1.137988
3,( appliances.kitchen.blender),( computers.components.memory),0.073497,0.25167,0.046771,0.636364,2.52856,0.028274,2.057906
4,( computers.components.memory),( appliances.kitchen.meat_grinder),0.25167,0.048998,0.037862,0.150442,3.070394,0.025531,1.119409
5,( appliances.kitchen.meat_grinder),( computers.components.memory),0.048998,0.25167,0.037862,0.772727,3.070394,0.025531,3.29265
6,( computers.components.cpu),( sport.tennis),0.08686,0.053452,0.022272,0.25641,4.797009,0.017629,1.272944
7,( sport.tennis),( computers.components.cpu),0.053452,0.08686,0.022272,0.416667,4.797009,0.017629,1.565383
8,( computers.components.memory),( country_yard.watering),0.25167,0.033408,0.03118,0.123894,3.708555,0.022773,1.103282
9,( country_yard.watering),( computers.components.memory),0.033408,0.25167,0.03118,0.933333,3.708555,0.022773,11.224944


### 2.3.2 Buys Weekend

In [59]:
frequent_itemsets = apriori(buy_weeknd_binary_db, min_support=0.1, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( appliances.kitchen.blender),( appliances.personal.scales),0.139785,0.193548,0.11828,0.846154,4.371795,0.091224,5.241935
1,( appliances.personal.scales),( appliances.kitchen.blender),0.193548,0.139785,0.11828,0.611111,4.371795,0.091224,2.211982


In [60]:
lift_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=2.0)
lift_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( appliances.kitchen.blender),( appliances.personal.scales),0.139785,0.193548,0.11828,0.846154,4.371795,0.091224,5.241935
1,( appliances.personal.scales),( appliances.kitchen.blender),0.193548,0.139785,0.11828,0.611111,4.371795,0.091224,2.211982


### 2.3.3 Clicks WeekDay

In [61]:
frequent_itemsets = apriori(click_weekday_binary_db, min_support=0.02, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


In [62]:
lift_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=2.0)
lift_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


### 2.3.4 Clicks Weekend

In [63]:
frequent_itemsets = apriori(click_weeknd_binary_db, min_support=0.01, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


In [64]:
lift_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=2.0)
lift_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction


### 2.4. Conclusions 

# 3. [Only Groups of 3] Spring vs Summer Purchases

In this part of the project you should analyse the consumption patterns during the Spring months (April and May) vs Summer months (June and July).

**In what follows keep the following question in mind and be creative!**

1. The most interesting products are the same during the Spring and the Summer? 
2. What are the most bought products during the Spring? And during the Summer?
3. There are differences between the sets of products bought during the Spring and the Summer?
4. Can you find different associations highliting that when people click in a product/set of products also buy this product(s) during the Spring vs the Summer?
5. Discuss the results obtained for the Spring sessions vs Summer sessions.

### 3.1. Load and Preprocess Data

 **Product quantities should not be considered.**

### 3.1.1Store Buys df

We have to separate this df into 2 different df one with summer months and the other with Spring months

In [21]:
masterdf.head(5)

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,2859734,214536500,electronics.tablet,2014-05-14
1,4276371,214536500,electronics.tablet,2014-06-04
2,4440056,214536500,electronics.tablet,2014-06-14
3,70532,214536506,electronics.tablet,2014-04-06
4,691119,214536506,electronics.tablet,2014-04-14


In [66]:
time = masterdf["Timestamp"].astype(str).str.split("-", n = 2, expand = True)
masterdf["Timestamp"] = time[1]
masterdf.head(5)

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp,Semana
0,2859734,214536500,electronics.tablet,5,Wednesday
1,4276371,214536500,electronics.tablet,6,Wednesday
2,4440056,214536500,electronics.tablet,6,Saturday
3,70532,214536506,electronics.tablet,4,Sunday
4,691119,214536506,electronics.tablet,4,Monday


In [67]:
teste = masterdf
teste = teste.drop("Semana",axis=1)
spring_df =  teste.loc[teste['Timestamp'].isin(['04', '05'])] 
summer_df = teste.loc[teste['Timestamp'].isin(['06', '07'])] 


In [68]:
spring_df.head(10)

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,2859734,214536500,electronics.tablet,5
3,70532,214536506,electronics.tablet,4
4,691119,214536506,electronics.tablet,4
5,405567,214662742,furniture.kitchen.table,4
6,763567,214662742,furniture.kitchen.table,4
7,842649,214662742,furniture.kitchen.table,4
8,573948,214662742,furniture.kitchen.table,4
9,877203,214662742,furniture.kitchen.table,4
10,655821,214662742,furniture.kitchen.table,4
11,1002166,214662742,furniture.kitchen.table,4


In [69]:
summer_df.head(10)

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
1,4276371,214536500,electronics.tablet,6
2,4440056,214536500,electronics.tablet,6
16,6355708,214662742,furniture.kitchen.table,7
25,3770444,214757390,appliances.kitchen.refrigerators,6
26,4040279,214757390,appliances.kitchen.refrigerators,6
27,4343874,214757390,appliances.kitchen.refrigerators,6
28,4468136,214757390,appliances.kitchen.refrigerators,6
78,3703813,214551617,appliances.personal.scales,6
79,4212793,214551617,appliances.personal.scales,6
80,4212793,214551617,appliances.personal.scales,6


### 2.1.2 Clicks Buys df

The same has to be done for the clicks df

### 3.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 3.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 3.4. Conclusions 

## 4. Conclusions
Draw some conclusions about this project work.