## Data Mining / Prospeção de Dados

## Diogo Soares and Sara C. Madeira, 2020/21

# Project 1 - Pattern Mining

## Logistics 
**_Read Carefully_**

**Students should work in teams of 2 or 3 people**. 

**TASK 3 - Spring vs Summer Purchases** must be done only by groups of 3 people.

Individual projects might be allowed (with valid justification), but will not have better grades for this reason. 

The quality of the project will dictate its grade, not the number of people working.

**The project's solution should be uploaded in Moodle before the end of `March, 28th (23:59)`.** 

Students should **upload a `.zip` file** containing all the files necessary for project evaluation. 
Groups should be registered in [Moodle](https://moodle.ciencias.ulisboa.pt/mod/groupselect/view.php?id=139096) and the zip file should be identified as `PDnn.zip` where `nn` is the number of your group.

**It is mandatory to produce a Jupyter notebook containing code and text/images/tables/etc describing the solution and the results. Projects not delivered in this format will not be graded. You can use `PD_202021_P1.ipynb`as template. In your `.zip` folder you should also include an HTML version of your notebook with all the outputs** (File > Download as > HTML).

**Decisions should be justified and results should be critically discussed.** 

_Project solutions containing only code and outputs without discussions will achieve a maximum grade 10 out of 20._

## Dataset and Tools



In this project you will analyse data from an online Store collected over 4 months (April - July 2014). The folder `data` contains three files that you should use to obtain the dataset to be used in pattern mining. 

The file `store-buys.dat` comprises the buy events of the users over the items. It contains **318.444 sessions**. Each record/line in the file has the following fields (with this order): 

* **Session ID** - the id of the session. In one session there are one or many buying events. Could be represented as an integer number.
* **Timestamp** - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of item that has been bought. Could be represented as an integer number. 
* **Price** – the price of the item. Could be represented as an integer number.
* **Quantity** – the quantity in this buying.  Could be represented as an integer number.

The file `store-clicks.dat` comprises the clicks of the users over the items. It contains **5.613.499 sessions**.  Each record/line in the file has the following fields (with this order):

* **Session ID** – the id of the session. In one session there are one or many clicks. Could be represented as an integer number.
* **Timestamp** – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of the item that has been clicked. Could be represented as an integer number.
* **Context** – the context of the click. The value "S" indicates a special offer, "0" indicates  a missing value, a number between 1 to 12 indicates a real category identifier,
any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH,
then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. 
 
The file `products.csv` comprises the list of products sold by the online store. It contains **46.294 different products** associated with **123 different subcategories**. Each record/line in the file has the following fields:

* **Item ID** - the unique identifier of the item. Could be represented as an integer number. 
* **Product Categories** - the category and subcategories of the item. It is a string containing the category and subcategories of the item. Eg. `appliances.kitchen.juice`


In this project you should use [Python 3](https://www.python.org), [Jupyter Notebook](http://jupyter.org) and **[MLxtend](http://rasbt.github.io/mlxtend/)**. When using MLxtend, frequent patterns can either be discovered using `Apriori` and `FP-Growth`. **Choose the pattern mining algorithm to be used.** 


## Team Identification

**GROUP NNN**

Students:

* Student 1 - n_student1
* Student 2 - n_student2
* Student 3 - n_student3

## 1. Mining Frequent Itemsets and Association Rules


In this first part of the project you should load and preprocess the dataset  in order to compute frequent itemsets and generate association rules considering all the sessions.

**In what follows keep the following question in mind and be creative!**

1. What are the most interesting products?
2. What are the most bought products?
3. Which products are bought together?
4. Can you find associations between the clicked products? 
5. Can you find associations highliting that when people buy a product/set of products also buy other product(s)?
6. Can you find associations highliting that when people click in a product/set of products also buy this product(s)?
7. Can you find relevant associated categories? 

### 1.1. Load and Preprocess Data

 **Product quantities should not be considered.**

In [6]:
import pandas as pd
import numpy as np
from mlxtend.preprocessing import  TransactionEncoder
from mlxtend.frequent_patterns import apriori
import mlxtend.preprocessing
import mlxtend.frequent_patterns
import matplotlib.pyplot as plt

### 1.1.1 Product

The file `products.csv` comprises the list of products sold by the online store. It contains **46.294 different products** associated with **123 different subcategories**. Each record/line in the file has the following fields:

* **Item ID** - the unique identifier of the item. Could be represented as an integer number. 
* **Product Categories** - the category and subcategories of the item. It is a string containing the category and subcategories of the item. Eg. `appliances.kitchen.juice`

In [7]:
product_df = pd.read_csv("products.csv")
product_df.columns = ["Item_ID","Product_Categories"]
product_df.head(5)

Unnamed: 0,Item_ID,Product_Categories
0,214536500,electronics.tablet
1,214536506,electronics.tablet
2,214577561,electronics.audio.headphone
3,214662742,furniture.kitchen.table
4,214662742,furniture.kitchen.table


In [8]:
product_df.shape

(20704558, 2)

In [9]:
pdf = product_df.drop_duplicates()

In [10]:
pdf.shape

(46294, 2)

### 1.1.2 Store-buys

The file `store-buys.dat` comprises the buy events of the users over the items. It contains **318.444 sessions**. Each record/line in the file has the following fields (with this order): 

* **Session ID** - the id of the session. In one session there are one or many buying events. Could be represented as an integer number.
* **Timestamp** - the time when the buy occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of item that has been bought. Could be represented as an integer number. 
* **Price** – the price of the item. Could be represented as an integer number.
* **Quantity** – the quantity in this buying.  Could be represented as an integer number.

In [11]:
store_buys = pd.read_csv("store-buys.dat")

store_buys.columns = ["Session_ID","Timestamp","Item_ID", "Price", "Quantity"]

store_buys.head(5)

Unnamed: 0,Session_ID,Timestamp,Item_ID,Price,Quantity
0,420374,2014-04-06T18:44:58.325Z,214537850,10471,1
1,281626,2014-04-06T09:40:13.032Z,214535653,1883,1
2,420368,2014-04-04T06:13:28.848Z,214530572,6073,1
3,420368,2014-04-04T06:13:28.858Z,214835025,2617,1
4,140806,2014-04-07T09:22:28.132Z,214668193,523,1


**Drop quantities**

In [12]:
store_buys = store_buys.drop("Quantity",axis=1)
store_buys = store_buys.drop("Price",axis=1)
store_buys

Unnamed: 0,Session_ID,Timestamp,Item_ID
0,420374,2014-04-06T18:44:58.325Z,214537850
1,281626,2014-04-06T09:40:13.032Z,214535653
2,420368,2014-04-04T06:13:28.848Z,214530572
3,420368,2014-04-04T06:13:28.858Z,214835025
4,140806,2014-04-07T09:22:28.132Z,214668193
...,...,...,...
679483,6926714,2014-07-27T15:35:40.221Z,214665277
679484,6645086,2014-07-28T10:08:58.076Z,214567057
679485,6740437,2014-07-25T19:02:58.252Z,214708044
679486,6926707,2014-07-27T13:58:54.040Z,214848986


### 1.1.3 Store-clicks(falta processar melhor os dados)


The file `store-clicks.dat` comprises the clicks of the users over the items. It contains **5.613.499 sessions**.  Each record/line in the file has the following fields (with this order):

* **Session ID** – the id of the session. In one session there are one or many clicks. Could be represented as an integer number.
* **Timestamp** – the time when the click occurred. Format of YYYY-MM-DDThh:mm:ss.SSSZ
* **Item ID** – the unique identifier of the item that has been clicked. Could be represented as an integer number.
* **Context** – the context of the click. The value "S" indicates a special offer, "0" indicates  a missing value, a number between 1 to 12 indicates a real category identifier,
any other number indicates a brand. E.g. if an item has been clicked in the context of a promotion or special offer then the value will be "S", if the context was a brand i.e BOSCH,
then the value will be an 8-10 digits number. If the item has been clicked under regular category, i.e. sport, then the value will be a number between 1 to 12. 
 

In [13]:
store_clicks = pd.read_csv("store-clicks.dat")

store_clicks.columns = ["Session_ID","Timestamp","Item_ID", "Context"]

store_clicks.head(5)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,Session_ID,Timestamp,Item_ID,Context
0,1,2014-04-07T10:54:09.868Z,214536500,0
1,1,2014-04-07T10:54:46.998Z,214536506,0
2,1,2014-04-07T10:57:00.306Z,214577561,0
3,2,2014-04-07T13:56:37.614Z,214662742,0
4,2,2014-04-07T13:57:19.373Z,214662742,0


In [14]:
store_clicks.shape

(20704558, 4)

In [15]:
cstore_clicks = store_clicks.drop_duplicates()
print(cstore_clicks.shape)

(20704512, 4)


### 1.2 Merge store-buys and product_df 

To get a transaction list with the storage buys with the correspondent product description, we did a merge with the ghsfnofdks

In [34]:
masterdf = pd.merge(pdf.set_index('Item_ID'), store_buys.set_index('Item_ID'),on='Item_ID')
masterdf = masterdf.reset_index()

In [35]:
masterdf = masterdf[['Session_ID','Item_ID','Product_Categories','Timestamp']]

In [36]:
masterdf

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,2859734,214536500,electronics.tablet,2014-05-14T12:06:02.717Z
1,4276371,214536500,electronics.tablet,2014-06-04T13:44:10.725Z
2,4440056,214536500,electronics.tablet,2014-06-14T19:14:00.581Z
3,70532,214536506,electronics.tablet,2014-04-06T09:59:03.143Z
4,691119,214536506,electronics.tablet,2014-04-14T15:56:30.514Z
...,...,...,...,...
679483,6649378,214851182,furniture.universal.light,2014-07-28T15:33:57.709Z
679484,6649378,214851182,furniture.universal.light,2014-07-28T15:53:12.209Z
679485,7001599,214571152,stationery.paper,2014-07-24T20:16:42.857Z
679486,6961172,214571152,stationery.paper,2014-07-28T19:39:09.688Z


In [37]:
buy_transactions= []
#for i in list(masterdf.Session_ID[:50].unique()):
for i in list(masterdf.Session_ID.unique()):
    buy_transactions.append(masterdf[masterdf.Session_ID==i].Product_Categories.values.tolist())
    
buy_transactions 


[[' electronics.tablet'],
 [' electronics.tablet', ' computers.peripherals.monitor'],
 [' electronics.tablet', ' sport.tennis'],
 [' electronics.tablet'],
 [' electronics.tablet',
  ' appliances.kitchen.grill',
  ' appliances.kitchen.grill',
  ' furniture.living_room.cabinet'],
 [' furniture.kitchen.table', ' computers.components.memory'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table',
  ' computers.components.memory',
  ' kids.toys',
  ' computers.components.memory',
  ' computers.components.memory',
  ' electronics.video.tv'],
 [' furniture.kitchen.table', ' electronics.audio.dictaphone'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table', ' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table'],
 [' furniture.kitchen.table',
  ' computers.components.videocards',
  ' furniture.kitchen.table',
  ' country_yard.lawn_mower'],
 [' electroni

Now that we have a transactions list, ggfsdafs

The Apriori implementation at MLxtend receives a binary database

In [38]:
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(buy_transactions).transform(buy_transactions)
binary_database = pd.DataFrame(trans_array, columns=tr_enc.columns_)
binary_database

Unnamed: 0,accessories.bag,accessories.umbrella,apparel.costume,apparel.glove,apparel.shirt,apparel.shoes,apparel.sock,apparel.trousers,apparel.tshirt,appliances.environment.air_conditioner,...,sport.bicycle,sport.diving,sport.ski,sport.snowboard,sport.tennis,sport.trainer,stationery.battery,stationery.cartrige,stationery.paper,stationery.stapler
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318439,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
318440,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
318441,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
318442,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


### 1.3  Merge store-clicks and product_df

In [39]:
clickprod_df = pd.merge(pdf.set_index('Item_ID'), cstore_clicks.set_index('Item_ID'),on='Item_ID')
clickprod_df = clickprod_df.reset_index()


bfhsdijkfbs

In [40]:
clickprod_df = clickprod_df[['Session_ID','Item_ID','Product_Categories','Timestamp']]

In [41]:
clickprod_df

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,1,214536500,electronics.tablet,2014-04-07T10:54:09.868Z
1,21133,214536500,electronics.tablet,2014-04-06T11:07:41.937Z
2,26623,214536500,electronics.tablet,2014-04-06T12:54:28.549Z
3,25964,214536500,electronics.tablet,2014-04-02T00:37:45.985Z
4,33429,214536500,electronics.tablet,2014-04-03T19:49:41.979Z
...,...,...,...,...
20704507,6928129,214535055,kids.dolls,2014-07-28T09:46:10.486Z
20704508,6927372,214818485,furniture.living_room.chair,2014-07-23T19:32:47.960Z
20704509,6927372,214818485,furniture.living_room.chair,2014-07-23T19:33:33.705Z
20704510,6926506,214646096,apparel.glove,2014-07-27T10:12:04.929Z


(Só meti 500 transanctions q tem tipo 2M e demora imenso)

In [42]:
click_transactions= []
#for i in list(masterdf.Session_ID[:50].unique()):
for i in list(clickprod_df.Session_ID[:50].unique()):
    click_transactions.append(clickprod_df[clickprod_df.Session_ID==i].Product_Categories.values.tolist())
    
click_transactions 


[[' electronics.tablet',
  ' electronics.tablet',
  ' electronics.audio.headphone'],
 [' electronics.tablet', ' computers.network.router', ' accessories.bag'],
 [' electronics.tablet',
  ' electronics.tablet',
  ' auto.accessories.videoregister',
  ' appliances.personal.scales',
  ' computers.network.router',
  ' computers.peripherals.scanner',
  ' electronics.calculator',
  ' electronics.smartphone',
  ' appliances.kitchen.hood',
  ' appliances.kitchen.dishwasher',
  ' appliances.kitchen.washer',
  ' construction.tools.drill',
  ' computers.peripherals.scanner',
  ' kids.fmcg.diapers',
  ' computers.gaming',
  ' furniture.bedroom.bed',
  ' appliances.kitchen.mixer',
  ' stationery.paper',
  ' electronics.smartphone',
  ' medicine.tools.tonometer',
  ' apparel.trousers',
  ' electronics.audio.headphone',
  ' furniture.living_room.shelving',
  ' electronics.smartphone',
  ' electronics.clocks',
  ' appliances.kitchen.coffee_machine',
  ' construction.tools.drill',
  ' computers.ebooks',

In [43]:
tr_enc = TransactionEncoder()
trans_array = tr_enc.fit(click_transactions).transform(click_transactions)
binary_database2 = pd.DataFrame(trans_array, columns=tr_enc.columns_)
binary_database2

Unnamed: 0,accessories.bag,apparel.shoes,apparel.sock,apparel.trousers,apparel.tshirt,appliances.iron,appliances.ironing_board,appliances.kitchen.blender,appliances.kitchen.coffee_machine,appliances.kitchen.dishwasher,...,kids.fmcg.diapers,kids.swing,medicine.tools.tonometer,sport.diving,sport.ski,sport.tennis,sport.trainer,stationery.battery,stationery.cartrige,stationery.paper
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,True,False,False,False,False,True,True,...,True,False,True,False,False,False,False,False,False,True
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
4,False,False,True,False,False,True,False,False,True,False,...,True,False,False,False,False,False,True,True,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
6,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,False,True,False,True,False,True,False,True,False,False,...,False,False,False,True,False,False,False,False,True,False
8,False,False,False,False,False,False,False,False,False,False,...,False,False,True,False,False,False,False,False,False,False
9,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## 1.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 1.2.1 Store Buys

In [32]:
frequent_itemsets = apriori(binary_database, min_support=0.12,  use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.129602,( sport.tennis)


**What are the most bought products?**

The most bought product is sport.tennis



In [45]:
frequent_itemsets = apriori(binary_database, min_support=0.05, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.065977,( appliances.kitchen.blender)
1,0.050015,( appliances.kitchen.grill)
2,0.061741,( appliances.kitchen.toster)
3,0.055894,( appliances.steam_cleaner)
4,0.083848,( computers.components.memory)
5,0.057407,( computers.peripherals.monitor)
6,0.0765,( country_yard.lawn_mower)
7,0.054075,( medicine.tools.tonometer)
8,0.129602,( sport.tennis)


In [46]:
frequent_itemsets = apriori(binary_database, min_support=0.01, use_colnames=True)
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.025672,( accessories.bag)
1,0.030448,( appliances.environment.fan)
2,0.047336,( appliances.environment.vacuum)
3,0.012658,( appliances.environment.water_heater)
4,0.024296,( appliances.iron)
5,0.065977,( appliances.kitchen.blender)
6,0.050015,( appliances.kitchen.grill)
7,0.043822,( appliances.kitchen.meat_grinder)
8,0.017592,( appliances.kitchen.microwave)
9,0.024023,( appliances.kitchen.mixer)


In [48]:
frequent_itemsets = apriori(binary_database, min_support=0.01, use_colnames=True)

frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.025672,( accessories.bag),1
1,0.030448,( appliances.environment.fan),1
2,0.047336,( appliances.environment.vacuum),1
3,0.012658,( appliances.environment.water_heater),1
4,0.024296,( appliances.iron),1
5,0.065977,( appliances.kitchen.blender),1
6,0.050015,( appliances.kitchen.grill),1
7,0.043822,( appliances.kitchen.meat_grinder),1
8,0.017592,( appliances.kitchen.microwave),1
9,0.024023,( appliances.kitchen.mixer),1


**Which products are bought together?**

In [49]:
frequent_itemsets = frequent_itemsets[ (frequent_itemsets['support'] >= 0.01) & (frequent_itemsets['length'] == 2)]
frequent_itemsets

Unnamed: 0,support,itemsets,length
39,0.011,"( computers.components.memory, appliances.env...",2
40,0.011519,"( country_yard.lawn_mower, appliances.kitchen...",2
41,0.010859,"( computers.components.memory, appliances.kit...",2
42,0.01203,"( sport.tennis, appliances.kitchen.toster)",2
43,0.010256,"( appliances.steam_cleaner, country_yard.lawn...",2
44,0.014116,"( computers.components.memory, sport.tennis)",2
45,0.019146,"( sport.tennis, country_yard.lawn_mower)",2


### 1.2.2 Store Clicks (Só tá com 500 transactions)

In [44]:
frequent_itemsets_clicks = apriori(binary_database2, min_support=0.2,  use_colnames=True)
frequent_itemsets_clicks

Unnamed: 0,support,itemsets
0,0.382979,( appliances.personal.scales)
1,0.297872,( electronics.audio.headphone)
2,1.0,( electronics.tablet)
3,0.255319,( furniture.living_room.sofa)
4,0.382979,"( appliances.personal.scales, electronics.tab..."
5,0.297872,"( electronics.tablet, electronics.audio.headp..."
6,0.255319,"( furniture.living_room.sofa, electronics.tab..."


In [45]:
frequent_itemsets_clicks = apriori(binary_database2, min_support=0.05, use_colnames=True)
frequent_itemsets_clicks

Unnamed: 0,support,itemsets
0,0.085106,( apparel.trousers)
1,0.085106,( appliances.iron)
2,0.106383,( appliances.kitchen.meat_grinder)
3,0.063830,( appliances.kitchen.mixer)
4,0.085106,( appliances.personal.hair_cutter)
...,...,...
136,0.085106,"( medicine.tools.tonometer, appliances.person..."
137,0.063830,"( furniture.living_room.shelving, computers.e..."
138,0.063830,"( furniture.living_room.sofa, electronics.tab..."
139,0.063830,"( electronics.tablet, electronics.audio.headp..."


In [46]:
frequent_itemsets_clicks = apriori(binary_database2, min_support=0.01, use_colnames=True)

frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))

frequent_itemsets

MemoryError: Unable to allocate 37.5 GiB for an array with shape (95099350, 9, 47) and data type bool

**Which products are bought together?**

In [49]:
frequent_itemsets_clicks = frequent_itemsets[ (frequent_itemsets['support'] >= 0.01) & (frequent_itemsets['length'] == 2)]
frequent_itemsets_clicks

Unnamed: 0,support,itemsets,length
39,0.011,"( computers.components.memory, appliances.env...",2
40,0.011519,"( country_yard.lawn_mower, appliances.kitchen...",2
41,0.010859,"( computers.components.memory, appliances.kit...",2
42,0.01203,"( sport.tennis, appliances.kitchen.toster)",2
43,0.010256,"( appliances.steam_cleaner, country_yard.lawn...",2
44,0.014116,"( computers.components.memory, sport.tennis)",2
45,0.019146,"( sport.tennis, country_yard.lawn_mower)",2


### 1.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 1.3.1 Store Buys

In [65]:
from mlxtend.frequent_patterns import association_rules


frequent_itemsets = apriori(binary_database, min_support=0.01, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.2)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( appliances.environment.fan),( computers.components.memory),0.030448,0.083848,0.011,0.361283,4.308768,0.008447,1.434362
1,( appliances.kitchen.grill),( country_yard.lawn_mower),0.050015,0.0765,0.011519,0.230301,3.010463,0.007692,1.199819
2,( country_yard.lawn_mower),( sport.tennis),0.0765,0.129602,0.019146,0.250277,1.93112,0.009232,1.160959



* When people buy appliances.environment.fan, computers.components.memory appears in 36% of the transactions. 

* When people buy appliances.kitchen.grill, country_yard.lawn_mowe appears in 23% of the transactions. 

* When people buy country_yard.lawn_mower, sport.tennis appears in 25% of the transactions. 


In [54]:
lift_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=2.0)
lift_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( computers.components.memory),( appliances.environment.fan),0.083848,0.030448,0.011,0.131194,4.308768,0.008447,1.115959
1,( appliances.environment.fan),( computers.components.memory),0.030448,0.083848,0.011,0.361283,4.308768,0.008447,1.434362
2,( country_yard.lawn_mower),( appliances.kitchen.grill),0.0765,0.050015,0.011519,0.150569,3.010463,0.007692,1.118377
3,( appliances.kitchen.grill),( country_yard.lawn_mower),0.050015,0.0765,0.011519,0.230301,3.010463,0.007692,1.199819
4,( computers.components.memory),( appliances.kitchen.toster),0.083848,0.061741,0.010859,0.129508,2.097611,0.005682,1.07785
5,( appliances.kitchen.toster),( computers.components.memory),0.061741,0.083848,0.010859,0.175881,2.097611,0.005682,1.111674
6,( appliances.steam_cleaner),( country_yard.lawn_mower),0.055894,0.0765,0.010256,0.183493,2.398604,0.00598,1.131038
7,( country_yard.lawn_mower),( appliances.steam_cleaner),0.0765,0.055894,0.010256,0.134067,2.398604,0.00598,1.090276


Lift - how likely item Y is to be purchased when item X is purchased, while controlling for how popular item Y is.
A lift value greater than 1 means that item Y is likely to be bought if item X is bought

Can you find associations highliting that when people buy a product/set of products also buy other product(s)?

Resposta: gfsdfsddfsvssvgrsfff

### 1.2.3 Store Clicks(só tá com 500 transactions)

In [86]:
frequent_itemsets = apriori(binary_database2, min_support=0.15, use_colnames=True)

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.1)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( appliances.personal.scales),( electronics.tablet),0.382567,0.992736,0.380145,0.993671,1.000942,0.000358,1.1477
1,( electronics.tablet),( appliances.personal.scales),0.992736,0.382567,0.380145,0.382927,1.000942,0.000358,1.000584
2,( electronics.audio.headphone),( electronics.tablet),0.188862,0.992736,0.181598,0.961538,0.968574,-0.005892,0.188862
3,( electronics.tablet),( electronics.audio.headphone),0.992736,0.188862,0.181598,0.182927,0.968574,-0.005892,0.992736
4,( electronics.tablet),( furniture.living_room.sofa),0.992736,0.179177,0.179177,0.180488,1.007317,0.001302,1.0016
5,( furniture.living_room.sofa),( electronics.tablet),0.179177,0.992736,0.179177,1.0,1.007317,0.001302,inf


In [88]:
lift_rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)
lift_rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,( appliances.personal.scales),( electronics.tablet),0.382567,0.992736,0.380145,0.993671,1.000942,0.000358,1.1477
1,( electronics.tablet),( appliances.personal.scales),0.992736,0.382567,0.380145,0.382927,1.000942,0.000358,1.000584
2,( electronics.tablet),( furniture.living_room.sofa),0.992736,0.179177,0.179177,0.180488,1.007317,0.001302,1.0016
3,( furniture.living_room.sofa),( electronics.tablet),0.179177,0.992736,0.179177,1.0,1.007317,0.001302,inf


### 1.4. Take a Look at Maximal Patterns: Compute Maximal Frequent Itemsets

### 1.5. Conclusions 

# 2. Week vs Weekend Purchases

In this part of the project you should analyse the consumption patterns during the week vs during the weekeed.

**In what follows keep the following question in mind and be creative!**

1. The most interesting products are the same during the week and the weekend? 
2. What are the most bought products during the week? And during the weekend?
3. There are differences between the sets of products bought during the week and the weekend?
4. Can you find different associations highliting that when people click in a product/set of products also buy this product(s) during the week vs the weekend?
5. Discuss the results obtained for the week sessions vs weekend sessions.

### 2.1. Load and Preprocess Data

 **Product quantities should not be considered.**
 


### 2.1.1Store Buys df

We have to separate this df into 2 different df one with week days and the other with weekend days

In [96]:
masterdf.head(5)

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,2859734,214536500,electronics.tablet,2014-05-14T12:06:02.717Z
1,4276371,214536500,electronics.tablet,2014-06-04T13:44:10.725Z
2,4440056,214536500,electronics.tablet,2014-06-14T19:14:00.581Z
3,70532,214536506,electronics.tablet,2014-04-06T09:59:03.143Z
4,691119,214536506,electronics.tablet,2014-04-14T15:56:30.514Z


### 2.1.2 Clicks Buys df

The same has to be done for the clicks df

In [97]:
clickprod_df.head(5)

Unnamed: 0,Session_ID,Item_ID,Product_Categories,Timestamp
0,1,214536500,electronics.tablet,2014-04-07T10:54:09.868Z
1,21133,214536500,electronics.tablet,2014-04-06T11:07:41.937Z
2,26623,214536500,electronics.tablet,2014-04-06T12:54:28.549Z
3,25964,214536500,electronics.tablet,2014-04-02T00:37:45.985Z
4,33429,214536500,electronics.tablet,2014-04-03T19:49:41.979Z


### 2.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 2.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 2.4. Conclusions 

# 3. [Only Groups of 3] Spring vs Summer Purchases

In this part of the project you should analyse the consumption patterns during the Spring months (April and May) vs Summer months (June and July).

**In what follows keep the following question in mind and be creative!**

1. The most interesting products are the same during the Spring and the Summer? 
2. What are the most bought products during the Spring? And during the Summer?
3. There are differences between the sets of products bought during the Spring and the Summer?
4. Can you find different associations highliting that when people click in a product/set of products also buy this product(s) during the Spring vs the Summer?
5. Discuss the results obtained for the Spring sessions vs Summer sessions.

### 3.1. Load and Preprocess Data

 **Product quantities should not be considered.**

### 3.2. Compute Frequent Itemsets

* Compute frequent itemsets considering a minimum support of X%. 
* Present frequent itemsets organized by length (number of items). 
* List frequent 1-itemsets, 2-itemsets, 3-itemsets, etc with support of at least Y%.
* Change X and Y when it makes sense and discuss the results.

### 3.3. Generate Association Rules from Frequent Itemsets

* Generate association rules with a choosed value (C) for minimum confidence. 
* Generate association rules with a choosed value (L) for minimum lift. 
* Generate association rules with both confidence >= C% and lift >= L.
* Change C and L when it makes sense and discuss the results.

### 3.4. Conclusions 

## 4. Conclusions
Draw some conclusions about this project work.