# Homework 2. Frequent itemset

***Double Click here to edit this cell***

- Name: 이재혁
- Student ID: 201502552
- Submission date: 2020/03/30

*Remark. Do not import numpy, pandas, sklearn, or any module implementing the solution directly*

## Frequent itemset
- ***Support*** is an indication of how frequently the itemset $X$ appears in the dataset $T$.
- The support of X with respect to T is defined as the proportion of transactions t in the dataset which contains the itemset X.

$$
{\displaystyle \mathrm {supp} (X)={\frac {|\{t\in T;X\subseteq t\}|}{|T|}}} 
$$

- Frequent itemset is an itemset whose support $\ge$ ***min_sup***.

## Data set

- Each line in the following can be imagined as a market basket, which contains items you want to buy.

In [183]:
# DO NOT EDIT THIS CELL
data_str = 'apple,beer,rice,chicken\n'
data_str += 'apple,beer,rice\n'
data_str += 'apple,beer\n'
data_str += 'apple,mango\n'
data_str += 'milk,beer,rice,chicken\n'
data_str += 'milk,beer,rice\n'
data_str += 'milk,beer\n'
data_str += 'milk,mango'

In [184]:
data_str

'apple,beer,rice,chicken\napple,beer,rice\napple,beer\napple,mango\nmilk,beer,rice,chicken\nmilk,beer,rice\nmilk,beer\nmilk,mango'

## Problem 1 (2 pts)

- Define a function ***record_gen*** generating a list of items each ***next***.
- It must be a generator.
- Use ***yield*** instead of ***return***

In [185]:
# YOUR CODE MUST BE HERE

def gen_record(s):
    li = [] # 빈 리스트 생성
    li = data_str.split('\n') # \n으로 split 한다. 이렇게 되면 리스트의 크기는 8 
    for i in range(len(li)): # li의 길이만큼 반복문을 돌린다.
        li[i] = li[i].split(',') # li의 첫번째 원소를 , 로 split 해서 각 원소를 string이 아니라 리스트로 저장한다.
        yield li[i] # generator의 경우 return이 아니라 yield를 사용

In [186]:
# DO NOT EDIT THIS CELL
test = gen_record(data_str)
next(test)

['apple', 'beer', 'rice', 'chicken']

**Your output must be:**
```
['apple', 'beer', 'rice', 'chicken']
```

In [187]:
# DO NOT EDIT THIS CELL
next(test)

['apple', 'beer', 'rice']

**Your output must be:**
```
['apple', 'beer', 'rice']
```

## Problem 2 (10 pts)

- Define a function ***gen_frequent_1_itemset*** generating 1-itemset.
- It must be a generator.
- We want to find frequent 1-itemset (itemset containing only 1 item)

In [188]:
dataset = list(gen_record(data_str))
dataset

[['apple', 'beer', 'rice', 'chicken'],
 ['apple', 'beer', 'rice'],
 ['apple', 'beer'],
 ['apple', 'mango'],
 ['milk', 'beer', 'rice', 'chicken'],
 ['milk', 'beer', 'rice'],
 ['milk', 'beer'],
 ['milk', 'mango']]

In [189]:
# YOUR CODE MUST BE HERE
from functools import reduce
from collections import Counter

def gen_frequent_1_itemset(dataset, min_sup=0.5):
    li = list(set(reduce(lambda x,y:x+y,dataset)))
    li2 = []
    for i in range(len(dataset)):
        for j in range(len(dataset[i])):
            li2.append(dataset[i][j])
    leng = len(dataset)
    count = Counter(li2)
    for st in li:
        if count[st] >= leng*min_sup:
            yield st

In [190]:
# DO NOT EDIT THIS CELL
dataset = list(gen_record(data_str))
for item in gen_frequent_1_itemset(dataset, 0.5):
    print(item)
print('No more items')

rice
apple
beer
milk
No more items


**Your output must be:**
```
rice
beer
milk
apple
No more items
```

In [191]:
# DO NOT EDIT THIS CELL
dataset = list(gen_record(data_str))
for item in gen_frequent_1_itemset(dataset, 0.7):
    print(item)
print('No more items')

beer
No more items


**Your output must be:**
```
beer
No more items
```

In [192]:
# DO NOT EDIT THIS CELL
dataset = list(gen_record(data_str))
for item in gen_frequent_1_itemset(dataset, 0.2):
    print(item)
print('No more items')

chicken
rice
apple
beer
mango
milk
No more items


**Your output must be:**
```
rice
chicken
beer
mango
milk
apple
No more items
```

## Problem 3 (10 pts)

- Define a function ***gen_frequent_2_itemset*** generating 2-itemset.
- It must be a generator.
- We want to find frequent 2-itemset (itemset containing only 2 items)

In [193]:
dataset = list(gen_record(data_str))
dataset

[['apple', 'beer', 'rice', 'chicken'],
 ['apple', 'beer', 'rice'],
 ['apple', 'beer'],
 ['apple', 'mango'],
 ['milk', 'beer', 'rice', 'chicken'],
 ['milk', 'beer', 'rice'],
 ['milk', 'beer'],
 ['milk', 'mango']]

In [194]:
# YOUR CODE MUST BE HERE
from functools import reduce
from collections import Counter

def gen_frequent_2_itemset(dataset, min_sup=0.5):
    li = list(set(reduce(lambda x,y:x+y,dataset))) 
    li2 = [(i,j)for i in items for j in items if i<j] # 각 li에 원소를 2개씩 튜플로 묶어 저장
    li3 = [sum([set(li) <= set(i)for i in dataset])for li in li2] # li2 의 원소가 총 몇번 나오는지 횟수를 같인 index에 저장
    leng = len(dataset)
    for i in range(len(li3)):
        if li3[i] >= min_sup * leng:
            yield li2[i]

In [195]:
# DO NOT EDIT THIS CELL
data = list(gen_record(data_str))
for item in gen_frequent_2_itemset(data, 0.5):
    print(item)
print('No more items')

('beer', 'rice')
No more items


**Your output must be:**
```
('beer', 'rice')
No more items
```

In [196]:
# DO NOT EDIT THIS CELL
data = list(gen_record(data_str))
for item in gen_frequent_2_itemset(data, 0.3):
    print(item)
print('No more items')

('apple', 'beer')
('beer', 'rice')
('beer', 'milk')
No more items


**Your output must be:**
```
('beer', 'rice')
('beer', 'milk')
('apple', 'beer')
No more items
```

In [197]:
# DO NOT EDIT THIS CELL
dataset = list(gen_record(data_str))
for item in gen_frequent_2_itemset(dataset, 0.2):
    print(item)
print('No more items')

('chicken', 'rice')
('apple', 'rice')
('apple', 'beer')
('beer', 'chicken')
('beer', 'rice')
('beer', 'milk')
('milk', 'rice')
No more items


**Your output must be:**
```
('chicken', 'rice')
('beer', 'rice')
('beer', 'chicken')
('beer', 'milk')
('milk', 'rice')
('apple', 'rice')
('apple', 'beer')
No more items
```

## Ethics:
If you cheat, you will get negatgive of the total points.
If the homework total is 22 and you cheat, you get -22.

## What to submit
- Run **all cells**
- Goto "File -> Print Preview"
- Print the page as pdf
- Submit the pdf file in google classroom
- No late homeworks accepted
- Your homework will be graded on the basis of correctness and programming skills