<a href="https://colab.research.google.com/github/MasahiroAraki/MachineLearning/blob/master/Python/chap12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 第12章 パターンマイニング

In [1]:
!pip list | grep mlxtend

mlxtend                          0.22.0


In [2]:
# mlxtendを最新版 0.23.0 (2023/10/23現在) に更新
!pip install -U mlxtend

Collecting mlxtend
  Downloading mlxtend-0.23.0-py3-none-any.whl (1.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.4/1.4 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mlxtend
  Attempting uninstall: mlxtend
    Found existing installation: mlxtend 0.22.0
    Uninstalling mlxtend-0.22.0:
      Successfully uninstalled mlxtend-0.22.0
Successfully installed mlxtend-0.23.0


## 頻出項目集合抽出

Aprioriアルゴリズムの実装

http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/

### 例題12.1

例題12.1 のデータを読み込んで、pandasのDataFrameに変換します。項目がアルファベット順にソートされていることに注意。

In [3]:
import numpy as np
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules, fpgrowth

In [4]:
dataset = [
    ['Milk', 'Bread', 'Butter'],
    ['Milk', 'Bread', 'Jam'],
    ['Milk', 'Magazine'],
    ['Bread', 'Butter'],
    ['Milk', 'Bread', 'Butter', 'Jam'],
    ['Magazine'],
    ['Milk', 'Bread', 'Jam', 'Magazine'],
    ['Jam']]

  and should_run_async(code)


In [5]:
# 警告が出る場合は以下で抑制
import warnings
warnings.simplefilter('ignore')

  and should_run_async(code)


In [6]:
# 疎行列形式の表現を、真偽値を値とする行列に変換
te = TransactionEncoder()
te_ary = te.fit_transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
df

Unnamed: 0,Bread,Butter,Jam,Magazine,Milk
0,True,True,False,False,True
1,True,False,True,False,True
2,False,False,False,True,True
3,True,True,False,False,False
4,True,True,True,False,True
5,False,False,False,True,False
6,True,False,True,True,True
7,False,False,True,False,False


Aprioriアルゴリズムで3事例以上出現する項目集合を抽出

In [7]:
freq = apriori(df, min_support= 3/len(df), use_colnames=True)
freq

Unnamed: 0,support,itemsets
0,0.625,(Bread)
1,0.375,(Butter)
2,0.5,(Jam)
3,0.375,(Magazine)
4,0.625,(Milk)
5,0.375,"(Butter, Bread)"
6,0.375,"(Bread, Jam)"
7,0.5,"(Milk, Bread)"
8,0.375,"(Milk, Jam)"
9,0.375,"(Milk, Bread, Jam)"


## 規則抽出


### 例題12.2

confidenceが0.7以上の規則を抽出します。

In [8]:
ar = association_rules(freq, metric='confidence', min_threshold=0.7)

In [9]:
ar[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

Unnamed: 0,antecedents,consequents,support,confidence,lift
0,(Butter),(Bread),0.375,1.0,1.6
1,(Jam),(Bread),0.375,0.75,1.2
2,(Milk),(Bread),0.5,0.8,1.28
3,(Bread),(Milk),0.5,0.8,1.28
4,(Jam),(Milk),0.375,0.75,1.2
5,"(Milk, Bread)",(Jam),0.375,0.75,1.5
6,"(Milk, Jam)",(Bread),0.375,1.0,1.6
7,"(Bread, Jam)",(Milk),0.375,1.0,1.6
8,(Jam),"(Milk, Bread)",0.375,0.75,1.5


抽出された規則をplotlyで可視化するために、データ形式を変換します。この事例ではスコアが同じ規則が複数あって、プロットが重なってしまっていることに注意してください。

In [10]:
alist = []
clist = []
for a, c in zip(ar['antecedents'], ar['consequents']):
  alist.append(','.join(a))
  clist.append(','.join(c))
ar2 = ar.drop(['antecedents','consequents'], axis=1)
ar2['antecedents'] = alist
ar2['consequents'] = clist
ar2

Unnamed: 0,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedents,consequents
0,0.375,0.625,0.375,1.0,1.6,0.140625,inf,0.6,Butter,Bread
1,0.5,0.625,0.375,0.75,1.2,0.0625,1.5,0.333333,Jam,Bread
2,0.625,0.625,0.5,0.8,1.28,0.109375,1.875,0.583333,Milk,Bread
3,0.625,0.625,0.5,0.8,1.28,0.109375,1.875,0.583333,Bread,Milk
4,0.5,0.625,0.375,0.75,1.2,0.0625,1.5,0.333333,Jam,Milk
5,0.5,0.5,0.375,0.75,1.5,0.125,2.0,0.666667,"Milk,Bread",Jam
6,0.375,0.625,0.375,1.0,1.6,0.140625,inf,0.6,"Milk,Jam",Bread
7,0.375,0.625,0.375,1.0,1.6,0.140625,inf,0.6,"Bread,Jam",Milk
8,0.5,0.5,0.375,0.75,1.5,0.125,2.0,0.666667,Jam,"Milk,Bread"


In [11]:
import plotly.express as px
fig = px.scatter(ar2, x = 'support', y = 'confidence', color='lift', hover_data=['antecedents','consequents'], range_x=[0.3, 0.6])
fig.show()

### 例題12.3

supermarket.arffでFPGrowth（Aprioriの高速化版）を試してみます。データをダウンロードし、scipyのloadarffでWekaのarff形式のデータを読み込みます。そしてpandasのDataFrameで値を文字t/?から真偽値True/Falseとし、最終列のtotalは除きます。

In [12]:
!wget https://raw.githubusercontent.com/fracpete/wekamooc/master/dataminingwithweka/data/supermarket.arff

--2023-10-23 02:00:29--  https://raw.githubusercontent.com/fracpete/wekamooc/master/dataminingwithweka/data/supermarket.arff
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2025871 (1.9M) [text/plain]
Saving to: ‘supermarket.arff’


2023-10-23 02:00:30 (24.5 MB/s) - ‘supermarket.arff’ saved [2025871/2025871]



In [13]:
from scipy.io import arff
data, meta = arff.loadarff('supermarket.arff')
df = pd.DataFrame(data)
df2 = df.replace({b'?':False, b't':True})
df2 = df2.drop('total', axis=1)
df2

Unnamed: 0,department1,department2,department3,department4,department5,department6,department7,department8,department9,grocery misc,...,department207,department208,department209,department210,department211,department212,department213,department214,department215,department216
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,True,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4622,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4623,False,False,False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4624,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4625,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


FPGrowthで頻出項目集合を求め、association_rulesで規則を抽出します。

In [14]:
freq2 = fpgrowth(df2, min_support=0.3, use_colnames=True)

In [15]:
ar = association_rules(freq2, metric='lift', min_threshold=1.2)

plotlyで可視化するためのデータ変換

In [16]:
alist = []
clist = []
for a, c in zip(ar['antecedents'], ar['consequents']):
  alist.append(','.join(a))
  clist.append(','.join(c))
ar2 = ar.drop(['antecedents','consequents'], axis=1)
ar2['antecedents'] = alist
ar2['consequents'] = clist
ar2

Unnamed: 0,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedents,consequents
0,0.49665,0.640156,0.387076,0.779373,1.217475,0.069143,1.631011,0.354878,"bread and cake,vegetables",fruit
1,0.502485,0.639939,0.387076,0.770323,1.203743,0.065516,1.567679,0.340207,"bread and cake,fruit",vegetables
2,0.639939,0.502485,0.387076,0.604863,1.203743,0.065516,1.259095,0.470081,vegetables,"bread and cake,fruit"
3,0.640156,0.49665,0.387076,0.604659,1.217475,0.069143,1.273204,0.496403,fruit,"bread and cake,vegetables"
4,0.437649,0.640156,0.339529,0.775802,1.211897,0.059366,1.605033,0.310922,"milk-cream,vegetables",fruit
5,0.440458,0.639939,0.339529,0.770854,1.204573,0.057662,1.571313,0.303517,"milk-cream,fruit",vegetables
6,0.639939,0.440458,0.339529,0.530564,1.204573,0.057662,1.191945,0.471671,vegetables,"milk-cream,fruit"
7,0.640156,0.437649,0.339529,0.530385,1.211897,0.059366,1.197473,0.485897,fruit,"milk-cream,vegetables"
8,0.410633,0.639939,0.321807,0.783684,1.224622,0.059026,1.664513,0.311218,"fruit,baking needs",vegetables
9,0.639939,0.410633,0.321807,0.502871,1.224622,0.059026,1.18554,0.509419,vegetables,"fruit,baking needs"


各規則がsupportとconfidenceの2次元空間にプロットされ、lift値は色の明るさで示されます。点にカーソルを合わせると、その点が表現している規則とそのスコアが表示されます。

In [17]:
import plotly.express as px
fig = px.scatter(ar2, x = 'support', y = 'confidence', color='lift', hover_data=['antecedents','consequents'])
fig.show()

## 行列分解

小規模な映画評価データを使って、行列分解を行います。



ライブラリの読み込み

In [18]:
import numpy as np
from sklearn.decomposition import NMF

<a href="http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/">参考ページ</a>のデータ例を使用します。行がユーザ(5人)、列が映画(4作品)、数値が1-5の5段階評価で、0は評価なしです。

In [19]:
X = np.array([
    [5,3,0,1],
    [4,0,0,1],
    [1,1,0,5],
    [1,0,0,4],
    [0,1,5,4]
])

ここでは、非負値行列分解[NMF](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html)を使います。NMFはXを非負の行列WとHに分解します。引数n_componentsは潜在変数の次元数です。

In [20]:
model = NMF(n_components = 2)
W = model.fit_transform(X)
H = model.components_

空所の値を予測してみます。似たようなユーザを参考に、埋められた値の妥当性を評価してみてください。

In [21]:
np.set_printoptions(suppress=True)
np.dot(W,H)

array([[5.25583567, 1.99314227, 0.        , 1.4551097 ],
       [3.50429816, 1.32891613, 0.        , 0.97018601],
       [1.31291755, 0.94415648, 1.94957379, 3.94613668],
       [0.98127107, 0.72179686, 1.5276022 , 3.07887938],
       [0.        , 0.65008604, 2.83999054, 5.21892798]])

人を表す2次元ベクトルを表示します。同じような評価をする人が似たベクトルになっていることを確認してください。

In [22]:
W

array([[0.        , 1.84358059],
       [0.        , 1.22919674],
       [0.33623743, 0.46052987],
       [0.2634612 , 0.34419879],
       [0.48980507, 0.        ]])

映画を表す2次元ベクトルを表示します。同じような評価をされる映画がないので、それぞれ異なったベクトルになっていることを確認してください。

In [23]:
H.T

array([[ 0.        ,  2.85088469],
       [ 1.3272342 ,  1.08112566],
       [ 5.79820565,  0.        ],
       [10.65511214,  0.78928456]])

## 練習問題

1. a prioriアルゴリズムによる規則抽出のコード例で求めた規則よりも、Lift値の高い規則を探すにはどうすればよいでしょうか。
1. 教科書p.226の演習問題12-3の手順に従い、MovieLensデータセットに対してscikit-surpriseを用いてNMFを行ってください。余力があれば、ハイパーパラメータを変更して、性能に与える影響を観察してください。

### 解答例

#### 練習問題1

FPGrowthでmin_supportを下げて多くの頻出項目集合を求め、association_rulesで規則を評価します。ただしmin_supportをあまり下げすぎると、あまり意味のない項目集合が出てきます。

In [24]:
freq2 = fpgrowth(df2, min_support=0.2, use_colnames=True)
association_rules(freq2, metric="lift", min_threshold=1.5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,"(bread and cake, biscuits, vegetables)","(frozen foods, fruit)",0.321375,0.402204,0.200778,0.624748,1.553309,0.07152,1.59305,0.524904
1,"(bread and cake, frozen foods, vegetables)","(biscuits, fruit)",0.334558,0.397018,0.200778,0.600129,1.511594,0.067953,1.507943,0.508604
2,"(bread and cake, frozen foods, fruit)","(biscuits, vegetables)",0.334558,0.381241,0.200778,0.600129,1.574148,0.073231,1.547398,0.548111
3,"(biscuits, vegetables)","(bread and cake, frozen foods, fruit)",0.381241,0.334558,0.200778,0.526644,1.574148,0.073231,1.405796,0.589463
4,"(biscuits, fruit)","(bread and cake, frozen foods, vegetables)",0.397018,0.334558,0.200778,0.505716,1.511594,0.067953,1.346274,0.561288
5,"(frozen foods, fruit)","(bread and cake, biscuits, vegetables)",0.402204,0.321375,0.200778,0.499194,1.553309,0.07152,1.355067,0.595878


#### 練習問題2

In [25]:
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.3.tar.gz (771 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m772.0/772.0 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.3-cp310-cp310-linux_x86_64.whl size=3163346 sha256=1af5ab073d9225043abfc384679216757a348cd29054b88cdc0109185f0bc9d6
  Stored in directory: /root/.cache/pip/wheels/a5/ca/a8/4e28def53797fdc4363ca4af740db15a9c2f1595ebc51fb445
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.3


In [26]:
from surprise import NMF
from surprise import Dataset
from surprise.model_selection import cross_validate

# movielens-100k データの読み込み
X = Dataset.load_builtin('ml-100k', prompt=False)

Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


In [27]:
# 5-fold CVで、平均平方二乗誤差と平均絶対誤差を表示します。
cross_validate(NMF(), X, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9626  0.9627  0.9702  0.9572  0.9689  0.9643  0.0048  
MAE (testset)     0.7567  0.7590  0.7601  0.7504  0.7641  0.7580  0.0045  
Fit time          2.35    3.20    5.31    4.42    4.93    4.04    1.11    
Test time         0.12    1.14    0.29    0.74    0.10    0.48    0.40    


{'test_rmse': array([0.96262759, 0.96266913, 0.97022461, 0.95715881, 0.96894975]),
 'test_mae': array([0.7566633 , 0.75901532, 0.76010045, 0.7503938 , 0.76407088]),
 'fit_time': (2.346341848373413,
  3.196976661682129,
  5.311063766479492,
  4.4247729778289795,
  4.928091049194336),
 'test_time': (0.12276649475097656,
  1.1410958766937256,
  0.28762292861938477,
  0.7448732852935791,
  0.10474467277526855)}

scikit-surprise NMFのデフォルトの圧縮次元数(n_factors)は15です。これを下げると、実行時間は早くなっていますが、性能は悪くなっています。

In [28]:
cross_validate(NMF(n_factors=5), X, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0498  1.0751  1.0643  1.0552  1.0586  1.0606  0.0086  
MAE (testset)     0.8618  0.8853  0.8741  0.8651  0.8714  0.8715  0.0081  
Fit time          1.85    1.78    1.84    2.54    1.87    1.98    0.29    
Test time         0.10    0.10    0.11    0.11    0.11    0.11    0.00    


{'test_rmse': array([1.04983641, 1.07506587, 1.06431584, 1.05516099, 1.05864262]),
 'test_mae': array([0.8617609 , 0.88525715, 0.87409731, 0.86510022, 0.87138518]),
 'fit_time': (1.8481297492980957,
  1.7820937633514404,
  1.837031364440918,
  2.5428943634033203,
  1.8678643703460693),
 'test_time': (0.10293149948120117,
  0.10376191139221191,
  0.1088571548461914,
  0.10780072212219238,
  0.10529041290283203)}