## Features based on Association Rules 

In this notebook we, we will try to understand machine failure using sensor data. The failures are be recorded as Codes 
(All codes a fake, data was anonymised for this purpose). The codes could have different meaning (like full stop of the engine, warnings, communication problems). Some codes lead to longer failures (10 hours) but most errors won't even stop the machine.

My first intution was that before a prolonged failure, a certain set of warnings or errors might preced them. If the company could know which error-codes have a tendency to precede a full-stop, this could be tracked as KPI (ex Critical Warnings/Week) to better anticipate failure. 

## ENTER the Association Rule Miner! 

Association Rule Mining, also known as Market Basket, is a technique used in marketing to decide which products are frequently bought together. It calculates confidence (amount of pairs bought together/all) and support (how frequently the pair appears /all) to show patterns in objects.

With this approach, we will try to find which error codes seem to happen the week before a failure! 

In [1]:
#!pip install mlxtend

In [3]:
import pandas as pd 
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

In [4]:
import os as os 
os.chdir('/Users/JohanLg/Documents/My Documents/ESCP/Kurser/Vår/Hackathon')

In [9]:
data = pd.read_csv('AllTurbData', encoding='iso-8859-1')

  interactivity=interactivity, compiler=compiler, result=result)


# Cleaning 

The dataset has a "countdown" feature that counts the number of instances between each important failure of the turbine. As the data is divided in to 10 min cycles, we filter out all cycles above 1000 (1000 x 10min ~= 1 week) 

In [10]:
Apriori = data[data.Countdown < 1000]

In [11]:
New = Apriori.iloc[:,[52,54,55,56]]

In [12]:
New.dtypes

Code         float64
Comment       object
Countdown    float64
Group        float64
dtype: object

As we can see below, each coundown til failure is also assigned a "group". This value is arbitrary and is just a way for the association rule miner to recoginse "transactions", which in this case is represented as the errors in a 1-week cycle

In [15]:
New1 = New2.dropna()

In [21]:
# Anonymise codes
New1['Code'] = New1['Code'].astype('category').cat.codes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [22]:
New1.groupby('Group')['Code'].count()

# We see that some cycles ex 2240 had 4 errors in total

Group
22.0      1
23.0      1
24.0      1
26.0      1
42.0      3
         ..
2240.0    4
2242.0    1
2243.0    1
2248.0    6
2251.0    3
Name: Code, Length: 771, dtype: int64

## Mining

For the rule miner to work, we need "transactional" data. 

In [24]:
df = New1.groupby(['Group','Code']).size().reset_index(name='count')

# We count the number of instances for every code in every group,
# then set an index for the values

In [25]:
basket = (df.groupby(['Group', 'Code'])['count']
        # Group the data on their code count.
          .sum().unstack()
        # Pivot the table with group as rows and code as columns
          .reset_index().fillna(0)
        # Fill the empty spaces with 0, representing that a code 
        # did not happen during that period
          .set_index('Group'))

In [26]:
basket.head()

Code,0,1,2,3,4,5,6,7,8,9,...,63,64,65,66,67,68,69,70,71,72
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
23.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
24.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
42.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [27]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1
    
# Hot-encode every observation

In [28]:
basket_sets = basket.applymap(encode_units)

## Rules

And below we get our rules! 

Confidence is high, meaning that in the case that 53 happen during a cycle, 52 will be the consquent 100% of the time. But the support shows us that this only happens in 1.2% of the cycles. Not very useful! 

This is due to the fact that some codes are labeled as warning and not stops. 52 might be a warning and not a full Stop! Therefore, we will make sure that all CONSEQUENTS are labeled as STOP.

In [29]:
code_rules = apriori(basket_sets, min_support=0.001, use_colnames=True)
rules = association_rules(code_rules, metric="lift")
rules.sort_values('confidence', ascending = False, inplace = True)
rules.head(5)



Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
155,(53),(52),0.001297,0.075227,0.001297,1.0,13.293103,0.001199,inf
260,(57),"(8, 52)",0.001297,0.001297,0.001297,1.0,771.0,0.001295,inf
263,"(8, 54)",(66),0.001297,0.001297,0.001297,1.0,771.0,0.001295,inf
264,"(66, 54)",(8),0.001297,0.016861,0.001297,1.0,59.307692,0.001275,inf
266,(66),"(8, 54)",0.001297,0.001297,0.001297,1.0,771.0,0.001295,inf


In [30]:
Rules = pd.DataFrame(rules)

In [31]:
# Create a list of unique stop codes to use as filtering argument

Stop_Codes = Apriori[Apriori.Status == 'Stop']
Stop_Code_list = Stop_Codes.Code.unique()
Stop = pd.DataFrame(Stop_Code_list)

In [34]:
type(Rules.consequents[0])

frozenset

In [35]:
a= list(rules.consequents)

a= [list(i) for i in a]
rules.consequents=a

lst=[]
for i in rules.consequents:
    if i in Stop_Code_list:
        lst.append(True)
    else:
        lst.append(False)

rules['S']=lst

  


In [36]:
a= list(rules.consequents)
print(a[1:15])

[[8, 52], [66], [8], [8, 54], [8, 66], [15], [38], [43], [13], [21], [13], [21], [49], [62]]


In [37]:
a= [list(i) for i in a]
rules.consequents=a

In [38]:
#Top rules with STOPs as rhs
stop = rules[(rules['S']== True) & (rules['lift']>1.2)]
stop.sort_values('support', ascending=False).head(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,S
38,(8),[36],0.016861,0.11284,0.006485,0.384615,3.408488,0.004582,1.441634,True
90,(18),[62],0.003891,0.006485,0.002594,0.666667,102.8,0.002569,2.980545,True
31,(6),[36],0.003891,0.11284,0.002594,0.666667,5.908046,0.002155,2.661479,True
79,(15),[38],0.068742,0.009079,0.002594,0.037736,4.156334,0.00197,1.029781,True
149,(61),[49],0.003891,0.006485,0.001297,0.333333,51.4,0.001272,1.490272,True


Now that we do have the correct code as consequents, we see that supprot for our theory is quite low. The confidence in some 
combinations is pretty high, but we sadly cannot find universally useful rules to apply to the machine failures. Altough, in a world where every machine failure might cost thousands of euros of damage or in lsot production, any model that can lead to lower failure rates can be useful. We see that 18 preceding 62 and 6 preceding 36 happen exactly 2/3rds of the time. Altough they happen seldomely, the probability of prventing an error is high! 