# Projeto Aprendizado de Máquina
## CTC- 17 Inteligência Artificial

O dataset é composto de registro de acidentes em 12 diferentes instalações em três diferentes paises. Cada linha é a
ocorrência de um acidente, e conta com as seguintes colunas:


**Columns description**


Data: timestamp or time/date information


Countries: which country the accident occurred (anonymized)


Local: the city where the manufacturing plant is located (anonymized)


Industry sector: which sector the plant belongs to (Mining, metals,Others)


Accident level: from I to VI, it registers how severe was the accident (I means not severe ...VI most severe)


Potential Accident Level: Depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)


Genre: if the person is male of female


Employee or Third Party: if the injured person is an employee or a third party


Critical Risk: some description of the risk involved in the accident

Utilizando a base de dados fornecida, criar um classificador baseado em árvore de decisão que classifique o nível do acidente (Accident Level), com base nas informações disponíveis nas outras colunas. Separe 80% das linha para treinamento e as demais para teste. Discuta quais variáveis valem a pena ou não participarem
da árvore, elimine as variáveis que vc esteja certo que não colaboram para a classificação. Descreva este processamento
dos dados para prepará-los para os algoritmos. Utilize o algoritmo ID3 ou uma versão deste melhorada, programe sem
utilizar frameworks que implementam árvores de decisão, mas você pode usar framework com estrutura de dados para
árvores.

In [1]:
import pandas as pd
import numpy as np
import math
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [2]:
accident_data = pd.read_csv('accident_data.csv')
accident_data.head()

Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee ou Terceiro,Risco Critico
0,2016-01-01 00:00:00,Country_01,Local_01,Mining,I,IV,Male,Third Party,Pressed
1,2016-01-02 00:00:00,Country_02,Local_02,Mining,I,IV,Male,Employee,Pressurized Systems
2,2016-01-06 00:00:00,Country_01,Local_03,Mining,I,III,Male,Third Party (Remote),Manual Tools
3,2016-01-08 00:00:00,Country_01,Local_04,Mining,I,I,Male,Third Party,Others
4,2016-01-10 00:00:00,Country_01,Local_04,Mining,IV,IV,Male,Third Party,Others


In [3]:
accident_data.describe()

Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee ou Terceiro,Risco Critico
count,439,439,439,439,439,439,439,439,439
unique,287,3,12,3,5,6,2,3,34
top,2016-02-26 00:00:00,Country_01,Local_03,Mining,I,IV,Male,Third Party,Others
freq,13,263,90,241,328,155,417,189,232


In [4]:
accident_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Data                      439 non-null    object
 1   Countries                 439 non-null    object
 2   Local                     439 non-null    object
 3   Industry Sector           439 non-null    object
 4   Accident Level            439 non-null    object
 5   Potential Accident Level  439 non-null    object
 6   Genre                     439 non-null    object
 7   Employee ou Terceiro      439 non-null    object
 8   Risco Critico             439 non-null    object
dtypes: object(9)
memory usage: 31.0+ KB


#### Selecionando variáveis úteis:

In [5]:
def entropia(data, variavel):
    tipos = data[variavel].unique()
    total_tipos = {}
    total = 0
    for tipo in tipos:
        total_tipos[tipo] = len(data[data[variavel] == tipo])
        total += total_tipos[tipo]
    
    entropia = 0
    for tipo in tipos:
        if total_tipos[tipo] != 0:
            entropia += ((-total_tipos[tipo]/(total)) * (math.log(total_tipos[tipo]/(total), 2))) 

    return entropia

def ganho(data, atributo, variavel):
    total = len(data)
    ganho = entropia(data, variavel)
    tipos_atributo = data[atributo].unique()
    
    for tipo in tipos_atributo:
        data_temp = data[data[atributo] == tipo]
        ganho -= (len(data_temp)/total)*entropia(data_temp, variavel)
    
    return ganho
        

In [6]:
data_treino, data_teste = train_test_split(accident_data, test_size=0.2, random_state=42)

In [7]:
entropia(accident_data, 'Accident Level')

1.284127977901661

In [8]:
ganho(accident_data, 'Risco Critico', 'Accident Level')

0.24533813782859124

In [9]:
# Excluindo variáveis desconsideradas e dividindo entre treino e teste:
drop = ['Data', 'Employee ou Terceiro', 'Countries', 'Genre']

data_treino, data_teste = train_test_split(accident_data.drop(drop, axis=1), test_size=0.2, random_state=42)

In [10]:
data_treino

Unnamed: 0,Local,Industry Sector,Accident Level,Potential Accident Level,Risco Critico
266,Local_06,Metals,I,II,Plates
294,Local_03,Mining,I,III,Others
31,Local_03,Mining,I,II,Others
84,Local_08,Metals,I,III,Pressed
301,Local_03,Mining,I,III,Others
...,...,...,...,...,...
106,Local_05,Metals,I,IV,Cut
270,Local_04,Mining,I,II,Others
348,Local_04,Mining,IV,IV,Vehicles and Mobile Equipment
435,Local_03,Mining,I,II,Others


In [11]:
data_teste

Unnamed: 0,Local,Industry Sector,Accident Level,Potential Accident Level,Risco Critico
265,Local_06,Metals,IV,IV,Others
78,Local_03,Mining,I,II,Others
347,Local_01,Mining,I,IV,Projection
255,Local_05,Metals,I,III,Pressurized Systems
327,Local_10,Others,I,II,Projection/Choco
...,...,...,...,...,...
57,Local_03,Mining,I,III,Others
137,Local_10,Others,IV,IV,Fall
24,Local_06,Metals,I,II,Others
17,Local_06,Metals,I,II,Others


## 2.2 Classificador baseado em árvore de decisão

In [18]:
def best_gain(data):
    col_names = data.columns.to_list()
    col_names.remove('Accident Level')
    gains = []
    for attr in col_names:
        gains.append((attr, ganho(data, attr, 'Accident Level')))
    max_gain = max(gains,key=lambda item:item[1])
    return max_gain

# each node of tree is represented in the following way:
# node[node_atrib, answer, value, [child nodes]]
# node_atrib -> represents the node for a given attribute
# answer: the choosen decision
# value: represents the chosen value of the parent node's attribute
# Note: if a node has no child, i.e, it's a leaf, node_attrib will be the same of node_parente
def ID3(examples, attribs, target_attr, padrao, node_attrib, value):
    root = [node_attrib, '', value, []]
    if len(examples) == 0:
        root[1] = padrao
        return root
    elif len(examples[target_attr].unique()) == 1:
        root[1] = examples[target_attr].mode().apply(str)[0]
        root[2] = value
    elif len(attribs) == 0:
        root[1] = examples[target_attr].mode().apply(str)[0]
        root[2] = value
    else:
        best = best_gain(examples)
        root[0] = best[0] # Nome do atributo best[1] = valor do ganho
        m = examples[target_attr].mode().apply(str)[0] # Valor da maioria
        attrib_types = examples[best[0]].unique()
        attribs_new = attribs.copy()
        for vi in attrib_types:
            # exmplos_i <- {elementos de exemplos com melhor = vi}
            examples_new = examples[examples[best[0]] == vi]
            #atributos - melhor
            attribs_temp = attribs_new.copy()
            attribs_temp.remove(best[0])
            examples_new = examples_new.drop(best[0], 1)
            #subarvore 
            subtree = ID3(examples_new, attribs_temp, target_attr, m, best[0], vi)
            root[3].append(subtree)
    return root

def print_tree(examples, root, parent_name):
    if len(root[3]) == 0:
        print(parent_name, "-->", " ", root[1])
    else:
        for node in root[3]:
            parent_name = root[0]
            if len(node[3]) != 0:
                print(parent_name, "-->", node[0])
            print_tree(examples, node, parent_name)
            
def walk_decision_tree(data, decision_tree, row_index, padrao):
    if len(decision_tree[3]) == 0:
        return decision_tree[1]
    else:
        #cat = return_value_code(data, tree[0], row_index)
        for node in decision_tree[3]:
            #print(decision_tree[0])
            #print(node[2])
            #print(data[decision_tree[0]].iloc[row_index])
            #print('-----')
            if node[2] == data[decision_tree[0]].iloc[row_index]:
                return walk_decision_tree(data, node, row_index, padrao)
    return padrao
            

def predict_decision(data, tree, padrao):
    predicted = []
    for i in range(len(data)):
        predicted.append(walk_decision_tree(data, tree, i, padrao))
        
    return predicted


In [15]:
col_names = data_treino.columns.to_list()
col_names.remove('Accident Level')
examples = data_treino.copy()
tree = ID3(examples, col_names, 'Accident Level', 'I', 'Accident Level', '')
print(tree) 

In [19]:
prediction = predict_decision(data_teste, tree, 'I')
id3predicted_real = []
for i in range(len(data_teste)):
    predicted = prediction[i]
    real = data_teste['Accident Level'].iloc[i]
    id3predicted_real.append((predicted, real))

print(id3predicted_real)

[('I', 'IV'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('III', 'II'), ('III', 'IV'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'IV'), ('I', 'I'), ('I', 'I'), ('I', 'II'), ('III', 'IV'), ('I', 'IV'), ('I', 'I'), ('I', 'V'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('III', 'III'), ('I', 'IV'), ('I', 'I'), ('I', 'II'), ('IV', 'II'), ('II', 'I'), ('III', 'I'), ('I', 'I'), ('I', 'II'), ('I', 'III'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'IV'), ('I', 'I'), ('IV', 'V'), ('I', 'I'), ('I', 'III'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'II'), ('I', 'I'), ('IV', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'II'), ('I', 'I'), ('I', 'IV'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'III'), ('I', 'II'), ('I', 'I'), ('I', 'II'), 

## 2.3 Classificador a priori

In [23]:
# Calculates mode
mode = data_treino['Accident Level'].mode().apply(str)[0]
# Test predicted values
prioripredicted_real = []
for i in range(len(data_teste)):
    predicted = mode
    real = data_teste['Accident Level'].iloc[i]
    prioripredicted_real.append((predicted, real))
print(prioripredicted_real)

[('I', 'IV'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'II'), ('I', 'IV'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'IV'), ('I', 'I'), ('I', 'I'), ('I', 'II'), ('I', 'IV'), ('I', 'IV'), ('I', 'I'), ('I', 'V'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'III'), ('I', 'IV'), ('I', 'I'), ('I', 'II'), ('I', 'II'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'II'), ('I', 'III'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'IV'), ('I', 'I'), ('I', 'V'), ('I', 'I'), ('I', 'III'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'II'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'II'), ('I', 'I'), ('I', 'IV'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'I'), ('I', 'III'), ('I', 'II'), ('I', 'I'), ('I', 'II'), ('I', 'III'), 

## 2.4 Análise Comparativa

### 2.4.1 Matriz de Confusão

In [54]:
def get_confusion_matrix(pred_real):
    attrib_types = list(examples['Accident Level'].unique())
    attrib_types.sort()
    attrib_types.append('Total')
    attr_map = dict(zip(attrib_types, range(len(attrib_types))))
    matrix = pd.DataFrame(0, index=attrib_types, columns=attrib_types)
    diag = 0
    for pr in pred_real:
        pred_idx = attr_map[pr[0]]
        real_idx = attr_map[pr[1]]
        total_idx = attr_map['Total']
        matrix.iloc[real_idx, pred_idx] += 1
        matrix.iloc[real_idx, total_idx] += 1
        matrix.iloc[total_idx, pred_idx] += 1
        matrix.iloc[total_idx, total_idx] += 1
        if (pred_idx == real_idx):
            diag += 1
    return (matrix, diag)

mid3 = get_confusion_matrix(id3predicted_real)
mpriori = get_confusion_matrix(prioripredicted_real)
print("Matriz de Confusão Árvore de decisão")
print(mid3[0])
print("----------------------------------------")
print("Matriz de Confusão Classificador a priori")
print(mpriori[0])

Matriz de Confusão Árvore de decisão
        I  II  III  IV  V  Total
I      59   1    1   1  0     62
II      7   0    1   1  0      9
III     5   0    1   0  0      6
IV      6   0    2   1  0      9
V       1   0    0   1  0      2
Total  78   1    5   4  0     88
----------------------------------------
Matriz de Confusão Classificador a priori
        I  II  III  IV  V  Total
I      62   0    0   0  0     62
II      9   0    0   0  0      9
III     6   0    0   0  0      6
IV      9   0    0   0  0      9
V       2   0    0   0  0      2
Total  88   0    0   0  0     88


### 2.4.2 Taxa de acerto

In [59]:
# Diagonal sum was saved in get_confusion_matrix() call
id3_rate = mid3[1] / mid3[0]['Total']['Total']
priori_rate = mpriori[1] / mpriori[0]['Total']['Total']
print("Taxa de acerto Árvore de decisão")
print(id3_rate)
print("Taxa de acerto Classificador a prior")
print(priori_rate)

Taxa de acerto Árvore de decisão
0.6931818181818182
Taxa de acerto Classificador a prior
0.7045454545454546


### 2.4.3 Erro quadrático médio

### 2.4.4 Estatística Kappa