# Projeto Aprendizado de Máquina
## CTC- 17 Inteligência Artificial

O dataset é composto de registro de acidentes em 12 diferentes instalações em três diferentes paises. Cada linha é a
ocorrência de um acidente, e conta com as seguintes colunas:


**Columns description**


Data: timestamp or time/date information


Countries: which country the accident occurred (anonymized)


Local: the city where the manufacturing plant is located (anonymized)


Industry sector: which sector the plant belongs to (Mining, metals,Others)


Accident level: from I to VI, it registers how severe was the accident (I means not severe ...VI most severe)


Potential Accident Level: Depending on the Accident Level, the database also registers how severe the accident could have been (due to other factors involved in the accident)


Genre: if the person is male of female


Employee or Third Party: if the injured person is an employee or a third party


Critical Risk: some description of the risk involved in the accident

Utilizando a base de dados fornecida, criar um classificador baseado em árvore de decisão que classifique o nível do acidente (Accident Level), com base nas informações disponíveis nas outras colunas. Separe 80% das linha para treinamento e as demais para teste. Discuta quais variáveis valem a pena ou não participarem
da árvore, elimine as variáveis que vc esteja certo que não colaboram para a classificação. Descreva este processamento
dos dados para prepará-los para os algoritmos. Utilize o algoritmo ID3 ou uma versão deste melhorada, programe sem
utilizar frameworks que implementam árvores de decisão, mas você pode usar framework com estrutura de dados para
árvores.

In [109]:
import pandas as pd
import numpy as np
import math
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [2]:
accident_data = pd.read_csv('accident_data.csv')
accident_data.head()

Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee ou Terceiro,Risco Critico
0,2016-01-01 00:00:00,Country_01,Local_01,Mining,I,IV,Male,Third Party,Pressed
1,2016-01-02 00:00:00,Country_02,Local_02,Mining,I,IV,Male,Employee,Pressurized Systems
2,2016-01-06 00:00:00,Country_01,Local_03,Mining,I,III,Male,Third Party (Remote),Manual Tools
3,2016-01-08 00:00:00,Country_01,Local_04,Mining,I,I,Male,Third Party,Others
4,2016-01-10 00:00:00,Country_01,Local_04,Mining,IV,IV,Male,Third Party,Others


In [3]:
accident_data.describe()

Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Potential Accident Level,Genre,Employee ou Terceiro,Risco Critico
count,439,439,439,439,439,439,439,439,439
unique,287,3,12,3,5,6,2,3,34
top,2016-02-26 00:00:00,Country_01,Local_03,Mining,I,IV,Male,Third Party,Others
freq,13,263,90,241,328,155,417,189,232


In [7]:
accident_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 439 entries, 0 to 438
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Data                      439 non-null    object
 1   Countries                 439 non-null    object
 2   Local                     439 non-null    object
 3   Industry Sector           439 non-null    object
 4   Accident Level            439 non-null    object
 5   Potential Accident Level  439 non-null    object
 6   Genre                     439 non-null    object
 7   Employee ou Terceiro      439 non-null    object
 8   Risco Critico             439 non-null    object
dtypes: object(9)
memory usage: 31.0+ KB


#### Selecionando variáveis úteis:

In [116]:
def entropia(data, variavel):
    tipos = data[variavel].unique()
    total_tipos = {}
    total = 0
    for tipo in tipos:
        total_tipos[tipo] = len(data[data[variavel] == tipo])
        total += total_tipos[tipo]
    
    entropia = 0
    for tipo in tipos:
        if total_tipos[tipo] != 0:
            entropia += ((-total_tipos[tipo]/(total)) * (math.log(total_tipos[tipo]/(total), 2))) 

    return entropia

def ganho(data, atributo, variavel):
    total = len(data)
    ganho = entropia(data, variavel)
    tipos_atributo = data[atributo].unique()
    
    for tipo in tipos_atributo:
        data_temp = data[data[atributo] == tipo]
        ganho -= (len(data_temp)/total)*entropia(data_temp, variavel)
    
    return ganho
        

In [117]:
data_treino, data_teste = train_test_split(accident_data, test_size=0.2, random_state=42)

In [118]:
entropia(accident_data, 'Accident Level')

1.284127977901661

In [119]:
ganho(accident_data, 'Risco Critico', 'Accident Level')

0.24533813782859124

In [120]:
# Excluindo variáveis desconsideradas e dividindo entre treino e teste:
drop = 'Potential Accident Level'

data_treino, data_teste = train_test_split(accident_data.drop([drop], axis=1), test_size=0.2, random_state=42)

In [121]:
data_treino

Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Genre,Employee ou Terceiro,Risco Critico
266,2016-11-13 00:00:00,Country_01,Local_06,Metals,I,Male,Employee,Plates
294,2016-12-23 00:00:00,Country_01,Local_03,Mining,I,Male,Third Party,Others
31,2016-02-19 00:00:00,Country_01,Local_03,Mining,I,Male,Employee,Others
84,2016-03-31 00:00:00,Country_02,Local_08,Metals,I,Male,Third Party (Remote),Pressed
301,2017-01-05 00:00:00,Country_01,Local_03,Mining,I,Male,Employee,Others
...,...,...,...,...,...,...,...,...
106,2016-04-17 00:00:00,Country_02,Local_05,Metals,I,Male,Employee,Cut
270,2016-11-23 00:00:00,Country_01,Local_04,Mining,I,Male,Employee,Others
348,2017-02-23 00:00:00,Country_01,Local_04,Mining,IV,Male,Third Party,Vehicles and Mobile Equipment
435,2017-07-04 00:00:00,Country_01,Local_03,Mining,I,Female,Employee,Others


In [122]:
data_teste

Unnamed: 0,Data,Countries,Local,Industry Sector,Accident Level,Genre,Employee ou Terceiro,Risco Critico
265,2016-11-11 00:00:00,Country_01,Local_06,Metals,IV,Male,Employee,Others
78,2016-03-22 00:00:00,Country_01,Local_03,Mining,I,Male,Employee,Others
347,2017-02-17 00:00:00,Country_01,Local_01,Mining,I,Male,Third Party,Projection
255,2016-10-20 00:00:00,Country_02,Local_05,Metals,I,Male,Employee,Pressurized Systems
327,2017-02-01 00:00:00,Country_03,Local_10,Others,I,Male,Third Party,Projection/Choco
...,...,...,...,...,...,...,...,...
57,2016-03-03 00:00:00,Country_01,Local_03,Mining,I,Male,Third Party,Others
137,2016-05-24 00:00:00,Country_03,Local_10,Others,IV,Male,Third Party,Fall
24,2016-02-14 00:00:00,Country_01,Local_06,Metals,I,Male,Third Party,Others
17,2016-02-07 00:00:00,Country_01,Local_06,Metals,I,Female,Third Party,Others
