## Informações do Dataset:

Este dataset contém **9358 respostas registradas** a cada hora de um Dispositivo Multisensorial de Qualidade Química do Ar que contém uma lista com **5 sensores de óxidos de metais**. Esse dispositivo estava localizado em uma área consideravelmente poluída, próxima a estradas de uma cidade italiana. Os dados foram registrados entre Março de 2004 a Fevereiro de 2005 (**um ano**) representando o registro gratuito mais longo de um dispositivo de qualidade química do ar. Registros verificados mostram concentrações de: CO, Hidrocarboneto não metânico, Benzeno, Óxidos totais de Nitrogênio (NOx) e Dióxido de Nitrogênio (NO2) e foram fornecidas por um analizador certificado da região. Evidências de **sensibilidades cruzadas**, bem como **desvios de conceito e sensor** estão presentes, conforme descrito em _De Vito et al., Sens. And Act. B, Vol. 129,2,2008_, que eventualmente podem afetar as estimações e concentrações dos sensores. **Valores faltantes foram atribuidos o valor "-200"**.
Esse dataset só pode ser usado para fins de pesquisa. Excluindo assim, usos comerciais.

### Detalhes dos registros
 - 0 Data (DD/MM/AAAA)
 - 1 Tempo (HH.MM.SS)
 - 2 Concentração média por hora de CO in mg/m^3 (analisador de referência)
 - 3 PT08.S1 (Óxido de Estanho) concentração média por hora (nominalmento direcionada a CO)
 - 4 Concentração média por hora de Hidrocarbonetos nao metânicos em microg/m^3 (analisador de referência)
 - 5 Concentração média por hora de Benzeno em microg/m^3 (analisador de referência)
 - 6 PT08.S2 (titânia) resposta média por hora do sensor (nominalmente direcionada ao NMHC)
 - 7 Concentração média por hora de NOX em ppb (analisador de referência)
 - 8 PT08.S3 (óxido de tungstênio) resposta média do sensor (nominalmente direcionada NO2)
 - 9 Concentração média por hora de NO2 em microg/m^3 (analisador de referência)
 - 10 PT08.S4 (óxido de tungstênio) resposta média por hora (nominalmente direcionada NO2)
 - 11 PT08.S5 (óxido de índio) resposta média por hora (nominalmente direcionada O3)
 - 12 Temperatura em ºC
 - 13 Umidade relativa (%)
 - 14 Umidade Absoluta (AH)

# Importação dos Dados

Visualizando o cabeçalho do dataframe

In [2]:
# Importando biblioteca de análise de dados
import pandas as pd
df = pd.read_csv("data05.csv")
df.head()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,classe
0,10/03/04,18:00:00,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,0
1,10/03/04,19:00:00,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,0
2,10/03/04,20:00:00,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,0
3,10/03/04,21:00:00,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,0
4,10/03/04,22:00:00,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,0


# Tratamento dos dados

Visualizando o número de linhas e colunas de uma matriz:

In [3]:
df.shape

(9357, 16)

Verificando as informações sobre o dataframe e suas colunas:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9357 entries, 0 to 9356
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Date           9357 non-null   object 
 1   Time           9357 non-null   object 
 2   CO(GT)         9357 non-null   float64
 3   PT08.S1(CO)    9357 non-null   int64  
 4   NMHC(GT)       9353 non-null   float64
 5   C6H6(GT)       9354 non-null   float64
 6   PT08.S2(NMHC)  9356 non-null   float64
 7   NOx(GT)        9355 non-null   float64
 8   PT08.S3(NOx)   9355 non-null   float64
 9   NO2(GT)        9354 non-null   float64
 10  PT08.S4(NO2)   9354 non-null   float64
 11  PT08.S5(O3)    9356 non-null   float64
 12  T              9356 non-null   float64
 13  RH             9354 non-null   float64
 14  AH             9352 non-null   float64
 15  classe         9357 non-null   int64  
dtypes: float64(12), int64(2), object(2)
memory usage: 1.1+ MB


Descrevendo estatísticas sobre os dados:

In [5]:
#colocamos o include='all', para ele trazer todas as informações do dataframe, sendo dados numéricos ou objetos
df.describe(include='all')

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,classe
count,9357,9357,9357.0,9357.0,9353.0,9354.0,9356.0,9355.0,9355.0,9354.0,9354.0,9356.0,9356.0,9354.0,9352.0,9357.0
unique,391,24,,,,,,,,,,,,,,
top,21/09/04,18:00:00,,,,,,,,,,,,,,
freq,24,390,,,,,,,,,,,,,,
mean,,,-34.207524,1048.990061,-159.225382,1.863267,894.6351,168.60759,794.997755,58.138871,1391.466218,975.060389,9.777533,39.48111,-6.84166,0.54665
std,,,77.65717,329.83271,139.648174,41.386475,342.329871,257.452849,322.015253,126.959066,467.262604,456.961218,43.205868,51.223682,38.986694,0.497846
min,,,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,-200.0,0.0
25%,,,0.6,921.0,-200.0,4.0,711.0,50.0,637.0,53.0,1185.0,700.0,10.9,34.1,0.69225,0.0
50%,,,1.5,1053.0,-200.0,7.9,895.0,141.0,794.0,96.0,1446.0,942.0,17.2,48.55,0.9773,1.0
75%,,,2.6,1221.0,-200.0,13.6,1105.0,284.0,960.0,133.0,1662.0,1255.25,24.1,61.9,1.29665,1.0


# Tratando dados ausentes

verificando dados nulos:

In [6]:
#retornara um atributo booleano para cada dado que esteja faltando no dataframe
df.isnull()

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,classe
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9352,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9353,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9354,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9355,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Verificando os dados nulos para cada coluna:

In [7]:
df.isnull().sum()

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         4
C6H6(GT)         3
PT08.S2(NMHC)    1
NOx(GT)          2
PT08.S3(NOx)     2
NO2(GT)          3
PT08.S4(NO2)     3
PT08.S5(O3)      1
T                1
RH               3
AH               5
classe           0
dtype: int64

Se você usar **dropna(how='all')**, você Exclui as linhas aonde contém somente valores nulos:

In [8]:
df.dropna(how='all').inplace=True

In [9]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,classe
64,13/03/04,10:00:00,3.2,1447,,12.9,1081.0,250.0,869.0,126.0,1667.0,1465.0,12.4,,0.7335,0
65,13/03/04,11:00:00,4.1,1542,283.0,,1184.0,296.0,808.0,158.0,,1583.0,15.6,42.2,0.7451,0
67,13/03/04,13:00:00,2.8,1328,154.0,12.3,1059.0,153.0,987.0,124.0,1600.0,1101.0,19.4,31.3,,0
69,13/03/04,15:00:00,2.0,1240,108.0,9.2,947.0,119.0,1049.0,,1532.0,947.0,18.4,33.6,0.7042,0
71,13/03/04,17:00:00,2.3,1326,97.0,10.6,1000.0,148.0,976.0,125.0,1602.0,,16.7,37.8,0.7117,1
79,14/03/04,01:00:00,2.8,1484,,11.9,1045.0,174.0,880.0,119.0,1624.0,1530.0,14.6,51.5,0.8536,0
82,14/03/04,04:00:00,-200.0,1130,56.0,5.2,773.0,70.0,1130.0,82.0,1452.0,1051.0,12.1,,0.8603,1
84,14/03/04,06:00:00,1.0,1076,29.0,2.5,618.0,44.0,1395.0,63.0,,872.0,11.6,62.2,0.8473,1
85,14/03/04,07:00:00,0.9,1028,27.0,2.4,615.0,74.0,1384.0,,1340.0,853.0,10.4,67.6,0.853,1
88,14/03/04,10:00:00,2.2,1332,129.0,8.6,923.0,144.0,,98.0,1614.0,1225.0,14.5,53.1,0.8728,0


Se você usar **dropna(subset=[])**, define quais as colunas você quer considerar para remover valores ausentes

In [10]:
df.dropna(subset = ['Date','Time','CO(GT)','PT08.S1(CO)','NMHC(GT)','C6H6(GT)','PT08.S2(NMHC)','NOx(GT)','PT08.S3(NOx)','NO2(GT)','PT08.S4(NO2)','PT08.S5(O3)','T','RH','AH','classe'], inplace=True)
df

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,classe
0,10/03/04,18:00:00,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,0
1,10/03/04,19:00:00,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,0
2,10/03/04,20:00:00,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,0
3,10/03/04,21:00:00,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,0
4,10/03/04,22:00:00,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9352,04/04/05,10:00:00,3.1,1314,-200.0,13.5,1101.0,472.0,539.0,190.0,1374.0,1729.0,21.9,29.3,0.7568,0
9353,04/04/05,11:00:00,2.4,1163,-200.0,11.4,1027.0,353.0,604.0,179.0,1264.0,1269.0,24.3,23.7,0.7119,0
9354,04/04/05,12:00:00,2.4,1142,-200.0,12.4,1063.0,293.0,603.0,175.0,1241.0,1092.0,26.9,18.3,0.6406,0
9355,04/04/05,13:00:00,2.1,1003,-200.0,9.5,961.0,235.0,702.0,156.0,1041.0,770.0,28.3,13.5,0.5139,1


In [11]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,classe


Utilizamos o **df.isna().sum()** para verificar se ainda temos dados ausentes no nosso dataframe

In [12]:
df.isna().sum()

Date             0
Time             0
CO(GT)           0
PT08.S1(CO)      0
NMHC(GT)         0
C6H6(GT)         0
PT08.S2(NMHC)    0
NOx(GT)          0
PT08.S3(NOx)     0
NO2(GT)          0
PT08.S4(NO2)     0
PT08.S5(O3)      0
T                0
RH               0
AH               0
classe           0
dtype: int64

Redefinindo o índice do DataFrame para voltar ao padrão

In [13]:
df = df.reset_index (drop=True)
df

Unnamed: 0,Date,Time,CO(GT),PT08.S1(CO),NMHC(GT),C6H6(GT),PT08.S2(NMHC),NOx(GT),PT08.S3(NOx),NO2(GT),PT08.S4(NO2),PT08.S5(O3),T,RH,AH,classe
0,10/03/04,18:00:00,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,0
1,10/03/04,19:00:00,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,0
2,10/03/04,20:00:00,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,0
3,10/03/04,21:00:00,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,0
4,10/03/04,22:00:00,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9330,04/04/05,10:00:00,3.1,1314,-200.0,13.5,1101.0,472.0,539.0,190.0,1374.0,1729.0,21.9,29.3,0.7568,0
9331,04/04/05,11:00:00,2.4,1163,-200.0,11.4,1027.0,353.0,604.0,179.0,1264.0,1269.0,24.3,23.7,0.7119,0
9332,04/04/05,12:00:00,2.4,1142,-200.0,12.4,1063.0,293.0,603.0,175.0,1241.0,1092.0,26.9,18.3,0.6406,0
9333,04/04/05,13:00:00,2.1,1003,-200.0,9.5,961.0,235.0,702.0,156.0,1041.0,770.0,28.3,13.5,0.5139,1


# Substituindo os valores

Selecionando as colunas para dividir o dataframe, para fazer a substituição de valores

In [14]:
df.columns = ['Data','Hora','Cobalto', 'OxidoEstanho','HidrocarbonetosNãoMetanicos', 'Benzeno','Titania','ConcentraçãoNOx','OxidoTungstênioNOx','ConcentraçãoNO2','OxidoTungstênioNO2','OxidoÍndioO3','Temperatura-C°','UmidadeRelativa (%)','UmidadeAbsoluta','Classe']
df

Unnamed: 0,Data,Hora,Cobalto,OxidoEstanho,HidrocarbonetosNãoMetanicos,Benzeno,Titania,ConcentraçãoNOx,OxidoTungstênioNOx,ConcentraçãoNO2,OxidoTungstênioNO2,OxidoÍndioO3,Temperatura-C°,UmidadeRelativa (%),UmidadeAbsoluta,Classe
0,10/03/04,18:00:00,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,13.6,48.9,0.7578,0
1,10/03/04,19:00:00,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,13.3,47.7,0.7255,0
2,10/03/04,20:00:00,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,11.9,54.0,0.7502,0
3,10/03/04,21:00:00,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,11.0,60.0,0.7867,0
4,10/03/04,22:00:00,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,11.2,59.6,0.7888,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9330,04/04/05,10:00:00,3.1,1314,-200.0,13.5,1101.0,472.0,539.0,190.0,1374.0,1729.0,21.9,29.3,0.7568,0
9331,04/04/05,11:00:00,2.4,1163,-200.0,11.4,1027.0,353.0,604.0,179.0,1264.0,1269.0,24.3,23.7,0.7119,0
9332,04/04/05,12:00:00,2.4,1142,-200.0,12.4,1063.0,293.0,603.0,175.0,1241.0,1092.0,26.9,18.3,0.6406,0
9333,04/04/05,13:00:00,2.1,1003,-200.0,9.5,961.0,235.0,702.0,156.0,1041.0,770.0,28.3,13.5,0.5139,1


In [15]:
df_2=df

Aqui utilizamos o drop, para selecionar as colunas aonde não iremos mexer:

In [16]:
df_2 = df.drop(columns=['Data','Hora','Temperatura-C°','UmidadeRelativa (%)','UmidadeAbsoluta'])
df_2.head()

Unnamed: 0,Cobalto,OxidoEstanho,HidrocarbonetosNãoMetanicos,Benzeno,Titania,ConcentraçãoNOx,OxidoTungstênioNOx,ConcentraçãoNO2,OxidoTungstênioNO2,OxidoÍndioO3,Classe
0,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,0
1,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,0
2,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,0
3,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,0
4,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,0


Aqui iremos trocar os valores -200, que foram ditos que são valores ausentes, iremos subistituir por 0 utilizando o replace:

In [17]:
df_2.replace([-200], 0, inplace=True)
df_2

Unnamed: 0,Cobalto,OxidoEstanho,HidrocarbonetosNãoMetanicos,Benzeno,Titania,ConcentraçãoNOx,OxidoTungstênioNOx,ConcentraçãoNO2,OxidoTungstênioNO2,OxidoÍndioO3,Classe
0,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,0
1,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,0
2,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,0
3,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,0
4,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,0
...,...,...,...,...,...,...,...,...,...,...,...
9330,3.1,1314,0.0,13.5,1101.0,472.0,539.0,190.0,1374.0,1729.0,0
9331,2.4,1163,0.0,11.4,1027.0,353.0,604.0,179.0,1264.0,1269.0,0
9332,2.4,1142,0.0,12.4,1063.0,293.0,603.0,175.0,1241.0,1092.0,0
9333,2.1,1003,0.0,9.5,961.0,235.0,702.0,156.0,1041.0,770.0,1


# Teste

In [18]:
df_2.shape

(9335, 11)

In [19]:
feature_columns = ['Cobalto', 'OxidoEstanho','HidrocarbonetosNãoMetanicos','Benzeno','Titania','ConcentraçãoNOx','OxidoTungstênioNOx','ConcentraçãoNO2','OxidoTungstênioNO2','OxidoÍndioO3','Classe']
x = df_2[feature_columns].values
y = df_2['Classe'].values
y

array([0, 0, 0, ..., 0, 1, 1], dtype=int64)

In [20]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y

array([0, 0, 0, ..., 0, 1, 1], dtype=int64)

Dividindo a base de teste em x treino, x teste, y treino, y teste:

In [21]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=0)

Visualizando o número de linhas e colunas de uma dos treinos e testes:

In [22]:
print(x_train.shape)
print(x_test.shape)

(7468, 11)
(1867, 11)


In [23]:
print(y_train.shape)
print(y_test.shape)

(7468,)
(1867,)


Importando `KNN` e testando com 3 vizinhos

In [24]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier (n_neighbors=3)
model.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [25]:
y_pred = model.predict(x_test)
y_pred

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

In [26]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)*100
print ('Acuracia: ', str(round(accuracy,2))+ "%")

Acuracia:  98.29%


In [27]:
y_pred = model.predict(x_test)

In [28]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) 
cm

array([[859,  19],
       [ 13, 976]], dtype=int64)

In [29]:
from sklearn.metrics import accuracy_score 
print (pd.crosstab(y_test,y_pred, rownames=['Real'], colnames=['Predito'], margins=True))

Predito    0    1   All
Real                   
0        859   19   878
1         13  976   989
All      872  995  1867


In [39]:
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred)
print("Precisão: " + str(round(precision,2)*100)+ "%")

Precisão: 98.0%


In [40]:
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred)
print("Recall: " + str(round(recall,2)*100)+ "%")


Recall: 99.0%


In [41]:
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred)
print("f1: " + str(round(f1,2)*100)+ "%")

f1: 98.0%


# Novo Tratamento

Trocando valores **-200** pela **mediana** da coluna

In [33]:
df_3 = df

In [34]:
df_3.shape

(9335, 16)

In [35]:
df_3 = df_3.drop(columns=['Data','Hora','Temperatura-C°','UmidadeRelativa (%)','UmidadeAbsoluta'])
df_3.head()

Unnamed: 0,Cobalto,OxidoEstanho,HidrocarbonetosNãoMetanicos,Benzeno,Titania,ConcentraçãoNOx,OxidoTungstênioNOx,ConcentraçãoNO2,OxidoTungstênioNO2,OxidoÍndioO3,Classe
0,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,0
1,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,0
2,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,0
3,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,0
4,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,0


In [36]:
df_3Cobalto=df_3[df_3['Cobalto']>-200.0]
df_3Cobalto

Unnamed: 0,Cobalto,OxidoEstanho,HidrocarbonetosNãoMetanicos,Benzeno,Titania,ConcentraçãoNOx,OxidoTungstênioNOx,ConcentraçãoNO2,OxidoTungstênioNO2,OxidoÍndioO3,Classe
0,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,0
1,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,0
2,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,0
3,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,0
4,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,0
...,...,...,...,...,...,...,...,...,...,...,...
9330,3.1,1314,-200.0,13.5,1101.0,472.0,539.0,190.0,1374.0,1729.0,0
9331,2.4,1163,-200.0,11.4,1027.0,353.0,604.0,179.0,1264.0,1269.0,0
9332,2.4,1142,-200.0,12.4,1063.0,293.0,603.0,175.0,1241.0,1092.0,0
9333,2.1,1003,-200.0,9.5,961.0,235.0,702.0,156.0,1041.0,770.0,1


In [46]:
df_3Cobalto['Cobalto'].median()

1.8

In [47]:
df_3['Cobalto']= df_3['Cobalto'].replace([-200],1.8)
df_3['Cobalto'].value_counts()

1.8    1866
1.0     304
1.4     279
1.6     274
1.5     272
       ... 
9.9       1
7.0       1
9.3       1
7.6       1
8.5       1
Name: Cobalto, Length: 96, dtype: int64

In [48]:
df_3OxidoEstanho=df_3[df_3['OxidoEstanho']>-200.0]
df_3OxidoEstanho

Unnamed: 0,Cobalto,OxidoEstanho,HidrocarbonetosNãoMetanicos,Benzeno,Titania,ConcentraçãoNOx,OxidoTungstênioNOx,ConcentraçãoNO2,OxidoTungstênioNO2,OxidoÍndioO3,Classe
0,2.6,1360,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,0
1,2.0,1292,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,0
2,2.2,1402,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,0
3,2.2,1376,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,0
4,1.6,1272,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,0
...,...,...,...,...,...,...,...,...,...,...,...
9330,3.1,1314,-200.0,13.5,1101.0,472.0,539.0,190.0,1374.0,1729.0,0
9331,2.4,1163,-200.0,11.4,1027.0,353.0,604.0,179.0,1264.0,1269.0,0
9332,2.4,1142,-200.0,12.4,1063.0,293.0,603.0,175.0,1241.0,1092.0,0
9333,2.1,1003,-200.0,9.5,961.0,235.0,702.0,156.0,1041.0,770.0,1


In [49]:
df_3OxidoEstanho['OxidoEstanho'].median()

1063.0

In [50]:
df_3['OxidoEstanho']= df_3['OxidoEstanho'].replace([-200],1140.5)
df_3['OxidoEstanho'].value_counts()

1140.5    366
973.0      30
1100.0     28
969.0      26
925.0      26
         ... 
1696.0      1
1576.0      1
728.0       1
1715.0      1
1519.0      1
Name: OxidoEstanho, Length: 1040, dtype: int64

In [51]:
df_3HidrocarbonetosNãoMetanicos=df_3[df_3['HidrocarbonetosNãoMetanicos']>-200.0]
df_3HidrocarbonetosNãoMetanicos

Unnamed: 0,Cobalto,OxidoEstanho,HidrocarbonetosNãoMetanicos,Benzeno,Titania,ConcentraçãoNOx,OxidoTungstênioNOx,ConcentraçãoNO2,OxidoTungstênioNO2,OxidoÍndioO3,Classe
0,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,0
1,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,0
2,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,0
3,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,0
4,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,0
...,...,...,...,...,...,...,...,...,...,...,...
1208,4.4,1449.0,501.0,19.5,1282.0,254.0,625.0,133.0,2100.0,1569.0,0
1209,3.1,1363.0,234.0,15.1,1152.0,189.0,684.0,110.0,1951.0,1495.0,0
1210,3.0,1371.0,212.0,14.6,1136.0,174.0,689.0,102.0,1927.0,1471.0,0
1211,3.1,1406.0,275.0,13.7,1107.0,167.0,718.0,108.0,1872.0,1384.0,0


In [52]:
df_3HidrocarbonetosNãoMetanicos['HidrocarbonetosNãoMetanicos'].median()

150.5

In [53]:
df_3['HidrocarbonetosNãoMetanicos']= df_3['HidrocarbonetosNãoMetanicos'].replace([-200],150.5)
df_3['HidrocarbonetosNãoMetanicos'].value_counts()

150.5     8437
66.0        14
40.0         9
88.0         8
93.0         8
          ... 
599.0        1
1084.0       1
776.0        1
247.0        1
434.0        1
Name: HidrocarbonetosNãoMetanicos, Length: 428, dtype: int64

In [54]:
df_3Benzeno=df_3[df_3['Benzeno']>-200.0]
df_3Benzeno

Unnamed: 0,Cobalto,OxidoEstanho,HidrocarbonetosNãoMetanicos,Benzeno,Titania,ConcentraçãoNOx,OxidoTungstênioNOx,ConcentraçãoNO2,OxidoTungstênioNO2,OxidoÍndioO3,Classe
0,2.6,1360.0,150.0,11.9,1046.0,166.0,1056.0,113.0,1692.0,1268.0,0
1,2.0,1292.0,112.0,9.4,955.0,103.0,1174.0,92.0,1559.0,972.0,0
2,2.2,1402.0,88.0,9.0,939.0,131.0,1140.0,114.0,1555.0,1074.0,0
3,2.2,1376.0,80.0,9.2,948.0,172.0,1092.0,122.0,1584.0,1203.0,0
4,1.6,1272.0,51.0,6.5,836.0,131.0,1205.0,116.0,1490.0,1110.0,0
...,...,...,...,...,...,...,...,...,...,...,...
9330,3.1,1314.0,150.5,13.5,1101.0,472.0,539.0,190.0,1374.0,1729.0,0
9331,2.4,1163.0,150.5,11.4,1027.0,353.0,604.0,179.0,1264.0,1269.0,0
9332,2.4,1142.0,150.5,12.4,1063.0,293.0,603.0,175.0,1241.0,1092.0,0
9333,2.1,1003.0,150.5,9.5,961.0,235.0,702.0,156.0,1041.0,770.0,1


In [55]:
df_3Benzeno['Benzeno'].median()

8.2

In [56]:
df_3['Benzeno']= df_3['Benzeno'].replace([-200],8.2)
df_3['Benzeno'].value_counts()

8.2     416
3.6      84
2.8      82
3.8      79
4.0      78
       ... 
43.7      1
39.2      1
38.8      1
36.5      1
35.5      1
Name: Benzeno, Length: 407, dtype: int64

In [57]:
df_3Titania=df_3[df_3['Titania']>-200.0]
df_3Titania['Titania'].median()

909.0

In [None]:
df_3['Titania']= df_3['Titania'].replace([-200],909.0)
df_3['Titania'].value_counts()

In [None]:
df_3ConcentraçãoNOx=df_3[df_3['ConcentraçãoNOx']>-200.0]
df_3ConcentraçãoNOx['ConcentraçãoNOx'].median()

In [None]:
df_3['ConcentraçãoNOx']= df_3['ConcentraçãoNOx'].replace([-200],180.0)
df_3['ConcentraçãoNOx'].value_counts()

In [None]:
df_3OxidoTungstênioNOx=df_3[df_3['OxidoTungstênioNOx']>-200.0]
df_3OxidoTungstênioNOx['OxidoTungstênioNOx'].median()

In [None]:
df_3['OxidoTungstênioNOx']= df_3['OxidoTungstênioNOx'].replace([-200],805.0)
df_3['OxidoTungstênioNOx'].value_counts()

In [None]:
df_3ConcentraçãoNO2=df_3[df_3['ConcentraçãoNO2']>-200.0]
df_3ConcentraçãoNO2['ConcentraçãoNO2'].median()

In [None]:
df_3['ConcentraçãoNO2']= df_3['ConcentraçãoNO2'].replace([-200],109.0)
df_3['ConcentraçãoNO2'].value_counts()

In [None]:
df_3OxidoTungstênioNO2=df_3[df_3['OxidoTungstênioNO2']>-200.0]
df_3OxidoTungstênioNO2['OxidoTungstênioNO2'].median()

In [None]:
df_3['OxidoTungstênioNO2']= df_3['OxidoTungstênioNO2'].replace([-200],1463.0)
df_3['OxidoTungstênioNO2'].value_counts()

In [None]:
df_3OxidoÍndioO3=df_3[df_3['OxidoÍndioO3']>-200.0]
df_3OxidoÍndioO3['OxidoÍndioO3'].median()

In [None]:
df_3['OxidoÍndioO3']= df_3['OxidoÍndioO3'].replace([-200],963.0)
df_3['OxidoÍndioO3'].value_counts()

# Testando

In [67]:
df_3.shape

(9335, 11)

In [68]:
feature_columns = ['Cobalto', 'OxidoEstanho','HidrocarbonetosNãoMetanicos','Benzeno','Titania','ConcentraçãoNOx','OxidoTungstênioNOx','ConcentraçãoNO2','OxidoTungstênioNO2','OxidoÍndioO3','Classe']
x = df_3[feature_columns].values
y = df_3['Classe'].values
y

array([0, 0, 0, ..., 0, 1, 1], dtype=int64)

In [69]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
y

array([0, 0, 0, ..., 0, 1, 1], dtype=int64)

In [70]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size=0.2, random_state=0)

In [71]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(7468, 11)
(1867, 11)
(7468,)
(1867,)


In [72]:
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier (n_neighbors=3)
model.fit(x_train, y_train)

KNeighborsClassifier(n_neighbors=3)

In [73]:
y_pred = model.predict(x_test)
y_pred

array([1, 1, 1, ..., 1, 0, 1], dtype=int64)

In [74]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)*100
print ('Acuracia: ', str(round(accuracy,2))+ "%")

Acuracia:  98.23%


In [75]:
y_pred = model.predict(x_test)

In [76]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred) 

In [77]:
from sklearn.metrics import accuracy_score 
print (pd.crosstab(y_test,y_pred, rownames=['Real'], colnames=['Predito'], margins=True))

Predito    0    1   All
Real                   
0        859   19   878
1         14  975   989
All      873  994  1867


In [81]:
from sklearn.metrics import precision_score
precision = precision_score(y_test, y_pred)
print("Precision: " + str(round(precision,2)*100) + "%")

Precision: 98.0%


In [82]:
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred)
print("Recall: " + str(round(recall,2)*100) + "%")


Recall: 99.0%


In [83]:
from sklearn.metrics import f1_score
f1 = f1_score(y_test, y_pred)
print("f1: " + str(round(f1,2)*100) + "%")

f1: 98.0%


# Novo Tratamento 

In [None]:
df_4 = df

In [None]:
df_4.shape