<a href="https://colab.research.google.com/github/GuilhermeDeAssis123/DataScience/blob/master/Modelo_SVM_COVID19(Dados_P%C3%BAblicos).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>





# <font color = 'red'> Previsão de Altas e Obitos Relacionadas ao COVID_19 com Dados de Espectrometria. </font> 

> Já faz mais de um ano que o Brasil se encontra em estado pandemico, trazendo inumeros prejuizos economicos a nosso pais, sem falar nas incontaveis mortes e individuos infectados pelo novo corona virus e suas variantes.


<center>

  <img src = 'https://www2.ufjf.br/progepe/wp-content/uploads/sites/28/2020/03/Covid-19-1024x576.jpg'/>

</center>

> Aqui no Brasil ja alcançamos uma marca superior a 11 milhões de pessoas infectadas e mais de 290 mil vitimas pelo novo corona virus, como podemos ver, esses são numeros de um crescimento desemfreado em relação ao avanço da doença.

> Por isso neste notebook irei realizar a criação de um simples modelo que preveja uma situação de alta ou obtio por parte dos infectados de acordo com as moleculas que estão presentes no corpo do paciente e levando em consideração também a intensidade dessa presença (quantidade presente).


## <font color = 'red'> 1. Obtenção dos Dados: </font>

> Os dados que serão utilizados para a realização desse notebook foram retirados do seguinte [link](https://zenodo.org/record/4329382#.YFYqcq9KhPZ). Esse dados foram disponibilizados por um projeto da UNICAMP em função de dispor os resultados de um pesquisa também relacionada ao coronavirus. Por aqui vamos trabalhar com o arquivo que se encontra na pasta `Covid_Control_X_LowRisk`, o arquivo que foi setado como o arquivo de `Fitting` para o modelo.

## <font color = 'red'> 2. Entendendo os Dados: <font>

> Nesta seção vamos explicar um pouco sobre os dados que estamos lidando e o que de fato significam as *features* desse *dataset*. Como citado no titulo deste notebook, aqui vamos trabalhar com dados de espectrometria de massa. Caso queira entender mais a fundo esse tipo de dado, vou disponibilizar um [link](https://www.news-medical.net/life-sciences/What-is-Mass-Spectrometry-(Portuguese).aspx) para que o leitor possa se aprofundar um pouco mais. Porém de uma forma resumida, a espctrometria de massa nada mais é do que realizar um *scanner* de um substancia atravéz de um espectrometro, isso por sua vez irá retornar as moleculas presentes naquela substancia e a quantidade de cada molecula, assim forncendo as informações do que contém em seu sangue por exemplo.

 ### <font color = 'red'> 2.1 Dicionario de Dados </font>
> Nesta seção vou descrever de forma breve as colunas presentes no nosso *dataset*:

* `Colunas`: As colunas, que são do tipo `Covid19_XXXXX_XX` diz respeito a um espectro especifico de um paciente, logo, cada coluna representa um espectro de um paciente. No qual o primeiro conjunto de número corresponde ao numero do paciente, e o segundo conjunto à identificação do espectro recolhido do paciente.

* `Index_0(# individual)`: Essa linha corresponde a indetificação do paciente, como se fosse um nome

* `Index_1(# indiv number)`: Essa linha corresponde a um número de identificação do paciente.

* `Index_2(# obsv numberl)`: Diz respeito ao número de identificação do espectro recolhido, logo o número 3, diz que essa foi a terceira espectrometria em cima daquele paciente.

* `Index_3(# categ number)`: Aqui o dado informa a qual grupo aquele paciente pertence, de modo que os grupos são:
  * Grupos 1, 2, 3, 4, 5, 6 e 7 --> são grupos de controle.
  * Grupos 8 e 12 --> são grupos que possuem a doença de forma leve
  * Grupos 9, 10, 11, 13 e 14 --> são grupos que possuem a doença de forma grave.

* `Index_4(# class)`: Aqui é o nosso alvo, essa informação indica se ouve obito ou alta por parte do paciente, se for 0, teve alta. Se for 1, o paciente foi a obito.

* `Index_5 em diante`: Aqui os indices passam a ser do tipo ``XXX_XXX`, isso indica uma molecula, de modo que cada numero indica uma molecula diferente.

## <font color = 'red'> 3. Importação dos Dados: <font>

> Nesta estapa vamos verificar qual a cara do nosso conjunto de dados, formato, composição.

> Minha hipótese inicial é de que seja um conjunto de dados bem esparso uma vez que dificilmente os paciente tenham as mesmas moleculas ou as mesmas quantidades. Como visto no dicionarios de dados algumas colunas são do tipo descritvas e esse tipo de dado não passa pelo modelo e posteriormente terão de ser retiradas.

> Para começar o primeiro passo é importar as bibliotecas que serão utilizadas no projeto.

In [24]:
# biblioteca de tratamento dos dados
import pandas as pd
import numpy as np

#bibliotecas de importação do modelo
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier 
from sklearn.linear_model import LogisticRegression

#metricas de avaliação do modelo
from sklearn.metrics import roc_auc_score, roc_curve, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

#metricas de treinamento do modelo
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold

#bibliotecas de visualização de dados
import seaborn as sns
import matplotlib.pyplot as plt

#filtrar avisos
import warnings
warnings.filterwarnings('ignore')

In [25]:
#importação do arquivo de manipulação
df = pd.read_csv('/content/FeatureVectors_Cov19_Control_x_Covid_Fitting.csv')

In [26]:
# Dimensões do dataframe
df.shape

(3939, 8005)

> Como podemos observar estamos lidando com um *dataset* que possui 3962 linhas por 3798 colunas.

In [27]:
df.head(10)

Unnamed: 0,# vector name,Covid19_01021_01,Covid19_01021_02,Covid19_01021_03,Covid19_01021_04,Covid19_01021_05,Covid19_01021_06,Covid19_01021_07,Covid19_01021_08,Covid19_01021_09,Covid19_01021_10,Covid19_01024_01,Covid19_01024_02,Covid19_01024_03,Covid19_01024_04,Covid19_01024_05,Covid19_01024_06,Covid19_01024_07,Covid19_01024_08,Covid19_01024_09,Covid19_01024_10,Covid19_01025_01,Covid19_01025_02,Covid19_01025_03,Covid19_01025_04,Covid19_01025_05,Covid19_01025_06,Covid19_01025_07,Covid19_01025_08,Covid19_01025_09,Covid19_01025_10,Covid19_01026_01,Covid19_01026_02,Covid19_01026_03,Covid19_01026_04,Covid19_01026_05,Covid19_01026_06,Covid19_01026_07,Covid19_01026_08,Covid19_01026_09,...,Covid19_00859_01,Covid19_00859_02,Covid19_00859_03,Covid19_00859_04,Covid19_00859_05,Covid19_00859_06,Covid19_00859_07,Covid19_00859_08,Covid19_00859_09,Covid19_00859_10,Covid19_00940_01,Covid19_00940_02,Covid19_00940_03,Covid19_00940_04,Covid19_00940_05,Covid19_00940_06,Covid19_00940_07,Covid19_00940_08,Covid19_00940_09,Covid19_00940_10,Covid19_00941_01,Covid19_00941_02,Covid19_00941_03,Covid19_00941_04,Covid19_00941_05,Covid19_00941_06,Covid19_00941_07,Covid19_00941_08,Covid19_00941_09,Covid19_00941_10,Covid19_00986_01,Covid19_00986_02,Covid19_00986_03,Covid19_00986_04,Covid19_00986_05,Covid19_00986_06,Covid19_00986_07,Covid19_00986_08,Covid19_00986_09,Covid19_00986_10
0,# individual,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,...,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986
1,# indiv number,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,...,798,798,798,798,798,798,798,798,798,798,799,799,799,799,799,799,799,799,799,799,800,800,800,800,800,800,800,800,800,800,801,801,801,801,801,801,801,801,801,801
2,# obsv number,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,...,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10
3,# categ number,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,...,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24
4,# class,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
5,200.013,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
6,200.0136,0,0,0,0,0,0,0,0,0,0,0,0,114144,132095,172123,182801,155038,168713,191552,211219,0,0,0,0,0,116491,152840,169172,176412,164951,0,0,0,0,102690,125636,137545,149265,159056,...,78310,139536,184030,184176,200302,214937,222172,235106,243181,281695,0,0,0,0,125261,114351,168535,189095,179519,175259,0,0,0,0,175242,212260,196141,215744,202470,222470,0,184836,237233,209538,270561,280675,293155,275972,304217,286410
7,200.164,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
8,200.165,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
9,200.172,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## <font color = 'red'> 4. Breve Exploração dos Dados: <font>

> Nesta seção vamos verificar se são necessárias algumas mudança e tratamentos nos dados. Como pudemos averiguar na célula de código anterior, o conjunto de dados é realmente muito esparso, talvez isso interfira no desempenho do modelo posteriormente.

> Vamos agora verificar se existem dados nulos:

In [28]:
#Porcentagem de valores nulos
(df.isnull().sum() / df.shape[0]).sort_values(ascending = False)

Covid19_00986_10    0.0
Covid19_00500_04    0.0
Covid19_00499_01    0.0
Covid19_00499_02    0.0
Covid19_00499_03    0.0
                   ... 
Covid19_00288_05    0.0
Covid19_00288_06    0.0
Covid19_00288_07    0.0
Covid19_00288_08    0.0
# vector name       0.0
Length: 8005, dtype: float64

> De alguma maneira estranha todas as colunas possuem a mesma quantidade de dados nulos, isso talvez possa ser explicado por uma falha do espectrometro no momento de recolhimento dos dados.

> Vamos olhar mais de perto quais são essas linhas.

In [29]:
#quais linhas são nulas?
#df[df['Covid19_01090_10'].isnull()]

> Observa-se que das linhas 3950 a 3961 todas as colunas possuem valores nulos. Como os valores de espectrometria são um pouco dificeis de serem imputados devido a especificidade dado, vamos optar por remover essas linhas.

In [30]:
# Removendo linhas com os valores NaN
df.dropna(inplace=True)

In [31]:
# Mostrando dados ausentes
(df.isnull().sum() / df.shape[0]).sort_values(ascending = False)

Covid19_00986_10    0.0
Covid19_00500_04    0.0
Covid19_00499_01    0.0
Covid19_00499_02    0.0
Covid19_00499_03    0.0
                   ... 
Covid19_00288_05    0.0
Covid19_00288_06    0.0
Covid19_00288_07    0.0
Covid19_00288_08    0.0
# vector name       0.0
Length: 8005, dtype: float64

> Como podemos ver agora na célula anterior não existem mais dados nulos.

In [32]:
df.head()

Unnamed: 0,# vector name,Covid19_01021_01,Covid19_01021_02,Covid19_01021_03,Covid19_01021_04,Covid19_01021_05,Covid19_01021_06,Covid19_01021_07,Covid19_01021_08,Covid19_01021_09,Covid19_01021_10,Covid19_01024_01,Covid19_01024_02,Covid19_01024_03,Covid19_01024_04,Covid19_01024_05,Covid19_01024_06,Covid19_01024_07,Covid19_01024_08,Covid19_01024_09,Covid19_01024_10,Covid19_01025_01,Covid19_01025_02,Covid19_01025_03,Covid19_01025_04,Covid19_01025_05,Covid19_01025_06,Covid19_01025_07,Covid19_01025_08,Covid19_01025_09,Covid19_01025_10,Covid19_01026_01,Covid19_01026_02,Covid19_01026_03,Covid19_01026_04,Covid19_01026_05,Covid19_01026_06,Covid19_01026_07,Covid19_01026_08,Covid19_01026_09,...,Covid19_00859_01,Covid19_00859_02,Covid19_00859_03,Covid19_00859_04,Covid19_00859_05,Covid19_00859_06,Covid19_00859_07,Covid19_00859_08,Covid19_00859_09,Covid19_00859_10,Covid19_00940_01,Covid19_00940_02,Covid19_00940_03,Covid19_00940_04,Covid19_00940_05,Covid19_00940_06,Covid19_00940_07,Covid19_00940_08,Covid19_00940_09,Covid19_00940_10,Covid19_00941_01,Covid19_00941_02,Covid19_00941_03,Covid19_00941_04,Covid19_00941_05,Covid19_00941_06,Covid19_00941_07,Covid19_00941_08,Covid19_00941_09,Covid19_00941_10,Covid19_00986_01,Covid19_00986_02,Covid19_00986_03,Covid19_00986_04,Covid19_00986_05,Covid19_00986_06,Covid19_00986_07,Covid19_00986_08,Covid19_00986_09,Covid19_00986_10
0,# individual,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,...,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986
1,# indiv number,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,...,798,798,798,798,798,798,798,798,798,798,799,799,799,799,799,799,799,799,799,799,800,800,800,800,800,800,800,800,800,800,801,801,801,801,801,801,801,801,801,801
2,# obsv number,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,...,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10
3,# categ number,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,...,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24
4,# class,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1


> Agora vamos fazer algumas manobras com os dados para que fique mais fácil de visualizar a informação útil para o modelo. Vamos fazer removendo algumas colunas e reposicionando os dados trocando o índice e utilizando a técnica de transposição.

In [33]:
# Resetando o index para a coluna vector name 
df.index = df['# vector name']

# Dropando a duplicata 
df.drop(columns=['# vector name'], axis=1)

Unnamed: 0_level_0,Covid19_01021_01,Covid19_01021_02,Covid19_01021_03,Covid19_01021_04,Covid19_01021_05,Covid19_01021_06,Covid19_01021_07,Covid19_01021_08,Covid19_01021_09,Covid19_01021_10,Covid19_01024_01,Covid19_01024_02,Covid19_01024_03,Covid19_01024_04,Covid19_01024_05,Covid19_01024_06,Covid19_01024_07,Covid19_01024_08,Covid19_01024_09,Covid19_01024_10,Covid19_01025_01,Covid19_01025_02,Covid19_01025_03,Covid19_01025_04,Covid19_01025_05,Covid19_01025_06,Covid19_01025_07,Covid19_01025_08,Covid19_01025_09,Covid19_01025_10,Covid19_01026_01,Covid19_01026_02,Covid19_01026_03,Covid19_01026_04,Covid19_01026_05,Covid19_01026_06,Covid19_01026_07,Covid19_01026_08,Covid19_01026_09,Covid19_01026_10,...,Covid19_00859_01,Covid19_00859_02,Covid19_00859_03,Covid19_00859_04,Covid19_00859_05,Covid19_00859_06,Covid19_00859_07,Covid19_00859_08,Covid19_00859_09,Covid19_00859_10,Covid19_00940_01,Covid19_00940_02,Covid19_00940_03,Covid19_00940_04,Covid19_00940_05,Covid19_00940_06,Covid19_00940_07,Covid19_00940_08,Covid19_00940_09,Covid19_00940_10,Covid19_00941_01,Covid19_00941_02,Covid19_00941_03,Covid19_00941_04,Covid19_00941_05,Covid19_00941_06,Covid19_00941_07,Covid19_00941_08,Covid19_00941_09,Covid19_00941_10,Covid19_00986_01,Covid19_00986_02,Covid19_00986_03,Covid19_00986_04,Covid19_00986_05,Covid19_00986_06,Covid19_00986_07,Covid19_00986_08,Covid19_00986_09,Covid19_00986_10
# vector name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
# individual,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01021,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01024,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01025,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,Covid19_01026,...,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00859,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00940,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00941,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986,Covid19_00986
# indiv number,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,...,798,798,798,798,798,798,798,798,798,798,799,799,799,799,799,799,799,799,799,799,800,800,800,800,800,800,800,800,800,800,801,801,801,801,801,801,801,801,801,801
# obsv number,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,...,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10
# categ number,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,...,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24,24
# class,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1572.19,0,0,0,0,0,0,0,0,0,0,0,215336,273532,283580,231535,277430,253294,211580,239022,219762,193155,208265,266549,222876,0,167066,205250,0,0,0,161238,222441,255869,260702,235303,215969,176526,282061,195336,194900,...,0,0,212934,194645,219571,266327,210152,240950,225761,196099,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1592.16,0,0,0,0,0,0,0,0,0,0,0,220352,217192,202257,275127,256174,224945,192017,254331,238720,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1594.17,0,0,0,0,0,0,0,0,0,0,0,233458,224833,347460,208796,226394,322378,242620,276170,231266,0,0,0,0,0,0,0,0,0,0,0,170558,195439,198956,189368,0,0,152298,163880,0,...,0,0,0,203336,205400,213555,0,195468,182790,210960,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1610.94,0,0,0,0,0,0,0,0,0,0,0,186669,206890,203694,232309,232358,233605,272574,218964,243386,374237,550455,566326,613878,475095,495418,497584,465857,503585,521741,221802,229533,275422,250114,249539,261469,237488,236316,250104,261961,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [34]:
# Realizando a matriz transposta dos dados
df = df.T

> Acima nos ralizamos a matriz transposta dos dados, utilizando essa técnica fazemos com que linhas virem as colunas e as colunas virem as linhas. Isso nos ajuda uma vez que, agora todas as linhas(entras) representam um espectro de um paciente, e as colunas represetam os atributos(moleculas) daquele paciente.

In [35]:
# Dropando colunas que não seram usadas pelo modelo.
df.drop(columns = ['# individual', '# indiv number', '# obsv number', '# categ number'], axis=1, inplace=True)

In [36]:
#resetando o index para um valorer numerico
df.reset_index(drop= True,inplace=True)

In [37]:
#Dropando a duplicata
df.drop(df.index[0], axis=0, inplace=True)

In [38]:
df.head()

# vector name,# class,200.013,200.0136,200.164,200.165,200.172,200.173,200.2,200.201,200.974,200.984,201.007,201.008,201.009,201.021,201.0435,201.0523,201.072,201.073,201.112,201.148,201.149,201.184,201.185,201.9873,202.012,202.013,202.016,202.029,202.032,202.033,202.409,202.41,202.573,202.574,202.874,203.0002,203.02,203.022,203.023,...,1273.6,1274.61,1279.59,1280.59,1281.93,1282.94,1293.94,1294.95,1295.58,1295.96,1296.59,1296.96,1388.84,1394.87,1395.87,1418.86,1516.13,1517.13,1518.14,1519.15,1520.15,1540.13,1541.13,1542.14,1543.15,1544.16,1545.16,1546.17,1547.18,1566.14,1567.15,1568.16,1569.16,1570.17,1571.18,1572.19,1592.16,1594.17,1610.94,1611.94
1,0,0,0,0,0,0,145175,0,130077,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,151311,0,0,101473,0,137426,76797,363182,0,0,0,0,305619,...,214367,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,225660,209644,223157,0,0,0,0,177233,0,274379,210346,201002,0,0,0,158379,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,185876,0,138983,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,124023,0,0,134213,0,162093,104872,457867,0,0,0,0,271729,...,357159,259290,0,0,0,0,0,0,0,0,0,0,0,265674,0,0,350216,276312,269380,0,0,0,0,289378,0,392541,300620,265609,0,0,0,234588,0,0,0,0,0,0,0,230907
3,0,0,0,0,0,0,141013,0,169534,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,149288,0,0,153733,0,179534,113629,469759,0,0,0,0,261227,...,356517,253741,0,0,0,0,0,0,0,0,0,0,0,250476,0,0,325553,284100,317882,0,0,0,0,283096,0,317325,291887,284862,0,0,0,249985,0,0,0,0,0,0,0,257409
4,0,0,0,0,0,0,176866,0,178534,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,200633,0,211865,121020,565167,0,0,0,0,332796,...,360227,280950,0,0,0,0,0,0,0,0,0,0,0,0,0,0,262506,0,0,0,0,0,0,0,0,292156,303447,0,0,0,0,259521,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,140745,0,185735,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,211762,0,229125,131018,582512,0,0,0,0,319535,...,328065,297247,0,0,0,0,0,0,0,0,0,0,0,0,0,0,294388,0,281874,0,0,0,0,273560,0,278679,297732,256705,0,0,0,0,0,0,0,0,0,0,0,0


> Como pode-se observar, agora temos somente as colunas relevantes para o modelo. Vemos tambem que fica mais fácil de relacionar os dados visualmente na cabeça do leitor e entender a correlação entre os dados.

> Vamos agora verificar os tipos das colunas:

In [39]:
df.dtypes.unique()

array([dtype('O')], dtype=object)

> Vemos que o único tipo presente é o tipo `object` ou `string`. Esses dados não passariam pelo modelo estando classificados dessa forma. Vamos então converter todas as colunas para `int` ou `float`

In [None]:
# Convertendo colunas
for coluna in df:
  df[coluna] = pd.to_numeric(df[coluna])

In [None]:
df.dtypes.unique()

> Vemos agora que os unicos tipo presentes no *dataset* são `int` e `float`.

## <font color = 'red'> 5. Preparação e criação do modelo: <font>

> Nesta etapa vamos primeiramente dividir nosso conjuntos dados em *dataset* de treino e teste. O *dataset* de treino sera utilizado durante todo o tempo para o ajuste de paramentro e melhora de desempenho do modelo. Enquanto que o *dataset* de teste será utilizado somente ao final para verificar o desempenho do modelo treinado.

In [None]:
#Separação da coluna alvo do resto dos dados.
X = df.drop(['# class'],axis=1)
y = df['# class']

In [None]:
#Separando os dados em treino e teste.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

> Aqui vamos aprimorar os hiperparametros do nosso modelo. Porém isso pode ser uma tarefa muito demorada, de forma que vamos realizar ela em partes para agilizar o processo de treinamento. Para tal, vamos treinar os hiperparamentros de forma separada, um por um.

> Para isso, vamos utilizar o modelo de aprendizagem de máquina SVM(Sub Vetorial Machine) por opção de escolha minha, pois domindo mais esse modelo do que outros.

> Vamos setar os paramentros na seguinte ordem:

*  `Kernel` : Tipo de função de aproximação
*  `C` : Passo de aproximação

> A metrica que vamos buscar pelo melhor desempenho é o `f1-score`.

In [None]:

svc = SVC(C=1)
param_grid = {
 'kernel':['rbf','poly','sigmoid','linear']
}

# identificar melhor parâmetro
kfold = StratifiedKFold(n_splits=5, shuffle=True)
grid_search = GridSearchCV(svc, param_grid, scoring='f1', n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X_train, y_train)

# ver resultados
print("Melhor: {} para {}".format(grid_result.best_score_, grid_result.best_params_))

>Para setar a melhor função usamos arbitrariamente o passo de aproximação como 1. Assim descobrimos que a melhor função de aproximação, é uma **função linear** com um desempenho no treino de **99,77%**.

> Agora que ja sabemos a melhor função, vamos procurar pela melhor aproximação:

In [None]:
svc = SVC(kernel='linear')

param_grid = {
 'C':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]
}

# identificar melhor parâmetro
kfold = StratifiedKFold(n_splits=5, shuffle=True)
grid_search = GridSearchCV(svc, param_grid, scoring='f1', n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X_train, y_train)

# ver resultados
print("Melhor: {} para {}".format(grid_result.best_score_, grid_result.best_params_))

> Setando a função de aproximação para uma função linear e testando alguns passos de aproximação, vemos que o melhor foi com o valore de `C = 0.00001`, aumentando o desempenho do modelo para **99,79%**.

> Com os dois paramentros setados vamos agora treinar o modelo com os valores encontrados e ver seu desempenho:


In [None]:
model = SVC(kernel='poly')
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

# Classification Report
print(classification_report(y_test, y_pred))

# imprimir a área sob a curva
print("AUC: {:.4f}\n".format(roc_auc_score(y_test, y_pred)))

## <font color = 'red'> 6. Conclusão : <font>
 
 Vamos aos pricipais pontos:

  * Ao contrário da hipótese inicial de que os dados poderiam influênciar negativamente o desempenho do modelo, vimos que não, o desempenho do modelo foi o melhor possível.

  * O modelo se SVM provou ser um ótimo modelo para o projeto.

  * Cabe um estudo para verificar a diferença entre esse dados e os dados usados a priori no projeto, uma vez que teoricamente são o mesmo tipo de dado. Minha hipótese é que os dados foram organizados de forma diferente.