<a href="https://colab.research.google.com/github/Lucas-Buk/IMT/blob/main/Cancer_KMeans_%2B_TSNE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Bibliotecas e instalações**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import pickle

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

# **Dicionário de variáveis**

*   **ESCOLARI**: Código para escolaridade do paciente (int = 1).

      1 – ANALFABETO

      2 – ENS. FUND. INCOMPLETO

      3 – ENS. FUND. COMPLETO

      4 – ENSINO MÉDIO

      5 – SUPERIOR

      9 – IGNORADA
*   **IDADE**: Idade do paciente (int = 3).
*   **SEXO**: Sexo do paciente (int = 1). 

      1 – MASCULINO

      2 – FEMININO
*   **UFNASC**: UF de nascimento (char = 2). Outras opções: SI - Sem informação; OP - Outro país.	
*   **UFRESID**: UF de residência (char = 2). Outras opções: OP - Outro país.
*   **IBGE**: Código da cidade de residência do paciente segundo IBGE com digito verificador (char = 7).
*   **CIDADE**: Cidade de residência do paciente (char = 200).
*   **CATEATEND**: Categoria de atendimento ao diagnóstico (int = 1). 

      1 - CONVENIO

      2 - SUS

      3 – PARTICULAR

      9 – SEM INFORMAÇÃO
*   **DTCONSULT**: Data da 1ª consulta (date = 10). Formato: DD/MM/YYYY	
*   **CLINICA**: Código da clinica (int = 2).

      1 – ALERGIA/IMUNOLOGIA

      2 – CIRURGIA CARDIACA

      3 – CIRURGIA CABEÇA E PESCOÇO

      4 – CIRURGIA GERAL

      5 – CIRURGIA PEDIATRICA

      6 – CIRURGIA PLASTICA

      7 – CIRURGIA TORAXICA

      8 – CIRURGIA VASCULAR

      9 – CLINICA MEDICA

      10 – DERMATOLOGIA

      11 – ENDOCRINOLOGIA

      12 – GASTROCIRURGIA
      
      13 – GASTROENTEROLOGIA
      
      14 – GERIATRIA
      
      15 – GINECOLOGIA
      
      16 – GINECOLOGIA / OBSTETRICIA
      
      17 – HEMATOLOGIA
      
      18 – INFECTOLOGIA
      
      19 – NEFROLOGIA
      
      20 – NEUROCIRURGIA
      
      21 – NEUROLOGIA
      
      22 – OFTALMOLOGIA
      
      23 – ONCOLOGIA CIRURGICA
      
      24 – ONCOLOGIA CLINICA
      
      25 – ONCOLOGIA PEDIATRICA
      
      26 – ORTOPEDIA
      
      27 – OTORRINOLARINGOLOGIA
      
      28 – PEDIATRIA
      
      29 – PNEUMOLOGIA
      
      30 – PROCTOLOGIA
      
      31 – RADIOTERAPIA
      
      32 – UROLOGIA
      
      33 – MASTOLOGIA
      
      34 – ONCOLOGIA CUTANEA
      
      35 – CIRURGIA PELVICA
      
      36 – CIRURGIA ABDOMINAL
      
      37 – ODONTOLOGIA
      
      38 – TRANSPLANTE HEPATICO
      
      99 – IGNORADO	
*   **DIAGPREV**: Diagnóstico e tratamento anterior (int = 1).

      1 – SEM DIAGNÓSTICO / SEM TRATAMENTO
      
      2 – COM DIAGNÓSTICO / SEM TRATAMENTO
      
      3 – COM DIAGNÓSTICO / COM TRATAMENTO
      
      4 – OUTROS	
*   **DTDIAG**: Data do diagnóstico (date = 10). Formato: DD/MM/YYYY	
*   **BASEDIAG**: Código da base do diagnóstico (int = 1).
      
      1 – EXAME CLINICO
      
      2 – RECURSOS AUXILIARES NÃO MICROSCÓPICOS
      
      3 – CONFIRMAÇÃO MICROSCÓPICA
      
      4 – SEM INFORMAÇÃO	
*   **TOPO**: Código da topografia (char = 4). Formato: C999 	
*   **TOPOGRUP**: Grupo da topografia	(char = 3). Formato: C99	
*   **DESCTOPO**: Descrição da Topografia (char = 80).
*   **MORFO**: Código da morfologia (char = 5). Formato: 99999
*   **DESCMORFO**: Descrição da morfologia (char = 80).
*   **EC**: Estádio clínico (char = 5).
*   **ECGRUP**: Grupo do estadiamento clínico (char = 3).

      0 - Tumores primários, classificados como in situ

      I - Tumores localizados

      II - Tumores com envolvimento regional por extensão direta

      III - Tumores com envolvimento regional de linfonodos
      
      IV - Tumores com metástase à distância 

      X - Para tumores não avaliados pelo profissional responsável ou sem informação sobre estadiamento anotada no prontuário

      Y - Para tumores em que não se aplica a classificação TNM. São os tumores não sólidos (por exemplo, as leucemias)
*   **T**: Classificação TNM - T (char = 5).
*   **N**: Classificação TNM - N (char = 5).	
*   **M**: Classificação TNM - M (char = 3).	
*   **PT**: Estadiamento pós cirúrgico (char = 5).
*   **PN**: Estadiamento pós cirúrgico (char = 5).	
*   **PM**: Estadiamento pós cirúrgico (char = 3).	
*   **S**: Classificação TNM - S (int = 1). Domínio: 0; 1; 2; 3; 8 – NÃO SE APLICA; 9 – X
*   **G**: Classificação TNM – G (Grau) (char = 5). 

      Domínio (exceto C40, C41, C381, C382, C383, C47, C48 e C49):
      0; 1; 2; 3; 4; 8 – NÃO SE APLICA; 9 – X

      Domínio (somente C40, C41, C381, C382, C383, C47, C48 e C49):
      ALTO; BAIXO; 8 – NÃO SE APLICA; 9 – X
	
*   **LOCALTNM**: Classificação TNM - Localização (int = 1).
      
      1 – SUPERIOR
      
      2 – MEDIO
      
      3 – INFERIOR
      
      8 – NÃO SE APLICA
      
      9 – X	
*   **IDMITOTIC**: Classificação TNM – Índice Mitótico (int = 1).
      
      1 – ALTA
      
      2 – BAIXA
      
      8 – NÃO SE APLICA
      
      9 – X	
*   **PSA**: Classificação TNM - PSA (int = 1).
      
      1 – MENOR QUE 10
      
      2 – MAIOR OU IGUAL A 10 E MENOR QUE 20
      
      3 – MAIOR OU IGUAL A 20
      
      8 – NÃO SE APLICA
      
      9 – X	
*   **GLEASON**: Classificação TNM - Gleason (int = 1).
      
      1 – MENOR OU IGUAL A 6
      
      2 – IGUAL A 7
      
      3 – MAIOR OU IGUAL A 8
      
      8 – NÃO SE APLICA
      
      9 – X	
*   **OUTRACLA**: Outra classificação de estadiamento (char = 20).
*   **META01**: Metástase (char = 3).	Formato: C99
*   **META02**: Metástase (char = 3).	Formato: C99
*   **META03**: Metástase (char = 3).	Formato: C99
*   **META04**: Metástase (char = 3).	Formato: C99
*   **DTTRAT**: Data de inicio do tratamento (date = 10). Formato: DD/MM/YYYY	
*   **NAOTRAT**: Código da razão para não realização do tratamento (int = 1).
      
      1 – RECUSA DO TRATAMENTO
      
      2 – DOENÇA AVANÇADA, FALTA DE CONDIÇÕES CLINICAS
      
      3 – OUTRAS DOENÇAS ASSOCIADAS
      
      4 – ABANDONO DE TRATAMENTO
      
      5 – OBITO POR CANCER
      
      6 – OBITO POR OUTRAS CAUSAS, SOE
      
      7 – OUTRAS
      
      8 – NÃO SE APLICA (CASO TENHA TRATAMENTO)
      
      9 – SEM INFORMAÇÃO	
*   **TRATAMENTO**: Código de combinação dos tratamentos realizados (char = 1).
      
      A – Cirurgia
      
      B – Radioterapia
      
      C – Quimioterapia
      
      D – Cirurgia + Radioterapia
      
      E – Cirurgia + Quimioterapia
      
      F – Radioterapia + Quimioterapia
      
      G – Cirurgia + Radio + Quimio
      
      H – Cirurgia + Radio + Quimio + Hormonio
      
      I – Outras combinações de tratamento
      
      J – Nenhum tratamento realizado	
*   **TRATHOSP**: Código de combinação dos tratamentos realizados no hospital (char = 1).
      
      A – Cirurgia
      
      B – Radioterapia
      
      C – Quimioterapia
      
      D – Cirurgia + Radioterapia
      
      E – Cirurgia + Quimioterapia
      
      F – Radioterapia + Quimioterapia
      
      G – Cirurgia + Radio + Quimio
      
      H – Cirurgia + Radio + Quimio + Hormonio
      
      I – Outras combinações de tratamento

      J – Nenhum tratamento realizado	
*   **TRATFANTES**: Código de combinação dos tratamentos realizados antes/durante admissão fora do hospital (char = 1).
      
      A – Cirurgia
      
      B – Radioterapia
      
      C – Quimioterapia
      
      D – Cirurgia + Radioterapia
      
      E – Cirurgia + Quimioterapia
      
      F – Radioterapia + Quimioterapia
      
      G – Cirurgia + Radio + Quimio
      
      H – Cirurgia + Radio + Quimio + Hormonio
      
      I – Outras combinações de tratamento
      
      J – Nenhum tratamento realizado
      
      K – Sem informação
*   **TRATFAPOS**: Código de combinação dos tratamentos realizados após admissão fora do hospital (char = 1).
      
      A – Cirurgia
      
      B – Radioterapia
      
      C – Quimioterapia
      
      D – Cirurgia + Radioterapia
      
      E – Cirurgia + Quimioterapia
      
      F – Radioterapia + Quimioterapia
      
      G – Cirurgia + Radio + Quimio
      
      H – Cirurgia + Radio + Quimio + Hormonio
      
      I – Outras combinações de tratamento
      
      J – Nenhum tratamento realizado
      
      K – Sem informação
*   **NENHUM**: Tratamento recebido no hospital = nenhum (int = 1). 0 – NÃO; 1 – SIM
*   **CIRURGIA**: Tratamento recebido no hospital = cirurgia (int = 1). 0 – NÃO; 1 – SIM
*   **RADIO**: Tratamento recebido no hospital = radioterapia (int = 1). 0 – NÃO; 1 – SIM
*   **QUIMIO**: Tratamento recebido no hospital = quimioterapia (int = 1). 0 – NÃO; 1 – SIM
*   **HORMONIO**: Tratamento recebido no hospital = hormonioterapia (int = 1). 0 – NÃO; 1 – SIM
*   **TMO**: Tratamento recebido no hospital = tmo (int = 1). 0 – NÃO; 1 – SIM
*   **IMUNO**: Tratamento recebido no hospital = imunoterapia (int = 1). 0 – NÃO; 1 – SIM
*   **OUTROS**: Tratamento recebido no hospital = outros (int = 1). 0 – NÃO; 1 – SIM
*   **NENHUMANT**: Tratamento recebido fora do hospital e antes da admissão = nenhum (int = 1). 0 – NÃO; 1 – SIM
*   **CIRURANT**: Tratamento recebido fora do hospital e antes da admissão = cirurgia (int = 1). 0 – NÃO; 1 – SIM
*   **RADIOANT**: Tratamento recebido fora do hospital e antes da admissão = radioterapia (int = 1). 0 – NÃO; 1 – SIM
*   **QUIMIOANT**: Tratamento recebido fora do hospital e antes da admissão = quimioterapia (int = 1). 0 – NÃO; 1 – SIM
*   **HORMOANT**: Tratamento recebido fora do hospital e antes da admissão = hormonioterapia (int = 1). 0 – NÃO; 1 – SIM
*   **TMOANT**: Tratamento recebido fora do hospital e antes da admissão = tmo (int = 1). 0 – NÃO; 1 – SIM
*   **IMUNOANT**: Tratamento recebido fora do hospital e antes da admissão = imunoterapia (int = 1). 0 – NÃO; 1 – SIM
*   **OUTROANT**: Tratamento recebido fora do hospital e antes da admissão = outros (int = 1). 0 – NÃO; 1 – SIM	
*   **NENHUMAPOS**: Tratamento recebido fora do hospital e durante/após admissão = nenhum	(int = 1). 0 – NÃO; 1 – SIM	
*   **CIRURAPOS**: Tratamento recebido fora do hospital e durante/após admissão = cirurgia	(int = 1). 0 – NÃO; 1 – SIM	
*   **RADIOAPOS**: Tratamento recebido fora do hospital e durante/após admissão = radioterapia	(int = 1). 0 – NÃO; 1 – SIM	
*   **QUIMIOAPOS**: Tratamento recebido fora do hospital e durante/após admissão = quimioterapia	(int = 1). 0 – NÃO; 1 – SIM		
*   **HORMOAPOS**: Tratamento recebido fora do hospital e durante/após admissão = hormonioterapia	(int = 1). 0 – NÃO; 1 – SIM		
*   **TMOAPOS**: Tratamento recebido fora do hospital e durante/após admissão = tmo	(int = 1). 0 – NÃO; 1 – SIM		
*   **IMUNOAPOS**: Tratamento recebido fora do hospital e durante/após admissão = imunoterapia	(int = 1). 0 – NÃO; 1 – SIM	
*   **OUTROAPOS**: Tratamento recebido fora do hospital e durante/após admissão = outros	(int = 1). 0 – NÃO; 1 – SIM	
*   **DTULTINFO**: Data da última informação do paciente (date = 10). Formato: DD/MM/YYYY
*   **ULTINFO**: Última informação sobre o paciente (int = 1).
      
      1 – VIVO, COM CÂNCER
      
      2 – VIVO, SOE
      
      3 – OBITO POR CANCER
      
      4 – OBITO POR OUTRAS CAUSAS, SOE
*   **CONSDIAG**: Diferença em dias entre as datas de consulta o diagnóstico (num = dias).
*   **TRATCONS**: Diferença em dias entre as datas de consulta e tratamento (num = dias).	
*   **DIAGTRAT**: Diferença em dias entre as datas de tratamento e diagnóstico (num = dias).
*   **ANODIAG**: Ano de diagnóstico (int = 4). Formato: 9999
*   **CICI**: Tumor infantil (char = 5).	
*   **CICIGRUP**: Tumor infantil – Grupo (char = 80).	
*   **CICISUBGRU**: Tumor infantil – Sub grupo (char = 80).	
*   **FAIXAETAR**: Faixa etária do paciente (char = 5).	
*   **LATERALI**: Lateralidade (int = 1).
      
      1 – DIREITA
      
      2 – ESQUERDA
      
      3 – BILATERAL
      
      8 - NÃO SE APLICA	
*   **INSTORIG**: Instituição de origem (char = 200). Obrigatório somente se DIAGPREV = 03 – COM DIAGNÓSTICO / COM TRATAMENTO
*   **DRS**: Departamentos Regionais de Saúde (char = 200).
*   **RRAS**: RRAS (char = 200).	
*   **PERDASEG**: Perda de seguimento (int = 1). 
      
      0 – Não
      
      1 – Sim
      
      8 – Não se aplica (excluído do cálculo para o indicador perda de seguimento)	
*   **ERRO**: Admissão com erro (int = 1). 0 – Sem; 1 – Com
*   **DTRECIDIVA**: Data da última ocorrência de recidiva (date = 10). Formato: DD/MM/YYYY	
*   **RECNENHUM**: Sem recidiva (int = 1). 0 - Não; 1 - Sim
*   **RECLOCAL**: Recidiva local (int = 1). 0 - Não; 1 - Sim	
*   **RECREGIO**: Recidiva regional (int = 1). 0 - Não; 1 - Sim	
*   **RECDIST**: Recidiva a distância / metástase (int = 1). 0 - Não; 1 - Sim	
*   **REC01**: Local da recidiva/metástase (char = 3). Formato: C99 
*   **REC02**: Local da recidiva/metástase (char = 3). Formato: C99 	
*   **REC03**: Local da recidiva/metástase (char = 3). Formato: C99 
*   **REC04**: Local da recidiva/metástase (char = 3). Formato: C99 	
*   **IBGEATEN**: Código IBGE da instituição (int = 7).	
*   **CIDO**: Código da morfologia 3ª Edição (int = 5). Formato: 99999	
*   **DSCCIDO**: Descrição da morfologia 3ª Edição (char = 89).




# **Dados**

In [80]:
df = pd.read_csv('/content/drive/MyDrive/Trabalho/Cancer/Datasets/cancer_preprocessing.csv')
print(df.shape)
df.head(3)

(943659, 73)


Unnamed: 0,ESCOLARI,IDADE,SEXO,IBGE,CATEATEND,CLINICA,DIAGPREV,BASEDIAG,TOPO,TOPOGRUP,MORFO,EC,ECGRUP,T,N,M,PT,PN,PM,G,LOCALTNM,IDMITOTIC,PSA,GLEASON,META01,META02,META03,META04,NAOTRAT,TRATAMENTO,TRATHOSP,TRATFANTES,TRATFAPOS,NENHUM,CIRURGIA,RADIO,QUIMIO,HORMONIO,TMO,IMUNO,OUTROS,NENHUMANT,CIRURANT,RADIOANT,NENHUMAPOS,CIRURAPOS,RADIOAPOS,QUIMIOAPOS,HORMOAPOS,TMOAPOS,IMUNOAPOS,OUTROAPOS,ULTINFO,CONSDIAG,TRATCONS,DIAGTRAT,ANODIAG,CICI,CICIGRUP,FAIXAETAR,LATERALI,DRS,RRAS,PERDASEG,RECNENHUM,RECLOCAL,RECREGIO,RECDIST,REC01,REC02,REC03,REC04,IBGEATEN
0,4,40.0,2,3530805,9,15,1,3,222,45,81402,0,0,25,15,0,30,0,0,8,8,8,8,8,0,0,0,0,8,0,0,2,9,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,2,7,95.0,88.0,2000,23,5,4,8,14,15,1,1,0,0,0,0,0,0,0,3509502
1,9,45.0,2,3509502,9,15,1,3,222,45,80703,19,3,25,15,0,41,22,7,8,8,8,8,8,0,0,0,0,8,5,5,2,9,0,0,1,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,3,12,15.0,3.0,2000,23,5,4,8,7,15,0,1,0,0,0,0,0,0,0,3509502
2,2,63.0,2,3509502,9,15,1,3,222,45,80703,19,3,25,15,0,41,22,7,8,8,8,8,8,0,0,0,0,8,1,1,2,9,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,3,6,15.0,9.0,2000,23,5,6,8,7,15,0,1,0,0,0,0,0,0,0,3509502


In [81]:
df.isna().sum().sort_values(ascending=False).head()

IBGEATEN     0
RADIO        0
G            0
LOCALTNM     0
IDMITOTIC    0
dtype: int64

In [82]:
col = df.columns
X = df[col].values
X.shape

(943659, 73)

## **Normalização**

In [83]:
scaler = StandardScaler()
X_norm = scaler.fit_transform(X)

In [84]:
pd.DataFrame(X_norm, columns=col).describe()

Unnamed: 0,ESCOLARI,IDADE,SEXO,IBGE,CATEATEND,CLINICA,DIAGPREV,BASEDIAG,TOPO,TOPOGRUP,MORFO,EC,ECGRUP,T,N,M,PT,PN,PM,G,LOCALTNM,IDMITOTIC,PSA,GLEASON,META01,META02,META03,META04,NAOTRAT,TRATAMENTO,TRATHOSP,TRATFANTES,TRATFAPOS,NENHUM,CIRURGIA,RADIO,QUIMIO,HORMONIO,TMO,IMUNO,OUTROS,NENHUMANT,CIRURANT,RADIOANT,NENHUMAPOS,CIRURAPOS,RADIOAPOS,QUIMIOAPOS,HORMOAPOS,TMOAPOS,IMUNOAPOS,OUTROAPOS,ULTINFO,CONSDIAG,TRATCONS,DIAGTRAT,ANODIAG,CICI,CICIGRUP,FAIXAETAR,LATERALI,DRS,RRAS,PERDASEG,RECNENHUM,RECLOCAL,RECREGIO,RECDIST,REC01,REC02,REC03,REC04,IBGEATEN
count,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0,943659.0
mean,-1.130038e-13,-1.080222e-14,-1.537774e-13,1.360224e-15,6.779968e-13,-7.516858e-16,-7.868857e-14,-5.30686e-14,-5.120349e-14,6.222969e-14,3.868816e-14,-3.518428e-14,1.156337e-13,-1.236059e-14,-3.104793e-13,1.803285e-13,1.593043e-13,-4.103677e-13,-2.304149e-13,-6.434715e-13,8.263831e-14,-9.542169e-15,-3.035037e-13,-1.49315e-14,1.421304e-13,2.221803e-13,-2.244972e-13,-2.79125e-13,-4.523499e-13,-2.713785e-14,-2.710381e-14,-1.347933e-14,9.824199e-14,9.139386e-14,-1.864898e-13,-2.963625e-14,3.033464e-14,1.809813e-13,9.093163e-14,1.771232e-13,7.649916e-14,2.100906e-13,-8.255507e-15,-8.255748e-15,2.04065e-13,-1.885738e-13,-9.541332e-14,-1.642661e-13,5.78317e-15,-2.095035e-14,-4.271488e-14,5.884254e-14,-5.514632e-13,3.696699e-14,7.931577e-15,2.190875e-14,-4.831541e-14,-2.35395e-14,1.675945e-13,3.523164e-13,1.251109e-13,-4.931823e-13,-1.359398e-13,-6.673524e-13,1.427914e-13,3.468548e-13,-6.341165e-13,-8.040598e-13,-2.720097e-13,-3.067903e-14,3.201143e-13,1.166343e-13,-5.947767e-13
std,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001,1.000001
min,-1.146847,-3.546734,-1.000733,-6.635249,-1.108539,-1.621715,-0.7686313,-8.091746,-2.618457,-2.590543,-0.7943764,-1.467136,-1.519387,-1.282202,-0.7610688,-0.5246297,-2.170617,-1.74747,-1.739749,-7.190407,-19.67641,-51.99395,-6.047915,-6.262922,-0.3572571,-0.1894448,-0.1055113,-0.05373153,-7.590048,-1.01171,-0.9935248,-868.8649,-5.995664,-0.3015171,-1.288288,-0.5928398,-0.7343145,-0.3728359,-0.06248441,-0.0818441,-0.2601739,-38.68942,-0.00102942,-0.00102942,-4.035393,-0.06617186,-0.173399,-0.0905237,-0.04821916,-0.01588312,-0.012311,-0.08189658,-1.669236,-0.3140949,-0.4578583,-0.4404552,-1.959214,-8.571645,-7.315449,-3.469124,-2.400017,-1.036384,-0.6266715,-0.4584023,-3.330464,-0.2049891,-0.158096,-0.1508652,-0.2406065,-0.1248493,-0.06865956,-0.03526423,-1.609258
25%,-0.8068051,-0.5235727,-1.000733,-0.07703426,-0.8261599,-0.7502067,-0.7686313,0.04967718,-0.5771264,-0.4934091,-0.6284939,-1.167958,-0.9600535,-0.9718903,-0.7610688,-0.5246297,-0.7494843,-0.9778342,0.135876,0.1554213,0.05378481,0.02025255,0.1744396,0.177784,-0.3572571,-0.1894448,-0.1055113,-0.05373153,0.2277508,-1.01171,-0.9935248,0.001381113,0.1987396,-0.3015171,-1.288288,-0.5928398,-0.7343145,-0.3728359,-0.06248441,-0.0818441,-0.2601739,0.02584686,-0.00102942,-0.00102942,0.2478073,-0.06617186,-0.173399,-0.0905237,-0.04821916,-0.01588312,-0.012311,-0.08189658,-0.5070269,-0.2961048,-0.3615204,-0.4404552,-0.8195577,0.07120664,0.04719412,-0.2731294,0.4524972,-0.8566377,-0.418276,-0.4584023,0.3002585,-0.2049891,-0.158096,-0.1508652,-0.2406065,-0.1248493,-0.06865956,-0.03526423,-1.257803
50%,-0.4667634,0.1284817,0.999268,-0.01436541,-0.8261599,0.20053,-0.7686313,0.04967718,-0.05960597,-0.0198628,-0.4887689,0.02875203,-0.4007203,-0.2478303,-0.7610688,-0.5246297,0.4347932,0.561437,0.6717688,0.1554213,0.05378481,0.02025255,0.1744396,0.177784,-0.3572571,-0.1894448,-0.1055113,-0.05373153,0.2277508,-0.4228016,-0.4072018,0.001381113,0.1987396,-0.3015171,0.7762241,-0.5928398,-0.7343145,-0.3728359,-0.06248441,-0.0818441,-0.2601739,0.02584686,-0.00102942,-0.00102942,0.2478073,-0.06617186,-0.173399,-0.0905237,-0.04821916,-0.01588312,-0.012311,-0.08189658,-0.5070269,-0.1821675,-0.1894885,-0.2176216,0.1301562,0.07120664,0.04719412,0.3660696,0.4524972,-0.3173993,-0.2515596,-0.4584023,0.3002585,-0.2049891,-0.158096,-0.1508652,-0.2406065,-0.1248493,-0.06865956,-0.03526423,0.5209868
75%,1.573487,0.7212585,0.999268,0.01195437,1.150492,0.5966703,1.301014,0.04967718,0.6016701,0.521333,0.2959577,0.9262847,0.717946,1.303727,1.422635,-0.05914003,0.4347932,0.561437,0.6717688,0.1554213,0.05378481,0.02025255,0.1744396,0.177784,-0.3572571,-0.1894448,-0.1055113,-0.05373153,0.2277508,1.04947,1.058606,0.001381113,0.1987396,-0.3015171,0.7762241,1.686796,1.361814,-0.3728359,-0.06248441,-0.0818441,-0.2601739,0.02584686,-0.00102942,-0.00102942,0.2478073,-0.06617186,-0.173399,-0.0905237,-0.04821916,-0.01588312,-0.012311,-0.08189658,0.6551821,0.003730048,0.0651188,0.1027016,0.8899273,0.07120664,0.04719412,1.005268,0.4524972,0.7610774,-0.1265222,-0.4584023,0.3002585,-0.2049891,-0.158096,-0.1508652,-0.2406065,-0.1248493,-0.06865956,-0.03526423,0.8833563
max,1.573487,3.151643,0.999268,17.5088,1.150492,6.142634,1.301014,24.47395,2.12548,2.077271,3.541674,1.424914,1.836612,1.407164,1.568215,2.268308,1.14536,0.7813328,0.6717688,1.204825,2.872384,7.450853,1.063347,1.097885,4.587511,7.966953,13.61829,26.56658,1.344579,1.638378,1.644929,0.001381113,0.1987396,3.316562,0.7762241,1.686796,1.361814,2.682145,16.00399,12.21835,3.843583,0.02584686,971.4206,971.4206,0.2478073,15.11216,5.767045,11.04683,20.73864,62.95994,81.22816,12.21052,1.817391,149.4294,171.5465,47.54535,1.839641,9.465611,10.35489,1.005268,0.4524972,2.0193,3.457881,2.18149,0.3002585,4.878308,6.32527,6.628434,6.550362,12.32954,21.87934,40.23854,1.129553


# **KMeans**

In [None]:
n_kmax = 35
lista_inercia = []

for i in range(1, n_kmax):
  km = KMeans(n_clusters = i, init='random', max_iter = 150, n_init = 3, random_state = 0)
  km.fit(X_norm)
  lista_inercia.append(km.inertia_)

In [None]:
fig = px.line(x=range(1, n_kmax), y=lista_inercia,
              labels=dict(x='Number of clusters', y='Distortion')
              )
fig.update_traces(mode='lines+markers')

fig.show()

**k = 21**

In [None]:
k = 21
km = KMeans(n_clusters = k, init='random', max_iter = 150, random_state = 0)
km.fit(X_norm)

KMeans(algorithm='auto', copy_x=True, init='random', max_iter=150,
       n_clusters=21, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=0, tol=0.0001, verbose=0)

In [None]:
y_cluster = km.predict(X_norm)
y_cluster

array([17,  6,  6, ..., 10,  2, 11], dtype=int32)

In [None]:
df['GRUPO'] = y_cluster
df.head()

Unnamed: 0,ESCOLARI,IDADE,SEXO,IBGE,CATEATEND,CLINICA,DIAGPREV,BASEDIAG,TOPO,TOPOGRUP,MORFO,EC,ECGRUP,T,N,M,PT,PN,PM,G,LOCALTNM,IDMITOTIC,PSA,GLEASON,META01,META02,META03,META04,NAOTRAT,TRATAMENTO,TRATHOSP,TRATFANTES,TRATFAPOS,NENHUM,CIRURGIA,RADIO,QUIMIO,HORMONIO,TMO,IMUNO,OUTROS,NENHUMANT,CIRURANT,RADIOANT,NENHUMAPOS,CIRURAPOS,RADIOAPOS,QUIMIOAPOS,HORMOAPOS,TMOAPOS,IMUNOAPOS,OUTROAPOS,ULTINFO,CONSDIAG,TRATCONS,DIAGTRAT,ANODIAG,CICI,CICIGRUP,FAIXAETAR,LATERALI,DRS,RRAS,PERDASEG,RECNENHUM,RECLOCAL,RECREGIO,RECDIST,REC01,REC02,REC03,REC04,IBGEATEN,GRUPO
0,4,40.0,2,3530805,9,15,1,3,222,45,81402,0,0,25,15,0,30,0,0,8,8,8,8,8,0,0,0,0,8,0,0,2,9,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,2,7,95.0,88.0,2000,23,5,4,8,14,15,1,1,0,0,0,0,0,0,0,3509502,17
1,9,45.0,2,3509502,9,15,1,3,222,45,80703,19,3,25,15,0,41,22,7,8,8,8,8,8,0,0,0,0,8,5,5,2,9,0,0,1,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,3,12,15.0,3.0,2000,23,5,4,8,7,15,0,1,0,0,0,0,0,0,0,3509502,6
2,2,63.0,2,3509502,9,15,1,3,222,45,80703,19,3,25,15,0,41,22,7,8,8,8,8,8,0,0,0,0,8,1,1,2,9,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,3,6,15.0,9.0,2000,23,5,6,8,7,15,0,1,0,0,0,0,0,0,0,3509502,6
3,9,64.0,2,3545803,9,15,1,3,222,45,80703,19,3,25,15,0,41,22,7,8,8,8,8,8,0,0,0,0,8,1,1,2,9,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,3,6,33.0,27.0,2000,23,5,6,8,7,15,0,1,0,0,0,0,0,0,0,3509502,6
4,1,48.0,2,3530805,9,15,2,3,222,45,80703,19,3,25,15,0,41,22,7,8,8,8,8,8,0,0,0,0,8,1,1,2,9,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,2,0,132.0,132.0,2000,23,5,4,8,14,15,1,1,0,0,0,0,0,0,0,3509502,17


## **Salvando modelos**

In [None]:
with open('/content/drive/MyDrive/Trabalho/Cancer/Modelos/scaler_kmeans.pkl', 'wb') as handle:
    pickle.dump({'scaler': scaler,}, handle)

In [None]:
with open('/content/drive/MyDrive/Trabalho/Cancer/Modelos/kmeans_full.pkl', 'wb') as handle:
    pickle.dump({'kmeans': km}, handle)

## **Salvando os dados em csv**

In [None]:
df.to_csv('/content/drive/MyDrive/Trabalho/Cancer/Datasets/kmeans_preprocessing.csv', encoding='utf-8', index=False)

# **Subset**

In [3]:
df = pd.read_csv('/content/drive/MyDrive/Trabalho/Cancer/Datasets/kmeans_preprocessing.csv')
print(df.shape)
df.head(3)

(943659, 74)


Unnamed: 0,ESCOLARI,IDADE,SEXO,IBGE,CATEATEND,CLINICA,DIAGPREV,BASEDIAG,TOPO,TOPOGRUP,MORFO,EC,ECGRUP,T,N,M,PT,PN,PM,G,LOCALTNM,IDMITOTIC,PSA,GLEASON,META01,META02,META03,META04,NAOTRAT,TRATAMENTO,TRATHOSP,TRATFANTES,TRATFAPOS,NENHUM,CIRURGIA,RADIO,QUIMIO,HORMONIO,TMO,IMUNO,OUTROS,NENHUMANT,CIRURANT,RADIOANT,NENHUMAPOS,CIRURAPOS,RADIOAPOS,QUIMIOAPOS,HORMOAPOS,TMOAPOS,IMUNOAPOS,OUTROAPOS,ULTINFO,CONSDIAG,TRATCONS,DIAGTRAT,ANODIAG,CICI,CICIGRUP,FAIXAETAR,LATERALI,DRS,RRAS,PERDASEG,RECNENHUM,RECLOCAL,RECREGIO,RECDIST,REC01,REC02,REC03,REC04,IBGEATEN,GRUPO
0,4,40.0,2,3530805,9,15,1,3,222,45,81402,0,0,25,15,0,30,0,0,8,8,8,8,8,0,0,0,0,8,0,0,2,9,0,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,2,7,95.0,88.0,2000,23,5,4,8,14,15,1,1,0,0,0,0,0,0,0,3509502,17
1,9,45.0,2,3509502,9,15,1,3,222,45,80703,19,3,25,15,0,41,22,7,8,8,8,8,8,0,0,0,0,8,5,5,2,9,0,0,1,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,3,12,15.0,3.0,2000,23,5,4,8,7,15,0,1,0,0,0,0,0,0,0,3509502,6
2,2,63.0,2,3509502,9,15,1,3,222,45,80703,19,3,25,15,0,41,22,7,8,8,8,8,8,0,0,0,0,8,1,1,2,9,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,3,6,15.0,9.0,2000,23,5,6,8,7,15,0,1,0,0,0,0,0,0,0,3509502,6


In [4]:
k = len(df.GRUPO.unique())
k # number of clusters

21

In [5]:
df.GRUPO.value_counts()

0     140385
2     115872
6     115482
14     92842
12     73438
17     69336
18     58709
1      43255
7      34455
16     30368
8      26956
11     26899
19     26133
15     22627
13     20611
5      17285
10      9994
9       7529
20      4527
4       3762
3       3194
Name: GRUPO, dtype: int64

In [6]:
n_samples = 10000 # samples from dataset

df_subset = df.sample(n_samples, random_state=7).sort_index().copy()
df_subset.GRUPO.value_counts()

0     1487
6     1234
2     1214
14    1018
12     777
17     692
18     620
1      427
7      381
8      314
19     288
16     287
11     281
15     241
13     207
5      203
10     114
9       78
20      57
3       42
4       38
Name: GRUPO, dtype: int64

In [7]:
cols = df.columns
cols = cols.drop(['GRUPO'])
len(cols)

73

In [8]:
df_subset[cols].shape

(10000, 73)

## **Normalização**



In [9]:
X = df_subset[cols].values
X.shape

(10000, 73)

In [10]:
scaler = StandardScaler()
X_norm = scaler.fit_transform(X)

In [11]:
pd.DataFrame(X_norm, columns=cols).describe()

Unnamed: 0,ESCOLARI,IDADE,SEXO,IBGE,CATEATEND,CLINICA,DIAGPREV,BASEDIAG,TOPO,TOPOGRUP,MORFO,EC,ECGRUP,T,N,M,PT,PN,PM,G,LOCALTNM,IDMITOTIC,PSA,GLEASON,META01,META02,META03,META04,NAOTRAT,TRATAMENTO,TRATHOSP,TRATFANTES,TRATFAPOS,NENHUM,CIRURGIA,RADIO,QUIMIO,HORMONIO,TMO,IMUNO,OUTROS,NENHUMANT,CIRURANT,RADIOANT,NENHUMAPOS,CIRURAPOS,RADIOAPOS,QUIMIOAPOS,HORMOAPOS,TMOAPOS,IMUNOAPOS,OUTROAPOS,ULTINFO,CONSDIAG,TRATCONS,DIAGTRAT,ANODIAG,CICI,CICIGRUP,FAIXAETAR,LATERALI,DRS,RRAS,PERDASEG,RECNENHUM,RECLOCAL,RECREGIO,RECDIST,REC01,REC02,REC03,REC04,IBGEATEN
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,-8.859358e-16,8.159029e-16,1.181544e-15,7.986162e-16,9.289391e-15,1.514527e-15,1.353451e-15,5.957457e-16,-4.098541e-16,-6.483439e-16,-2.82383e-15,1.645439e-15,5.773382e-16,3.623235e-15,1.709188e-15,-1.274536e-15,-5.033829e-15,2.23761e-15,6.8896e-15,-3.73846e-15,-5.035861e-16,-1.247846e-15,4.381501e-15,-8.059226e-15,2.427059e-16,-2.996808e-15,6.332518e-16,-1.884272e-15,1.011347e-15,3.197886e-16,-2.236877e-16,0.0,-1.535e-15,-3.478018e-15,-3.352696e-15,1.726685e-15,1.437095e-15,-1.05892e-15,-1.457365e-15,2.18999e-15,-6.115242e-15,-3.262036e-16,0.0,0.0,2.898137e-15,1.484401e-15,-3.244571e-16,-3.158251e-16,3.606962e-16,-1.523347e-16,1.971477e-15,-3.160716e-15,-3.462941e-15,2.861406e-16,-5.567213e-17,-7.970624e-16,-1.346643e-14,-9.750534000000001e-17,-1.205075e-15,1.680722e-15,-6.789014e-17,1.403575e-14,1.730444e-15,1.952074e-14,7.771428e-15,-7.851914e-15,2.155442e-16,1.363338e-14,-3.794182e-15,-2.505463e-15,-3.549522e-16,5.807591e-16,1.125388e-14
std,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,0.0,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,0.0,0.0,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005,1.00005
min,-1.13387,-3.550412,-1.005415,-6.788184,-1.109469,-1.613128,-0.7793963,-7.546369,-2.63134,-2.601522,-0.7947671,-1.479956,-1.525293,-1.297038,-0.7714408,-0.5342507,-2.180944,-1.760598,-1.769725,-7.127178,-17.55974,-48.64311,-5.792416,-5.990581,-0.3589617,-0.1852981,-0.1102052,-0.05402606,-7.612654,-1.021117,-0.9999033,0.0,-5.970787,-0.3075239,-1.278146,-0.5862805,-0.746423,-0.3713695,-0.06797983,-0.0783418,-0.2533168,-37.78322,0.0,0.0,-3.986464,-0.07229925,-0.1755607,-0.08923512,-0.05754065,-0.01414355,-0.01414355,-0.08025724,-1.663895,-0.328121,-0.5020728,-0.4333664,-1.979241,-8.553143,-7.256808,-3.47978,-2.415753,-1.045299,-0.6290751,-0.453852,-3.328257,-0.1960143,-0.1620882,-0.1564775,-0.2487893,-0.1281075,-0.07581707,-0.03404664,-1.595557
25%,-0.7944184,-0.528462,-1.005415,-0.07855127,-0.8270032,-0.7414669,-0.7793963,0.04097164,-0.5957886,-0.5118486,-0.62995,-1.181419,-0.9702979,-0.8850835,-0.7714408,-0.5342507,-0.7574598,-0.9884735,0.1232025,0.1583031,0.0588966,0.02155149,0.1832832,0.1852757,-0.3589617,-0.1852981,-0.1102052,-0.05402606,0.2273604,-1.021117,-0.9999033,0.0,0.1995061,-0.3075239,-1.278146,-0.5862805,-0.746423,-0.3713695,-0.06797983,-0.0783418,-0.2533168,0.02646678,0.0,0.0,0.2508489,-0.07229925,-0.1755607,-0.08923512,-0.05754065,-0.01414355,-0.01414355,-0.08025724,-0.5020383,-0.3094153,-0.4033595,-0.4333664,-0.8363711,0.07586019,0.04791901,-0.282687,0.4471436,-0.8665417,-0.4217756,-0.453852,0.3004576,-0.1960143,-0.1620882,-0.1564775,-0.2487893,-0.1281075,-0.07581707,-0.03404664,-1.24499
50%,-0.4549669,0.1233311,0.9946145,-0.01388332,-0.8270032,0.2094365,-0.7793963,0.04097164,-0.0797334,-0.03998691,-0.4950608,0.01272761,-0.4153028,-0.2671523,-0.7714408,-0.5342507,0.4287773,0.555775,0.6640389,0.1583031,0.0588966,0.02155149,0.1832832,0.1852757,-0.3589617,-0.1852981,-0.1102052,-0.05402606,0.2273604,-0.4316085,-0.4138622,0.0,0.1995061,-0.3075239,0.782383,-0.5862805,-0.746423,-0.3713695,-0.06797983,-0.0783418,-0.2533168,0.02646678,0.0,0.0,0.2508489,-0.07229925,-0.1755607,-0.08923512,-0.05754065,-0.01414355,-0.01414355,-0.08025724,-0.5020383,-0.1909462,-0.2059328,-0.2231321,0.1160203,0.07586019,0.04791901,0.3567317,0.4471436,-0.3302712,-0.2559361,-0.453852,0.3004576,-0.1960143,-0.1620882,-0.1564775,-0.2487893,-0.1281075,-0.07581707,-0.03404664,0.5293036
75%,1.581742,0.7158703,0.9946145,0.01277046,1.150257,0.6056462,1.283044,0.04097164,0.63701,0.4992836,0.2745131,0.9083376,0.6946874,1.277676,1.398779,-0.0764131,0.4287773,0.555775,0.6640389,0.1583031,0.0588966,0.02155149,0.1832832,0.1852757,-0.3589617,-0.1852981,-0.1102052,-0.05402606,0.2273604,1.042162,1.05124,0.0,0.1995061,-0.3075239,0.782383,1.705668,1.339723,-0.3713695,-0.06797983,-0.0783418,-0.2533168,0.02646678,0.0,0.0,0.2508489,-0.07229925,-0.1755607,-0.08923512,-0.05754065,-0.01414355,-0.01414355,-0.08025724,0.6598184,0.008580911,0.07502063,0.09221926,0.8779334,0.07586019,0.04791901,0.9961503,0.4471436,0.74227,-0.1315564,-0.453852,0.3004576,-0.1960143,-0.1620882,-0.1564775,-0.2487893,-0.1281075,-0.07581707,-0.03404664,0.8907569
max,1.581742,2.493488,0.9946145,17.91432,1.150257,6.152583,1.283044,22.80299,2.099167,2.049686,3.457556,1.405899,1.804678,1.380664,1.54346,2.212775,1.14052,0.776382,0.6640389,1.199086,2.575845,6.973646,1.036955,1.067541,4.463218,7.976557,13.31335,24.39216,1.347362,1.631671,1.637282,0.0,0.1995061,3.25178,0.782383,1.705668,1.339723,2.692736,14.71024,12.76458,3.947626,0.02646678,0.0,0.0,0.2508489,13.8314,5.696035,11.20635,17.37902,70.70361,70.70361,12.45994,1.821675,27.54955,28.49687,28.9403,1.830325,9.455211,10.27454,0.9961503,0.4471436,1.993568,3.433994,2.203361,0.3004576,5.101669,6.169481,6.390697,6.323989,12.37781,18.90217,47.19193,1.136332


# **PCA e TSNE**

In [12]:
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X_norm)

df_subset['pca_one'] = pca_result[:,0]
df_subset['pca_two'] = pca_result[:,1]

print('Cumulative explained variation for 2 principal components: {}'.format(np.sum(pca.explained_variance_ratio_)))

Cumulative explained variation for 2 principal components: 0.13996969796723632


In [37]:
fig = px.scatter(df_subset, x="pca_one", y="pca_two", color=df_subset.GRUPO.astype(str),
                 height=700, width=1000
                )
fig.update_traces(marker=dict(size=7,
                              line=dict(width=0.7, 
                                        color='white')),
                  selector=dict(mode='markers'))

fig.show()

In [14]:
pca_50 = PCA(n_components=50)

pca_result_50 = pca_50.fit_transform(X_norm)

print('Cumulative explained variation for 50 principal components: {}'.format(np.sum(pca_50.explained_variance_ratio_)))

Cumulative explained variation for 50 principal components: 0.9560360734221055


In [15]:
tsne = TSNE(n_components=2, random_state=7, perplexity=30)
tsne_pca_results = tsne.fit_transform(pca_result_50)

df_subset['tsne_pca50_one'] = tsne_pca_results[:,0]
df_subset['tsne_pca50_two'] = tsne_pca_results[:,1]

In [16]:
fig = px.scatter(df_subset, x="tsne_pca50_one", y="tsne_pca50_two", color=df_subset.GRUPO.astype(str),
                 height=700, width = 1000
                )
fig.update_traces(marker=dict(size=7,
                              line=dict(width=0.7, color='white'
                              )),
                  selector=dict(mode='markers'))

fig.show()

In [17]:
tsne = TSNE(n_components=2, random_state=7, perplexity=50)
tsne_pca_results = tsne.fit_transform(pca_result_50)

df_subset['tsne_pca50_one'] = tsne_pca_results[:,0]
df_subset['tsne_pca50_two'] = tsne_pca_results[:,1]

In [18]:
fig = px.scatter(df_subset, x="tsne_pca50_one", y="tsne_pca50_two", color=df_subset.GRUPO.astype(str),
                 height=700, width = 1000
                )
fig.update_traces(marker=dict(size=7,
                              line=dict(width=0.7, color='white'
                              )),
                  selector=dict(mode='markers'))

fig.show()

In [19]:
df_kmeans = pd.DataFrame()
df_kmeans['tsne_pca50_one'] = df_subset['tsne_pca50_one'].copy()
df_kmeans['tsne_pca50_two'] = df_subset['tsne_pca50_two'].copy()
df_kmeans['grupo_kmeans'] = df_subset['GRUPO'].copy()
df_kmeans.reset_index(inplace=True)
df_kmeans.head()

Unnamed: 0,index,tsne_pca50_one,tsne_pca50_two,grupo_kmeans
0,47,32.359104,-8.554454,0
1,84,-10.798744,-22.420107,6
2,113,-19.763067,64.418961,7
3,176,17.245131,21.340729,12
4,232,20.317442,32.953377,12


In [20]:
df_kmeans.to_csv('/content/drive/MyDrive/Trabalho/Cancer/Datasets/kmeans_tsne.csv', encoding='utf-8', index=False)

In [21]:
tsne = TSNE(n_components=2, random_state=7, perplexity=100)
tsne_pca_results = tsne.fit_transform(pca_result_50)

df_subset['tsne_pca50_one'] = tsne_pca_results[:,0]
df_subset['tsne_pca50_two'] = tsne_pca_results[:,1]

In [22]:
fig = px.scatter(df_subset, x='tsne_pca50_one', y='tsne_pca50_two', color=df_subset.GRUPO.astype(str),
                 height=700, width = 1000
                )
fig.update_traces(marker=dict(size=7,
                              line=dict(width=0.7, color='white'
                              )),
                  selector=dict(mode='markers'))

fig.show()

# **Balanceado**

In [26]:
df_test = df[df.GRUPO == 3].sample(500, random_state=7)
n = df_test.shape[0]
for i in range(21):
  if i != 3:
    df_aux = df[df.GRUPO == i].sample(n, random_state=7)
    df_test = pd.concat([df_test, df_aux])

df_test = df_test.sort_index()
df_test.GRUPO.value_counts().sort_index()

0     500
1     500
2     500
3     500
4     500
5     500
6     500
7     500
8     500
9     500
10    500
11    500
12    500
13    500
14    500
15    500
16    500
17    500
18    500
19    500
20    500
Name: GRUPO, dtype: int64

In [27]:
cols = df_test.columns
cols = cols.drop(['GRUPO'])
len(cols)

73

In [28]:
X = df_test[cols]
X.shape

(10500, 73)

In [29]:
s = StandardScaler()
X_norm = s.fit_transform(X)

In [30]:
pd.DataFrame(data=X_norm, columns=cols).describe()

Unnamed: 0,ESCOLARI,IDADE,SEXO,IBGE,CATEATEND,CLINICA,DIAGPREV,BASEDIAG,TOPO,TOPOGRUP,MORFO,EC,ECGRUP,T,N,M,PT,PN,PM,G,LOCALTNM,IDMITOTIC,PSA,GLEASON,META01,META02,META03,META04,NAOTRAT,TRATAMENTO,TRATHOSP,TRATFANTES,TRATFAPOS,NENHUM,CIRURGIA,RADIO,QUIMIO,HORMONIO,TMO,IMUNO,OUTROS,NENHUMANT,CIRURANT,RADIOANT,NENHUMAPOS,CIRURAPOS,RADIOAPOS,QUIMIOAPOS,HORMOAPOS,TMOAPOS,IMUNOAPOS,OUTROAPOS,ULTINFO,CONSDIAG,TRATCONS,DIAGTRAT,ANODIAG,CICI,CICIGRUP,FAIXAETAR,LATERALI,DRS,RRAS,PERDASEG,RECNENHUM,RECLOCAL,RECREGIO,RECDIST,REC01,REC02,REC03,REC04,IBGEATEN
count,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0,10500.0
mean,4.06514e-15,-1.406282e-16,4.068069e-15,-6.759461e-16,1.132267e-14,8.557996e-16,-4.4366630000000003e-17,-1.793883e-15,4.186281e-16,1.582121e-15,-1.473076e-15,4.780938e-16,-2.088594e-16,4.579881e-16,-1.397093e-15,1.543946e-15,-2.956291e-15,2.342623e-15,7.932702e-16,9.15805e-15,2.718503e-15,-2.957434e-15,-2.360096e-15,-3.598502e-15,9.203918e-15,1.765202e-15,-1.58054e-15,-3.987279e-15,6.04518e-15,3.092341e-16,-4.353237e-16,0.0,-4.228152e-16,-4.916575e-15,2.003244e-15,-3.347354e-15,-3.481659e-16,-2.808938e-15,1.723886e-15,1.831083e-15,3.986324e-15,6.314565e-15,0.0,0.0,-1.011857e-15,3.241235e-15,3.133039e-15,-1.588227e-15,-2.21203e-15,-9.982802e-16,0.0,-1.857985e-15,3.081133e-17,-5.913788e-16,4.104177e-16,-6.933858e-16,-2.284179e-14,-2.051692e-15,1.846269e-15,-4.342473e-15,3.583012e-15,-3.631021e-15,1.027411e-15,-1.989349e-14,-9.034858e-15,1.266955e-15,-4.557069e-15,1.892099e-14,1.006925e-14,-6.084694e-15,3.961487e-15,1.276347e-15,5.611268e-15
std,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,0.0,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,0.0,0.0,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,0.0,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048,1.000048
min,-1.138183,-3.133441,-0.9615024,-4.931477,-0.9957704,-1.733983,-0.8758321,-7.487967,-2.54823,-2.523728,-0.841566,-1.70597,-1.689557,-1.421232,-0.8115424,-0.560724,-2.581527,-2.027879,-2.007535,-3.802002,-5.185783,-100.4357,-4.573592,-4.788707,-0.4098543,-0.271656,-0.2218896,-0.1146416,-6.89666,-1.234124,-1.167931,0.0,-3.684934,-0.3305042,-1.062319,-0.6286203,-0.8398667,-0.416784,-0.1014662,-0.08013695,-0.2842585,-45.81484,0.0,0.0,-2.617901,-0.1110897,-0.2629402,-0.2275699,-0.06986308,-0.02182699,0.0,-0.1052315,-1.648417,-0.2881619,-0.3060262,-0.323552,-2.138612,-5.584279,-5.019104,-2.9948,-2.300583,-0.9544647,-0.6595538,-0.4355076,-2.271202,-0.2791381,-0.2147774,-0.2496456,-0.3438272,-0.2440119,-0.2089406,-0.1146686,-1.592594
25%,-0.7936073,-0.463819,-0.9615024,-0.09784411,-0.7023708,-0.6438847,-0.8758321,0.042313,-0.6222383,-0.4971448,-0.6971093,-0.5478065,-0.5371749,-0.7778356,-0.8115424,-0.560724,0.3639273,0.4840999,0.009605431,0.3062461,0.2205949,0.006832827,0.2352746,0.2359181,-0.4098543,-0.271656,-0.2218896,-0.1146416,0.2402266,-1.234124,-1.167931,0.0,0.3395934,-0.3305042,-1.062319,-0.6286203,-0.8398667,-0.416784,-0.1014662,-0.08013695,-0.2842585,0.02182699,0.0,0.0,0.3819855,-0.1110897,-0.2629402,-0.2275699,-0.06986308,-0.02182699,0.0,-0.1052315,-0.4861569,-0.2778203,-0.2715356,-0.317894,-0.7772056,0.1682158,0.1492421,-0.7007294,0.4719763,-0.7767624,-0.4985457,-0.4355076,0.4402955,-0.2791381,-0.2147774,-0.2496456,-0.3438272,-0.2440119,-0.2089406,-0.1146686,-1.243704
50%,-0.4490315,0.135484,-0.9615024,-0.04723456,-0.7023708,0.1107987,-0.8758321,0.042313,0.278405,0.2219653,-0.5530642,0.08391926,0.03901638,-0.3489047,-0.6677247,-0.560724,0.3639273,0.4840999,0.5859313,0.3062461,0.2205949,0.006832827,0.2352746,0.2359181,-0.4098543,-0.271656,-0.2218896,-0.1146416,0.2402266,-0.0312741,0.004102115,0.0,0.3395934,-0.3305042,0.9413366,-0.6286203,-0.8398667,-0.416784,-0.1014662,-0.08013695,-0.2842585,0.02182699,0.0,0.0,0.3819855,-0.1110897,-0.2629402,-0.2275699,-0.06986308,-0.02182699,0.0,-0.1052315,-0.4861569,-0.2364539,-0.2099454,-0.2273658,0.1952275,0.1682158,0.1492421,0.446306,0.4719763,-0.59906,-0.3697391,-0.4355076,0.4402955,-0.2791381,-0.2147774,-0.2496456,-0.3438272,-0.2440119,-0.2089406,-0.1146686,0.5484067
75%,1.618423,0.6803048,1.040039,-0.02860989,1.351427,0.613921,1.141771,0.042313,0.8465031,0.8757017,0.2309529,0.8209327,0.6152076,1.259586,1.345722,-0.104992,0.3639273,0.4840999,0.5859313,0.3062461,0.2205949,0.006832827,0.2352746,0.2359181,-0.4098543,-0.271656,-0.2218896,-0.1146416,0.2402266,1.171576,1.176135,0.0,0.3395934,-0.3305042,0.9413366,1.590786,1.190665,-0.416784,-0.1014662,-0.08013695,-0.2842585,0.02182699,0.0,0.0,0.3819855,-0.1110897,-0.2629402,-0.2275699,-0.06986308,-0.02182699,0.0,-0.1052315,0.6761034,-0.1599261,-0.1114009,-0.06611251,0.7786874,0.1682158,0.1492421,1.019824,0.4719763,0.6448564,-0.208731,-0.4355076,0.4402955,-0.2791381,-0.2147774,-0.2496456,-0.3438272,-0.2440119,-0.2089406,-0.1146686,0.8818264
max,1.618423,2.423732,1.040039,12.87678,1.351427,6.399827,1.141771,22.63315,2.024268,1.987054,3.251785,1.347371,1.76759,1.366819,1.48954,2.173668,1.167233,0.723336,0.5859313,0.8931387,0.9929346,14.35577,0.9222556,0.9537218,3.968499,5.44489,6.486012,12.37743,1.259782,1.472289,1.469143,0.0,0.3395934,3.02568,0.9413366,1.590786,1.190665,2.399324,9.855498,12.47864,3.517925,0.02182699,0.0,0.0,0.3819855,9.001736,3.803146,4.394255,14.31371,45.81484,0.0,9.50286,1.838364,11.41233,13.63062,13.14251,1.75112,6.170819,7.384926,1.019824,0.4719763,2.066475,2.496206,2.296171,0.4402955,3.582456,4.655984,4.005678,4.649107,6.203359,7.18324,12.05859,1.126226


In [31]:
pca = PCA(n_components=2)
pca_result = pca.fit_transform(X_norm)

df_test['pca_one'] = pca_result[:,0]
df_test['pca_two'] = pca_result[:,1]

print('Cumulative explained variation for 2 principal components: {}'.format(np.sum(pca.explained_variance_ratio_)))

Cumulative explained variation for 2 principal components: 0.15341369070188265


In [36]:
fig = px.scatter(df_test, x="pca_one", y="pca_two", color=df_test.GRUPO.astype(str),
                 height=700, width=1000
                )
fig.update_traces(marker=dict(size=7,
                              line=dict(width=0.7, 
                                        color='white')),
                  selector=dict(mode='markers'))

fig.show()

In [33]:
pca_50 = PCA(n_components=50)

pca_result_50 = pca_50.fit_transform(X_norm)

print('Cumulative explained variation for 50 principal components: {}'.format(np.sum(pca_50.explained_variance_ratio_)))

Cumulative explained variation for 50 principal components: 0.9649897170084584


In [34]:
tsne = TSNE(n_components=2, random_state=7, perplexity=50)
tsne_pca_results = tsne.fit_transform(pca_result_50)

df_test['tsne_pca50_one'] = tsne_pca_results[:,0]
df_test['tsne_pca50_two'] = tsne_pca_results[:,1]

In [35]:
fig = px.scatter(df_test, x="tsne_pca50_one", y="tsne_pca50_two", color=df_test.GRUPO.astype(str),
                 height=700, width = 1000
                )
fig.update_traces(marker=dict(size=7,
                              line=dict(width=0.7, color='white'
                              )),
                  selector=dict(mode='markers'))

fig.show()

# **Referências**

https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html

https://medium.com/@violante.andre/an-introduction-to-t-sne-with-python-example-47e6ae7dc58f

https://towardsdatascience.com/t-sne-python-example-1ded9953f26

https://towardsdatascience.com/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

https://distill.pub/2016/misread-tsne/