<a href="https://colab.research.google.com/github/TrabalhosPUCPR/ClassificacaoTumores/blob/main/Classifica%C3%A7%C3%A3o_de_Tumores.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Avaliação Formativa - Classificação de Tumores

O ICMR (Conselho Indiano de Pesquisa Médica, na sigla em inglês) deseja analisar diferentes tipos de câncer, como câncer de mama, câncer renal, câncer de cólon, câncer de pulmão e câncer de próstata, que se tornaram uma causa de preocupação nos últimos anos. Eles gostariam de identificar a provável causa desses cânceres em termos de genes responsáveis por cada tipo de câncer. Isso nos levaria à identificação precoce de cada tipo de câncer, reduzindo a taxa de fatalidade.

Detalhes do conjunto de dados:
A base de dados de entrada contém 802 amostras para as 802 pessoas que foram detectadas com diferentes tipos de câncer. Cada amostra contém valores de expressão de mais de 20 mil genes. As amostras têm um dos tipos de tumores: BRCA, KIRC, COAD, LUAD e PRAD.


Maiores detalhes podem ser encontrados em [Kaggle ICMR](https://www.kaggle.com/datasets/shibumohapatra/icmr-data)



Os alunos devem utilizar os conhecimentos adquiridos na disciplina para desenvolver um modelo de aprendizado de máquina para classificar os tumores.

O trabalho consiste em: 



*   Pré-Processamento dos dados (Converter dados categoricos, normalização, etc)
*   Análise Exploratória (Analisar desbalanceamentos e distribuição das classes)

*   Se necessário aplicar técnicas de balanceamento e reducão (PCA)

*   Treinar os modelos (KNN, NB, DT) e realizar as análises crítitcas


In [4]:
! pip install -q kaggle
from google.colab import files

#Aqui fazemos upload do Token Kaggle
files.upload()
!mkdir ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!cp kaggle.json ~/.kaggle/
!kaggle datasets download 'shibumohapatra/icmr-data'
!unzip icmr-data.zip

Saving kaggle.json to kaggle.json
chmod: cannot access '/root/.kaggle/kaggle.json': No such file or directory
Downloading icmr-data.zip to /content
 85% 60.0M/70.6M [00:00<00:00, 92.6MB/s]
100% 70.6M/70.6M [00:00<00:00, 107MB/s] 
Archive:  icmr-data.zip
  inflating: data.csv                
  inflating: labels.csv              


In [32]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, ConfusionMatrixDisplay
from imblearn.over_sampling import (SMOTE, ADASYN,RandomOverSampler)
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = pd.read_csv('data.csv')
labels = pd.read_csv('labels.csv')

del data[data.columns[0]]

##Preparacao dos dados

In [35]:
random_seed = 9999

#70% treino - 30% test
train_data, test_data, train_labels, test_labels = train_test_split(data, labels['Class'], 
                       train_size=0.7,
                       test_size=0.3,
                       random_state=random_seed)

pca = PCA(n_components=3)
pca_components = pca.fit_transform(data)

train_data_pca, test_data_pca, train_labels_pca, test_labels_pca = train_test_split(pca_components, labels['Class'], 
                       train_size=0.7,
                       test_size=0.3,
                       random_state=random_seed)

sampler = RandomUnderSampler(random_state=random_seed)
data_under, labels_under = sampler.fit_resample(pca_components, labels['Class'])

train_data_under, test_data_under, train_labels_under, test_labels_under = train_test_split(data_under, labels_under, 
                       train_size=0.7,
                       test_size=0.3,
                       random_state=random_seed)

import  plotly.graph_objs as go
import  plotly.express as px


fig = px.scatter_3d(x = data_under[:, 0], y = data_under[:, 1], z = data_under[:,2], color = labels_under)
fig.show()

##Desbalanceamento

In [31]:
print(np.unique(train_labels, return_counts = True))
print(np.unique(test_labels, return_counts = True))
print(np.unique(labels['Class'], return_counts = True))

(array(['BRCA', 'COAD', 'KIRC', 'LUAD', 'PRAD'], dtype=object), array([206,  48, 104, 108,  94]))
(array(['BRCA', 'COAD', 'KIRC', 'LUAD', 'PRAD'], dtype=object), array([94, 30, 42, 33, 42]))
(array(['BRCA', 'COAD', 'KIRC', 'LUAD', 'PRAD'], dtype=object), array([300,  78, 146, 141, 136]))


In [41]:
knn = KNeighborsClassifier(metric='minkowski', p=2, n_neighbors=4)

knn.fit(train_data,train_labels)
predicts = knn.predict(test_data)

print("KNN, SEM PCA: ")
print(classification_report(test_labels,predicts))

knn = KNeighborsClassifier(metric='minkowski', p=2, n_neighbors=4)
knn.fit(train_data_pca,train_labels_pca)
predicts = knn.predict(test_data_pca)

print("KNN, COM PCA: ")
print(classification_report(test_labels_pca,predicts))

knn = KNeighborsClassifier(metric='minkowski', p=2, n_neighbors=4)

knn.fit(train_data_under,train_labels_under)
predicts = knn.predict(test_data_under)

print("KNN, COM PCA E UNDER: ")
print(classification_report(test_labels_under,predicts))

KNN, SEM PCA: 
              precision    recall  f1-score   support

        BRCA       1.00      1.00      1.00        94
        COAD       1.00      1.00      1.00        30
        KIRC       1.00      1.00      1.00        42
        LUAD       1.00      1.00      1.00        33
        PRAD       1.00      1.00      1.00        42

    accuracy                           1.00       241
   macro avg       1.00      1.00      1.00       241
weighted avg       1.00      1.00      1.00       241

KNN, COM PCA: 
              precision    recall  f1-score   support

        BRCA       1.00      0.99      0.99        94
        COAD       1.00      1.00      1.00        30
        KIRC       1.00      0.98      0.99        42
        LUAD       0.94      1.00      0.97        33
        PRAD       1.00      1.00      1.00        42

    accuracy                           0.99       241
   macro avg       0.99      0.99      0.99       241
weighted avg       0.99      0.99      0.99    

In [42]:
model = GaussianNB()
model.fit(train_data, train_labels)
nb_predicts = model.predict(test_data)

print("NAIVE BAYES SEM PCA: ")
print(classification_report(test_labels,nb_predicts))

model = GaussianNB()
model.fit(train_data_pca, train_labels_pca)
nb_predicts = model.predict(test_data_pca)

print("NAIVE BAYES COM PCA: ")
print(classification_report(test_labels_pca,nb_predicts))

model = GaussianNB()
model.fit(train_data_under, train_labels_under)
nb_predicts = model.predict(test_data_under)

print("NAIVE BAYES COM PCA E UNDERSAMPLING: ")
print(classification_report(test_labels_under,nb_predicts))

NAIVE BAYES SEM PCA: 
              precision    recall  f1-score   support

        BRCA       0.83      0.89      0.86        94
        COAD       1.00      0.33      0.50        30
        KIRC       0.73      0.86      0.79        42
        LUAD       0.52      0.70      0.60        33
        PRAD       0.97      0.86      0.91        42

    accuracy                           0.78       241
   macro avg       0.81      0.73      0.73       241
weighted avg       0.82      0.78      0.78       241

NAIVE BAYES COM PCA: 
              precision    recall  f1-score   support

        BRCA       0.99      0.97      0.98        94
        COAD       0.94      1.00      0.97        30
        KIRC       1.00      0.98      0.99        42
        LUAD       0.89      0.97      0.93        33
        PRAD       1.00      0.95      0.98        42

    accuracy                           0.97       241
   macro avg       0.96      0.97      0.97       241
weighted avg       0.97      0.97

In [43]:
model = DecisionTreeClassifier()

model.fit(train_data, train_labels)
dt_predicts = model.predict(test_data)

print("DECISION TREE SEM PCA: ")
print(classification_report(test_labels,dt_predicts))

model = DecisionTreeClassifier()

model.fit(train_data_pca, train_labels_pca)
dt_predicts = model.predict(test_data_pca)

print("DECISION TREE COM PCA: ")
print(classification_report(test_labels_pca,dt_predicts))

model = DecisionTreeClassifier()

model.fit(train_data_under, train_labels_under)
dt_predicts = model.predict(test_data_under)

print("DECISION TREE COM PCA E UNDERSAMPLING: ")
print(classification_report(test_labels_under,dt_predicts))

DECISION TREE SEM PCA: 
              precision    recall  f1-score   support

        BRCA       0.96      0.99      0.97        94
        COAD       0.94      0.97      0.95        30
        KIRC       0.95      0.95      0.95        42
        LUAD       0.94      0.91      0.92        33
        PRAD       1.00      0.93      0.96        42

    accuracy                           0.96       241
   macro avg       0.96      0.95      0.95       241
weighted avg       0.96      0.96      0.96       241

DECISION TREE COM PCA: 
              precision    recall  f1-score   support

        BRCA       0.99      0.98      0.98        94
        COAD       0.94      0.97      0.95        30
        KIRC       1.00      0.95      0.98        42
        LUAD       0.89      1.00      0.94        33
        PRAD       1.00      0.95      0.98        42

    accuracy                           0.97       241
   macro avg       0.96      0.97      0.97       241
weighted avg       0.97      