### Compreensão do projeto
O desafio é a criação de um classificador para determinado a qual classe pertence uma mutação genética encontrada em diversos tipos de câncer.
Para realizar a tarefa faremos uso de um dataset fornecido contendo as seguintes colunas preditoras: 
* Texto ==> Um texto descrevendo informações sobre o câncer encontrado em diversas amostras, anotadas por especialista humanos
* Gene ==> O gene analisado
* Variante ==> A variante à qual pertence a amostra analisada

Além da coluna alvo
* Class

### Desafios

Para realizar a criação do classificador deveremos
* Tratar o texto fornecido aplicando técnicas para conseguir extrair o significado do mesmo
* Unir o texto com as outras variaveis fornecidas (leia-se Gene e Variantes) visto que se tratam de elementos de natureza relativamente diferente (texto e variáveis categóricas)
* Gerar um classificador utilizando esses dados

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
from sklearn.decomposition import TruncatedSVD, PCA
from sklearn import metrics
import xgboost

import string

In [2]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, Embedding, Input, RepeatVector, Bidirectional
from keras.optimizer_v1 import SGD
from tensorflow import keras

In [3]:
import warnings
warnings.filterwarnings(action = "ignore")

In [4]:
dfTrain = pd.read_csv("training_text", sep="\|\|", engine='python', header=None, skiprows=1, names=["ID","Text"])
dfTrainVariant = pd.read_csv("training_variants")

In [5]:
train = pd.merge(dfTrain,dfTrainVariant, how = "inner", on = "ID")

In [6]:
train.head()

Unnamed: 0,ID,Text,Gene,Variation,Class
0,0,Cyclin-dependent kinases (CDKs) regulate a var...,FAM58A,Truncating Mutations,1
1,1,Abstract Background Non-small cell lung canc...,CBL,W802*,2
2,2,Abstract Background Non-small cell lung canc...,CBL,Q249E,2
3,3,Recent evidence has demonstrated that acquired...,CBL,N454D,3
4,4,Oncogenic mutations in the monomeric Casitas B...,CBL,L399V,4


In [7]:
train.isna().sum()

ID           0
Text         5
Gene         0
Variation    0
Class        0
dtype: int64

Após a carga dos dados e junção dos dois datasets através da coluna de referência percebo que existem valores missing na colunas contendo o texto.
Como a quantidade de dados é baixo comparado ao tamanho do dataset irei apenas removê-los

In [8]:
train.set_index("ID")
train.dropna(inplace=True)

In [9]:
train.shape

(3316, 5)

In [10]:
stop_words = set(stopwords.words("english"))

In [11]:
def preprocessamento(text):
    text = text.lower()
    text = text.translate(str.maketrans("","", string.punctuation))
    
    return text

Como pré processamento estou transformando todas em minúsculas e removendo pontuação
Farei isso criando uma nova coluna no dataset implementando essas mudanças

In [12]:
train["Text2"] = train["Text"].map(preprocessamento)

Aqui existem diversas possibilidades, podemos remover stop-words, realizar a tokenização do texto
A minha escolha é utilizar o TFidf da biblioteca Scikit learn que apresentou um bom resultado para o dataset em questão

In [13]:
tfidf =TfidfVectorizer(min_df = 1, ngram_range=(1,2), max_features=20000, stop_words="english")

In [14]:
text_train = tfidf.fit_transform(train["Text2"].values).toarray()

In [15]:
train2 = pd.DataFrame(text_train, index = train.index)

A dimensionalidade do dataset é muito grande, algumas linhas possuem mais de 100.000 palavras o que tornaria o treinamento bastante custoso. Irei utilizar redução de dimensionalidade afim de viabilizar o treinamento

In [16]:
svd_truncated = TruncatedSVD(n_components=200)
truncated_train = pd.DataFrame(svd_truncated.fit_transform(train2))
truncated_train["ID"] = train["ID"]
truncated_train.set_index("ID")
truncated_train.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,191,192,193,194,195,196,197,198,199,ID
0,0.221528,-0.081707,-0.01678,-0.071,-0.000251,-0.009083,0.002798,-0.010887,-0.020465,-0.01884,...,-0.004839,0.028431,0.000813,0.021793,-0.016415,-0.012044,0.004751,-0.001735,-0.012538,0.0
1,0.165491,-0.087215,-0.038926,0.075832,-0.023038,0.028609,0.03884,0.00657,-0.009878,-0.007995,...,0.010726,-0.001199,-0.008188,-0.002768,0.003849,-0.010672,-0.001554,0.002522,0.018431,1.0
2,0.165491,-0.087215,-0.038926,0.075832,-0.023038,0.028609,0.03884,0.00657,-0.009878,-0.007995,...,0.010726,-0.001199,-0.008188,-0.002768,0.003849,-0.010672,-0.001554,0.002522,0.018431,2.0
3,0.182577,-0.072146,-0.025338,0.00903,-0.002064,-0.019587,-0.021849,-0.025466,0.016703,-0.012428,...,0.026159,-0.034974,0.001514,-0.036586,-0.049705,-0.067765,-0.17772,-0.114349,-0.081355,3.0
4,0.241561,-0.075471,-0.001424,0.031511,-0.044637,0.036575,0.001294,-0.013997,-0.026975,-0.021823,...,-0.009963,0.014049,0.006492,0.01616,0.016435,0.037297,0.080705,0.048989,0.039365,4.0


As colunas de "Gene" e "Variation" também possuem informações importantes e precisam ser incluídas no dataset final, entretanto as mesmas possuem um formato diferente da coluna de texto o que exige uma abordagem diferente também
Utilizarei one hot encoding de maneira a viabilizar o uso das variáveis categóricas

In [17]:
one_hot_enc_gene_var = pd.get_dummies(train,columns = ["Gene","Variation"],drop_first=True)
one_hot_enc_gene_var.drop(["Text","Text2","Class"], axis = 1, inplace = True)
one_hot_enc_gene_var.head()

Unnamed: 0,ID,Gene_ACVR1,Gene_AGO2,Gene_AKT1,Gene_AKT2,Gene_AKT3,Gene_ALK,Gene_APC,Gene_AR,Gene_ARAF,...,Variation_Y87N,Variation_Y901C,Variation_Y931C,Variation_Y98H,Variation_Y98N,Variation_YAP1-FAM118B Fusion,Variation_YAP1-MAMLD1 Fusion,Variation_ZC3H7B-BCOR Fusion,Variation_ZNF198-FGFR1 Fusion,Variation_p61BRAF
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,3,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,4,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
one_hot_enc_gene_var = one_hot_enc_gene_var.set_index("ID")
train = train.set_index("ID")

In [19]:
one_hot_enc_gene_var.head()

Unnamed: 0_level_0,Gene_ACVR1,Gene_AGO2,Gene_AKT1,Gene_AKT2,Gene_AKT3,Gene_ALK,Gene_APC,Gene_AR,Gene_ARAF,Gene_ARID1A,...,Variation_Y87N,Variation_Y901C,Variation_Y931C,Variation_Y98H,Variation_Y98N,Variation_YAP1-FAM118B Fusion,Variation_YAP1-MAMLD1 Fusion,Variation_ZC3H7B-BCOR Fusion,Variation_ZNF198-FGFR1 Fusion,Variation_p61BRAF
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [20]:
svd_truncated = PCA(n_components=50)
one_hot_truncated = pd.DataFrame(svd_truncated.fit_transform(one_hot_enc_gene_var.values))
one_hot_truncated["ID"] = train.index
one_hot_truncated.set_index("ID")
one_hot_truncated.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,41,42,43,44,45,46,47,48,49,ID
0,-0.085605,-0.051975,-0.071414,0.010961,-0.176018,-0.28979,-0.709392,0.574529,-0.000615,-0.000306,...,-0.011327,-0.010551,0.005368,-0.00147,0.011597,0.006366,0.004388,-0.00059,-0.001987,0
1,-0.063159,-0.032183,-0.027989,-0.003428,-0.055313,-0.025283,-0.010818,-0.044898,0.006458,0.000118,...,-0.05852,-0.000263,0.041253,-0.00022,-0.003161,-0.000573,-0.027656,-0.002797,-0.00461,1
2,-0.063159,-0.032183,-0.027989,-0.003428,-0.055313,-0.025283,-0.010818,-0.044898,0.006458,0.000118,...,-0.05852,-0.000263,0.041253,-0.00022,-0.003161,-0.000573,-0.027656,-0.002797,-0.00461,2
3,-0.063159,-0.032183,-0.027989,-0.003428,-0.055313,-0.025283,-0.010818,-0.044898,0.006458,0.000118,...,-0.05852,-0.000263,0.041253,-0.00022,-0.003161,-0.000573,-0.027656,-0.002797,-0.00461,3
4,-0.063159,-0.032183,-0.027989,-0.003428,-0.055313,-0.025283,-0.010818,-0.044898,0.006458,0.000118,...,-0.05852,-0.000263,0.041253,-0.00022,-0.003161,-0.000573,-0.027656,-0.002797,-0.00461,4


Criarei um novo dataframe utilizando para isso o dataframe constituído pelas variáveis resultantes da redução de dimensionalidade do texto e as variáveis resultado do one-hot-encoding fazendo um merge

In [21]:
train2 = pd.merge(truncated_train,one_hot_truncated, how = 'inner', on = "ID")
train2["Class"] = train["Class"]
train2.dropna(inplace = True)
train2.head()

Unnamed: 0,0_x,1_x,2_x,3_x,4_x,5_x,6_x,7_x,8_x,9_x,...,41_y,42_y,43_y,44_y,45_y,46_y,47_y,48_y,49_y,Class
0,0.221528,-0.081707,-0.01678,-0.071,-0.000251,-0.009083,0.002798,-0.010887,-0.020465,-0.01884,...,-0.011327,-0.010551,0.005368,-0.00147,0.011597,0.006366,0.004388,-0.00059,-0.001987,1.0
1,0.165491,-0.087215,-0.038926,0.075832,-0.023038,0.028609,0.03884,0.00657,-0.009878,-0.007995,...,-0.05852,-0.000263,0.041253,-0.00022,-0.003161,-0.000573,-0.027656,-0.002797,-0.00461,2.0
2,0.165491,-0.087215,-0.038926,0.075832,-0.023038,0.028609,0.03884,0.00657,-0.009878,-0.007995,...,-0.05852,-0.000263,0.041253,-0.00022,-0.003161,-0.000573,-0.027656,-0.002797,-0.00461,2.0
3,0.182577,-0.072146,-0.025338,0.00903,-0.002064,-0.019587,-0.021849,-0.025466,0.016703,-0.012428,...,-0.05852,-0.000263,0.041253,-0.00022,-0.003161,-0.000573,-0.027656,-0.002797,-0.00461,3.0
4,0.241561,-0.075471,-0.001424,0.031511,-0.044637,0.036575,0.001294,-0.013997,-0.026975,-0.021823,...,-0.05852,-0.000263,0.041253,-0.00022,-0.003161,-0.000573,-0.027656,-0.002797,-0.00461,4.0


In [22]:
train2.drop(["ID"], axis =1, inplace=True)

Como os datasets de teste do kaggle não possuem a coluna de target e preciso de dados para testar utilizarei o train_test_split do scikit-learn de maneira a dividir o dataset

In [23]:
X = train2.drop("Class", axis=1)
y = train2.Class

X_train, X_test, y_train, y_test  = train_test_split(X,y, test_size=0.20,shuffle=True)

Iniciarei o treinamento de diversos modelos afim de tentar encontrar o que mais se adequa ao problema, iniciando com o XGBoost

In [24]:
modelXGB = xgboost.XGBClassifier()
modelXGB.fit(X_train,y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, objective='multi:softprob', predictor='auto',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=None,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

In [25]:
predXGBTest = modelXGB.predict(X_test)
metrics.accuracy_score(y_test,predXGBTest)

0.5151057401812689

In [26]:
metrics.f1_score(y_test,predXGBTest, average=None)

array([0.44545455, 0.3375    , 0.23076923, 0.58778626, 0.24175824,
       0.33333333, 0.67561521, 0.2       , 0.77777778])

In [27]:
predXGBTrain = modelXGB.predict(X_train)
metrics.accuracy_score(y_train,predXGBTrain)

1.0

Após o treinamento do XGBoost vejo que o mesmo apresentou uma performance em teste na casa dos 51%, entretanto apresentou um performance em treino de 100% o que indica Overfit, para tentar resolver o problema posso tentar outras técnicas ou outros algoritmos

In [45]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

In [46]:
modelrf = rf.fit(X_train,y_train)

In [47]:
predTest = modelrf.predict(X_test)
metrics.accuracy_score(y_test,predTest)

0.5317220543806647

In [48]:
predTrain = modelrf.predict(X_train)
metrics.accuracy_score(y_train,predTrain)

0.8653555219364599

O random Forest, apesar de apresentar uma performance em treino inferior apresentou uma performance em teste melhor e portanto tem menos indícios de overfit, embora ainda não esteja perfeito

O modelo com árvore de decisão é similar ao random forest porém com uma performance um pouco inferior, o que é esperado

In [32]:
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier()

In [33]:
modelTree = tree.fit(X_train,y_train)
predTest = modelTree.predict(X_test)
predTrain = modelTree.predict(X_train)
metrics.accuracy_score(y_test, predTest), metrics.accuracy_score(y_train,predTrain)

(0.48338368580060426, 0.8653555219364599)

In [34]:
X_train.shape

(2644, 250)

In [35]:
def baseline_model():
    model = Sequential()
    model.add(Dense(512, input_dim=250, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(512, activation='relu'))
    model.add(Dropout(0.2))
    model.add(Dense(512, activation='relu'))
    model.add(Dense(1, activation="softmax"))
    model.compile(loss='poisson', optimizer='sgd', metrics=['accuracy'])
    return model

In [36]:
modelNN = baseline_model()

In [37]:
estimatorNN = modelNN.fit(X_train,y_train,epochs = 10, validation_split=0.2, batch_size=64)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [38]:
estimatorNN.history

{'loss': [0.9999996423721313,
  0.9999996423721313,
  0.9999996423721313,
  0.9999997615814209,
  0.9999996423721313,
  0.9999997615814209,
  0.9999996423721313,
  0.9999996423721313,
  0.9999996423721313,
  0.9999996423721313],
 'accuracy': [0.17446808516979218,
  0.17446808516979218,
  0.17446808516979218,
  0.17446808516979218,
  0.17446808516979218,
  0.17446808516979218,
  0.17446808516979218,
  0.17446808516979218,
  0.17446808516979218,
  0.17446808516979218],
 'val_loss': [0.9999994039535522,
  0.9999994039535522,
  0.9999994039535522,
  0.9999994039535522,
  0.9999994039535522,
  0.9999994039535522,
  0.9999994039535522,
  0.9999994039535522,
  0.9999994039535522,
  0.9999994039535522],
 'val_accuracy': [0.15879017114639282,
  0.15879017114639282,
  0.15879017114639282,
  0.15879017114639282,
  0.15879017114639282,
  0.15879017114639282,
  0.15879017114639282,
  0.15879017114639282,
  0.15879017114639282,
  0.15879017114639282]}

In [39]:
max_features = 2000

In [40]:
inputs = keras.Input(shape=(None,), dtype="float32")

x = Embedding(max_features, 256)(inputs)

x = Bidirectional(LSTM(128, return_sequences=True))(x)
x = Bidirectional(LSTM(128))(x)

outputs = Dense(1, activation="sigmoid")(x)

model = keras.Model(inputs, outputs)
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 256)         850944    
                                                                 
 bidirectional (Bidirectiona  (None, None, 256)        394240    
 l)                                                              
                                                                 
 bidirectional_1 (Bidirectio  (None, 256)              394240    
 nal)                                                            
                                                                 
 dense_5 (Dense)             (None, 1)                 257       
                                                                 
Total params: 1,639,681
Trainable params: 1,639,681
Non-train

In [41]:
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics = ["accuracy"])

In [42]:
model.fit(X_train, y_train, batch_size=32, epochs=2, validation_data=(X_test, y_test))

Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x14ee03254c8>

In [43]:
modelXGB.save_model("Modelo1.json")

In [44]:
model.save("modelKeras")



INFO:tensorflow:Assets written to: modelKeras\assets


INFO:tensorflow:Assets written to: modelKeras\assets


In [49]:
#O modelo com random forest também apresentou uma performance interessante, com menos overfit que o modelo com o XGBoost, por isso
# irei salvá-lo para uso posterior caso necessário
from joblib import dump, load
dump(modelrf,"modelorf.joblib")

['modelorf.joblib']

Como último teste gostaria de treinar um modelo para realizar a previsão das classes utilizando apenas a informação referente aos textos
A priori pode parecer incorreto entretanto dado o fato de que boa parte da complexidade desse projeto está justamente em conciliar a informação do texto com a informação categórica das outras variáveis irei realizar essa tentativa

In [56]:
X = truncated_train.values
y = train.Class

X_train, X_test, y_train, y_test  = train_test_split(X,y, test_size=0.20,shuffle=True)

In [57]:
modelXGB = xgboost.XGBClassifier()
modelXGB.fit(X_train,y_train)



XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
              gamma=0, gpu_id=-1, importance_type=None,
              interaction_constraints='', learning_rate=0.300000012,
              max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=100, n_jobs=8,
              num_parallel_tree=1, objective='multi:softprob', predictor='auto',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=None,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

In [58]:
predXGBTest = modelXGB.predict(X_test)
metrics.accuracy_score(y_test,predXGBTest)

0.6746987951807228

In [59]:
predXGBTrain = modelXGB.predict(X_train)
metrics.accuracy_score(y_train,predXGBTrain)

0.9935897435897436

In [60]:
metrics.f1_score(y_test,predXGBTest, average=None)

array([0.61157025, 0.55757576, 0.32      , 0.71586716, 0.56565657,
       0.75247525, 0.77530864, 0.        , 0.61538462])

Curiosamente utilizar apenas as informações oriundas do texto apresentam um resultado melhor em relação ao uso das informações referentes a "Gene" e "Variation" a acurácia medida foi superior