### Proyecto: Analitica de textos
#### Caso: Elegibilidad de pacientes para ensayos clinicos
#### Nicolás Orjuela

Para este proyecto, se escogio utilizar el modelo de máquinas de vectores de soporte (SVM) para determinar a los pacientes en si son elegibles o si no. Antes de poder entrenar el modelo se debe realizar un preprocesamiento exaustivo en donde se utilice el modelo de Bag of Words para la vectorización del texto, y despues realizar la lematización del mismo. 

**Este preprocesamiento es el mismo utilizado por Felipe Bedoya**

### 0. Importación de librerias

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 50)
import numpy as np
np.random.seed(3301)

# Preprocesamiento de datos
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
#para hacer balanceo de los features
from imblearn.over_sampling import SMOTE
# Para realizar la separaciond el conjunto de aprendizaje en entrenamiento y test
from sklearn.model_selection import train_test_split
# Para evaluar el modelo
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import plot_confusion_matrix
# Para busqueda de hiperparametros
from sklearn.model_selection import GridSearchCV
# Para la validación cruzada
from sklearn.model_selection import KFold
#Librerias para la visualizacion
import matplotlib.pyplot as plt
#Seaborn
import seaborn as sns

import re

from sklearn.preprocessing import FunctionTransformer

import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer


pd.set_option('display.max_colwidth', None)  # or 199

from sklearn.feature_extraction.text import CountVectorizer

%matplotlib inline

from sklearn.pipeline import Pipeline

from sklearn import svm

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\orjue\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 1. Preprocesamiento

Primero se deben cargar los datos y realizar un perfilamiento de estos, una vez se sabe el estado y pureza de los datos se pueden empezar a limpiar siguiendo lo establecido en el diccionario de datos

In [2]:
# Importe
df_eleg = pd.read_csv('../Datos/ElegibilidadEstudiantes/clinical_trials_on_cancer_data_clasificacion.csv', sep=',', encoding='utf-8', dtype='unicode')
df_eleg.head()

Unnamed: 0,label,study_and_condition
0,__label__0,study interventions are Saracatinib . recurrent verrucous carcinoma of the larynx diagnosis and patients must agree to use adequate birth control for the duration of study participation and for at least eight weeks after discontinuation of study drug
1,__label__1,study interventions are Stem cell transplantation . hodgkin lymphoma diagnosis and history of congenital hematologic immunologic or metabolic disorder which in the estimation of the pi poses prohibitive risk to the recipient
2,__label__0,study interventions are Lenograstim . recurrent adult diffuse mixed cell lymphoma diagnosis and creatinine clearance crcl greater than fifty ml per minute all tests must be performed within twenty-eight days prior to registration
3,__label__0,study interventions are Doxorubicin . stage iii diffuse large cell lymphoma diagnosis and stages ii bulky disease defined as mass size of more than ten cm stage iii or iv ann_arbor staging patients with stage and stage ii non bulky disease are excluded from this study
4,__label__1,study interventions are Poly I-C . prostate cancer diagnosis and unresolved iraes following prior biological therapy except that stable and managed iraes may be acceptable hypothyroidism or hypopituitarism on appropriate replacement


Vemos si existen datos nulos en el dataset:

In [3]:
df_eleg.isna().sum()

label                  0
study_and_condition    0
dtype: int64

Los datos no tienen nulos pero estan presentados de maneras distintas. Algunos comienzan con comillas, otros son espacios. Debemos lograr entradas similares. Adicionalmente la informacion importante del paciente esta ubicada despues del primer punto. Adicionalmente vemos que los labels son categoricos y debemos volvernos numericos. Aunque un label encoder haria esta tarea con facilidad, queremos conservar la categoria implicita que ya traen.

In [4]:
def preprocessor(df):
    df['study_and_condition'] = df['study_and_condition'].replace('""', '')
    df['study_and_condition'] = df['study_and_condition'].str.strip(' ')
    df['study_and_condition'] = df['study_and_condition'].str.split('.').str[1]
    df.loc[df['label'] == '__label__0', 'label'] = 0
    df.loc[df['label'] == '__label__1', 'label'] = 1
preprocessor(df_eleg)
print(df_eleg.describe())
df_eleg

        label                    study_and_condition
count   12000                                  12000
unique      2                                  11685
top         0   lymphoma diagnosis and not specified
freq     6000                                     12


Unnamed: 0,label,study_and_condition
0,0,recurrent verrucous carcinoma of the larynx diagnosis and patients must agree to use adequate birth control for the duration of study participation and for at least eight weeks after discontinuation of study drug
1,1,hodgkin lymphoma diagnosis and history of congenital hematologic immunologic or metabolic disorder which in the estimation of the pi poses prohibitive risk to the recipient
2,0,recurrent adult diffuse mixed cell lymphoma diagnosis and creatinine clearance crcl greater than fifty ml per minute all tests must be performed within twenty-eight days prior to registration
3,0,stage iii diffuse large cell lymphoma diagnosis and stages ii bulky disease defined as mass size of more than ten cm stage iii or iv ann_arbor staging patients with stage and stage ii non bulky disease are excluded from this study
4,1,prostate cancer diagnosis and unresolved iraes following prior biological therapy except that stable and managed iraes may be acceptable hypothyroidism or hypopituitarism on appropriate replacement
...,...,...
11995,0,recurrent childhood large cell lymphoma diagnosis and no known hypersensitivity to etanercept
11996,0,"recurrent rectal cancer diagnosis and absolute neutrophil count greater_than equal_than one thousand, five hundred ul"
11997,1,recurrent lymphoblastic lymphoma diagnosis and and intrathecal intraventricular therapy
11998,0,colorectal cancer diagnosis and patients must have received at least one prior chemotherapy regimen for advanced disease


Creamos el pipeline para facilitar el uso del modelo en produccion.

In [5]:
pre = [('preproc', FunctionTransformer(preprocessor))]

Ahora vamos a preprocesar el texto partiendolo en tokens y lematizandolo. Despues se utilizar un modelo de bag of words y finalmente tf-idf para identificar las palabras importantes. Aprovechamos el corpus de nltk para quitar palabras conectoras que generen ruido.

In [6]:
porter = PorterStemmer()
stop = stopwords.words('english')
def tokenizer_porter(sentence):
    tokens = sentence.split()
    stemmed_tokens = [porter.stem(token) for token in tokens if token not in stop]
    return ' '.join(stemmed_tokens)

def transformer_tokenizer(df):
    df['study_and_condition'] = df['study_and_condition'].apply(tokenizer_porter)
    
pre += [('porter', FunctionTransformer(transformer_tokenizer))]

In [7]:
transformer_tokenizer(df_eleg)
df_eleg

Unnamed: 0,label,study_and_condition
0,0,recurr verruc carcinoma larynx diagnosi patient must agre use adequ birth control durat studi particip least eight week discontinu studi drug
1,1,hodgkin lymphoma diagnosi histori congenit hematolog immunolog metabol disord estim pi pose prohibit risk recipi
2,0,recurr adult diffus mix cell lymphoma diagnosi creatinin clearanc crcl greater fifti ml per minut test must perform within twenty-eight day prior registr
3,0,stage iii diffus larg cell lymphoma diagnosi stage ii bulki diseas defin mass size ten cm stage iii iv ann_arbor stage patient stage stage ii non bulki diseas exclud studi
4,1,prostat cancer diagnosi unresolv ira follow prior biolog therapi except stabl manag ira may accept hypothyroid hypopituitar appropri replac
...,...,...
11995,0,recurr childhood larg cell lymphoma diagnosi known hypersensit etanercept
11996,0,"recurr rectal cancer diagnosi absolut neutrophil count greater_than equal_than one thousand, five hundr ul"
11997,1,recurr lymphoblast lymphoma diagnosi intrathec intraventricular therapi
11998,0,colorect cancer diagnosi patient must receiv least one prior chemotherapi regimen advanc diseas


Ahora utilizamos el modelo Bag of Words para traducir el texto a un vector numerico que representa las palabras en el mismo. Como la frecuencia de las palabras no importantes tiende a ser elevado entonces utilizamos tfidf para corregirlo.

In [8]:
pre

[('preproc',
  FunctionTransformer(func=<function preprocessor at 0x000002A2E8BB39D0>)),
 ('porter',
  FunctionTransformer(func=<function transformer_tokenizer at 0x000002A2E8F0E160>))]

In [9]:
tfidf = TfidfVectorizer(strip_accents=None, lowercase=False, preprocessor=None)
pre += [('tfidf', tfidf)]
pre += [('SVM', svm.SVC())]
pipeline = Pipeline(pre)

### 2. Modelamiento y entrenamiento del modelo

Ya con los datos preprocesados y listos para analizar, se procede a crear los datos de entrenamiento y validacion y a entrenar el modelo SVM.

In [10]:
x = df_eleg.drop('label', axis = 1)
y = df_eleg['label']

In [11]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2,random_state=100)

In [12]:
#Create a svm Classifier and hyper parameter tuning 
ml = svm.SVC() 
  
# defining parameter range
param_grid = {'SVM__C': [0.001, 0.1, 1, 10, 100, 1000,10000], 
              'SVM__gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'SVM__kernel': ['linear','poly','sigmoid','rbf']} 
  
grid = GridSearchCV(pipeline, param_grid, refit = True, verbose = 1, cv=15)
  
# fitting the model for grid search
grid_search=grid.fit(x_train, y_train)

Fitting 15 folds for each of 140 candidates, totalling 2100 fits


2100 fits failed out of a total of 2100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
2100 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\orjue\anaconda3\lib\site-packages\pandas\core\indexes\base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas\_libs\index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas\_libs\hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'label'

The above except

KeyError: 'label'

In [None]:
print(grid_search.best_params_)

In [None]:
accuracy = grid_search.best_score_ *100
print("Accuracy for our training dataset with tuning is : {:.2f}%".format(accuracy) )

In [None]:
y_test_hat=grid.predict(x_test)
test_accuracy=accuracy_score(y_test,y_test_hat)*100
test_accuracy
print("Accuracy for our testing dataset with tuning is : {:.2f}%".format(test_accuracy) )

In [None]:
confusion_matrix(y_test,y_test_hat)
disp=plot_confusion_matrix(grid, x_test, y_test,cmap=plt.cm.Blues)

In [13]:
df_eleg

Unnamed: 0,label,study_and_condition
0,0,recurr verruc carcinoma larynx diagnosi patient must agre use adequ birth control durat studi particip least eight week discontinu studi drug
1,1,hodgkin lymphoma diagnosi histori congenit hematolog immunolog metabol disord estim pi pose prohibit risk recipi
2,0,recurr adult diffus mix cell lymphoma diagnosi creatinin clearanc crcl greater fifti ml per minut test must perform within twenty-eight day prior registr
3,0,stage iii diffus larg cell lymphoma diagnosi stage ii bulki diseas defin mass size ten cm stage iii iv ann_arbor stage patient stage stage ii non bulki diseas exclud studi
4,1,prostat cancer diagnosi unresolv ira follow prior biolog therapi except stabl manag ira may accept hypothyroid hypopituitar appropri replac
...,...,...
11995,0,recurr childhood larg cell lymphoma diagnosi known hypersensit etanercept
11996,0,"recurr rectal cancer diagnosi absolut neutrophil count greater_than equal_than one thousand, five hundr ul"
11997,1,recurr lymphoblast lymphoma diagnosi intrathec intraventricular therapi
11998,0,colorect cancer diagnosi patient must receiv least one prior chemotherapi regimen advanc diseas
