### Proyecto: Analitica de textos
#### Caso: Elegibilidad de pacientes para ensayos clinicos
#### Felipe Bedoya

Para este proyecto, se escogio utilizar el modelo NaiveBayes de clasificación para categorizar a los pacientes en si son elegibles o si no. Antes de poder entrenar el modelo se debe realizar un preprocesamiento exaustivo en donde se utilice el modelo de Bag of Words para la vectorización del texto, y despues realizar la lematización del mismo. 

### 0. Importación de librerias

In [1]:
import pandas as pd
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 50)
import numpy as np
np.random.seed(3301)
import pandas as pd

# Preprocesamiento de datos
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
#para hacer balanceo de los features
from imblearn.over_sampling import SMOTE
# Para realizar la separaciond el conjunto de aprendizaje en entrenamiento y test
from sklearn.model_selection import train_test_split
# Para evaluar el modelo
from sklearn.metrics import confusion_matrix, classification_report, precision_score, recall_score, f1_score, accuracy_score
from sklearn.metrics import plot_confusion_matrix
# Para busqueda de hiperparametros
from sklearn.model_selection import GridSearchCV
# Para la validación cruzada
from sklearn.model_selection import KFold
#Librerias para la visualizacion
import matplotlib.pyplot as plt
#Seaborn
import seaborn as sns

import re

from sklearn.preprocessing import FunctionTransformer

import nltk

nltk.download('stopwords')
from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer


pd.set_option('display.max_colwidth', None)  # or 199

%matplotlib inline

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\felip\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### 1. Preprocesamiento

Primero se deben cargar los datos y realizar un perfilamiento de estos, una vez se sabe el estado y pureza de los datos se pueden empezar a limpiar siguiendo lo establecido en el diccionario de datos

In [2]:
# Importe
df_eleg = pd.read_csv('../Datos/ElegibilidadEstudiantes/clinical_trials_on_cancer_data_clasificacion.csv', sep=',', encoding='utf-8', dtype='unicode')
df_eleg.head()

Unnamed: 0,label,study_and_condition
0,__label__0,study interventions are Saracatinib . recurrent verrucous carcinoma of the larynx diagnosis and patients must agree to use adequate birth control for the duration of study participation and for at least eight weeks after discontinuation of study drug
1,__label__1,study interventions are Stem cell transplantation . hodgkin lymphoma diagnosis and history of congenital hematologic immunologic or metabolic disorder which in the estimation of the pi poses prohibitive risk to the recipient
2,__label__0,study interventions are Lenograstim . recurrent adult diffuse mixed cell lymphoma diagnosis and creatinine clearance crcl greater than fifty ml per minute all tests must be performed within twenty-eight days prior to registration
3,__label__0,study interventions are Doxorubicin . stage iii diffuse large cell lymphoma diagnosis and stages ii bulky disease defined as mass size of more than ten cm stage iii or iv ann_arbor staging patients with stage and stage ii non bulky disease are excluded from this study
4,__label__1,study interventions are Poly I-C . prostate cancer diagnosis and unresolved iraes following prior biological therapy except that stable and managed iraes may be acceptable hypothyroidism or hypopituitarism on appropriate replacement


Vemos si existen datos nulos en el dataset:

In [3]:
df_eleg.isna().sum()

label                  0
study_and_condition    0
dtype: int64

Los datos no tienen nulos pero estan presentados de maneras distintas. Algunos comienzan con comillas, otros son espacios. Debemos lograr entradas similares. Adicionalmente la informacion importante del paciente esta ubicada despues del primer punto. Adicionalmente vemos que los labels son categoricos y debemos volvernos numericos. Aunque un label encoder haria esta tarea con facilidad, queremos conservar la categoria implicita que ya traen.

In [4]:
def preprocessor(df):
    df['study_and_condition'] = df['study_and_condition'].replace('""', '')
    df['study_and_condition'] = df['study_and_condition'].str.strip(' ')
    df['study_and_condition'] = df['study_and_condition'].str.split('.').str[1]
    df.loc[df['label'] == '__label__0', 'label'] = 0
    df.loc[df['label'] == '__label__1', 'label'] = 1
preprocessor(df_eleg)
print(df_eleg.describe())
df_eleg

        label                    study_and_condition
count   12000                                  12000
unique      2                                  11685
top         0   lymphoma diagnosis and not specified
freq     6000                                     12


Unnamed: 0,label,study_and_condition
0,0,recurrent verrucous carcinoma of the larynx diagnosis and patients must agree to use adequate birth control for the duration of study participation and for at least eight weeks after discontinuation of study drug
1,1,hodgkin lymphoma diagnosis and history of congenital hematologic immunologic or metabolic disorder which in the estimation of the pi poses prohibitive risk to the recipient
2,0,recurrent adult diffuse mixed cell lymphoma diagnosis and creatinine clearance crcl greater than fifty ml per minute all tests must be performed within twenty-eight days prior to registration
3,0,stage iii diffuse large cell lymphoma diagnosis and stages ii bulky disease defined as mass size of more than ten cm stage iii or iv ann_arbor staging patients with stage and stage ii non bulky disease are excluded from this study
4,1,prostate cancer diagnosis and unresolved iraes following prior biological therapy except that stable and managed iraes may be acceptable hypothyroidism or hypopituitarism on appropriate replacement
...,...,...
11995,0,recurrent childhood large cell lymphoma diagnosis and no known hypersensitivity to etanercept
11996,0,"recurrent rectal cancer diagnosis and absolute neutrophil count greater_than equal_than one thousand, five hundred ul"
11997,1,recurrent lymphoblastic lymphoma diagnosis and and intrathecal intraventricular therapy
11998,0,colorectal cancer diagnosis and patients must have received at least one prior chemotherapy regimen for advanced disease


Creamos el pipeline para facilitar el uso del modelo en produccion.

In [5]:
pre = [('preproc', FunctionTransformer(preprocessor))]

Ahora vamos a preprocesar el texto partiendolo en tokens y lematizandolo. Despues se utilizar un modelo de bag of words y finalmente tf-idf para identificar las palabras importantes.

In [12]:
porter = PorterStemmer()
def tokenizer_porter(text):
    return [porter.stem(word) for word in text.split()]
print(tokenizer_porter("ovarian cancer diagnosis and more than four weeks since prior participation in any other investigational study"))

def transformer_tokenizer(df):
    print(tokenizer_porter(df['study_and_condition'].str))
transformer_tokenizer(df_eleg)

['ovarian', 'cancer', 'diagnosi', 'and', 'more', 'than', 'four', 'week', 'sinc', 'prior', 'particip', 'in', 'ani', 'other', 'investig', 'studi']


AttributeError: 'list' object has no attribute 'lower'

In [7]:
stop = stopwords.words('english')
df_eleg

Unnamed: 0,label,study_and_condition
0,0,recurrent verrucous carcinoma of the larynx diagnosis and patients must agree to use adequate birth control for the duration of study participation and for at least eight weeks after discontinuation of study drug
1,1,hodgkin lymphoma diagnosis and history of congenital hematologic immunologic or metabolic disorder which in the estimation of the pi poses prohibitive risk to the recipient
2,0,recurrent adult diffuse mixed cell lymphoma diagnosis and creatinine clearance crcl greater than fifty ml per minute all tests must be performed within twenty-eight days prior to registration
3,0,stage iii diffuse large cell lymphoma diagnosis and stages ii bulky disease defined as mass size of more than ten cm stage iii or iv ann_arbor staging patients with stage and stage ii non bulky disease are excluded from this study
4,1,prostate cancer diagnosis and unresolved iraes following prior biological therapy except that stable and managed iraes may be acceptable hypothyroidism or hypopituitarism on appropriate replacement
...,...,...
11995,0,recurrent childhood large cell lymphoma diagnosis and no known hypersensitivity to etanercept
11996,0,"recurrent rectal cancer diagnosis and absolute neutrophil count greater_than equal_than one thousand, five hundred ul"
11997,1,recurrent lymphoblastic lymphoma diagnosis and and intrathecal intraventricular therapy
11998,0,colorectal cancer diagnosis and patients must have received at least one prior chemotherapy regimen for advanced disease
