# Clase 7 - Ejercicio

In [None]:
## Librerias
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


## Breast Cancer (METABRIC, Nature 2012 & Nat Commun 2016)




Origen del dataset (https://www.cbioportal.org/study/clinicalData?id=brca_metabric)

La base de datos del Consorcio Internacional de Taxonomía Molecular del Cáncer de Mama (METABRIC) es un proyecto Canadá-Reino Unido que contiene datos de secuenciación específica de 1980 muestras primarias de cáncer de mama. Los datos clínicos y genómicos se descargaron de cBioPortal.

El conjunto de datos fue recopilado por el profesor Carlos Caldas del Cambridge Research Institute y el profesor Sam Aparicio del British Columbia Cancer Center en Canadá y publicado en Nature Communications (Pereira et al., 2016). También apareció en varios artículos, incluidos Nature y otros:
- [Associations between genomic stratification of breast cancer and centrally reviewed tumor pathology in the METABRIC cohort](https://www.nature.com/articles/s41523-018-0056-8)
- [Predicting Outcomes of Hormone and Chemotherapy in the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) Study by Biochemically-inspired Machine Learning](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5461908/)

## Desde CBioPortal:

- Clinical attributes in the dataset: 31 values
- Genetic attributes in the dataset: The genetics part of the dataset contains m-RNA levels z-score for 331 genes, and mutation for 175 genes.

### Genetic attributes in the dataset:

| Name                           | Type   | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| ------------------------------ | ------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| patient_id                     | object | Patient ID                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| age_at_diagnosis               | float  | Age of the patient at diagnosis time                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| type_of_breast_surgery         | object | Breast cancer surgery type: 1- MASTECTOMY, which refers to a surgery to remove all breast tissue from a breast as a way to treat or prevent breast cancer. 2- BREAST CONSERVING, which refers to a urgery where only the part of the breast that has cancer is removed                                                                                                                                                                                                                             |
| cancer_type                    | object | Breast cancer types: 1- Breast Cancer or 2- Breast Sarcoma                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| cancer_type_detailed           | object | Detailed Breast cancer types: 1- Breast Invasive Ductal Carcinoma 2- Breast Mixed Ductal and Lobular Carcinoma 3- Breast Invasive Lobular Carcinoma 4- Breast Invasive Mixed Mucinous Carcinoma 5- Metaplastic Breast Cancer                                                                                                                                                                                                                                                                       |
| cellularity                    | object | Cancer cellularity post chemotherapy, which refers to the amount of tumor cells in the specimen and their arrangement into clusters                                                                                                                                                                                                                                                                                                                                                                |
| chemotherapy                   | int    | Whether or not the patient had chemotherapy as a treatment (yes/no)                                                                                                                                                                                                                                                                                                                                                                                                                                |
| pam50_+_claudin-low_subtype    | object | Pam 50: is a tumor profiling test that helps show whether some estrogen receptor-positive (ER-positive), HER2-negative breast cancers are likely to metastasize (when breast cancer spreads to other organs). The claudin-low breast cancer subtype is defined by gene expression characteristics, most prominently: Low expression of cell–cell adhesion genes, high expression of epithelial–mesenchymal transition (EMT) genes, and stem cell-like/less differentiated gene expression patterns |
| cohort                         | float  | Cohort is a group of subjects who share a defining characteristic (It takes a value from 1 to 5)                                                                                                                                                                                                                                                                                                                                                                                                   |
| er_status_measured_by_ihc      | float  | To assess if estrogen receptors are expressed on cancer cells by using immune-histochemistry (a dye used in pathology that targets specific antigen, if it is there, it will give a color, it is not there, the tissue on the slide will be colored) (positive/negative)                                                                                                                                                                                                                           |
| er_status                      | object | Cancer cells are positive or negative for estrogen receptors                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| neoplasm_histologic_grade      | int    | Determined by pathology by looking the nature of the cells, do they look aggressive or not (It takes a value from 1 to 3)                                                                                                                                                                                                                                                                                                                                                                          |
| her2_status_measured_by_snp6   | object | To assess if the cancer positive for HER2 or not by using advance molecular techniques (Type of next generation sequencing)                                                                                                                                                                                                                                                                                                                                                                        |
| her2_status                    | object | Whether the cancer is positive or negative for HER2                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| tumor_other_histologic_subtype | object | Type of the cancer based on microscopic examination of the cancer tissue (It takes a value of 'Ductal/NST', 'Mixed', 'Lobular', 'Tubular/ cribriform', 'Mucinous', 'Medullary', 'Other', 'Metaplastic' )                                                                                                                                                                                                                                                                                           |
| hormone_therapy                | int    | Whether or not the patient had hormonal as a treatment (yes/no)                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| inferred_menopausal_state      | object | Whether the patient is is post menopausal or not (post/pre)                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| integrative_cluster            | object | Molecular subtype of the cancer based on some gene expression (It takes a value from '4ER+', '3', '9', '7', '4ER-', '5', '8', '10', '1', '2', '6')                                                                                                                                                                                                                                                                                                                                                 |
| primary_tumor_laterality       | object | Whether it is involving the right breast or the left breast                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| lymph_nodes_examined_positive  | float  | To take samples of the lymph node during the surgery and see if there were involved by the cancer                                                                                                                                                                                                                                                                                                                                                                                                  |
| mutation_count                 | float  | Number of gene that has relevant mutations                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| nottingham_prognostic_index    | float  | It is used to determine prognosis following surgery for breast cancer. Its value is calculated using three pathological criteria: the size of the tumour; the number of involved lymph nodes; and the grade of the tumour.                                                                                                                                                                                                                                                                         |
| oncotree_code                  | object | The OncoTree is an open-source ontology that was developed at Memorial Sloan Kettering Cancer Center (MSK) for standardizing cancer type diagnosis from a clinical perspective by assigning each diagnosis a unique OncoTree code.                                                                                                                                                                                                                                                                 |
| overall_survival_months        | float  | Duration from the time of the intervention to death                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| overall_survival               | object | Target variable wether the patient is alive of dead.                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| pr_status                      | object | Cancer cells are positive or negative for progesterone receptors                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| radio_therapy                  | int    | Whether or not the patient had radio as a treatment (yes/no)                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| 3-gene_classifier_subtype      | object | Three Gene classifier subtype It takes a value from 'ER-/HER2-', 'ER+/HER2- High Prolif', nan, 'ER+/HER2- Low Prolif','HER2+'                                                                                                                                                                                                                                                                                                                                                                      |
| tumor_size                     | float  | Tumor size measured by imaging techniques                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| tumor_stage                    | float  | Stage of the cancer based on the involvement of surrounding structures, lymph nodes and distant spread                                                                                                                                                                                                                                                                                                                                                                                             |
| death_from_cancer              | int    | Wether the patient's death was due to cancer or not (yes/no)                                                                                                                                                                                                                                                                                                                                                                                                                                       |


### Genetic attributes in the dataset:
The genetics part of the dataset contains m-RNA levels z-score for 331 genes, and mutation for 175 genes.

#### What are mRNA?
The DNA molecules attached to each slide act as probes to detect gene expression, which is also known as the transcriptome or the set of messenger RNA (mRNA) transcripts expressed by a group of genes. To perform a microarray analysis, mRNA molecules are typically collected from both an experimental sample and a reference sample.

#### What are mRNA Z-Scores?
For mRNA expression data, The calculations of the relative expression of an individual gene and tumor to the gene's expression distribution in a reference population is done. That reference population is all samples in the study . The returned value indicates the number of standard deviations away from the mean of expression in the reference population (Z-score). This measure is useful to determine whether a gene is up- or down-regulated relative to the normal samples or all other tumor samples.

The formula is :
```
z = (expression in tumor sample - mean expression in reference sample) / standard deviation of expression in reference sample
```

## Exploratory Data Analysis (EDA)

### Cargando la data

In [None]:
data = pd.read_csv("METABRIC_RNA_Mutation.csv")

In [None]:
## dimensiones del data set (<filas>, <columnas>)
data.shape

In [None]:
## head()
data.head()

### Explorando la data clinica

In [None]:
columnas_data_clinica = data.columns[:31]
print(columnas_data_clinica)

In [None]:
data_clinica = data[columnas_data_clinica].copy()

In [None]:
data_clinica.shape

In [None]:
data_clinica.head()

#### info()
Imprime un listado de columnas, cuantos valores no nulos contiene y el tipo de objeto

In [None]:
data_clinica.info()

#### describe()
Estadistica descriptiva de los valos numericos en el dataframe

In [None]:
data_clinica.describe()

#### Variable Objetivo

In [None]:
data_clinica['overall_survival'].unique()

# Clasificador

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
data_clinica.dropna(inplace=True)

In [None]:
variables = [
    "age_at_diagnosis", "chemotherapy", "lymph_nodes_examined_positive", "lymph_nodes_examined_positive",
    "radio_therapy", "tumor_size", "tumor_stage"
]

In [None]:
x = data_clinica[variables]
y = data_clinica['overall_survival']

### Definicion del Modelo

In [None]:
modelo = LogisticRegression()

In [None]:
modelo.fit(x,y)

### Realizando una Prediccion

In [None]:
paciente_random = data_clinica.sample(n=1)

In [None]:
paciente_random

In [None]:
modelo.predict(paciente_random[variables])

In [None]:
paciente_random['overall_survival']

## Evaluando el Modelo

In [None]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve

In [None]:
predicciones = modelo.predict(data_clinica[variables])

In [None]:
accuracy_score(predicciones, y)

### Matriz de confusión

In [None]:
matriz_confusion = confusion_matrix(predicciones, y)

In [None]:
matriz_confusion

In [None]:
plt.figure(figsize=(5,5))
sns.heatmap(matriz_confusion, annot=True, fmt=".3f", square = True, cmap = 'Blues_r', cbar=False);
plt.ylabel('Valores Reales');
plt.xlabel('Predicciones');

### Precisión

$precision =  \frac{TP}{TP+FP} $

In [None]:
precision_score(predicciones, y)

### Recall (Sensitividad)

$recall =  \frac{TP}{TP+FN} $

In [None]:
recall_score(predicciones, y)

### F1 score

In [None]:
f1_score(predicciones, y)

### ROC Curve

In [None]:
fpr, tpr, tr = roc_curve(predicciones, y)

In [None]:
roc_score = roc_auc_score(predicciones, y)
roc_score

In [None]:
plt.plot(fpr, tpr, linewidth=2, label=f"ROC AUC {roc_score}")
plt.plot([0,1], [0,1], 'k--')#Diagonal
plt.xlabel('Tasa de falsos positivos')
plt.ylabel('Sensitividad')
plt.legend(loc='best')
plt.grid()
plt.show()

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

In [None]:
clf = DecisionTreeClassifier(max_depth=5)

In [None]:
clf.fit(x,y)

In [None]:
plt.figure(figsize=(15,15))
tree.plot_tree(clf)
plt.show()