# Curso de Capacitación en Machine Learning para la Industria (ML CETAM) - Sesión 3 (Tarea)

<img src='http://ia.inf.pucp.edu.pe/static/images/logo.svg' width=300px>
<img src='https://dci.pucp.edu.pe/wp-content/uploads/2014/02/logo-color-pucp1.gif' width=200px>


PhD. Edwin Villanueva, PhD. Soledad Espezua, BSc. Daniel Saromo

<font color='#008B72'> Predicción del diagnóstico de cáncer: Reducción de dimensionalidad usando PCA. </font>

## Caso: Diagnóstico de cancer: Breast Cancer (Wisconsin) Dataset

Fuente del dataset: https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29

Se realizaron ensayos clínicos a pacientes, y se anotó el diagnóstico realizado por un médico con respecto a si tiene un cáncer maligno o benigno. El conjunto de datos contiene características calculadas a partir de una imagen digitalizada de una muestra extraída por *aspiración con aguja fina* ([FNA](https://en.wikipedia.org/wiki/Fine-needle_aspiration), por sus siglas en inglés) proveniente una masa mamaria. Los atributos describen características de los núcleos celulares presentes en la imagen, y son los siguientes:

- radius (promedio de las distancias desde el centro a los puntos de la periferia)
- texture (desviación estándar de los valores en escala de grises)
- perimeter
- area
- smoothness (variación local de las longitudes de los radios)
- compactness (perimeter^2 / area - 1.0)
- concavity (severidad de las porciones cóncavas del contorno)
- concave points (número de porciones cóncavas del contorno)
- symmetry
- fractal dimension ("coastline approximation" - 1)

La media, la desviación estándar y "peor" o mayor (media de los tres
valores más grandes) de cada una de estas características se calcularon para cada imagen, resultando en 30 atributos. Por ejemplo, la columna 0 es el radio medio, el atributo 10 es la SD del radio y la columna 20 es el peor radio.

Además, para el diagnóstico, `1` quiere decir que es un cáncer maligno y `0` que es un cáncer benigno.

## Solución desde el punto de vista de ML

De acuerdo al caso planteado, el objetivo del proyecto es poder realizar una estimación del diagnóstico que haría un médico, usando los atributos obtenidos de los ensayos clínicos. No se trata de reemplazar el expertiz de un profesional de la salud, sino poder tener una estimativa de la posible gravedad de un caso clínico para poder priorizar los recursos hospitalarios disponibles.

Completar con su código los campos indicados y subir su notebook solucionado a la plataforma PAIDEIA.

El formato del nombre del archivo es: `Desafio3_APELLIDOPATERNO_NOMBRE.ipynb`. Respetar los lineamientos de la entrega de archivo influirá en su calificación. La fecha límite de entrega del presente desafío se encuentra en PAIDEIA.

## Obtención de los datos

In [None]:
#libraries required
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

  import pandas.util.testing as tm


In [None]:
from sklearn.datasets import load_breast_cancer

breast = load_breast_cancer() #esta función carga los atributos, el target y los nombres de las columnas

In [None]:
breast_data = breast.data #atributos en un numpy array
breast_data.shape

(569, 30)

In [None]:
breast_labels = breast.target #labels en un numpy array
breast_labels.shape

(569,)

In [None]:
labels = np.reshape(breast_labels,(569,1))

In [None]:
features = breast.feature_names #vemos los nombres de los atributos
features

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

Como se puede ver, el label no está considerado como columna. Vamos a agregarlo:

In [None]:
features_con_labels = np.append(features,'diagnostico')

Con ambos numpy arrays, vamos a crear un dataframe que contenga los atributos y el label.

In [None]:
full_breast_data = np.concatenate([breast_data,labels],axis=1)
full_breast_data.shape

(569, 31)

Creamos el dataframe que tiene todo el dataset.

In [None]:
breast_dataset = pd.DataFrame(full_breast_data)

Vamos a poner nombre a las columnas del dataframe:

In [None]:
breast_dataset.columns = features_con_labels

Exploremos del dataset listo:

In [None]:
breast_dataset.describe()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnostico
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,0.062798,0.405172,1.216853,2.866059,40.337079,0.007041,0.025478,0.031894,0.011796,0.020542,0.003795,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946,0.627417
std,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,0.00706,0.277313,0.551648,2.021855,45.491006,0.003003,0.017908,0.030186,0.00617,0.008266,0.002646,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061,0.483918
min,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,0.04996,0.1115,0.3602,0.757,6.802,0.001713,0.002252,0.0,0.0,0.007882,0.000895,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504,0.0
25%,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,0.0577,0.2324,0.8339,1.606,17.85,0.005169,0.01308,0.01509,0.007638,0.01516,0.002248,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146,0.0
50%,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,0.06154,0.3242,1.108,2.287,24.53,0.00638,0.02045,0.02589,0.01093,0.01873,0.003187,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004,1.0
75%,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,0.06612,0.4789,1.474,3.357,45.19,0.008146,0.03245,0.04205,0.01471,0.02348,0.004558,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208,1.0
max,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,0.09744,2.873,4.885,21.98,542.2,0.03113,0.1354,0.396,0.05279,0.07895,0.02984,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075,1.0


In [None]:
breast_dataset.tail() #imprimimos las 5 últimas muestras

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnostico
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,1.176,1.256,7.673,158.7,0.0103,0.02891,0.05198,0.02454,0.01114,0.004239,25.45,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,0.0
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,0.7655,2.463,5.203,99.04,0.005769,0.02423,0.0395,0.01678,0.01898,0.002498,23.69,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,0.0
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,0.4564,1.075,3.425,48.55,0.005903,0.03731,0.0473,0.01557,0.01318,0.003892,18.98,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,0.0
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,0.726,1.595,5.772,86.22,0.006522,0.06158,0.07117,0.01664,0.02324,0.006185,25.74,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,0.0
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,0.3857,1.428,2.548,19.15,0.007189,0.00466,0.0,0.0,0.02676,0.002783,9.456,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,1.0


Vamos a cambiar los labels de `0` y `1` a `'benigno'` y `maligno`:

In [None]:
breast_dataset['diagnostico'].replace(0, 'benigno',inplace=True)
breast_dataset['diagnostico'].replace(1, 'maligno',inplace=True)

In [None]:
breast_dataset.tail()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,radius error,texture error,perimeter error,area error,smoothness error,compactness error,concavity error,concave points error,symmetry error,fractal dimension error,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnostico
564,21.56,22.39,142.0,1479.0,0.111,0.1159,0.2439,0.1389,0.1726,0.05623,1.176,1.256,7.673,158.7,0.0103,0.02891,0.05198,0.02454,0.01114,0.004239,25.45,26.4,166.1,2027.0,0.141,0.2113,0.4107,0.2216,0.206,0.07115,benigno
565,20.13,28.25,131.2,1261.0,0.0978,0.1034,0.144,0.09791,0.1752,0.05533,0.7655,2.463,5.203,99.04,0.005769,0.02423,0.0395,0.01678,0.01898,0.002498,23.69,38.25,155.0,1731.0,0.1166,0.1922,0.3215,0.1628,0.2572,0.06637,benigno
566,16.6,28.08,108.3,858.1,0.08455,0.1023,0.09251,0.05302,0.159,0.05648,0.4564,1.075,3.425,48.55,0.005903,0.03731,0.0473,0.01557,0.01318,0.003892,18.98,34.12,126.7,1124.0,0.1139,0.3094,0.3403,0.1418,0.2218,0.0782,benigno
567,20.6,29.33,140.1,1265.0,0.1178,0.277,0.3514,0.152,0.2397,0.07016,0.726,1.595,5.772,86.22,0.006522,0.06158,0.07117,0.01664,0.02324,0.006185,25.74,39.42,184.6,1821.0,0.165,0.8681,0.9387,0.265,0.4087,0.124,benigno
568,7.76,24.54,47.92,181.0,0.05263,0.04362,0.0,0.0,0.1587,0.05884,0.3857,1.428,2.548,19.15,0.007189,0.00466,0.0,0.0,0.02676,0.002783,9.456,30.37,59.16,268.6,0.08996,0.06444,0.0,0.0,0.2871,0.07039,maligno


## Estandarización de los datos

Vamos a aplicar la [función](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) `StandardScaler` de sklearn, para que los datos estén distribuidos de forma que tengan media 0 y desviación estándar de 1. Sin embargo, se debe aplicar esta transformación a las columnas que son atributos, pero no al label (diagnóstico).

In [None]:
 #seleccionamos todas las filas, pero solamente las columnas cuyas etiquetas estén en la variable features
x = breast_dataset.loc[:, features].values

#la siguiente fila es un auto-verificador para que vean si realizaron correctamente la extracción
assert x.shape == (569,30), "La extracción de los datos para aplicar StandardScaler no está bien ejecutada"

In [None]:
from sklearn.preprocessing import StandardScaler
x = StandardScaler().fit_transform(x)

In [None]:
np.mean(x),np.std(x)

(-6.826538293184326e-17, 1.0)

## Principal Component Analysis (PCA) 2D

Por comodidad, cambiemos las etiquetas de las columnas a una forma más genérica:

In [None]:
feat_cols = ['feature'+str(i) for i in range(x.shape[1])]
feat_cols

['feature0',
 'feature1',
 'feature2',
 'feature3',
 'feature4',
 'feature5',
 'feature6',
 'feature7',
 'feature8',
 'feature9',
 'feature10',
 'feature11',
 'feature12',
 'feature13',
 'feature14',
 'feature15',
 'feature16',
 'feature17',
 'feature18',
 'feature19',
 'feature20',
 'feature21',
 'feature22',
 'feature23',
 'feature24',
 'feature25',
 'feature26',
 'feature27',
 'feature28',
 'feature29']

In [None]:
normalised_dataset = pd.DataFrame(x,columns=feat_cols)

In [None]:
normalised_dataset.tail()

Ejecutamos un análisis de componentes principales bidimensional:

In [None]:
from sklearn.decomposition import PCA
# TO DO ::: Completar código
# Use las líneas de código que necesite
pca_breast_dataset = ...
principalComponents_breast_dataset = ...

Vamos a crear un nuevo dataframe con los componentes principales:

In [None]:
principal_breast_Df = pd.DataFrame(data = principalComponents_breast_dataset
             , columns = ['componente principal 1', 'componente principal 2'])

principal_breast_Df.tail()

Computar el ratio de varianza explicada (`explained_variance_ratio`). Nos indica la cantidad de información (o varianza) que posee cada componente principal luego de proyectar los datos en un sub-espacio de una dimensión menor.

In [None]:
#Debe mostrar/imprimir el explained_variance_ratio para todos los componentes principales

Completar: Se observa que el primer componente principal tiene el ...% de la información, y el segundo solamente el ...%. Al proyectar en un espacio de dimensión 2, se ha perdido aproximadamente un ...% de información.

### Interpretación gráfica del PCA

In [None]:
plt.figure()
plt.figure(figsize=(12,12))
plt.xticks(fontsize=12)
plt.yticks(fontsize=14)
plt.xlabel('Componente Principal 1',fontsize=18)
plt.ylabel('Componente Principal 2',fontsize=18)
plt.title("PCA para el Breast Cancer Dataset",fontsize=20)
targets = ['benigno', 'maligno']
colors = ['g', 'r']
for color, target in zip(colors, targets):
    index = breast_dataset['diagnostico'] == target
    plt.scatter(principal_breast_Df.loc[index, 'componente principal 1'] , principal_breast_Df.loc[index, 'componente principal 2'], c = color, s = 50)

plt.legend(targets,prop={'size': 12})

De acuerdo a la gráfica, se observa que una línea recta podría dividir a los dos grupos de datos! Esta es una aplicación práctica de la utilidad de hacer PCA.

## Principal Component Analysis (PCA) 3D

Proceder análogamente, pero esta vez realizando un PCA tridimensional.

Ejecutamos un análisis de componentes principales 3D:

In [None]:
from sklearn.decomposition import PCA
pca_breast_dataset_3D = PCA(n_components=3)
principalComponents_breast_dataset_3D = pca_breast_dataset_3D.fit_transform(x)

Vamos a crear un nuevo dataframe con los componentes principales:

In [None]:
principal_breast_Df_3D = pd.DataFrame(...)

principal_breast_Df_3D.tail()

Computamos nuevamente el ratio de varianza explicada (`explained_variance_ratio`). Indicar el porcentaje de información que se pierde si en lugar de 2, se proyecta en 3 dimensiones.

In [None]:
# Complete código

Se observa que el tercer componente principal tiene el ...% de la información. Al proyectar en un espacio de dimensión 3, se ha perdido aproximadamente un ...% de información.

### Interpretación gráfica del PCA

In [None]:
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()

plt.clf()

elev = +20
azim = -60

ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=elev, azim=azim)

ax.set_xlabel('Componente Principal 1')
ax.set_ylabel('Componente Principal 2')
ax.set_zlabel('Componente Principal 3')
ax.set_title("PCA para el Breast Cancer Dataset")


targets = ['benigno', 'maligno']
colors = ['g', 'r']
for color, target in zip(colors, targets):
    index = breast_dataset['diagnostico'] == target
    ax.scatter(principal_breast_Df_3D.loc[index, 'componente principal 1'] , principal_breast_Df_3D.loc[index, 'componente principal 2'], principal_breast_Df_3D.loc[index, 'componente principal 3'], c = color, s = 50)

plt.legend(targets,prop={'size': 10})