### Tarea Random Forests
#### Daniel Espinosa 136981


El objetivo de este trabajo es comparar el desempeño de los modelos **Random Forest** y **Árbol de decisión**.

Se espera ver que Random Forest tenga mejor desempeño ya que es un modelo más robusto pues es es un modelo en ensamble compuesto por varios árboles de decisión.

El dataset que se utiliza es el [
Breast Cancer Wisconsin (Diagnostic) Data Set](https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29) que contiene varias variables explicativas numéricas derivadas de muestras de tejido de mama de diversas pacientes y la variable target es **Diagnosis**, si el tejido fue maligno (cáncer) o benigno (no cáncer).

Los modelos se entrenarán para clasificar el diagnóstico de los tejidos.

Las variables del dataset son las siguientes:

1) ID number

2) Diagnosis (M = malignant, B = benign) 

Ten real-valued features are computed for each cell nucleus: 

a) radius (mean of distances from center to points on the perimeter) 

b) texture (standard deviation of gray-scale values) 

c) perimeter 

d) area 

e) smoothness (local variation in radius lengths) 

f) compactness (perimeter^2 / area - 1.0) 

g) concavity (severity of concave portions of the contour) 

h) concave points (number of concave portions of the contour) 

i) symmetry 

j) fractal dimension ("coastline approximation" - 1)

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import confusion_matrix

In [3]:
df = pd.read_csv("wdbc.csv",header=None, na_values = "?")
df.describe()

Unnamed: 0,0,2,3,4,5,6,7,8,9,10,...,22,23,24,25,26,27,28,29,30,31
count,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,...,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0,569.0
mean,30371830.0,14.127292,19.289649,91.969033,654.889104,0.09636,0.104341,0.088799,0.048919,0.181162,...,16.26919,25.677223,107.261213,880.583128,0.132369,0.254265,0.272188,0.114606,0.290076,0.083946
std,125020600.0,3.524049,4.301036,24.298981,351.914129,0.014064,0.052813,0.07972,0.038803,0.027414,...,4.833242,6.146258,33.602542,569.356993,0.022832,0.157336,0.208624,0.065732,0.061867,0.018061
min,8670.0,6.981,9.71,43.79,143.5,0.05263,0.01938,0.0,0.0,0.106,...,7.93,12.02,50.41,185.2,0.07117,0.02729,0.0,0.0,0.1565,0.05504
25%,869218.0,11.7,16.17,75.17,420.3,0.08637,0.06492,0.02956,0.02031,0.1619,...,13.01,21.08,84.11,515.3,0.1166,0.1472,0.1145,0.06493,0.2504,0.07146
50%,906024.0,13.37,18.84,86.24,551.1,0.09587,0.09263,0.06154,0.0335,0.1792,...,14.97,25.41,97.66,686.5,0.1313,0.2119,0.2267,0.09993,0.2822,0.08004
75%,8813129.0,15.78,21.8,104.1,782.7,0.1053,0.1304,0.1307,0.074,0.1957,...,18.79,29.72,125.4,1084.0,0.146,0.3391,0.3829,0.1614,0.3179,0.09208
max,911320500.0,28.11,39.28,188.5,2501.0,0.1634,0.3454,0.4268,0.2012,0.304,...,36.04,49.54,251.2,4254.0,0.2226,1.058,1.252,0.291,0.6638,0.2075


Hay variables que no son relevantes como el ID. A continuación se forma la matriz de los datos como se utilizarán.

In [25]:
data = df.as_matrix()

data_final = np.column_stack((data[:,2:],data[:,1]))

np.random.seed(923)
X_train, X_test, Y_train, Y_test = train_test_split(data_final[:,:-1],data_final[:,-1], train_size=0.75)

### Desempeño del árbol de decisión

Árbol de decisión con min_samples_split = 20 para que el árbol no sea tan profundo y generalice mejor.

In [46]:
clf = DecisionTreeClassifier(min_samples_split = 20)
clf = clf.fit(X_train, Y_train)
Y_test_prediction = clf.predict(X_test)
cm = confusion_matrix(Y_test, Y_test_prediction)
tn, fp, fn, tp = confusion_matrix(Y_test, Y_test_prediction).ravel()
cm

array([[80,  3],
       [ 5, 55]])

In [44]:
#Accuracy (num buenos/total)
accuracy = float(tp+tn) / float(tp + tn + fp + fn)
print("Accuracy: " + str(accuracy))
#Precision

precision = float(tp)/float(tp + fp)
print("Precision: " + str(precision))
#Recall

recall = float(tp)/float(tp + fn)
print("Recall: " + str(recall))

#F1

f1 = f1 = (2*(precision*recall))/(precision+recall)
print("F1 score: " + str(f1))

Accuracy: 0.937062937063
Precision: 0.947368421053
Recall: 0.9
F1 score: 0.923076923077


### Desempeño del Random Forest

Random Forest con 15 árboles y también min_samples_split = 20.

In [47]:
clf = RandomForestClassifier(n_estimators = 15 ,min_samples_split = 20)
clf = clf.fit(X_train, Y_train)
Y_test_prediction = clf.predict(X_test)
cm = confusion_matrix(Y_test, Y_test_prediction)
tn, fp, fn, tp = confusion_matrix(Y_test, Y_test_prediction).ravel()
cm

array([[81,  2],
       [ 2, 58]])

In [48]:
#Accuracy (num buenos/total)
accuracy = float(tp+tn) / float(tp + tn + fp + fn)
print("Accuracy: " + str(accuracy))
#Precision

precision = float(tp)/float(tp + fp)
print("Precision: " + str(precision))
#Recall

recall = float(tp)/float(tp + fn)
print("Recall: " + str(recall))

#F1

f1 = f1 = (2*(precision*recall))/(precision+recall)
print("F1 score: " + str(f1))

Accuracy: 0.972027972028
Precision: 0.966666666667
Recall: 0.966666666667
F1 score: 0.966666666667


### Comparación del desempeño de los modelos

|Métrica|**Árbol de decisión**|**Random Forest**|
|---|---|---|
|**Accuracy**|0.937062937063|0.972027972028|
|**Precision**|0.947368421053|0.966666666667|
|**Recall**|0.9|0.966666666667|
|**F1 Score**|0.923076923077|0.966666666667|

### Conclusiones

Aunque el árbol de decisión obtuvo resultados buenos, el modelo Random Forest obtuvo resultados todavía mejores y por un márgen considerable. Random Forest, al ser un meta modelo de aprendizaje en ensamble y en este caso hacer bootstrap, combina la predicción de muchos árboles construídos aleatoriamente y entrenados con diferentes muestras con reemplazo del set de entrenamiento, lo que ayuda a que controle el *overfitting*.

Es importante tomar en cuenta que muchas de las técnicas de aprendizaje de máquina se pueden combinar en diferentes maneras para formar modelos más robustos y que tengan mejor desempeño al clasificar o hacer regresiones sobre diferentes tipos de datos.