<h1 align='center'> Modelo Lineal de Clasificación  con Tensorflow 2.0 </h1>

<h3>Autor</h3>

1. Alvaro Mauricio Montenegro Díaz, ammontenegrod@unal.edu.co
2. Daniel Mauricio Montenegro Reyes, dextronomo@gmail.com 

<h3>Fork</h3>

<h3>Referencias</h3>

1. <table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/drive/1qNxKmi0QpkunqTDdpXfVLlneG-NFDN9c"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Ver este código en  in Google Colab</a>
  </td>
  

<h2> 1. Introducción </h2>

Este código fue tomado y  adaptado de [Google Colab]("https://colab.research.google.com/drive/1qNxKmi0QpkunqTDdpXfVLlneG-NFDN9c"). En este ejercicio usaremos el famoso conjunto de datos *iris*. Sin embargo se usaran todos los datos, porqur en este ejercicio vamos a introducir el modelo logístico clasico que permite separar en dos clases. Los datos de la primera clase son omitidos y los datos se recodifican para tener solamente dos clases. Próximamente usaremos todos los datos.


<h2> 2. Importar los módulos requeridos </h2>

In [113]:
try:
  %tensorflow_version 2.x
except Exception:
  pass

In [114]:
from __future__ import absolute_import, division, print_function, unicode_literals

import pandas as pd
import seaborn as sb
import tensorflow as tf
from tensorflow import keras
from tensorflow.estimator import LinearClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

print(tf.__version__)

2.1.0


<h2> 3. Load and configure the Iris Dataset </h2>


In [136]:
# nombres de las columnas de los datos
col_names = ['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Species']
target_dimensions = ['Setosa', 'Versicolor', 'Virginica']

# lee los datos
training_data_path = tf.keras.utils.get_file("iris_training.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_training.csv")
test_data_path = tf.keras.utils.get_file("iris_test.csv", "https://storage.googleapis.com/download.tensorflow.org/data/iris_test.csv")
training = pd.read_csv(training_data_path, names=col_names, header=0)
test = pd.read_csv(test_data_path, names=col_names, header=0)

# esta sección es para omitir la clase 0: "Setosa" y recodificar loa datos  de entrenamiento
training = training[training['Species'] >= 1]
training['Species'] = training['Species'].replace([1,2], [0,1])

# esta sección es para omitir la clase 0: "Setosa" y recodificar loa datos  de validación
test = test[test['Species'] >= 1]
test['Species'] = test['Species'].replace([1,2], [0,1])

#omite los índices de los dos dataframes para poderlos concadenar
training.reset_index(drop=True, inplace=True)
test.reset_index(drop=True, inplace=True)

# concadena los dataframes
iris_dataset = pd.concat([training, test], axis=0)

<h2> 4. Primer acercamiento descriptivo los datos</h2>

In [137]:
iris_dataset.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SepalLength,100.0,6.262,0.662834,4.9,5.8,6.3,6.7,7.9
SepalWidth,100.0,2.872,0.332751,2.0,2.7,2.9,3.025,3.8
PetalLength,100.0,4.906,0.825578,3.0,4.375,4.9,5.525,6.9
PetalWidth,100.0,1.676,0.424769,1.0,1.3,1.6,2.0,2.5
Species,100.0,0.5,0.502519,0.0,0.0,0.5,1.0,1.0


sb.pairplot(iris_dataset, diag_kind="kde")

In [138]:
correlation_data = iris_dataset.corr()
correlation_data.style.background_gradient(cmap='coolwarm', axis=None)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Species
SepalLength,1.0,0.553855,0.828479,0.593709,0.494305
SepalWidth,0.553855,1.0,0.519802,0.566203,0.30808
PetalLength,0.828479,0.519802,1.0,0.823348,0.786424
PetalWidth,0.593709,0.566203,0.823348,1.0,0.828129
Species,0.494305,0.30808,0.786424,0.828129,1.0


<h2> 5. Separa features y targets </h2>

In [139]:
X_data = iris_dataset[[m for m in iris_dataset.columns if m not in ['Species']]]
Y_data = iris_dataset[['Species']]

<h2> 6. Divide los datos: entrenamiento y validación </h2>

In [140]:
training_features , test_features ,training_labels, test_labels = train_test_split(X_data , Y_data , test_size=0.2)

In [141]:
print('No. of rows in Training Features: ', training_features.shape[0])
print('No. of rows in Test Features: ', test_features.shape[0])
print('No. of columns in Training Features: ', training_features.shape[1])
print('No. of columns in Test Features: ', test_features.shape[1])

print('No. of rows in Training Label: ', training_labels.shape[0])
print('No. of rows in Test Label: ', test_labels.shape[0])
print('No. of columns in Training Label: ', training_labels.shape[1])
print('No. of columns in Test Label: ', test_labels.shape[1])

No. of rows in Training Features:  80
No. of rows in Test Features:  20
No. of columns in Training Features:  4
No. of columns in Test Features:  4
No. of rows in Training Label:  80
No. of rows in Test Label:  20
No. of columns in Training Label:  1
No. of columns in Test Label:  1


In [142]:
stats = training_features.describe()
stats = stats.transpose()
stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SepalLength,80.0,6.26,0.68945,4.9,5.775,6.25,6.7,7.9
SepalWidth,80.0,2.8675,0.33367,2.0,2.7,2.9,3.0,3.8
PetalLength,80.0,4.865,0.841503,3.0,4.275,4.8,5.5,6.9
PetalWidth,80.0,1.64875,0.422768,1.0,1.3,1.55,2.0,2.5


In [143]:
stats = test_features.describe()
stats = stats.transpose()
stats

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
SepalLength,20.0,6.27,0.559229,5.4,5.95,6.3,6.475,7.7
SepalWidth,20.0,2.89,0.337015,2.2,2.675,2.9,3.125,3.4
PetalLength,20.0,5.07,0.756098,3.5,4.65,5.05,5.6,6.7
PetalWidth,20.0,1.785,0.425843,1.0,1.5,1.75,2.05,2.5


<h2> 8. Normaliza los datos</h2>

In [144]:
def norm(x):
  stats = x.describe()
  stats = stats.transpose()
  return (x - stats['mean']) / stats['std']

normed_train_features = norm(training_features)
normed_test_features = norm(test_features)

<h2> 9. Construye la tuberia (pipeline) para la alimentación de datos de Tensorflow</h2>

In [145]:
def feed_input(features_dataframe, target_dataframe, num_of_epochs=10, shuffle=True, batch_size=32):
  def input_feed_function():
    dataset = tf.data.Dataset.from_tensor_slices((dict(features_dataframe), target_dataframe))
    if shuffle:
      dataset = dataset.shuffle(2000)
    dataset = dataset.batch(batch_size).repeat(num_of_epochs)
    return dataset
  return input_feed_function

train_feed_input = feed_input(normed_train_features, training_labels)
train_feed_input_testing = feed_input(normed_train_features, training_labels, num_of_epochs=1, shuffle=False)
test_feed_input = feed_input(normed_test_features, test_labels, num_of_epochs=1, shuffle=False)

<h2> 10. Entrenamiento del Modelo</h2>

In [146]:
feature_columns_numeric = [tf.feature_column.numeric_column(m) for m in training_features.columns]

In [147]:
logistic_model = LinearClassifier(feature_columns=feature_columns_numeric)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmp_9esi7oq', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': ClusterSpec({}), '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}


In [148]:
logistic_model.train(train_feed_input)

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into /tmp/tmp_9esi7oq/model.ckpt.
INFO:tensorflow:loss = 0.6931472, step = 0
INFO:tensorflow:Saving checkpoints for 30 into /tmp/tmp_9esi7oq/model.ckpt.
INFO:tensorflow:Loss for final step: 0.17787997.


<tensorflow_estimator.python.estimator.canned.linear.LinearClassifierV2 at 0x7f7ff017c310>

<h2> 11. Predicciones</h2>

In [150]:
train_predictions = logistic_model.predict(train_feed_input_testing)
test_predictions = logistic_model.predict(test_feed_input)

In [151]:
train_predictions_series = pd.Series([p['classes'][0].decode("utf-8")   for p in train_predictions])
test_predictions_series = pd.Series([p['classes'][0].decode("utf-8")   for p in test_predictions])

INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmp_9esi7oq/model.ckpt-30
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Calling model_fn.


To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.

INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restor

In [152]:
train_predictions_df = pd.DataFrame(train_predictions_series, columns=['predictions'])
test_predictions_df = pd.DataFrame(test_predictions_series, columns=['predictions'])

In [153]:
training_labels.reset_index(drop=True, inplace=True)
train_predictions_df.reset_index(drop=True, inplace=True)

test_labels.reset_index(drop=True, inplace=True)
test_predictions_df.reset_index(drop=True, inplace=True)

In [154]:
train_labels_with_predictions_df = pd.concat([training_labels, train_predictions_df], axis=1)
test_labels_with_predictions_df = pd.concat([test_labels, test_predictions_df], axis=1)

<h2> 12. Validación</h2>

In [155]:
def calculate_binary_class_scores(y_true, y_pred):
  accuracy = accuracy_score(y_true, y_pred.astype('int64'))
  precision = precision_score(y_true, y_pred.astype('int64'))
  recall = recall_score(y_true, y_pred.astype('int64'))
  return accuracy, precision, recall

- **accuracy_score**: En la clasificación con múltiples etiquetas, esta función calcula la precisión del subconjunto: el conjunto de etiquetas predichas para una muestra que coincide exactamente con el conjunto de etiquetas correspondiente en y_true.
- **precision_score**: es la razón $\frac{tp }{tp + fp}$ en donde $tp$ es el número de positivos verdadero y $fp$ el número de falsos positivos. El mejor valor es 1 y el peor valor es 0.
- **recall_score**:  es la relación $\frac{tp }{tp + fn}$ donde $tp$ es el número de verdaderos positivos y $fn$ el número de falsos negativos. El recuerdo es intuitivamente la capacidad del clasificador para encontrar todas las muestras positivas. El mejor valor es 1 y el peor valor es 0.

In [156]:
train_accuracy_score, train_precision_score, train_recall_score = calculate_binary_class_scores(training_labels, train_predictions_series)
test_accuracy_score, test_precision_score, test_recall_score = calculate_binary_class_scores(test_labels, test_predictions_series)

print('Training Data Accuracy (%) = ', round(train_accuracy_score*100,2))
print('Training Data Precision (%) = ', round(train_precision_score*100,2))
print('Training Data Recall (%) = ', round(train_recall_score*100,2))
print('-'*50)
print('Test Data Accuracy (%) = ', round(test_accuracy_score*100,2))
print('Test Data Precision (%) = ', round(test_precision_score*100,2))
print('Test Data Recall (%) = ', round(test_recall_score*100,2))


Training Data Accuracy (%) =  98.75
Training Data Precision (%) =  100.0
Training Data Recall (%) =  97.44
--------------------------------------------------
Test Data Accuracy (%) =  90.0
Test Data Precision (%) =  100.0
Test Data Recall (%) =  81.82


In [134]:
train_predictions_series

0     1
1     1
2     0
3     1
4     1
     ..
75    1
76    1
77    1
78    1
79    0
Length: 80, dtype: object

In [135]:
train_predictions_df 

Unnamed: 0,predictions
0,1
1,1
2,0
3,1
4,1
...,...
75,1
76,1
77,1
78,1
