# Proyecto 1 - MLOPS

**Integrantes:**

- Camilo Diaz Granados Nobman
- Luis Carlos Fernandez Vargas
- Daniel Alejandro Ruiz


Este proyecto busca evaluar la capacidad de crear un ambiente de desarrollo de machine learning
en el cual sea posible la ingesta, validación y transformación de datos, demostrando capacidad de
versionar código y ambiente de desarrollo.


Para el proyecto se utiliza una variante del conjunto de datos Tipo de Cubierta Forestal, de acuerdo con lo propuesto. Esto se puede utilizar **para entrenar un modelo que predice el tipo de cobertura forestal en función de variables cartográficas**.

## 1. Cargar el Dataset

Se procede a realizar la carga del dataset:

In [1]:
import os
import requests

# Cambiar la ruta del directorio a la nueva carpeta 'Data'
_data_root = '../data/covertype'
# Ruta al archivo de datos de entrenamiento
_data_filepath = os.path.join(_data_root, 'covertype_train.csv')
# Descargar datos
os.makedirs(_data_root, exist_ok=True)
if not os.path.isfile(_data_filepath):
    # URL del dataset
    url = 'https://docs.google.com/uc?export=download&confirm={{VALUE}}&id=1lVF1BCWLH4eXXV_YOJzjR7xZjj-wAGj9'
    r = requests.get(url, allow_redirects=True, stream=True)
    open(_data_filepath, 'wb').write(r.content)


## 2. Selección de Caracteristicas

Importamos las librerias necesarias para realizar la validación de los datos

In [2]:
import os
import requests
import pandas as pd
import tensorflow as tf
import tensorflow_data_validation as tfdv
from sklearn.model_selection import train_test_split
print('TF version:', tf.__version__)
print('TFDV version:', tfdv.version.__version__)

2025-03-02 02:36:49.467193: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-02 02:36:49.480559: I external/local_tsl/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-03-02 02:36:49.517303: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:479] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-03-02 02:36:49.592126: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:10575] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-03-02 02:36:49.592260: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1442] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-03-02 02:36:49.630691: I tensorflow/core/platform/cpu_feature_guard.cc:

TF version: 2.16.2
TFDV version: 1.16.1


In [3]:
# Cargamos el dataset en un Dataframe desde la nueva ubicación
data_filepath = '../data/covertype/covertype_train.csv'
df = pd.read_csv(data_filepath, index_col=False)
df.head()

Unnamed: 0,Elevation,Aspect,Slope,Horizontal_Distance_To_Hydrology,Vertical_Distance_To_Hydrology,Horizontal_Distance_To_Roadways,Hillshade_9am,Hillshade_Noon,Hillshade_3pm,Horizontal_Distance_To_Fire_Points,Wilderness_Area,Soil_Type,Cover_Type
0,2991,119,7,67,11,1015,233,234,133,1570,Commanche,C7202,1
1,2876,3,18,485,71,2495,192,202,144,1557,Commanche,C7757,1
2,3171,315,2,277,9,4374,213,237,162,1052,Rawah,C7745,0
3,3087,342,13,190,31,4774,193,221,166,752,Rawah,C7745,0
4,2835,158,10,212,41,3596,231,242,141,3280,Rawah,C4744,1


In [4]:
#Validamos la información de las caracteristicas
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116203 entries, 0 to 116202
Data columns (total 13 columns):
 #   Column                              Non-Null Count   Dtype 
---  ------                              --------------   ----- 
 0   Elevation                           116203 non-null  int64 
 1   Aspect                              116203 non-null  int64 
 2   Slope                               116203 non-null  int64 
 3   Horizontal_Distance_To_Hydrology    116203 non-null  int64 
 4   Vertical_Distance_To_Hydrology      116203 non-null  int64 
 5   Horizontal_Distance_To_Roadways     116203 non-null  int64 
 6   Hillshade_9am                       116203 non-null  int64 
 7   Hillshade_Noon                      116203 non-null  int64 
 8   Hillshade_3pm                       116203 non-null  int64 
 9   Horizontal_Distance_To_Fire_Points  116203 non-null  int64 
 10  Wilderness_Area                     116203 non-null  object
 11  Soil_Type                           116

Teniendo en cuenta que dos (2) de las caracteristicas son categoricas, creamos subconjuntos de las caracteristicas con el fin de calificar cuales tienen un mayor impacto en la predicción de la etiqueta "Cover_Type"

In [5]:
#Creamos subconjuntos
numeric_features = df.select_dtypes(include=[int, float])
categorical_features = df.select_dtypes(include=[object])

print("Características numéricas:")
print(numeric_features.info())

print("\nCaracterísticas categóricas:")
print(categorical_features.info())

Características numéricas:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 116203 entries, 0 to 116202
Data columns (total 11 columns):
 #   Column                              Non-Null Count   Dtype
---  ------                              --------------   -----
 0   Elevation                           116203 non-null  int64
 1   Aspect                              116203 non-null  int64
 2   Slope                               116203 non-null  int64
 3   Horizontal_Distance_To_Hydrology    116203 non-null  int64
 4   Vertical_Distance_To_Hydrology      116203 non-null  int64
 5   Horizontal_Distance_To_Roadways     116203 non-null  int64
 6   Hillshade_9am                       116203 non-null  int64
 7   Hillshade_Noon                      116203 non-null  int64
 8   Hillshade_3pm                       116203 non-null  int64
 9   Horizontal_Distance_To_Fire_Points  116203 non-null  int64
 10  Cover_Type                          116203 non-null  int64
dtypes: int64(11)
memory usage

De acuerdo con las instrucciones del taller seleccionamos únicamente las caracteristicas númericas y separamos nuestra variable objetivo "Cover_Type".

In [6]:
# Definimos la variable objetivo y removemos Cover_Type de las columnas de entrada
y = numeric_features['Cover_Type']
X = numeric_features.drop('Cover_Type', axis=1, errors='ignore')

print("Dimensiones de X antes de la selección:", X.shape)

Dimensiones de X antes de la selección: (116203, 10)


Importamos las librerias para ejecutar la selección de las 8 mejores caracteristicas a través de "SelectKBest"

In [7]:
from sklearn.feature_selection import SelectKBest, f_classif

In [8]:
k = 8

# Crear el selector con la función de puntuación elegida
selector = SelectKBest(score_func=f_classif, k=k)

# Ajustar el selector y transformar X
X_selected = selector.fit_transform(X, y)

# Obtener los nombres de las columnas que han sido seleccionadas
selected_cols = X.columns[selector.get_support()]
print(f"Columnas seleccionadas (k={k}):\n", selected_cols.tolist())

Columnas seleccionadas (k=8):
 ['Elevation', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Horizontal_Distance_To_Fire_Points']


Se obtiene el resultado esperado según el taller, quitando de las caracteristicas seleccionadas a "Aspect" y "Hillshade3pm". Por lo que construimos un archivo CSV con los datos previamente seleccionados.

In [12]:
# Construir el DataFrame con las columnas seleccionadas y la variable target
df_selected = pd.DataFrame(X_selected, columns=selected_cols)
df_selected['target'] = y.values

# Definir la ruta externa usando una ruta relativa
external_data_dir = os.path.abspath(os.path.join(os.getcwd(), '..', 'data'))
# Organiza las carpetas internas según necesites, en este caso:
output_dir = os.path.join(external_data_dir, 'selected_dataset')
os.makedirs(output_dir, exist_ok=True)

# Ruta completa donde se guardará el CSV
output_path = os.path.join(output_dir, 'selected.csv')
df_selected.to_csv(output_path, index=False)

## 3. Data Pipeline

Primero configuramos el contexto interactivo para ejecutar manualmente los componentes de canalización desde un cuaderno definiendo la ruta. 
La base de datos sqlite la mantenemos en el directorio Root de este Pipeline.

In [9]:
#Importamos las librerias necesarias
from tfx.components import CsvExampleGen
from tfx.components import ExampleValidator
from tfx.components import SchemaGen
from tfx.components import StatisticsGen
from tfx.components import Transform

from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
from google.protobuf.json_format import MessageToDict

import pprint
pp = pprint.PrettyPrinter()

In [13]:
# Ubicación de la Metadata del Pipeline
_pipeline_root = './pipeline_root/'

# Directorio de los archivos de data "raw"
_data_root = '../data/selected_dataset/'

# Ruta de la data de entrenamiento "raw" 
_data_filepath = os.path.join(_data_root, 'selected.csv')

In [14]:
# Inicializar el InteractiveContext con un archivo de sqlite local.
context = InteractiveContext(pipeline_root=_pipeline_root)



### Generamos los Ejemplos

In [15]:
# Instanciar ExampleGen con el input CSV dataset
example_gen = CsvExampleGen(input_base=_data_root)

In [16]:
# Execute the component
context.run(example_gen)





0,1
.execution_id,1
.component,"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } CsvExampleGen at 0x701ceac64bb0.inputs{}.outputs['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x701ceac65c90.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0.exec_properties['input_base']../data/selected_dataset/['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:3945344,xor_checksum:1740891820,sum_checksum:1740891820"
.component.inputs,{}
.component.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x701ceac65c90.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.inputs,{}
.outputs,"['examples'] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x701ceac65c90.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"
.exec_properties,"['input_base']../data/selected_dataset/['input_config']{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }['output_config']{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }['output_data_format']6['output_file_format']5['custom_config']None['range_config']None['span']0['version']None['input_fingerprint']split:single_split,num_files:1,total_bytes:3945344,xor_checksum:1740891820,sum_checksum:1740891820"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x701ceac65c90.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,./pipeline_root/CsvExampleGen/examples/1
.span,0
.split_names,"[""train"", ""eval""]"
.version,0

0,1
['input_base'],../data/selected_dataset/
['input_config'],"{  ""splits"": [  {  ""name"": ""single_split"",  ""pattern"": ""*""  }  ] }"
['output_config'],"{  ""split_config"": {  ""splits"": [  {  ""hash_buckets"": 2,  ""name"": ""train""  },  {  ""hash_buckets"": 1,  ""name"": ""eval""  }  ]  } }"
['output_data_format'],6
['output_file_format'],5
['custom_config'],
['range_config'],
['span'],0
['version'],
['input_fingerprint'],"split:single_split,num_files:1,total_bytes:3945344,xor_checksum:1740891820,sum_checksum:1740891820"

0,1
['examples'],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Channel of type 'Examples' (1 artifact) at 0x701ceac65c90.type_nameExamples._artifacts[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.type_name,Examples
._artifacts,"[0] function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
[0],"function toggleTfxObject(element) {  var objElement = element.parentElement;  if (objElement.classList.contains('collapsed')) {  objElement.classList.remove('collapsed');  objElement.classList.add('expanded');  } else {  objElement.classList.add('collapsed');  objElement.classList.remove('expanded');  } } Artifact of type 'Examples' (uri: ./pipeline_root/CsvExampleGen/examples/1) at 0x701d880708b0.type<class 'tfx.types.standard_artifacts.Examples'>.uri./pipeline_root/CsvExampleGen/examples/1.span0.split_names[""train"", ""eval""].version0"

0,1
.type,<class 'tfx.types.standard_artifacts.Examples'>
.uri,./pipeline_root/CsvExampleGen/examples/1
.span,0
.split_names,"[""train"", ""eval""]"
.version,0
