### Pipeline - Python Basic

#### Preparation previus tools

In [12]:
#Loading Libraries
import pandas as pd

###### This code´s help us to liberate all visual limits when you process the dataset

In [31]:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

#### Uploading Dataset

In [13]:
# Upload Dataset

url = "https://www.datos.gov.co/resource/w8tr-p8mr.json?$limit=5000"
df_main = pd.read_json(url)

###### Summary - Information on Electronic Contracts for SECOP I and II of the District Attorney's Office of Santiago de Cali

###### I want to work on real dataframes, those that we will really evidence in our day to day, for this I rely on the free database of the Colombian government, many of these data are fed from surveys or questionnaires, where there are no standards for the possible answers that can be given, for example for the name, we find values in lowercase and uppercase, so they are ideal to apply as many tools as possible.

###### Something important that we must have clear before starting to work on a dataset is its structure and its unit, all the measures or calculations that we do, must have results and it is of vital importance to know in what is expressed, if they are X amount of some product, for this specific case, for our DF the unit is CONTRACTS.

###### Unit: Contracts

#### Getting to Know the Dataset

###### An important part before starting any pipeline is to understand the structure, form and content of the data with which we are going to deal, this to give us a glimpse of what we can find, in addition to begin to rule out errors that may affect us in the movement of information.

In [19]:
df_main.columns

Index(['nivel_entidad', 'codigo_entidad_en_secop', 'nombre_de_la_entidad',
       'nit_de_la_entidad', 'departamento_entidad', 'municipio_entidad',
       'estado_del_proceso', 'modalidad_de_contrataci_n', 'objeto_a_contratar',
       'objeto_del_proceso', 'tipo_de_contrato', 'fecha_de_firma_del_contrato',
       'fecha_fin_ejecuci_n', 'numero_del_contrato', 'numero_de_proceso',
       'valor_contrato', 'nom_raz_social_contratista', 'url_contrato',
       'origen', 'tipo_documento_proveedor', 'documento_proveedor',
       'fecha_inicio_ejecuci_n'],
      dtype='object')

###### The first error, that I can find is the name of the last columns, in the Data movement, this can suffer changes by the conditions in each tool, in this dataframe the name suffered a change in its name because in the O it had a tilde, some databases do not know these symbols and change this for another in this case change Ó for _.

In [20]:
df_main.shape

(1941, 22)

###### We have to have clarity in the totality of the data because all the data that we have in the input, must appear in the output, regardless of the transformations and leaving aside if we have to make filters.
###### So, our dataframe have a 1941 registers and 22 columns.

###### Tip: Always leave the main dataframe as a separate dataset, if you make changes, do it in the other dataframes created from the main dataframe, never use directly the main dataframe, because if you have an error, you must load again and in some cases this process is very difficult and slow.

###### This is a very important step, because here we can know and standardize the dataset.
###### 

##### Firts Step: Getting to Know our columns

###### In this step, we will getting to know our columns, every one, his values and we can star the standardize process, so we going to consult every column and his unique values

In [30]:
df_main.nunique()

nivel_entidad                     1
codigo_entidad_en_secop           1
nombre_de_la_entidad              1
nit_de_la_entidad                 1
departamento_entidad              1
municipio_entidad                 1
estado_del_proceso                9
modalidad_de_contrataci_n         8
objeto_a_contratar             1059
objeto_del_proceso             1057
tipo_de_contrato                  4
fecha_de_firma_del_contrato     332
fecha_fin_ejecuci_n              52
numero_del_contrato            1840
numero_de_proceso              1812
valor_contrato                  373
nom_raz_social_contratista      601
url_contrato                   1807
origen                            1
tipo_documento_proveedor          4
documento_proveedor             601
fecha_inicio_ejecuci_n          327
dtype: int64

###### If you want view every value for columns, you can use this line df_main.apply(lambda x: x.unique()), but in this case in some columns you have a lot values, more 1000 registers, and this code, it will saturate the visual with all the values, so it is better to check the unique values of each column in numbers and then check one by one, the ones we think are important to validate.

#### Cleaning Dataset



Limpia los datos

Transforma los datos

Filtra los datos

Muestra los resultados

In [16]:
df_main.head()

Unnamed: 0,nivel_entidad,codigo_entidad_en_secop,nombre_de_la_entidad,nit_de_la_entidad,departamento_entidad,municipio_entidad,estado_del_proceso,modalidad_de_contrataci_n,objeto_a_contratar,objeto_del_proceso,...,fecha_fin_ejecuci_n,numero_del_contrato,numero_de_proceso,valor_contrato,nom_raz_social_contratista,url_contrato,origen,tipo_documento_proveedor,documento_proveedor,fecha_inicio_ejecuci_n
0,Territorial,701511206,PERSONERIA DISTRITAL DE SANTIAGO DE CALI,805003895,Valle del Cauca,Cali,Activo,Mínima cuantía,REALIZAR LA PUBLICACIÓN DE EDICTOS EN UN DIARI...,REALIZAR LA PUBLICACIÓN DE EDICTOS EN UN DIARI...,...,2025-04-30T00:00:00.000,CO1.PCCNTR.7620665,131.7.1.2025.MC-154,800000,NUEVO DIARIO OCCIDENTE SAS,https://community.secop.gov.co/Public/Tenderin...,SECOPII,No Definido,805017188,
1,Territorial,701511206,PERSONERIA DISTRITAL DE SANTIAGO DE CALI,805003895,Valle del Cauca,Cali,En ejecución,No Definido,PRESTAR LOS SERVICIOS DE APOYO A LA GESTIÓN E...,PRESTAR LOS SERVICIOS DE APOYO A LA GESTIÓN E...,...,2025-05-31T00:00:00.000,CO1.PCCNTR.7626699,131.7.1.2025.CD-161,7737000,Juan Carlos Martínez Bonilla,https://community.secop.gov.co/Public/Tenderin...,SECOPII,Cédula de Ciudadanía,1107529934,2025-03-12T00:00:00.000
2,Territorial,701511206,PERSONERIA DISTRITAL DE SANTIAGO DE CALI,805003895,Valle del Cauca,Cali,En ejecución,Contratación directa,PRESTAR LOS SERVICIOS PROFESIONALES S EN LA PE...,PRESTAR LOS SERVICIOS PROFESIONALES S EN LA PE...,...,2025-05-31T00:00:00.000,CO1.PCCNTR.7626468,131.7.1.2025.CD-123,13314000,Harold Humberto Cairasco Calderón,https://community.secop.gov.co/Public/Tenderin...,SECOPII,Cédula de Ciudadanía,16804043,2025-03-11T00:00:00.000
3,Territorial,701511206,PERSONERIA DISTRITAL DE SANTIAGO DE CALI,805003895,Valle del Cauca,Cali,En ejecución,Contratación directa,PRESTAR LOS SERVICIOS DE APOYO A LA GESTION EN...,PRESTAR LOS SERVICIOS DE APOYO A LA GESTION EN...,...,2025-05-31T00:00:00.000,CO1.PCCNTR.7626688,131.7.1.2025.CD-159,9840000,LEONARDO DELGADO LOPEZ,https://community.secop.gov.co/Public/Tenderin...,SECOPII,Cédula de Ciudadanía,94060109,2025-03-11T00:00:00.000
4,Territorial,701511206,PERSONERIA DISTRITAL DE SANTIAGO DE CALI,805003895,Valle del Cauca,Cali,En ejecución,Contratación directa,PRESTAR LOS SERVICIOS PROFESIONALES EN LA PER...,PRESTAR LOS SERVICIOS PROFESIONALES EN LA PER...,...,2025-05-31T00:00:00.000,CO1.PCCNTR.7627306,131.7.1.2025.CD-162,15693000,Jaime Portocarrero Banguera,https://community.secop.gov.co/Public/Tenderin...,SECOPII,Cédula de Ciudadanía,10386217,2025-03-12T00:00:00.000


In [17]:
#Aplicar modelo SOLID - Eficiencia
#Peso del File al final
#Alerta por data incompleta
#Estandarizar Names
#Validacion Persona natural o 