# <a id="tabla_contenidos"></a> 
## Tabla de Contenidos

### <a href='#section_repaso'>1. Introducción</a>
- #### <a href='#explicacion_dataset'>1.1 Presentación del caso</a>
- #### <a href='#metodologia'>1.2 La metodología de trabajo</a>

### <a href='#section_import_lib'>2. Importación de librerías para casos de Clasificación</a>


### <a href='#importar_dataset'>3. Importando el DataSet</a> 
- #### <a href='#imputacion'>3.1 Imputación de datos</a>

### <a  href='#preparacion_entrenamiento'>4. Preparación y Entrenamiento del DataSet</a>
- #### <a href='#features'>4.1 Features</a>
- #### <a href='#split'>4.2 Split del set de entrenamiento</a>
- #### <a href='#metodo_entrenamiento'>4.3 Selección del método de entrenamiento</a>

### <a  href='#evaluacion_modelos'>5. Evaluación de los Modelos</a>
- #### <a href='#section_matriz_confusion_code'>5.1 Métricas: Matriz de confusión</a>
- #### <a href='#section_accuracy'>5.2 Métricas: Accuracy</a>
- #### <a href='#section_error'>5.3 Métricas: Error de Clasificación</a>
- #### <a href='#section_recall'>5.4 Métricas: Sensitivity (o recall)</a>
- #### <a href='#section_specificity'>5.5 Métricas: Specificity</a>
- #### <a href='#section_precision'>5.6 Métricas: Precision</a>
- #### <a href='#section_fpr'>5.7 Métricas: False positive rate (FPR)</a>
- #### <a href='#section_f1_score'>5.8 Métricas: F1-Score</a>

### <a href='#section_curva_roc'>6. Curva ROC</a>
- #### <a href='#section_umbrales'>6.1 Ajustando los umbrales</a>
- #### <a href='#section_imp_croc'>6.2 Implementación Curva ROC y AUC</a>

### <a href='#conclusiones'>7. Conclusuiones</a>
---

<a id="section_repaso"></a> 
## 1. Introducción
---

<a id="explicacion_dataset"></a> 
### 1.1 Presentación del caso
Escribir aquí... # TO-DO

<a id="metodologia"></a> 
### 1.2 La metodología de trabajo
Escribir aquí... # TO-DO


<a id="section_import_lib"></a>
## 2. Importación de librerías para casos de Clasificación
---

Comenzamos importando las librerías y dependencias que utilizaremos a lo largo del trabajo.

In [1]:
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

<a id="importar_dataset"></a>
## 3. Importando el DataSet
---

In [2]:
ruta_tat = 'VENDROS_TAT.csv'
ruta_vendors = 'VENDORS_NAMES_metrics.csv'

data_tat = pd.read_csv(ruta_tat, encoding='UTF-8', sep=',')
data_vendors = pd.read_csv(ruta_vendors, encoding='UTF-8', sep=',')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Visalizamos el DataSet para identificar las columnas y los valores que necesitamos.

In [3]:
data_tat.sample(5)

Unnamed: 0,TAT,TAT7,TAT6,TAT5,TAT4,TAT3,TAT2,TAT1,order_number_id,order_number,...,EIN_destination,EIN_shipper,EIN_flight_date,EIN_arrival_date,EIN_arrival_date_tz,EIN_part_tool,EIN_created_date,EIN_created_date_tz,EIN_airway_bill_date,EIN_airway_bill_date_tz
1565,13.0,,,13.0,13.0,,,0.0,157403,R0221721,...,,,,,,,,,,
9680,42.0,,,,,42.0,41.0,0.0,182558,P0002422,...,,,,,,,,,,
12814,44.0,43.0,44.0,16.0,16.0,11.0,11.0,0.0,162884,R0715221,...,7.0,1.0,2021-06-02,2021-06-03 00:00:00,+00:00,Y,2021-06-01 00:00:00,+00:00,2021-06-01 00:00:00,+00:00
8766,189.0,189.0,,8.0,8.0,,,0.0,165698,R0941421,...,7.0,,0,,,Y,2021-12-09 00:00:00,+00:00,2021-12-02 00:00:00,+00:00
10380,28.0,26.0,28.0,12.0,12.0,9.0,8.0,0.0,158353,R0309021,...,7.0,1.0,2021-03-18,2021-03-19 00:00:00,+00:00,Y,2021-03-16 00:00:00,+00:00,2021-03-16 00:00:00,+00:00


In [4]:
data_tat.dtypes

TAT                        float64
TAT7                       float64
TAT6                       float64
TAT5                       float64
TAT4                       float64
                            ...   
EIN_part_tool               object
EIN_created_date            object
EIN_created_date_tz         object
EIN_airway_bill_date        object
EIN_airway_bill_date_tz     object
Length: 91, dtype: object

In [5]:
data_tat.shape

(14875, 91)

<a id="imputacion"></a> 
### 3.1 Imputación de datos

En este apartado se realiza un análisis manual del estado del DataSet, para verificar de manera preliminar, el estado  de la base de datos que vamos a utilizar.

Para eso identificaremos la cantidad de **Nulos**.

In [6]:
data_null = data_tat.apply(lambda x: x.isnull().sum()/data_tat.shape[0], axis=0)
data_null

TAT                        0.001008
TAT7                       0.608605
TAT6                       0.794084
TAT5                       0.203092
TAT4                       0.203092
                             ...   
EIN_part_tool              0.607126
EIN_created_date           0.607126
EIN_created_date_tz        0.607126
EIN_airway_bill_date       0.608605
EIN_airway_bill_date_tz    0.608605
Length: 91, dtype: float64

Realizaremos una eliminación de las **columnas nulas** del DataSet, con la función *dropna(how='all')*

In [7]:
data_tat.dropna(how='all', axis=1)

Unnamed: 0,TAT,TAT7,TAT6,TAT5,TAT4,TAT3,TAT2,TAT1,order_number_id,order_number,...,EIN_destination,EIN_shipper,EIN_flight_date,EIN_arrival_date,EIN_arrival_date_tz,EIN_part_tool,EIN_created_date,EIN_created_date_tz,EIN_airway_bill_date,EIN_airway_bill_date_tz
0,9.0,,,9.0,9.0,,,1.0,87862,R2054619,...,,,,,,,,,,
1,31.0,31.0,,4.0,4.0,,,4.0,116048,R4093319,...,7.0,,0,,,Y,2019-11-22 00:00:00,+00:00,2019-11-14 00:00:00,+00:00
2,45.0,45.0,,8.0,8.0,,,0.0,176087,R1788221,...,7.0,,0,,,Y,2021-12-09 00:00:00,+00:00,2021-12-02 00:00:00,+00:00
3,19.0,18.0,19.0,,,,,6.0,123425,P0724819,...,7.0,1.0,2019-12-21,2019-12-22 00:00:00,+00:00,Y,2019-12-20 00:00:00,+00:00,2019-12-20 00:00:00,+00:00
4,9.0,,,9.0,9.0,,,0.0,170863,R1361821,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14870,13.0,13.0,13.0,,,,,1.0,88966,R2132319,...,100.0,1.0,2019-04-24,,,Y,2019-04-24 00:00:00,+00:00,2019-04-24 00:00:00,+00:00
14871,29.0,29.0,29.0,15.0,15.0,14.0,12.0,10.0,127458,R0014920,...,7.0,1.0,2020-01-31,2020-02-01 00:00:00,+00:00,Y,2020-01-31 00:00:00,+00:00,2020-01-31 00:00:00,+00:00
14872,11.0,,,11.0,11.0,,,0.0,136119,R0680920,...,,,,,,,,,,
14873,19.0,,,8.0,8.0,19.0,18.0,0.0,229035,R0369923,...,,,,,,,,,,


En este punto tomaremos solamente las columnas Categoricas o numéricas que pueden aportar al objetivo de clasificación que queremos realizar. Para eso crearemos  un nuevo DataFrame con un **filtro de columnas**.

In [8]:
data_tat_filter = data_tat[[
    'TAT',
    'order_number_id',
    'order_number',
    'order_type',
    'order_date',
    'entry_date',
    'address_ship',
    'vendor_code',
    'shipment',
    'priority',
    'created_date',
    'od_target_date',
    'od_confirmed_date',
    'od_req_condition']]

In [9]:
data_tat_filter.head(5)

Unnamed: 0,TAT,order_number_id,order_number,order_type,order_date,entry_date,address_ship,vendor_code,shipment,priority,created_date,od_target_date,od_confirmed_date,od_req_condition
0,9.0,87862,R2054619,R,2019-04-08 00:00:00,2019-04-05 00:00:00,ARG,AR0011,,RTN,2019-04-06 00:00:00,0,0,CA
1,31.0,116048,R4093319,R,2019-10-14 00:00:00,2019-10-10 00:00:00,AR0650,AR0011,,RTN,2019-10-14 00:00:00,0,2019-12-12 00:00:00,CA
2,45.0,176087,R1788221,R,2021-10-18 00:00:00,2021-10-18 00:00:00,AR0650,AR0011,,RTN,2021-10-18 00:00:00,0,2021-12-03 00:00:00,CA
3,19.0,123425,P0724819,P,2019-11-26 00:00:00,2019-11-26 00:00:00,ADUANA-EZE,381AB,F3,USR,2019-12-02 00:00:00,2019-12-01 00:00:00,2019-12-01 00:00:00,N
4,9.0,170863,R1361821,R,2021-08-10 00:00:00,2021-08-10 00:00:00,AR0650,AR0011,,RTN,2021-08-10 00:00:00,0,2021-09-30 00:00:00,CA


In [10]:
data_tat_filter.shape[1]

14

Pasamos de tener 91 columnas en nuestro DataSet original, a pasar a **14 columnas** finales.

In [11]:
data_tat_filter.apply(lambda x: x.isnull().sum()/data_tat_filter.shape[0], axis=0)

TAT                  0.001008
order_number_id      0.000000
order_number         0.000000
order_type           0.000000
order_date           0.000000
entry_date           0.037445
address_ship         0.007731
vendor_code          0.000000
shipment             0.530487
priority             0.032605
created_date         0.000000
od_target_date       0.000000
od_confirmed_date    0.000000
od_req_condition     0.020370
dtype: float64

Una vez imputado todos los datos de los TAT vamos **identificar los vendors** desde el DataSet "data_vendors" que identifica a los Vendors con un codigo.

Con ese dato podemos incorporar los nombres de los provedores de una manera más amigable. Creamos un nuevo DataSet con los nombres de los vendors.

In [12]:
data_vendors.head(5)

Unnamed: 0,vendor_name,vendor_code,COUNT,MEAN_of_TAT,STDEV_of_TAT
0,"BERNOULLI AEROSPACE, LLC",US0942,1061,73.393968,96.565499
1,DASTEC SRL USA LLC,US0192,1,33.0,
2,EDACI S.R.L.,AR0310,444,45.380631,58.214414
3,DANIELS MANUFACTURING CORP.,11851,9,58.333333,45.634417
4,SCHENCK ROTEC GMBH,CE619,1,193.0,


In [13]:
data_vendors_filtrado = data_vendors.drop(['COUNT','MEAN_of_TAT','STDEV_of_TAT'], axis=1)

Realizamos un merge con los nombres.

In [14]:
ds_complete = data_tat_filter.merge(data_vendors_filtrado, how='left', on='vendor_code')
ds_complete.sample(3)

Unnamed: 0,TAT,order_number_id,order_number,order_type,order_date,entry_date,address_ship,vendor_code,shipment,priority,created_date,od_target_date,od_confirmed_date,od_req_condition,vendor_name
11018,14.0,72522,L0003218,L,2018-12-19 00:00:00,2018-12-19 00:00:00,ADUANA-EZE,3Z9K5,,,2018-12-19 00:00:00,2018-12-26 00:00:00,0,,AIRBUS NORTHAMERICA CS INC.
8158,26.0,204894,R1789222,R,2022-07-18 00:00:00,2022-07-18 00:00:00,ADUANA-EZE,US0962,F3,RTN,2022-07-18 00:00:00,0,2022-08-13 00:00:00,CA,UMT CALIBRATION LABORATORY
14285,41.0,185027,R0213722,R,2022-01-28 00:00:00,2022-01-28 00:00:00,ADUANA-EZE,US0942,F3,RTN,2022-01-28 00:00:00,0,2022-03-08 00:00:00,CA,"BERNOULLI AEROSPACE, LLC"


<a id="preparacion_entrenamiento"></a>
## 4. Preparación y Entrenamiento del DataSet
---

<a id="features"></a> 
### 4.1 Features
Escribir aquí... **# TO-DO**

**Features categóricas**

In [15]:
# TO-DO

**Features numéricas**

In [16]:
# TO-DO

<a id="split"></a> 
### 4.2 Split del set de entrenamiento
Escribir aquí... **# TO-DO**

In [17]:
# TO-DO

<a id="metodo_entrenamiento"></a> 
### 4.3 Selección del método de entrenamiento
Escribir aquí...  **# TO-DO**

**4.3.1 Regresión Logística**

In [18]:
# TO-DO

**4.3.2 Neives-Bayes Gausiano**

In [19]:
# TO-DO

**4.3.3 K-Nearest Neighbors**

In [20]:
# TO-DO

<a id="evaluacion_modelos"></a>
## 5. Evaluación de los Modelos
---

In [21]:
# TO-DO

<a id="section_matriz_confusion_code"></a> 
### 5.1 Métricas: Matriz de confusión

In [22]:
# TO-DO

<a id="section_accuracy"></a> 
### 5.2 Métricas: Accuracy

In [23]:
# TO-DO

<a id="section_error"></a> 
### 5.3 Métricas: Error de Clasificación

In [24]:
# TO-DO

<a id="section_recall"></a> 
### 5.4 Métricas: Sensitivity (o recall)

In [25]:
# TO-DO

<a id="section_specificity"></a> 
### 5.5 Métricas: Specificity

In [26]:
# TO-DO

<a id="section_precision"></a> 
### 5.6 Métricas: Precision

In [27]:
# TO-DO

<a id="section_fpr"></a> 
### 5.7 Métricas: False positive rate (FPR)

In [28]:
# TO-DO

<a id="section_f1_score"></a> 
### 5.8 Métricas: F1-Score

In [29]:
# TO-DO

<a id="section_curva_roc"></a> 
## 6. Curva ROC
---

In [30]:
# TO-DO

<a id="section_umbrales"></a> 
### 6.1 Ajustando los umbrales

In [31]:
# TO-DO

<a id="section_imp_croc"></a> 
### 6.2 Implementación Curva ROC y AUC

In [32]:
# TO-DO

<a id="4."></a> 
## 7. Conclusiones
---

In [33]:
# TO-DO