#  Data Quality Audit — DogDayCare  
### Diagnóstico de calidad de datos antes de cualquier análisis

Este documento presenta una auditoría completa de los datasets de *bookings*, *customers* y *payments*.  
El objetivo es evaluar su calidad, consistencia y nivel de riesgo antes de construir modelos, reportes o integraciones.

In [4]:
import pandas as pd

In [5]:
bookings = pd.read_csv("../data/raw/bookings_raw.csv")
customers = pd.read_csv("../data/raw/customers_raw.csv")
payments = pd.read_csv("../data/raw/payments_raw.csv")


## 1. Vista general de los datasets

En esta sección revisamos:

- estructura de columnas  
- tipos de datos  
- primeras filas  
- señales tempranas de problemas  

No se realiza ninguna transformación.  
El objetivo es **entender qué tenemos entre manos**.


In [6]:
bookings.head()

Unnamed: 0,booking_id,external_booking_ref,customer_name,customer_email,service,service_code,booking_date,start_time,end_time,checkin_time,...,price,tax_rate,is_cancelled,cancel_reason,is_repeat_customer,customer_rating,customer_feedback,created_at,updated_at,notes
0,1,,OSCAR BERG,oscarber650@outlook.com,nail trim,NT,12 Feb 2024,01:00 PM,21.00,,...,15.0,25.0,0,,Y,3.0,,2024/02/04,2024/02/15,
1,1,,Elin Svensson,elinsven661@email.com,Daycare,DC,20 Apr 2024,16:00,17:00,,...,25.0,0.25,yes,,N,,,17 Apr 2024,,first visit
2,3,,Emma Johansson,,vet shuttle,VS,08 Apr 2024,10:00 AM,11:00,,...,,0.25,,,Y,,,2024-04-08,12 Apr 2024,
3,4,,ELIN SVENSSON,elinsven653@email.com,,VS,28/03/2024,13.00,14:00,,...,20.0,,1,,,4.0,ok,2024-25-03,01/04/2024,
4,5,,Frida Karlsson,fridakar823@gmail.com,Nail Trim,,20/01/2024,,18:00,,...,15.0,25.0,,,,2.0,ok,Jan 17 2024,20-01-2024,


In [9]:
bookings.info()


<class 'pandas.DataFrame'>
RangeIndex: 1225 entries, 0 to 1224
Data columns (total 35 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   booking_id            1225 non-null   int64  
 1   external_booking_ref  537 non-null    str    
 2   customer_name         1189 non-null   str    
 3   customer_email        1076 non-null   str    
 4   service               1193 non-null   str    
 5   service_code          1103 non-null   str    
 6   booking_date          1199 non-null   str    
 7   start_time            1166 non-null   str    
 8   end_time              1164 non-null   str    
 9   checkin_time          507 non-null    str    
 10  checkout_time         507 non-null    str    
 11  location              1044 non-null   str    
 12  channel               1033 non-null   str    
 13  source_system         1009 non-null   str    
 14  staff_assigned        897 non-null    str    
 15  dog_name              844 non-nu

In [7]:
customers.head()

Unnamed: 0,customer_id,full_name,email,phone,country,city,postal_code,address_line1,address_line2,signup_date,...,primary_pet_dob,emergency_contact,vet_provider,vaccination_status,waiver_signed,payment_terms,language,last_contacted,source_system,notes
0,1001,mariam ali,mariamal755@email.com,+46 70 658 21 85,SE,Täby,81426,Kungsgatan 58,,2024/01/20,...,2020-12-19,,AniCura,,1,,es,2024/10/05,excel,
1,1002,DAVID STONE,davidsto898@outlook.com,+34 671586053,NO,Täby,151 56,Birger Jarlsgatan 12,Floor 7,04-02-2024,...,,Karin 65707197,,UPTODATE,N,pay_on_arrival,,06/26/2024,export_v1,
2,1003,Erik Holm,erikholm707@gmail.com,+44 7655742829,España,Stockholm,,Kungsgatan 1,,2023/12/01,...,22 Apr 2023,,,unknown,N,,,2024-01-09,excel,
3,1004,ahmed ali,ahmedali594@email.com,+34 673993471,Sweden,Täby,40828,,Floor 1,30/10/2023,...,23 Dec 2020,Karin 65033115,AniCura,,N,,es,11/01/2024,export_v2,prefers morning drop-off
4,1005,Ana Garcia,anagarci342@gmail.com,+46 70 368 30 66,Portugal,Solna,,Parkvägen 6,,27/11/2023,...,,,,unknown,FALSE,net_14,SV,19 Jan 2024,excel,moved address


In [10]:
customers.info()

<class 'pandas.DataFrame'>
RangeIndex: 238 entries, 0 to 237
Data columns (total 28 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   customer_id         238 non-null    int64  
 1   full_name           231 non-null    str    
 2   email               221 non-null    str    
 3   phone               171 non-null    str    
 4   country             224 non-null    str    
 5   city                225 non-null    str    
 6   postal_code         147 non-null    str    
 7   address_line1       153 non-null    str    
 8   address_line2       120 non-null    str    
 9   signup_date         229 non-null    str    
 10  marketing_opt_in    223 non-null    str    
 11  preferred_channel   211 non-null    str    
 12  customer_type       208 non-null    str    
 13  risk_flag           164 non-null    str    
 14  lifetime_value      54 non-null     float64
 15  pets_count          196 non-null    str    
 16  primary_pet_name   

In [8]:
payments.head()

Unnamed: 0,payment_id,booking_id,customer_email,amount_gross,currency,tax_amount,fee_amount,amount_net,status,payment_method,...,refunded_at,refund_ref,country_of_card,card_brand,last4,installments,chargeback_flag,created_at,source_system,notes
0,900001,648.0,saralind@gmail.com,35,sek,,2.52,,FAILED,card,...,,,,,7525.0,1.0,,2024/03/15 11:22,manual_sheet,duplicate
1,900002,310.0,user3786@gmail.com,20,€,5.0,150.0,,failed,bank transfer,...,,,SE,mastercard,,,,,square_export,
2,900003,714.0,erikholm288@gmail.com,60,sek,,3.24,56.76,payd,invoice,...,,,,,,,no,,,partial refund
3,900004,1078.0,,40,SEK,10.0,2.66,,paid,bank transfer,...,,,SE,visa,7868.0,1.0,,2024-03-13 08:43,square_export,
4,900005,831.0,davidsto777@company.se,60,€,,,56.76,Paid,SWISH,...,,,,amex,9456.0,1.0,yes,2024-02-01 20:35,square_export,


In [11]:
payments.info()

<class 'pandas.DataFrame'>
RangeIndex: 1390 entries, 0 to 1389
Data columns (total 25 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   payment_id           1390 non-null   int64  
 1   booking_id           1360 non-null   float64
 2   customer_email       1187 non-null   str    
 3   amount_gross         1255 non-null   str    
 4   currency             1315 non-null   str    
 5   tax_amount           533 non-null    float64
 6   fee_amount           566 non-null    str    
 7   amount_net           357 non-null    float64
 8   status               1342 non-null   str    
 9   payment_method       1306 non-null   str    
 10  gateway              1300 non-null   str    
 11  gateway_payment_ref  613 non-null    str    
 12  invoice_no           358 non-null    str    
 13  receipt_no           388 non-null    str    
 14  paid_at              1097 non-null   str    
 15  refunded_at          24 non-null     str    
 16 

### Observaciones iniciales

- Los tres datasets presentan **tipos incorrectos** (fechas como texto, números como strings).
- Existen **columnas con valores nulos significativos**.
- Hay **inconsistencias de formato**: `NaN`, `NULL`, `n/a`, valores vacíos.
- Algunos campos contienen **valores imposibles** (horas como `99:99`, fechas inválidas).
- Los datasets provienen de **múltiples sistemas**, lo que explica la heterogeneidad.

Estas señales indican que los datos **no pueden usarse directamente** para análisis o reporting.


In [12]:
bookings.isna().mean().sort_values(ascending=False) * 100

customer_feedback       62.775510
special_instructions    62.775510
cancel_reason           61.795918
checkout_time           58.612245
checkin_time            58.612245
promo_code              58.285714
medication              56.734694
food_preference         56.571429
external_booking_ref    56.163265
notes                   51.591837
discount                51.428571
dog_weight              51.183673
customer_rating         48.979592
is_repeat_customer      46.612245
tax_rate                38.857143
is_cancelled            33.061224
dog_name                31.102041
temperament             28.571429
staff_assigned          26.775510
dog_breed               20.244898
source_system           17.632653
channel                 15.673469
location                14.775510
updated_at              12.897959
customer_email          12.163265
service_code             9.959184
price                    8.653061
created_at               8.408163
dog_gender               7.346939
end_time      

In [13]:
customers.isna().mean().sort_values(ascending=False) * 100


lifetime_value        77.310924
emergency_contact     52.521008
vaccination_status    50.000000
address_line2         49.579832
vet_provider          49.579832
payment_terms         48.739496
notes                 41.176471
primary_pet_name      39.915966
postal_code           38.235294
address_line1         35.714286
language              34.873950
risk_flag             31.092437
phone                 28.151261
last_contacted        25.630252
primary_pet_dob       18.067227
primary_pet_breed     18.067227
pets_count            17.647059
customer_type         12.605042
preferred_channel     11.344538
waiver_signed          7.983193
email                  7.142857
marketing_opt_in       6.302521
country                5.882353
city                   5.462185
source_system          5.462185
signup_date            3.781513
full_name              2.941176
customer_id            0.000000
dtype: float64

In [14]:
payments.isna().mean().sort_values(ascending=False) * 100

refunded_at            98.273381
refund_ref             91.223022
amount_net             74.316547
invoice_no             74.244604
last4                  73.956835
receipt_no             72.086331
tax_amount             61.654676
fee_amount             59.280576
chargeback_flag        57.050360
gateway_payment_ref    55.899281
installments           54.748201
notes                  53.812950
card_brand             48.057554
country_of_card        44.028777
created_at             39.784173
source_system          27.985612
paid_at                21.079137
customer_email         14.604317
amount_gross            9.712230
gateway                 6.474820
payment_method          6.043165
currency                5.395683
status                  3.453237
booking_id              2.158273
payment_id              0.000000
dtype: float64

### Valores nulos — Impacto

- Entre 20% y 60% de los campos críticos presentan valores faltantes.
- Esto afecta la capacidad de:
  - identificar clientes únicos  
  - reconstruir reservas  
  - calcular ingresos  
  - generar reportes confiables  

Los nulos no son solo un problema técnico:  
**representan riesgo operativo y financiero.**


In [15]:
bookings.duplicated().sum(), customers.duplicated().sum(), payments.duplicated().sum()

(np.int64(25), np.int64(0), np.int64(40))

In [16]:
customers['email'].duplicated().sum()

np.int64(33)

### Duplicados — Impacto

- Existen clientes duplicados por email.
- Esto genera múltiples perfiles para la misma persona.
- Distorsiona métricas como:
  - recurrencia  
  - lifetime value  
  - segmentación  
  - historial del cliente  

Conclusión: **no es posible confiar en métricas de clientes sin deduplicación.**


In [17]:
bookings['booking_date'].head(20)

0     12 Feb 2024
1     20 Apr 2024
2     08 Apr 2024
3      28/03/2024
4      20/01/2024
5      02/07/2024
6      2024-23-04
7      2024/02/27
8      2024-03-23
9      2024-02-01
10     2024-01-21
11    30 Mar 2024
12    19 Feb 2024
13            NaN
14     2024/03/02
15     2024.03.09
16     19-03-2024
17    Feb 03 2024
18     2024.04.14
19     02/18/2024
Name: booking_date, dtype: str

### Fechas inconsistentes

Se observan más de 10 formatos distintos:

- `2024-03-12`
- `12 Mar 2024`
- `03/12/2024`
- `2024.03.12`
- `99:99` (inválido)

Esto impide:

- ordenar reservas  
- calcular duración  
- generar reportes por mes  
- analizar estacionalidad  

Sin normalización, **no se puede hacer análisis temporal confiable**.


In [18]:
payments['status'].value_counts(dropna=False)

status
FAILED      143
paid        137
Pending     130
Paid        127
pending     124
failed      118
payed       118
payd        117
Refunded    111
PAID        111
refunded    106
NaN          48
Name: count, dtype: int64

### Estados de pago inconsistentes

Ejemplos encontrados:

- `paid`
- `payed`
- `PAID`
- `Pending`
- `FAILED`
- `refunded`
- `n/a`

Esto afecta directamente:

- conciliación financiera  
- cálculo de ingresos  
- reportes contables  

Sin estandarización, **los ingresos no son confiables**.


In [19]:
payments['booking_id'].isin(bookings['booking_id']).value_counts()


booking_id
True     1337
False      53
Name: count, dtype: int64

In [20]:
bookings['booking_id'].isin(payments['booking_id']).value_counts()


booking_id
True     808
False    417
Name: count, dtype: int64

### Relaciones rotas entre datasets

- Existen pagos sin reserva asociada → **ingresos huérfanos**.
- Existen reservas sin pago → **pérdidas potenciales o registros incompletos**.

Esto indica fallos en:

- procesos operativos  
- integraciones  
- exportaciones manuales  

Impacto directo: **riesgo financiero y errores en reporting**.


In [21]:
audit_summary = pd.DataFrame({
    "dataset": ["bookings", "customers", "payments"],
    "issues_detected": [
        "Fechas inconsistentes, valores nulos, formatos mixtos",
        "Duplicados, emails faltantes, direcciones incompletas",
        "Estados inconsistentes, pagos huérfanos, montos en texto"
    ],
    "risk_level": ["High", "High", "Medium"]
})

audit_summary


Unnamed: 0,dataset,issues_detected,risk_level
0,bookings,"Fechas inconsistentes, valores nulos, formatos...",High
1,customers,"Duplicados, emails faltantes, direcciones inco...",High
2,payments,"Estados inconsistentes, pagos huérfanos, monto...",Medium


## Impacto en el negocio

- No es posible calcular ingresos reales con confianza.  
- No se puede identificar clientes únicos ni su historial.  
- Los reportes actuales pueden estar duplicando o perdiendo información.  
- Las decisiones se toman con datos incompletos o inconsistentes.  
- La conciliación financiera requiere trabajo manual y es propensa a errores.  
- La experiencia del cliente se ve afectada por perfiles duplicados o incorrectos.  

**Conclusión:** Antes de analizar, reportar o automatizar, es imprescindible un proceso de limpieza y normalización.
