# Ejercicio Data Quality - Perfilado
## Evaluar la calidad de datos de las ventas de productos

Se quiere hacer una evaluación de calidad de datos sobre las ventas (sales) y pagos (payments). Para ello se requiere hacer un análisis de los siguientes puntos:
- Calidad de los datos
- Selección de clave principal
- Identificación de cardinalidad
- Obtener media, varianza y desviacion Estandar, covarianza, correlacion
- Mejorar la calidad.

**Referencia**: “Estadística Descriptiva con Python y Pandas”: https://coderhook.github.io/Descriptive%20Statistics

- Columnas sales:, orderNumber, orderLineNumber, orderDate, shippedDate, requiredDate, customerNumber, employeeNumber, productCode, status, comments, quantityOrdered, priceEach, sales_amount, origin

- Columnas payments:, customerNumber, checkNumber, paymentDate, amount

In [25]:
import pandas as pd
import numpy as np
from tabulate import tabulate
import warnings
warnings.filterwarnings('ignore')

## Cargar archivos

In [2]:
sales_df = pd.read_csv('../Pandas/datos/company_sales/sales.csv')

In [3]:
payments_df = pd.read_csv('../Pandas/datos/company_sales/payments.csv')

In [14]:
# Mostrar las primeras filas de cada dataset
sales_df.head(), payments_df.head()

(       0  0.1  0000-00-00 0000-00-00.1 0000-00-00.2  0.2   0.3 productCode  \
 0  10100    1  0000-00-00   0000-00-00   0000-00-00  363  1216    S24_3969   
 1  10100    2  0000-00-00   0000-00-00   0000-00-00  363  1216    S18_2248   
 2  10100    3  0000-00-00   0000-00-00   0000-00-00  363  1216    S18_1749   
 3  10100    4  0000-00-00   0000-00-00   0000-00-00  363  1216    S18_4409   
 4  10101    1  0000-00-00   0000-00-00   0000-00-00  128  1504    S18_2795   
 
     status                comments  0.4    0.00   0.00.1 origin  
 0  Shipped                     NaN   49   35.29  1729.21  spain  
 1  Shipped                     NaN   50   55.09  2754.50  spain  
 2  Shipped                     NaN   30  136.00  4080.00  spain  
 3  Shipped                     NaN   22   75.46  1660.12  spain  
 4  Shipped  Check on availability.   26  167.06  4343.56  spain  ,
      0 checkNumber  0000-00-00      0.00
 0  103    HQ336336  2004-10-19   6066.78
 1  103    JM555205  2003-06-05  1457

In [22]:
# Renombrar columnas con nombres más adecuados para sales.csv
sales_df.columns = [
"orderNumber", "orderLineNumber", "orderDate", "shippedDate", "requiredDate",
"customerNumber", "EmployeeNumber", "productCode", "status", "comments",
"quantityOrdered", "priceEach", "sales_amount", "origin"
]
# Renombrar columnas para payments.csv
payments_df.columns = ["customerNumber", "checkNumber", "paymentDate", "amount"]

sales_df.head(), payments_df.head()

(   orderNumber  orderLineNumber   orderDate shippedDate requiredDate  \
 0        10100                1  0000-00-00  0000-00-00   0000-00-00   
 1        10100                2  0000-00-00  0000-00-00   0000-00-00   
 2        10100                3  0000-00-00  0000-00-00   0000-00-00   
 3        10100                4  0000-00-00  0000-00-00   0000-00-00   
 4        10101                1  0000-00-00  0000-00-00   0000-00-00   
 
    customerNumber  EmployeeNumber productCode   status  \
 0             363            1216    S24_3969  Shipped   
 1             363            1216    S18_2248  Shipped   
 2             363            1216    S18_1749  Shipped   
 3             363            1216    S18_4409  Shipped   
 4             128            1504    S18_2795  Shipped   
 
                  comments  quantityOrdered  priceEach  sales_amount origin  
 0                     NaN               49      35.29       1729.21  spain  
 1                     NaN               50     

In [23]:
# Revisar tipos de datos
sales_info = sales_df.dtypes
payments_info = payments_df.dtypes
sales_info, payments_info

(orderNumber          int64
 orderLineNumber      int64
 orderDate           object
 shippedDate         object
 requiredDate        object
 customerNumber       int64
 EmployeeNumber       int64
 productCode         object
 status              object
 comments            object
 quantityOrdered      int64
 priceEach          float64
 sales_amount       float64
 origin              object
 dtype: object,
 customerNumber      int64
 checkNumber        object
 paymentDate        object
 amount            float64
 dtype: object)

In [26]:
# Convertir las columnas numéricas y fechas al tipo correcto
sales_df["orderNumber"] = pd.to_numeric(sales_df["orderNumber"],
errors="coerce")
sales_df["orderLineNumber"] = pd.to_numeric(sales_df["orderLineNumber"],
errors="coerce")
sales_df["customerNumber"] = pd.to_numeric(sales_df["customerNumber"],
errors="coerce")
sales_df["EmployeeNumber"] = pd.to_numeric(sales_df["EmployeeNumber"],
errors="coerce")
sales_df["quantityOrdered"] = pd.to_numeric(sales_df["quantityOrdered"],
errors="coerce")
sales_df["priceEach"] = pd.to_numeric(sales_df["priceEach"], errors="coerce")
sales_df["sales_amount"] = pd.to_numeric(sales_df["sales_amount"], errors="coerce")

sales_df["orderDate"] = pd.to_datetime(sales_df["orderDate"], errors="coerce")
sales_df["requiredDate"] = pd.to_datetime(sales_df["requiredDate"], errors="coerce")
sales_df["shippedDate"] = pd.to_datetime(sales_df["shippedDate"], errors="coerce")
payments_df["customerNumber"] = pd.to_numeric(payments_df["customerNumber"],
errors="coerce")
payments_df["amount"] = pd.to_numeric(payments_df["amount"], errors="coerce")
payments_df["paymentDate"] = pd.to_datetime(payments_df["paymentDate"],
errors="coerce")
# Mostrar la información corregida
sales_df.info(), payments_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   orderNumber      3001 non-null   int64         
 1   orderLineNumber  3001 non-null   int64         
 2   orderDate        0 non-null      datetime64[ns]
 3   shippedDate      3 non-null      datetime64[ns]
 4   requiredDate     3 non-null      datetime64[ns]
 5   customerNumber   3001 non-null   int64         
 6   EmployeeNumber   3001 non-null   int64         
 7   productCode      3001 non-null   object        
 8   status           3001 non-null   object        
 9   comments         759 non-null    object        
 10  quantityOrdered  3001 non-null   int64         
 11  priceEach        3001 non-null   float64       
 12  sales_amount     3001 non-null   float64       
 13  origin           3001 non-null   object        
dtypes: datetime64[ns](3), float64(2), int64(

(None, None)

In [34]:
# Convertir fechas en sales.csv
date_columns_sales = ["orderDate", "shippedDate", "requiredDate"]
for col in date_columns_sales:
    sales_df[col] = pd.to_datetime(sales_df[col], errors='coerce') # Convierte y pone NaT en valores inválidos

# Convertir fecha en payments.csv
payments_df["paymentDate"] = pd.to_datetime(payments_df["paymentDate"],
errors='coerce')

missing_dates_sales = sales_df[date_columns_sales].isnull().sum()
missing_dates_payments = payments_df["paymentDate"].isnull().sum()

missing_dates_sales, missing_dates_payments

(orderDate       3001
 shippedDate     2998
 requiredDate    2998
 dtype: int64,
 np.int64(0))

In [37]:
# Evaluar unicidad de posibles claves primarias en cada dataset

# Para sales.csv, posibles claves: orderNumber, orderLineNumber (combinación)
sales_unique_order = sales_df["orderNumber"].nunique()
sales_total_rows = len(sales_df)
sales_unique_combination = sales_df[["orderNumber",
"orderLineNumber"]].duplicated().sum() # Chequear duplicados

# Para payments.csv, posibles claves: checkNumber (supuestamente única)
payments_unique_check = payments_df["checkNumber"].nunique()
payments_total_rows = len(payments_df)
payments_duplicated_checks = payments_total_rows - payments_unique_check

print(sales_unique_order, sales_unique_combination, payments_unique_check,
payments_duplicated_checks)

326 5 273 5


In [38]:
# Evaluar la relación entre sales y payments a través de customerNumber

# Contar clientes únicos en cada dataset
unique_customers_sales = sales_df["customerNumber"].nunique()
unique_customers_payments = payments_df["customerNumber"].nunique()

# Contar clientes comunes entre ambos datasets
common_customers = len(set(sales_df["customerNumber"]).intersection(set(payments_df["customerNumber"])))
unique_customers_sales, unique_customers_payments, common_customers

(98, 98, 98)

In [39]:
# Calcular estadísticas descriptivas en variables numéricas
stats_sales = sales_df[["quantityOrdered", "priceEach", "sales_amount"]].describe().T
stats_payments = payments_df[["amount"]].describe().T

# Calcular varianza y desviación estándar
variance_sales = sales_df[["quantityOrdered", "priceEach", "sales_amount"]].var()
std_dev_sales = sales_df[["quantityOrdered", "priceEach", "sales_amount"]].std()
variance_payments = payments_df[["amount"]].var()
std_dev_payments = payments_df[["amount"]].std()

# Calcular covarianza entre variables en sales.csv
covariance_sales = sales_df[["quantityOrdered", "priceEach", "sales_amount"]].cov()

# Calcular correlación entre variables en sales.csv
correlation_sales = sales_df[["quantityOrdered", "priceEach", "sales_amount"]].corr()

# Mostrar los resultados en consola usando print
print("=== Estadísticas de Sales ===")
print(stats_sales)
print("\n=== Estadísticas de Payments ===")
print(stats_payments)
print("\n=== Covarianza en Sales ===")
print(covariance_sales)
print("\n=== Correlación en Sales ===")
print(correlation_sales)

=== Estadísticas de Sales ===
                  count         mean          std     min     25%      50%  \
quantityOrdered  3001.0    35.211929     9.828957    6.00    27.0    35.00   
priceEach        3001.0    90.765831    36.579368   26.55    62.0    85.76   
sales_amount     3001.0  3204.908437  1631.356967  481.50  1988.7  2880.48   

                     75%       max  
quantityOrdered    43.00     97.00  
priceEach         114.65    214.30  
sales_amount     4093.60  11503.14  

=== Estadísticas de Payments ===
        count          mean           std     min        25%       50%  \
amount  278.0  31827.944281  21096.143249  615.45  15144.135  31369.15   

             75%        max  
amount  45036.97  120166.58  

=== Covarianza en Sales ===
                 quantityOrdered     priceEach  sales_amount
quantityOrdered        96.608404      8.892374  9.220015e+03
priceEach               8.892374   1338.050184  4.793471e+04
sales_amount         9220.015265  47934.709695  2.6613