##Ejercicio Data Quality - Perfilado
 ## Evaluar la calidad de datos de las ventas de productos

Se quiere hacer una evaluación de calidad de datos sobre las ventas (sales) y pagos (payments). Para ello se requiere hacer un análisis de los siguientes puntos:
```

Calidad de los datos
Selección de clave principal
Identificación de cardinalidad
Obtener media, varianza y desviacion Estandar, covarianza, correlacion
Mejorar la calidad.
Referencia: “Estadística Descriptiva con Python y Pandas”: https://coderhook.github.io/Descriptive%20Statistics

Columnas sales:, orderNumber, orderLineNumber, orderDate, shippedDate, requiredDate, customerNumber, employeeNumber, productCode, status, comments, quantityOrdered, priceEach, sales_amount, origin

Columnas payments:, customerNumber, checkNumber, paymentDate, amount

In [35]:
#Carga datos
import pandas as pd
import numpy as np
 
payments = pd.read_csv(r'https://raw.githubusercontent.com/ricardoahumada/DataScienceBasics/refs/heads/main/data/company_sales/payments.csv')
payments

sales = pd.read_csv(r'https://raw.githubusercontent.com/ricardoahumada/DataScienceBasics/refs/heads/main/data/company_sales/sales.csv')
sales

Unnamed: 0,0,0.1,0000-00-00,0000-00-00.1,0000-00-00.2,0.2,0.3,productCode,status,comments,0.4,0.00,0.00.1,origin
0,10100,1,0000-00-00,0000-00-00,0000-00-00,363,1216,S24_3969,Shipped,,49,35.29,1729.21,spain
1,10100,2,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_2248,Shipped,,50,55.09,2754.50,spain
2,10100,3,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_1749,Shipped,,30,136.00,4080.00,spain
3,10100,4,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_4409,Shipped,,22,75.46,1660.12,spain
4,10101,1,0000-00-00,0000-00-00,0000-00-00,128,1504,S18_2795,Shipped,Check on availability.,26,167.06,4343.56,spain
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2996,10425,9,0000-00-00,,0000-00-00,119,1370,S24_2300,In Process,,49,127.79,6261.71,spain
2997,10425,10,0000-00-00,,0000-00-00,119,1370,S18_2432,In Process,,19,48.62,923.78,spain
2998,10425,11,0000-00-00,,0000-00-00,119,1370,S32_1268,In Process,,41,83.79,3435.39,spain
2999,10425,12,0000-00-00,,0000-00-00,119,1370,S10_4962,In Process,,38,131.49,4996.62,spain


In [14]:
payments.info()


#Diccionario de mapeo de nombres de columnas
mapeo_columnas = {'0': 'customerNumber','checkNumber':'checkNumber','0000-00-00':'paymentDate','0.00':'amount'}

# Renombrar las columnas del DataFrame
payments.rename(columns=mapeo_columnas, inplace=True)

print(payments.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 4 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   customerNumber  278 non-null    int64  
 1   checkNumber     278 non-null    object 
 2   paymentDate     278 non-null    object 
 3   amount          278 non-null    float64
dtypes: float64(1), int64(1), object(2)
memory usage: 8.8+ KB
   customerNumber checkNumber paymentDate    amount
0             103    HQ336336  2004-10-19   6066.78
1             103    JM555205  2003-06-05  14571.44
2             103    OM314933  2004-12-18   1676.14
3             112    BO864823  2004-12-17  14191.12
4             112     HQ55022  2003-06-06  32641.98


In [16]:
sales.info()

#Diccionario de mapeo de nombres de columnas
mapeo_columnas_sales = {
    '0': 'orderNumber',
    '0.1': 'orderLineNumber',
    '0000-00-00': 'orderDate',
    '0000-00-00.1': 'shippedDate',
    '0000-00-00.2': 'requiredDate',
    '0.2': 'customerNumber',
    '0.3': 'employeeNumber',
    'productCode': 'productCode',
    'status': 'status',
    'comments': 'comments',
    '0.4': 'quantityOrdered',
    '0.00': 'priceEach',
    '0.00.1': 'sales_amount',
    'origin': 'origin'
}

# Renombrar las columnas del DataFrame
sales.rename(columns=mapeo_columnas_sales, inplace=True)

print(sales.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   orderNumber      3001 non-null   int64  
 1   orderLineNumber  3001 non-null   int64  
 2   orderDate        3001 non-null   object 
 3   shippedDate      2859 non-null   object 
 4   requiredDate     3001 non-null   object 
 5   customerNumber   3001 non-null   int64  
 6   employeeNumber   3001 non-null   int64  
 7   productCode      3001 non-null   object 
 8   status           3001 non-null   object 
 9   comments         759 non-null    object 
 10  quantityOrdered  3001 non-null   int64  
 11  priceEach        3001 non-null   float64
 12  sales_amount     3001 non-null   float64
 13  origin           3001 non-null   object 
dtypes: float64(2), int64(5), object(7)
memory usage: 328.4+ KB
   orderNumber  orderLineNumber   orderDate shippedDate requiredDate  \
0        10100       

In [24]:
# Selección de clave principal

# Combinar los DataFrames utilizando una clave compuesta
merged_pay_sale = pd.merge(sales, payments, on='customerNumber', how = 'outer', indicator = True)

# Imprimir el DataFrame combinado

print(merged_pay_sale.head())

print(merged_pay_sale.tail())



   orderNumber  orderLineNumber   orderDate shippedDate requiredDate  \
0        10123                1  0000-00-00  0000-00-00   0000-00-00   
1        10123                1  0000-00-00  0000-00-00   0000-00-00   
2        10123                1  0000-00-00  0000-00-00   0000-00-00   
3        10123                2  0000-00-00  0000-00-00   0000-00-00   
4        10123                2  0000-00-00  0000-00-00   0000-00-00   

   customerNumber  employeeNumber productCode   status comments  \
0             103            1370    S24_1628  Shipped      NaN   
1             103            1370    S24_1628  Shipped      NaN   
2             103            1370    S24_1628  Shipped      NaN   
3             103            1370    S18_1589  Shipped      NaN   
4             103            1370    S18_1589  Shipped      NaN   

   quantityOrdered  priceEach  sales_amount origin checkNumber paymentDate  \
0               50      43.27       2163.50  spain    HQ336336  2004-10-19   
1       

In [32]:
merged_pay_sale.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12135 entries, 0 to 12134
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype   
---  ------           --------------  -----   
 0   orderNumber      12135 non-null  int64   
 1   orderLineNumber  12135 non-null  int64   
 2   orderDate        12135 non-null  object  
 3   shippedDate      11566 non-null  object  
 4   requiredDate     12135 non-null  object  
 5   customerNumber   12135 non-null  int64   
 6   employeeNumber   12135 non-null  int64   
 7   productCode      12135 non-null  object  
 8   status           12135 non-null  object  
 9   comments         3064 non-null   object  
 10  quantityOrdered  12135 non-null  int64   
 11  priceEach        12135 non-null  float64 
 12  sales_amount     12135 non-null  float64 
 13  origin           12135 non-null  object  
 14  checkNumber      12135 non-null  object  
 15  paymentDate      12135 non-null  object  
 16  amount           12135 non-null  float64

In [26]:
# Identificación de cardinalidad


# Contar las ocurrencias de cada tipo de relación
cardinalidad = merged_pay_sale['_merge'].value_counts()

print(cardinalidad)

_merge
both          12135
left_only         0
right_only        0
Name: count, dtype: int64


In [29]:
merged_pay_sale.describe()

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,quantityOrdered,priceEach,sales_amount,amount
count,12135.0,12135.0,12135.0,12135.0,12135.0,12135.0,12135.0,12135.0
mean,10268.261557,6.480429,215.378904,1312.386321,35.350639,90.489844,3207.886866,43310.415009
std,97.138924,4.210831,111.267488,289.218619,9.641869,36.800733,1640.172089,27281.45705
min,10100.0,1.0,103.0,0.0,6.0,26.55,481.5,615.45
25%,10182.0,3.0,141.0,1216.0,27.0,60.9,1979.58,25080.96
50%,10272.0,6.0,145.0,1370.0,35.0,85.86,2864.16,39440.59
75%,10358.0,9.0,298.0,1401.0,43.0,115.03,4091.9,51619.02
max,10425.0,18.0,496.0,1702.0,97.0,214.3,11503.14,120166.58


In [47]:
merged_pay_sale_del = merged_pay_sale.dropna()

merged_pay_sale_del = merged_pay_sale.drop_duplicates()
merged_pay_sale_del.duplicated().sum()

print(merged_pay_sale_del.head())

   orderNumber  orderLineNumber  customerNumber  employeeNumber  \
0        10123                1             103            1370   
1        10123                1             103            1370   
2        10123                1             103            1370   
3        10123                2             103            1370   
4        10123                2             103            1370   

   quantityOrdered  priceEach  sales_amount    amount  
0               50      43.27       2163.50   6066.78  
1               50      43.27       2163.50  14571.44  
2               50      43.27       2163.50   1676.14  
3               26     120.71       3138.46   6066.78  
4               26     120.71       3138.46  14571.44  


In [48]:
merged_pay_sale = pd.concat([merged_pay_sale, merged_pay_sale.iloc[[1,60,6]]])
duplicate_rows = merged_pay_sale[merged_pay_sale.duplicated()]
print(duplicate_rows)


       orderNumber  orderLineNumber  customerNumber  employeeNumber  \
457          10425                3             119            1370   
458          10425                3             119            1370   
459          10425                3             119            1370   
2328         10111                1             129            1165   
2332         10111                2             129            1165   
...            ...              ...             ...             ...   
11926        10219                3             487            1165   
11929        10219                4             487            1165   
1            10123                1             103            1370   
60           10278                1             112            1166   
6            10123                3             103            1370   

       quantityOrdered  priceEach  sales_amount    amount  
457                 28     147.36       4126.08  19501.82  
458                 28     

In [43]:
import warnings 
warnings.filterwarnings('ignore')

from datetime import datetime, timedelta




merged_pay_sale_del['orderDate'] = pd.to_datetime(merged_pay_sale_del['orderDate'], errors='coerce')
merged_pay_sale_del['shippedDate'] = pd.to_datetime(merged_pay_sale_del['shippedDate'], errors='coerce')
merged_pay_sale_del['requiredDate'] = pd.to_datetime(merged_pay_sale_del['requiredDate'], errors='coerce')

merged_pay_sale_del.info()




<class 'pandas.core.frame.DataFrame'>
Index: 2668 entries, 21 to 12056
Data columns (total 18 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   orderNumber      2668 non-null   int64         
 1   orderLineNumber  2668 non-null   int64         
 2   orderDate        0 non-null      datetime64[ns]
 3   shippedDate      0 non-null      datetime64[ns]
 4   requiredDate     0 non-null      datetime64[ns]
 5   customerNumber   2668 non-null   int64         
 6   employeeNumber   2668 non-null   int64         
 7   productCode      2668 non-null   object        
 8   status           2668 non-null   object        
 9   comments         2668 non-null   object        
 10  quantityOrdered  2668 non-null   int64         
 11  priceEach        2668 non-null   float64       
 12  sales_amount     2668 non-null   float64       
 13  origin           2668 non-null   object        
 14  checkNumber      2668 non-null   object    

In [45]:
merged_pay_sale = merged_pay_sale.select_dtypes(np.number)
merged_pay_sale.columns

Index(['orderNumber', 'orderLineNumber', 'customerNumber', 'employeeNumber',
       'quantityOrdered', 'priceEach', 'sales_amount', 'amount'],
      dtype='object')

In [46]:
merged_pay_sale.corr(method = 'pearson')

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,quantityOrdered,priceEach,sales_amount,amount
orderNumber,1.0,-0.043601,-0.055823,0.093933,0.061193,-0.000631,0.037534,0.075594
orderLineNumber,-0.043601,1.0,-0.046181,-0.02534,-0.030471,0.003338,-0.023161,0.0704
customerNumber,-0.055823,-0.046181,1.0,0.047658,-0.0082,-0.009333,-0.008878,-0.315141
employeeNumber,0.093933,-0.02534,0.047658,1.0,-0.011477,-0.024298,-0.026997,-0.022254
quantityOrdered,0.061193,-0.030471,-0.0082,-0.011477,1.0,0.025449,0.567394,0.016842
priceEach,-0.000631,0.003338,-0.009333,-0.024298,0.025449,1.0,0.808453,-0.004082
sales_amount,0.037534,-0.023161,-0.008878,-0.026997,0.567394,0.808453,1.0,0.004644
amount,0.075594,0.0704,-0.315141,-0.022254,0.016842,-0.004082,0.004644,1.0


## CORRECCIÓN

In [51]:
import pandas as pd
import numpy as np
from tabulate import tabulate

payments = pd.read_csv(r'https://raw.githubusercontent.com/ricardoahumada/DataScienceBasics/refs/heads/main/data/company_sales/payments.csv')

sales = pd.read_csv(r'https://raw.githubusercontent.com/ricardoahumada/DataScienceBasics/refs/heads/main/data/company_sales/sales.csv')


In [55]:
sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   0             3001 non-null   int64  
 1   0.1           3001 non-null   int64  
 2   0000-00-00    3001 non-null   object 
 3   0000-00-00.1  2859 non-null   object 
 4   0000-00-00.2  3001 non-null   object 
 5   0.2           3001 non-null   int64  
 6   0.3           3001 non-null   int64  
 7   productCode   3001 non-null   object 
 8   status        3001 non-null   object 
 9   comments      759 non-null    object 
 10  0.4           3001 non-null   int64  
 11  0.00          3001 non-null   float64
 12  0.00.1        3001 non-null   float64
 13  origin        3001 non-null   object 
dtypes: float64(2), int64(5), object(7)
memory usage: 328.4+ KB


In [56]:
sales.head()

Unnamed: 0,0,0.1,0000-00-00,0000-00-00.1,0000-00-00.2,0.2,0.3,productCode,status,comments,0.4,0.00,0.00.1,origin
0,10100,1,0000-00-00,0000-00-00,0000-00-00,363,1216,S24_3969,Shipped,,49,35.29,1729.21,spain
1,10100,2,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_2248,Shipped,,50,55.09,2754.5,spain
2,10100,3,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_1749,Shipped,,30,136.0,4080.0,spain
3,10100,4,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_4409,Shipped,,22,75.46,1660.12,spain
4,10101,1,0000-00-00,0000-00-00,0000-00-00,128,1504,S18_2795,Shipped,Check on availability.,26,167.06,4343.56,spain


In [59]:
mapeo_columnas_sales = {
    '0': 'orderNumber',
    '0.1': 'orderLineNumber',
    '0000-00-00': 'orderDate',
    '0000-00-00.1': 'shippedDate',
    '0000-00-00.2': 'requiredDate',
    '0.2': 'customerNumber',
    '0.3': 'employeeNumber',
    'productCode': 'productCode',
    'status': 'status',
    'comments': 'comments',
    '0.4': 'quantityOrdered',
    '0.00': 'priceEach',
    '0.00.1': 'sales_amount',
    'origin': 'origin'
}

# Renombrar las columnas del DataFrame
sales.rename(columns=mapeo_columnas_sales, inplace=True)

sales.head()

Unnamed: 0,orderNumber,orderLineNumber,orderDate,shippedDate,requiredDate,customerNumber,employeeNumber,productCode,status,comments,quantityOrdered,priceEach,sales_amount,origin
0,10100,1,0000-00-00,0000-00-00,0000-00-00,363,1216,S24_3969,Shipped,,49,35.29,1729.21,spain
1,10100,2,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_2248,Shipped,,50,55.09,2754.5,spain
2,10100,3,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_1749,Shipped,,30,136.0,4080.0,spain
3,10100,4,0000-00-00,0000-00-00,0000-00-00,363,1216,S18_4409,Shipped,,22,75.46,1660.12,spain
4,10101,1,0000-00-00,0000-00-00,0000-00-00,128,1504,S18_2795,Shipped,Check on availability.,26,167.06,4343.56,spain


In [61]:
sales['orderDate'].unique()

array(['0000-00-00', '2038-09-00'], dtype=object)

In [65]:
sales_clean = sales.drop(columns = ['orderDate','shippedDate','requiredDate','comments'])
sales_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   orderNumber      3001 non-null   int64  
 1   orderLineNumber  3001 non-null   int64  
 2   customerNumber   3001 non-null   int64  
 3   employeeNumber   3001 non-null   int64  
 4   productCode      3001 non-null   object 
 5   status           3001 non-null   object 
 6   quantityOrdered  3001 non-null   int64  
 7   priceEach        3001 non-null   float64
 8   sales_amount     3001 non-null   float64
 9   origin           3001 non-null   object 
dtypes: float64(2), int64(5), object(3)
memory usage: 234.6+ KB


In [67]:
sales_clean.isna().sum()


orderNumber        0
orderLineNumber    0
customerNumber     0
employeeNumber     0
productCode        0
status             0
quantityOrdered    0
priceEach          0
sales_amount       0
origin             0
dtype: int64

In [37]:
#Valores atípicos: zscore, iqr
zscores = (sales_clean - sales_clean.mean(numeric_only = True)) / sales_clean.std(numeric_only=True)

zscores_abs = zscores.apply(np.abs)

print(tabulate(zscores))


----  -----------  ----------  ---------  -----------  ---  ------------  ---  ----------  ------------  ---
   0   0.872955    -0.312397   -1.29252   -1.73299     nan  -1.51659      nan   1.4028     -0.904583     nan
   1   0.872955    -0.312397   -1.05424   -1.73299     nan  -0.975299     nan   1.50454    -0.276094     nan
   2   0.872955    -0.312397   -0.815971  -1.73299     nan   1.2366       nan  -0.530263    0.536419     nan
   3   0.872955    -0.312397   -0.577698  -1.73299     nan  -0.418428     nan  -1.34418    -0.946935     nan
   4  -1.11178      0.570109   -1.29252   -1.72219     nan   2.08572      nan  -0.937223    0.697978     nan
   5  -1.11178      0.570109   -1.05424   -1.72219     nan  -1.26891      nan   1.09758    -0.714012     nan
   6  -1.11178      0.570109   -0.815971  -1.72219     nan  -1.59204      nan   0.99584    -1.06725      nan
   7  -1.11178      0.570109   -0.577698  -1.72219     nan   0.472785     nan  -1.03896    -0.308583     nan
   8  -0.664162    

In [39]:
umbral = 3

out_mask = ~zscores[zscores_abs > umbral]-isna()
print('\nOutliers per column:\n')
print(out_mask.sum())



TypeError: ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [40]:
outliers = sales_clean['quantityOrdered'][out_mask['quantityOrdered']]
print('Outliers:\n', outliers)

NameError: name 'out_mask' is not defined

In [32]:
sales_clean.describe()

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,quantityOrdered,priceEach,sales_amount
count,3001.0,3001.0,3001.0,3001.0,3001.0,3001.0,3001.0
mean,10260.509164,6.424525,259.63912,1317.948684,35.211929,90.765831,3204.908437
std,92.61975,4.19687,118.403435,326.343575,9.828957,36.579368,1631.356967
min,10100.0,1.0,103.0,0.0,6.0,26.55,481.5
25%,10181.0,3.0,145.0,1216.0,27.0,62.0,1988.7
50%,10263.0,6.0,240.0,1370.0,35.0,85.76,2880.48
75%,10339.0,9.0,353.0,1501.0,43.0,114.65,4093.6
max,10425.0,18.0,496.0,1702.0,97.0,214.3,11503.14


In [42]:
#Duplicados

sales_clean.duplicated().sum()

5

In [44]:
sales_clean[sales_clean.duplicated()]
#La clave, sería una combinación de orderNumber y orderLineNumber

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,productCode,status,quantityOrdered,priceEach,sales_amount,origin
28,10104,2,141,1370,S50_1514,Shipped,32,53.31,1705.92,spain
2861,10410,2,357,1612,S18_3136,Shipped,34,84.82,2883.88,spain
2895,10413,6,175,1323,S32_3207,Shipped,24,56.55,1357.2,spain
2945,10419,1,382,1401,S18_1589,Shipped,37,100.8,3729.6,spain
2990,10425,3,119,1370,S18_2238,In Process,28,147.36,4126.08,spain


In [50]:
complete_ordNum = sales_clean['completeOrderNumber'] = sales_clean['orderNumber'].astype('str') + '_' + sales_clean['orderLineNumber'].astype('str')
complete_ordNum.values

array(['10100_1', '10100_2', '10100_3', ..., '10425_11', '10425_12',
       '10425_13'], dtype=object)

In [52]:
#Eliminar lso duplicados
sales_celan = sales_clean.drop_duplicates()
sales_clean.duplicated().sum()

5

In [54]:
#Incoherencia

sales_celan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2996 entries, 0 to 3000
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   orderNumber          2996 non-null   int64  
 1   orderLineNumber      2996 non-null   int64  
 2   customerNumber       2996 non-null   int64  
 3   employeeNumber       2996 non-null   int64  
 4   productCode          2996 non-null   object 
 5   status               2996 non-null   object 
 6   quantityOrdered      2996 non-null   int64  
 7   priceEach            2996 non-null   float64
 8   sales_amount         2996 non-null   float64
 9   origin               2996 non-null   object 
 10  completeOrderNumber  2996 non-null   object 
dtypes: float64(2), int64(5), object(4)
memory usage: 280.9+ KB


In [56]:
sales_clean[['productCode','status','origin']] = sales_clean[['productCode','status','origin']].astype('category')
sales_clean.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3001 entries, 0 to 3000
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   orderNumber          3001 non-null   int64   
 1   orderLineNumber      3001 non-null   int64   
 2   customerNumber       3001 non-null   int64   
 3   employeeNumber       3001 non-null   int64   
 4   productCode          3001 non-null   category
 5   status               3001 non-null   category
 6   quantityOrdered      3001 non-null   int64   
 7   priceEach            3001 non-null   float64 
 8   sales_amount         3001 non-null   float64 
 9   origin               3001 non-null   category
 10  completeOrderNumber  3001 non-null   object  
dtypes: category(3), float64(2), int64(5), object(1)
memory usage: 201.7+ KB


In [59]:
#Cardinalidad

print(sales_clean['status'].unique())
print(sales_clean['status'].value_counts())

['Shipped', 'Resolved', 'Cancelled', 'On Hold', 'Disputed', 'In Process']
Categories (6, object): ['Cancelled', 'Disputed', 'In Process', 'On Hold', 'Resolved', 'Shipped']
Shipped       2775
Cancelled       79
Resolved        47
On Hold         44
In Process      42
Disputed        14
Name: status, dtype: int64


In [61]:
def calc_cardinalidad(adf):
    result = {}
    for col in adf.columns:
        unique_values = adf[col].nunique()
        result[col] = unique_values
        print(f'\n-Valores únicos para {col}: {unique_values}')
    return result


In [63]:
sales_card = calc_cardinalidad(sales_clean)
print(sales_card)


-Valores únicos para orderNumber: 326

-Valores únicos para orderLineNumber: 18

-Valores únicos para customerNumber: 98

-Valores únicos para employeeNumber: 15

-Valores únicos para productCode: 109

-Valores únicos para status: 6

-Valores únicos para quantityOrdered: 61

-Valores únicos para priceEach: 1573

-Valores únicos para sales_amount: 2885

-Valores únicos para origin: 2

-Valores únicos para completeOrderNumber: 2996
{'orderNumber': 326, 'orderLineNumber': 18, 'customerNumber': 98, 'employeeNumber': 15, 'productCode': 109, 'status': 6, 'quantityOrdered': 61, 'priceEach': 1573, 'sales_amount': 2885, 'origin': 2, 'completeOrderNumber': 2996}


In [68]:
#Frecuencias
for col in sales_clean.columns:
    print('\n- Frecuencias para "{0}"'.format(col), '\n')
    print(sales_clean[col].value_counts())


- Frecuencias para "orderNumber" 

10360    18
10168    18
10386    18
10222    18
10275    18
         ..
10286     1
10376     1
10277     1
10132     1
10345     1
Name: orderNumber, Length: 326, dtype: int64

- Frecuencias para "orderLineNumber" 

1     327
2     311
3     288
4     273
5     256
6     239
7     212
8     201
9     177
10    149
11    134
12    114
13    101
14     82
15     57
16     43
17     26
18     11
Name: orderLineNumber, dtype: int64

- Frecuencias para "customerNumber" 

141    260
124    180
114     55
119     54
187     51
      ... 
381      8
198      8
473      8
103      7
219      3
Name: customerNumber, Length: 98, dtype: int64

- Frecuencias para "employeeNumber" 

1370    398
1165    331
1401    273
1501    236
1504    220
1323    212
1612    186
1611    185
1337    177
1216    152
1286    142
0       137
1188    124
1702    114
1166    114
Name: employeeNumber, dtype: int64

- Frecuencias para "productCode" 

S18_3232    53
S18_3136    29
S18_

In [71]:
sales_clean.describe(include='category')

Unnamed: 0,productCode,status,origin
count,3001,3001,3001
unique,109,6,2
top,S18_3232,Shipped,spain
freq,53,2775,2864


Análisis Multivariante

In [74]:
#Correlaciones
sales_corr = sales_clean.corr()
print(sales_corr)

                 orderNumber  orderLineNumber  customerNumber  employeeNumber  \
orderNumber         1.000000        -0.050853       -0.000871        0.112239   
orderLineNumber    -0.050853         1.000000       -0.043279       -0.012187   
customerNumber     -0.000871        -0.043279        1.000000        0.069613   
employeeNumber      0.112239        -0.012187        0.069613        1.000000   
quantityOrdered     0.077880        -0.020396        0.019246       -0.023855   
priceEach          -0.003917        -0.018829       -0.027937       -0.009872   
sales_amount        0.041960        -0.034630       -0.008249       -0.021265   

                 quantityOrdered  priceEach  sales_amount  
orderNumber             0.077880  -0.003917      0.041960  
orderLineNumber        -0.020396  -0.018829     -0.034630  
customerNumber          0.019246  -0.027937     -0.008249  
employeeNumber         -0.023855  -0.009872     -0.021265  
quantityOrdered         1.000000   0.024733      0.

In [78]:
sales_corr[(np.abs(sales_corr) >= 0.7) & (np.abs(sales_corr)!= 1)]

Unnamed: 0,orderNumber,orderLineNumber,customerNumber,employeeNumber,quantityOrdered,priceEach,sales_amount
orderNumber,,,,,,,
orderLineNumber,,,,,,,
customerNumber,,,,,,,
employeeNumber,,,,,,,
quantityOrdered,,,,,,,
priceEach,,,,,,,0.803276
sales_amount,,,,,,0.803276,


In [80]:
#Sesgo
sales_skw =sales_clean.skew(numeric_only=True)
print(sales_skw)

orderNumber        0.013978
orderLineNumber    0.603652
customerNumber     0.455611
employeeNumber    -2.847769
quantityOrdered    0.424534
priceEach          0.643025
sales_amount       1.104946
dtype: float64


In [82]:
sales_skw[np.abs(sales_skw)>2]

employeeNumber   -2.847769
dtype: float64

In [None]:
#Jurtosis
sales_kurt =sales_clean.kurt(numeric_only=True)
print(sales_kurt)