### **Understanding**

#### **General visualization of the raw data**

This section provides a quick overview of the dataset to:

* See the structure of the table.
* Identify data types.
* Detect missing values.
* Review basic statistics.
* Guide the next steps for cleaning and analysis.


In [1]:
from pandas import DataFrame, read_csv
from pathlib import Path
from re import compile

# DATA LOADING
path = Path("../data/raw/")

# Try to load the optimized Parquet version first
data: DataFrame = read_csv(
    path / "Sales.csv",
    engine="pyarrow",
    dtype_backend="pyarrow",
    low_memory=True,
)

# Pattern to detect special characters and symbols
re_pattern = compile(
    r"[#{}\[\]%@*<>¿?=+~|^…¡!()「」『』]|[\x00-\x1F\x7F]|[\U0001F300-\U0001F6FF]",
)

# DATA PREVIEW
print(f"\n{'=' * 10} First and Last 5 Rows {'=' * 10}")
display(data.head())

# EXPLORATORY DATA ANALYSIS
print(f"\n{'=' * 10} Exploratory Data Analysis {'=' * 10}")
display(data.describe(include="all"))

# DATASET OVERVIEW
print(f"\n{'=' * 10} Dataset Overview {'=' * 10}")
display(
    DataFrame(
        {
            "values_count": data.count(),
            "symbols_found": data.apply(  # Find unique special characters in each column
                lambda col: list(set(re_pattern.findall(col.astype(str).str.cat())))
            ),
            "null_counts": data.isnull().sum(),
            "dirty_data_types": data.dtypes,
        }
    )
)

# Display shape and memory usage
print(
    f"Shape: {data.shape[0]} x {data.shape[1]}\n"
    f"Memory usage: {data.memory_usage(deep=True).sum() / (1024**2):.2f} MB"
)





Unnamed: 0,Fecha,Producto,Tipo_Producto,Cantidad,Precio_Unitario,Ciudad,Pais,Tipo_Venta,Tipo_Cliente,Descuento,Costo_Envio
0,Santiago,2025-10-30,Arepa,Abarrotes,2.0,3681.0,Online,Minorista,0.2,0.0,5889.0
1,Córdoba,2025-11-17,Arepa,Abarrotes,7.0,2321.0,Distribuidor,Gobierno,0.15,0.0,13809.0
2,Barranquilla,2025-10-22,Leche,Lácteo,9.0,3540.0,Distribuidor,Gobierno,0.2,0.0,25488.0
3,New York,2025-10-20,Cereal,Lácteo,3.0,3287.0,Tienda_Física,Gobierno,0.05,0.0,9367.0
4,Madrid,2025-10-20,Leche,Hogar,2.0,3414.0,Distribuidor,Mayorista,0.0,0.0,6828.0





Unnamed: 0,Fecha,Producto,Tipo_Producto,Cantidad,Precio_Unitario,Ciudad,Pais,Tipo_Venta,Tipo_Cliente,Descuento,Costo_Envio
count,1248858,1248861,1248851,1248897,1248837.0,1248851,1248824,1248869,1248917.0,1248912.0,1248835
unique,188,146,72,36,34.0,5185,24,24,6.0,9.0,50009
top,Trujillo,2025-10-31,Café,Abarrotes,10.0,???,Distribuidor,Gobierno,0.0,0.0,???
freq,44849,42062,104476,208459,125348.0,853,312611,312302,250412.0,831378.0,812





Unnamed: 0,values_count,symbols_found,null_counts,dirty_data_types
Fecha,1248858,"[@, >, *, <, #]",1142,string[pyarrow]
Producto,1248861,"[@, >, *, <, #]",1139,string[pyarrow]
Tipo_Producto,1248851,"[@, >, *, <, #]",1149,string[pyarrow]
Cantidad,1248897,"[@, >, *, <, #]",1103,string[pyarrow]
Precio_Unitario,1248837,"[<, ?, >]",1163,string[pyarrow]
Ciudad,1248851,"[<, ?, >]",1149,string[pyarrow]
Pais,1248824,"[@, >, *, <, #]",1176,string[pyarrow]
Tipo_Venta,1248869,"[@, >, *, <, #]",1131,string[pyarrow]
Tipo_Cliente,1248917,"[<, >, ?]",1083,string[pyarrow]
Descuento,1248912,"[<, >, ?]",1088,string[pyarrow]


Shape: 1250000 x 11
Memory usage: 144.78 MB


### Conclusion

- Normalize and convert types (e.g., Fecha → datetime, Cantidad/Precio/Costo_Envio/Descuento → numeric).  
- Handle nulls (imputation or deletion depending on the case).  
- Remove or correct special/inconsistent characters in the affected columns.  
- Review outliers in Costo_Envio and Precio_Unitario before analysis or modeling.
