# Análisis de ventas de productos digitales simuladas
## Fase 1. Preparación de datos 
Garantizar la calidad y estructura del dataset para el análisis 

```mermaid 
    graph LR
    A[1 <br> Cargar datos]
    B[2 <br> Validaciones básicas]
    C[3 <br> Limpieza]
    D[4 <br> Enriquecimiento temporal] 
    
    A-->B
    B-->C
    C-->D

    style A fill: #bc942e
    style B fill: #b59c5c
    style C fill: #ac965c
    style D fill: #aea07f
```


In [1]:
# 1. Cargar datos 
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

data = pd.read_csv('C:/Users/nat27/Desktop/Desktop/Proyectos/CienciaDatos/digital-sales-analytics/data/digital_products_sales_simulated.csv')
data.head()

Unnamed: 0,order_id,customer_id,product_name,category,price_usd,quantity,discount_rate,gross_amount_usd,net_revenue_usd,purchase_dt,region,channel,payment_method
0,6824,2,Musica-44,Musica,53.61,1,0.0,53.61,53.61,2024-09-23 00:32:55,,Email,PayPal
1,6387,1,Curso-27,Curso,118.73,1,0.05,118.73,112.79,2024-09-23 01:00:03,LATAM,Marketplace,Card
2,1623,126,Curso-03,Curso,98.43,1,0.0,98.43,98.43,2024-09-23 03:01:31,EU,SocialAds,Card
3,1146,2,Plantilla-06,Plantilla,10.32,1,0.0,10.32,10.32,2024-09-23 03:03:24,EU,Marketplace,Card
4,2146,2,Plantilla-46,Plantilla,23.09,2,0.05,46.19,43.88,2024-09-23 03:27:10,LATAM,Website,Card


In [2]:
# 2. Validaciones básicas 
print("Este es el shape del dataset:", data.shape)
print("Estos son los tipos de datos:\n", data.dtypes)

Este es el shape del dataset: (8000, 13)
Estos son los tipos de datos:
 order_id              int64
customer_id           int64
product_name         object
category             object
price_usd           float64
quantity              int64
discount_rate       float64
gross_amount_usd    float64
net_revenue_usd     float64
purchase_dt          object
region               object
channel              object
payment_method       object
dtype: object


In [3]:
print("Estos son los valores duplicados:", data.duplicated().sum())
print("Estos son los valores nulos por columna:\n", data.isnull().sum())

Estos son los valores duplicados: 0
Estos son los valores nulos por columna:
 order_id               0
customer_id            0
product_name           0
category               0
price_usd              0
quantity               0
discount_rate          0
gross_amount_usd       0
net_revenue_usd        0
purchase_dt            0
region              2828
channel                0
payment_method         0
dtype: int64


In [4]:
# 3. Limpieza de datos 
data['region'] = data['region'].fillna('Desconocido')
print("Estos son los valores nulos por columna después de la limpieza:\n", data.isnull().sum())

Estos son los valores nulos por columna después de la limpieza:
 order_id            0
customer_id         0
product_name        0
category            0
price_usd           0
quantity            0
discount_rate       0
gross_amount_usd    0
net_revenue_usd     0
purchase_dt         0
region              0
channel             0
payment_method      0
dtype: int64


In [5]:
# 4. Enriquecimiento temporal  
data['order_date'] = pd.to_datetime(data['purchase_dt'])
data['year'] = data['order_date'].dt.year
data['month'] = data['order_date'].dt.month
data['day'] = data['order_date'].dt.day
data['day_of_week'] = data['order_date'].dt.dayofweek
data['week_of_year'] = data['order_date'].dt.isocalendar().week
data['is_weekend'] = np.where(data['day_of_week'].isin([5,6]), 1, 0) 
data['trimester'] = data['order_date'].dt.quarter
data.head()

Unnamed: 0,order_id,customer_id,product_name,category,price_usd,quantity,discount_rate,gross_amount_usd,net_revenue_usd,purchase_dt,...,channel,payment_method,order_date,year,month,day,day_of_week,week_of_year,is_weekend,trimester
0,6824,2,Musica-44,Musica,53.61,1,0.0,53.61,53.61,2024-09-23 00:32:55,...,Email,PayPal,2024-09-23 00:32:55,2024,9,23,0,39,0,3
1,6387,1,Curso-27,Curso,118.73,1,0.05,118.73,112.79,2024-09-23 01:00:03,...,Marketplace,Card,2024-09-23 01:00:03,2024,9,23,0,39,0,3
2,1623,126,Curso-03,Curso,98.43,1,0.0,98.43,98.43,2024-09-23 03:01:31,...,SocialAds,Card,2024-09-23 03:01:31,2024,9,23,0,39,0,3
3,1146,2,Plantilla-06,Plantilla,10.32,1,0.0,10.32,10.32,2024-09-23 03:03:24,...,Marketplace,Card,2024-09-23 03:03:24,2024,9,23,0,39,0,3
4,2146,2,Plantilla-46,Plantilla,23.09,2,0.05,46.19,43.88,2024-09-23 03:27:10,...,Website,Card,2024-09-23 03:27:10,2024,9,23,0,39,0,3


In [6]:
# Guardar el dataset limpio
data.to_csv('C:/Users/nat27/Desktop/Desktop/Proyectos/CienciaDatos/digital-sales-analytics/data/clean/digital_products_sales_cleaned.csv', index=False)