# 01 — Collect & Merge (Ingestia datelor)

**Scop:** colectez date brute din surse multiple, aliniez denumirile de coloane (schema) și salvez rezultate **intermediare** în format Parquet.  
**De ce:** separ ingestia de curățare, fac pașii reproductibili și mai rapizi.

**Surse:**
- `data/raw/products_master.csv`
- `data/raw/promotions.csv`
- `data/raw/customers.csv`
- `data/raw/transactions_systemA.csv`
- `data/raw/transactions_systemB.xlsx` (foaie: `sales`)

**Output (interim):**
- `data/interim/products.parquet`, `promotions.parquet`, `customers.parquet`,
- `data/interim/tx_systemA.parquet`, `tx_systemB.parquet`.

In [1]:
from pathlib import Path

In [2]:
import pandas as pd

In [3]:
BASE = Path(".")
RAW = BASE/"data"/"raw"
INTERIM = BASE/"data"/"interim"
INTERIM.mkdir(parents=True, exist_ok=True)

In [4]:
list(RAW.glob("*"))

[WindowsPath('data/raw/customers.csv'),
 WindowsPath('data/raw/products_master.csv'),
 WindowsPath('data/raw/promotions.csv'),
 WindowsPath('data/raw/transactions_systemA.csv'),
 WindowsPath('data/raw/transactions_systemB.xlsx')]

**Explicație:** definesc folderele de lucru. `RAW` conține fișierele originale,  
`INTERIM` va stoca rezultatele standardizate (schema unificată), fără curățări încă.

In [5]:
products = pd.read_csv(RAW/"products_master.csv")
promos = pd.read_csv(RAW/"promotions.csv")
customers = pd.read_csv(RAW/"customers.csv")
tx_a = pd.read_csv(RAW/"transactions_systemA.csv")
tx_b = pd.read_excel(RAW/"transactions_systemB.xlsx", sheet_name="sales")

for name, df in {"products":products,"promos":promos,"customers":customers,"tx_a":tx_a,"tx_b":tx_b}.items():
    print(f"\n== {name} ==")
    display(df.head())
    display(df.info())


== products ==


Unnamed: 0,product_id,product_name,category,unit,list_price,vat
0,P1000,apa light,Baza,buc,53.38,
1,P1001,Biscuiti Max,Lactate,kg,6.52,0.19
2,P1002,Ulei Max,Dulciuri,kg,58.82,0.09
3,P1003,Suc Zero,Dulciuri,set,16.36,
4,P1004,Ulei Light,Bauturi,kg,45.67,0.05


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   product_id    60 non-null     object 
 1   product_name  60 non-null     object 
 2   category      60 non-null     object 
 3   unit          60 non-null     object 
 4   list_price    60 non-null     float64
 5   vat           49 non-null     float64
dtypes: float64(2), object(4)
memory usage: 2.9+ KB


None


== promos ==


Unnamed: 0,product_id,start_date,end_date,promo_price,discount_type
0,P1031,25/11/2024,2024/12/13,18,
1,P1020,29/10/2024,2024/11/04,3846,
2,P1059,31/10/2024,2024/11/05,122,fixed
3,P1017,22/10/2024,2024/11/03,2393,fixed
4,P1010,29/11/2024,2024/12/12,2842,%


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30 entries, 0 to 29
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   product_id     30 non-null     object
 1   start_date     30 non-null     object
 2   end_date       30 non-null     object
 3   promo_price    30 non-null     object
 4   discount_type  21 non-null     object
dtypes: object(5)
memory usage: 1.3+ KB


None


== customers ==


Unnamed: 0,customer_id,customer_name,email,city,segment
0,C2000,Ana Popescu,user5049@example.ro,Brasov,B2B
1,C2001,Elena Radu,,Timisoara,b2c
2,C2002,Andrei Marin,user4382@example.ro,Bucuresti,Retail
3,C2003,Maria Popescu,user5849@example.ro,Timisoara,B2 C
4,C2004,George Stan,user3462@example.ro,Brasov,B2B


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 253 entries, 0 to 252
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   customer_id    253 non-null    object
 1   customer_name  253 non-null    object
 2   email          236 non-null    object
 3   city           253 non-null    object
 4   segment        253 non-null    object
dtypes: object(5)
memory usage: 10.0+ KB


None


== tx_a ==


Unnamed: 0,trans_id,timestamp,customer_id,product_id,quantity,unit_price,store
0,T313725,2025-07-14 11:28:00,C2127,P1010,2.0,21.22,
1,T426596,2025-04-23 10:55:00,C2189,P1003,3.0,14.02,Store-01
2,T797849,2025-08-16 01:08:00,C2057,P1053,3.0,38.66,shop_03
3,T611504,2025-01-12 04:49:00,C2097,P1028,1.0,12.6,Store-01
4,T661334,2025-04-04 23:30:00,C2123,P1012,1.0,1049.0,Online


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 900 entries, 0 to 899
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   trans_id     900 non-null    object 
 1   timestamp    900 non-null    object 
 2   customer_id  900 non-null    object 
 3   product_id   900 non-null    object 
 4   quantity     781 non-null    float64
 5   unit_price   900 non-null    object 
 6   store        900 non-null    object 
dtypes: float64(1), object(6)
memory usage: 49.3+ KB


None


== tx_b ==


Unnamed: 0,id,cust_id,sku,qty,date,location,payment_method
0,T964964,C2089,P1051,1.0,17.02.2025,Magazine,card
1,T273350,C2118,P1047,1.0,22.07.2025,Magazine,voucher
2,T107111,C2121,P1021,1.0,23.05.2025,Depozit1,card
3,T566879,C2072,P1029,-1.0,05.02.2025,Online,voucher
4,T282912,C2182,P1011,10.0,16.03.2025,Online,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700 entries, 0 to 699
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   id              700 non-null    object 
 1   cust_id         700 non-null    object 
 2   sku             700 non-null    object 
 3   qty             616 non-null    float64
 4   date            700 non-null    object 
 5   location        700 non-null    object 
 6   payment_method  544 non-null    object 
dtypes: float64(1), object(6)
memory usage: 38.4+ KB


None

**Explicație:** încarc toate tabelele brute în DataFrame-uri și verific:
- tipurile de date (text/număr/dată),
- valori lipsă evidente,
- dacă denumirile de coloane sunt coerente între sisteme.

In [6]:
tx_b = tx_b.rename(columns={
    "id":"trans_id",
    "cust_id":"customer_id",
    "sku":"product_id",
    "qty":"quantity",
    "date":"timestamp",
    "location":"store"
})
tx_b.head()

Unnamed: 0,trans_id,customer_id,product_id,quantity,timestamp,store,payment_method
0,T964964,C2089,P1051,1.0,17.02.2025,Magazine,card
1,T273350,C2118,P1047,1.0,22.07.2025,Magazine,voucher
2,T107111,C2121,P1021,1.0,23.05.2025,Depozit1,card
3,T566879,C2072,P1029,-1.0,05.02.2025,Online,voucher
4,T282912,C2182,P1011,10.0,16.03.2025,Online,


**Explicație:** cele două sisteme (A/B) folosesc denumiri diferite.  
Le uniformizez pe un **nume standard** (schema comună) ca să pot concatena / analiza împreună:
- `trans_id`, `customer_id`, `product_id`, `quantity`, `timestamp`, `store`.


In [7]:
products.to_parquet(INTERIM/"products.parquet", index=False)
promos.to_parquet(INTERIM/"promotions.parquet", index=False)
customers.to_parquet(INTERIM/"customers.parquet", index=False)
tx_a.to_parquet(INTERIM/"tx_systemA.parquet", index=False)
tx_b.to_parquet(INTERIM/"tx_systemB.parquet", index=False)
print("Interim files saved in:", INTERIM)

Interim files saved in: data\interim


**Explicație:** salvez rezultatele în **Parquet**:
- compresie mai bună decât CSV,
- citire/scriere rapidă,
- păstrează tipurile de date mai bine.
Aceste fișiere sunt punctul de plecare pentru pasul **02_Cleaning**.


### Concluzie 01
- Sursele au fost încărcate corect.
- Am aliniat schema pentru tranzacțiile din system B la standardul comun.
- Am salvat tabelele intermediare în `data/interim/`.

**Pasul următor (02_Cleaning):**
- conversie prețuri (`,`, `.`, „RON”),
- standardizare date calendaristice,
- tratament valori lipsă/negative,
- deduplicare,
- concatenare tranzacții A+B și join cu `products`/`customers`,
- (opțional) atașare promo activă la data tranzacției.
