# Análisis de Ventas - Carga inicial completada

Este notebook trabaja sobre una base de datos MySQL ppopulada con información contenida en archivos CSV.

Los datos fueron cargados desde los archivos ubicados en el directorio `./data` y corresponden a distintas entidades del negocio: categorías, productos, clientes, empleados, países y ventas.

⚠️ **Nota:** Este notebook **no vuelve a cargar** los datos. Solo se conecta a la base para analizarlos y resolver las consignas del proyecto.

---

## Archivos disponibles:

- `categories.csv`
- `products.csv`
- `customers.csv`
- `employees.csv`
- `countries.csv`
- `sales.csv`

## Estructura esperada en la base de datos:

- `categories`
- `products`
- `customers`
- `employees`
- `countries`
- `sales`

---


In [None]:
import pandas as pd
from pathlib import Path
import utils.notebook_utils as notebook_utils

data_path = Path("./data")
csv_files = data_path.glob("*.csv")

print("📂 Summary of CSV files found:\n")

for file in csv_files:
    colored = notebook_utils.print_colored(text='File:', color='blue')
    print(f" {file.name}\n")

    try:
        df = pd.read_csv(file, nrows=5)
        with open(file, encoding="utf-8") as f:
            total_rows = sum(1 for _ in f) - 1  # excluye th

        print(df.to_string(index=False))
        print(f"\n Rows: {total_rows} | Columns: {len(df.columns)}\n")
    except Exception as e:
        print(f" Error reading file {file.name}: {e}")
    
    #print("=" * 60)
    separator = notebook_utils.print_colored_separator()


Comparamos los datos recibidos con los que efectivmente fueron a la db

In [1]:
import pandas as pd
from pathlib import Path
import utils.sql_utils as sql_utils
import utils.notebook_utils as notebook_utils

data_path = Path("./data")

csv_files = sorted(data_path.glob("*.csv"))

print("📂 Summary of csv files found and database created:\n")


for file in csv_files:
    try:
        df = pd.read_csv(file, nrows=5)  
        total_rows = sum(1 for _ in open(file)) - 1  
        total_columns = len(df.columns)

        table_name = file.stem
        colored = notebook_utils.print_colored(text='Table name:', color='blue')
        print(f" {table_name}\n")

        print(f"First 5 records from tabe {table_name}:\n")
        sql_utils.run_query("""SELECT * FROM sales LIMIT 5;""")

        colored_title = notebook_utils.print_colored(text='File:', color='blue')
        print(f" {file.name}\n")
        print(df.to_string(index=False))
        print(f"\n Rows: {total_rows} || Columns : {total_columns}\n")
        
    except Exception as e:
        print(f" Error while readig file {file.name}: {e}")

# verificar que los registros (rows) totales coincidan entre csv y base de datos

# mosytrar los primeros 5 registros de cada csv

📂 Summary of csv files found and database created:



 categories

First 5 records from tabe categories:



 categories.csv

 CategoryID CategoryName
          1  Confections
          2   Shell fish
          3      Cereals
          4        Dairy
          5    Beverages

 Rows: 11 || Columns : 2



 cities

First 5 records from tabe cities:



 cities.csv

 CityID       CityName  Zipcode  CountryID
      1         Dayton    80563         32
      2        Buffalo    17420         32
      3        Chicago    44751         32
      4        Fremont    20641         32
      5 Virginia Beach    62389         32

 Rows: 96 || Columns : 4



 countries

First 5 records from tabe countries:



 countries.csv

 CountryID CountryName CountryCode
         1     Armenia          AN
         2      Canada          FO
         3      Belize          MK
         4      Uganda          LV
         5    Thailand          VI

 Rows: 206 || Columns : 3



 customers

First 5 records from tabe customers:



 customers.csv

 CustomerID FirstName MiddleInitial LastName  CityID                      Address
          1  Stefanie             Y     Frye      79                97 Oak Avenue
          2     Sandy             T    Kirby      96       52 White First Freeway
          3       Lee             T    Zhang      55      921 White Fabien Avenue
          4    Regina             S    Avery      40                75 Old Avenue
          5    Daniel             S   Mccann       2 283 South Green Hague Avenue

 Rows: 98759 || Columns : 6



 employees

First 5 records from tabe employees:



 employees.csv

 EmployeeID FirstName MiddleInitial LastName               BirthDate Gender  CityID                HireDate
          1    Nicole             T   Fuller 1981-03-07 00:00:00.000      F      80 2011-06-20 07:15:36.920
          2 Christine             W   Palmer 1968-01-25 00:00:00.000      F       4 2011-04-27 04:07:56.930
          3     Pablo             Y    Cline 1963-02-09 00:00:00.000      M      70 2012-03-30 18:55:23.270
          4   Darnell             O  Nielsen 1989-02-06 00:00:00.000      M      39 2014-03-06 06:55:02.780
          5   Desiree             L   Stuart 1963-05-03 00:00:00.000      F      23 2014-11-16 22:59:54.720

 Rows: 23 || Columns : 8



 products

First 5 records from tabe products:



 products.csv

 ProductID                ProductName   Price  CategoryID  Class ModifyDate Resistant IsAllergic  VitalityDays
         1        Flour - Whole Wheat 74.2988           3 Medium    21:49.2   Durable    Unknown             0
         2 Cookie Chocolate Chip With 91.2329           3 Medium    39:11.0   Unknown    Unknown             0
         3         Onions - Cippolini  9.1379           9 Medium    11:51.6      Weak      FALSE           111
         4 Sauce - Gravy; Au Jus; Mix 54.3055           9 Medium    46:28.9   Durable    Unknown             0
         5     Artichokes - Jerusalem 65.4771           2    Low    13:35.4   Durable       TRUE            27

 Rows: 452 || Columns : 9



 sales

First 5 records from tabe sales:



 sales.csv

 SalesID  SalesPersonID  CustomerID  ProductID  Quantity  Discount  TotalPrice               SalesDate    TransactionNumber
       1              6       27039        381         7       0.0         0.0 2018-02-05 07:38:25.430 FQL4S94E4ME1EZFTG42G
       2             16       25011         61         7       0.0         0.0 2018-02-02 16:03:31.150 12UGLX40DJ1A5DTFBHB8
       3             13       94024         23        24       0.0         0.0 2018-05-03 19:31:56.880 5DT8RCPL87KI5EORO7B0
       4              8       73966        176        19       0.2         0.0 2018-04-07 14:43:55.420 R3DR9MLD5NR76VO17ULE
       5             10       32653        310         9       0.0         0.0 2018-02-12 15:37:03.940 4BGS0Z5OMAZ8NDAFHHP3

 Rows: 6758125 || Columns : 9



In [None]:
import pandas as pd
from pathlib import Path
from utils.sql_utils import run_query
import utils.notebook_utils as notebook_utils

data_path = Path("./data")
csv_files = sorted(data_path.glob("*.csv"))

print("Checking data loaded from csv files and what finally was loaded on db. \n")

for file in csv_files:
    try:
        table_name = file.stem

        df_csv_sample = pd.read_csv(file, nrows=5)
        total_csv_rows = sum(1 for _ in open(file)) - 1  # sin header
        csv_columns = list(df_csv_sample.columns)

        db_sample = run_query(f"SELECT * FROM {table_name} LIMIT 5;")
        result = run_query(f"SELECT COUNT(*) as total FROM {table_name};")
        
        total_db_rows = result.iloc[0]['total']

        colored_filename = notebook_utils.print_colored(text='Filename -> table_name', color='blue', tag='p')
        print(f" {file.name} -> {table_name}  \n")

        print(f" CSV: {total_csv_rows} rows | DB: {total_db_rows} rows \n")
        
        # Comparación de columnas
        db_columns = list(db_sample.columns)
        if csv_columns != db_columns:
            notebook_utils.print_colored(text='Columnas no coinciden: ', color='red', tag='p')
            print(f"CSV: {csv_columns}")
            print(f"DB : {db_columns}")
        else:
            notebook_utils.print_colored(text='Columnas coinciden: ', color='green', tag='p')

            notebook_utils.print_colored(text='first 5 rows (CSV) ', color='blue', tag='p', weight='normal')
            print(df_csv_sample.to_string(index=False))

            notebook_utils.print_colored(text='first 5 rows (DB) ', color='blue', weight='normal')

            print(sql_utils.run_query(f"SELECT * FROM {table_name} LIMIT 5;"))

    except Exception as e:
        print(f" Error al procesar {file.name}: {e}")
    
    # print("─" * 60)
    notebook_utils.print_colored_separator()
