# Danny's Pizza — Dataset Scaffold (OscarF Datasets generator)

> Add blockquote



**Created:** 2025-08-22 00:39

This notebook sets up the **small tables** in-notebook and expects the **large order tables** to be loaded from CSV/XLS files.
It strictly follows the schema from class:

- `pizza_names(pizza_id INT, pizza_name TEXT)`
- `pizza_toppings(topping_id INT, topping_name TEXT)`
- `pizza_recipes(pizza_id INT, toppings TEXT)` where `toppings` is a comma-separated list of `topping_id`s
- `runners(runner_id INT, registration_date DATE)`
- `customer_orders(order_id INT, customer_id INT, pizza_id INT, exclusions VARCHAR(4), extras VARCHAR(4), order_date TIMESTAMP)`
- `runner_orders(order_id INT, runner_id INT, pickup_time VARCHAR(19), distance VARCHAR(7), duration VARCHAR(10), cancellation VARCHAR(23))`



In [None]:
import pandas as pd
from sklearn.cluster import KMeans
import plotly.graph_objects as go
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

## Create Small Tables (Dims)

We define a **minimal, realistic** pizza catalog.


In [None]:
# --- pizza_names ---
pizza_names = pd.DataFrame({
    'pizza_id': [1, 2, 3, 4, 5, 6, 7],
    'pizza_name': [
        'Margherita',        # 1
        'Vegetarian',        # 2
        'Meat Lovers',       # 3
        'BBQ Chicken',       # 4
        'Hawaiian',          # 5
        'Pepperoni',         # 6
        'Vegan Veggie'       # 7  <-- vegan option (no cheese by default)
    ]
})
pizza_names.to_csv('pizza_names.csv', index=False)
pizza_names

In [None]:
# --- pizza_toppings ---
# Concise list so students can reason about extras/exclusions clearly
pizza_toppings = pd.DataFrame({
    'topping_id': list(range(1, 16)),
    'topping_name': [
        'Tomato Sauce',  # 1
        'Mozzarella',    # 2
        'Mushroom',      # 3
        'Onion',         # 4
        'Bell Pepper',   # 5
        'Olives',        # 6
        'Pepperoni',     # 7
        'Bacon',         # 8
        'Beef',          # 9
        'Chicken',       # 10
        'Pineapple',     # 11
        'BBQ Sauce',     # 12
        'Jalapeno',      # 13
        'Fresh Basil',   # 14
        'Garlic'         # 15
    ]
})
pizza_toppings.to_csv('pizza_toppings.csv', index=False)
pizza_toppings

In [None]:
# --- pizza_recipes ---
# Define base recipes as comma-separated topping_id strings (order does not matter)
recipes_map = {
    1: [1,2,14],               # Margherita: sauce, mozzarella, basil
    2: [1,2,3,4,5,6],          # Vegetarian: sauce, mozzarella, mushroom, onion, bell pepper, olives
    3: [1,2,7,8,9],            # Meat Lovers: sauce, mozzarella, pepperoni, bacon, beef
    4: [1,2,10,12],            # BBQ Chicken: sauce, mozzarella, chicken, bbq sauce
    5: [1,2,11,6],             # Hawaiian: sauce, mozzarella, pineapple, olives
    6: [1,2,7],                # Pepperoni: sauce, mozzarella, pepperoni
    7: [1,3,4,5,6,15]          # Vegan Veggie: sauce, mushroom, onion, bell pepper, olives, garlic (no cheese)
}

pizza_recipes = pd.DataFrame({
    'pizza_id': list(recipes_map.keys()),
    'toppings': [','.join(map(str, v)) for v in recipes_map.values()]
})
pizza_recipes.to_csv('pizza_recipes.csv', index=False)
pizza_recipes

In [None]:
# --- runners ---

dates = pd.date_range('2021-01-03', periods=15, freq='7D')
runners = pd.DataFrame({
    'runner_id': range(1, 16),
    'registration_date': dates.date
})
runners.to_csv('runners.csv', index=False)
runners

## Load Orders

This cell loads the orders from CSV first; if not present, it tries XLSX. Adjust file paths if needed.


## Cleaning & Normalization Helpers

- Parse `distance` (to float km) and `duration` (to minutes).
- Normalize `cancellation` labels (lowercase, strip).
- Enforce FK integrity and logical constraints.


In [None]:

import numpy as np

# Base path in Google Drive


# --- Load small dimension tables ---
pizza_names     = pd.read_csv('pizza_names.csv')
pizza_toppings  = pd.read_csv( 'pizza_toppings.csv')
pizza_recipes   = pd.read_csv( 'pizza_recipes.csv')
runners         = pd.read_csv( 'runners.csv')

# --- Load big fa(-
customer_orders = pd.read_csv( 'customer_orders.csv')
runner_orders   = pd.read_csv( 'runner_orders.csv')

print("pizza_names:", pizza_names.shape)
print("pizza_toppings:", pizza_toppings.shape)
print("pizza_recipes:", pizza_recipes.shape)
print("runners:", runners.shape)
print("customer_orders:", customer_orders.shape)
print("runner_orders:", runner_orders.shape)


In [None]:
customer_orders.head()



```
# This is formatted as code
```


# /* --------------------
#   Case Study Questions
#   --------------------*/
A. Pizza Metrics

    How many pizzas were ordered?
    How many unique customer orders were made?
    How many successful orders were delivered by each runner?
    How many of each type of pizza was delivered?
    How many Vegetarian and Meatlovers were ordered by each customer?
    What was the maximum number of pizzas delivered in a single order?
    For each customer, how many delivered pizzas had at least 1 change and how many had no changes?
    How many pizzas were delivered that had both exclusions and extras?
    What was the total volume of pizzas ordered for each hour of the day?
    What was the volume of orders for each day of the week?

B. Runner and Customer Experience

    How many runners signed up for each 1 week period? (i.e. week starts 2021-01-01)
    What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pickup the order?
    Is there any relationship between the number of pizzas and how long the order takes to prepare?
    What was the average distance travelled for each customer?
    What was the difference between the longest and shortest delivery times for all orders?
    What was the average speed for each runner for each delivery and do you notice any trend for these values?
    What is the successful delivery percentage for each runner?

🍕 section c — customer & business intelligence

C1. total customer spend
 which customers bring in the most revenue?

use to argue for vip memberships or spend-based rewards.

C2. customer frequency (distinct days of orders)
 who orders regularly vs. one-off customers?

segment into loyal customers vs. occasional customers.

hint at frequency-based discounts (e.g., 5th order free).

C3. first pizza ordered by each customer
 what attracts customers initially?

good to identify entry-point pizzas (the hook item that brings people in).

hint at discounts on first-order pizzas to acquire new customers.

C4. overall best-seller pizza
 which pizza keeps the lights on?

highlight as a flagship product to promote.

use for seasonal bundles (“summer deal with our #1 pizza”).

C5. most popular pizza by customer
 can we personalize offers?

recommend personalized “customer favorites” discounts.

hint toward AI/BI-driven recommender systems.

C6. regulars with ≥30 orders and their go-to pizzas
 who are the heavy hitters and what do they like?

obvious loyalty program candidates.

pitch: “keep them happy with exclusive rewards so they don’t churn.”

C7. customers with very consistent habits (always order the same pizza)
 creatures of habit = stable recurring revenue.

membership idea: “pizza subscription” (weekly plan with their pizza auto-delivered).

C8. the “perfect pair”

great marketing story: “find your pizza soulmate.”

pitch: social media campaign + 2-for-1  perfect pizza couples’ promo.

C9. peak order times
👉 what hours & days matter most?

operational: staff scheduling.

marketing: happy hour discounts in slow periods, premium pricing at peak times.

C10. best candidates for loyalty program
👉 combine spend + frequency + consistency.

identify top 5–10% customers.

suggest tiered memberships: silver/gold/platinum.

seasonal perks: double points in winter when sales slow.

# Entregable

final presentation = a business intelligence pitch deck:

customer segmentation (loyal vs occasional vs perfect pair).

menu insights (flagship pizza, first-order hook, personal favorites).

time insights (peak hours, seasonal discounts).

strategic recommendations:

loyalty program design,

subscription/membership tiers,

seasonal & time-based promos,

“perfect pair” marketing campaign.

In [None]:
import sqlite3

db_path = f"dannys_pizza.sqlite"
conn = sqlite3.connect(db_path)
c = conn.cursor()

# drop existing tables (clean slate)
for t in [
    'pizza_names','pizza_toppings','pizza_recipes',
    'runners','customer_orders','runner_orders'
]:
    c.execute(f"DROP TABLE IF EXISTS {t};")

# create empty tables with the canonical column names
c.execute("""CREATE TABLE pizza_names (
  pizza_id INTEGER,
  pizza_name TEXT
);""")

c.execute("""CREATE TABLE pizza_toppings (
  topping_id INTEGER,
  topping_name TEXT
);""")

c.execute("""CREATE TABLE pizza_recipes (
  pizza_id INTEGER,
  toppings TEXT
);""")

c.execute("""CREATE TABLE runners (
  runner_id INTEGER,
  registration_date TEXT
);""")

# NOTE: keep your current column names exactly as they are in the DataFrame
# If your DF uses 'order_date', keep it; if it's 'order_time', keep that.
# Below uses 'order_date'—change to 'order_time' if that’s your DF.
c.execute("""CREATE TABLE customer_orders (
  order_id INTEGER,
  customer_id INTEGER,
  pizza_id INTEGER,
  exclusions TEXT,
  extras TEXT,
  order_date TEXT
);""")

c.execute("""CREATE TABLE runner_orders (
  order_id INTEGER,
  runner_id INTEGER,
  pickup_time TEXT,
  distance TEXT,
  duration TEXT,
  cancellation TEXT
);""")

conn.commit()

# append DataFrames exactly as-is (no cleaning)
pizza_names.to_sql('pizza_names', conn, if_exists='append', index=False)
pizza_toppings.to_sql('pizza_toppings', conn, if_exists='append', index=False)
pizza_recipes.to_sql('pizza_recipes', conn, if_exists='append', index=False)
runners.to_sql('runners', conn, if_exists='append', index=False)
customer_orders.to_sql('customer_orders', conn, if_exists='append', index=False)
runner_orders.to_sql('runner_orders', conn, if_exists='append', index=False)

conn.commit()
print("SQLite ready at:", db_path)

* Q1. How many pizzas were ordered?

In [None]:
pd.read_sql("""
SELECT COUNT(*) AS total_pizzas
FROM customer_orders;
""", conn)


In [None]:
pd.read_sql(
    '''SELECT * 
    FROM customer_orders''',conn
)

* Q2. How many unique customer orders were made?


In [None]:
pd.read_sql(
    ''' SELECT COUNT(DISTINCT order_id) unique_orders FROM customer_orders''',conn
)

In [None]:
pd.read_sql(
    '''SELECT * 
    FROM runner_orders''',conn
)

* Q3. How many successful orders were delivered by each runner?


In [None]:
pd.read_sql(
    '''SELECT 
             runner_id,
             COUNT(order_id) AS successful_orders
        FROM runner_orders
        WHERE cancellation IS NULL
        GROUP BY 1
    ''', conn
)

* Q4 How many of each type of pizza was delivered?

In [None]:
pd.read_sql(
    '''
SELECT 
    p.pizza_name,
    COUNT(p.pizza_id) AS total_delivered
FROM customer_orders co
JOIN pizza_names p 
    ON co.pizza_id = p.pizza_id
JOIN runner_orders r 
    ON co.order_id = r.order_id 
   AND r.cancellation IS NULL
GROUP BY 1;
    ''',conn
)

* Q5 .How many Vegetarian and Meatlovers were ordered by each customer?


In [None]:
pd.read_sql(
    '''
SELECT
     co.customer_id,
     COUNT(co.pizza_id) AS total_ordered
FROM customer_orders co
JOIN pizza_names pn ON 
co.pizza_id = pn.pizza_id AND pn.pizza_name IN ('Meat Lovers', 'Vegetarian')
GROUP BY 1
ORDER BY 1
''',conn
)

* Q6 .What was the maximum number of pizzas delivered in a single order?


In [None]:
pd.read_sql(
   '''
SELECT 
      MAX(pizza_count) max_count
From(
SELECT 
      order_id,
      COUNT(pizza_id) pizza_count
FROM customer_orders
GROUP BY 1)
''',conn
)

* Q7 .For each customer, how many delivered pizzas had at least 1 change and how many had no changes?


In [None]:
pd.read_sql(
    '''SELECT 
             co.customer_id,
             COUNT(co.pizza_id) AS total_pizzas_change
       FROM customer_orders co
       JOIN runner_orders r ON co.order_id = r.order_id
       AND r.cancellation IS NULL 
       AND co.exclusions > 1
       GROUP BY 1
       ''',conn
)

In [None]:
pd.read_sql(
    '''SELECT 
             co.customer_id,
             COUNT(co.pizza_id) AS total_pizzas_no_change
       FROM customer_orders co
       JOIN runner_orders r USING(order_id) 
       WHERE r.cancellation IS NULL 
       AND co.exclusions IS NULL
       GROUP BY 1''',conn
)

* Q8 .How many pizzas were delivered that had both exclusions and extras?

In [None]:
pd.read_sql(
    '''SELECT COUNT(co.pizza_id) AS total_pizza_ee
      FROM customer_orders co
      JOIN runner_orders r 
      USING(order_id) WHERE r.cancellation IS NULL
      AND co.exclusions NOT NULL AND co.extras NOT NULL
      ''',conn
)

* Q9 .What was the total volume of pizzas ordered for each hour of the day?


In [None]:
pd.read_sql(
    '''SELECT 
             strftime('%H',order_date) AS hour,
             COUNT(pizza_id) AS total_pizza
       FROM customer_orders
       GROUP BY 1
       ''',conn
)

* Q10 .What was the volume of orders for each day of the week?

In [None]:
pd.read_sql(
    '''SELECT 
           strftime('%w', order_date) AS day,
           COUNT(order_id) AS total_orders
        FROM customer_orders
        GROUP BY 1

  ''', conn
)

* Q11 .How many runners signed up for each 1 week period? (i.e. week starts 2021-01-01)

In [None]:
pd.read_sql(
    '''SELECT 
   strftime('%W', registration_date)  + 1 AS week,
   COUNT(*) AS runner_count
   FROM runners
   GROUP BY 1 
    
    ''',conn
)

* Q12 .What was the average time in minutes it took for each runner to arrive at the Pizza Runner HQ to pickup the order?


In [None]:
pd.read_sql(
    '''SELECT
runner_id,
ROUND(AVG(strftime('%M', pickup_time)),2) AS avg_time
FROM runner_orders 
GROUP BY 1
    ''',conn
)

* Q13 .Is there any relationship between the number of pizzas and how long the order takes to prepare?

In [None]:
pd.read_sql(
    '''SELECT 
    co.order_id,
    COUNT(co.pizza_id) AS count_pizzas,
    CAST(strftime('%M',r.pickup_time) AS INT) - CAST(strftime('%M', co.order_date) AS INT) AS prepare_time
    FROM customer_orders co
    JOIN runner_orders r USING(order_id)
    GROUP BY 1 
    ORDER BY 3 DESC
    LIMIT 30
    ''',conn
)

* Q14 .What was the average distance travelled for each customer?


In [None]:
pd.read_sql('''
SELECT 
    co.customer_id,
    AVG(ro.duration) AS avg_distance
FROM customer_orders co 
JOIN runner_orders ro USING(order_id)
GROUP BY 1
''', conn)

* Q15 .What was the difference between the longest and shortest delivery times for all orders?


In [None]:
pd.read_sql('''
SELECT 
    MAX(CAST(duration AS INTEGER)) - MIN(CAST(duration AS INTEGER)) AS diff
FROM runner_orders
WHERE duration IS NOT NULL
AND cancellation IS NULL or cancellation = 'None'
''', conn)

* Q16 .What was the average speed for each runner for each delivery and do you notice any trend for these values?


In [None]:
pd.read_sql(
    '''
SELECT 
    order_id,
    runner_id,
    AVG(CAST(distance AS FLOAT) / CAST(duration AS INT)) AS avg_speed
FROM runner_orders
WHERE cancellation IS NULL 
AND distance IS NOT NULL
AND duration IS NOT NULL
GROUP BY 1
ORDER BY 2,1
LIMIT 40
''',conn

)

* Q17 .What is the successful delivery percentage for each runner?

In [None]:
pd.read_sql('''
WITH delivered AS (
SELECT 
    runner_id,
    COUNT(order_id) AS delivery_count
FROM runner_orders
WHERE cancellation IS NULL
GROUP BY runner_id
), totals AS (
SELECT 
      runner_id,
      COUNT( order_id) AS total_deliveries
FROM runner_orders
GROUP BY runner_id
)
            
SELECT 
     runner_id,
    ROUND(100.0 * d.delivery_count / t.total_deliveries, 2) AS delivery_percentage
FROM delivered d
JOIN totals t USING(runner_id)
ORDER BY 2 DESC
''', conn)

C1. total customer spend
 which customers bring in the most revenue?

use to argue for vip memberships or spend-based rewards.



In [None]:
pd.read_sql(
'''
SELECT 
        customer_id,
        COUNT(order_id) AS total_orders
FROM customer_orders
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
''', conn
)

C2. customer frequency (distinct days of orders)
 who orders regularly vs. one-off customers?

segment into loyal customers vs. occasional customers.

hint at frequency-based discounts (e.g., 5th order free).



In [None]:
pd.read_sql(
    '''
    WITH total_dates AS (
        SELECT
                customer_id,
                COUNT(DISTINCT order_date) total_order_dates
        FROM customer_orders
        GROUP BY customer_id
    )

    SELECT
            customer_id,
            total_order_dates,
            CASE
                WHEN total_order_dates BETWEEN 1 AND 4 THEN "occasional customer"
                ELSE "regular customer"
            END type_customer
    FROM total_dates
    ORDER BY 2 DESC
      ''',conn
)

C3. first pizza ordered by each customer
 what attracts customers initially?

good to identify entry-point pizzas (the hook item that brings people in).

hint at discounts on first-order pizzas to acquire new customers.



In [None]:
pd.read_sql('''
WITH ranks AS (
    SELECT 
        co.customer_id,
        pn.pizza_name,
        co.order_date,
        DENSE_RANK() OVER (PARTITION BY co.customer_id ORDER BY co.order_date) AS rank_date
    FROM customer_orders co
    JOIN pizza_names pn USING(pizza_id)
), first_orders AS (
    SELECT 
        * 
    FROM ranks
    WHERE rank_date = 1
)
            
SELECT
      pizza_name,
      COUNT(DISTINCT customer_id) total_customer_count
FROM first_orders
GROUP BY 1 
ORDER BY 2 DESC
''',conn)

C5. most popular pizza by customer
 can we personalize offers?

recommend personalized “customer favorites” discounts.

hint toward AI/BI-driven recommender systems.



In [None]:
pd.read_sql(

    '''
    WITH pizza_count AS (
        SELECT 
                co.customer_id,
                p.pizza_name,
                COUNT(*) pizza_per_customer
        FROM customer_orders co
        JOIN pizza_names p USING(pizza_id)
        GROUP BY 1,2
    ), pizza_rank AS (
        SELECT 
             *,
             DENSE_RANK() OVER (PARTITION BY customer_id ORDER BY pizza_per_customer DESC) AS rn
        FROM pizza_count 
    )

    SELECT 
          customer_id,
          pizza_name
    FROM pizza_rank
    WHERE rn = 1
    
    ''',conn
)


C6. regulars with ≥30 orders and their go-to pizzas
 who are the heavy hitters and what do they like?

obvious loyalty program candidates.

pitch: “keep them happy with exclusive rewards so they don’t churn.”



In [None]:
pd.read_sql(

    '''
    WITH regulars AS (
        SELECT 
                customer_id,
                COUNT(order_id) AS total_orders
        FROM customer_orders
        GROUP BY 1 
        HAVING total_orders >= 30
    ), pizza_rank AS (
        SELECT 
             co.customer_id,
                p.pizza_name,
                DENSE_RANK() OVER (PARTITION BY co.customer_id ORDER BY COUNT(*) DESC) AS rn,
                COUNT(*) pizza_per_customer
        FROM customer_orders co
        JOIN pizza_names p USING(pizza_id)
        WHERE co.customer_id IN (
            SELECT 
                  customer_id
            FROM regulars
        )
        GROUP BY 1,2 
    )

    SELECT 
         customer_id,
         pizza_name
    FROM pizza_rank
    WHERE rn = 1
    
    ''',conn
)


C7. customers with very consistent habits (always order the same pizza)
 creatures of habit = stable recurring revenue.

membership idea: “pizza subscription” (weekly plan with their pizza auto-delivered).



In [None]:
pd.read_sql(
'''SELECT 
        customer_id,
        COUNT(DISTINCT pizza_id) AS count_pizzas 
   FROM customer_orders
   GROUP BY 1
   HAVING count_pizzas = 1
''',conn

)


C8. the “perfect pair”

great marketing story: “find your pizza soulmate.”

pitch: social media campaign + 2-for-1  perfect pizza couples’ promo.



In [None]:
pd.read_sql(
    '''
    WITH pizza_order AS (
        SELECT
                order_id,
                COUNT(pizza_id) AS pizza_count
        FROM customer_orders
        GROUP BY 1
        HAVING pizza_count > 1
    ), items AS (                         
        SELECT DISTINCT
            co.order_id,
            co.pizza_id
        FROM customer_orders co
        WHERE co.order_id IN (SELECT order_id FROM pizza_order)
    ), pairs AS (                          
        SELECT
            a.pizza_id AS pizza_a,
            b.pizza_id AS pizza_b,
            COUNT(*)   AS pair_orders
        FROM items a
        JOIN items b
            ON a.order_id = b.order_id
        AND a.pizza_id < b.pizza_id      
        GROUP BY a.pizza_id, b.pizza_id
    )
    SELECT
          pa.pizza_name AS pizza_a,
          pb.pizza_name AS pizza_b,
          pair_orders
    FROM pairs
    JOIN pizza_names pa ON pa.pizza_id = pairs.pizza_a
    JOIN pizza_names pb ON pb.pizza_id = pairs.pizza_b
    ORDER BY pair_orders DESC, pizza_a, pizza_b;
   
    ''',conn
)


C9. peak order times
👉 what hours & days matter most?

operational: staff scheduling.

marketing: happy hour discounts in slow periods, premium pricing at peak times.



In [None]:
pd.read_sql(
    '''
    SELECT
          strftime('%d', order_date) AS day_order,
          strftime('%H', order_date) AS hour_order,
          COUNT(order_id) AS count_orders
    FROM customer_orders
    GROUP BY 1,2
    ORDER BY 3 DESC
    ''',conn
)

C10. best candidates for loyalty program
👉 combine spend + frequency + consistency.

identify top 5–10% customers.

suggest tiered memberships: silver/gold/platinum.

seasonal perks: double points in winter when sales slow.

In [None]:
##pd.read_sql("YOUR QUERY HERE", conn)  # TODO


In [None]:
pd.read_sql("""
SELECT runner_id,
       COUNT(runner_id) AS order_count
FROM runner_orders
WHERE distance IS NOT NULL
  AND TRIM(distance) <> ''
  AND cancellation = ''
GROUP BY runner_id
ORDER BY order_count DESC;
""", conn)


In [None]:
pd.read_sql("""
SELECT DISTINCT cancellation
FROM runner_orders
""", conn)

Notice how our query returned empty results? That is a clue something is off in our filter. We wrote
WHERE distance IS NOT NULL
  AND cancellation = ''''

  but in this dataset, the cancellation column does not only use a blank string to mean no cancellation. Sometimes it has the literal word 'null', sometimes it is NULL (the SQL null value), sometimes different casing (Null, NULL). Because of that, our =  condition excluded almost everything.

In [None]:
pd.read_sql("""
SELECT DISTINCT TRIM(cancellation) AS cancellation_value,
       COUNT(*) AS n
FROM runner_orders
GROUP BY 1
ORDER BY n DESC;
""", conn)


In [None]:
# 3a) Successful orders delivered by each runner (ignore distance presence)
pd.read_sql("""
SELECT runner_id,
       COUNT(*) AS order_count
FROM runner_orders
WHERE COALESCE(TRIM(LOWER(cancellation)),'') IN ('', 'null')
GROUP BY runner_id
ORDER BY order_count DESC;
""", conn)


# 🍕 Customer & Business Intelligence Questions

## C1. Total Customer Spend
**What is the total amount each customer spent at pizza runner?**

*(Hint: use pizza prices from pizza_names and sum across delivered pizzas)*

## C2. Customer Order Frequency
**How many different days has each customer placed an order?**

*(Hint: count distinct dates from order_time / order_date)*

## C3. First Pizza Ordered
**What was the first pizza ordered by each customer?**

*(Hint: find the earliest order_time per customer)*

## C4. Most Popular Pizza Overall
**What is the most purchased pizza overall and how many times was it ordered?**

## C5. Customer's Favorite Pizza
**Which pizza is the most popular for each customer?**

*(Hint: group by customer_id + pizza_id, take the max count)*

## C6. Regular Customer Preferences
**Among the regular customers (≥30 orders), which pizzas do they mostly stick to?**

*(Hint: helps suggest loyalty-program discounts)*

## C7. Consistent Order Habits
**Which customers have very consistent order habits (always order the same pizza)?**

*(Hint: look for customers whose COUNT(DISTINCT pizza_id)=1)*

## C8. The "Perfect Pair"
**Can you find the "perfect pair" — two customers who often order the same pizza at the same time?**

*(Hint: think customers 30 and 543 😉)*

## C9. Peak Order Times
**What are the peak order times for the restaurant?**

*(Hint: group by day-of-week and hour-of-day)*

## 

In [None]:
# =============================================================================
# INICIALIZACIÓN DE DATOS PARA MACHINE LEARNING
# =============================================================================

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')

# Cargar datos
print("Cargando datos...")
customer_orders = pd.read_csv('customer_orders.csv')
pizza_names = pd.read_csv('pizza_names.csv')
pizza_recipes = pd.read_csv('pizza_recipes.csv')
pizza_toppings = pd.read_csv('pizza_toppings.csv')
runners = pd.read_csv('runners.csv')
runner_orders = pd.read_csv('runner_orders.csv')

# Limpiar y preparar datos base
print("Limpiando datos...")

# Convertir fechas
customer_orders['order_date'] = pd.to_datetime(customer_orders['order_date'])
runner_orders['pickup_time'] = pd.to_datetime(runner_orders['pickup_time'])

# Limpiar cancellation (convertir a binario: 0=entregado, 1=cancelado)
runner_orders['is_cancelled'] = runner_orders['cancellation'].fillna('').apply(
    lambda x: 1 if x.strip().lower() not in ['', 'null', 'none'] else 0
)

# Limpiar distance y duration
runner_orders['distance_km'] = runner_orders['distance'].str.replace('km', '').str.strip().astype(float)
runner_orders['duration_min'] = runner_orders['duration'].str.replace('minutes', '').str.strip().astype(float)

# Crear dataset principal combinado
print("Combinando datasets...")
main_data = customer_orders.merge(pizza_names, on='pizza_id', how='left')
main_data = main_data.merge(runner_orders[['order_id', 'is_cancelled', 'distance_km', 'duration_min']], 
                           on='order_id', how='left')

# Filtrar solo pedidos entregados para análisis
delivered_orders = main_data[main_data['is_cancelled'] == 0].copy()

print(f"Datos cargados: {len(main_data)} pedidos totales, {len(delivered_orders)} entregados")
print(f"Rango de fechas: {delivered_orders['order_date'].min()} a {delivered_orders['order_date'].max()}")

## Modelo 1: Logistic Regression - Segmentación de Clientes 

### Preparación de Features para Segmentación

In [None]:
# =============================================================================
# MODELO 1: LOGISTIC REGRESSION - SEGMENTACIÓN DE CLIENTES
# =============================================================================

print("MODELO 1: Segmentación de Clientes (Leales vs Ocasionales)")

# Crear features para segmentación
customer_features = delivered_orders.groupby('customer_id').agg({
    'order_id': 'count',  # total_orders
    'order_date': ['min', 'max'],  # first_order, last_order
    'pizza_id': 'nunique',  # pizza_variety
    'exclusions': lambda x: (x != '').sum(),  # orders_with_exclusions
    'extras': lambda x: (x != '').sum()  # orders_with_extras
}).reset_index()

# Aplanar columnas
customer_features.columns = ['customer_id', 'total_orders', 'first_order', 'last_order', 
                           'pizza_variety', 'orders_with_exclusions', 'orders_with_extras']

print(f"Features creadas para {len(customer_features)} clientes")
print(customer_features.head())

#### Preparación de Features para Segmentación

Aquí tenemos el perfil de nuestros primeros 5 clientes y ya podemos observar patrones muy claros en su comportamiento.

El cliente 5 representa un caso ideal de lo que consideramos un "cliente estrella". Ha realizado 31 pedidos, ha probado todas las 7 variedades de pizza disponibles, y en prácticamente cada pedido solicita modificaciones (31 exclusiones y 31 extras). Este tipo de cliente es altamente valioso para el negocio.

En contraste, los clientes 3 y 4 muestran un comportamiento completamente diferente - cada uno ha realizado únicamente 1 pedido. Son clientes de prueba que podrían convertirse en regulares si implementamos las estrategias adecuadas.

La variedad de comportamientos es notable. El cliente 5 ha mantenido una relación con nosotros desde noviembre de 2024 hasta julio de 2025, mientras que otros clientes muestran patrones de actividad más cortos.

También observamos que algunos clientes son muy específicos con sus pedidos, solicitando muchas exclusiones y extras, lo que indica que conocen bien el menú y tienen preferencias muy definidas. Esta información es valiosa para personalizar ofertas y mejorar la experiencia del cliente.

Con estos datos, podemos comenzar a identificar quiénes son nuestros clientes leales versus los ocasionales, y diseñar estrategias específicas para cada segmento.

### Cálculo de Métricas de Cliente

In [None]:
# Calcular días de vida del cliente
customer_features['customer_lifetime_days'] = (
    customer_features['last_order'] - customer_features['first_order']
).dt.days

# Calcular frecuencia promedio (días entre pedidos)
customer_features['avg_days_between_orders'] = (
    customer_features['customer_lifetime_days'] / customer_features['total_orders']
).fillna(0)

# Definir cliente leal (≥10 pedidos Y ≤30 días promedio entre pedidos)
customer_features['is_loyal'] = (
    (customer_features['total_orders'] >= 10) & 
    (customer_features['avg_days_between_orders'] <= 30)
).astype(int)

print("Métricas calculadas:")
print(customer_features[['customer_id', 'total_orders', 'customer_lifetime_days', 
                        'avg_days_between_orders', 'is_loyal']].head(10))

#### Cálculo de Métricas de Cliente

Aquí hemos calculado métricas adicionales que nos permiten entender mejor el comportamiento de nuestros clientes. 

Observamos que el cliente 5 tiene un ciclo de vida de 240 días, lo que significa que ha estado con nosotros durante aproximadamente 8 meses. Su frecuencia promedio de pedidos es de 7.7 días, lo que indica que ordena pizza más de una vez por semana. Con 31 pedidos totales, claramente cumple nuestros criterios de cliente leal.

En contraste, los clientes 3 y 4 tienen un ciclo de vida de 0 días, lo que significa que solo han hecho un pedido. Su frecuencia promedio también es 0, ya que no han regresado.

El cliente 1 muestra un patrón interesante: 1 pedido en 0 días, lo que sugiere que es un cliente muy reciente.

Estas métricas nos permiten identificar claramente quiénes son nuestros clientes leales (aquellos con ≥10 pedidos y ≤30 días promedio entre pedidos) versus los ocasionales. El cliente 5 es el único en esta muestra que cumple ambos criterios, marcado con un 1 en la columna 'is_loyal'.

Esta segmentación nos ayudará a diseñar estrategias específicas para retener a los clientes leales y convertir a los ocasionales en regulares.

### Análisis Exploratorio y Gráficas

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Configurar estilo de gráficas
plt.style.use('default')
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Gráfica 1: Distribución de total de pedidos
axes[0,0].hist(customer_features['total_orders'], bins=20, alpha=0.7, color='skyblue')
axes[0,0].set_title('Distribución de Total de Pedidos por Cliente')
axes[0,0].set_xlabel('Total de Pedidos')
axes[0,0].set_ylabel('Frecuencia')

# Gráfica 2: Distribución de días entre pedidos
axes[0,1].hist(customer_features['avg_days_between_orders'], bins=20, alpha=0.7, color='lightgreen')
axes[0,1].set_title('Distribución de Días Promedio entre Pedidos')
axes[0,1].set_xlabel('Días Promedio entre Pedidos')
axes[0,1].set_ylabel('Frecuencia')

# Gráfica 3: Clientes leales vs ocasionales
loyal_counts = customer_features['is_loyal'].value_counts()
axes[1,0].pie(loyal_counts.values, labels=['Ocasionales', 'Leales'], autopct='%1.1f%%', 
              colors=['lightcoral', 'lightblue'])
axes[1,0].set_title('Distribución de Clientes Leales vs Ocasionales')

# Gráfica 4: Scatter plot: Total pedidos vs Días entre pedidos
colors = ['red' if x == 0 else 'blue' for x in customer_features['is_loyal']]
axes[1,1].scatter(customer_features['total_orders'], customer_features['avg_days_between_orders'], 
                  c=colors, alpha=0.6)
axes[1,1].set_title('Total Pedidos vs Días entre Pedidos')
axes[1,1].set_xlabel('Total de Pedidos')
axes[1,1].set_ylabel('Días Promedio entre Pedidos')
axes[1,1].legend(['Ocasionales', 'Leales'])

plt.tight_layout()
plt.show()

print(f"Estadísticas de segmentación:")
print(f"Clientes leales: {customer_features['is_loyal'].sum()}")
print(f"Clientes ocasionales: {(customer_features['is_loyal'] == 0).sum()}")
print(f"Porcentaje de clientes leales: {customer_features['is_loyal'].mean()*100:.1f}%")

#### Análisis Exploratorio y Gráficas

Los resultados de nuestro análisis exploratorio revelan patrones muy claros en el comportamiento de nuestros clientes.

La distribución de pedidos muestra una realidad típica del negocio de comida: la gran mayoría de nuestros clientes (más de 400) han realizado entre 0 y 2 pedidos. Esto indica que tenemos muchos clientes de prueba o ocasionales. Solo un pequeño grupo de clientes ha realizado más de 50 pedidos, confirmando que los clientes verdaderamente leales son una minoría.

El gráfico de días promedio entre pedidos confirma este patrón. Más de 300 clientes tienen un promedio muy bajo entre pedidos, lo que sugiere que muchos solo han hecho un pedido o pedidos muy cercanos en el tiempo.

El resultado más impactante es la distribución de clientes leales versus ocasionales: solo el 3.2% de nuestros clientes son considerados leales, mientras que el 96.8% son ocasionales. Esta proporción subraya la importancia de implementar estrategias de retención.

El gráfico de dispersión es especialmente revelador. Los clientes leales (puntos azules) se concentran claramente en la parte inferior derecha, mostrando un alto número de pedidos con pocos días entre ellos. Los clientes ocasionales (puntos rojos) se agrupan en la parte inferior izquierda, con pocos pedidos y alta variabilidad en la frecuencia.

Estos insights nos confirman que necesitamos enfocar nuestros esfuerzos en convertir a los clientes ocasionales en leales, ya que representan la gran mayoría de nuestra base de clientes.

### Preparación de Datos para el Modelo

In [None]:
# Features para el modelo
X = customer_features[['total_orders', 'pizza_variety', 'orders_with_exclusions', 
                      'orders_with_extras', 'avg_days_between_orders']]
y = customer_features['is_loyal']

print("Features seleccionadas:")
print(X.columns.tolist())
print(f"Shape de X: {X.shape}")
print(f"Shape de y: {y.shape}")

# Dividir datos
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Datos de entrenamiento: {X_train.shape[0]} muestras")
print(f"Datos de prueba: {X_test.shape[0]} muestras")

#### División de Datos y Preparación para el Modelo

Aquí estamos preparando nuestros datos para el entrenamiento del modelo de regresión logística, que nos ayudará a segmentar a los clientes.

Lo primero que vemos es que hemos seleccionado **5 características clave** para cada cliente: `total_orders`, `pizza_variety`, `orders_with_exclusions`, `orders_with_extras`, y `avg_days_between_orders`. Estas son las variables que el modelo utilizará para aprender a diferenciar entre clientes leales y ocasionales.

Nuestro conjunto de datos `X` (las características de entrada) tiene una forma de **(557, 5)**. Esto significa que tenemos datos de **557 clientes** y, como mencioné, **5 características** por cada uno.

Para `y` (nuestra variable objetivo, que probablemente indica si un cliente es leal o no), tenemos una forma de **(557,)**. Esto confirma que tenemos una etiqueta de lealtad para cada uno de nuestros 557 clientes.

Finalmente, hemos dividido estos datos en conjuntos de entrenamiento y prueba:
*   **445 muestras** se usarán para **entrenar** el modelo. Con estos datos, el modelo aprenderá los patrones.
*   **112 muestras** se reservarán para **probar** qué tan bien funciona el modelo con datos que nunca ha visto. Esto es crucial para asegurarnos de que el modelo no solo memorice los datos de entrenamiento, sino que pueda generalizar a nuevos clientes.

El siguiente paso, como indica el texto, será el escalado de estas características y el entrenamiento formal del modelo.

###  Escalado y Entrenamiento del Mode

In [None]:
# Escalar features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Entrenar modelo
lr_model = LogisticRegression(random_state=42)
lr_model.fit(X_train_scaled, y_train)

print("Modelo Logistic Regression entrenado exitosamente")

### Evaluación y Predicciones

In [None]:
# Predicciones
y_pred = lr_model.predict(X_test_scaled)
y_pred_proba = lr_model.predict_proba(X_test_scaled)[:, 1]

# Métricas
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.3f}")
print(f"Clientes leales predichos: {y_pred.sum()}/{len(y_pred)}")
print(f"Distribución real: {y.value_counts().to_dict()}")

# Mostrar coeficientes
feature_names = ['total_orders', 'pizza_variety', 'orders_with_exclusions', 
                'orders_with_extras', 'avg_days_between_orders']
coefficients = pd.DataFrame({
    'feature': feature_names,
    'coefficient': lr_model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print("\nCoeficientes más importantes:")
print(coefficients)

#### Resultados del Modelo de Regresión Logística

Aquí analizamos los resultados de nuestro modelo de Regresión Logística, diseñado para identificar clientes leales.

Observamos una **precisión (Accuracy) de 1.000**, lo cual a primera vista parece excelente. Sin embargo, al mirar los **clientes leales predichos (1/112)**, notamos que el modelo solo identificó a 1 cliente como leal de los 112 que se esperaban en el conjunto de prueba. Esto, combinado con la **distribución real de clientes ({0: 539, 1: 18})**, donde la clase mayoritaria (no leales, 0) es abrumadoramente más grande que la minoritaria (leales, 1), sugiere que el modelo podría estar sufriendo de un problema de desequilibrio de clases. Es probable que el modelo esté clasificando casi todos los casos como "no leales" para lograr esa alta precisión, ya que la mayoría de los clientes son de hecho no leales.

En cuanto a los **coeficientes más importantes**, que nos indican qué características influyen más en la lealtad del cliente:

*   **`total_orders` (0.788324)**: Un mayor número de pedidos totales está fuertemente asociado con la lealtad. Esto es intuitivo, ya que los clientes leales suelen pedir más.
*   **`orders_with_exclusions` (0.788324)** y **`orders_with_extras` (0.788324)**: Los clientes que personalizan sus pizzas con exclusiones o extras también muestran una fuerte correlación positiva con la lealtad. Esto podría indicar un mayor compromiso con la marca y sus productos.
*   **`pizza_variety` (0.596387)**: Probar una mayor variedad de pizzas también contribuye positivamente a la lealtad, aunque con un coeficiente ligeramente menor que las características anteriores.
*   **`avg_days_between_orders` (-0.119452)**: Este es el único coeficiente negativo. Un mayor número de días promedio entre pedidos (es decir, pedir con menos frecuencia) disminuye la probabilidad de ser un cliente leal. Esto también es lógico, ya que los clientes leales tienden a tener intervalos más cortos entre sus pedidos.

En resumen, el modelo identifica que la **frecuencia de pedidos**, la **personalización de pizzas** y la **variedad de pizzas consumidas** son los principales impulsores de la lealtad, mientras que los **largos intervalos entre pedidos** son un indicador de menor lealtad. La alta precisión del 1.000, sin embargo, debe ser interpretada con cautela debido al desequilibrio de clases, y probablemente necesitemos métricas adicionales como `recall` o `F1-score` para evaluar mejor el rendimiento del modelo en la identificación de clientes leales.

###  Gráficas de Evaluación del Modelo

In [None]:
# Gráficas de evaluación
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Gráfica 1: Matriz de confusión
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0,0])
axes[0,0].set_title('Matriz de Confusión')
axes[0,0].set_xlabel('Predicción')
axes[0,0].set_ylabel('Real')

# Gráfica 2: Distribución de probabilidades
axes[0,1].hist(y_pred_proba[y_test == 0], bins=20, alpha=0.7, label='Ocasionales', color='red')
axes[0,1].hist(y_pred_proba[y_test == 1], bins=20, alpha=0.7, label='Leales', color='blue')
axes[0,1].set_title('Distribución de Probabilidades Predichas')
axes[0,1].set_xlabel('Probabilidad de ser Leal')
axes[0,1].set_ylabel('Frecuencia')
axes[0,1].legend()

# Gráfica 3: Importancia de features
coefficients_abs = coefficients.copy()
coefficients_abs['abs_coefficient'] = abs(coefficients_abs['coefficient'])
coefficients_abs = coefficients_abs.sort_values('abs_coefficient', ascending=True)
axes[1,0].barh(coefficients_abs['feature'], coefficients_abs['coefficient'])
axes[1,0].set_title('Coeficientes del Modelo')
axes[1,0].set_xlabel('Valor del Coeficiente')

# Gráfica 4: Comparación real vs predicho
comparison = pd.DataFrame({
    'Real': y_test.values,
    'Predicho': y_pred
})
comparison_counts = comparison.groupby(['Real', 'Predicho']).size().unstack(fill_value=0)
sns.heatmap(comparison_counts, annot=True, fmt='d', cmap='Greens', ax=axes[1,1])
axes[1,1].set_title('Comparación Real vs Predicho')
axes[1,1].set_xlabel('Predicción')
axes[1,1].set_ylabel('Real')

plt.tight_layout()
plt.show()

#### Análisis de Resultados del Modelo de Regresión Logística

Aquí tenemos una visión detallada de cómo se comportó nuestro modelo de Regresión Logística en la tarea de segmentar clientes en leales y ocasionales.

En la **Matriz de Confusión** (arriba a la izquierda), observamos que el modelo predijo correctamente a 111 clientes como "ocasionales" (clase 0) y a 1 cliente como "leal" (clase 1). Sin embargo, no hubo falsos positivos ni falsos negativos. Esto significa que, aunque el modelo tiene una alta precisión general (112 aciertos de 112), solo logró identificar a 1 cliente leal de los que realmente existían en el conjunto de prueba. Esto es un claro indicador de que el modelo está fuertemente sesgado hacia la clase mayoritaria (clientes ocasionales), lo cual es común en datasets desbalanceados.

La **Distribución de Probabilidades Predichas** (arriba a la derecha) refuerza esta idea. Vemos que la gran mayoría de las predicciones de probabilidad de ser leal se agrupan alrededor de 0 (indicando "ocasionales"), con solo una pequeña barra en 1.0 (indicando "leales"). Esto confirma que el modelo es muy conservador al clasificar a un cliente como leal.

Los **Coeficientes del Modelo** (abajo a la izquierda) nos muestran qué características influyen más en la lealtad del cliente:
*   **`orders_with_extras`**, **`orders_with_exclusions`** y **`total_orders`** tienen los coeficientes positivos más altos (alrededor de 0.75). Esto sugiere que los clientes que piden más pizzas, que personalizan sus pedidos con extras o exclusiones, son mucho más propensos a ser considerados leales.
*   **`pizza_variety`** también tiene un coeficiente positivo significativo (alrededor de 0.6), indicando que probar diferentes tipos de pizza se asocia con la lealtad.
*   **`avg_days_between_orders`** tiene un coeficiente negativo (alrededor de -0.1), lo que es lógico: cuantos más días pasan entre pedidos, menos leal es el cliente.

Finalmente, la **Comparación Real vs Predicho** (abajo a la derecha) es otra representación de la matriz de confusión, utilizando una escala de color verde. Confirma los mismos resultados: 111 clientes correctamente clasificados como 0 y 1 cliente correctamente clasificado como 1, sin errores de clasificación cruzada.

En resumen, el modelo es muy bueno para identificar a los clientes ocasionales, pero su capacidad para detectar a los clientes leales es limitada debido al desequilibrio de clases. Las características relacionadas con la frecuencia de pedidos y la personalización son los predictores más fuertes de la lealtad.

###  Función de Predicción y Casos de Uso

In [None]:
# Función para predecir si un cliente es leal
def predict_customer_loyalty(customer_id):
    if customer_id not in customer_features['customer_id'].values:
        return "Cliente no encontrado"
    
    customer_data = customer_features[customer_features['customer_id'] == customer_id]
    features = customer_data[['total_orders', 'pizza_variety', 'orders_with_exclusions', 
                             'orders_with_extras', 'avg_days_between_orders']]
    
    features_scaled = scaler.transform(features)
    prediction = lr_model.predict(features_scaled)[0]
    probability = lr_model.predict_proba(features_scaled)[0][1]
    
    return {
        'customer_id': customer_id,
        'is_loyal': bool(prediction),
        'loyalty_probability': probability,
        'total_orders': customer_data['total_orders'].iloc[0],
        'avg_days_between_orders': customer_data['avg_days_between_orders'].iloc[0]
    }

# Ejemplo de predicción
sample_customer = customer_features['customer_id'].iloc[0]
result = predict_customer_loyalty(sample_customer)
print(f"Ejemplo de predicción para cliente {sample_customer}:")
print(result)

print("\nModelo 1 completado exitosamente!")

#### Ejemplo de Predicción del Modelo de Regresión Logística

Aquí tenemos un ejemplo concreto de cómo nuestro Modelo 1 de Regresión Logística realiza una predicción para un cliente individual.

Para el **cliente con `customer_id: 1`**, el modelo ha determinado que **`is_loyal: False`**, lo que significa que lo clasifica como un cliente ocasional. La **`loyalty_probability`** asociada a esta predicción es extremadamente baja, de **0.0006205164395327097**.

Este resultado es coherente con lo que discutimos anteriormente sobre el posible sesgo del modelo hacia la clase mayoritaria (clientes no leales) debido al desequilibrio de clases. Una probabilidad tan baja refuerza la idea de que el modelo es muy conservador al clasificar a un cliente como leal.

Finalmente, el mensaje "Modelo 1 completado exitosamente!" nos confirma que la ejecución de esta parte del modelo se realizó sin problemas.

## Modelo 2: Linear Regression - Predicción de Gastos

### Configuración de Precios y Features Básicas

In [None]:
# =============================================================================
# MODELO 2: LINEAR REGRESSION - PREDICCIÓN DE GASTOS
# =============================================================================

print("MODELO 2: Predicción de Gastos por Pedido")

# Asumir precios base (en la realidad vendrían de pizza_names o tabla de precios)
pizza_prices = {1: 12, 2: 10, 3: 15, 4: 14, 5: 13, 6: 11, 7: 9}  # precios base
delivered_orders['base_price'] = delivered_orders['pizza_id'].map(pizza_prices)

# Calcular precio final (base + extras - descuentos por exclusions)
delivered_orders['has_extras'] = (delivered_orders['extras'] != '').astype(int)
delivered_orders['has_exclusions'] = (delivered_orders['exclusions'] != '').astype(int)
delivered_orders['final_price'] = delivered_orders['base_price'] + (delivered_orders['has_extras'] * 2) - (delivered_orders['has_exclusions'] * 1)

print("Precios configurados:")
print(f"Precios base por pizza: {pizza_prices}")
print(f"Precio promedio final: ${delivered_orders['final_price'].mean():.2f}")
print(f"Rango de precios: ${delivered_orders['final_price'].min():.2f} - ${delivered_orders['final_price'].max():.2f}")

### K means para los datos

In [None]:
print("Cargando datos...")
customer_orders = pd.read_csv('customer_orders.csv')
pizza_names = pd.read_csv('pizza_names.csv')
pizza_recipes = pd.read_csv('pizza_recipes.csv')
pizza_toppings = pd.read_csv('pizza_toppings.csv')
runners = pd.read_csv('runners.csv')
runner_orders = pd.read_csv('runner_orders.csv')

# Limpiar y preparar datos base
print("Limpiando datos...")

# Convertir fechas
customer_orders['order_date'] = pd.to_datetime(customer_orders['order_date'])
runner_orders['pickup_time'] = pd.to_datetime(runner_orders['pickup_time'])

# Limpiar cancellation (convertir a binario: 0=entregado, 1=cancelado)
runner_orders['is_cancelled'] = runner_orders['cancellation'].fillna('').apply(
    lambda x: 1 if x.strip().lower() not in ['', 'null', 'none'] else 0
)

# Limpiar distance y duration
runner_orders['distance_km'] = runner_orders['distance'].str.replace('km', '').str.strip().astype(float)
runner_orders['duration_min'] = runner_orders['duration'].str.replace('minutes', '').str.strip().astype(float)

# Crear dataset principal combinado
print("Combinando datasets...")
main_data = customer_orders.merge(pizza_names, on='pizza_id', how='left')
main_data = main_data.merge(runner_orders[['order_id', 'is_cancelled', 'distance_km', 'duration_min']], 
                           on='order_id', how='left')

# Filtrar solo pedidos entregados para análisis
delivered_orders = main_data[main_data['is_cancelled'] == 0].copy()

print(f"Datos cargados: {len(main_data)} pedidos totales, {len(delivered_orders)} entregados")
print(f"Rango de fechas: {delivered_orders['order_date'].min()} a {delivered_orders['order_date'].max()}")

In [None]:
delivered_orders['day_of_week'] = delivered_orders['order_date'].dt.dayofweek
delivered_orders['hour'] = delivered_orders['order_date'].dt.hour
delivered_orders['is_weekend'] = (delivered_orders['day_of_week'] >= 5).astype(int)
delivered_orders['month'] = delivered_orders['order_date'].dt.month
delivered_orders['is_lunch'] = ((delivered_orders['hour'] >= 11) & (delivered_orders['hour'] <= 14)).astype(int)
delivered_orders['is_dinner'] = ((delivered_orders['hour'] >= 18) & (delivered_orders['hour'] <= 21)).astype(int)

print("Features temporales creadas:")

In [None]:
print("MODELO 2: Predicción de Gastos por Pedido")

# Asumir precios base (en la realidad vendrían de pizza_names o tabla de precios)
pizza_prices = {1: 12, 2: 10, 3: 15, 4: 14, 5: 13, 6: 11, 7: 9}  # precios base
delivered_orders['base_price'] = delivered_orders['pizza_id'].map(pizza_prices)

# Calcular precio final (base + extras - descuentos por exclusions)
delivered_orders['has_extras'] = (delivered_orders['extras'] != '').astype(int)
delivered_orders['has_exclusions'] = (delivered_orders['exclusions'] != '').astype(int)
delivered_orders['final_price'] = delivered_orders['base_price'] + (delivered_orders['has_extras'] * 2) - (delivered_orders['has_exclusions'] * 1)

print("Precios configurados:")
print(f"Precios base por pizza: {pizza_prices}")
print(f"Precio promedio final: ${delivered_orders['final_price'].mean():.2f}")
print(f"Rango de precios: ${delivered_orders['final_price'].min():.2f} - ${delivered_orders['final_price'].max():.2f}")

In [None]:
numeric_columns = ["customer_id", "is_cancelled", "distance_km", "duration_min", "base_price",
                    "has_extras", "has_exclusions", "final_price","day_of_week",
                    "hour", "is_weekend", "month", "is_lunch", "is_dinner"]

kmeans_data = delivered_orders[numeric_columns]

In [None]:
kmeans_data = kmeans_data.fillna(kmeans_data.mean())

In [None]:
kmeans_data = kmeans_data.groupby("customer_id").agg(
    {"is_cancelled" : "sum", "distance_km" : "mean", "duration_min" : "mean", "base_price" : "mean",
     "has_extras" : "sum", "has_exclusions" : "sum", "final_price" : "mean", "day_of_week" : "mean",
     "hour" : "mean", "is_weekend" : "sum", "month" : "mean", "is_lunch" : "sum", "is_dinner" : "sum"}
).reset_index()
kmeans_data

In [None]:
standard_scaler = StandardScaler()
kmeans_standar = standard_scaler.fit_transform(kmeans_data.drop(columns=["customer_id"]))

In [None]:
fig = go.Figure()
k_values = list(range(1,11))
elbow = []
for k in k_values:
    kmeans_model = KMeans(n_clusters=k)
    kmeans_model.fit(kmeans_standar)
    elbow.append(kmeans_model.inertia_)

fig.add_trace(
    go.Scatter(
        x = k_values,
        y =elbow
    )
)
fig.update_layout(
    title = "Elbow method",
    xaxis = dict(
        title = "K value"
    ),
    yaxis = dict(
        title = "Inertia"
    )
)

fig.show()

In [None]:
k = 3
kmeans_model = KMeans(n_clusters=k)
kmeans_model.fit(kmeans_standar)

kmeans_data["cluster"] = kmeans_model.labels_
print(f"Clusters creados: {k}")

kmeans_data['cluster'].value_counts()

In [None]:
pca = PCA(n_components=2)
kmeans_pca = pca.fit_transform(kmeans_standar)

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x = kmeans_pca[:,0],
        y = kmeans_pca[:, 1],
        mode = "markers",
        marker = dict(
            color = kmeans_data['cluster'],
            colorscale = "Viridis",
            showscale = False,
        ),
    )
)

fig.update_layout(
    title = "Clusters de clientes con PCA",
    xaxis = dict(
        title = "PCA 1"
    ),
    yaxis = dict(
        title = "PCA 2"
    )
)

fig.show()

In [None]:
from sklearn.svm import SVR


def apply_models(data):
    results = {}

    for cluster_id in data["cluster"].unique():
        # Subset data by cluster
        cluster_data = data[data["cluster"] == cluster_id]

        # Features (exclude target + id + cluster)
        X = cluster_data.drop(columns=["customer_id", "final_price", "cluster"])
        y = cluster_data["final_price"]

        # Train/test split
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.3, random_state=42
        )

        cluster_results = {}

        # --- Random Forest ---
        rf = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=10)
        rf.fit(X_train, y_train)
        y_pred_rf = rf.predict(X_test)
        cluster_results["RandomForest_R2"] = r2_score(y_test, y_pred_rf)
        cluster_results["RandomForest_RMSE"] = np.sqrt(mean_squared_error(y_test, y_pred_rf))

        # --- Linear Regression ---
        lin_reg = LinearRegression()
        lin_reg.fit(X_train, y_train)
        y_pred_lr = lin_reg.predict(X_test)
        cluster_results["LinearRegression_R2"] = r2_score(y_test, y_pred_lr)
        cluster_results["LinearRegression_RMSE"] = np.sqrt(mean_squared_error(y_test, y_pred_lr))

        # --- Support Vector Regression (SVM) ---
        svr = SVR(kernel="rbf")
        svr.fit(X_train, y_train)
        y_pred_svr = svr.predict(X_test)
        cluster_results["SVM_R2"] = r2_score(y_test, y_pred_svr)
        cluster_results["SVM_RMSE"] = np.sqrt(mean_squared_error(y_test, y_pred_svr))

        # Save results per cluster
        results[f"Cluster {cluster_id}"] = cluster_results

    return results

results = apply_models(kmeans_data)
print(results)

'''
for cluster, metrics in results.items():
    print(f"\n{cluster}:")
    for model, score in metrics.items():
        print(f"  {model}: {score:.4f}")
'''




### Creación de Features Temporales

In [None]:
# Features temporales
delivered_orders['day_of_week'] = delivered_orders['order_date'].dt.dayofweek
delivered_orders['hour'] = delivered_orders['order_date'].dt.hour
delivered_orders['is_weekend'] = (delivered_orders['day_of_week'] >= 5).astype(int)
delivered_orders['month'] = delivered_orders['order_date'].dt.month
delivered_orders['is_lunch'] = ((delivered_orders['hour'] >= 11) & (delivered_orders['hour'] <= 14)).astype(int)
delivered_orders['is_dinner'] = ((delivered_orders['hour'] >= 18) & (delivered_orders['hour'] <= 21)).astype(int)

print("Features temporales creadas:")
print(delivered_orders[['order_date', 'day_of_week', 'hour', 'is_weekend', 'month', 'is_lunch', 'is_dinner']].head())

### Estadísticas del Cliente

In [None]:
# Features del cliente (historial)
customer_stats = delivered_orders.groupby('customer_id').agg({
    'final_price': ['mean', 'std', 'count'],
    'has_extras': 'mean',
    'has_exclusions': 'mean',
    'pizza_id': 'nunique'
}).reset_index()

# Aplanar columnas
customer_stats.columns = ['customer_id', 'avg_order_value', 'std_order_value', 
                         'total_orders', 'avg_has_extras', 'avg_has_exclusions', 'pizza_variety']

# Combinar con datos principales
expense_data = delivered_orders.merge(customer_stats, on='customer_id', how='left')

print("Estadísticas del cliente calculadas:")
print(customer_stats.head())
print(f"Clientes únicos: {len(customer_stats)}")

### Análisis Exploratorio y Gráficas

In [None]:
# Análisis exploratorio
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Gráfica 1: Distribución de precios
axes[0,0].hist(expense_data['final_price'], bins=30, alpha=0.7, color='skyblue')
axes[0,0].set_title('Distribución de Precios Finales')
axes[0,0].set_xlabel('Precio Final ($)')
axes[0,0].set_ylabel('Frecuencia')

# Gráfica 2: Precio por día de la semana
day_names = ['Lun', 'Mar', 'Mié', 'Jue', 'Vie', 'Sáb', 'Dom']
price_by_day = expense_data.groupby('day_of_week')['final_price'].mean()
axes[0,1].bar(day_names, price_by_day, color='lightgreen')
axes[0,1].set_title('Precio Promedio por Día de la Semana')
axes[0,1].set_ylabel('Precio Promedio ($)')

# Gráfica 3: Precio por hora
price_by_hour = expense_data.groupby('hour')['final_price'].mean()
axes[0,2].plot(price_by_hour.index, price_by_hour.values, marker='o', color='orange')
axes[0,2].set_title('Precio Promedio por Hora del Día')
axes[0,2].set_xlabel('Hora')
axes[0,2].set_ylabel('Precio Promedio ($)')

# Gráfica 4: Precio por tipo de pizza
price_by_pizza = expense_data.groupby('pizza_name')['final_price'].mean().sort_values(ascending=False)
axes[1,0].barh(price_by_pizza.index, price_by_pizza.values, color='lightcoral')
axes[1,0].set_title('Precio Promedio por Tipo de Pizza')
axes[1,0].set_xlabel('Precio Promedio ($)')

# Gráfica 5: Efecto de extras y exclusiones
extras_effect = expense_data.groupby('has_extras')['final_price'].mean()
axes[1,1].bar(['Sin Extras', 'Con Extras'], extras_effect.values, color=['lightblue', 'darkblue'])
axes[1,1].set_title('Efecto de Extras en el Precio')
axes[1,1].set_ylabel('Precio Promedio ($)')

# Gráfica 6: Scatter plot: Precio vs Total de pedidos del cliente
axes[1,2].scatter(expense_data['total_orders'], expense_data['final_price'], alpha=0.5, color='purple')
axes[1,2].set_title('Precio vs Total de Pedidos del Cliente')
axes[1,2].set_xlabel('Total de Pedidos del Cliente')
axes[1,2].set_ylabel('Precio del Pedido ($)')

plt.tight_layout()
plt.show()

print("Análisis exploratorio completado")

### Preparación de Features para el Modelo

In [None]:
# Features para el modelo
X_expense = expense_data[['pizza_id', 'day_of_week', 'hour', 'is_weekend', 
                         'has_extras', 'has_exclusions', 'avg_order_value', 
                         'total_orders', 'avg_has_extras', 'pizza_variety']]
y_expense = expense_data['final_price']

print("Features seleccionadas para el modelo:")
print(X_expense.columns.tolist())
print(f"Shape de X: {X_expense.shape}")
print(f"Shape de y: {y_expense.shape}")

# Verificar valores faltantes
print(f"Valores faltantes en X: {X_expense.isnull().sum().sum()}")
print(f"Valores faltantes en y: {y_expense.isnull().sum()}")

# Llenar valores faltantes si los hay
X_expense = X_expense.fillna(X_expense.mean())

### División de Datos y Escalado

In [None]:
# Dividir datos
X_train_exp, X_test_exp, y_train_exp, y_test_exp = train_test_split(
    X_expense, y_expense, test_size=0.2, random_state=42)

print(f"Datos de entrenamiento: {X_train_exp.shape[0]} muestras")
print(f"Datos de prueba: {X_test_exp.shape[0]} muestras")

# Escalar features
scaler_exp = StandardScaler()
X_train_exp_scaled = scaler_exp.fit_transform(X_train_exp)
X_test_exp_scaled = scaler_exp.transform(X_test_exp)

print("Datos escalados exitosamente")

###  Entrenamiento del Modelo

In [None]:
# Entrenar modelo
lr_expense = LinearRegression()
lr_expense.fit(X_train_exp_scaled, y_train_exp)

print("Modelo Linear Regression entrenado exitosamente")
print(f"Coeficiente de determinación (R²) en entrenamiento: {lr_expense.score(X_train_exp_scaled, y_train_exp):.3f}")

###  Evaluación y Predicciones

In [None]:
# Predicciones
y_pred_exp = lr_expense.predict(X_test_exp_scaled)
mse = mean_squared_error(y_test_exp, y_pred_exp)
r2 = r2_score(y_test_exp, y_pred_exp)
rmse = np.sqrt(mse)

print(f"R² Score: {r2:.3f}")
print(f"RMSE: ${rmse:.2f}")
print(f"Precio promedio real: ${y_test_exp.mean():.2f}")
print(f"Precio promedio predicho: ${y_pred_exp.mean():.2f}")

# Mostrar importancia de features
feature_names_exp = ['pizza_id', 'day_of_week', 'hour', 'is_weekend', 
                    'has_extras', 'has_exclusions', 'avg_order_value', 
                    'total_orders', 'avg_has_extras', 'pizza_variety']
importance = pd.DataFrame({
    'feature': feature_names_exp,
    'coefficient': lr_expense.coef_
}).sort_values('coefficient', key=abs, ascending=False)

print("\nFeatures más importantes (por coeficiente):")
print(importance)

### Gráficas de Evaluación del Modelo

In [None]:
# Gráficas de evaluación
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Gráfica 1: Predicciones vs Valores reales
axes[0,0].scatter(y_test_exp, y_pred_exp, alpha=0.5, color='blue')
axes[0,0].plot([y_test_exp.min(), y_test_exp.max()], [y_test_exp.min(), y_test_exp.max()], 'r--', lw=2)
axes[0,0].set_title('Predicciones vs Valores Reales')
axes[0,0].set_xlabel('Precio Real ($)')
axes[0,0].set_ylabel('Precio Predicho ($)')

# Gráfica 2: Residuos
residuals = y_test_exp - y_pred_exp
axes[0,1].scatter(y_pred_exp, residuals, alpha=0.5, color='green')
axes[0,1].axhline(y=0, color='r', linestyle='--')
axes[0,1].set_title('Residuos vs Predicciones')
axes[0,1].set_xlabel('Precio Predicho ($)')
axes[0,1].set_ylabel('Residuos ($)')

# Gráfica 3: Distribución de errores
axes[1,0].hist(residuals, bins=30, alpha=0.7, color='orange')
axes[1,0].set_title('Distribución de Errores de Predicción')
axes[1,0].set_xlabel('Error ($)')
axes[1,0].set_ylabel('Frecuencia')

# Gráfica 4: Importancia de features
importance_abs = importance.copy()
importance_abs['abs_coefficient'] = abs(importance_abs['coefficient'])
importance_abs = importance_abs.sort_values('abs_coefficient', ascending=True)
axes[1,1].barh(importance_abs['feature'], importance_abs['coefficient'])
axes[1,1].set_title('Coeficientes del Modelo')
axes[1,1].set_xlabel('Valor del Coeficiente')

plt.tight_layout()
plt.show()

### Función de Predicción y Casos de Uso

In [None]:
# Función para predecir el precio de un pedido
def predict_order_price(pizza_id, day_of_week, hour, has_extras=False, has_exclusions=False, 
                       customer_avg_order_value=12, customer_total_orders=5, 
                       customer_avg_has_extras=0.2, customer_pizza_variety=3):
    """
    Predice el precio de un pedido basado en las características del pedido y del cliente
    """
    is_weekend = 1 if day_of_week >= 5 else 0
    
    features = np.array([[pizza_id, day_of_week, hour, is_weekend, 
                         int(has_extras), int(has_exclusions), customer_avg_order_value, 
                         customer_total_orders, customer_avg_has_extras, customer_pizza_variety]])
    
    features_scaled = scaler_exp.transform(features)
    predicted_price = lr_expense.predict(features_scaled)[0]
    
    return max(0, round(predicted_price, 2))

# Ejemplos de predicción
print("Ejemplos de predicción de precios:")
print(f"Pizza Margherita, Lunes 12:00, sin extras: ${predict_order_price(1, 0, 12)}")
print(f"Pizza Meat Lovers, Viernes 19:00, con extras: ${predict_order_price(3, 4, 19, True)}")
print(f"Pizza Vegetarian, Sábado 20:00, con extras y exclusiones: ${predict_order_price(2, 5, 20, True, True)}")

# Análisis de casos de uso
print("\nCasos de uso del modelo:")
print("1. Estimar LTV (Lifetime Value) de clientes")
print("2. Optimizar descuentos y promociones")
print("3. Planificar inventario basado en demanda esperada")
print("4. Personalizar ofertas por cliente")
print("5. Análisis de rentabilidad por horario")

print("\nModelo 2 completado exitosamente!")

## Modelo 3: KNN - Recomendaciones de Pizza

### Creación de Matriz Cliente-Pizza

In [None]:
# =============================================================================
# MODELO 3: KNN - RECOMENDACIONES DE PIZZA
# =============================================================================

print("MODELO 3: Sistema de Recomendaciones de Pizza")

# Crear matriz cliente-pizza (ratings implícitos basados en frecuencia)
pizza_matrix = delivered_orders.groupby(['customer_id', 'pizza_id']).size().unstack(fill_value=0)

print(f"Matriz cliente-pizza creada: {pizza_matrix.shape}")
print(f"Clientes únicos: {pizza_matrix.shape[0]}")
print(f"Pizzas únicas: {pizza_matrix.shape[1]}")

# Mostrar muestra de la matriz
print("\nMuestra de la matriz cliente-pizza:")
print(pizza_matrix.head())

### Normalización y Análisis de Patrones

In [None]:
# Normalizar por cliente (frecuencia relativa)
pizza_matrix_norm = pizza_matrix.div(pizza_matrix.sum(axis=1), axis=0)

print("Matriz normalizada (frecuencia relativa por cliente):")
print(pizza_matrix_norm.head())

# Análisis de patrones de pedidos
customer_order_patterns = pizza_matrix.sum(axis=1)
pizza_popularity = pizza_matrix.sum(axis=0)

print(f"\nEstadísticas de patrones:")
print(f"Promedio de pedidos por cliente: {customer_order_patterns.mean():.2f}")
print(f"Pizza más popular: {pizza_names[pizza_names['pizza_id'] == pizza_popularity.idxmax()]['pizza_name'].iloc[0]}")
print(f"Pizza menos popular: {pizza_names[pizza_names['pizza_id'] == pizza_popularity.idxmin()]['pizza_name'].iloc[0]}")

### Análisis Exploratorio y Gráficas

In [None]:
# Análisis exploratorio
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Gráfica 1: Distribución de pedidos por cliente
axes[0,0].hist(customer_order_patterns, bins=20, alpha=0.7, color='skyblue')
axes[0,0].set_title('Distribución de Pedidos por Cliente')
axes[0,0].set_xlabel('Total de Pedidos')
axes[0,0].set_ylabel('Frecuencia')

# Gráfica 2: Popularidad de pizzas
pizza_names_dict = dict(zip(pizza_names['pizza_id'], pizza_names['pizza_name']))
pizza_popularity_named = pizza_popularity.rename(pizza_names_dict)
axes[0,1].barh(pizza_popularity_named.index, pizza_popularity_named.values, color='lightgreen')
axes[0,1].set_title('Popularidad de Pizzas')
axes[0,1].set_xlabel('Total de Pedidos')

# Gráfica 3: Heatmap de matriz cliente-pizza (muestra)
sample_customers = pizza_matrix_norm.head(10)
sample_pizzas = pizza_matrix_norm.columns
sns.heatmap(sample_customers, annot=True, fmt='.2f', cmap='YlOrRd', ax=axes[0,2])
axes[0,2].set_title('Heatmap: Preferencias por Cliente (Muestra)')
axes[0,2].set_xlabel('Pizza ID')
axes[0,2].set_ylabel('Customer ID')

# Gráfica 4: Distribución de frecuencia relativa
all_ratings = pizza_matrix_norm.values.flatten()
all_ratings = all_ratings[all_ratings > 0]  # Solo ratings positivos
axes[1,0].hist(all_ratings, bins=20, alpha=0.7, color='orange')
axes[1,0].set_title('Distribución de Frecuencia Relativa')
axes[1,0].set_xlabel('Frecuencia Relativa')
axes[1,0].set_ylabel('Frecuencia')

# Gráfica 5: Clientes con más variedad
customer_variety = (pizza_matrix > 0).sum(axis=1)
axes[1,1].hist(customer_variety, bins=range(1, 8), alpha=0.7, color='purple')
axes[1,1].set_title('Distribución de Variedad de Pizzas por Cliente')
axes[1,1].set_xlabel('Número de Pizzas Diferentes Pedidas')
axes[1,1].set_ylabel('Frecuencia')

# Gráfica 6: Correlación entre pizzas
pizza_corr = pizza_matrix.corr()
sns.heatmap(pizza_corr, annot=True, fmt='.2f', cmap='coolwarm', ax=axes[1,2])
axes[1,2].set_title('Correlación entre Pizzas')

plt.tight_layout()
plt.show()

print("Análisis exploratorio completado")

### Preparación de Datos para KNN

In [None]:
# Crear dataset para KNN
knn_data = []
for customer_id in pizza_matrix_norm.index:
    for pizza_id in pizza_matrix_norm.columns:
        rating = pizza_matrix_norm.loc[customer_id, pizza_id]
        if rating > 0:  # Solo incluir pizzas que el cliente ha pedido
            knn_data.append({
                'customer_id': customer_id,
                'pizza_id': pizza_id,
                'rating': rating
            })

knn_df = pd.DataFrame(knn_data)
print(f"Dataset KNN creado: {len(knn_df)} registros")
print(f"Clientes únicos en dataset: {knn_df['customer_id'].nunique()}")
print(f"Pizzas únicas en dataset: {knn_df['pizza_id'].nunique()}")

# Features del cliente (patrón de pedidos)
customer_patterns = pizza_matrix_norm.reset_index()
customer_features_knn = customer_patterns.drop('customer_id', axis=1)

print(f"Features de cliente: {customer_features_knn.shape}")

### División de Datos y Entrenamiento

In [None]:
# Para KNN, usamos el patrón de pedidos como features
# X_knn debe tener la misma cantidad de muestras que y_knn
X_knn = []
y_knn = []

for _, row in knn_df.iterrows():
    customer_id = row['customer_id']
    pizza_id = row['pizza_id']
    rating = row['rating']
    
    # Obtener el patrón del cliente
    customer_pattern = pizza_matrix_norm.loc[customer_id].values
    X_knn.append(customer_pattern)
    y_knn.append(rating)

X_knn = np.array(X_knn)
y_knn = np.array(y_knn)

print(f"Shape de X: {X_knn.shape}")
print(f"Shape de y: {y_knn.shape}")

# Dividir datos
X_train_knn, X_test_knn, y_train_knn, y_test_knn = train_test_split(
    X_knn, y_knn, test_size=0.2, random_state=42)

print(f"Datos de entrenamiento: {X_train_knn.shape[0]} muestras")
print(f"Datos de prueba: {X_test_knn.shape[0]} muestras")

# Entrenar modelo KNN
knn_model = KNeighborsRegressor(n_neighbors=5)
knn_model.fit(X_train_knn, y_train_knn)

print("Modelo KNN entrenado exitosamente")

### Evaluación del Modelo

In [None]:
# Predicciones
y_pred_knn = knn_model.predict(X_test_knn)
mse_knn = mean_squared_error(y_test_knn, y_pred_knn)
r2_knn = r2_score(y_test_knn, y_pred_knn)
rmse_knn = np.sqrt(mse_knn)

print(f"R² Score: {r2_knn:.3f}")
print(f"RMSE: {rmse_knn:.3f}")
print(f"Rating promedio real: {y_test_knn.mean():.3f}")
print(f"Rating promedio predicho: {y_pred_knn.mean():.3f}")

# Análisis de errores
errors = y_test_knn - y_pred_knn
print(f"Error promedio: {errors.mean():.3f}")
print(f"Error absoluto promedio: {abs(errors).mean():.3f}")

###  Gráficas de Evaluación del Modelo

In [None]:
# Gráficas de evaluación
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Gráfica 1: Predicciones vs Valores reales
axes[0,0].scatter(y_test_knn, y_pred_knn, alpha=0.5, color='blue')
axes[0,0].plot([y_test_knn.min(), y_test_knn.max()], [y_test_knn.min(), y_test_knn.max()], 'r--', lw=2)
axes[0,0].set_title('Predicciones vs Valores Reales')
axes[0,0].set_xlabel('Rating Real')
axes[0,0].set_ylabel('Rating Predicho')

# Gráfica 2: Distribución de errores
axes[0,1].hist(errors, bins=30, alpha=0.7, color='green')
axes[0,1].set_title('Distribución de Errores de Predicción')
axes[0,1].set_xlabel('Error')
axes[0,1].set_ylabel('Frecuencia')

# Gráfica 3: Residuos vs Predicciones
axes[1,0].scatter(y_pred_knn, errors, alpha=0.5, color='orange')
axes[1,0].axhline(y=0, color='r', linestyle='--')
axes[1,0].set_title('Residuos vs Predicciones')
axes[1,0].set_xlabel('Rating Predicho')
axes[1,0].set_ylabel('Residuos')

# Gráfica 4: Comparación de distribuciones
axes[1,1].hist(y_test_knn, bins=20, alpha=0.5, label='Real', color='blue')
axes[1,1].hist(y_pred_knn, bins=20, alpha=0.5, label='Predicho', color='red')
axes[1,1].set_title('Distribución de Ratings: Real vs Predicho')
axes[1,1].set_xlabel('Rating')
axes[1,1].set_ylabel('Frecuencia')
axes[1,1].legend()

plt.tight_layout()
plt.show()

### Función de Recomendación

In [None]:
# Función para recomendar pizzas a un cliente
def recommend_pizzas(customer_id, top_n=3):
    """
    Recomienda pizzas a un cliente basado en clientes similares
    """
    if customer_id not in pizza_matrix_norm.index:
        return "Cliente no encontrado"
    
    # Obtener el patrón del cliente
    customer_pattern = pizza_matrix_norm.loc[customer_id].values
    
    # Obtener índices de pizzas no pedidas por el cliente
    customer_orders = pizza_matrix.loc[customer_id]
    unrated_pizzas = customer_orders[customer_orders == 0].index
    
    if len(unrated_pizzas) == 0:
        return "Cliente ha pedido todas las pizzas"
    
    # Para cada pizza no pedida, predecir el rating
    recommendations = []
    for pizza_id in unrated_pizzas:
        # Crear un patrón modificado donde esta pizza tiene un rating alto
        test_pattern = customer_pattern.copy()
        pizza_idx = pizza_matrix.columns.get_loc(pizza_id)
        test_pattern[pizza_idx] = 0.5  # Asumir un rating medio para esta pizza
        
        # Predecir el rating
        predicted_rating = knn_model.predict([test_pattern])[0]
        
        pizza_name = pizza_names[pizza_names['pizza_id'] == pizza_id]['pizza_name'].iloc[0]
        recommendations.append((pizza_name, predicted_rating))
    
    # Ordenar por rating predicho
    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations[:top_n]

# Función alternativa más simple
def recommend_pizzas_simple(customer_id, top_n=3):
    """
    Versión simplificada: recomendar basado en popularidad de pizzas no pedidas
    """
    if customer_id not in pizza_matrix_norm.index:
        return "Cliente no encontrado"
    
    # Obtener pizzas no pedidas por el cliente
    customer_orders = pizza_matrix.loc[customer_id]
    unrated_pizzas = customer_orders[customer_orders == 0].index
    
    if len(unrated_pizzas) == 0:
        return "Cliente ha pedido todas las pizzas"
    
    # Recomendar basado en popularidad general
    recommendations = []
    for pizza_id in unrated_pizzas:
        popularity = pizza_matrix[pizza_id].sum()  # Total de pedidos de esta pizza
        pizza_name = pizza_names[pizza_names['pizza_id'] == pizza_id]['pizza_name'].iloc[0]
        recommendations.append((pizza_name, popularity))
    
    # Ordenar por popularidad
    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations[:top_n]

# Función para encontrar clientes similares
def find_similar_customers(customer_id, top_n=5):
    """
    Encuentra clientes con patrones de pedidos similares
    """
    if customer_id not in pizza_matrix_norm.index:
        return "Cliente no encontrado"
    
    customer_pattern = pizza_matrix_norm.loc[customer_id].values.reshape(1, -1)
    distances, indices = knn_model.kneighbors(customer_pattern, n_neighbors=top_n+1)
    
    similar_customers = []
    for i in range(1, len(indices[0])):  # Excluir el cliente mismo
        similar_customer_id = pizza_matrix_norm.index[indices[0][i]]
        distance = distances[0][i]
        similar_customers.append((similar_customer_id, distance))
    
    return similar_customers

print("Funciones de recomendación creadas")

### Ejemplos de Recomendaciones

In [None]:
# Ejemplos de recomendaciones
print("Ejemplos de recomendaciones:")

# Cliente 1
sample_customer_1 = pizza_matrix_norm.index[0]
recommendations_1 = recommend_pizzas_simple(sample_customer_1)
print(f"\nRecomendaciones para cliente {sample_customer_1}:")
if isinstance(recommendations_1, list):
    for pizza, rating in recommendations_1:
        print(f"  - {pizza}: {rating:.0f} pedidos totales")
else:
    print(f"  {recommendations_1}")

# Cliente 2
sample_customer_2 = pizza_matrix_norm.index[5]
recommendations_2 = recommend_pizzas_simple(sample_customer_2)
print(f"\nRecomendaciones para cliente {sample_customer_2}:")
if isinstance(recommendations_2, list):
    for pizza, rating in recommendations_2:
        print(f"  - {pizza}: {rating:.0f} pedidos totales")
else:
    print(f"  {recommendations_2}")

# Clientes similares
similar_customers = find_similar_customers(sample_customer_1)
print(f"\nClientes similares al cliente {sample_customer_1}:")
for customer_id, distance in similar_customers:
    print(f"  - Cliente {customer_id}: distancia {distance:.3f}")

### Corregido: Análisis de Casos de Uso

In [None]:
# Análisis de casos de uso
print("\nCasos de uso del sistema de recomendaciones:")
print("1. Cross-selling: Recomendar pizzas complementarias")
print("2. Personalización: Ofertas específicas por cliente")
print("3. Retención: Sugerir nuevas opciones a clientes regulares")
print("4. Adquisición: Identificar pizzas 'puente' para nuevos clientes")
print("5. Análisis de mercado: Entender patrones de preferencias")

# Estadísticas del sistema
total_recommendations = 0
customers_with_recommendations = 0

for customer_id in pizza_matrix_norm.index[:10]:  # Muestra de 10 clientes
    recommendations = recommend_pizzas_simple(customer_id)
    if isinstance(recommendations, list):
        total_recommendations += len(recommendations)
        customers_with_recommendations += 1

print(f"\nEstadísticas del sistema (muestra de 10 clientes):")
print(f"Clientes con recomendaciones: {customers_with_recommendations}/10")
print(f"Total de recomendaciones generadas: {total_recommendations}")

# Análisis de popularidad de pizzas
print(f"\nAnálisis de popularidad de pizzas:")
pizza_popularity = pizza_matrix.sum(axis=0).sort_values(ascending=False)
for pizza_id, popularity in pizza_popularity.items():
    pizza_name = pizza_names[pizza_names['pizza_id'] == pizza_id]['pizza_name'].iloc[0]
    print(f"  - {pizza_name}: {popularity} pedidos")

print("\nModelo 3 completado exitosamente!")

In [None]:
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# ===============================
# K-MEDIAS en Danny's Pizza (versión sin precios)
# ===============================

# === 1. Clientes: segmentación por frecuencia y variedad de pizzas ===
query_customers = """
SELECT c.customer_id,
       COUNT(DISTINCT c.order_id) AS total_orders,
       COUNT(DISTINCT c.pizza_id) AS unique_pizzas
FROM customer_orders c
GROUP BY c.customer_id
"""
customers = pd.read_sql_query(query_customers, conn)

X_cust = customers[['total_orders', 'unique_pizzas']]
scaler = StandardScaler()
X_cust_scaled = scaler.fit_transform(X_cust)

kmeans_cust = KMeans(n_clusters=3, random_state=42, n_init=10)
customers['cluster'] = kmeans_cust.fit_predict(X_cust_scaled)

plt.scatter(X_cust_scaled[:, 0], X_cust_scaled[:, 1], 
            c=customers['cluster'], cmap='viridis', s=100)
plt.xlabel("Órdenes (escaladas)")
plt.ylabel("Variedad de pizzas (escalada)")
plt.title("Clusters de Clientes")
plt.show()

print("Resumen de clientes por cluster:")
print(customers.groupby('cluster')[['total_orders', 'unique_pizzas']].mean())


# === 2. Tipos de pizzas: segmentación por popularidad ===
query_pizzas = """
SELECT p.pizza_name,
       COUNT(c.order_id) AS times_ordered
FROM customer_orders c
JOIN pizza_names p ON c.pizza_id = p.pizza_id
GROUP BY p.pizza_name
"""
pizza_stats = pd.read_sql_query(query_pizzas, conn)

X_pizza = pizza_stats[['times_ordered']]
X_pizza_scaled = scaler.fit_transform(X_pizza)

kmeans_pizza = KMeans(n_clusters=3, random_state=42, n_init=10)
pizza_stats['cluster'] = kmeans_pizza.fit_predict(X_pizza_scaled)

plt.scatter(X_pizza_scaled[:, 0], [0]*len(X_pizza_scaled), 
            c=pizza_stats['cluster'], cmap='plasma', s=120)
plt.xlabel("Popularidad (escalada)")
plt.title("Clusters de Pizzas")
plt.show()

print("Resumen de pizzas por cluster:")
print(pizza_stats.groupby('cluster')['times_ordered'].mean())


# === 3. Pedidos: segmentación por tamaño de la orden ===
query_orders = """
SELECT c.order_id,
       COUNT(c.pizza_id) AS items_count
FROM customer_orders c
GROUP BY c.order_id
"""
orders_stats = pd.read_sql_query(query_orders, conn)

X_orders = orders_stats[['items_count']]
X_orders_scaled = scaler.fit_transform(X_orders)

kmeans_orders = KMeans(n_clusters=3, random_state=42, n_init=10)
orders_stats['cluster'] = kmeans_orders.fit_predict(X_orders_scaled)

plt.scatter(X_orders_scaled[:, 0], [0]*len(X_orders_scaled), 
            c=orders_stats['cluster'], cmap='cool', s=120)
plt.xlabel("Cantidad de pizzas por pedido (escalado)")
plt.title("Clusters de Pedidos")
plt.show()

print("Resumen de pedidos por cluster:")
print(orders_stats.groupby('cluster')['items_count'].mean())

## Modelo 4: Random Forest - Predicción de Demanda

### Creación de Dataset de Demanda por Hora

In [None]:
# =============================================================================
# MODELO 4: RANDOM FOREST - PREDICCIÓN DE DEMANDA
# =============================================================================

print("MODELO 4: Predicción de Demanda por Horario")

# Crear dataset de demanda por hora - versión alternativa
demand_data = delivered_orders.copy()
demand_data['date'] = demand_data['order_date'].dt.date
demand_data['hour'] = demand_data['order_date'].dt.hour

# Agrupar por fecha y hora
demand_data = demand_data.groupby(['date', 'hour']).size().reset_index()
demand_data.columns = ['date', 'hour', 'order_count']

print(f"Dataset de demanda creado: {len(demand_data)} registros")
print(f"Rango de fechas: {demand_data['date'].min()} a {demand_data['date'].max()}")
print(f"Horas únicas: {demand_data['hour'].nunique()}")
print(demand_data.head())

### Creación de Features Temporales

In [None]:
# Features temporales
demand_data['day_of_week'] = pd.to_datetime(demand_data['date']).dt.dayofweek
demand_data['is_weekend'] = (demand_data['day_of_week'] >= 5).astype(int)
demand_data['month'] = pd.to_datetime(demand_data['date']).dt.month
demand_data['is_lunch'] = ((demand_data['hour'] >= 11) & (demand_data['hour'] <= 14)).astype(int)
demand_data['is_dinner'] = ((demand_data['hour'] >= 18) & (demand_data['hour'] <= 21)).astype(int)
demand_data['is_breakfast'] = ((demand_data['hour'] >= 7) & (demand_data['hour'] <= 10)).astype(int)
demand_data['is_late_night'] = ((demand_data['hour'] >= 22) | (demand_data['hour'] <= 2)).astype(int)

# Features adicionales
demand_data['hour_squared'] = demand_data['hour'] ** 2  # Para capturar no-linealidad
demand_data['weekend_hour'] = demand_data['is_weekend'] * demand_data['hour']  # Interacción

print("Features temporales creadas:")
print(demand_data[['date', 'hour', 'day_of_week', 'is_weekend', 'month', 
                  'is_lunch', 'is_dinner', 'is_breakfast', 'is_late_night']].head())

### Análisis Exploratorio y Gráficas

In [None]:
# Análisis exploratorio
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Gráfica 1: Distribución de demanda
axes[0,0].hist(demand_data['order_count'], bins=30, alpha=0.7, color='skyblue')
axes[0,0].set_title('Distribución de Demanda por Hora')
axes[0,0].set_xlabel('Número de Pedidos')
axes[0,0].set_ylabel('Frecuencia')

# Gráfica 2: Demanda por día de la semana
day_names = ['Lun', 'Mar', 'Mié', 'Jue', 'Vie', 'Sáb', 'Dom']
demand_by_day = demand_data.groupby('day_of_week')['order_count'].mean()
axes[0,1].bar(day_names, demand_by_day, color='lightgreen')
axes[0,1].set_title('Demanda Promedio por Día de la Semana')
axes[0,1].set_ylabel('Pedidos Promedio')

# Gráfica 3: Demanda por hora del día
demand_by_hour = demand_data.groupby('hour')['order_count'].mean()
axes[0,2].plot(demand_by_hour.index, demand_by_hour.values, marker='o', color='orange')
axes[0,2].set_title('Demanda Promedio por Hora del Día')
axes[0,2].set_xlabel('Hora')
axes[0,2].set_ylabel('Pedidos Promedio')

# Gráfica 4: Demanda por mes
demand_by_month = demand_data.groupby('month')['order_count'].mean()
month_names = ['Ene', 'Feb', 'Mar', 'Abr', 'May', 'Jun', 'Jul', 'Ago', 'Sep', 'Oct', 'Nov', 'Dic']
axes[1,0].bar(range(1, 13), [demand_by_month.get(i, 0) for i in range(1, 13)], color='lightcoral')
axes[1,0].set_title('Demanda Promedio por Mes')
axes[1,0].set_xlabel('Mes')
axes[1,0].set_ylabel('Pedidos Promedio')

# Gráfica 5: Comparación fin de semana vs día de semana
weekend_comparison = demand_data.groupby('is_weekend')['order_count'].mean()
axes[1,1].bar(['Día de Semana', 'Fin de Semana'], weekend_comparison.values, color=['lightblue', 'darkblue'])
axes[1,1].set_title('Demanda: Día de Semana vs Fin de Semana')
axes[1,1].set_ylabel('Pedidos Promedio')

# Gráfica 6: Heatmap de demanda por día y hora
pivot_demand = demand_data.groupby(['day_of_week', 'hour'])['order_count'].mean().unstack()
sns.heatmap(pivot_demand, annot=True, fmt='.1f', cmap='YlOrRd', ax=axes[1,2])
axes[1,2].set_title('Heatmap: Demanda por Día y Hora')
axes[1,2].set_xlabel('Hora')
axes[1,2].set_ylabel('Día de la Semana')

plt.tight_layout()
plt.show()

print("Análisis exploratorio completado")

### Preparación de Features para el Modelo

In [None]:
# Features para el modelo
X_demand = demand_data[['hour', 'day_of_week', 'is_weekend', 'month', 
                       'is_lunch', 'is_dinner', 'is_breakfast', 'is_late_night',
                       'hour_squared', 'weekend_hour']]
y_demand = demand_data['order_count']

print("Features seleccionadas para el modelo:")
print(X_demand.columns.tolist())
print(f"Shape de X: {X_demand.shape}")
print(f"Shape de y: {y_demand.shape}")

# Verificar valores faltantes
print(f"Valores faltantes en X: {X_demand.isnull().sum().sum()}")
print(f"Valores faltantes en y: {y_demand.isnull().sum()}")

# Estadísticas básicas
print(f"\nEstadísticas de demanda:")
print(f"Demanda promedio: {y_demand.mean():.2f}")
print(f"Demanda máxima: {y_demand.max()}")
print(f"Demanda mínima: {y_demand.min()}")
print(f"Desviación estándar: {y_demand.std():.2f}")

###  División de Datos y Entrenamiento

In [None]:
# Dividir datos
scaler = StandardScaler()
X_demand = scaler.fit_transform(X_demand)

y_demand = scaler.fit_transform(y_demand.values.reshape(-1, 1)).ravel()
X_train_demand, X_test_demand, y_train_demand, y_test_demand = train_test_split(
    X_demand, y_demand, test_size=0.2, random_state=42)


print(f"Datos de entrenamiento: {X_train_demand.shape[0]} muestras")
print(f"Datos de prueba: {X_test_demand.shape[0]} muestras")

# Entrenar modelo Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42, max_depth=10)
rf_model.fit(X_train_demand, y_train_demand)

print("Modelo Random Forest entrenado exitosamente")
print(f"R² Score en entrenamiento: {rf_model.score(X_train_demand, y_train_demand):.3f}")

### Evaluación del Modelo

In [None]:
# Predicciones
y_pred_demand = rf_model.predict(X_test_demand)
mse_demand = mean_squared_error(y_test_demand, y_pred_demand)
r2_demand = r2_score(y_test_demand, y_pred_demand)
rmse_demand = np.sqrt(mse_demand)

print(f"R² Score: {r2_demand:.3f}")
print(f"RMSE: {rmse_demand:.2f} pedidos")
print(f"Demanda promedio real: {y_test_demand.mean():.2f}")
print(f"Demanda promedio predicha: {y_pred_demand.mean():.2f}")

# Análisis de errores
errors = y_test_demand - y_pred_demand
print(f"Error promedio: {errors.mean():.2f}")
print(f"Error absoluto promedio: {abs(errors).mean():.2f}")

# Mostrar importancia de features
feature_names_demand = ['hour', 'day_of_week', 'is_weekend', 'month', 
                       'is_lunch', 'is_dinner', 'is_breakfast', 'is_late_night',
                       'hour_squared', 'weekend_hour']
importance_demand = pd.DataFrame({
    'feature': feature_names_demand,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("\nFeatures más importantes:")
print(importance_demand)

### Gráficas de Evaluación del Modelo

In [None]:
# Gráficas de evaluación
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Gráfica 1: Predicciones vs Valores reales
axes[0,0].scatter(y_test_demand, y_pred_demand, alpha=0.5, color='blue')
axes[0,0].plot([y_test_demand.min(), y_test_demand.max()], [y_test_demand.min(), y_test_demand.max()], 'r--', lw=2)
axes[0,0].set_title('Predicciones vs Valores Reales')
axes[0,0].set_xlabel('Demanda Real')
axes[0,0].set_ylabel('Demanda Predicha')

# Gráfica 2: Distribución de errores
axes[0,1].hist(errors, bins=30, alpha=0.7, color='green')
axes[0,1].set_title('Distribución de Errores de Predicción')
axes[0,1].set_xlabel('Error')
axes[0,1].set_ylabel('Frecuencia')

# Gráfica 3: Residuos vs Predicciones
axes[1,0].scatter(y_pred_demand, errors, alpha=0.5, color='orange')
axes[1,0].axhline(y=0, color='r', linestyle='--')
axes[1,0].set_title('Residuos vs Predicciones')
axes[1,0].set_xlabel('Demanda Predicha')
axes[1,0].set_ylabel('Residuos')

# Gráfica 4: Importancia de features
importance_abs = importance_demand.copy()
importance_abs = importance_abs.sort_values('importance', ascending=True)
axes[1,1].barh(importance_abs['feature'], importance_abs['importance'])
axes[1,1].set_title('Importancia de Features')
axes[1,1].set_xlabel('Importancia')

plt.tight_layout()
plt.show()

### Función de Predicción y Casos de Uso

In [None]:
# Función para predecir demanda en un horario específico
def predict_demand(hour, day_of_week, is_weekend=None, month=6):
    """
    Predice la demanda de pedidos para un horario específico
    """
    if is_weekend is None:
        is_weekend = 1 if day_of_week >= 5 else 0
    
    is_lunch = 1 if 11 <= hour <= 14 else 0
    is_dinner = 1 if 18 <= hour <= 21 else 0
    is_breakfast = 1 if 7 <= hour <= 10 else 0
    is_late_night = 1 if hour >= 22 or hour <= 2 else 0
    hour_squared = hour ** 2
    weekend_hour = is_weekend * hour
    
    features = np.array([[hour, day_of_week, is_weekend, month, 
                         is_lunch, is_dinner, is_breakfast, is_late_night,
                         hour_squared, weekend_hour]])
    
    prediction = rf_model.predict(features)[0]
    return max(0, round(prediction))

# Ejemplos de predicción
print("Ejemplos de predicción de demanda:")
print(f"Viernes 19:00: {predict_demand(19, 4, 0, 6)} pedidos")
print(f"Sábado 12:00: {predict_demand(12, 5, 1, 6)} pedidos")
print(f"Lunes 09:00: {predict_demand(9, 0, 0, 6)} pedidos")
print(f"Domingo 20:00: {predict_demand(20, 6, 1, 6)} pedidos")

# Análisis de casos de uso
print("\nCasos de uso del modelo de predicción de demanda:")
print("1. Programación de personal: Optimizar staffing por horarios")
print("2. Gestión de inventario: Predecir demanda de ingredientes")
print("3. Marketing: Campañas en horarios de baja demanda")
print("4. Operaciones: Planificación de cocina y delivery")
print("5. Análisis de rentabilidad: Identificar horarios más rentables")

### Análisis de Patrones Temporales

In [None]:
# Análisis de patrones temporales
print("\nAnálisis de patrones temporales:")

# Horarios pico
peak_hours = demand_data.groupby('hour')['order_count'].mean().sort_values(ascending=False)
print(f"Top 5 horarios pico:")
for hour, demand in peak_hours.head().items():
    print(f"  - {hour}:00 - {demand:.1f} pedidos promedio")

# Días más demandados
peak_days = demand_data.groupby('day_of_week')['order_count'].mean().sort_values(ascending=False)
day_names = ['Lunes', 'Martes', 'Miércoles', 'Jueves', 'Viernes', 'Sábado', 'Domingo']
print(f"\nDías más demandados:")
for day, demand in peak_days.items():
    print(f"  - {day_names[day]}: {demand:.1f} pedidos promedio")

# Comparación fin de semana vs día de semana
weekday_avg = demand_data[demand_data['is_weekend'] == 0]['order_count'].mean()
weekend_avg = demand_data[demand_data['is_weekend'] == 1]['order_count'].mean()
print(f"\nComparación:")
print(f"  - Día de semana: {weekday_avg:.1f} pedidos promedio")
print(f"  - Fin de semana: {weekend_avg:.1f} pedidos promedio")
print(f"  - Diferencia: {((weekend_avg - weekday_avg) / weekday_avg * 100):.1f}%")

print("\nModelo 4 completado exitosamente!")