# Binance Producer

This document explains each section of `src/binance_producer.py`.  
The script is a WebSocket producer that connects to Binance's API and streams closed candlestick data to a Kafka topic.

**Idea:** Producer automatically retrieves Bitcoin price data from the Binance exchange every 30 minutes and transmits it to a data processing system (Kafka) for further analysis.

**Plan:**   
1. Connect to Binance via WebSocket.  
2. Receive completed 30-minute candlesticks.  
3. Extract important data (prices, volumes etc).  
4. Send them to Kafka for storage and further analysis.  
5. Automatically reconnect upon errors.  

---

## 1. Preprocessing

### 1.1 Library Imports

Standard library and third-party imports.

In [None]:
import json
import asyncio
import websockets
from kafka import KafkaProducer
from datetime import datetime, timezone
from typing import Dict, List, Optional
import logging

Each library serves a specific purpose in the data pipeline:

- **json**: Binance sends data in JSON format, so we need to parse incoming messages and serialize outgoing records before sending to Kafka.

- **asyncio**: Asynchronous framework, that allows the program to wait for WebSocket messages without blocking other operations. When waiting for data from Binance, the program can handle other tasks or simply wait efficiently.

- **websockets**: Unlike HTTP requests, WebSocket maintains a persistent bidirectional connection. Once connected to Binance, we receive a continuous stream of price updates as they happen in real time.

- **KafkaProducer**: The Kafka client for sending messages to a Kafka topic. 

- **datetime, timezone**: Used to timestamp each record with the exact moment it was received. 

- **typing.Dict, List, Optional**: Type hints for better code readability.

- **logging**: Logging framework for diagnostics.

---

### 1.2 Logging configuration

Sets up the root logger so that all log messages show timestamp when the log entry was created, Name of the logger, severity level (INFO, WARNING, ERROR, etc.), and message.    
The module-level logger is used throughout the producer.

In [None]:
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

Reasons why we use logging:  

- **Troubleshooting**: When something goes wrong, timestamps help identify when the issue started.  

- **Monitoring**: It creates dashboards and alerts.

- **Debugging**: We can see the exact sequence of events leading to an error. 

---

## 2. BinanceKlineProducer Class

The core producer component that bridges Binance's real-time market data with Kafka infrastructure. It manages WebSocket connections, message parsing, data validation, and reliable message delivery with built-in error recovery.

### 2.1 Core Parameters

The `__init__` method initializes the producer with configuration needed to connect to both Binance and Kafka.

In [None]:
def __init__(
    self,
    symbols: List[str],
    kafka_topic: str = "binance_kline",
    kafka_bootstrap_servers: str = "localhost:9092",
    interval: str = "30m"
):
    self.symbols = [s.lower() for s in symbols]
    self.kafka_topic = kafka_topic
    self.kafka_bootstrap_servers = kafka_bootstrap_servers
    self.interval = interval

Each parameter has a specific purpose:

- **symbols: List[str]**: Trading pairs to monitor, like ["BTCUSDT", "ETHUSDT"].

- **kafka_topic: str = "binance_kline"**: The Kafka topic name where messages will be published and from where consumer will then read candelstick data.

- **kafka_bootstrap_servers: str = "localhost:9092"**: The Kafka broker address.

- **interval: str = "30m"**: Candlestick time interval, so each candlestick aggregates all trades within this period. 

All symbols are normalized to lowercase for the Binance stream URL.

---

### 2.2 Interval Validation

In [None]:
if interval not in ["1m", "30m"]:
        raise ValueError("Interval must be '1m' or '30m'")

The project supports 1m and 30m intervals from Binance. We use native intervals directly rather than manually aggregating data, which simplifies the code and guarantees proper candle alignment.     
The code also supports 1 minute for potential future use cases, but the current pipeline is optimized for 30-minute candles.

---

### 2.3 WebSocket URL and Kafka Producer Initialization

In [None]:
streams = "/".join([f"{symbol}@kline_{interval}" for symbol in self.symbols])
    self.ws_url = f"wss://stream.binance.com/stream?streams={streams}"

    self.producer = KafkaProducer(
        bootstrap_servers=kafka_bootstrap_servers,
        value_serializer=lambda v: json.dumps(v).encode('utf-8'),
        acks='all',
        retries=3
    )

    logger.info(f"Initialized producer for {len(symbols)} symbols: {', '.join(self.symbols)}")
    logger.info(f"WebSocket URL: {self.ws_url}")
    logger.info(f"Kafka topic: {kafka_topic}")

Binance allows subscribing to multiple streams simultaneously through a single WebSocket connection.    

Each parameter in producer configures how messages are sent to Kafka:

- **bootstrap_servers**: Initial connection point to the Kafka cluster.

- **value_serializer**: Automatically converts Python data to JSON and encodes it as bytes. Kafka requires all messages to be in byte format, so this function handles the serialization transparently before each send operation.

- **acks='all'**: Acknowledgment level for durability. This ensures every message is replicated to multiple brokers before confirming success, because losing financial data is unacceptable.

- **retries=3**: Retries up to 3 times, if a send fails.

---

## 3. Message parsing

Parses a WebSocket message from Binance and returns a clean, structured record ready for storage and analysis, only for **closed** candles.

### 3.1 Extracting Stream Data

In [None]:
def parse_kline(self, message: Dict) -> Optional[Dict]:
    try:
        data = message.get("data", message)
        k = data.get("k")
        
        if not k:
            return None
            
        if not k.get("x"):
                return None

When using Binance multiple symbols in one connection, the API wraps the actual candlestick data inside a **"data"** key.        
The actual candlestick information is always nested within the **"k"** (kline) key, which contains all the price data.         
If there's no "k" field, this isn't a candlestick message, so it will return "None" to skip processing.     
**Important**: We only process **closed** candles ("x": true). Open candles are constantly updating and aren't suitable for analysis.     

---

### 3.2 Building the record

In [None]:
record = {
            "open_time": k["t"],
            "open": float(k["o"]),
            "high": float(k["h"]),
            "low": float(k["l"]),
            "close": float(k["c"]),
            "volume": float(k["v"]),
            "quote_asset_volume": float(k["q"]),
            "number_of_trades": int(k["n"]),
            "taker_buy_base_asset_volume": float(k["V"]),
            "taker_buy_quote_asset_volume": float(k["Q"]),
            "symbol": data.get("s", "").upper(),
            "interval": k["i"],
            "ingested_at": datetime.now(timezone.utc).isoformat()
        }
        
        return record

Each field provides essential information for technical analysis and machine learning models:

- **open_time**: Timestamp in milliseconds, which represents when the candlestick period started.

- **open**: First trade price in the period.

- **high**: Highest trade price reached during the period, it shows the peak enthusiasm/resistance level. 

- **low**: Lowest trade price during the period, it shows support level or panic selling point. 

- **close**: Last trade price in the period. Most important for trend analysis is to see whether close > open or close < open. 

- **volume**: Total base asset traded. High volume indicates strong conviction in price movement. 

- **quote_asset_volume**: Total e.g. USDT spent/received in trades. While volume shows  e.g. BTC amount, this shows the dollar value.

- **number_of_trades**: Count of individual trades.

- **taker_buy_base_asset_volume**: Python's built-in logging framework. Instead of using `print()` statements, we use structured logging which includes timestamps, severity levels, and allows filtering of log messages.

- **taker_buy_quote_asset_volume**: Python's built-in logging framework. Instead of using `print()` statements, we use structured logging which includes timestamps, severity levels, and allows filtering of log messages.

- **symbol**: Trading pair in uppercase. 

- **interval**: Confirms the candlestick interval, in our case 30 m. 

- **ingested_at**: Timestamp of when we received this data. 

---

### 3.3 Error handling wrapper

In [None]:
try:
    # ... parsing logic ...
except (KeyError, ValueError, TypeError) as e:
    logger.error(f"Error parsing {e}")
    return None

The entire method is wrapped in a **try-except** block because:

- **KeyError**: If Binance changes their message format or a required field is missing. 

- **ValueError**: If a string can't be converted to float. 

- **TypeError**: If we try to call a method on None or the wrong type. 

---

## 4. Main Producer Loop

Async method that runs until interrupted. It connects to the WebSocket with a 20s ping interval and 10s ping timeout to detect dead connections. 

**Processing flow:**
1. Receive raw message
2. Parse JSON
3. Validate and transform into record
4. Send to Kafka
5. Log successful operation

### 4.1 WebSocket Connection

In [None]:
async def run(self):
    logger.info("Starting the producer")
    
    while True:
        try:
            async with websockets.connect(
                self.ws_url,
                ping_interval=20,
                ping_timeout=10
            ) as ws:
                logger.info(f"Connected to websocket for {len(self.symbols)} symbols")

The WebSocket connection is configured with a **ping interval** of 20 seconds, meaning the client sends a ping message every 20 seconds to verify the connection is still active. If the server doesn't respond within 10 seconds (**ping timeout**), the connection is considered dead and will be automatically closed and reconnected. This mechanism ensures that any broken or unresponsive connections are detected quickly rather than waiting indefinitely for data that will never arrive.

---

### 4.2 Message Processing

In [None]:
candle_count = 0
                async for raw_message in ws:
                    message = json.loads(raw_message)
                    record = self.parse_kline(message)
    
                    if record:
                        self.producer.send(self.kafka_topic, value=record)
                        candle_count += 1
        
                        logger.info(
                            f"Sent candle #{candle_count}: {record['symbol']} "
                            f"@ {record['open_time']} | Close: {record['close']:.2f}"
                        )

The producer initializes a counter at zero to track the number of candlesticks processed during the session. As each raw WebSocket message arrives, it's immediately parsed from JSON format into a Python dictionary. This dictionary is then passed to the parse_kline method, which validates the data and transforms it into our standardized record format.   
Only valid, closed candlesticks are sent to Kafka—if parse_kline returns None the loop continues without sending anything.   
When a valid record is received, it's immediately sent to the Kafka topic. We then increment the counter using += 1 to maintain an accurate count of successfully processed candlesticks, which is useful for monitoring throughput.   

---

### 4.3 Automatic Reconnection

In [None]:
except websockets.exceptions.ConnectionClosed as e:
                logger.warning(f"Connection closed: {e}, reconnecting")
                await asyncio.sleep(2)
            except Exception as e:
                logger.error(f"Unexpected error: {e}, reconnecting")
                await asyncio.sleep(5)

We handle two types of failures:    
1. When the WebSocket connection closes (which Binance does routinely every 24 hours) the producer logs a warning and waits 2 seconds before reconnecting.    
2. For unexpected errors like network issues or parsing failures, it waits 5 seconds before retrying. The longer pause helps prevent rapid retry loops during more serious problems.     

Both handlers sit inside the infinite loop, so the producer automatically reconnects and continues streaming without manual intervention.

---

## 5. Graceful Shutdown

This method ensures graceful shutdown without data loss. It's critical for data integrity.

In [None]:
def close(self):
        self.producer.flush()
        self.producer.close()
        logger.info("Producer closed")

Kafka producers use batching for efficiency—instead of sending each message individually, they accumulate messages in an internal buffer and send them in batches. This dramatically improves throughput.    
Without flushing, any messages still in the buffer when the program exits are simply discarded. The flush() method forces the producer to send all buffered messages immediately and waits for Kafka to acknowledge receipt before continuing. This ensures that every candlestick we processed actually reaches Kafka, even the last few messages that arrived right before shutdown.    
After flushing, the producer is properly closed to release network sockets, memory, and file descriptors. 

---

## 6. Starting the producer

This is where the producer begins its work. When the script is executed, it initializes with BTCUSDT as the trading symbol, connects to a local Kafka server at localhost:9092, and starts streaming 30-minute candlestick data to the binance_kline topic.

In [None]:
async def main():
    SYMBOLS = ["BTCUSDT"]
    KAFKA_TOPIC = "binance_kline"
    KAFKA_SERVERS = "localhost:9092"

    producer = BinanceKlineProducer(
        symbols=SYMBOLS,
        kafka_topic=KAFKA_TOPIC,
        kafka_bootstrap_servers=KAFKA_SERVERS,
        interval="30m"
    )

    try:
        await producer.run()
    except KeyboardInterrupt:
        logger.info("Shutting down")
    finally:
        producer.close()


if __name__ == "__main__":
    asyncio.run(main())

The producer runs continuously until manually stopped with KeyboardInterrupt, at which point it gracefully shuts down and ensures all buffered messages are sent to Kafka before closing.