
### Overview
This guide outlines the steps for generating and ingesting test data into Kafka, using the Confluent platform and Databricks. The purpose is to demonstrate how to easily simulate real-world data scenarios by generating custom test data with the `faker` library and managing Kafka topics via the Confluent CLI. This setup is ideal for testing and development purposes, ensuring that data pipelines are robust and scalable before deploying in a production environment.

### Prerequisites
To get started with Kafka by generating test data, ensure the following prerequisites are met:
- **Confluent Account:** Sign up for an account with Confluent if you haven't already.
- **Confluent CLI:** Install and set up the Confluent CLI to manage Kafka topics and review data in Kafka topics using command-line interface commands.
- **Kafka Topic:** Create a Kafka topic named `users` in Confluent Kafka without a schema.
- **Library Installations:** Install the `confluent_kafka` and `faker` libraries in your Databricks Cluster.

### Agenda
This session will cover the following:
1. **Test Data Generation:** Use the `faker` library to generate test data.
2. **Data Ingestion into Kafka:** Write the generated test data to a Kafka topic named `users`.

In [16]:
pip install confluent kafka



In [17]:
pip install faker



In [18]:
pip install confluent_kafka



In [19]:
from faker import Faker
import random
from datetime import datetime, timedelta

# Initialize Faker instance
faker = Faker()

# Function to generate random timestamps
def random_timestamp(start, end):
    return start + timedelta(
        seconds=random.randint(0, int((end - start).total_seconds())),
    )

# Generate data
def generate_user_data(num_users=10):
    users = []
    for _ in range(num_users):
        user_id = faker.uuid4()
        user_first_name = faker.first_name()
        user_last_name = faker.last_name()
        user_email = faker.email()
        created_ts = faker.date_time_this_decade()
        last_updated_ts = random_timestamp(created_ts, datetime.now())

        users.append({
            'user_id': user_id,
            'user_first_name': user_first_name,
            'user_last_name': user_last_name,
            'user_email': user_email,
            'created_ts': created_ts,
            'last_updated_ts': last_updated_ts
        })
    return users

# Generate and print user data
user_data = generate_user_data(10)
for user in user_data:
    print(user)

{'user_id': 'db7d4014-d821-4eaa-931f-ad8b3a04b4cc', 'user_first_name': 'Lynn', 'user_last_name': 'Barnes', 'user_email': 'garrettshah@example.net', 'created_ts': datetime.datetime(2023, 12, 20, 13, 6, 51, 596762), 'last_updated_ts': datetime.datetime(2024, 6, 15, 1, 33, 36, 596762)}
{'user_id': '6e70b034-75a1-40eb-b669-fdf383602744', 'user_first_name': 'Shelby', 'user_last_name': 'Brown', 'user_email': 'ialexander@example.com', 'created_ts': datetime.datetime(2023, 7, 13, 13, 41, 16, 185968), 'last_updated_ts': datetime.datetime(2024, 1, 8, 19, 19, 58, 185968)}
{'user_id': 'ef5f1342-825f-4bdb-beca-d18389f4486f', 'user_first_name': 'Deborah', 'user_last_name': 'Poole', 'user_email': 'omason@example.com', 'created_ts': datetime.datetime(2025, 5, 3, 21, 9, 45, 398515), 'last_updated_ts': datetime.datetime(2025, 6, 14, 13, 57, 48, 398515)}
{'user_id': '1c88479a-baf4-4788-8179-2fa121969e41', 'user_first_name': 'William', 'user_last_name': 'Austin', 'user_email': 'danielle68@example.net', 'c

In [20]:

from confluent_kafka import Producer, KafkaError

GO to confluent webstie, create a topic users.

Go to cluster ovreview -> cluster settings -> end points -> we can seee there bootstrap server


In [21]:

# Kafka and Confluent Cloud Configuration
kafka_bootstrap_servers = "pkc-n98pk.us-west-2.aws.confluent.cloud:9092"
kafka_topic = "users"
kafka_api_key = "4ZXD5BX7OFKQFYQH"
kafka_api_secret = "cfltFhzEUePcMAJVO+YPKEX5U2Xh9/PGkmycVSHm4Dwys6kUeOycQUVBw3BDmBwA"

In [22]:
# Producer configuration
conf = {
    'bootstrap.servers': kafka_bootstrap_servers,
    'sasl.mechanisms': 'PLAIN',
    'security.protocol': 'SASL_SSL',
    'sasl.username': kafka_api_key,
    'sasl.password': kafka_api_secret,
}

In [23]:
producer = Producer(conf)

In [24]:
# Function to generate random user data
def generate_user_data():
    user_id = faker.uuid4()
    user_first_name = faker.first_name()
    user_last_name = faker.last_name()
    user_email = faker.email()
    created_ts = faker.date_time_this_decade()
    last_updated_ts = faker.date_time_between(start_date=created_ts, end_date='now')

    return {
        'user_id': user_id,
        'user_first_name': user_first_name,
        'user_last_name': user_last_name,
        'user_email': user_email,
        'created_ts': created_ts.isoformat(),
        'last_updated_ts': last_updated_ts.isoformat()
    }

In [25]:
import json
import time

In [31]:
# Produce messages to Kafka
def produce_messages(num_messages=1000, rate_per_sec=10):
    for _ in range(num_messages // rate_per_sec):
        for _ in range(rate_per_sec):
            user_data = generate_user_data()
            producer.produce(kafka_topic, key=user_data['user_id'], value=json.dumps(user_data))
        producer.flush()
        time.sleep(0)

In [33]:
# Produce 10 messages at a rate of 1 messages per second
produce_messages(10, 1)

In [39]:
producer.produce(kafka_topic,"hiii")


In [40]:
producer.produce(kafka_topic,key="demo1", value="Demo first")

In [41]:
producer.produce(kafka_topic,key="demo2", value="Demo second")

In [42]:
producer.produce(kafka_topic,key="demo3", value="Demo third")

In [43]:
producer.produce(kafka_topic,key="demo4", value="Demo fourth")