
### Overview
This guide outlines the steps for generating and ingesting test data into Kafka, using the Confluent platform and Databricks. The purpose is to demonstrate how to easily simulate real-world data scenarios by generating custom test data with the `faker` library and managing Kafka topics via the Confluent CLI. This setup is ideal for testing and development purposes, ensuring that data pipelines are robust and scalable before deploying in a production environment.

### Prerequisites
To get started with Kafka by generating test data, ensure the following prerequisites are met:
- **Confluent Account:** Sign up for an account with Confluent if you haven't already.
- **Confluent CLI:** Install and set up the Confluent CLI to manage Kafka topics and review data in Kafka topics using command-line interface commands.
- **Kafka Topic:** Create a Kafka topic named `users` in Confluent Kafka without a schema.
- **Library Installations:** Install the `confluent_kafka` and `faker` libraries in your Databricks Cluster.

### Agenda
This session will cover the following:
1. **Test Data Generation:** Use the `faker` library to generate test data.
2. **Data Ingestion into Kafka:** Write the generated test data to a Kafka topic named `users`.

In [0]:
from faker import Faker
import random
from datetime import datetime, timedelta

# Initialize Faker instance
faker = Faker()

# Function to generate random timestamps
def random_timestamp(start, end):
    return start + timedelta(
        seconds=random.randint(0, int((end - start).total_seconds())),
    )

# Generate data
def generate_user_data(num_users=10):
    users = []
    for _ in range(num_users):
        user_id = faker.uuid4()
        user_first_name = faker.first_name()
        user_last_name = faker.last_name()
        user_email = faker.email()
        created_ts = faker.date_time_this_decade()
        last_updated_ts = random_timestamp(created_ts, datetime.now())
        
        users.append({
            'user_id': user_id,
            'user_first_name': user_first_name,
            'user_last_name': user_last_name,
            'user_email': user_email,
            'created_ts': created_ts,
            'last_updated_ts': last_updated_ts
        })
    return users

# Generate and print user data
user_data = generate_user_data(10)
for user in user_data:
    print(user)

{'user_id': '9507b6dd-cf80-427a-85f4-3e05938f5f2d', 'user_first_name': 'Brian', 'user_last_name': 'Jackson', 'user_email': 'anthony33@example.org', 'created_ts': datetime.datetime(2021, 1, 18, 15, 49, 1, 765332), 'last_updated_ts': datetime.datetime(2021, 8, 19, 14, 35, 33, 765332)}
{'user_id': '7a6788c3-43da-42ce-b4c3-46254443a921', 'user_first_name': 'Jamie', 'user_last_name': 'Maxwell', 'user_email': 'kenneth35@example.org', 'created_ts': datetime.datetime(2022, 2, 24, 16, 10, 47, 394201), 'last_updated_ts': datetime.datetime(2023, 4, 21, 19, 6, 18, 394201)}
{'user_id': 'd04262b9-cc5e-4784-9d37-b5c646efa405', 'user_first_name': 'Melody', 'user_last_name': 'Simon', 'user_email': 'alyssarice@example.com', 'created_ts': datetime.datetime(2020, 8, 2, 11, 6, 19, 937776), 'last_updated_ts': datetime.datetime(2022, 4, 25, 7, 45, 36, 937776)}
{'user_id': 'c6008136-6b78-45f1-8d8f-026842b55ec0', 'user_first_name': 'Tyler', 'user_last_name': 'Allen', 'user_email': 'avelez@example.org', 'create

In [0]:
from confluent_kafka import Producer, KafkaError

In [0]:
# Kafka and Confluent Cloud Configuration
kafka_bootstrap_servers = "pkc-rgm37.us-west-2.aws.confluent.cloud:9092"
kafka_topic = "users"
kafka_api_key = "HHYHHAHFHYVIJPOH"
kafka_api_secret = "ep304/Y9c+b7wOslz/1r0SDDuqzFZC+5WZMbLFUILg/l+2URJMcYTy7V1erTv74I"

In [0]:
# Producer configuration
conf = {
    'bootstrap.servers': kafka_bootstrap_servers,
    'sasl.mechanisms': 'PLAIN',
    'security.protocol': 'SASL_SSL',
    'sasl.username': kafka_api_key,
    'sasl.password': kafka_api_secret,
}

In [0]:
producer = Producer(conf)

In [0]:
# Function to generate random user data
def generate_user_data():
    user_id = faker.uuid4()
    user_first_name = faker.first_name()
    user_last_name = faker.last_name()
    user_email = faker.email()
    created_ts = faker.date_time_this_decade()
    last_updated_ts = faker.date_time_between(start_date=created_ts, end_date='now')
    
    return {
        'user_id': user_id,
        'user_first_name': user_first_name,
        'user_last_name': user_last_name,
        'user_email': user_email,
        'created_ts': created_ts.isoformat(),
        'last_updated_ts': last_updated_ts.isoformat()
    }

In [0]:
import json
import time

In [0]:
# Produce messages to Kafka
def produce_messages(num_messages=1000, rate_per_sec=10):
    for _ in range(num_messages // rate_per_sec):
        for _ in range(rate_per_sec):
            user_data = generate_user_data()
            producer.produce(kafka_topic, key=user_data['user_id'], value=json.dumps(user_data))
        producer.flush()
        time.sleep(1)

In [0]:
# Produce 1000 messages at a rate of 10 messages per second
produce_messages(1000, 10)