To read from and write to Kafka using PySpark in a Jupyter Notebook, follow these steps:

### Step 1: Install Required Libraries

Ensure you have the required libraries installed. You need `pyspark` and `kafka-python`. You can install them using pip.

```bash
!pip install pyspark kafka-python
```

### Step 2: Set Up Your Kafka Environment

Make sure you have a Kafka broker running and accessible. For local development, you can use tools like Docker to set up Kafka.

### Step 3: Import Required Libraries

In your Jupyter Notebook, import the necessary libraries from PySpark.

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StringType
```

### Step 4: Create a SparkSession

Create a SparkSession with the required Kafka configurations and spark-kafka-Integreation.

```python
spark = SparkSession.builder \
    .appName("KafkaPySparkIntegration") \
    .config("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1")
    .getOrCreate()
```

### Step 5: Read Data from Kafka

Configure the Kafka parameters and read the stream.

```python
kafka_bootstrap_servers = "localhost:9092"  # Update with your Kafka broker address
kafka_topic = "your_input_topic"

df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
  .option("subscribe", kafka_topic) \
  .load()

# Select the key and value columns and cast them to strings
kafka_df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
```

### Step 6: Process the Stream (Optional)

You can perform various transformations on the DataFrame as per your requirement.

```python
processed_df = kafka_df.withColumn("uppercase_value", col("value").cast(StringType()))
```

### Step 7: Write Data to Kafka

Configure the Kafka parameters and write the stream.

```python
output_kafka_topic = "your_output_topic"

query = processed_df.selectExpr("CAST(key AS STRING)", "CAST(uppercase_value AS STRING) AS value") \
    .writeStream \
    .format("kafka") \
    .outputMode('update')\
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("topic", output_kafka_topic) \
    .option("checkpointLocation", "/path/to/checkpoint/dir") \
    .start()

query.awaitTermination()
```

### Complete Example

Here is the complete example in one place:

```python
!pip install pyspark kafka-python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StringType

# Create SparkSession
spark = SparkSession.builder \
    .appName("KafkaPySparkIntegration") \
    .getOrCreate()

# Kafka configurations
kafka_bootstrap_servers = "localhost:9092"
kafka_topic = "your_input_topic"

# Read from Kafka
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
  .option("subscribe", kafka_topic) \
  .load()

# Process the data
kafka_df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
processed_df = kafka_df.withColumn("uppercase_value", col("value").cast(StringType()))

# Write to Kafka
output_kafka_topic = "your_output_topic"

query = processed_df.selectExpr("CAST(key AS STRING)", "CAST(uppercase_value AS STRING) AS value") \
    .writeStream \
    .format("kafka") \
    .outputMode('update')\
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("topic", output_kafka_topic) \
    .option("checkpointLocation", "/path/to/checkpoint/dir") \
    .start()

query.awaitTermination()
```

### Notes:

1. Replace `"localhost:9092"`, `"your_input_topic"`, and `"your_output_topic"` with your actual Kafka server address and topic names.
2. The `checkpointLocation` is necessary for Spark to maintain the state of the stream processing. Ensure the path is accessible and writeable.
3. The transformations and processing steps are optional and can be customized based on your use case.

you need to start Zookeeper and Kafka brokers before creating the SparkSession and running your PySpark application. Kafka depends on Zookeeper for its operations, so Zookeeper must be running first. Here are the detailed steps:

### Step 1: Start Zookeeper

First, start the Zookeeper server. If you are using Kafka binaries, you can start Zookeeper with the following command:

```bash
# Navigate to your Kafka installation directory
cd /path/to/kafka

# Start Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
```

### Step 2: Start Kafka Broker

Once Zookeeper is running, start the Kafka broker:

```bash
# Navigate to your Kafka installation directory
cd /path/to/kafka

# Start Kafka broker
bin/kafka-server-start.sh config/server.properties
```

### Step 3: Create Kafka Topics

You need to create the input and output topics that your Spark application will use. You can create a Kafka topic using the following command:

```bash
# Create input topic
bin/kafka-topics.sh --create --topic your_input_topic --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1

# Create output topic
bin/kafka-topics.sh --create --topic your_output_topic --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1
```

### Step 4: Verify Kafka Topics

Ensure the topics are created correctly:

```bash
bin/kafka-topics.sh --list --bootstrap-server localhost:9092
```

### Step 5: Start Jupyter Notebook and Create SparkSession

With Zookeeper and Kafka running, you can now start your Jupyter Notebook and create the SparkSession.

### Step 6: Follow the PySpark Kafka Integration Steps

Follow the previously mentioned steps to read from and write to Kafka using PySpark in Jupyter Notebook:

```python
!pip install pyspark kafka-python

from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from pyspark.sql.types import StringType

# Create SparkSession
spark = SparkSession.builder \
    .appName("KafkaPySparkIntegration") \
    .getOrCreate()

# Kafka configurations
kafka_bootstrap_servers = "localhost:9092"
kafka_topic = "your_input_topic"

# Read from Kafka
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
  .option("subscribe", kafka_topic) \
  .load()

# Process the data
kafka_df = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
processed_df = kafka_df.withColumn("uppercase_value", col("value").cast(StringType()))

# Write to Kafka
output_kafka_topic = "your_output_topic"

query = processed_df.selectExpr("CAST(key AS STRING)", "CAST(uppercase_value AS STRING) AS value") \
    .writeStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("topic", output_kafka_topic) \
    .option("checkpointLocation", "/path/to/checkpoint/dir") \
    .start()

query.awaitTermination()
```

### Additional Notes

- Ensure you have set the correct paths to your Kafka installation directory in the commands.
- The `localhost:9092` address is used for local setups. For remote Kafka clusters, replace this with the appropriate address.
- The `checkpointLocation` is necessary for stateful stream processing. Ensure this path is accessible and writable.

# Subscribe to 1 topic

```python
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

# Subscribe to 1 topic, with headers
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .option("includeHeaders", "true") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers")

# Subscribe to multiple topics
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1,topic2") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

# Subscribe to a pattern
df = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribePattern", "topic.*") \
  .load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
```