### This is a summary of how to read from and write to different Streaming data sources and sinks using  Spark. 

#### **1. Files:** CSV,JSON,Parquet,ORC,text,etc.
 
 csv_file_dir ='/path/to csv_file' # the directory of csv files with the same schema.
 output_dir = '/path/to/output_files' # the directory to save the files.
 check_dir = '/path/to/checkpoint_dir' # the directory of check point loaction.
    
    from pyspark.sql import SparkSession
    from pyspark.sql.functions import *
    from pyspark.sql.types import *
    
    spark = SparkSession.builder\
            .appName('files-stream-csv')\
            .getOrCreate()
    
    df = spark.readStream.format('csv')\
              .schema(schema_file)\    
              .option('path',csv_file_dir)\
              .load()
    
    csv_stream_query = df.writeStream.format('csv')\ 
                         .option('path',output_dir)\
                         .option('checkpointLocation',check_dir)\
                         .outputMode('append')\  
                         .start()
    csv_stream_query.awaitTermination()
    
**Remarks:**

     . All the files must be of the same format and are expected to have the same schema
     . Structured Streaming supports writing streaming query output to files in the same formats as reads.
     . For files, it only supports append mode.
     . For CSV and JSON files, need to specify the schema.
    

**2.Apache Kafka:**  a popular publish/subscribe system. Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation. It is designed to handle real-time data feeds with high throughput and low latency. Here are some key features and components of Apache Kafka:

#### Key Features:
1. **Scalability**: Kafka is designed to scale horizontally, allowing you to add more brokers to handle increased data loads.
2. **Fault Tolerance**: Data is replicated across multiple brokers, ensuring that even if one broker fails, the data remains available.
3. **High Throughput**: Kafka can handle millions of messages per second with very low latency, making it suitable for high-throughput use cases.
4. **Durability**: Messages are stored on disk and replicated across the cluster to prevent data loss.
5. **Real-time Processing**: Kafka provides real-time stream processing capabilities, making it ideal for applications that require immediate data analysis.

#### Core Components:
1. **Broker**: Kafka cluster is made up of one or more servers, each called a broker. Brokers handle the storage and retrieval of messages.
2. **Topic**: Messages in Kafka are categorized into topics. A topic is a logical channel to which producers send messages and from which consumers read messages.
3. **Producer**: Producers are clients that send messages to Kafka topics. They can publish data to one or more topics.
4. **Consumer**: Consumers read messages from Kafka topics. They can subscribe to one or more topics and process the incoming messages.
5. **Partition**: Each topic is divided into partitions, which allow Kafka to parallelize the processing of messages. Partitions enable high scalability and fault tolerance.
6. **ZooKeeper**: Kafka uses Apache ZooKeeper to manage and coordinate the Kafka brokers. ZooKeeper helps with leader election for partitions and maintaining configuration information.

#### Use Cases:
1. **Log Aggregation**: Kafka is commonly used to collect and aggregate log data from various systems for monitoring and analysis.
2. **Real-time Analytics**: Companies use Kafka to process streaming data in real time for analytics and decision-making.
3. **Data Integration**: Kafka serves as a data integration layer, allowing different systems to share and process data in real-time.
4. **Event Sourcing**: Kafka can be used to store and process events in an event-driven architecture.

#### How Kafka Works:
- **Producers** send data to Kafka topics.
- **Brokers** store the data in partitions within the topics.
- **Consumers** subscribe to topics and read the data from the partitions.
- Kafka ensures the data is replicated and distributed across multiple brokers to provide fault tolerance and high availability.

Kafka's architecture and design make it a powerful tool for building robust, scalable, and real-time data processing pipelines.

**step 1:** Start zookeeper and kafka:

     (a) .bin/zookeeper-server-start.sh config/zookeeper.properties
     (b) .bin/kafka-server-start.sh config/server.properties
 
**step 2:**  Create topics:
     
     (a) .bin/kafka-topics.sh --create --topic input-events --bootstrap-server localhost:9092
         --replication-factor 3 --partitions 5
     (b) .bin/kafka-topics.sh --create --topic output-events --bootstrap-server localhost:9092
         --replication-factor 3 --partitions 5
     (c) .bin/kafka-topics.sh --list --bootstrap-server localhost:9092
     
**step 3:** producing/consuming message to/from the topic: 

      (a) .bin/kafka-console-producer.sh --topic input-events --bootstrap-server localhost:9092
      (b) .bin/kafka-console-consumer.sh --topic output-events --from-beginning --bootstrap-server
          localhost:9092
    



    spark = SparkSession.builder\
            .appName('kafka-stream')\
            .config('spark.jars.packages','org.apache.spark:spark-sql-kafka-0-10_2.12:3.5.1')\
            .getOrCreate()
    
    df = spark.readStream.format('kafka')\
              .option('kafka.bootstrap.servers','localhost:9092,host2:port2')\    
              .option('subscribe','input-events')\
              .load()
    
    query = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
              .writeStream \
              .format("kafka") \
              .option("kafka.bootstrap.servers", 'localhost:9092') \
              .option("topic", output_kafka_topic) \
              .outputMode('update')\
              .option("checkpointLocation", "/path/to/checkpoint/dir") \
              .start()

    query.awaitTermination()

**3. Writing to and reading from storage system: Cassandra**

**step 1:**  Start cassandra and cqlsh:

            (a) .bin/cassandra -f
            (b) .bin/cqlsh
            
**step 2:** Create keysapce and table :

       (a) cqlsh> create keyspace myspace with replication={'class':'SimpleStrategy','replication_factor:1};
       (b) cqlsh> create table if not exists my_tbl(id int, name text,age int,nation text);
       
**step 3:** Start SparkSeesion 


       spark = SparkSession.builder\
                           .appName('spark-cassandra')\
                           .config('spark.jars.packages',
                                   'com.datastax.spark:spark-cassandra-connector_2.12:3.5.1')\
                           .config('spark.cassandra.connection.host','localhost')\
                           .config('spark.cassandra.connection.port',port_number)\   (optional)
                           .getOrCreate()
                           
                           
       df = spark.read.format('org.apache.spark.sql.cassandra')\
                 .options(table='my_tbl',keyspace='myspace')\
                 .load()
       
       
       df.write.format('org.apache.spark.sql.cassandra')\
               .options(table='my_tbl',keyspace='myspace')\
               .save()
         
**Remarks:**
        
        . table and keyspace exist before running sparksession app
        . table must hhave the same schema as dataframe.
            