## Streaming Data Sources and Sinks
- You can create DataFrames from streaming sources using
<b>SparkSession.readStream()</b> and write the output from a result DataFrame using
<b>DataFrame.writeStream()</b>.
- In each case, you can specify the source type using the
method <b>format()</b>.

## Files
- Structured Streaming supports reading and writing data streams to and from files in
the same formats as the ones supported in batch processing: <b>plain text, CSV, JSON,
Parquet, ORC, etc.</b>

### Reading from files
- Structured Streaming can treat files written into a directory as a data stream.

In [1]:
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Read lines from a file stream").getOrCreate()

In [1]:
inputDirectory = 'MyInputStream/'

In [2]:
df = spark.readStream.format("text") \
    .load('MyInputStream/')

In [3]:
writer = df.writeStream.outputMode("append") \
    .format("console") \
    .option("truncate", False) \
    .option("numRows", 100)
#     .option('checkpointLocation','chkpnt')
    

In [10]:
query = writer.start()

24/04/07 14:32:42 WARN ResolveWriteToStream: Temporary checkpoint location created which is deleted normally when the query didn't fail: /tmp/temporary-1bec4a0e-e686-45ad-a435-3029abf163fc. If it's required to delete it under any circumstances, please set spark.sql.streaming.forceDeleteTempCheckpointLocation to true. Important to know deleting temp checkpoint folder is best effort.
24/04/07 14:32:42 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.


-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------------------------+
|value                                 |
+--------------------------------------+
|date,delay,distance,origin,destination|
|1121215,-5,602,ABE,ATL                |
|1121725,-1,602,ABE,ATL                |
|1131215,14,602,ABE,ATL                |
|1130600,-7,369,ABE,DTW                |
|1131725,-6,602,ABE,ATL                |
|1131230,-13,369,ABE,DTW               |
|1130625,29,602,ABE,ATL                |
|1131219,-8,569,ABE,ORD                |
|1140600,-9,369,ABE,DTW                |
|1141725,-9,602,ABE,ATL                |
|1141230,-8,369,ABE,DTW                |
|1140625,-5,602,ABE,ATL                |
|1141219,-10,569,ABE,ORD               |
|1150600,0,369,ABE,DTW                 |
|1151725,-6,602,ABE,ATL                |
|1151230,0,369,ABE,DTW                 |
|1150625,0,602,ABE,ATL                 |
|1150607,0,569,ABE,ORD                 |
|

24/04/07 14:33:05 WARN HadoopFSUtils: The directory file:/home/hatem/PySpark/Ubuntu_Final_Spark_Intake_43/L5_StructuredStreaming/MyInputStream/lu530236y701.tmp was not found. Was it deleted very recently?


-------------------------------------------
Batch: 1
-------------------------------------------
+-----+
|value|
+-----+
+-----+



In [12]:
query.stop()

- The returned streaming DataFrame will have the specified schema. - Here are a few key points to remember when using files:
    - All the files must be of the same format and are expected to have the same schema. For example, if the format is "json" , all the files must be in the JSON format with one JSON record per line. The schema of each JSON record must match the one specified with readStream() . Violation of these assumptions can lead to incorrect parsing (e.g., unexpected null values) or query failures.
    - Each file must appear in the directory listing atomically—that is, the whole file must be available at once for reading, and once it is available, the file cannot be updated or modified. This is because Structured Streaming will process the file when the engine finds it (using directory listing) and internally mark it as processed. Any changes to that file will not be processed.
    - When there are multiple new files to process but it can only pick some of them in the next micro-batch (e.g., because of rate limits), it will select the files with the earliest timestamps. Within the micro-batch, however, there is no predefined order of reading of the selected files; all of them will be read in parallel.