<a href="https://colab.research.google.com/github/Ajay-user/Apache-beam/blob/main/foundation/Reading_and_writing_data_Apache_beam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
#@title ###### Licensed to the Apache Software Foundation (ASF), Version 2.0 (the "License")

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

In [2]:
# # Install apache-beam with pip.
# !pip install --quiet apache-beam
# Create a directory for our data files.
!mkdir -p data

In [3]:
%%writefile ./data/my-text-file-1.txt
This is just a plain text file, UTF-8 strings are allowed ðŸŽ‰.
Each line in the file is one element in the PCollection.

Writing ./data/my-text-file-1.txt


In [4]:
%%writefile data/my-text-file-2.txt
There are no guarantees on the order of the elements.
à¸…^â€¢ï»Œâ€¢^à¸…

Writing data/my-text-file-2.txt


In [5]:
%%writefile data/penguins.csv

species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667
0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556
1,0.5236363636363636,0.5714285714285713,0.3389830508474576,0.2222222222222222
1,0.6509090909090909,0.7619047619047619,0.4067796610169492,0.3333333333333333
2,0.509090909090909,0.011904761904761862,0.6610169491525424,0.5
2,0.6509090909090909,0.38095238095238104,0.9830508474576272,0.8333333333333334

Writing data/penguins.csv


## Reading from text files

We can use the
[`ReadFromText`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText)
transform to read text files into `str` elements.

It takes a
[_glob pattern_](https://en.wikipedia.org/wiki/Glob_%28programming%29)
as an input, and reads all the files that match that pattern.
It returns one element for each line in the file.

For example, in the pattern `data/*.txt`, the `*` is a wildcard that matches anything. This pattern matches all the files in the `data/` directory with a `.txt` extension.

In [6]:
import apache_beam as beam

data_dir = './data/*.txt'

with beam.Pipeline() as pipeline:

  (
      pipeline    
   | 'Reading files' >> beam.io.ReadFromText(data_dir)
   | 'Printing contents' >> beam.Map(print)
  )



This is just a plain text file, UTF-8 strings are allowed ðŸŽ‰.
Each line in the file is one element in the PCollection.
There are no guarantees on the order of the elements.
à¸…^â€¢ï»Œâ€¢^à¸…


## Writing to text files

We can use the
[`WriteToText`](https://beam.apache.org/releases/pydoc/2.27.0/apache_beam.io.textio.html#apache_beam.io.textio.WriteToText) transform to write `str` elements into text files.

It takes a _file path prefix_ as an input, and it writes the all `str` elements into one or more files with filenames starting with that prefix. You can optionally pass a `file_name_suffix` as well, usually used for the file extension. Each element goes into its own line in the output files.

### Example 1

In [8]:
import apache_beam as beam

output_filename_prefix = 'output/file'

with beam.Pipeline() as pipeline:
  (pipeline
   | 'Create text' >> beam.Create([
           'Each element must be a string.',
          'It writes one element per line.',
          'There are no guarantees on the line order.',
          'The data might be written into multiple files.',
   ])
   |'Write text' >> beam.io.WriteToText(output_filename_prefix, file_name_suffix='.txt'))

In [10]:
!ls './output'

file-00000-of-00001.txt


In [17]:
!cat ./output/file*.txt 

Each element must be a string.
It writes one element per line.
There are no guarantees on the line order.
The data might be written into multiple files.


### Example 2

In [19]:
with beam.Pipeline() as pipe:
  (pipe
   | 'Reading data' >> beam.io.ReadFromText('./data/*.txt')
   | 'Writing data' >> beam.io.WriteToText(file_path_prefix='./output/sample', file_name_suffix='.txt')
   )

In [22]:
! cat output/sample*.txt

This is just a plain text file, UTF-8 strings are allowed ðŸŽ‰.
Each line in the file is one element in the PCollection.
There are no guarantees on the order of the elements.
à¸…^â€¢ï»Œâ€¢^à¸…


## Reading from an `iterable`

The easiest way to create elements is using
[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap).

A common way is having a [`generator`](https://docs.python.org/3/glossary.html#term-generator) function. This could take an input and _expand_ it into a large amount of elements. The nice thing about `generator`s is that they don't have to fit everything into memory like a `list`, they simply
[`yield`](https://docs.python.org/3/reference/simple_stmts.html#yield)
elements as they process them.

For example, let's define a `generator` called `count`, that `yield`s the numbers from `0` to `n`. We use `Create` for the initial `n` value(s) and then exapand them with `FlatMap`.

In [35]:
def count(n):
  for i in range(n):
    yield i

In [40]:
n = 5
with beam.Pipeline() as pipe:
  (
    pipe
   |'Create inputs' >> beam.Create([2,n,3])
   |'Generate Elements' >> beam.FlatMap(count)
   |'Printing'>> beam.Map(print)
  )

0
1
0
1
2
3
4
0
1
2


## Creating an input transform

For a nicer interface, we could abstract the `Create` and the `FlatMap` into a custom `PTransform`. This would give a more intuitive way to use it, while hiding the inner workings.

We need a new class that inherits from `beam.PTransform`. We can do this more conveniently with the
[`beam.ptransform_fn`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.ptransform.html#apache_beam.transforms.ptransform.ptransform_fn) decorator.

The `PTransform` function takes the input `PCollection` as the first argument, and any other inputs from the generator function, like `n`, can be arguments to the `PTransform` as well. The original generator function can be defined locally within the `PTransform`.
Finally, we apply the `Create` and `FlatMap` transforms and return a new `PCollection`.

We can also, optionally, add type hints with the [`with_input_types`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.ptransform.html#apache_beam.transforms.ptransform.PTransform.with_input_types) and [`with_output_types`](https://beam.apache.org/releases/pydoc/current/apache_beam.transforms.ptransform.html#apache_beam.transforms.ptransform.PTransform.with_output_types) decorators. They serve both as documentation, and are a way to ensure your data types are consistent throughout your pipeline. This becomes more useful as the complexity grows.

Since our `PTransform` is expected to be the first transform in the pipeline, it doesn't receive any inputs. We can mark it as the beginning with the [`PBegin`](https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/pvalue.html) type hint.

Finally, to enable type checking, you can pass `--type_check_additional=all` when running your pipeline. Alternatively, you can also pass it directly to `PipelineOptions` if you want them enabled by default. To learn more about pipeline options, see [Configuring pipeline options](https://beam.apache.org/documentation/programming-guide/#configuring-pipeline-options).

In [54]:

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from typing import Iterable


@beam.ptransform_fn
@beam.typehints.with_input_types(beam.pvalue.PBegin)
@beam.typehints.with_output_types(int)
def Count(pbegin:beam.pvalue.PBegin, n:int)->beam.PCollection[int]:

  def count(n:int)->Iterable:
    for i in range(n):
      yield i
  
  return (
      pbegin
      |'Create elements' >> beam.Create([n])
      |'Generate elements' >> beam.FlatMap(count)
  )


n = 5
options = PipelineOptions(flags=[], type_check_additional='all')

with beam.Pipeline(options=options) as pipe:
  (
      pipe 
      | f'Count upto {n}' >> Count(n)
      | 'Print output' >> beam.Map(print)
  )

0
1
2
3
4
