# Reading and writing data -- _Tour of Beam_

So far we've learned some of the basic transforms like
[`Map`](https://beam.apache.org/documentation/transforms/python/elementwise/map),
[`FlatMap`](https://beam.apache.org/documentation/transforms/python/elementwise/flatmap),
[`Filter`](https://beam.apache.org/documentation/transforms/python/elementwise/filter),
[`Combine`](https://beam.apache.org/documentation/transforms/python/aggregation/combineglobally), and
[`GroupByKey`](https://beam.apache.org/documentation/transforms/python/aggregation/groupbykey).
These allow us to transform data in any way, but so far we've used
[`Create`](https://beam.apache.org/documentation/transforms/python/other/create)
to get data from an in-memory
[`iterable`](https://docs.python.org/3/glossary.html#term-iterable), like a `list`.

This works well for experimenting with small datasets. For larger datasets we can use `Source` transforms to read data and `Sink` transforms to write data.
If there are no built-in `Source` or `Sink` transforms, we can also easily create our custom I/O transforms.

Let's create some data files and see how we can read them in Beam.

In [1]:
%mkdir -p data

In [2]:
%%writefile data/my-text-file-1.txt
This is just a plain text file, UTF-8 strings are allowed 🎉.
Each line in the file is one element in the PCollection.

Writing data/my-text-file-1.txt


In [3]:
%%writefile data/my-text-file-2.txt
There are no guarantees on the order of the elements.
ฅ^•ﻌ•^ฅ

Writing data/my-text-file-2.txt


In [4]:
%%writefile data/penguins.csv
species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
0,0.2545454545454545,0.6666666666666666,0.15254237288135594,0.2916666666666667
0,0.26909090909090905,0.5119047619047618,0.23728813559322035,0.3055555555555556
1,0.5236363636363636,0.5714285714285713,0.3389830508474576,0.2222222222222222
1,0.6509090909090909,0.7619047619047619,0.4067796610169492,0.3333333333333333
2,0.509090909090909,0.011904761904761862,0.6610169491525424,0.5
2,0.6509090909090909,0.38095238095238104,0.9830508474576272,0.8333333333333334

Writing data/penguins.csv


# Reading from text files

We can use the
[`ReadFromText`](https://beam.apache.org/releases/pydoc/current/apache_beam.io.textio.html#apache_beam.io.textio.ReadFromText)
transform to read text files into `str` elements.

It takes a
[_glob pattern_](https://en.wikipedia.org/wiki/Glob_%28programming%29)
as an input, and reads all the files that match that pattern.
It returns one element for each line in the file.

For example, in the pattern `data/*.txt`, the `*` is a wildcard that matches anything. This pattern matches all the files in the `data/` directory with a `.txt` extension.

In [10]:
import apache_beam as beam

files = 'data/*.txt'

with beam.Pipeline ()as pipe:
    result = (
        pipe
        | "Read files" >> beam.io.ReadFromText(files)
        | "Print contents" >> beam.Map(print)
    )

There are no guarantees on the order of the elements.
ฅ^•ﻌ•^ฅ
This is just a plain text file, UTF-8 strings are allowed 🎉.
Each line in the file is one element in the PCollection.


### Writing to text files


In [11]:
output_file_prefix = 'output/file'

text = [
          'Each element must be a string.',
          'It writes one element per line.',
          'There are no guarantees on the line order.',
          'The data might be written into multiple files.',
      ]

with beam.Pipeline() as pipe:
    (
        pipe
        | "Create file lines" >> beam.Create(text)
        | "Write to files" >> beam.io.WriteToText(
            output_file_prefix,
            file_name_suffix=".txt"
        )
    )
