# Reading Text Files

In this notebook we show how to read text files that can be read by numpy's `loadtxt` function

These are essentially column-based text files.

The notebook will also show you how Kosh can help you by adding metadata onto the file which in turn will help the loader (and potentially the Kosh users to pinpoint the actual text file they need).

## Reading in the whole text file

We will be using the text files in [this directory](../tests/baselines/npy/)

### Raw numpy

In [1]:
import numpy

filename = "../tests/baselines/npy/example_columns_no_header.txt"
data = numpy.loadtxt(filename)
print(data.shape)

(25, 6)


### Kosh

Let's setup a Kosh store, create a dataset and associate this file. Numpy's `loadtxt` is used via the `numpy/txt` mime_type

In [2]:
import kosh

store = kosh.connect("numpy_loadtxt.sql", delete_all_contents=True)
dataset = store.create(name="example1")
dataset.associate(filename, mime_type="numpy/txt")
print("Features:", dataset.list_features())
print(dataset["features"][:].shape)

Features: ['features']
(25, 6)


## Slicing

While it is nice be able to read the whole file it can be very time consuming if the file gets big, possibly not even fitting into memory.

Kosh's loader can slice the data appropriately and read only the necessary part of the file. Solving these potential problems:

In [3]:
print(dataset["features"][2:4, 1:5].shape)

(2, 4)


## Header rows

Now it is possible that the text files actually has a few header lines.

A good example would be [example_non_hashed_header_rows.txt](../tests/baselines/npy/example_non_hashed_header_rows.txt)

Now numpy's `loadtxt` cannot read the file as is (you could pass the skiprows keyword though):


In [4]:
filename = "../tests/baselines/npy/example_non_hashed_header_rows.txt"
try:
    data = numpy.loadtxt(filename)
except ValueError:
    print("Numpy cannot read this text file")

Numpy cannot read this text file


And similarly Kosh's loader won't be able to read as is:

In [5]:
dataset = store.create(name="example_headers_rows")
associated = dataset.associate(filename, mime_type="numpy/txt", id_only=False)
try:
    print(dataset["features"][:].shape)
except ValueError:
    print("Cannot read as is")

Cannot read as is


Fortunately we can add metadata on our kosh-associated object and inform the loader on what to do:

In [6]:
associated.skiprows = 6
print(dataset["features"][:].shape)

(25, 6)


## Columns Headers

It is quite frequent that one of the header rows contains the columns/names

Let's add some metadata informing the loader which line contains the features.

In [7]:
associated.features_line = 5
# we'll need to clear the features cache
print(dataset.list_features(use_cache=False))

['time', 'zeros', 'ones', 'twos', 'threes', 'fours']


We can now access each feature/column separately, via their name. This can be useful if you're reading data from text files that are organized differently but contain the same column name.

In [8]:
zeros = dataset["zeros"][:4]
print(zeros)

[0.65485361 0.04917816 0.20506388 0.24302516]


In some cases the column headers can be separated via fixed width (causing two names to touch each other)

For a good example would be: [../tests/baselines/npy/example_column_names_in_header_via_constant_width.txt](../tests/baselines/npy/example_column_names_in_header_via_constant_width.txt)

In [9]:
filename = "../tests/baselines/npy/example_column_names_in_header_via_constant_width.txt"
dataset = store.create(name="example_constant_width")
associated = dataset.associate(filename, mime_type="numpy/txt", id_only=False)
associated.skiprows=1
associated.features_line=0
associated.columns_width=10
print(dataset.list_features())


['time', 'zeros col', 'ones  col', 'twos col', 'threes col', 'fours']
