## Python Ingestion
The purpose of this notebook is to demonstrate how to use the demo Python ingestion code, as well as discuss possible next steps as we work to build something that is ready to show clients

### Overview
My idea is to split the overall task of data ingestion into two processes. The first process is handled by `DataParsers`, which are responsible for locating files containing data to ingest and turning that data into `Streams`. `Streams` contain arrays of timestamps and values, as well as metadata (collection name, tags, annotations). `Stream` objects are passed to `DataIngestors`, which are responsible for mapping `Stream` objects to BTrDB streams (or creating a new stream if it doesn't exist yet), and inserting points, 50k at a time.

This example uses `CSVParser`, which is an implementation of the `DataParser` interface. In my opinion, most ingestions will require the user to write a new `DataParser` class, because it will contain bespoke code to find and parse files that will almost definitely have unique formats/oddities. Writing a valid `DataParser` will be the responsibility of the user, whereas the `DataIngestor` should be suitable for all use cases.

In [1]:
import os
import btrdb
from csv_parser import CSVParser
from ingest import DataIngestor

In [2]:
# instantiate CSVParser with path for stream data and collection prefix
cp = CSVParser(fpath="test_csvs/", collection_prefix="test_ingest")

# locate files and calculate total number of points
files = cp.collect_files(total=True)
total_points = sum([f.count for f in files])
total_points

6480000

In [3]:
# Connect to BTrDB, instantiate ingestor and insert data
conn = btrdb.connect(profile=os.environ["BTRDB_PROFILE"])

ingestor = DataIngestor(conn, total_points=total_points)
for streams in cp.create_streams(files):
    ingestor.ingest(streams)

6480060it [04:10, 28483.51it/s]                             

## Potential Next Steps
Here are a few ideas that I had to improve this tool and make it user ready:
* It would be really nice to set this up in a producer/consumer pattern where we have multiple `DataParsers` parsing files into `Stream` objects and passing those to `DataIngestors` via a queue or some sort of shared storage
* We likely need to make this compatible with s3
* We may want to provide a few strategies to help clients match metadata with streams. I took a very simple approach here, but it might be nice to allow clients to provide a yaml/json file specifying their metadata schema. Importman does this but it's a bit confusing how it works for newbies, so we'll need to thoroughly document whatever solution we come up with.