# Uploading data to InfluxDB

## Requirements

1. [InfluxDB installed](https://www.influxdata.com/downloads/).
2. Export InfluxDB API Key as `INFLUXDB_TOKEN` environment variable.
3. Download preprocessed CSV data using `../scripts/fetch_data.py`.

## Processing

In [1]:
import pandas as pd
import os
import influxdb_client
from influxdb_client.client.write_api import SYNCHRONOUS
import csv

### Setting up the InfluxDB connection

In [2]:
DATA_PREROCESSED_DIR = "../data_processed"
DATA_PREPROCESSED_FILE = DATA_PREROCESSED_DIR + "/preprocessed.csv"

bucket = "radem"
org = "radem"
token = os.environ.get("INFLUXDB_TOKEN")
url="http://localhost:8086"

client = influxdb_client.InfluxDBClient(
    url=url,
    token=token,
    org=org
)

write_api = client.write_api(write_options=SYNCHRONOUS)

### Reading preprocessed CSV data

In [3]:
df = pd.read_csv(DATA_PREPROCESSED_FILE)

# Convert time to ns for InfluxDB
df['time_ns'] = pd.to_datetime(df['time'], format="%Y-%m-%d %H:%M:%S.%f").astype('int64')

df

Unnamed: 0,time,event_type,channel,value,time_ns
0,2023-10-11 00:00:32.787951,e,1,6,1696982432787951000
1,2023-10-11 00:00:32.787951,e,2,12,1696982432787951000
2,2023-10-11 00:00:32.787951,e,3,14,1696982432787951000
3,2023-10-11 00:00:32.787951,e,4,16,1696982432787951000
4,2023-10-11 00:00:32.787951,e,5,15,1696982432787951000
...,...,...,...,...,...
21613307,2023-10-06 23:59:23.156658,d,27,1,1696636763156658000
21613308,2023-10-06 23:59:23.156658,d,28,3,1696636763156658000
21613309,2023-10-06 23:59:23.156658,d,29,0,1696636763156658000
21613310,2023-10-06 23:59:23.156658,d,30,1,1696636763156658000


### Converting to InfluxDB Line Protocol

In [6]:
batch_size = 1000000

print(f"BATCH_SIZE = {batch_size}")
print(f"INPUT_SIZE = {len(df)}")

time_start = pd.Timestamp.now()
count = 0
for batch in range(0, len(df), batch_size):    
    batch_end = min(batch+batch_size-1, len(df)-1)
    batch_indices = slice(batch, batch_end)

    print(f"Processing batch of {batch_indices.stop - batch_indices.start + 1} records, from {batch_indices.start} to {batch_indices.stop}.")

    # Convert time to datetime and then to timestamp in nanoseconds
    df.loc[batch_indices, 'timestamp'] = pd.to_datetime(df.loc[batch_indices, 'time']).astype('int64')

    # Use vectorized operations to construct the line
    df.loc[batch_indices, 'line'] = (
        "my_measurement," +
        "event_type=" + df.loc[batch_indices, 'event_type'] + "," +
        "channel=" + df.loc[batch_indices, 'channel'].astype(str) + " " +
        "value=" + df.loc[batch_indices, 'value'].astype(str) + "i " +
        df.loc[batch_indices, 'time_ns'].astype(str)
    )

    count += len(df.loc[batch_indices, 'line'])

time_total = pd.Timestamp.now() - time_start
print(f"Processed {count} records in {time_total.total_seconds()} seconds")
print(f"SUCCESS")

BATCH_SIZE = 1000000
INPUT_SIZE = 21613312
Processing batch of 1000000 records, from 0 to 999999.
Processing batch of 1000000 records, from 1000000 to 1999999.
Processing batch of 1000000 records, from 2000000 to 2999999.
Processing batch of 1000000 records, from 3000000 to 3999999.
Processing batch of 1000000 records, from 4000000 to 4999999.
Processing batch of 1000000 records, from 5000000 to 5999999.
Processing batch of 1000000 records, from 6000000 to 6999999.
Processing batch of 1000000 records, from 7000000 to 7999999.
Processing batch of 1000000 records, from 8000000 to 8999999.
Processing batch of 1000000 records, from 9000000 to 9999999.
Processing batch of 1000000 records, from 10000000 to 10999999.
Processing batch of 1000000 records, from 11000000 to 11999999.
Processing batch of 1000000 records, from 12000000 to 12999999.
Processing batch of 1000000 records, from 13000000 to 13999999.
Processing batch of 1000000 records, from 14000000 to 14999999.
Processing batch of 1000

### Save data to InfluxDB Line Protocol file

Example line: `my_measurement,event_type=e,channel=0 value=123 1556813561098000000`


In [7]:
df['line'].to_csv(DATA_PREROCESSED_DIR + "/influx_line_protocol.line", index=False, header=False, quoting=csv.QUOTE_NONE, sep='\n')

### Read data from InfluxDB Line Protocol file

In [8]:
df_lines = pd.read_csv(DATA_PREROCESSED_DIR + "/influx_line_protocol.line", header=None, sep='\0', names=['line'])

### Upload data to InfluxDB

In [9]:
batch_size = 1000000
for batch in range(0, len(df_lines), batch_size):
    batch_end = min(batch+batch_size-1, len(df_lines)-1)
    batch_indices = slice(batch, batch_end)

    print(f"Uploading batch of {batch_indices.stop - batch_indices.start + 1} records, from {batch_indices.start} to {batch_indices.stop}.")

    write_api.write(bucket, org, df_lines.loc[batch_indices, 'line'])

write_api.flush()

Uploading batch of 1000000 records, from 0 to 999999.
Uploading batch of 1000000 records, from 1000000 to 1999999.
Uploading batch of 1000000 records, from 2000000 to 2999999.
Uploading batch of 1000000 records, from 3000000 to 3999999.
Uploading batch of 1000000 records, from 4000000 to 4999999.
Uploading batch of 1000000 records, from 5000000 to 5999999.
Uploading batch of 1000000 records, from 6000000 to 6999999.
Uploading batch of 1000000 records, from 7000000 to 7999999.
Uploading batch of 1000000 records, from 8000000 to 8999999.
Uploading batch of 1000000 records, from 9000000 to 9999999.
Uploading batch of 1000000 records, from 10000000 to 10999999.
Uploading batch of 1000000 records, from 11000000 to 11999999.
Uploading batch of 1000000 records, from 12000000 to 12999999.
Uploading batch of 1000000 records, from 13000000 to 13999999.
Uploading batch of 1000000 records, from 14000000 to 14999999.
Uploading batch of 1000000 records, from 15000000 to 15999999.
Uploading batch of 