<a href="https://colab.research.google.com/github/AlisonDavey/tinybird_examples/blob/main/wiki.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stream data from a Jupyter Notebook

### Using pandas DataFrames of recent changes to Wikipedia

- create a Data Source from 15 minutes of data in `df_wiki`

- append 5 minutes of data to the Data Source from `df_wiki_new`

Based on
https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams#When_not_to_use_EventStreams

Options for ingesting data:

1. Tinybird API
2. Tinybird UI
3. Tinybird CLI

## Create pandas DataFrames

In [1]:
!pip install sseclient
!pip install fsspec
!pip install ndjson

Collecting sseclient
  Downloading sseclient-0.0.27.tar.gz (7.5 kB)
Building wheels for collected packages: sseclient
  Building wheel for sseclient (setup.py) ... [?25l[?25hdone
  Created wheel for sseclient: filename=sseclient-0.0.27-py3-none-any.whl size=5584 sha256=fcdbe78486198398fea5d2677ffcfb007e1b9486fd9ab661036bd8cddb0e1eec
  Stored in directory: /root/.cache/pip/wheels/07/67/7e/96edf627ac746de1a5c5cbb8d59ed960f033b8352dc12c545d
Successfully built sseclient
Installing collected packages: sseclient
Successfully installed sseclient-0.0.27
Collecting fsspec
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 5.9 MB/s 
[?25hInstalling collected packages: fsspec
Successfully installed fsspec-2022.1.0
Collecting ndjson
  Downloading ndjson-0.3.1-py2.py3-none-any.whl (5.3 kB)
Installing collected packages: ndjson
Successfully installed ndjson-0.3.1


In [2]:
import json
import ndjson
from sseclient import SSEClient as EventSource

import fsspec
import time
from google.colab import files

import pandas as pd

In [3]:
def create_df_wiki(url='https://stream.wikimedia.org/v2/stream/recentchange', n=5):
  df_wiki = pd.DataFrame()
  t_end = time.time() + 60 * n
  for event in EventSource(url):
    if time.time() > t_end:
      break
    elif event.event == 'message':
          try:
              change = json.loads(event.data)
          except ValueError:
              pass
          else:
            if change['type']!='log':
              df=pd.DataFrame.from_dict(change)
              df_wiki=df_wiki.append(df[df.index=='domain'])
  return df_wiki

DataFrame of 15 minutes of data to create Data Source

In [4]:
df_wiki = create_df_wiki(n=15)
df_wiki.drop(columns=['$schema','length','revision'], inplace=True)

In [5]:
df_wiki.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22793 entries, domain to domain
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   meta                22793 non-null  object
 1   id                  22793 non-null  int64 
 2   type                22793 non-null  object
 3   namespace           22793 non-null  int64 
 4   title               22793 non-null  object
 5   comment             22793 non-null  object
 6   timestamp           22793 non-null  int64 
 7   user                22793 non-null  object
 8   bot                 22793 non-null  bool  
 9   server_url          22793 non-null  object
 10  server_name         22793 non-null  object
 11  server_script_path  22793 non-null  object
 12  wiki                22793 non-null  object
 13  parsedcomment       22793 non-null  object
 14  minor               14531 non-null  object
 15  patrolled           9545 non-null   object
dtypes: bool(1), int64(3),

DataFrame of 5 minutes of data to append to Data Source

In [6]:
df_wiki_new = create_df_wiki(n=5)
df_wiki_new.drop(columns=['$schema','length','revision'], inplace=True)

## Option 1: Stream to Tinybird from the API

In [7]:
import csv
import requests

from io import StringIO
from requests.adapters import HTTPAdapter

from urllib3.util.retry import Retry
from urllib.parse import urlencode

In [8]:
token = '{TOKEN}'

if token == '':
   print("Get your token from your Tinybird workspace.")

In [9]:
def ingest_from_array(rows,datasource, token, mode='append', endpoint='https://api.tinybird.co'):
  url = f'{endpoint}/v0/datasources?mode={mode}&name={datasource}'

  retry = Retry(total=5, backoff_factor=0.2)
  adapter = HTTPAdapter(max_retries=retry)
  _session = requests.Session()
  _session.mount('http://', adapter)
  _session.mount('https://', adapter)

  csv_chunk = StringIO()
  writer = csv.writer(csv_chunk, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)

  max_wait_records = 5000
  max_wait_bytes = 32 * 1024 ** 2

  records = 0
  for row in rows:
    writer.writerow(row)
    records += 1

    if (records > max_wait_records and csv_chunk.tell() > max_wait_bytes) or len(rows) == records:
        data = csv_chunk.getvalue()
        headers = {
            'Authorization': f'Bearer {token}',
            'X-TB-Client': 'pltx-0.1',
        }

        ok = False
        try:
            response = _session.post(url, headers=headers, files=dict(csv=data))
            result = response.json()

            ok = response.status_code < 400
            if ok:
                csv_chunk = StringIO()
                writer = csv.writer(csv_chunk, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
                print(f"Flushed {len(data)} bytes, datasource={datasource}, response={response.status_code}")
                print(f"Result id={result.get('import_id', None)}, error={result.get('error', False)}")
        except Exception as e:
            print(e)

  print('Done')

### Create Data Source and Ingest
Column names read from local file, column data types interpreted from local file.

In [10]:
datasource = 'wiki'
endpoint = 'https://api.tinybird.co'

mode = 'create'
rows= df_wiki.values.tolist()
rows.insert(0, df_wiki.columns.tolist())

ingest_from_array(rows, datasource, token, mode, endpoint)

Flushed 10152580 bytes, datasource=wiki, response=200
Result id=2e194aa0-f450-452a-8cc3-803d14604b9c, error=False
Done


### Append to Data Source

In [11]:
mode = 'append'
rows= df_wiki_new.values.tolist()
rows.insert(0, df_wiki_new.columns.tolist())

ingest_from_array(rows, datasource, token, mode, endpoint)

Flushed 3514165 bytes, datasource=wiki, response=200
Result id=02264dbb-69ea-4261-8fe9-15a15413c35a, error=False
Done


## Option 2: Download to a local file then ingest to Tinybird through the UI, from:

- CSV
- NDJSON

The column names and types can be changed in the preview in the UI, for example, the column `type` can be changed to `LowCardinality(String)`.

### Format CSV

In [25]:
df_wiki.to_csv("wiki_ui_csv.csv", index=False)
files.download('wiki_ui_csv.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Format NDJSON

In [26]:
df_wiki.to_json("wiki_ui_ndjson.ndjson", orient="records", lines=True, force_ascii=0)
files.download("wiki_ui_ndjson.ndjson")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Option 3: Ingest to Tinybird from the CLI, from:
- CSV
- NDJSON

For the schema use 
```
!tb --token=$token datasource generate wiki_cli_csv.csv
```

or
```

!tb --token=$token datasource generate wiki_ndjson_csv.ndjson
```

or define it directly (as shown here) with data types, sorting key etc.

In [27]:
!pip install tinybird-cli -q -U

In [28]:
token = '{TOKEN}'

if token == '':
   print("Get your token from your Tinybird workspace.")

In [29]:
def write_text_to_file(filename, text):
  with open(filename, 'w') as f: f.write(text)

### Format CSV

In [30]:
df_wiki.to_csv("wiki_cli_csv.csv", index=False)

In [31]:
filename = 'wiki_cli_csv.datasource'
text='''
SCHEMA >
    `meta` LowCardinality(String),
    `id` Int64,
    `type` String,
    `namespace` Int16,
    `title` String,
    `comment` Nullable(String),
    `timestamp` Int64,
    `user` String,
    `bot` String,
    `minor` Nullable(String),
    `patrolled` Nullable(String),
    `server_url` LowCardinality(String),
    `server_name` LowCardinality(String),
    `server_script_path` String,
    `wiki` LowCardinality(String),
    `parsedcomment` Nullable(String)

ENGINE "MergeTree"
ENGINE_SORTING_KEY "timestamp"
'''

write_text_to_file(filename, text)

In [32]:
!tb --token=$token push wiki_cli_csv.datasource
!tb --token=$token datasource append wiki_cli_csv wiki_cli_csv.csv

[0m** Processing wiki_cli_csv.datasource[0m
[0m** Building dependencies[0m
[0m** Running wiki_cli_csv [0m
[92m** 'wiki_cli_csv' created[0m
[0m** Not pushing fixtures[0m
[0m** 🥚 starting import process[0m
[92m** 🐥 done[0m
[92m** Total rows in wiki_cli_csv: 22793[0m
[92m** Data appended to Data Source 'wiki_cli_csv' successfully![0m
[0m** Data pushed to wiki_cli_csv[0m


### Format NDJSON

In [33]:
df_wiki.to_json("wiki_cli_ndjson.ndjson", orient="records", lines=True, force_ascii=0)

In [34]:
filename = 'wiki_cli_ndjson.datasource'
text='''
SCHEMA >

    bot UInt8 `json:$.bot`,
    comment Nullable(String) `json:$.comment`,
    id Int64 `json:$.id`,
    meta LowCardinality(String) `json:$.meta`,
    minor Nullable(UInt8) `json:$.minor`,
    namespace Int16 `json:$.namespace`,
    parsedcomment Nullable(String) `json:$.parsedcomment`,
    patrolled Nullable(UInt8) `json:$.patrolled`,
    server_name String `json:$.server_name`,
    server_script_path String `json:$.server_script_path`,
    server_url String `json:$.server_url`,
    timestamp Int64 `json:$.timestamp`,
    title String `json:$.title`,
    type String `json:$.type`,
    user String `json:$.user`,
    wiki LowCardinality(String) `json:$.wiki`
    
ENGINE "MergeTree"
ENGINE_SORTING_KEY "timestamp"
'''

write_text_to_file(filename, text)

In [35]:
!tb --token=$token push wiki_cli_ndjson.datasource
!tb --token=$token datasource append wiki_cli_ndjson wiki_cli_ndjson.ndjson

[0m** Processing wiki_cli_ndjson.datasource[0m
[0m** Building dependencies[0m
[0m** Running wiki_cli_ndjson [0m
[92m** 'wiki_cli_ndjson' created[0m
[0m** Not pushing fixtures[0m
[0m** 🥚 starting import process[0m
[92m** 🐥 done[0m
[92m** Appended 0 new rows[0m
[92m** Total rows in wiki_cli_ndjson: 22793[0m
[92m** Data appended to Data Source 'wiki_cli_ndjson' successfully![0m
[0m** Data pushed to wiki_cli_ndjson[0m
