<a href="https://colab.research.google.com/github/AlisonDavey/tinybird_examples/blob/main/wiki.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Stream data from a Jupyter Notebook

### Using pandas DataFrames of recent changes to Wikipedia

- create a Data Source from 15 minutes of data in `df_wiki`

- append 5 minutes of data to the Data Source from `df_wiki_new`

Based on
https://wikitech.wikimedia.org/wiki/Event_Platform/EventStreams#When_not_to_use_EventStreams

Options for ingesting data:

- API
- UI
- CLI

## Create pandas DataFrames

In [47]:
!pip install sseclient
!pip install fsspec
!pip install ndjson



In [2]:
import json
import ndjson
from sseclient import SSEClient as EventSource

import fsspec
import time
from google.colab import files

import pandas as pd

In [3]:
def create_df_wiki(url='https://stream.wikimedia.org/v2/stream/recentchange', n=5):
  df_wiki = pd.DataFrame()
  t_end = time.time() + 60 * n
  for event in EventSource(url):
    if time.time() > t_end:
      break
    elif event.event == 'message':
          try:
              change = json.loads(event.data)
          except ValueError:
              pass
          else:
            if change['type']!='log':
              df=pd.DataFrame.from_dict(change)
              df_wiki=df_wiki.append(df[df.index=='domain'])
  return df_wiki

DataFrame of 15 minutes of data

In [46]:
df_wiki = create_df_wiki(n=15)
df_wiki.drop(columns=['$schema','length','revision'], inplace=True) # drop unwanted columns

In [5]:
df_wiki.info()

<class 'pandas.core.frame.DataFrame'>
Index: 22310 entries, domain to domain
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   meta                22310 non-null  object
 1   id                  22310 non-null  int64 
 2   type                22310 non-null  object
 3   namespace           22310 non-null  int64 
 4   title               22310 non-null  object
 5   comment             22310 non-null  object
 6   timestamp           22310 non-null  int64 
 7   user                22310 non-null  object
 8   bot                 22310 non-null  bool  
 9   server_url          22310 non-null  object
 10  server_name         22310 non-null  object
 11  server_script_path  22310 non-null  object
 12  wiki                22310 non-null  object
 13  parsedcomment       22310 non-null  object
 14  minor               14452 non-null  object
 15  patrolled           9921 non-null   object
dtypes: bool(1), int64(3),

DataFrame of 5 minutes of data

In [48]:
df_wiki_new = create_df_wiki(n=5)
df_wiki_new.drop(columns=['$schema','length','revision'], inplace=True) # drop unwanted columns

## Option 1: Stream to Tinybird using the API

In [49]:
import csv
import requests

from io import StringIO
from requests.adapters import HTTPAdapter

from urllib3.util.retry import Retry
from urllib.parse import urlencode

In [50]:
def ingest_from_array(rows,datasource, token, mode='append', endpoint='https://api.tinybird.co'):
  url = f'{endpoint}/v0/datasources?mode={mode}&name={datasource}'

  retry = Retry(total=5, backoff_factor=0.2)
  adapter = HTTPAdapter(max_retries=retry)
  _session = requests.Session()
  _session.mount('http://', adapter)
  _session.mount('https://', adapter)

  csv_chunk = StringIO()
  writer = csv.writer(csv_chunk, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)

  max_wait_records = 5000
  max_wait_bytes = 32 * 1024 ** 2

  records = 0
  for row in rows:
    writer.writerow(row)
    records += 1

    if (records > max_wait_records and csv_chunk.tell() > max_wait_bytes) or len(rows) == records:
        data = csv_chunk.getvalue()
        headers = {
            'Authorization': f'Bearer {token}',
            'X-TB-Client': 'pltx-0.1',
        }

        ok = False
        try:
            response = _session.post(url, headers=headers, files=dict(csv=data))
            result = response.json()

            ok = response.status_code < 400
            if ok:
                csv_chunk = StringIO()
                writer = csv.writer(csv_chunk, delimiter=',', quotechar='"', quoting=csv.QUOTE_NONNUMERIC)
                print(f"Flushed {len(data)} bytes, datasource={datasource}, response={response.status_code}")
                print(f"Result id={result.get('import_id', None)}, error={result.get('error', False)}")
        except Exception as e:
            print(e)

  print('Done')

### Create Data Source and Ingest

In [51]:
datasource = 'wiki'
#token = '{TOKEN}'
endpoint = 'https://api.tinybird.co'

In [52]:
mode = 'create'
rows= df_wiki.values.tolist()
rows.insert(0, df_wiki.columns.tolist())

ingest_from_array(rows, datasource, token, mode, endpoint)

Flushed 10297052 bytes, datasource=wiki, response=200
Result id=9b4c777a-cea4-49ba-a05e-dcf77b9e9542, error=False
Done


### Append to Data Source

In [53]:
mode = 'append'
rows= df_wiki_new.values.tolist()
rows.insert(0, df_wiki_new.columns.tolist())

ingest_from_array(rows, datasource, token, mode, endpoint)

Flushed 4036348 bytes, datasource=wiki, response=200
Result id=987bb19c-e52f-4aff-a7c8-f240cb64acae, error=False
Done


## Option 2: Download to a local file then ingest to Tinybird through the UI

- CSV
- NDJSON

Checking the column names and types in the preview in the UI, the column `type` can be changed to `LowCardinality(String)`.

### Format CSV

In [54]:
df_wiki.to_csv("wiki_ui_csv.csv", index=False)
files.download('wiki_ui_csv.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Format NDJSON

In [55]:
df_wiki.to_json("wiki_ui_ndjson.ndjson", orient="records", lines=True, force_ascii=0)
files.download("wiki_ui_ndjson.ndjson")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Option 3: Ingest to Tinybird from the CLI
- CSV
- NDJSON

In [56]:
!pip install tinybird-cli -q -U

In [15]:
#token = '{TOKEN}'

### Format CSV

In [57]:
df_wiki.to_csv("wiki_cli_csv.csv", index=False)
!tb --token=$token datasource generate 'wiki_cli_csv.csv' 

[92m** Generated wiki_cli_csv.datasource
** => Create it on the server running: $ tb push wiki_cli_csv.datasource
** => Append data using: $ tb datasource append wiki_cli_csv wiki_cli_csv.csv
[0m


In [58]:
!tb --token=$token push wiki_cli_csv.datasource

[0m** Processing wiki_cli_csv.datasource[0m
[0m** Building dependencies[0m
[0m** Running wiki_cli_csv [0m
[92m** 'wiki_cli_csv' created[0m
[0m** Not pushing fixtures[0m


In [59]:
!tb --token=$token datasource append wiki_cli_csv wiki_cli_csv.csv

[0m** 🥚 starting import process[0m
[92m** 🐥 done[0m
[92m** Total rows in wiki_cli_csv: 22744[0m
[92m** Data appended to Data Source 'wiki_cli_csv' successfully![0m
[0m** Data pushed to wiki_cli_csv[0m


### Format NDJSON

In [60]:
df_wiki.to_json("wiki_cli_ndjson.ndjson", orient="records", lines=True, force_ascii=0)
!tb --token=$token datasource generate 'wiki_cli_ndjson.ndjson'

[92m** Generated wiki_cli_ndjson.datasource
** => Create it on the server running: $ tb push wiki_cli_ndjson.datasource
** => Append data using: $ tb datasource append wiki_cli_ndjson wiki_cli_ndjson.ndjson
[0m


In [61]:
!tb --token=$token push wiki_cli_ndjson.datasource

[0m** Processing wiki_cli_ndjson.datasource[0m
[0m** Building dependencies[0m
[0m** Running wiki_cli_ndjson [0m
[92m** 'wiki_cli_ndjson' created[0m
[0m** Not pushing fixtures[0m


In [62]:
!tb --token=$token datasource append wiki_cli_ndjson wiki_cli_ndjson.ndjson

[0m** 🥚 starting import process[0m
[92m** 🐥 done[0m
[92m** Appended 0 new rows[0m
[92m** Total rows in wiki_cli_ndjson: 22744[0m
[92m** Data appended to Data Source 'wiki_cli_ndjson' successfully![0m
[0m** Data pushed to wiki_cli_ndjson[0m
