# Get data using datahub catalog

datahub uri : [https://datahub.aiengineer.polytech.sandbox-atos.com/](https://datahub.aiengineer.polytech.sandbox-atos.com/)

## Recover taxi trips and push into clickhouse

In [24]:
# Get the dataset Taxi Trips as CSV
!curl --get 'https://data.cityofchicago.org/resource/wrvz-psew.csv' \
  --data-urlencode '$limit=10000' \
  --data-urlencode '$where=trip_start_timestamp >= "2022-01-01" AND trip_start_timestamp < "2022-02-01"' \
  --data-urlencode '$select=tips,trip_start_timestamp,trip_seconds,trip_miles,pickup_community_area,pickup_centroid_latitude,pickup_centroid_longitude,dropoff_community_area,fare,tolls,extras,trip_total' \
  | tr -d '"' > "./chicagodata/trip.csv"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  994k    0  994k    0     0  88032      0 --:--:--  0:00:11 --:--:--  263k


In [32]:
#pip install pandahouse minio
import pandahouse as ph
import pandas as pd

In [28]:
### helper function for handle this python client
def write_clickhouse(query,connection):
    print(query)
    try:
        ph.read_clickhouse(query,connection=connection)
    except KeyError:
        print("Nothing to return")

In [29]:
## create your db name with your username but with "_" instead of "-"
dbname = ''

## The connection dict need a default database
connection = dict(database='default',
                  host='http://clickhouse-install.clickhouse.svc.cluster.local:8123',
                  user='admin',
                  password='B1gdata-demo')


write_clickhouse(f"create database {dbname}",connection)

connection['database'] = f"{dbname}"

print(connection)

create database guillaume_etevenard
Nothing to return
{'database': 'guillaume_etevenard', 'host': 'http://clickhouse-install.clickhouse.svc.cluster.local:8123', 'user': 'admin', 'password': 'B1gdata-demo'}


In [39]:
## get data
dbtable='chicago_taxi'
data = pd.read_csv("./chicagodata/trip.csv")
### select features
features = data[[
    "tips",
    "trip_start_timestamp",
    "trip_seconds",
    "trip_miles",
    "pickup_community_area" ,
    "dropoff_community_area" ,
    "fare",
    "tolls",
    "extras",
    "trip_total"
]]

In [40]:
### create table for inserting taxi trip dataset 
## Clickhouse table definition
# using the df informations, and clickhouse documentation write  the create table statement
taxitable = f"""
CREATE TABLE {dbname}.{dbtable}
(
    `tips` Float32,
    `trip_start_timestamp` DateTime,
    `trip_seconds` Float32,
    `trip_miles` Float32,
    `pickup_community_area` Float32,
    `dropoff_community_area` Float32,
    `fare` Float32,
    `tolls` Float32,
    `extras` Float32,
    `trip_total` Float32
) 
ENGINE = MergeTree
PARTITION BY toYYYYMM(trip_start_timestamp)
ORDER BY trip_start_timestamp;
"""

In [41]:
write_clickhouse(taxitable,connection)



CREATE TABLE guillaume_etevenard.chicago_taxi
(
    `tips` Float32,
    `trip_start_timestamp` DateTime,
    `trip_seconds` Float32,
    `trip_miles` Float32,
    `pickup_community_area` Float32,
    `dropoff_community_area` Float32,
    `fare` Float32,
    `tolls` Float32,
    `extras` Float32,
    `trip_total` Float32
) 
ENGINE = MergeTree
PARTITION BY toYYYYMM(trip_start_timestamp)
ORDER BY trip_start_timestamp;

Nothing to return


In [42]:
## We have to be compliant with the clickhouse date type. 
## we need to force '%Y-%m-%d %H:%M:%S'
## force the date format with the defined schema, using pandas
features["trip_start_timestamp"] = pd.to_datetime(data["trip_start_timestamp"]).dt.strftime('%Y-%m-%d %H:%M:%S')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features["trip_start_timestamp"] = pd.to_datetime(data["trip_start_timestamp"]).dt.strftime('%Y-%m-%d %H:%M:%S')


In [43]:
### insert using the to_clickhouse function
ph.to_clickhouse(features, dbtable, index=False, chunksize=100000, connection=connection)

10000

### Browse UI to get back our taxi trips dataset

![datahub](./images/datahub.png)

### Create a transformation view on the data  

Here we want to create a view with only 1 week of data

In [45]:
dbview='chicago_data_oneweek'

In [48]:
### Create a view from chicago_taxi table
# this view will use only last week of available data
taxiview = f"""
CREATE view {dbname}.{dbview} as Select * ...
"""

In [49]:
write_clickhouse(...,connection)


CREATE view guillaume_etevenard.chicago_data_oneweek as Select * from guillaume_etevenard.chicago_taxi  where trip_start_timestamp >  (toDateTime('2022-02-01') - INTERVAL 7 DAY)

Nothing to return


### Browse UI to get the lineage link between the table and the view

![lineage](./images/lineage.png)