# https://github.com/LanLi2017/OpenRefine_client_Tutorial#openrefine-python-client-library

To ask for communicating with OpenRefine server.
Notes:
1. Ensure you have OpenRefine.[http://127.0.0.1:3333/] running 
2. Server and Project id are required for sending request to OpenRefine server.
3. OR.py has encapsulated and extended functions based on OR-Client library .[https://github.com/LanLi2017/OpenRefineClientPy3] 
4. After applying each operation, a 'history id' will be generated and corresponding history meta-info will be Auto-stored in your local machine. You might check it from worksapce. 
5. Update: crsf_token is needed when you send request, OpenRefine 3.3+. Check more from Openrefine API[https://github.com/OpenRefine/OpenRefine/wiki/Changes-for-3.3#csrf-protection-changes]
6. Distinguish 'operation history' and 'history'
7. For more interesting information: Notebook [https://github.com/ouseful-PR/openrefineder/tree/master/notebooks]

In [23]:
from OpenRefineClientPy3.google_refine.refine import refine
from OpenRefineClientPy3.google_refine.refine import facet
from OR import Refineop
import pandas as pd
import json
from pprint import pprint

# Project

In [24]:
# use refine.py to directly connect with server
projects = refine.Refine(refine.RefineServer()).list_projects()
p = refine.Refine(refine.RefineServer()).open_project(1956267142800)

In [25]:
# use OR.py to connect with server 
# more functions supported, and refined outputs
# Params: (server, project id)
server = refine.RefineServer()
projectID = 1956267142800
refineop = Refineop(server,projectID)

# History

In [26]:
# List history (From internal package)
# Output: 'past' stands for current operation history; 'future' stands for 'undo' part 
# Description+id+time: metadata for each data cleaning step
history_id = refineop.list_history()
pprint(history_id)
hid = [hid['id'] for hid in history_id['past']]
print(hid)

{'future': [{'description': 'Mass edit 739 cells in column scientificName',
             'id': 1602135606758,
             'time': '2020-10-08T05:32:10Z'},
            {'description': 'Text transform on 21 cells in column country: '
                            'value.toUppercase()',
             'id': 1605762843888,
             'time': '2020-11-19T05:05:09Z'},
            {'description': 'Mass edit 303 cells in column scientificName',
             'id': 1605763237463,
             'time': '2020-11-19T05:17:35Z'},
            {'description': 'Text transform on 0 cells in column country: '
                            'value.toUppercase()',
             'id': 1605764001677,
             'time': '2020-11-19T05:17:39Z'},
            {'description': 'Edit single cell on row 9, column country',
             'id': 1605763442258,
             'time': '2020-11-19T05:17:40Z'},
            {'description': 'Create new column mm/dd based on column mo by '
                            'filling 35549 

In [27]:
# Redo step 2: hid ==
# Params: history id for step 2 (3-1)
refinesever =refine.RefineServer()
csrf_t = refinesever.get_csrf()['token']
refineop.undo_redo_proj(lastDone_id= 1605762843888,token=csrf_t)

True

In [28]:
# Check if we've successfully redo step 3
# Return: past add up one step; future minus one step 
history_id = refineop.list_history()
pprint(history_id)

{'future': [{'description': 'Mass edit 303 cells in column scientificName',
             'id': 1605763237463,
             'time': '2020-11-19T05:17:35Z'},
            {'description': 'Text transform on 0 cells in column country: '
                            'value.toUppercase()',
             'id': 1605764001677,
             'time': '2020-11-19T05:17:39Z'},
            {'description': 'Edit single cell on row 9, column country',
             'id': 1605763442258,
             'time': '2020-11-19T05:17:40Z'},
            {'description': 'Create new column mm/dd based on column mo by '
                            'filling 35549 rows with '
                            'cells.mo.value+"/"+cells.dy.value',
             'id': 1605763734208,
             'time': '2020-11-19T05:17:41Z'},
            {'description': 'Split 20231 cell(s) in column scientificName into '
                            'several columns by separator',
             'id': 1605763394814,
             'time': '2020-11-19

In [29]:
# List operation history 
operation_history = refineop.get_operations()
pprint(operation_history)

[{'columnName': 'scientificName',
  'description': 'Mass edit cells in column scientificName',
  'edits': [{'from': ['Amphispiza bilineata',
                      'Amphespiza bilineata',
                      'Emphispiza bilinata',
                      'Amphispiza bilineatus',
                      'Amphispizo bilineata'],
             'fromBlank': False,
             'fromError': False,
             'to': 'Amphispiza bilineata'},
            {'from': ['Ammospermophilus harrisi',
                      'Ammospermophilis harrisi',
                      'Ammospermophilus harrisii'],
             'fromBlank': False,
             'fromError': False,
             'to': 'Ammospermophilus harrisi'}],
  'engineConfig': {'facets': [], 'mode': 'row-based'},
  'expression': 'value',
  'op': 'core/mass-edit'},
 {'columnName': 'country',
  'description': 'Text transform on cells in column country using expression '
                 'value.toUppercase()',
  'engineConfig': {'facets': [], 'mode': 'ro

Difference between operation history and history!
1. Content: operation history returns prospective provenance; history returns retrospective provenance
2. Operation history only returns "past"; history returns "past" and "future"

# Cell (Inner-Column)

In [30]:
# Text Transform 
# Params: column name, expression, csrf required
# Return: status, history

refinesever =refine.RefineServer()
csrf_t = refinesever.get_csrf()['token']

refineop.text_transform('country',expression = 'value.toUppercase()', token= csrf_t)

{'historyEntry': {'id': 1605764855941,
  'description': 'Text transform on 0 cells in column country: value.toUppercase()',
  'time': '2020-11-19T05:31:38Z'},
 'code': 'ok'}

In [31]:
# Single edit (8,18) from (@KIMBA707) to (KIMBA707)
# Params: row, cell, type, value
# is there any ops recorded?
csrf_t = refinesever.get_csrf()['token']
refineop.single_edit(8,18,'text','United States of America',csrf_t)

{'code': 'ok',
 'historyEntry': {'id': 1605764171672,
  'description': 'Edit single cell on row 9, column country',
  'time': '2020-11-19T05:32:57Z'},
 'cell': {'v': 'United States of America'},
 'pool': {'recons': {}}}

# Clustering
Open Refine supports several clustering methods:

- clusterer_type: binning; refine_function: fingerprint|metaphone3|cologne-phonetic
- clusterer_type: binning; refine_function: ngram-fingerprint; params: {'ngram-size': INT}
- clusterer_type: knn; refine_function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}

In [32]:
# csrf token is needed as far as you need to 
# Post request to OpenRefine server
csrf_t = refinesever.get_csrf()['token']

In [33]:
from collections import OrderedDict
# Cluter and Edit 
# 1. compute clusters: return a list of clusters; 
#    params: column, clusterer_type='binning', function=None, params=None
# Mass_edit
# 2. apply (edits_from, edits_to) 
# Return: status, history

clusters = refineop.compute_clusters('scientificName', csrf_t,clusterer_type='knn', function='levenshtein')
clusters

# choose cluster
from_edit=refineop.getFromValue(clusters)
to_edit=refineop.getToValue(clusters)
default_edits=OrderedDict()
default_edits['fromBlank']='false'
default_edits['fromError']='false'
for f1,t1 in zip(from_edit, to_edit):
    default_edits['from']=f1
    default_edits['to']=t1
edits=[default_edits]
# mass edit : column, edits
refineop.mass_edit('scientificName', edits, csrf_t)

{'historyEntry': {'id': 1605765161328,
  'description': 'Mass edit 303 cells in column scientificName',
  'time': '2020-11-19T05:36:20Z'},
 'code': 'ok'}

In [35]:
# Column

{'historyEntry': {'id': 1605764476617,
  'description': 'Split 20231 cell(s) in column scientificName into several columns by separator',
  'time': '2020-11-19T05:38:19Z'},
 'code': 'ok'}

In [36]:
# Rename Column
# Params: old column name, new column name
csrf_t = refinesever.get_csrf()['token']
refineop.rename_column('scientificName 1', 'scientific_Name_1', csrf_t)

{'historyEntry': {'id': 1605764506603,
  'description': 'Rename column scientificName 1 to scientific_Name_1',
  'time': '2020-11-19T05:39:05Z'},
 'code': 'ok'}

# Row level

In [37]:
# Annotate Row (star/flag)
# Row index, flagged:default='True'
csrf_t_1 = refinesever.get_csrf()['token']
refineop.flag_row(1,csrf_t_1)

csrf_t_2 = refinesever.get_csrf()['token']
refineop.star_row(2, csrf_t_2)

{'historyEntry': {'id': 1605764719016,
  'description': 'Star row 3',
  'time': '2020-11-19T05:39:37Z'},
 'code': 'ok'}

# Download Operation History 

In [38]:
# Load operation history
# Params: load file name
# Return: auto load operation history in your local machine
file_name = 'load_recipe'
refineop.load_ops(file_name)

# Download Cleaned Dataset

In [39]:
# Load cleaned dataset
# params: file name
# return: auto load dataset in your local machine
filename = 'CleanedData.csv'
refineop.load_data(filename)

check https://github.com/tmcphillips/openrefine-provenance/blob/master/models/openrefine%20data%20and%20project%20model.md for OpenRefine taxonomy in OpenRefine