<img src=https://raw.githubusercontent.com/superwise-ai/quickstart/f395a719ac93377005c6ce8bacebc425bf667cf1/docs/images/white_mode_logo.svg 
width="400" alt="Superwise">

This notebook provides a quickstart experience for extracting and enriching meta-features from plain text using Elemeta open source package. You will also be guided through Elemeta's two main use cases: 
* Engineer new features with extracted meta-features to build improved models.
* Using Elemeta to monitor NLP use cases (here using Superwise).

The notebook includes:
* [Installation](#installation)
* [Monitor NLP with Superwise and Elemeta](#monitor)
    * [Simulation preperation](#simulation_preperation)
    * [Create a project](#create_project)
    * [Training pipeline](#training_pipeline)
    * [Inference pipeline](#inference_pipeline)
    * [Ground truth pipeline](#ground_truth_pipeline)
---

# <a name="installation"></a>Installation

---

## Install packages 
Elemeta can be installed directly from the Superwise internal package repository (**DO NOT SHARE WITH OTHERS**).
Install PyDrive to access Google Drive data directly.

In [1]:
!pip install --extra-index-url "https://_json_key_base64:ewogICJ0eXBlIjogInNlcnZpY2VfYWNjb3VudCIsCiAgInByb2plY3RfaWQiOiAiZGV2ZWxvcG1lbnQxMTI1MzlkZiIsCiAgInByaXZhdGVfa2V5X2lkIjogIjRmYTEyM2NmODBjOTc0MWMwMzNlYTBmYTE1MWRjZTFhMzhiMTBmY2QiLAogICJwcml2YXRlX2tleSI6ICItLS0tLUJFR0lOIFBSSVZBVEUgS0VZLS0tLS1cbk1JSUV2Z0lCQURBTkJna3Foa2lHOXcwQkFRRUZBQVNDQktnd2dnU2tBZ0VBQW9JQkFRQ3k3SmVuR0pxM0ZpK0RcblhnUkNzbzVRRDJkYU1na01ZQ05SS0JOanFCL05yRHd5UGtrRGRST0VqclZuRTlVemtYTS81cS9IdE1KZkpkZUJcbk5wSkV5eS82UU9NZklUTlZmQ1RVakJ5dWlheVloRnE4Y2xUMGp1SXQ4WEk3S1VwYzJjcWRKVDhZVCtYdWZCWHZcbjQ4cUdmTVV2QmxSZ2pMaTNSOWsvVFpkSXBhSkdWbytacjZ5SWIwdVllS09HNmZGOUJ3YnlvSW9iNUlPbkVMVUJcbnM0ZFQ2VllGeTJkdnl2d3JlS0ZFbHhsekdaM2tlZDhHWUpEUDljaXNmMDRWTGxTbVdTcStubXJBOWJaUHFWRldcbk11SnpNRHJOeFhQczVKbjNtcU8wNTVCZG5NTHVUdDBpRGdvOEFpdmpHSkVkWTFqVzVPWWNVd1YvVHFRNjVYQ1hcbjdXNjZyZ0NIQWdNQkFBRUNnZ0VBTUFtMGg4RGozUThnVDVGZzdIVmJIeVNibDR4Q2dLZVpJOU55TFRuNDIvUUhcbnIyZW9tN3lGc252TU9YSUtOayt4VlRFKzlZdlMrYy9EcGVYOGJHcnZKUzNockx4eHQxeGUyUkFMTFZNNld6S0JcbjJBR01US0xHR0Jhd21EQzBUZXlOYVJhVWM1Y1VBUzBnaUtrc2VXSXJZTDQrempOSjdxOWtKUXBVZVVVN0pjM29cbk5BakVmWDAzRjRZa0dHVm9iNlBZWnAvdi9Uc2M5Z2ZrcGxIb3lsN1Z0SU9SWFBIeS9obmZoZ2F0bm44L3hkUUxcblRLRXAzazRnL3pKSnVva2xRSS9IT0dNZFBuM2FhckIrUk0wYTBhbDg2dG9CNW0wWnVwdkxTRzVGWDlkK08yVUtcbjU5cEZjQmgvS2VmL3UvUjVCRHZRNmErSHFaQmJldlRyNDNJcSs1cnJvUUtCZ1FEOUFIemRFYXI4Y1ZVdXNyWFdcbjJwRkNkZ0k3dXp1cmM4d2JIY2RYaFp2MGMxS2ZYQzluOG5yQTRHMjZKNklVVlhXSnFOM1ZBMmpINFpqQmVMSk5cbkdmeHJKZlRseWxSWnlheHdKZDc2YTd1a0trdmZsTXBzSEwxS3dKWXBFVDJhOW1DNG5mcG9iTHh6Zm9xRmhBY0dcbjJ3ZUwweUVqeU4rZ3JFek5ScWxETHdVNHB3S0JnUUMxQzJGOXR6V2NGZDlkVDBTR2RUVS9wVjk3TGVjOFlqL1hcblJ3ZmRiS09jN1g2blovN2tmemozV0JOWmxzWi93amZYeTgwZHhDOWY4SCt3REdCeWhLRUVMY0hidGhpV0FhM09cbk5JRlZuQkJCbEVQTlU3TjlJVjZlVkM2RWRlUERkUWIrMjE2QlVzTHZHVjZncERmMGV3YmhsQ056ZkptelZlZ2tcbjZLdmtaMFFWSVFLQmdRQ090bis0bEFiSGI0YUZXUG1Kd0xDL3RLRjk0QmZBbHdsRElvRVh0WjVMUGVJVlVvTExcbk45UldpRUpkQjQ0OXVoY2JGODVLSWlvdzFlaTgrY0JhRFpaOU9tUHlXemRKanFGZWdYNU45QWRjaXg5UmR6VFhcbjF6NVB0R0wvdDJId3o1bXZpaTErU2hmamJqWGxLcHhzR1pFZ0puQkRKMFE5OWZNOHQrY3lwb3RqY1FLQmdEUjVcbjNBUkllbTJIbVhxK1l5cG1CczB2N2dFU1NSZ2pra1dmL1JPZFRiOUt4NDlXZ1hkUnVQMVl0aU1kcE9PYk0veUVcbndpdUNsZ2pFK1AzYVdJcFpEeUxhOEhueXlpV1F6d1FhQy9MNGpXMjB3QUpmNUlLOGpXUnZHaHlpM3lYa1llYTFcblJ5dE5CZHV3Q3RHZFIrckJUamxNYXdvcWI1S2ZyKzRpMHRBZGJvcUJBb0dCQUtZMGlic0hIbkx4V1BQck0zQ25cbkFXRXlnZUZDYUdMbmxZTFZQWk5hZkQvUmVmVDQxT2lmNHFUVkpUWHNGaVNSNGt5RUFUTUNQK2Q4a0hnUlZEL3hcblhCajNGUHYwWU9SajlEcC96VVArZ1lKa0ZMT3k5cWZ2aFc0N1dhZTUxUXhqV3dVQWc4N29yZ2lZUXZKbFJjNmpcbktSTGtHR3AwT1Q2OEdIN0V5c3BSRHd0L1xuLS0tLS1FTkQgUFJJVkFURSBLRVktLS0tLVxuIiwKICAiY2xpZW50X2VtYWlsIjogImFydGlmYWN0cy10ZXN0cHlwaS1yZWFkQGRldmVsb3BtZW50MTEyNTM5ZGYuaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iLAogICJjbGllbnRfaWQiOiAiMTA2NTQzMDk5ODI0MjY3OTk5OTcxIiwKICAiYXV0aF91cmkiOiAiaHR0cHM6Ly9hY2NvdW50cy5nb29nbGUuY29tL28vb2F1dGgyL2F1dGgiLAogICJ0b2tlbl91cmkiOiAiaHR0cHM6Ly9vYXV0aDIuZ29vZ2xlYXBpcy5jb20vdG9rZW4iLAogICJhdXRoX3Byb3ZpZGVyX3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vb2F1dGgyL3YxL2NlcnRzIiwKICAiY2xpZW50X3g1MDlfY2VydF91cmwiOiAiaHR0cHM6Ly93d3cuZ29vZ2xlYXBpcy5jb20vcm9ib3QvdjEvbWV0YWRhdGEveDUwOS9hcnRpZmFjdHMtdGVzdHB5cGktcmVhZCU0MGRldmVsb3BtZW50MTEyNTM5ZGYuaWFtLmdzZXJ2aWNlYWNjb3VudC5jb20iCn0K@us-central1-python.pkg.dev/development112539df/testpypi/simple/" elemeta
!pip install -U -q PyDrive
!pip install superwise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/, https://_json_key_base64:****@us-central1-python.pkg.dev/development112539df/testpypi/simple/
Collecting elemeta
  Downloading https://us-central1-python.pkg.dev/development112539df/testpypi/elemeta/elemeta-1.1.2-py3-none-any.whl (28.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m28.2/28.2 MB[0m [31m23.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting better-profanity<0.8.0,>=0.7.0
  Downloading better_profanity-0.7.0-py3-none-any.whl (46 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.1/46.1 KB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
Collecting autocorrect<3.0.0,>=2.6.1
  Downloading autocorrect-2.6.1.tar.gz (622 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.8/622.8 KB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting vadersentiment<4.0.0,>=3.3.2
  Do

## Restart the kernel

After installing everything, restart the notebook kernel so it can find the packages.


In [2]:
import os

# Automatically restart kernel after installs
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

## Imports
Import the relevant packages into the session.

In [1]:
from elemeta.nlp.metadata_extractor_runner import MetadataExtractorsRunner
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from superwise import Superwise
from superwise.models.project import Project
from superwise.models.model import Model
from superwise.models.version import Version
from superwise.models.dataset import Dataset
from superwise.resources.superwise_enums import NotifyUpon, ScheduleCron



## Read data
Authenticate with Google Drive in order to access the data file.

In [2]:
# Authenticate
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

Read Twitter tweets dataset from Oren's sharable link. We will use only 3 fields: the tweet itself, it's timestamp and the number of likes (later on will be used as the label for our prediction task.

In [3]:
# sharable link: https://drive.google.com/file/d/1nvIfGWwd4WkJyRqH4fuUMvuDm1xdo6Sr/view?usp=share_link
fileDownloaded = drive.CreateFile({'id':'1nvIfGWwd4WkJyRqH4fuUMvuDm1xdo6Sr'})
fileDownloaded.GetContentFile('tweets_kaggle.csv')
df_full = pd.read_csv('tweets_kaggle.csv')[["content","date_time","number_of_likes"]]
df_origin = df_full[:200]


# <a name="monitor"></a>Monitor NLP with Superwise and Elemeta
This is a quickstart example of how Elemeta can be used to monitor NLP use cases with Superwise. Please ensure that you have an active Superwise account, and if you don't have one, [please create one](https://portal.superwise.ai/account/sign-up).

---



## <a name="simulation_preperation"></a>Simulation preperation
We will split the original Twitter data into three parts to simulate training, inference, and ground truth data pipelines.

In [4]:
from sklearn.model_selection import train_test_split
if "id" not in df_full.columns:
  df_full = df_full.reset_index().rename({"index":"id"},axis=1)
X_train, X_inference, y_train, y_inference = train_test_split(df_full, df_full.loc[:,'number_of_likes'], test_size=0.33, random_state=42)

sample_size = 200
train_sampled = X_train[:sample_size]
inference_sampled = X_inference[:sample_size]
ground_truth_sampled = pd.DataFrame(y_inference[:sample_size]).assign(id=inference_sampled["id"])

print(X_train.shape)
print(X_inference.shape)
print(ground_truth_sampled.shape)

(35203, 4)
(17339, 4)
(200, 2)


## <a name="create_project"></a>Create a project
We will programatically create our project and model using Superwise SDK.

### Generate tokens
Please enter your API token or user token here. You can find out how to generate them or import them [here](https://docs.superwise.ai/docs/authentication).

In [5]:
import os
os.environ['SUPERWISE_CLIENT_ID'] = '[CLIENT_ID]'
os.environ['SUPERWISE_SECRET'] = '[SECRET]'

### Create a new project

In [6]:
sw = Superwise()
project = Project(
    name="My NLP Project",
    description="Natural Language Processing"
)

project = sw.project.create(project)
print(f"New project Created - {project.id}")

DEBUG:superwise:POST:  https://auth.superwise.ai/identity/resources/auth/v1/api-token params: {'clientId': '9411927f-892f-4673-8e15-ca8c35b49792', 'secret': 'abd70071-bda3-46c5-8478-c461a65a9f3e'}
DEBUG:superwise:GET https://portal.superwise.ai/aa639d7248a8342bd9/admin/v1/settings query params: None
INFO:superwise:POST model/v1/projects 
DEBUG:superwise:POST:  https://portal.superwise.ai/aa639d7248a8342bd9/model/v1/projects params: {'created_at': None, 'created_by': None, 'description': 'Natural Language Processing', 'id': None, 'name': 'My NLP Project'}


New project Created - 13


### Create a new model

In [8]:
nlp_model = Model(
    project_id=project.id,
    name="Tweeter Likes NLP Model",
    description="Regression model with simulated data"
)

nlp_model = sw.model.create(nlp_model)

INFO:superwise:POST admin/v1/models 
DEBUG:superwise:POST:  https://portal.superwise.ai/aa639d7248a8342bd9/admin/v1/models params: {'active_version_id': None, 'description': 'Regression model with simulated data', 'external_id': None, 'id': None, 'is_archived': None, 'name': 'Tweeter Likes NLP Model12', 'project_id': 13, 'time_units': None}


## <a name="training_pipeline"></a>Training pipeline
In order to predict the number of likes per tweet, we will train a regression model and log the training data into Superwise after Elemeta enrichment.

### Train a new model
Based on the training dataset, build a simple classifier model pipeline.

In [9]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

from sklearn.linear_model import LogisticRegression,SGDRegressor

pipe = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('sgdr',SGDRegressor(max_iter=3000))
    ])
pipe.fit(X_train["content"],y_train)

Maximum number of iteration reached before convergence. Consider increasing max_iter to improve the fit.


### Log training data
Prepare the training dataset in Superwise format, extend it with Elemeta, and send it to Superwise.

In [10]:
train_sampled

Unnamed: 0,id,content,date_time,number_of_likes
10299,10299,I sincerely enjoy this and every moment I get ...,12/11/2014 12:43,4828
49940,49940,Like. Love. Affection. Romance. A double tap. ...,14/02/2016 20:41,1545
45822,45822,"With the opening of these 2 centers, @Movimien...",03/07/2015 16:02,1661
52156,52156,@WValderrama I know this won't mean anything t...,23/05/2015 15:33,12645
44558,44558,ありがとうございます Summersonic Tokyo!! だいすき！,15/08/2015 09:55,32410
...,...,...,...,...
12272,12272,@__glitterDICK Happy #cake day to my glitter d...,28/12/2012 18:31,224
17036,17036,"A night of magical moments, rocknroll and too ...",05/05/2015 17:30,5858
764,764,When you could go anywhere for your bday dinne...,27/10/2015 19:37,9421
48415,48415,Mmm...pasta. 🍝 Worldwide InstaMeet is coming s...,10/09/2016 15:00,916


##### Preparation for Superwise format and Enrichment


In [11]:
train_sampled["predicted_number_of_likes"] = pipe.predict(train_sampled["content"]).astype(int)

# Enrich the training dataset with Elemeta
metadata_extractors_runner = MetadataExtractorsRunner()
print("The original dataset had {} columns".format(train_sampled.shape[1]))

# The enrichment process
print("Processing...")
train_sampled = metadata_extractors_runner.run_on_dataframe(dataframe=train_sampled,text_column='content')
print("The transformed dataset has {} columns".format(train_sampled.shape[1]))


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


The original dataset had 5 columns
Processing...
The transformed dataset has 31 columns


In [14]:
train_sampled

Index(['id', 'content', 'date_time', 'number_of_likes',
       'predicted_number_of_likes', 'detect_langauge', 'emoji_count',
       'text_complexity', 'unique_word_ratio', 'unique_word_count',
       'word_regex_matches_count', 'number_count', 'out_of_vocabulary_count',
       'must_appear_words_ratio', 'sentence_count', 'sentence_avg_length',
       'word_count', 'avg_word_length', 'text_length', 'stop_words_count',
       'punctuation_count', 'special_chars_count', 'capital_letters_ratio',
       'regex_match_count', 'email_count', 'link_count', 'hashtag_count',
       'mention_count', 'syllable_count', 'acronym_count', 'date_count'],
      dtype='object')

In [15]:
from superwise.models.dataset import Dataset
from superwise.resources.superwise_enums import DataEntityRole,FeatureType


dataset = Dataset.generate_dataset_from_dataframe(name="Tweeter Likes Dataset",
                  project_id=project.id,
                  dataframe=train_sampled,
                  roles={
                    DataEntityRole.METADATA.value:["content"],
                    DataEntityRole.PREDICTION_VALUE.value:["predicted_number_of_likes"],
                    DataEntityRole.TIMESTAMP.value:"date_time",
                    DataEntityRole.LABEL.value:["number_of_likes"],
                    DataEntityRole.ID.value:"id"},
                    )

# Create the dataset in Superwise, may take some time to process
dataset = sw.dataset.create(dataset)

new_version = Version(
    model_id=nlp_model.id,
    name="1.0.0",
    dataset_id=dataset.id
)

new_version = sw.version.create(new_version)
sw.version.activate(new_version.id)

INFO:superwise:Uploading dataset files...
DEBUG:superwise.utils.storage.internal_storage.internal_storage:Upload file to Superwise's internal blob storage datasets/project_id=13/Tweeter Likes Dataset_from_dataframe-o4xvi_is_68393069-0248-4de3-958d-a073c26d1092.parquet
DEBUG:superwise:Upload file to superwise bucket datasets/project_id=13/Tweeter Likes Dataset_from_dataframe-o4xvi_is_68393069-0248-4de3-958d-a073c26d1092.parquet
INFO:superwise:Finished uploading, start processing dataset...
INFO:superwise:POST model/v1/datasets 
DEBUG:superwise:POST:  https://portal.superwise.ai/aa639d7248a8342bd9/model/v1/datasets params: {'_from_dataframe': False, '_tempfile_path': None, 'created_at': None, 'created_by': None, 'dtypes': None, 'files': ['file://Tweeter Likes Dataset_from_dataframe-o4xvi_is.parquet'], 'full_flow': True, 'head': None, 'id': None, 'internal_files': ['gs://superwise-aa639d7248a8342bd9-production/datasets/project_id=13/Tweeter Likes Dataset_from_dataframe-o4xvi_is_68393069-0

<Response [200]>

## <a name="inference_pipeline"></a>Inference pipeline
Produce model inference predictions and log them in Superwise for monitoring. Inference logs will be sent in batches once Elemeta has enriched them.

In [16]:
inference_sampled.loc[:,"predicted_number_of_likes"] = pipe.predict(inference_sampled["content"]).astype(int)

# prep for Superwise format
prediction_time_vector = pd.Timestamp.now().floor('h') - \
    pd.TimedeltaIndex(inference_sampled.reset_index(drop=True).index // int(inference_sampled.shape[0] // 30), unit='D')

ongoing_predictions = inference_sampled.assign(
    date_time=prediction_time_vector,
)

#util function 
def chunks(df, n):
    """Yield successive n-sized chunks from df."""
    for i in range(0, df.shape[0], n):
        yield df[i:i + n]

# break the inference data into chunks
ongoing_predictions_chunks = chunks(ongoing_predictions, 50) # batches of 50

transaction_ids = list()
# for each chunk
for ongoing_predictions_chunk in ongoing_predictions_chunks:
    
    # enrich with Elemeta
    ongoing_predictions_chunk = metadata_extractors_runner.run_on_dataframe(dataframe=ongoing_predictions_chunk,text_column="content")
    
    # send to Superwise
    transaction_id = sw.transaction.log_records(
        model_id=nlp_model.id, 
        version_id=new_version.id, 
        records=ongoing_predictions_chunk.to_dict(orient="records")
    )
    transaction_ids.append(transaction_id)
    print(transaction_id)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Passing version name will be deprecated soon, pass version ID instead
INFO:superwise:Send records with params : model_id=24, version_id=23
DEBUG:superwise:POST:  https://portal.superwise.ai/aa639d7248a8342bd9/gateway/v1/transaction/records params: {'records': [{'id': 28263, 'content': '#TheBFFIssue @sorbetmag @riccardotisci17 http://t.co/VmVqbZ0OtZ', 'date_time': '2023-04-02T11:00:00.000Z', 'number_of_likes': 2891, 'predicted_number_of_likes': 5871, 'detect_langauge': 'en', 'emoji_count': 0, 'text_complexity': -93.33, 'unique_word_ratio': 0.8571428571, 'unique_word_count': 6, 'word_regex_matches_count': 9, 'number_count': 0, 'out_of_vocabulary_count': 9, 'must_appear_words_ratio': 0, 'sentence_count': 1, 'sentence_avg_length':

{'transaction_id': '7a3b58e6-d14a-11ed-b02b-566d729718d6'}


Passing version name will be deprecated soon, pass version ID instead
INFO:superwise:Send records with params : model_id=24, version_id=23
DEBUG:superwise:POST:  https://portal.superwise.ai/aa639d7248a8342bd9/gateway/v1/transaction/records params: {'records': [{'id': 33608, 'content': 'We had a great game and result yesterday against a difficult team like Lyon. Thank you for the support and enjoy the... http://t.co/C7SQjKbI', 'date_time': '2023-03-25T11:00:00.000Z', 'number_of_likes': 219, 'predicted_number_of_likes': 2194, 'detect_langauge': 'en', 'emoji_count': 0, 'text_complexity': 76.72, 'unique_word_ratio': 0.8571428571, 'unique_word_count': 18, 'word_regex_matches_count': 27, 'number_count': 0, 'out_of_vocabulary_count': 6, 'must_appear_words_ratio': 0, 'sentence_count': 2, 'sentence_avg_length': 69.5, 'word_count': 24, 'avg_word_length': 4.7083333333, 'text_length': 140, 'stop_words_count': 10, 'punctuation_count': 3, 'special_chars_count': 1, 'capital_letters_ratio': 0.07407407

{'transaction_id': '7a7c7448-d14a-11ed-b02b-566d729718d6'}


Passing version name will be deprecated soon, pass version ID instead
INFO:superwise:Send records with params : model_id=24, version_id=23
DEBUG:superwise:POST:  https://portal.superwise.ai/aa639d7248a8342bd9/gateway/v1/transaction/records params: {'records': [{'id': 13059, 'content': '🇭🇹 🇭🇹 🇭🇹\r\n\r\n100 Years of Haitian Beauty → https://t.co/T6hlP7xPSq https://t.co/udInFXzuz9', 'date_time': '2023-03-17T11:00:00.000Z', 'number_of_likes': 954, 'predicted_number_of_likes': 9937, 'detect_langauge': 'en', 'emoji_count': 3, 'text_complexity': 5.15, 'unique_word_ratio': 0.8, 'unique_word_count': 8, 'word_regex_matches_count': 15, 'number_count': 1, 'out_of_vocabulary_count': 13, 'must_appear_words_ratio': 0, 'sentence_count': 1, 'sentence_avg_length': 89.0, 'word_count': 13, 'avg_word_length': 5.6923076923, 'text_length': 89, 'stop_words_count': 1, 'punctuation_count': 2, 'special_chars_count': 0, 'capital_letters_ratio': 0.1886792453, 'regex_match_count': 3, 'email_count': 0, 'link_count':

{'transaction_id': '7b03afda-d14a-11ed-bfe9-527a181536c2'}


Passing version name will be deprecated soon, pass version ID instead
INFO:superwise:Send records with params : model_id=24, version_id=23
DEBUG:superwise:POST:  https://portal.superwise.ai/aa639d7248a8342bd9/gateway/v1/transaction/records params: {'records': [{'id': 7161, 'content': 'Get the facts, not the fluff—join the OFA Truth Team: https://t.co/ssz4MCTj7s', 'date_time': '2023-03-08T11:00:00.000Z', 'number_of_likes': 1470, 'predicted_number_of_likes': 6472, 'detect_langauge': 'en', 'emoji_count': 0, 'text_complexity': 77.23, 'unique_word_ratio': 0.9, 'unique_word_count': 9, 'word_regex_matches_count': 15, 'number_count': 0, 'out_of_vocabulary_count': 8, 'must_appear_words_ratio': 0, 'sentence_count': 1, 'sentence_avg_length': 77.0, 'word_count': 12, 'avg_word_length': 5.3333333333, 'text_length': 77, 'stop_words_count': 4, 'punctuation_count': 3, 'special_chars_count': 0, 'capital_letters_ratio': 0.1578947368, 'regex_match_count': 1, 'email_count': 0, 'link_count': 0, 'hashtag_cou

{'transaction_id': '7b55531c-d14a-11ed-bfe9-527a181536c2'}


## <a name="ground_truth_pipeline"></a>Ground truth pipeline
Simulate ground truth collection and log it to Superwise for monitoring.

In [17]:
# prep for Superwise format
prediction_time_vector = pd.Timestamp.now().floor('h') - \
    pd.TimedeltaIndex(ground_truth_sampled.reset_index(drop=True).index // int(ground_truth_sampled.shape[0] // 30), unit='D')

ongoing_labels = ground_truth_sampled.assign(
    id = ground_truth_sampled["id"]
)

# break the label data into chunks
ongoing_labels_chunks = chunks(ongoing_labels, 50)

transaction_ids = list()
# for each chunk
for ongoing_labels_chunk in ongoing_labels_chunks:
    # send to Superwise
    transaction_id = sw.transaction.log_records(
        model_id=nlp_model.id, 
        version_id=new_version.id, 
        records=ongoing_labels_chunk.to_dict(orient="records")
    )
    transaction_ids.append(transaction_id)
    print(transaction_id)

Passing version name will be deprecated soon, pass version ID instead
INFO:superwise:Send records with params : model_id=24, version_id=23
DEBUG:superwise:POST:  https://portal.superwise.ai/aa639d7248a8342bd9/gateway/v1/transaction/records params: {'records': [{'number_of_likes': 2891, 'id': 28263}, {'number_of_likes': 9121, 'id': 28996}, {'number_of_likes': 8624, 'id': 32910}, {'number_of_likes': 44867, 'id': 3834}, {'number_of_likes': 56, 'id': 23410}, {'number_of_likes': 660, 'id': 6946}, {'number_of_likes': 1354, 'id': 41897}, {'number_of_likes': 7, 'id': 13758}, {'number_of_likes': 43, 'id': 23443}, {'number_of_likes': 31, 'id': 25506}, {'number_of_likes': 12088, 'id': 34789}, {'number_of_likes': 1159, 'id': 12759}, {'number_of_likes': 6183, 'id': 708}, {'number_of_likes': 2231, 'id': 7613}, {'number_of_likes': 1439, 'id': 38125}, {'number_of_likes': 315, 'id': 32904}, {'number_of_likes': 1569, 'id': 46480}, {'number_of_likes': 15557, 'id': 31983}, {'number_of_likes': 3334, 'id': 

{'transaction_id': '8166ab52-d14a-11ed-9b41-566d729718d6'}
{'transaction_id': '81717c6c-d14a-11ed-acb0-527a181536c2'}
{'transaction_id': '817a8d70-d14a-11ed-acb0-527a181536c2'}
{'transaction_id': '81853e00-d14a-11ed-acb0-527a181536c2'}
