![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)

# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

In [2]:
!pip install giskard



You should consider upgrading via the '/Users/rak/Documents/giskard/python-client/.venv/bin/python3.7 -m pip install --upgrade pip' command.[0m[33m
[0m

## Installing `giskard` and other packages

In [3]:
!pip install giskard transformers torch nltk

Collecting transformers
  Using cached transformers-4.25.1-py3-none-any.whl (5.8 MB)
Collecting torch
  Using cached torch-1.13.1-cp37-none-macosx_10_9_x86_64.whl (135.3 MB)
Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting regex!=2019.12.17
  Using cached regex-2022.10.31-cp37-cp37m-macosx_10_9_x86_64.whl (294 kB)
Collecting huggingface-hub<1.0,>=0.10.0
  Using cached huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Using cached tokenizers-0.13.2-cp37-cp37m-macosx_10_11_x86_64.whl (3.8 MB)


Installing collected packages: tokenizers, torch, regex, huggingface-hub, transformers, nltk
Successfully installed huggingface-hub-0.11.1 nltk-3.8.1 regex-2022.10.31 tokenizers-0.13.2 torch-1.13.1 transformers-4.25.1
You should consider upgrading via the '/Users/rak/Documents/giskard/python-client/.venv/bin/python3.7 -m pip install --upgrade pip' command.[0m[33m
[0m

In [1]:
import giskard
giskard.__version__

'1.7.0'

## Connect the external worker in daemon mode

In [None]:
!giskard worker start -d

In [4]:
!pip uninstall giskard -y
!pip install -e "/Users/rak/Documents/giskard/python-client"
import giskard
giskard.__version__

Found existing installation: giskard 1.7.0
Uninstalling giskard-1.7.0:
  Successfully uninstalled giskard-1.7.0
Obtaining file:///Users/rak/Documents/giskard/python-client
  Installing build dependencies ... [?25l|^C
[?25canceled
[31mERROR: Operation cancelled by user[0m[31m
You should consider upgrading via the '/Users/rak/Documents/giskard/python-client/.venv/bin/python3.7 -m pip install --upgrade pip' command.[0m[33m
[0m

'1.7.0'

# Start by creating a ML model 🚀🚀🚀

Download the categorized email files from Berkeley.

In [1]:
!wget http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
!tar zxf enron_with_categories.tar.gz
!rm enron_with_categories.tar.gz

--2023-01-12 12:34:30--  http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
Resolving bailando.sims.berkeley.edu (bailando.sims.berkeley.edu)... 128.32.78.19
Connecting to bailando.sims.berkeley.edu (bailando.sims.berkeley.edu)|128.32.78.19|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://bailando.berkeley.edu/enron/enron_with_categories.tar.gz [following]
--2023-01-12 12:34:31--  https://bailando.berkeley.edu/enron/enron_with_categories.tar.gz
Resolving bailando.berkeley.edu (bailando.berkeley.edu)... 128.32.78.19
Connecting to bailando.berkeley.edu (bailando.berkeley.edu)|128.32.78.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4523350 (4,3M) [application/x-gzip]
Saving to: ‘enron_with_categories.tar.gz’


2023-01-12 12:34:34 (1,91 MB/s) - ‘enron_with_categories.tar.gz’ saved [4523350/4523350]



In [1]:
import email
import glob

from collections import defaultdict

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from string import punctuation

import pandas as pd
from dateutil import parser
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


from sklearn.linear_model import LogisticRegression
from sklearn import model_selection

Various imports and the list of categories from http://bailando.sims.berkeley.edu/enron/enron_categories.txt.

In [2]:
nltk.download('punkt')
nltk.download('stopwords')

stoplist = set(stopwords.words('english') + list(punctuation))
stemmer = PorterStemmer()


# http://bailando.sims.berkeley.edu/enron/enron_categories.txt
idx_to_cat = {
    1: 'REGULATION',
    2: 'INTERNAL',
    3: 'INFLUENCE',
    4: 'INFLUENCE',
    5: 'INFLUENCE',
    6: 'CALIFORNIA CRISIS',
    7: 'INTERNAL',
    8: 'INTERNAL',
    9: 'INFLUENCE',
    10: 'REGULATION',
    11: 'talking points',
    12: 'meeting minutes',
    13: 'trip reports'}

idx_to_cat2 = {
    1: 'regulations and regulators (includes price caps)',
    2: 'internal projects -- progress and strategy',
    3: ' company image -- current',
    4: 'company image -- changing / influencing',
    5: 'political influence / contributions / contacts',
    6: 'california energy crisis / california politics',
    7: 'internal company policy',
    8: 'internal company operations',
    9: 'alliances / partnerships',
    10: 'legal advice',
    11: 'talking points',
    12: 'meeting minutes',
    13: 'trip reports'}


LABEL_CAT = 3  # we'll be using the 2nd-level category "Primary topics" because the two first levels provide categories that are not mutually exclusive. see : https://bailando.berkeley.edu/enron/enron_categories.txt

# get_labels returns a dictionary representation of these labels.
def get_labels(filename):
    with open(filename + '.cats') as f:
        labels = defaultdict(dict)
        line = f.readline()
        while line:
            line = line.split(',')
            top_cat, sub_cat, freq = int(line[0]), int(line[1]), int(line[2])
            labels[top_cat][sub_cat] = freq
            line = f.readline()
    return dict(labels)


email_files = [f.replace('.cats', '') for f in glob.glob('enron_with_categories/*/*.cats')]

[nltk_data] Downloading package punkt to /Users/rak/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /Users/rak/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Build dataframe

In [3]:
columns_name = ['Target', 'Subject', 'Content', 'Week_day', 'Year', 'Month', 'Hour', 'Nb_of_forwarded_msg']


data = pd.DataFrame(columns=columns_name)

for email_file in email_files:
    values_to_add = {}

    # Target is the sub-category with maximum frequency
    if LABEL_CAT in get_labels(email_file):
      sub_cat_dict = get_labels(email_file)[LABEL_CAT]
      target_int = max(sub_cat_dict, key=sub_cat_dict.get)
      values_to_add['Target'] = str(idx_to_cat[target_int])

    # Features are metadata from the email object
    filename = email_file+'.txt'
    with open(filename) as f:

      message = email.message_from_string(f.read())
  
      values_to_add['Subject'] = str(message['Subject'])
      values_to_add['Content'] = str(message.get_payload())
     
      date_time_obj = parser.parse(message['Date'])
      values_to_add['Week_day'] = date_time_obj.strftime("%A")
      values_to_add['Year'] = date_time_obj.strftime("%Y")
      values_to_add['Month'] = date_time_obj.strftime("%B")
      values_to_add['Hour'] = int(date_time_obj.strftime("%H"))

      # Count number of forwarded mails
      number_of_messages = 0
      for line in message.get_payload().split('\n'):
        if ('forwarded' in line.lower() or 'original' in line.lower()) and '--' in line:
            number_of_messages += 1
      values_to_add['Nb_of_forwarded_msg'] = number_of_messages
    
    row_to_add = pd.Series(values_to_add)
    data = data.append(row_to_add, ignore_index=True)

## Filter Dataframe

In [4]:
# We filter 879 rows (if Primary topics exists (i.e. if coarse genre 1.1 is selected) )
data_filtered = data[data["Target"].notnull()]

#Exclude target category with very few rows ; 812 rows remains
excluded_category = [idx_to_cat[i] for i in [11,12,13]]
data_filtered = data_filtered[data_filtered["Target"].isin(excluded_category) == False]
num_classes = len(data_filtered["Target"].value_counts())

In [5]:
column_types={       
        'Target': "category",
        "Subject": "text",
        "Content": "text",
        "Week_day": "category",
        "Month": "category",
        "Hour": "numeric",
        "Nb_of_forwarded_msg": "numeric",
        "Year": "numeric"
    }

## Training with scikit learn pipeline

In [6]:
feature_types = {i:column_types[i] for i in column_types if i!='Target'}

columns_to_scale = [key for key in feature_types.keys() if feature_types[key]=="numeric"]

numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])


columns_to_encode = [key for key in feature_types.keys() if feature_types[key]=="category"]

categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])
text_transformer = Pipeline([
                      ('vect', CountVectorizer(stop_words=stoplist)),
                      ('tfidf', TfidfTransformer())
                     ])
preprocessor = ColumnTransformer(
    transformers=[
      ('num', numeric_transformer, columns_to_scale),
      ('cat', categorical_transformer, columns_to_encode),
      ('text_Mail', text_transformer, "Content")
    ]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter =1000))])

## Split train/test

In [7]:
feature_types = {i:column_types[i] for i in column_types if i!="Target"}
Y = data_filtered["Target"]
X = data_filtered.drop(columns=["Target"])
X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X, Y,test_size=0.20, random_state = 30, stratify = Y)

# Learning phase

In [8]:
clf.fit(X_train, Y_train)
print("model score: %.3f" % clf.score(X_test, Y_test))

model score: 0.500


In [9]:
train_data = pd.concat([X_train, Y_train], axis=1)
test_data = pd.concat([X_test, Y_test ], axis=1)

# Upload the model in Giskard 🚀🚀🚀

## Initiate a project


In [10]:
from giskard import GiskardClient

url = "http://localhost:9000" # If Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL 
token = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsInRva2VuX3R5cGUiOiJBUEkiLCJhdXRoIjoiUk9MRV9BRE1JTiIsImV4cCI6MTY4MTI5OTUyMH0.V7nqfl54iNTLKunwhNyUobSkgURQ2GYAkdK59bdyzLc" # you can generate your API token in the Admin tab of the Giskard application (for installation, see: https://docs.giskard.ai/start/guides/installation)

client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
#enron = client.create_project("enron", "Email Classification", "Email Classification")

# If you've already created a project with the key "enron_demo" use
enron = client.get_project("enron_demo")

GiskardError: Project not found by key: enron_demo: no details

### Old way to upload your model and dataset

In [12]:
#enron.upload_model_and_df(
#    prediction_function=clf.predict_proba, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
#    model_type='classification', # "classification" for classification model OR "regression" for regression model
#    df=test_data, # The dataset you want to use to inspect your model
#    column_types=column_types, # # A dictionary with columns names of df as key and types(category, numeric, text) of columns as values
#    target='Target', # The column name in df corresponding to the actual target variable (ground truth).
#    feature_names=list(feature_types.keys()), # List of the feature names of prediction_function
#    model_name='logistic_regression_model', # Name of the model
#    dataset_name='test_data', # Name of the dataset
#    classification_labels=clf.classes_ # List of the classification labels of your prediction
#)

### New way to upload your model and dataset

#### preprocessor and classifier as a pipeline clf

In [14]:
from giskard import Model, SKLearnModel, GiskardClient, Dataset

# Wrap your clf with SKLearnModel from Giskard
my_model = SKLearnModel(clf=clf, model_type="classification")

# Wrap your dataset with Dataset from Giskard
my_test_dataset = Dataset(test_data, name="test dataset", target="Target", column_meanings=column_types)

# save model and dataset to Giskard server
mid = my_model.save(client, "enron", validate_ds=my_test_dataset)
did = my_test_dataset.save(client, "enron")

2023-01-12 16:54:52,936 pid:1241 MainThread giskard.ml_worker.core.dataset INFO     Casting dataframe columns from {'Subject': 'object', 'Content': 'object', 'Week_day': 'object', 'Year': 'object', 'Month': 'object', 'Hour': 'object', 'Nb_of_forwarded_msg': 'object'} to {'Subject': 'object', 'Content': 'object', 'Week_day': 'object', 'Year': 'object', 'Month': 'object', 'Hour': 'object', 'Nb_of_forwarded_msg': 'object'}
2023-01-12 16:54:52,975 pid:1241 MainThread giskard.ml_worker.utils.logging INFO     Predicted dataset with shape (10, 8) executed in 0:00:00.041494
2023-01-12 16:54:52,977 pid:1241 MainThread giskard.ml_worker.core.dataset INFO     Casting dataframe columns from {'Subject': 'object', 'Content': 'object', 'Week_day': 'object', 'Year': 'object', 'Month': 'object', 'Hour': 'object', 'Nb_of_forwarded_msg': 'object'} to {'Subject': 'object', 'Content': 'object', 'Week_day': 'object', 'Year': 'object', 'Month': 'object', 'Hour': 'object', 'Nb_of_forwarded_msg': 'object'}
202

#### preprocessor as data_preparation_function

In [13]:
from giskard import Model, SKLearnModel, GiskardClient, Dataset

# Wrap your clf with SKLearnModel from Giskard
def data_preparation_function(df):
    return preprocessor.transform(df)

my_model = SKLearnModel(name="LogisticRegression",
                        clf=clf[-1],
                        model_type="classification",
                        classification_labels=list(clf.classes_),
                        data_preparation_function=data_preparation_function)

# Wrap your dataset with Dataset from Giskard
my_test_dataset = Dataset(test_data, name="test dataset", target="Target", column_meanings=column_types)

# save model and dataset to Giskard server
mid = my_model.save(client, "enron_demo", validate_ds=my_test_dataset)
did = my_test_dataset.save(client, "enron_demo")

2023-01-12 16:45:56,856 pid:319 MainThread giskard.ml_worker.core.dataset INFO     Casting dataframe columns from {'Subject': 'object', 'Content': 'object', 'Week_day': 'object', 'Year': 'object', 'Month': 'object', 'Hour': 'object', 'Nb_of_forwarded_msg': 'object'} to {'Subject': 'object', 'Content': 'object', 'Week_day': 'object', 'Year': 'object', 'Month': 'object', 'Hour': 'object', 'Nb_of_forwarded_msg': 'object'}
2023-01-12 16:45:56,903 pid:319 MainThread giskard.ml_worker.utils.logging INFO     Predicted dataset with shape (10, 8) executed in 0:00:00.048102
2023-01-12 16:45:56,905 pid:319 MainThread giskard.ml_worker.core.dataset INFO     Casting dataframe columns from {'Subject': 'object', 'Content': 'object', 'Week_day': 'object', 'Year': 'object', 'Month': 'object', 'Hour': 'object', 'Nb_of_forwarded_msg': 'object'} to {'Subject': 'object', 'Content': 'object', 'Week_day': 'object', 'Year': 'object', 'Month': 'object', 'Hour': 'object', 'Nb_of_forwarded_msg': 'object'}
2023-0

### 🌟 If you want to upload a dataset without a model


### For example, let's upload the train set in Giskard, this is key to create drift tests in Giskard.

#### Old way to upload dataset

In [None]:
#enron.upload_df(
#    df=train_data, # The dataset you want to upload
#    column_types=column_types, # All the column types of df
#    target="Target", # Do not pass this parameter if dataset doesn't contain target column
#    name="train_data" # Name of the dataset
#)

#### New way to upload dataset

In [16]:
my_train_dataset = Dataset(train_data, name="train dataset", target="Target", column_meanings=column_types)
did = my_train_dataset.save(client, "enron")

Dataset successfully uploaded to project key 'enron' with ID = dd639e32-8e8c-4b36-aa73-4b5f6114b3c2


### You can also upload new production data to use it as a validatation set for your existing model. In that case, you might not have the ground truth target variable

In [11]:

production_data = data.drop(columns="Target")


#### Old way to upload dataset

In [12]:
#enron.upload_df(
#    df=production_data, # The dataset you want to upload
#    column_types=feature_types, # All the column types without the target
#    name="production_data" # Name of the dataset
#)

#### New way to upload dataset

In [15]:
from giskard import Model, SKLearnModel, GiskardClient, Dataset
column_types.pop("Target")
my_prod_dataset = Dataset(production_data, name="prod dataset", column_meanings=column_types)
did = my_prod_dataset.save(client, "enron")

Dataset successfully uploaded to project key 'enron' with ID = b185a29a-2c68-4d5a-b4d7-d7a603fffd74


### 🌟 If you just want to upload a model without a dataframe
#### Lets try a Hugging Face pytorch model

In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification

np.random.seed(112)

# Read data
data = data_filtered

# Define pretrained tokenizer and model
# model_name = "bert-base-uncased" # For large BERT Model
model_name = "cross-encoder/ms-marco-TinyBERT-L-2" # For tiny BERT Model

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=4, ignore_mismatched_sizes=True)

for param in model.base_model.parameters():
    param.requires_grad = False

# ----- 1. Preprocess data -----#
# Preprocess data

X = list(data["Content"])
classification_labels_mapping = {'REGULATION': 0,'INTERNAL': 1,'CALIFORNIA CRISIS': 2,'INFLUENCE': 3}
y = list(data_filtered['Target'].map(classification_labels_mapping))

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=128)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=128)

In [None]:
# Create torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)

# ----- 2. Fine-tune pretrained model -----#
# Define Trainer parameters
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred, average='macro')
    precision = precision_score(y_true=labels, y_pred=pred, average='macro')
    f1 = f1_score(y_true=labels, y_pred=pred, average='macro')

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}



In [None]:
# Define Trainer
args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="steps",
    eval_steps=500,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    seed=0,
    load_best_model_at_end=True,
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
    )

# Train pre-trained model
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
def predict(test_dataset):
    test_dataset= test_dataset.squeeze(axis=1)
    X_test = list(test_dataset)
    X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

    # Create torch dataset
    test_dataset = Dataset(X_test_tokenized)

    # Define test trainer
    test_trainer = Trainer(model)

    # Make prediction
    raw_pred, _, _ = test_trainer.predict(test_dataset)
    predictions = torch.nn.functional.softmax(torch.from_numpy(raw_pred), dim=-1)
    predictions = predictions.cpu().detach().numpy()

    return predictions

In [None]:
feature_names = ['Content']
test_df = data_filtered[feature_names][:5]
predict(test_df)

In [None]:
enron.upload_model(
    prediction_function=predict, # Python function which takes pandas dataframe as input and returns probabilities for classification model OR returns predictions for regression model
    model_type='classification', # "classification" for classification model OR "regression" for regression model
    validate_df=data_filtered.head(), # Optional. Validation df is not uploaded in the app, it's only used to check whether the model has the good format
    target='Target', # Optional. target should be a column of validate_df. Pass this parameter only if validate_df is being passed
    feature_names=feature_names,# List of the feature names of prediction_function
    name='bert_pytorch_model', # Name of the model
    classification_labels=list(classification_labels_mapping.keys()) # List of the classification labels of your prediction
)