![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)

# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

# Start by creating a ML model 🚀🚀🚀

Download the categorized email files from Berkeley.

In [1]:
!wget http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
!tar zxf enron_with_categories.tar.gz
!rm enron_with_categories.tar.gz

--2022-07-19 16:28:55--  http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
Resolving bailando.sims.berkeley.edu (bailando.sims.berkeley.edu)... 128.32.78.19
Connecting to bailando.sims.berkeley.edu (bailando.sims.berkeley.edu)|128.32.78.19|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://bailando.berkeley.edu/enron/enron_with_categories.tar.gz [following]
--2022-07-19 16:28:56--  https://bailando.berkeley.edu/enron/enron_with_categories.tar.gz
Resolving bailando.berkeley.edu (bailando.berkeley.edu)... 128.32.78.19
Connecting to bailando.berkeley.edu (bailando.berkeley.edu)|128.32.78.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4523350 (4.3M) [application/x-gzip]
Saving to: ‘enron_with_categories.tar.gz’


2022-07-19 16:28:59 (1.77 MB/s) - ‘enron_with_categories.tar.gz’ saved [4523350/4523350]



In [4]:
!pip install transformers



In [5]:
!pip install giskard



In [6]:
import email
import glob
import time
import math
from collections import defaultdict, namedtuple

import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from string import punctuation

import numpy as np

import pandas as pd
import datetime
from dateutil import parser

from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import TruncatedSVD

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import model_selection

Various imports and the list of categories from http://bailando.sims.berkeley.edu/enron/enron_categories.txt.

In [7]:
nltk.download('punkt')
nltk.download('stopwords')

stoplist = set(stopwords.words('english') + list(punctuation))
stemmer = PorterStemmer()


# http://bailando.sims.berkeley.edu/enron/enron_categories.txt
idx_to_cat = {
    1: 'REGULATION',
    2: 'INTERNAL',
    3: 'INFLUENCE',
    4: 'INFLUENCE',
    5: 'INFLUENCE',
    6: 'CALIFORNIA CRISIS',
    7: 'INTERNAL',
    8: 'INTERNAL',
    9: 'INFLUENCE',
    10: 'REGULATION',
    11: 'talking points',
    12: 'meeting minutes',
    13: 'trip reports'}

idx_to_cat2 = {
    1: 'regulations and regulators (includes price caps)',
    2: 'internal projects -- progress and strategy',
    3: ' company image -- current',
    4: 'company image -- changing / influencing',
    5: 'political influence / contributions / contacts',
    6: 'california energy crisis / california politics',
    7: 'internal company policy',
    8: 'internal company operations',
    9: 'alliances / partnerships',
    10: 'legal advice',
    11: 'talking points',
    12: 'meeting minutes',
    13: 'trip reports'}


LABEL_CAT = 3  # we'll be using the 2nd-level category "Primary topics" because the two first levels provide categories that are not mutually exclusive. see : https://bailando.berkeley.edu/enron/enron_categories.txt

#get_labels returns a dictionary representation of these labels. 
def get_labels(filename):
    with open(filename + '.cats') as f:
        labels = defaultdict(dict)
        line = f.readline()
        while line:
            line = line.split(',')
            top_cat, sub_cat, freq = int(line[0]), int(line[1]), int(line[2])
            labels[top_cat][sub_cat] = freq
            line = f.readline()
    return dict(labels)


email_files = [f.replace('.cats', '') for f in glob.glob('enron_with_categories/*/*.cats')]

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/princyiakov/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/princyiakov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


##Build dataframe

In [8]:
columns_name = ['Target', 'Subject', 'Content', 'Week_day', 'Year', 'Month', 'Hour', 'Nb_of_forwarded_msg']


data = pd.DataFrame(columns=columns_name)

for email_file in email_files:
    values_to_add = {}

    #Target is the sub-category with maximum frequency
    if LABEL_CAT in get_labels(email_file):
      sub_cat_dict = get_labels(email_file)[LABEL_CAT]
      target_int = max(sub_cat_dict, key=sub_cat_dict.get)
      values_to_add['Target'] = str(idx_to_cat[target_int])

    #Features are metadata from the email object
    filename = email_file+'.txt'
    with open(filename) as f:

      message = email.message_from_string(f.read())
  
      values_to_add['Subject'] = str(message['Subject'])
      values_to_add['Content'] = str(message.get_payload())
     
      date_time_obj = parser.parse(message['Date'])
      values_to_add['Week_day'] = date_time_obj.strftime("%A")
      values_to_add['Year'] = date_time_obj.strftime("%Y")
      values_to_add['Month'] = date_time_obj.strftime("%B")
      values_to_add['Hour'] = int(date_time_obj.strftime("%H"))

      # Count number of forwarded mails
      number_of_messages = 0
      for line in message.get_payload().split('\n'):
        if ('forwarded' in line.lower() or 'original' in line.lower()) and '--' in line:
            number_of_messages += 1
      values_to_add['Nb_of_forwarded_msg'] = number_of_messages
    
    row_to_add = pd.Series(values_to_add)
    data = data.append(row_to_add, ignore_index=True)

Filter Dataframe

In [9]:
#We filter 879 rows (if Primary topics exists (i.e. if coarse genre 1.1 is selected) )
data_filtered = data[data["Target"].notnull()]

#Exclude target category with very few rows ; 812 rows remains
excluded_category = [idx_to_cat[i] for i in [11,12,13]]
data_filtered = data_filtered[data_filtered["Target"].isin(excluded_category) == False]



In [10]:
classification_labels_mapping = {'REGULATION': 0,'INTERNAL': 1,'CALIFORNIA CRISIS': 2,'INFLUENCE': 3}
data_filtered['Labels'] = data_filtered['Target'].map(classification_labels_mapping)
num_classes = len(data_filtered["Target"].value_counts())

In [11]:
column_types={       
        'Target': "category",
        "Subject": "text",
        "Content": "text",
        "Week_day": "category",
        "Month": "category",
        "Hour": "numeric",
        "Nb_of_forwarded_msg": "numeric",
        "Year": "numeric"
    }

Training with scikit learn pipeline

In [12]:
feature_types = {i:column_types[i] for i in column_types if i!='Target'}

columns_to_scale = [key for key in feature_types.keys() if feature_types[key]=="numeric"]

numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])


columns_to_encode = [key for key in feature_types.keys() if feature_types[key]=="category"]

categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])
text_transformer = Pipeline([
                      ('vect', CountVectorizer(stop_words=stoplist)),
                      ('tfidf', TfidfTransformer())
                     ])
preprocessor = ColumnTransformer(
    transformers=[
      ('num', numeric_transformer, columns_to_scale),
      ('cat', categorical_transformer, columns_to_encode),
      ('text_Mail', text_transformer, "Content")
    ]
)
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter =1000))])

Split train/test

In [13]:
feature_types = {i:column_types[i] for i in column_types if i!="Target"}
Y = data_filtered["Target"]
X = data_filtered.drop(columns=["Target","Labels"])
X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X, Y,test_size=0.20, random_state = 30, stratify = Y)

Learning phase

In [14]:
clf.fit(X_train, Y_train)
print("model score: %.3f" % clf.score(X_test, Y_test))

model score: 0.500


In [15]:
train_data = pd.concat([X_train, Y_train], axis=1)
test_data = pd.concat([X_test, Y_test ], axis=1)

#Upload the model in Giskard 🚀🚀🚀

### Initiate a project


In [17]:
from giskard.giskard_client import GiskardClient

url = "http://localhost:9000" # if docker image is running on your local
#url = "http://app.giskard.ai" # If you want to upload on giskard URL 
#url = 'http://gsk1.giskard.ai:10000'
#token = "XXX" #Find your token in the Admin tab of your app (login: admin; password: admin)
token ='eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsInRva2VuX3R5cGUiOiJBUEkiLCJhdXRoIjoiUk9MRV9BRE1JTiIsImV4cCI6MTY2NjAxNzA2NX0.n3jwD186v-0SaAxBEUnGnkTfqQReXayx--L4LSGkdmM'
client = GiskardClient(url, token)

enron = client.create_project("enron_demo", "Enron Mails Classification", "Project to classify enron mails")

#If you've already created a project with the key "enron_demo" use
#enron = client.get_project("enron_demo")

### Upload your model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))

In [18]:
enron.upload_model_and_df(
    prediction_function=clf.predict_proba, 
    model_type='classification',
    df=test_data, #the dataset you want to use to inspect your model
    column_types=column_types, #all the column types of df
    target='Target', #the column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()),#list of the feature names of prediction_function
    model_name='logistic_regression_model',
    dataset_name='test_data',
    classification_labels=clf.classes_
)

Model successfully uploaded to project key 'enron_demo' and is available at http://localhost:9000 
Dataset successfully uploaded to project key 'enron_demo' and is available at http://localhost:9000 


### 🌟 If you want to upload a dataset without a model


For example, let's upload the train set in Giskard, this is key to create drift tests in Giskard.

In [19]:
enron.upload_df(
    df=train_data,
    column_types=column_types, #all the column types of df
    target="Target", # do not pass this parameter if dataset doesnt contain target column 
    name="train_data"
)

Dataset successfully uploaded to project key 'enron_demo' and is available at http://localhost:9000 


<Response [200]>

You can also upload new production data to use it as a validatation set for your existing model. In that case, you might not have the ground truth target variable

#Upload an additional PyTorch model in Giskard 🚀🚀🚀


In [20]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch
from transformers import TrainingArguments, Trainer
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import EarlyStoppingCallback

np.random.seed(112)
# Read data
data = data_filtered

# Define pretrained tokenizer and model
# model_name = "bert-base-uncased"
model_name = "cross-encoder/ms-marco-TinyBERT-L-2"

tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=4, ignore_mismatched_sizes=True)

for param in model.base_model.parameters():
    param.requires_grad = False

# ----- 1. Preprocess data -----#
# Preprocess data
# le = preprocessing.LabelEncoder()
# y = list(le.fit_transform(data_filtered.Target))
X = list(data["Content"])
y = list(data["Labels"])

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=128)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=128)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at cross-encoder/ms-marco-TinyBERT-L-2 and are newly initialized because the shapes did not match:
- classifier.weight: found shape torch.Size([1, 128]) in the checkpoint and torch.Size([4, 128]) in the model instantiated
- classifier.bias: found shape torch.Size([1]) in the checkpoint and torch.Size([4]) in the model instantiated
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [21]:
# Create torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)

# ----- 2. Fine-tune pretrained model -----#
# Define Trainer parameters
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    recall = recall_score(y_true=labels, y_pred=pred, average='macro')
    precision = precision_score(y_true=labels, y_pred=pred, average='macro')
    f1 = f1_score(y_true=labels, y_pred=pred, average='macro')

    return {"accuracy": accuracy, "precision": precision, "recall": recall, "f1": f1}



In [27]:
# Define Trainer
args = TrainingArguments(
    output_dir="output",
    evaluation_strategy="steps",
    eval_steps=500,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    seed=0,
    load_best_model_at_end=True,
)
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
    )

# Train pre-trained model
trainer.train()

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running training *****
  Num examples = 679
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 85


Step,Training Loss,Validation Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=85, training_loss=1.4155163933249082, metrics={'train_runtime': 2.6551, 'train_samples_per_second': 255.737, 'train_steps_per_second': 32.014, 'total_flos': 215799714816.0, 'train_loss': 1.4155163933249082, 'epoch': 1.0})

In [28]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 170
  Batch size = 8


  _warn_prf(average, modifier, msg_start, len(result))


{'eval_loss': 1.3968162536621094,
 'eval_accuracy': 0.2529411764705882,
 'eval_precision': 0.2430161943319838,
 'eval_recall': 0.2691769114456643,
 'eval_f1': 0.16330063431721686,
 'eval_runtime': 0.4811,
 'eval_samples_per_second': 353.393,
 'eval_steps_per_second': 45.733,
 'epoch': 1.0}

In [None]:
# saving the fine tuned model & tokenizer
# model_path = "saved_bert_model"
# model.save_pretrained(model_path)
# tokenizer.save_pretrained(model_path)
# model = BertForSequenceClassification.from_pretrained(model_path, num_labels=4)


Configuration saved in 20newsgroups-bert-base-uncased/config.json
Model weights saved in 20newsgroups-bert-base-uncased/pytorch_model.bin
tokenizer config file saved in 20newsgroups-bert-base-uncased/tokenizer_config.json
Special tokens file saved in 20newsgroups-bert-base-uncased/special_tokens_map.json
loading configuration file 20newsgroups-bert-base-uncased/config.json
Model config BertConfig {
  "_name_or_path": "cross-encoder/ms-marco-TinyBERT-L-2",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1",
    "2": "LABEL_2",
    "3": "LABEL_3"
  },
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1,
    "LABEL_2": 2,
    "LABEL_3": 3
  },
  "layer_norm_eps": 1e-12,
  "max_position_embedd

In [29]:
def predict(test_dataset):

  X_test = list(test_dataset)
  X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

  # Create torch dataset
  test_dataset = Dataset(X_test_tokenized)

  # Define test trainer
  test_trainer = Trainer(model)

  # Make prediction
  raw_pred, _, _ = test_trainer.predict(test_dataset)
  predictions = torch.nn.functional.softmax(torch.from_numpy(raw_pred), dim=-1)
  predictions = predictions.cpu().detach().numpy()



  return predictions

In [30]:
predict(data_filtered['Content'])

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Prediction *****
  Num examples = 849
  Batch size = 8


array([[0.2768008 , 0.23026586, 0.2449776 , 0.2479557 ],
       [0.29211152, 0.24161239, 0.25983658, 0.20643948],
       [0.2935181 , 0.21655953, 0.25857565, 0.23134674],
       ...,
       [0.305729  , 0.20803015, 0.28140607, 0.20483476],
       [0.3236544 , 0.20453224, 0.24126975, 0.2305436 ],
       [0.29627258, 0.21610978, 0.2564075 , 0.23121022]], dtype=float32)

In [31]:
column_types['Labels'] = column_types.pop('Target')

In [32]:
enron.upload_model_and_df(
    prediction_function=predict, 
    model_type='classification',
    df=data_filtered.drop('Target', axis=1), #the dataset you want to use to inspect your model
    column_types=column_types, #all the column types of df
    target='Labels', #the column name in df corresponding to the actual target variable (ground truth).
    feature_names=['Content'],#list of the feature names of prediction_function
    model_name='bert_pytorch_model',
    dataset_name='data_with_labels_for_bert',
    classification_labels=list(classification_labels_mapping.values())
)

No `TrainingArguments` passed, using `output_dir=tmp_trainer`.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
***** Running Prediction *****
  Num examples = 1
  Batch size = 8


Model successfully uploaded to project key 'enron_demo' and is available at http://localhost:9000 
Dataset successfully uploaded to project key 'enron_demo' and is available at http://localhost:9000 
