# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

# Install the giskard preview package

In [1]:
!pip install ./preview-dist/giskard-1.8.0-py3-none-any.whl
!python pip install great_expectations

Processing ./preview-dist/giskard-1.8.0-py3-none-any.whl
Collecting mlflow@ git+https://github.com/Giskard-AI/mlflow.git
  Cloning https://github.com/Giskard-AI/mlflow.git to /private/var/folders/nf/w2h_y58n4sxdyjmq4qlwcxym0000gn/T/pip-install-yypu3ryu/mlflow_be501ba7e34942f5984e1cd82bd29da2
  Running command git clone --filter=blob:none --quiet https://github.com/Giskard-AI/mlflow.git /private/var/folders/nf/w2h_y58n4sxdyjmq4qlwcxym0000gn/T/pip-install-yypu3ryu/mlflow_be501ba7e34942f5984e1cd82bd29da2
  Resolved https://github.com/Giskard-AI/mlflow.git to commit 7580281a40c27aeb8f962f392e171619d6574c37
  Running command git submodule update --init --recursive -q
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Installing collected packages: giskard
  Attempting uninstall: giskard
    Found existing installation: giskard 1.8.0
    Uninstalling giskard-1.8.0:
 

## Installing other packages

In [2]:
!pip install transformers torch nltk



In [3]:
import giskard
giskard.__version__

UDF repository is not available because the 'GIT_REPOSITORY' environment variable is not set


'1.8.0'

## Connect the external worker in daemon mode

In [4]:
!giskard worker start -u http://localhost:19000 -k 'YOUR_API_KEY_HERE' -d

UDF repository is not available because the 'GIT_REPOSITORY' environment variable is not set


# Start by creating a ML model 🚀🚀🚀

Download the categorized email files from Berkeley.

In [5]:
!wget http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
!tar zxf enron_with_categories.tar.gz
!rm enron_with_categories.tar.gz

--2023-03-02 12:18:29--  http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz
Resolving bailando.sims.berkeley.edu (bailando.sims.berkeley.edu)... 128.32.78.19
Connecting to bailando.sims.berkeley.edu (bailando.sims.berkeley.edu)|128.32.78.19|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://bailando.berkeley.edu/enron/enron_with_categories.tar.gz [following]
--2023-03-02 12:18:32--  https://bailando.berkeley.edu/enron/enron_with_categories.tar.gz
Resolving bailando.berkeley.edu (bailando.berkeley.edu)... 128.32.78.19
Connecting to bailando.berkeley.edu (bailando.berkeley.edu)|128.32.78.19|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4523350 (4.3M) [application/x-gzip]
Saving to: ‘enron_with_categories.tar.gz’


2023-03-02 12:18:39 (778 KB/s) - ‘enron_with_categories.tar.gz’ saved [4523350/4523350]



In [6]:
import email
import glob

from collections import defaultdict

import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from string import punctuation

import pandas as pd
from dateutil import parser
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer


from sklearn.linear_model import LogisticRegression
from sklearn import model_selection

Various imports and the list of categories from http://bailando.sims.berkeley.edu/enron/enron_categories.txt.

In [8]:
nltk.download('punkt')
nltk.download('stopwords')

stoplist = set(stopwords.words('english') + list(punctuation))
stemmer = PorterStemmer()


# http://bailando.sims.berkeley.edu/enron/enron_categories.txt
idx_to_cat = {
    1: 'REGULATION',
    2: 'INTERNAL',
    3: 'INFLUENCE',
    4: 'INFLUENCE',
    5: 'INFLUENCE',
    6: 'CALIFORNIA CRISIS',
    7: 'INTERNAL',
    8: 'INTERNAL',
    9: 'INFLUENCE',
    10: 'REGULATION',
    11: 'talking points',
    12: 'meeting minutes',
    13: 'trip reports'}

idx_to_cat2 = {
    1: 'regulations and regulators (includes price caps)',
    2: 'internal projects -- progress and strategy',
    3: ' company image -- current',
    4: 'company image -- changing / influencing',
    5: 'political influence / contributions / contacts',
    6: 'california energy crisis / california politics',
    7: 'internal company policy',
    8: 'internal company operations',
    9: 'alliances / partnerships',
    10: 'legal advice',
    11: 'talking points',
    12: 'meeting minutes',
    13: 'trip reports'}


LABEL_CAT = 3  # we'll be using the 2nd-level category "Primary topics" because the two first levels provide categories that are not mutually exclusive. see : https://bailando.berkeley.edu/enron/enron_categories.txt

# get_labels returns a dictionary representation of these labels.
def get_labels(filename):
    with open(filename + '.cats') as f:
        labels = defaultdict(dict)
        line = f.readline()
        while line:
            line = line.split(',')
            top_cat, sub_cat, freq = int(line[0]), int(line[1]), int(line[2])
            labels[top_cat][sub_cat] = freq
            line = f.readline()
    return dict(labels)


email_files = [f.replace('.cats', '') for f in glob.glob('enron_with_categories/*/*.cats')]

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/kevinmessiaen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/kevinmessiaen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Build dataframe

In [9]:
columns_name = ['Target', 'Subject', 'Content', 'Week_day', 'Year', 'Month', 'Hour', 'Nb_of_forwarded_msg']


data = pd.DataFrame(columns=columns_name)

for email_file in email_files:
    values_to_add = {}

    # Target is the sub-category with maximum frequency
    if LABEL_CAT in get_labels(email_file):
      sub_cat_dict = get_labels(email_file)[LABEL_CAT]
      target_int = max(sub_cat_dict, key=sub_cat_dict.get)
      values_to_add['Target'] = str(idx_to_cat[target_int])

    # Features are metadata from the email object
    filename = email_file+'.txt'
    with open(filename) as f:

      message = email.message_from_string(f.read())
  
      values_to_add['Subject'] = str(message['Subject'])
      values_to_add['Content'] = str(message.get_payload())
     
      date_time_obj = parser.parse(message['Date'])
      values_to_add['Week_day'] = date_time_obj.strftime("%A")
      values_to_add['Year'] = date_time_obj.strftime("%Y")
      values_to_add['Month'] = date_time_obj.strftime("%B")
      values_to_add['Hour'] = int(date_time_obj.strftime("%H"))

      # Count number of forwarded mails
      number_of_messages = 0
      for line in message.get_payload().split('\n'):
        if ('forwarded' in line.lower() or 'original' in line.lower()) and '--' in line:
            number_of_messages += 1
      values_to_add['Nb_of_forwarded_msg'] = number_of_messages
    
    row_to_add = pd.Series(values_to_add)
    data = data.append(row_to_add, ignore_index=True)

## Filter Dataframe

In [10]:
# We filter 879 rows (if Primary topics exists (i.e. if coarse genre 1.1 is selected) )
data_filtered = data[data["Target"].notnull()]

#Exclude target category with very few rows ; 812 rows remains
excluded_category = [idx_to_cat[i] for i in [11,12,13]]
data_filtered = data_filtered[data_filtered["Target"].isin(excluded_category) == False]
num_classes = len(data_filtered["Target"].value_counts())

In [11]:
column_types={       
        'Target': "category",
        "Subject": "text",
        "Content": "text",
        "Week_day": "category",
        "Month": "category",
        "Hour": "numeric",
        "Nb_of_forwarded_msg": "numeric",
        "Year": "numeric"
    }

## Training with scikit learn pipeline

In [12]:
feature_types = {i:column_types[i] for i in column_types if i!='Target'}

columns_to_scale = [key for key in feature_types.keys() if feature_types[key]=="numeric"]

numeric_transformer = Pipeline([('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])


columns_to_encode = [key for key in feature_types.keys() if feature_types[key]=="category"]

categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])
text_transformer = Pipeline([
                      ('vect', CountVectorizer(stop_words=stoplist)),
                      ('tfidf', TfidfTransformer())
                     ])
preprocessor = ColumnTransformer(
    transformers=[
      ('num', numeric_transformer, columns_to_scale),
      ('cat', categorical_transformer, columns_to_encode),
      ('text_Mail', text_transformer, "Content")
    ]
)

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter =1000))])

## Split train/test

In [13]:
feature_types = {i:column_types[i] for i in column_types if i!="Target"}
Y = data_filtered["Target"]
X = data_filtered.drop(columns=["Target"])
X_train,X_test,Y_train,Y_test = model_selection.train_test_split(X, Y,test_size=0.20, random_state = 30, stratify = Y)

# Learning phase

In [14]:
clf.fit(X_train, Y_train)
print("model score: %.3f" % clf.score(X_test, Y_test))

model score: 0.500


In [15]:
train_data = pd.concat([X_train, Y_train], axis=1)
test_data = pd.concat([X_test, Y_test ], axis=1)

# Upload the model in Giskard 🚀🚀🚀

## Initiate a project


In [16]:
from giskard import GiskardClient
from giskard.ml_worker.core.suite import Suite

url = "http://localhost:19000" # If Giskard is installed locally (for installation, see: https://docs.giskard.ai/start/guides/installation)
#url = "http://app.giskard.ai" # If you want to upload on giskard URL
token = "eyJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJhZG1pbiIsImF1dGgiOiJST0xFX0FETUlOIiwiaWQiOjYsInRva2VuX3R5cGUiOiJVSSIsImV4cCI6MTY3NTkzNTkxNn0.B9P02Zi8lq5OWzBX6MFH9gbJ1zaVal4F7Eh_cErSH2w" # you can generate your API token in the Admin tab of the Giskard application (for installation, see: https://docs.giskard.ai/start/guides/installation)

client = GiskardClient(url, token)

# your_project = client.create_project("project_key", "PROJECT_NAME", "DESCRIPTION")
# Choose the arguments you want. But "project_key" should be unique and in lower case
enron = client.create_project("enron_test", "Email Classification", "Email Classification")

# If you've already created a project with the key "enron_demo" use
#enron = client.get_project("enron")

### New way to upload your model and dataset

#### preprocessor and classifier as a pipeline clf

In [17]:
from giskard import Model, SKLearnModel, GiskardClient, Dataset

# Wrap your clf with SKLearnModel from Giskard
my_model = SKLearnModel(clf=clf, model_type="classification")

# Wrap your dataset with Dataset from Giskard
my_test_dataset = Dataset(test_data, name="test dataset", target="Target", feature_types=column_types)

# save model and dataset to Giskard server
mid = my_model.upload(client, "enron_test", validate_ds=my_test_dataset)
did = my_test_dataset.save(client, "enron_test")



Model successfully uploaded to project key 'enron_test' with ID = 7bce419e-88e9-4bfa-bc91-ee506a736066
Dataset successfully uploaded to project key 'enron_test' with ID = eec31fa1-97bd-41bf-9927-6dbf7fdf38b7


# Create a uniqueness and data quality test classes

In [18]:
from giskard.ml_worker.core.test_result import TestResult
from giskard.ml_worker.testing.registry.giskard_test import GiskardTest
import great_expectations as ge

class UniquenessTest(GiskardTest):

    def __init__(self):
        super().__init__()

    def set_params(self, dataset: Dataset = None, column_name: str = None):
        self.dataset = dataset
        self.column_name = column_name

        return self

    def execute(self) -> TestResult:
        dataframe = ge.from_pandas(self.dataset.df)
        uniqueness = dataframe.expect_column_values_to_be_unique(column=self.column_name)
        passed = uniqueness["success"]
        metric = uniqueness["result"]["element_count"]
        return TestResult(passed=passed, metric=metric)

class DataQuality(GiskardTest):

    def __init__(self):
        super().__init__()

    def set_params(self,
                   dataset: Dataset = None,
                   threshold: float = 0.5,
                   column_name: str = None,
                   category: str = None):
        self.dataset = dataset
        self.threshold = threshold
        self.column_name = column_name
        self.category = category

        return self

    def execute(self) -> TestResult:
        freq_of_cat = self.dataset.df[self.column_name].value_counts()[self.category]/ (len(self.dataset.df))
        passed = freq_of_cat < self.threshold

        return TestResult(passed=passed, metric=freq_of_cat)

# Run a suite with those two tests

In [19]:
passed, results = Suite().add_test(UniquenessTest().set_params(column_name='Subject'), "uniq").add_test(DataQuality().set_params(column_name='Month', category='August'), "quality").run(dataset=my_test_dataset)

print(f"Result: {passed}")
print(f"UniquenessTest: {results['uniq'].passed} {results['uniq'].metric}")
print(f"DataQuality: {results['uniq'].passed} {results['uniq'].metric}")


Result: False
UniquenessTest: False 170
DataQuality: False 170


# Now lets upload the test suite

In [20]:
Suite().add_test(UniquenessTest().set_params(column_name='Subject'), "uniq").add_test(DataQuality().set_params(column_name='Month', category='August'), "quality").save(client, 'enron_test')

<giskard.ml_worker.core.suite.Suite at 0x177059760>

Now we can go on the project tab "Test suite new" to see and execute the test suite. Furthermore, the two test will be available inside the Giskard catalog
