# ENRON email classification [sklearn]

<div class="alert alert-info">
What is Giskard ?

Giskard is an open-source testing framework dedicated to ML models, ranging from tabular to LLM. [To know more about Giskard, click here](https://docs.giskard.ai/en/latest/getting-started/index.html).
</div>

By running this notebook, you'll create a whole test suite in a few lines of code. The model used here is a scikit-learn classification model. It is used to predict categories of emails in the ENRON dataset.

You'll learn how to:

- Detect vulnerabilities by scanning the model

- Generate a test suite with domain-specific tests

- Customize your test suite by loading a test from the Giskard catalog

- Upload your model to the Giskard server to:

    - Compare models to decide which one to promote

    - Debug your tests to diagnose issues

    - Share your results and collect business feedback from your team


## Install Giskard

To see the list of Python requirements, please refer to [the documentation](https://docs.giskard.ai/en/latest/guides/installation_library/index.html).

In [None]:
pip install "giskard>=2.0.0b" -U

## Import libraries

In [5]:
import email
import glob
from collections import defaultdict
from string import punctuation

import nltk
import pandas as pd
from dateutil import parser
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn import model_selection
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

import giskard

## Import data and load it into Giskard

### Import data

In [None]:
!wget http: // bailando.sims.berkeley.edu / enron / enron_with_categories.tar.gz
!tar zxf enron_with_categories.tar.gz
!rm enron_with_categories.tar.gz

### Pre-process and filter data

In [None]:
nltk.download('punkt')
nltk.download('stopwords')

stoplist = set(stopwords.words('english') + list(punctuation))
stemmer = PorterStemmer()

# http://bailando.sims.berkeley.edu/enron/enron_categories.txt
idx_to_cat = {
    1: 'REGULATION',
    2: 'INTERNAL',
    3: 'INFLUENCE',
    4: 'INFLUENCE',
    5: 'INFLUENCE',
    6: 'CALIFORNIA CRISIS',
    7: 'INTERNAL',
    8: 'INTERNAL',
    9: 'INFLUENCE',
    10: 'REGULATION',
    11: 'talking points',
    12: 'meeting minutes',
    13: 'trip reports'}

idx_to_cat2 = {
    1: 'regulations and regulators (includes price caps)',
    2: 'internal projects -- progress and strategy',
    3: ' company image -- current',
    4: 'company image -- changing / influencing',
    5: 'political influence / contributions / contacts',
    6: 'california energy crisis / california politics',
    7: 'internal company policy',
    8: 'internal company operations',
    9: 'alliances / partnerships',
    10: 'legal advice',
    11: 'talking points',
    12: 'meeting minutes',
    13: 'trip reports'}

LABEL_CAT = 3  # we'll be using the 2nd-level category "Primary topics" because the two first levels provide categories that are not mutually exclusive. see : https://bailando.berkeley.edu/enron/enron_categories.txt


#get_labels returns a dictionary representation of these labels.
def get_labels(filename):
    with open(filename + '.cats') as f:
        labels = defaultdict(dict)
        line = f.readline()
        while line:
            line = line.split(',')
            top_cat, sub_cat, freq = int(line[0]), int(line[1]), int(line[2])
            labels[top_cat][sub_cat] = freq
            line = f.readline()
    return dict(labels)


email_files = [f.replace('.cats', '') for f in glob.glob('enron_with_categories/*/*.cats')]

columns_name = ['Target', 'Subject', 'Content', 'Week_day', 'Year', 'Month', 'Hour', 'Nb_of_forwarded_msg']

data = pd.DataFrame(columns=columns_name)

for email_file in email_files:
    values_to_add = {}

    #Target is the sub-category with maximum frequency
    if LABEL_CAT in get_labels(email_file):
        sub_cat_dict = get_labels(email_file)[LABEL_CAT]
        target_int = max(sub_cat_dict, key=sub_cat_dict.get)
        values_to_add['Target'] = str(idx_to_cat[target_int])

    #Features are metadata from the email object
    filename = email_file + '.txt'
    with open(filename) as f:

        message = email.message_from_string(f.read())

        values_to_add['Subject'] = str(message['Subject'])
        values_to_add['Content'] = str(message.get_payload())

        date_time_obj = parser.parse(message['Date'])
        values_to_add['Week_day'] = date_time_obj.strftime("%A")
        values_to_add['Year'] = date_time_obj.strftime("%Y")
        values_to_add['Month'] = date_time_obj.strftime("%B")
        values_to_add['Hour'] = int(date_time_obj.strftime("%H"))

        # Count number of forwarded mails
        number_of_messages = 0
        for line in message.get_payload().split('\n'):
            if ('forwarded' in line.lower() or 'original' in line.lower()) and '--' in line:
                number_of_messages += 1
        values_to_add['Nb_of_forwarded_msg'] = number_of_messages

    row_to_add = pd.Series(values_to_add)
    data = data.append(row_to_add, ignore_index=True)

#We filter 879 rows (if Primary topics exists (i.e. if coarse genre 1.1 is selected) )
data_filtered = data[data["Target"].notnull()]

#Exclude target category with very few rows ; 812 rows remains
excluded_category = [idx_to_cat[i] for i in [11, 12, 13]]
data_filtered = data_filtered[data_filtered["Target"].isin(excluded_category) == False]
num_classes = len(data_filtered["Target"].value_counts())

# Keep only the email column and the target
data_filtered = data_filtered[["Content", "Target"]]

### Wrap your dataset into Giskard

In [18]:
column_types = {"Content": "text"}

dataset = giskard.Dataset(df=data_filtered, target="Target", column_types=column_types)

## Create your model & wrap it into Giskard

### Train your model

In [20]:
# Train/test split
feature_types = {i: column_types[i] for i in column_types if i != "Target"}
Y = data_filtered["Target"]
X = data_filtered.drop(columns=["Target"])[list(feature_types.keys())]
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=0.20, random_state=30, stratify=Y)

# Train model
# feature_types is used to declare the features the model is trained on
feature_types = {i: column_types[i] for i in column_types if i != 'Target'}

# Pipeline for text transformer
text_transformer = Pipeline([
    ('vect', CountVectorizer(stop_words=stoplist)),
    ('tfidf', TfidfTransformer())
])
preprocessor = ColumnTransformer(
    transformers=[
        ('text_Mail', text_transformer, "Content")
    ]
)

# Pipeline for the model Logistic Regression
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(max_iter=1000))])

# Fit and score your model
clf.fit(X_train, Y_train)
print("Global model score: %.3f" % clf.score(X_test, Y_test))

Global model score: 0.559


### Wrap your model in Giskard

In [21]:
model = giskard.Model(
    model=clf,
    model_type="classification",
    name="enron_email_classification",
    classification_threshold=0.5
)

## Scan your model to find vulnerabilities

With the Giskard scan feature, you can detect vulnerabilities in your model, including *performance biases*, *unrobustness*, *data leakage*, *stochasticity*, *underconfidence*, *ethical issues*, and *more*. For detailed information about the scan feature, please refer to our scan [documentation](https://docs.giskard.ai/en/latest/guides/scan/index.html).

In [None]:
results = giskard.scan(model, dataset)

In [23]:
display(results)  # in your notebook

As you see above, the model may detect various vulnerabilites by displaying:

* Data slices showing unperformance, underconfidence, overconfidence or spurious correlations
* Data transformations creating robutness or ethical issues
* Examples making some tests fail


## Generate a test suite from the Scan

The objects produced by the scan can be used as fixtures to generate a test suite that integrate domain-specific issues. To create custom tests, refer to the [Test your ML Model](https://docs.giskard.ai/en/latest/guides/test-suite/index.html) page.

In [24]:
test_suite = results.generate_test_suite("My first test suite")

# You can run the test suite locally to verify that it reproduces the issues
test_suite.run()

Executed 'Invariance to “Punctuation Removal”' with arguments {'model': <giskard.models.sklearn.SKLearnModel object at 0x28eefb7f0>, 'dataset': <giskard.datasets.base.Dataset object at 0x176e7bcd0>, 'transformation_function': <giskard.scanner.robustness.text_transformations.TextPunctuationRemovalTransformation object at 0x2b8596bc0>, 'threshold': 0.95, 'output_sensitivity': 0.05}: 
               Test failed
               Metric: 0.94
                - [TestMessageLevel.INFO] 848 rows were perturbed
               


## Customize your suite by loading objects from the Giskard catalog

The Giskard open source catalog will enable to load:

* **Tests** such as metamorphic, performance, prediction & data drift, statistical tests, etc
* **Slicing functions** such as detectors of toxicity, hate, emotion, etc
* **Transformation functions** such as generators of typos, paraphrase, style tune, etc

For demo purposes, we will load a simple unit test ([test_right_label](https://docs.giskard.ai/en/latest/reference/tests/statistic.html#giskard.testing.test_right_label)) that checks if a given row (the first example) has the right label. For more examples of tests and functions, refer to the [Giskard catalog](https://docs.giskard.ai/en/latest/guides/catalog/index.html).

In [None]:
# For the test_right_label test we are adding, all the parameters are specified except model
# This means that we will need to specify model everytime we run the suite: model is a global parameter of the suite
suite = test_suite \
    .add_test(
    giskard.testing.test_right_label(dataset=dataset.iloc[[1]], classification_label="yes", threshold=1)).run()

## Upload your suite to the Giskard server

<div class="alert alert-warning">
Install Giskard Server

To upload your suite to the Giskard Server you must first run the Giskard Server. Refer to the [documentation](https://docs.giskard.ai/en/latest/guides/installation_app/index.html).
</div>

Upload your suite to the Giskard server to:

- Compare models to decide which model to promote
- Debug your tests to diagnose the issues
- Create more domain-specific tests that are integrating business feedback
- Share your results

In [None]:
# Uploading the test suite will automatically save the model, dataset, tests, slicing & transformation functions inside the Giskard UI server
# Create a Giskard client aftern having install the Giskard server (see documentation)
token = "API_TOKEN"  # Find it in Settings in the Giskard server
client = giskard.GiskardClient(
    url="http://localhost:19000",  # URL of your Giskard instance
    token=token
)

my_project = client.create_project("my_project", "PROJECT_NAME", "DESCRIPTION")

# Upload to the current project
test_suite.upload(client, "my_project")

<div class="alert alert-info">
Connecting Google Colab with the Giskard server

If you are using Google Colab and you want to install the Giskard server **locally**, you can run the Giskard server by executing this line in the terminal of your **local** machine (see the [documentation](https://docs.giskard.ai/en/latest/guides/installation_app/index.html)):

> giskard server start

Once the Giskard server is running, from the same terminal on your **local** machine, you can run:

> giskard server expose

This will provide you with the code snippets that you can copy and paste into your Colab notebook to establish a connection with your locally installed Giskard server
</div>