# Technical exercise - Data scientist intern @ Giskard

Hi! As part of our recruitment process, we’d like you to complete the following technical test in 10 days. Once you finish the exercise, you can send your notebook or share your code repository by email (matteo@giskard.ai). If you want to share a private GitHub repository, make sure you give read access to `mattbit`.

If you have problems running the notebook, get in touch with Matteo at matteo@giskard.ai.

In [1]:
%pip install numpy pandas scikit-learn datasets transformers torch "giskard>=2.0.0b"

Collecting datasets
  Obtaining dependency information for datasets from https://files.pythonhosted.org/packages/e2/cf/db41e572d7ed958e8679018f8190438ef700aeb501b62da9e1eed9e4d69a/datasets-2.15.0-py3-none-any.whl.metadata
  Downloading datasets-2.15.0-py3-none-any.whl.metadata (20 kB)
Collecting torch
  Obtaining dependency information for torch from https://files.pythonhosted.org/packages/d6/a8/43e5033f9b2f727c158456e0720f870030ad3685c46f41ca3ca901b54922/torch-2.1.1-cp311-cp311-win_amd64.whl.metadata
  Downloading torch-2.1.1-cp311-cp311-win_amd64.whl.metadata (26 kB)
Collecting giskard>=2.0.0b
  Obtaining dependency information for giskard>=2.0.0b from https://files.pythonhosted.org/packages/29/5a/9f60832817582d1a1f379e318292fee74ba5d8a224aeb4fd3c4c1e8d4424/giskard-2.0.5-py3-none-any.whl.metadata
  Downloading giskard-2.0.5-py3-none-any.whl.metadata (13 kB)
Collecting pyarrow-hotfix (from datasets)
  Obtaining dependency information for pyarrow-hotfix from https://files.pythonhosted.

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
s3fs 2023.3.0 requires fsspec==2023.3.0, but you have fsspec 2023.10.0 which is incompatible.


## Exercise 1: Code review

Your fellow intern is working on securing our API and wrote some code to generate secure tokens. You have been asked to review their code and make sure it is secure and robust. Can you spot the problem and write a short feedback?

In [2]:
import random

ALPHABET = "abcdefghijklmnopqrstuvxyz0123456789"


def generate_secret_key(size: int = 20):
    """Generates a cryptographically secure random token."""
    token = "".join(random.choice(ALPHABET) for _ in range(size))
    return token


The `random` module in Python is not cryptographically secure for generating secure tokens or keys. For such purposes, it's recommended to use the `secrets` module which provides functions for generating secure tokens.

Example of a function to correct mistakes

In [26]:
import secrets
import string

ALPHABET = string.ascii_lowercase + string.digits


def modified_generate_secret_key(size: int = 20):
    """Generates a cryptographically secure random token."""
    token = ''.join(secrets.choice(ALPHABET) for _ in range(size))
    return token

This code uses `secrets.choice()` to generate a secure token by leveraging the `secrets` module, which is designed for generating cryptographically secure random numbers suitable for managing data such as passwords, account authentication, and tokens. 

Additionally, `string.ascii_lowercase` and `string.digits` are used to form the ALPHABET string, ensuring it includes all lowercase letters and digits for a comprehensive token generation instead of a manually written string like in the previous code. Moreover it avoids typos.

In [45]:
print(modified_generate_secret_key())

yqsnti1cvv79txnmogr6


## Exercise 2: High dimensions

Matteo, our ML researcher, is struggling with a dataset of 40-dimensional points. He’s sure there are some clusters in there, but he does not know how many. Can you help him find the correct number of clusters in this dataset?

In [None]:
import numpy as np

x = np.load("points_1.npy")


# ...

print("It looks like there are ??? clusters.")


In [None]:
import numpy as np

x = np.load("points_1.npy")


Matteo is grateful for how you helped him with the cluster finding, and he has another problem for you. He has another high-dimensional dataset, but he thinks that those points could be represented in a lower dimensional space. Can you help him determine how many dimensions would be enough to well represent the data?

In [None]:
import numpy as np

x = np.load("points_2.npy")

# ...

print("It looks the data is ???-dimensional")


## Exercise 3: Mad GPT

Matteo is a good guy but he is a bit messy: he fine-tuned a GPT-2 model, but it seems that something went wrong during the process and the model became obsessed with early Romantic literature.

Could you check how the model would continue a sentence starting with “Ty”? Could you recover the logit of the next best token? And its probability?

You can get the model from the HuggingFace Hub as `mattbit/gpt2wb`.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("mattbit/gpt2wb")

# ...


## Exercise 4: Not bad reviews


We trained a random forest model to predict if a film review is positive or negative. Here is the training code:

In [None]:
import datasets

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline


# Load training data
train_data = datasets.load_dataset("sst2", split="train[:20000]").to_pandas()
valid_data = datasets.load_dataset("sst2", split="validation").to_pandas()

# Prepare model
with open("stopwords.txt", "r") as f:
    stopwords = [w.strip() for w in f.readlines()]

preprocessor = TfidfVectorizer(stop_words=stopwords, max_features=5000, lowercase=False)
classifier = RandomForestClassifier(n_estimators=400, n_jobs=-1)

model = Pipeline([("preprocessor", preprocessor), ("classifier", classifier)])

# Train
X = train_data.sentence
y = train_data.label

model.fit(X, y)

print(
    "Training complete.",
    "Accuracy:",
    model.score(valid_data.sentence, valid_data.label),
)


Overall, it works quite well, but we noticed it has some problems with reviews containing negations, for example:

In [None]:
# Class labels are:
# 1 = Positive, 0 = Negative

# this returns positive, that’s right!
assert model.predict(["This movie is good"]) == [1]

# negative! bingo!
assert model.predict(["This movie is bad"]) == [0]

# WHOOPS! this ↓ is predicted as negative?! uhm…
assert model.predict(["This movie is not bad at all!"]) == [1]

# WHOOPS! this ↓ is predicted as negative?! why?
assert model.predict(["This movie is not perfect, but very good!"]) == [1]


Can you help us understand what is going on? Do you have any idea on how to fix it?
You can edit the code above.

## Exercise 5: Model weaknesses


The Giskard python library provides an automatic scanner to find weaknesses and vulnerabilities in ML models.

Using this tool, could you identify some issues in the movie classification model above? Can you propose hypotheses about what is causing these issues?

Then, choose one of the issues you just found and try to improve the model to mitigate or resolve it — just one, no need to spend the whole weekend over it!

You can find a quickstart here: https://docs.giskard.ai/en/latest/getting-started/quickstart.html