# Technical exercise - Data scientist intern @ Giskard

Hi! As part of our recruitment process, we’d like you to complete the following technical test in 10 days. Once you finish the exercise, you can send your notebook or share your code repository by email (matteo@giskard.ai). If you want to share a private GitHub repository, make sure you give read access to `mattbit`.

If you have problems running the notebook, get in touch with Matteo at matteo@giskard.ai.

In [1]:
%pip install numpy pandas scikit-learn datasets transformers torch "giskard>=2.0.0b"

Note: you may need to restart the kernel to use updated packages.


## Exercise 1: Code review

Your fellow intern is working on securing our API and wrote some code to generate secure tokens. You have been asked to review their code and make sure it is secure and robust. Can you spot the problem and write a short feedback?

In [2]:
import random

ALPHABET = "abcdefghijklmnopqrstuvxyz0123456789"


def generate_secret_key(size: int = 20):
    """Generates a cryptographically secure random token."""
    token = "".join(random.choice(ALPHABET) for _ in range(size))
    return token


There could be a small problem with the fact that the letter "w" is missing from the ALPHABET.

Alternatively using the random module in this circumstance is not advisable. Someone could find the inital seed that was used to generate the token, and remake the token.
The secrets module could be used instead.  

In [3]:
import secrets

ALPHABET = "abcdefghijklmnopqrstuvwxyz0123456789"

def generate_secret_key(size: int = 20):
    """Generates a cryptographically secure random token."""
    token = ''.join(secrets.choice(ALPHABET) for _ in range(size))
    return token


## Exercise 2: High dimensions

Matteo, our ML researcher, is struggling with a dataset of 40-dimensional points. He’s sure there are some clusters in there, but he does not know how many. Can you help him find the correct number of clusters in this dataset?

In [4]:
# Silhouette Score

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

x = np.load("points_1.npy")

limit = int((x.shape[0]//2)**0.5)
scoreList = []

for k in range(2, limit+1):
    model = KMeans(n_clusters=k, n_init=10)
    model.fit(x)
    pred = model.predict(x)
    score = silhouette_score(x, pred)
    scoreList.append(score)
    
print("It looks like there are {} clusters.".format(scoreList.index(max(scoreList))+2))


It looks like there are 11 clusters.


Matteo is grateful for how you helped him with the cluster finding, and he has another problem for you. He has another high-dimensional dataset, but he thinks that those points could be represented in a lower dimensional space. Can you help him determine how many dimensions would be enough to well represent the data?

In [5]:
# Principal Component Analysis

import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

x = np.load("points_2.npy")
xScaled = StandardScaler().fit_transform(x)

pca = PCA()
pca.fit(xScaled)

print(pca.explained_variance_)
print(np.cumsum(pca.explained_variance_ratio_))


[12.0712989   8.69575874  6.18598189  4.22223949  0.93416246  0.58704625
  0.48948339  0.46406602  0.40220908  0.39272047  0.35745547  0.34110437
  0.31023658  0.29937102  0.28709567  0.28278631  0.2671307   0.25265965
  0.24244703  0.22468491  0.21587419  0.20995291  0.20514762  0.19230804
  0.18833865  0.17475562  0.16809787  0.16193516  0.14479163  0.14299042
  0.13401393  0.1208438   0.10968856  0.10352973  0.09752006  0.09012386
  0.0839506   0.07465291  0.0671299   0.0444562 ]
[0.30148069 0.51865726 0.67315216 0.77860259 0.8019333  0.81659478
 0.82881963 0.84040968 0.85045485 0.86026304 0.86919049 0.87770958
 0.88545773 0.89293452 0.90010474 0.90716733 0.91383892 0.92014909
 0.92620421 0.93181571 0.93720717 0.94245074 0.94757431 0.9523772
 0.95708096 0.96144548 0.96564372 0.96968805 0.97330422 0.97687541
 0.98022241 0.98324048 0.98597995 0.98856561 0.99100117 0.99325201
 0.99534868 0.99721314 0.99888971 1.        ]


As we can see the first 5 principal components are enough to describe around 80% of the data. 
If we want to describe around 95% of the data we would need 24 principal components. 

In [6]:
# pca = PCA(n_components=5)
# xReduced = pca.fit_transform(xScaled) 

# print(xReduced.shape)

## Exercise 3: Mad GPT

Matteo is a good guy but he is a bit messy: he fine-tuned a GPT-2 model, but it seems that something went wrong during the process and the model became obsessed with early Romantic literature.

Could you check how the model would continue a sentence starting with “Ty”? Could you recover the logit of the next best token? And its probability?

You can get the model from the HuggingFace Hub as `mattbit/gpt2wb`.


In [7]:
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("mattbit/gpt2wb")

inputIds = tokenizer.encode("Ty", return_tensors="pt")
output = model.generate(inputIds, max_length=100)
outputText = tokenizer.decode(output[0], skip_special_tokens=True)

print(outputText)  

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Tyger Tyger, burning bright, 
In the forests of the night; 
What immortal hand or eye, 
Could frame thy fearful symmetry?
In what distant deeps or skies. 
Burnt the fire of thine eyes?
On what wings dare he aspire?
What the hand, dare seize the fire?
And what shoulder, & what art,
Could twist the sinews of thy heart?
And when thy heart began to beat.



In [8]:
import torch

inputs = tokenizer("Ty", return_tensors="pt")
outputs = model(**inputs)

softmaxOutput = torch.nn.functional.softmax(outputs.logits, dim=2)

topValues, topIndices = torch.topk(softmaxOutput, k=2, dim=2)
outputText = tokenizer.decode(topIndices[0][0][1], skip_special_tokens=True)

print("logit : {}, output text: {}, probability : {}".format(outputs.logits[0][0][topIndices[0][0][1].item()] ,outputText, topValues[0][0][1].item()))

logit : -22.899076461791992, output text: gers, probability : 0.0013438640162348747


## Exercise 4: Not bad reviews


We trained a random forest model to predict if a film review is positive or negative. Here is the training code:

In [24]:
import datasets

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline


# Load training data
train_data = datasets.load_dataset("sst2", split="train[:20000]").to_pandas()
valid_data = datasets.load_dataset("sst2", split="validation").to_pandas()

# Prepare model
with open("stopwords.txt", "r") as f:
    stopwords = [w.strip() for w in f.readlines()]

preprocessor = TfidfVectorizer(stop_words=stopwords, max_features=5000, lowercase=False) #look into this 
classifier = RandomForestClassifier(n_estimators=400, n_jobs=-1)

model = Pipeline([("preprocessor", preprocessor), ("classifier", classifier)])

# Train
X = train_data.sentence
y = train_data.label

model.fit(X, y)

print(
    "Training complete.",
    "Accuracy:",
    model.score(valid_data.sentence, valid_data.label),
)


Training complete. Accuracy: 0.7431192660550459


Overall, it works quite well, but we noticed it has some problems with reviews containing negations, for example:

In [25]:
# Class labels are:
# 1 = Positive, 0 = Negative
testPreprocesser = TfidfVectorizer(stop_words=stopwords, max_features=5000, lowercase=False)

# this returns positive, that’s right!
testPreprocesser.fit_transform(["this movie is good"])
print(testPreprocesser.get_feature_names_out())
assert model.predict(["this movie is good"]) == [1]

# negative! bingo!
testPreprocesser.fit_transform(["this movie is bad"])
print(testPreprocesser.get_feature_names_out())
assert model.predict(["this movie is bad"]) == [0]

# WHOOPS! this ↓ is predicted as negative?! uhm…
testPreprocesser.fit_transform(["this movie is not bad at all!"])
print(testPreprocesser.get_feature_names_out())
assert model.predict(["this movie is not bad at all!"]) == [1]

# WHOOPS! this ↓ is predicted as negative?! why?
testPreprocesser.fit_transform(["this movie is not perfect, but very good!"])
print(testPreprocesser.get_feature_names_out())
assert model.predict(["this movie is not perfect, but very good!"]) == [1]


['good']
['bad']
['bad' 'not']


AssertionError: 

Can you help us understand what is going on? Do you have any idea on how to fix it?
You can edit the code above.

One of the problems comes with the model in and of itself. Decision trees, and by extension, random forests, work by taking one element and determining how it influences the outcome, moving down the list in reducing order of importance. Negation needs context to be understood, which is hard to implement in this type of model.

If we look at how the sentences are preprocessed, we're selecting the words that could represent the most information and taking into account their frequency. We could try to take into account the position of the words in the sentence.  

## Exercise 5: Model weaknesses


The Giskard python library provides an automatic scanner to find weaknesses and vulnerabilities in ML models.

Using this tool, could you identify some issues in the movie classification model above? Can you propose hypotheses about what is causing these issues?

Then, choose one of the issues you just found and try to improve the model to mitigate or resolve it — just one, no need to spend the whole weekend over it!

You can find a quickstart here: https://docs.giskard.ai/en/latest/getting-started/quickstart.html