In this notebook we will go over:
1. Creating a TextData object and auto calculating properties
2. Data integrity checks
3. Drift and model evaluation checks

## Load Data

In [1]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from deepchecks.nlp.text_data import TextData

In this tutorial we will use the tweet emotion dataset, containing tweets and metadata on the users who wrote them. </br>
Our goal will be to create a model that given a tweet classify its emotion in one of 4 categories: 'happiness', 'anger', 'optimism' and 'sadness'.

In [2]:
from deepchecks.nlp.datasets.classification import tweet_emotion

train, test = tweet_emotion.load_data(data_format='DataFrame')
train.head(3)

Unnamed: 0,text,user_age,gender,days_on_platform,user_region,label
2,No but that's so cute. Atsu was probably shy a...,24.97,Male,2729,Middle East/Africa,happiness
3,Rooneys fucking untouchable isn't he? Been fuc...,21.66,Male,1376,Asia Pacific,anger
7,Tiller and breezy should do a collab album. Ra...,37.29,Female,3853,Americas,happiness


## Create TextData Objects (A Deepchecks' Artifact)

Deepchecks' TextData object contain the text samples, labels and possibly also properties and metadata. </br>
it stores cache to save time between repeated computations and contain functionalities for input validations and sampling.

In [3]:
train = TextData(train.text, label=train['label'], task_type='text_classification',
                 index=train.index, metadata=train.drop(columns=['label', 'text']))
test = TextData(test.text, label=test['label'], task_type='text_classification',
                index=test.index, metadata=test.drop(columns=['label', 'text']))

## Calculating Properties

Some of Deepchecks' checks uses properties of the text samples for varieus calculations. </br>
Deepcheck have a wide varity of such properties, some simple and some that rely on external models and are more heavy to run. </br>
In order for Deepcheck's checks to be able to acess the properties they be stored within the TextData object.

In [4]:
# properties can be either either calculated directly by Deepchecks or imported for other sources in appropriate foramt

# train.calculate_default_properties(include_long_calculation_properties=True)
# test.calculate_default_properties(include_long_calculation_properties=True)

train.set_properties(pd.read_csv('train_properties.csv', index_col=0))
test.set_properties(pd.read_csv('test_properties.csv', index_col=0))

train.properties.head(2)

Unnamed: 0,Text Length,Average Word Length,Max Word Length,% Special Characters,Language,Sentiment,Subjectivity,Toxicity,Fluency,Formality
2,94,4.277778,8,0.021277,en,0.0,0.75,0.009497,0.349153,0.204132
3,102,6.923077,18,0.04902,en,-0.8,0.9,0.995803,0.176892,0.036638


# Data Integrity Checks

We will start by doing some perlimanery integrity check to validate the text formatting. </br>
It is recommended to do this step before model training as it may imply additional data engeneering is required. </br>

We will run the TextPropertyOutliers check aim to detect outlier based on different properties Deepchecks calculate on each text sample.

### Integrity #1: Text outliers

From the result we can derive several insights: </br>
    1. hashtags ('#...') are usally several words written togther without spaces - we might consider splitting them before feeding the tweet to a model</br>
    2. In some instances users deliberately misspell words, for example '!' instead of the letter 'l' or 'okayyyyyyyyyy'</br>
    3. The majority of the data is in english but not all. If we want a classfier that is multi lenguial we should collect more data, otherwise we may consider </br>
       dropping tweets in other languighes from our dataset before training our model. 

In [5]:
from deepchecks.nlp.checks import TextPropertyOutliers

check = TextPropertyOutliers(iqr_scale=3)
res = check.run(train)
res

VBox(children=(HTML(value='<h4><b>Text Property Outliers</b></h4>'), HTML(value='<p>Find outliers images with …

### Integrity #2: Property-Label Correlation (Shortcut Learning)

Next integrity check verifies the data does not contain any shortcuts the model can fixate on during the learning process. </br> 
For more information about shortcut learning see: https://towardsdatascience.com/shortcut-learning-how-and-why-models-cheat-1b37575a159

In [6]:
from deepchecks.nlp.checks import PropertyLabelCorrelation

check = PropertyLabelCorrelation(properties_to_ignore=['Sentiment'])
check.run(train)

VBox(children=(HTML(value='<h4><b>Property-Label Correlation</b></h4>'), HTML(value='<p>Return the PPS (Predic…

# Drift & Model Evalution Checks

### Loading precalculated model predictions

The checks below require model predictions and can be supplied via the relevant arguments in the ``run`` function

In [7]:
train_preds = tweet_emotion.load_precalculated_predictions(pred_format='predictions')[train.index]
test_preds = tweet_emotion.load_precalculated_predictions(pred_format='predictions')[test.index]

train_probas = tweet_emotion.load_precalculated_predictions(pred_format='probabilities')[train.index]
test_probas = tweet_emotion.load_precalculated_predictions(pred_format='probabilities')[test.index]

When deploying a trained model into production is crucial to verify that the data enviroment is similar to the one the model was trained in. </br>
This can be done by monitoring for drift in the data, predictions and labels.

### Drift & Model Evalution #1: Prediction Drift

In [8]:
from deepchecks.nlp.checks import PredictionDrift

check = PredictionDrift().add_condition_drift_score_less_than(0.1)
check.run(train, test, train_predictions=list(train_preds), test_predictions=list(test_preds))



VBox(children=(HTML(value='<h4><b>Prediction Drift</b></h4>'), HTML(value='<p>    Calculate prediction drift b…

### Drift & Model Evalution #2: Label Drift

In [9]:
from deepchecks.nlp.checks import LabelDrift

check = LabelDrift().add_condition_drift_score_less_than(0.1)
check.run(train, test)

VBox(children=(HTML(value='<h4><b>Train Test Label Drift</b></h4>'), HTML(value='<p>    Calculate label drift …

We can see that in our test set, 16% of the data belongs to the class of optimism which contain only 3% of the training data. </br>
The Prediction drift check tells us that from our model point of view there wasnt a significant change in the data distribution </br> 
meaning we are most likely dealing with a case of Concept Drift

Since the 'Optimism' class is rare in our training set it is possible that some of the optimism tweets found in the test set are underrepresented in the training set </br>
and therefore our model will fail to classify them. </br>

We can verify this assumption by looking at our model confusion matrix:

### Drift & Model Evalution #3: Label Confusion Matrix

In [10]:
from deepchecks.nlp.checks import ConfusionMatrixReport

check = ConfusionMatrixReport(normalize_display=False)
result = check.run(test, predictions=list(test_preds))
result



VBox(children=(HTML(value='<h4><b>Confusion Matrix Report</b></h4>'), HTML(value='<p>Calculate the confusion m…

As we can see, our model does a really bad job in classifing tweets from the Optimism class

# Metadata Based Segmentation: Looking for Weak Segments

In [11]:
# In our use case the metadata columns are information on the user that wrote the tweet

test.metadata.head(3)

Unnamed: 0,user_age,gender,days_on_platform,user_region
0,30.73,Male,5614,Americas
1,42.29,Female,4308,Europe
4,35.07,Female,4631,Europe


Next, we will use the metadata columns of user related information to try and **autometically** detect significant data segments on which our model performs badly. 

### Drift & Model Evalution #4: Metadata Segments Performance

In [12]:
from deepchecks.nlp.checks import MetadataSegmentsPerformance

check = MetadataSegmentsPerformance()

res = check.run(test, predictions=list(test_preds), probabilities=test_probas)
res



VBox(children=(HTML(value='<h4><b>Metadata Segments Performance</b></h4>'), HTML(value='<p>Search for segments…

# Properties Based Segmentation: Looking for Weak Segments

### Drift & Model Evalution #5: Properties Segments Performance

In [13]:
from deepchecks.nlp.checks import PropertySegmentsPerformance

check = PropertySegmentsPerformance() #segment_minimum_size_ratio=0.3, categorical_aggregation_threshold=0.1
check.run(test, predictions=list(test_preds), probabilities=test_probas)



VBox(children=(HTML(value='<h4><b>Property Segments Performance</b></h4>'), HTML(value='<p>Search for segments…