# Text classification Using FHE for the *20 Newsgroups* dataset

## Introduction
This tutorial is based on the 20-newsgroups dataset and classifies a text snippet to its relevant category (snippet is represented as a bag of words vector with tf/idf scores). The tutorial is composed of two parts: the first part shows how to train the unencrypted model and the second part shows how to transform the plaintext model into an encrypted one and how to perform an FHE inference with it. The classifier model is a neural network (NN) with a single hidden layer and polynomial activation. In this tutorial, we used only 4 out of 20 available categories.  

#### This demo uses the 20 Newsgroups dataset, originally taken from: http://kdd.ics.uci.edu/databases/20newsgroups/20newsgroups.html

This data set consists of 20000 messages taken from 20 newsgroups.

## Use case
A potential use case for text classification is sentiment analysis. For example, take a scenario where a call center has the texts from customer calls and they would like to categorize it (e.g. loans, general complaints, investments, etc.) but the calls are considered to be sensitive data. With FHE, you can encrypt the text and categorize the calls in order to analyze customer sentiment - all while preserving customer privacy.

<br>

## Step 1. Load and train the unencrypted model
In this step, you load and train the model without using FHE - this is just data science. You take the dataset and separate it into a training set and a test set. We will use FHE in Step 2 to perform the inference.


#### 1a. Load, train and test datasets

In [None]:
from sklearn.datasets import fetch_20newsgroups
import utils 

utils.verify_memory()

# Load some categories from the training set
categories = [
    'alt.atheism',
    'soc.religion.christian', 
    'comp.graphics',
    'sci.space',
]

remove = ('headers','footers','signatures')
print(f"Loading 20 newsgroups dataset for categories: {categories}")

data_train = fetch_20newsgroups(subset='train', categories=categories,
                                shuffle=True, random_state=42,
                                remove=remove)

data_test = fetch_20newsgroups(subset='test', categories=categories,
                               shuffle=True, random_state=42,
                               remove=remove)
print('Data loaded')

# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names

# split a training set and a test set
y_train, y_test = data_train.target, data_test.target



<br>

#### 1b. Preprocess
Here you try to select the best features for the model.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from sklearn.feature_selection import SelectKBest, chi2
import tensorflow as tf
from tensorflow.keras import utils

print("Extracting features from the training data using a sparse vectorizer")
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')

X_train = vectorizer.fit_transform(data_train.data)

print("Extracting features from the test data using the same vectorizer")
X_test = vectorizer.transform(data_test.data)
print()

# mapping from integer feature name to original token string
feature_names = vectorizer.get_feature_names_out()

if len(feature_names) > 0:
    feature_names = np.asarray(feature_names)

n_features = 3000
print(f"Extracting {n_features} best features by a chi-squared test")
ch2 = SelectKBest(chi2, k=n_features)
X_train = ch2.fit_transform(X_train, y_train)
X_test = ch2.transform(X_test)

# Convert class vector to binary class matrices
num_classes = len(categories)
y_train_cat = utils.to_categorical(y_train, num_classes)
y_test_cat = utils.to_categorical(y_test, num_classes)

In [None]:
# serialize test data that will be used for evaluation during FHE inference 
from utils import save_data_set, serialize_model
import os 

PATH = os.path.join('.', 'data', 'text_classification')
save_data_set(X_test.A, y_test_cat, data_type='test', path=PATH)

<br>

#### 1c. Define the model architecture
Here we define the model architecture for the training and the activation function. To make the model HE-friendly, we used a scaled polynomial of a second degree: \$activation(x)= 0.01\cdot x^2\ + x$. This activation funtion, unlike previous examples, isn't just square activation. This is different to show that you can use different activiation functions with FHE.

In [None]:
def create_model():
    model = Sequential()
    model.add(Dense(300, input_shape=(X_train.shape[1],))) 
    model.add(PolyActivation([0.01,1.,0.]))
    model.add(Dense(4))  
    sgd = SGD(learning_rate=0.1)

    model.compile(loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True), optimizer=sgd, metrics=['accuracy'])
    return model

<br>

#### 1d. Train the model and determine accuracy

In [None]:
def train_and_evaluate_model(model, X_train, y_train, X_val, y_val, num_classes, mode='val'):
    y_train_cat = utils.to_categorical(y_train, num_classes)
    y_val_cat = utils.to_categorical(y_val, num_classes)
    
    model.fit(X_train, y_train_cat,
              batch_size=200,
              epochs=75,
              verbose=0,
              validation_data=(X_val, y_val_cat),
              shuffle=True,
              )
    score = model.evaluate(X_val, y_val_cat, verbose=0)
    print(f'{mode} loss: { score[0]:.3f}')
    print(f'{mode} accuracy: {score[1] * 100:.3f}%')
    
    return score[1]

In [None]:
# Cross validation
from sklearn.model_selection import StratifiedKFold
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from data_gen.activations import PolyActivation
from tensorflow.keras.optimizers import SGD

n_folds = 5
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
res = []

for i, (train_index, val_index) in enumerate(skf.split(X_train.A, y_train)):
    print (f"Running Fold {i+1}/{n_folds}")
    
    model = None # Clearing the NN.
    model = create_model()
    res.append(train_and_evaluate_model(model, X_train.A[train_index], y_train[train_index], X_train.A[val_index], y_train[val_index],num_classes))

print(f"Validation results: {res}")
print(f"Mean validation accuracy:{np.mean(res):.3f}, standard deviation: {np.std(res):.3f}")

<br>

#### 1e. Evaluate the model

In [None]:
from sklearn import metrics

model = create_model()
acc = train_and_evaluate_model(model, X_train.A, y_train, X_test.A, y_test,num_classes, mode='test')
y_pred = np.argmax(model.predict(X_test.todense()), axis=-1)

print(metrics.classification_report(y_test, y_pred, target_names=target_names))
print(metrics.confusion_matrix(y_test, y_pred))

In [None]:
# Model serialization
from utils import save_data_set, serialize_model
import os 


PATH = os.path.join('.', 'data', 'text_classification')
serialize_model(model, PATH, s='')

<br>


## Step 2. FHE inference


#### 2a. Load the plain model and the model weights that we just trained and encrypt them in a trusted environment.
 We now load the model files in HElayers. This runs internally an optimization process that finds the best parameters for this model. We can provide various additional parameters as input. The only thing we tune here is the batch size (how many samples would you provide each time for the inference model to do the classification). Here we define it as 8.

In [None]:
import numpy as np
import h5py
import pyhelayers
import utils

he_run_req = pyhelayers.HeRunRequirements()
he_run_req.set_he_context_options([pyhelayers.DefaultContext()])
he_run_req.optimize_for_batch_size(8)

nn = pyhelayers.NeuralNet()
nn.encode_encrypt([PATH + "/model.json", PATH + "/model.h5"], he_run_req)
context = nn.get_created_he_context()
batch_size = nn.get_profile().get_optimal_batch_size()
print("model loaded and encrypted")

<br>

#### 2b. Load and encrypt the test data

In [None]:
with h5py.File(PATH + "/x_test.h5") as f:
    x_test = np.array(f["x_test"])
with h5py.File(PATH + "/y_test.h5") as f:
    y_test = np.array(f["y_test"])
    
plain_samples, labels = utils.extract_batch(x_test, y_test, batch_size, 0)
print('Batch of size',batch_size,'loaded')

model_io_encoder = pyhelayers.ModelIoEncoder(nn)
samples = pyhelayers.EncryptedData(context)
model_io_encoder.encode_encrypt(samples, [plain_samples])
print('Test data encrypted')

<br>

#### 2c. Perform the FHE inference

In [None]:
predictions = pyhelayers.EncryptedData(context)
utils.start_timer()
nn.predict(predictions, samples)
duration=utils.end_timer('predict')
utils.report_duration('predict per sample',duration/batch_size)

<br>

#### 2d. Decrypt and print the predictions

In [None]:
plain_predictions = model_io_encoder.decrypt_decode_output(predictions)
print('predictions',plain_predictions)

<br>

#### 2e. Evaluate the model

In [None]:
utils.assess_results(labels, plain_predictions)