## Developing the artificial benchmark for federated learning spam detection

This aims at developing a benchmark for spam email detection using federated learning. The files will be generated\
with their name embedding one of the 2 classification categories: spam and not spam. This benchmark is parameterized to\
be able to quantify the difficulty.

### What is the alphabet used?
english alphabet

### How are words composed?
Words are represented by embeddings vectors of length $n$. For simplicity, we start with $n=10$.

### How are sentences composed?
Sentences will be composed of $m$ words embeddings following a defined grammar. For simplicity, we start with \
$m = 3$ words arrange in $noun + verb + noun$ configuration.The configuration is defined by a gammar G.

### What is the grammar?
Concatenation of the words vectors

### How are emails composed?
Emails are considered paragraphs made of sentences. An email will be made of $k$ sentences.

### How is a spam defined?


In [86]:
import numpy as np

In [87]:
def gen_embedding(n=10):
    '''
    Generate a random float vector of length n with values between -1 and 1
    '''
    res = np.random.rand(n)
    res = res * 2 - 1
    # convert to list
    res = res.tolist()
    return res


In [88]:
def gen_sentence(m=3):
    '''
    Generate a sentence of length m
    '''
    return [gen_embedding() for _ in range(m)]


In [89]:
def gen_paragraph(k=5):
    '''
    Generate a paragraph of length n
    '''
    return [gen_sentence() for _ in range(k)]

In [90]:
def check_spam(paragraph):
    '''
    Check if a paragraph is spam
    input: paragraph
    '''
    for sentence in paragraph:
        for word in sentence:
            if sum(word)/len(word) > .4:
                return True
    return False

In [91]:
def gen_files(n=10):
    '''
    Generate n files
    '''
    for i in range(n):
        paragraph = gen_paragraph()
        is_spam = check_spam(paragraph)
        if is_spam:
            with open(f'./test/spam_{i}.txt', 'w') as f:
                f.write(str(paragraph))
                f.write('\n')
        else:
            with open(f'./test/email_{i}.txt', 'w') as f:
                f.write(str(paragraph))
                f.write('\n')

In [92]:
# Generate 20 files
gen_files(20)

## Evaluation on Real Data