# Email Spam detection
This tutorial is to show you how to make a very simple learning program that also utilizes gorubi solver to apply constraints on a multiclass classification for two classes `spam` and `regular`

## The Graph
First we define the graph code that defines the domain knowledge for this problem.

In [57]:
import sys
sys.path.append('/home/hfaghihi/Framework/DomiKnowS/')

from regr.graph import Graph, Concept # importing basic graph classes
from regr.graph.logicalConstrain import orL, andL, notL # importing basic constraint classes

Graph.clear()
Concept.clear()

with Graph('global') as graph:
    email = Concept(name='email')

    Spam = email(name='spam')

    Regular = email(name='regular')

    # The constraint of not having regular and spam together
    orL(andL(notL(Spam, ('x', )), Regular, ('x', )), andL(notL(Regular, ('x', )), Spam, ('x', )))



## Data and Data Reader
As our data is located in different text files and in different folders, we have to write a reader class that reads this entries into a list of dictionaries in python. Here we use the default Reader class of the Framework.


In [58]:
import os
from regr.data.reader import RegrReader

class EmailSpamReader(RegrReader):
    def parse_file(self, ):
        folder = self.file
        data_spam = []
        data_ham = []
        for file in [f for f in os.listdir(folder + "/spam") if os.path.isfile(os.path.join(folder + "/spam", f)) and f.endswith('.txt')]:
            with open(folder + "/spam/" + file, "r") as f:
                x = []
                for i in f:
                    x.append(i)
            data_spam.append(x)
        for file in [f for f in os.listdir(folder + "/ham") if os.path.isfile(os.path.join(folder + "/ham", f)) and f.endswith('.txt')]:
            with open(folder + "/ham/" + file, "r") as f:
                x = []
                for i in f:
                    x.append(i)
            data_ham.append(x)
        final_data = []
        for dat in data_spam:
            item = {'subject': dat[0].split(":")[1]}
            index = [i for i, v in enumerate(dat) if v.startswith('- - - - - - - - -')]
            if len(index):
                index = index[0]
                item['body'] = "".join(dat[1:index])
                sub = [(i, v) for i, v in enumerate(dat[index:]) if v.startswith('subject')][0]
                item['forward_subject'] = sub[1].split(":")[1]
                item['forward_body'] = "".join(dat[index + sub[0] + 1:])
            else:
                item['body'] = item['body'] = ("").join(dat[1:])
            item['label'] = "spam"
            final_data.append(item)

        for dat in data_ham:
            item = {'subject': dat[0].split(":")[1]}
            index = [i for i, v in enumerate(dat) if v.startswith('- - - - - - - - -')]
            if len(index):
                index = index[0]
                item['body'] = "".join(dat[1:index])
                sub = [(i, v) for i, v in enumerate(dat[index:]) if v.startswith('subject')][0]
                item['forward_subject'] = sub[1].split(":")[1]
                item['forward_body'] = "".join(dat[index + sub[0] + 1:])
            else:
                item['body'] = item['body'] = ("").join(dat[1:])
            item['label'] = "ham"
            final_data.append(item)
        return final_data

    def getSubjectval(self, item):
        return item['subject']

    def getBodyval(self, item):
        return item['body']

    def getForwardSubjectval(self, item):
        if 'forward_subject' in item:
            return item['forward_subject']
        else:
            return None

    def getForwardBodyval(self, item):
        if 'forward_body' in item:
            return item['forward_body']
        else:
            return None

    def getSpamval(self, item):
        if item['label'] == "spam":
            return 1
        else:
            return 0

    def getRegularval(self, item):
        if item['label'] == "ham":
            return 1
        else:
            return 0

This class redefines the `parse_file` function to parse data into a list of dictionary and then defines some keywords to be used by `ReaderSensor` later in our program to connect data with our knowledge graph. Next we make an instance of this class on the training samples.

In [59]:
train_reader = EmailSpamReader(file='/home/hfaghihi/Framework/DomiKnowS/examples/Email_Spam/data/train', type="folder").run()

You can check your very first instance by calling `next` and your reader. 
! Make sure to re-initiate your reader if you do call `next` for test.

In [60]:
print(next(train_reader))
train_reader = EmailSpamReader(file='/home/hfaghihi/Framework/DomiKnowS/examples/Email_Spam/data/train', type="folder").run()

{'Body': 'hi ,\nwe have a new offer for you . buy cheap viagra through our online store .\n- private online ordering\n- no prescription required\n- world wide shipping\norder your drugs offshore and save over 70 % !\nclick here : http : / / aamedical . net / meds /\nbest regards ,\ndonald cunfingham\nno thanks : http : / / aamedical . net / rm . html', 'ForwardBody': None, 'ForwardSubject': None, 'Regular': 0, 'Spam': 1, 'Subject': ' buy cheap viagra through us .\n'}


## Model Declaration
Now we start to connect the reader output data with our formatted domain knowledge defined in the graph.

In [61]:
from regr.sensor.pytorch.sensors import ReaderSensor

email['subject'] = ReaderSensor(keyword='Subject')
email['body'] = ReaderSensor(keyword="Body")
email['forward_subject'] = ReaderSensor(keyword="ForwardSubject")
email['forward_body'] = ReaderSensor(keyword="ForwardBody")

Next we read the labels for the `spam` and `regular` concepts

In [62]:
email[Spam] = ReaderSensor(keyword='Spam', label=True)
email[Regular] = ReaderSensor(keyword='Regular', label=True)

### Define a new sensor
Here we want to use spacy to define a new sensor which gives us an average glove embedding tensor for a sentence

In [63]:
from regr.sensor.pytorch.sensors import TorchSensor
import spacy
from typing import Any
import torch

class SentenceRepSensor(TorchSensor):
    def __init__(self, *pres, edges=None, label=False):
        super().__init__(*pres, edges=None, label=False)
        nlp = spacy.load('en_core_web_lg')

    def forward(self,) -> Any:
        email = self.nlp(self.inputs[0])
        return torch.from_numpy(email.vector)

The input to this sensor would be a sentence. You can find the usage of this sensor in the following sections.

Next, we want to define a new sensor which gives us a tensor indicating whether the email has a forwarded message or not.

In [64]:
class ForwardPresenceSensor(TorchSensor):
    def forward(self,) -> Any:
        if self.inputs[0]:
            return torch.ones(1)
        else:
            return torch.zeros(1)

### Connecting new sensors to the graph 
We connect these sensors to the graph to make new properties on the concept `email`. We want to make new representations on the `subject` and `body` of the email and that why those properties are passed as input to the defined sensors.

In [65]:
email['subject_rep'] = SentenceRepSensor('subject')
email['body_rep'] = SentenceRepSensor('body')
email['forward_presence'] = ForwardPresenceSensor('forward_body')

### Preparing input features for the learner
Now we concatenate all the generated features to make a new property on the graph which will provide input for the classifier of `spam` and `regular` concepts.

In [66]:
from regr.sensor.pytorch.sensors import ConcatSensor

email['features'] = ConcatSensor('subject_rep', 'body_rep', 'forward_presence')

### Define the learner
Here we define a learner and connect it to the concepts of `spam` and `regular`. This learner is a simple pytorch module of linear neural network.

In [67]:
from regr.sensor.pytorch.learners import ModuleLearner
from torch import nn

email[Spam] = ModuleLearner('features', module=nn.Linear(601, 2))
email[Regular] = ModuleLearner('features', module=nn.Linear(601, 2))

### Make the learning model from the updated graph
Here we make an executable version of this graph that is able to trace the dependencies of the sensors and fill the data from the reader to run examples on the declared model.

In [68]:
from regr.program import LearningBasedProgram
from regr.program.model.pytorch import PoiModel

program = LearningBasedProgram(graph, PoiModel)

## Run the graph
Here we use populate to run the graph with the defined data from the reader

In [69]:
for datanode in program.populate(dataset=train_reader, inference=True):
    print(datanode)

Error during updating data item with sensor global/email/<spam>/readersensor-16


TypeError: len() of a 0-d tensor