It's possible to analyze about 5 milions of mails (without Apache Tika analisys) for day with a 4 cores server and 4 GB of RAM. If you enable Apache Tika, you can analyze about 1 milion of mails.
SpamScope use Apache Storm that allows you to start small and scale horizontally as you grow. Simply add more worker.
You can chose your mails input sources (with spouts) and your functionalities (with bolts). SpamScope come with a tokenizer (split mail in token: headers, body, attachments), attachments and phishing analyzer (Which is the target of mails? Is there a malware in attachment?) and JSON output.
Store where you want
You can build your custom output bolts and store your data in Elasticsearch, Mongo, filesystem, etc.
Build your topology
With streamparse tecnology you can build your topology in Python, add and/or remove spouts and bolts.
Apache 2 Open Source License
Fedele Mantuano (Twitter: @fedelemantuano)
For more details please visit the wiki page.
git clone https://github.com/SpamScope/spamscope.git
Install requirements in file
pip install -r requirements.txt
SpamScope can use Tika App to parse every attachment mail.
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types (such as PPT, XLS, and PDF).
To install it follow the wiki.
To enable Apache Tika analisys, you should set it in
If you want to analyze the attachments with Thug, follow these instructions to install it and enable it in
What is Thug? From README project:
Thug is a Python low-interaction honeyclient aimed at mimicing the behavior of a web browser in order to detect and emulate malicious contents.
You can see a complete SpamScope report with Thug analysis here.
It's possible add to results (for mail attachments) VirusTotal report. You need a private API key.
It's possible to store the results in Elasticsearch. In this case you should install
It's possible to store the results in Redis. In this case you should install
For more details please visit the wiki page or read the comments in the files in
From SpamScope v1.1 you can decide to filter mails and attachments already analyzed. If you enable filter in
tokenizer section you will enable the RAM database and
SpamScope will check on it to decide if mail/attachment is already analyzed or not. If the mail is in RAM database, SpamScope will not analyze it and will store only the hashes.
SpamScope comes with two topologies:
and a general configuration file
To run topology for debug:
sparse run --name topology
If you want submit topology to Apache Storm:
sparse submit -f --name topology
It's very important to set the main configuration file. The default value is
/etc/spamscope/spamscope.yml, but it's possible to set the environment variable
$ export SPAMSCOPE_CONF_FILE=/etc/spamscope/spamscope.yml
If you use Elasticsearch output, I suggest you to use Elasticsearch template that comes with SpamScope.
Apache Storm settings
It's possible change the default settings for all Apache Storm options. I suggest for SpamScope these options:
- topology.tick.tuple.freq.secs: reload configuration of all bolts
- topology.max.spout.pending: Apache Storm framework will then throttle your spout as needed to meet the
- topology.sleep.spout.wait.strategy.time.ms: max sleep for emit new tuple (mail)
If you don't enable Apache Tika, Thug and VirusTotal, could use:
topology.tick.tuple.freq.secs: 60 topology.max.spout.pending: 200 topology.sleep.spout.wait.strategy.time.ms: 10
If Apache Tika is enabled:
To submit above options use:
sparse submit -f --name topology -o "topology.tick.tuple.freq.secs=60" -o "topology.max.spout.pending=100" -o "topology.sleep.spout.wait.strategy.time.ms=10"
Thug analysis can be very slow, it depends from attachment. To avoid Apache Storm timeout, you should use these two switches when submit the topology:
As you can see, the timeouts are both to 600 seconds. 600 seconds is the default timeout of Thug.
The complete command is:
sparse submit -f --name topology -o "topology.tick.tuple.freq.secs=60" -o "topology.max.spout.pending=50" -o "topology.sleep.spout.wait.strategy.time.ms=10" -o "supervisor.worker.timeout.secs=600" -o "topology.message.timeout.secs=600"
For more details you can refer here.
It's possible to use complete Docker images with Apache Storm and SpamScope. Take the following images: