- What is the problem to solve?
If you have exported emails from MS Outlook, you’re also limited to MS Outlook in order to read them. And if there are a lot of pst files, it becomes even more tedios.
- What can we do about it?
No clear idea yet. There are tools for reading or converting pst file such as readpst, apache tika.
Using readpst, I will try to convert a batch of pst files into plain text together with their attachments. Then emails will be parsed and converted to a sqlite file. Any search queries that will subsequently be performed should work well.
- User puts the pst archives into folder 1.
- Using readpst tool, extracts all data to folder 2.
- program 1 reads all data in folder 2, parses it and converts it to an sqlite file (why sqlite? Because of its simplicity, flexibility of sql - which should provide better performance.)
- program 2 , based on a search query, makes selections from the sqlite file and outputs all matching emails and attachments to output folder (separate folder for each query).
Lets try an example. We will use part of Enron data set - it’s big (~50 GB) and publically available.
Assuming golang and readpst are installed:
mkdir /tmp/test-pst cd /tmp/test-pst curl -O https://s3.amazonaws.com/edrm.download.nuix.com/RevisedEDRMv1/albert_meyers.zip unzip albert_meyers.zip mv albert_meyers/albert_meyers_000_1_1.pst . rm -fr albert_meyers rm albert_meyers.zip #extracting archive to files mkdir extract readpst -S -D -o extract albert_meyers_000_1_1.pst git clone https://github.com/tonna/search-pst cd search-pst go get github.com/mattn/go-sqlite3 go build -o reader.so main1.go && go build -o search.so main2.go cd - ./search-pst/reader.so -input=/tmp/test-pst/extract -output=db mkdir found echo "select path, content from email where content like '%1%' limit 5;" > query-dummy.sql ./search-pst/search.so -input=/tmp/test-pst/db -output=/tmp/test-pst/found -query=query-dummy.sql ls /tmp/test-pst/found -Rreadpst exports files in a way that email and attachments are presented as
folder1/folder2/email1 folder1/folder2/email2 folder1/folder2/email2-attachment1 folder1/folder2/email2-attachment2 folder1/folder3/email1Building
git clone https://github.com/tonna/search-pst cd search-pst #dependency go get github.com/mattn/go-sqlite3 go build -o reader.so main1.go && gofmt -s -w *.go go build -o search.so main2.go && gofmt -s -w *.go
Every email file that has attachments, if found in search, should be extracted with attachments.
How and when match emails and attachments? When reading I think. Primary and foreign keys?