Skip to content

Latest commit

 

History

History
119 lines (92 loc) · 3.67 KB

README.org

File metadata and controls

119 lines (92 loc) · 3.67 KB

FAQ

  • What is the problem to solve?

    If you have exported emails from MS Outlook, you’re also limited to MS Outlook in order to read them. And if there are a lot of pst files, it becomes even more tedios.

  • What can we do about it?

    No clear idea yet. There are tools for reading or converting pst file such as readpst, apache tika.

    Using readpst, I will try to convert a batch of pst files into plain text together with their attachments. Then emails will be parsed and converted to a sqlite file. Any search queries that will subsequently be performed should work well.

How it works?

  1. User puts the pst archives into folder 1.
  2. Using readpst tool, extracts all data to folder 2.
  3. program 1 reads all data in folder 2, parses it and converts it to an sqlite file (why sqlite? Because of its simplicity, flexibility of sql - which should provide better performance.)
  4. program 2 , based on a search query, makes selections from the sqlite file and outputs all matching emails and attachments to output folder (separate folder for each query).

Lets try an example. We will use part of Enron data set - it’s big (~50 GB) and publically available.

Assuming golang and readpst are installed:

mkdir /tmp/test-pst
cd /tmp/test-pst

curl -O https://s3.amazonaws.com/edrm.download.nuix.com/RevisedEDRMv1/albert_meyers.zip
unzip albert_meyers.zip
mv albert_meyers/albert_meyers_000_1_1.pst .
rm -fr albert_meyers
rm albert_meyers.zip

#extracting archive to files
mkdir extract
readpst -S -D -o extract albert_meyers_000_1_1.pst

git clone https://github.com/tonna/search-pst
cd search-pst

go get github.com/mattn/go-sqlite3
go build -o reader.so main1.go && go build -o search.so main2.go
cd -

./search-pst/reader.so -input=/tmp/test-pst/extract -output=db

mkdir found
echo "select path, content from email where content like '%1%' limit 5;" > query-dummy.sql
./search-pst/search.so -input=/tmp/test-pst/db -output=/tmp/test-pst/found -query=query-dummy.sql
ls /tmp/test-pst/found -R

Assumptions

readpst exports files in a way that email and attachments are presented as
folder1/folder2/email1
folder1/folder2/email2
folder1/folder2/email2-attachment1
folder1/folder2/email2-attachment2
folder1/folder3/email1

Misc

Building
git clone https://github.com/tonna/search-pst

cd search-pst

#dependency
go get github.com/mattn/go-sqlite3

go build -o reader.so main1.go && gofmt -s -w *.go
go build -o search.so main2.go && gofmt -s -w *.go

todo-list

NEXT Search accepts SQL queries

Match email and attachment files

Every email file that has attachments, if found in search, should be extracted with attachments.

How and when match emails and attachments? When reading I think. Primary and foreign keys?

Come up with found file naming