Skip to content

MrParagon/Filter

Repository files navigation

COMP 472 Project Deliverable 3

by
Daniel Miller - 6602002
Yash Lalwani - 6531857

We certify that this submission is the original work of members of the group and meets the
Faculty's Expectations of Originality.

Signed Daniel Miller and Yash Lalwani, 1st December, 2014.

Instructions to run Deliverable 1:

1) Extract the datasets and ensure that the folders which contain the emails,
"20030228_easy_ham_2.tar" and "20030228_spam_2.tar" are in the 'Filter' project folder,
along with the 'src' and 'build' folders.

2) The required files are 'ProbabilityCalculator.java' and 'Tokenizer.java', and are 
part of the 'Filter' package.

3) Run the 'Tokenizer.java' file as it contains the main method. It will run create
'Model.txt' in the project folder.

Description of assumptions made for Tokenisation:

1) All words are converted to lowercase as soon as being read.

2) Words only 2 letters long or shorter are ignored.

3) Words consisting of digits are ignored.

4) Hyphenated words are treated as separate words.

5) Words with apostrophes originally in them are split, usually leaving only 
the section before the apostrophe.

6) Anything between "<>" is assumed to be an HTML tag and not part of the message content.

7) A list of words commonly found in the email header are ignored, using a switch case, 
such as "mime" and "delivered".

8) We assume that the emails obey general grammar and spelling rules as much as possible.

9) We are aware that the spam folder is missing 3 files before the "01000.xxx" file, but decided
to run it for a 1000 spam files in total anyway.

=================================================================================================

Instructions to run Deliverable 2:

1) Extract the datasets and ensure that the folders which contain the emails,
"20030228_easy_ham_2.tar" and "20030228_spam_2.tar" are in the 'Filter' project folder,
along with the 'src' and 'build' folders.

2) The required files are 'Tokenizer.java', 'Model.java', 'Classifier.java', 
'Word.java' and 'model.txt', and are part of the 'Filter' package.

3) Run the 'Classifier.java' as it contains the main method.

The only changes we made to our Deliverable 1 code was to move the main from 
'Tokenizer.java' to 'Classifier.java', and delegate the tokenization to a generic method.

Please note that we took the instructions to be literal and wrote our code so that it
checks files named '01001.xxx to 01400.xxx' from the 'spam' folder and files named 
'01000.xxx to 01400.xxx' from the 'ham' folder.

==================================================================================================

Deliverable 3 Notes:

1) Minimum length of words is now changed from 3 to 4.

2) Smoothing technique is now changed from add 0.5 to add 1.0.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages