Skip to content

Commit

Permalink
prepare for project submission
Browse files Browse the repository at this point in the history
  • Loading branch information
Dan Wolf committed Feb 11, 2020
1 parent 952f622 commit a18b316
Show file tree
Hide file tree
Showing 4 changed files with 48 additions and 25 deletions.
61 changes: 48 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,21 +7,60 @@ classifier in Go. This supervised learning activity is accomplished using a
Naive Bayes classifier. This simple algorithm assumes conditional independence,
does a lot of counts, calculates some ratios, and multiplies them together.

### Project and Course Name
### Project Student and Course Name

Project 1: Exploratory Data Analysis
*Project 1:* Exploratory Data Analysis

CIS 678, Winter 2020
*Student:* Daniel Wolf

*Course:* CIS 678, Winter 2020

### Implementation

#### Problem and Approach

*Problem:* Determine what class (C) a new message (M) belongs to.
There are two classes, ‘ham’ and ‘spam’.

*Approach:*
- Each word position in a message is an attribute
- The value each attribute takes on is the word in that position

*Smoothing:* In order to prevent words from preventing a classification
if never observed to be in a class during training, a smoothing
variable of 1 is added to the frequency of each word. The denominator
is also adjusted as a result, adding the length of the vocabulary to the length of the class's map of word frequencies.

```
func (wf WordFrequency) Probability(v Vocabulary) Probability {
p := make(map[string]float64)
for _, vocabWord := range v {
p[vocabWord] = float64(wf[vocabWord]+1) / float64(len(wf)+len(v))
}
return p
}
```

*Common structures and patterns:* Lookups of any type take advantage of the map type in Go. This is used for word frequency and probability. Floats are tracked in 64 bit variables to ensure sufficient precision
and they are truncated only when printed to the terminal.

#### Classification Formula

This formula takes the larger value of the probabilities (the maximum) that the sms
message is in each class. This is calculated based upon the total
probability that a message is in a class times the probability
of each word being in a message of that class. The general form
of this formula is below:

![classification formula](classification.png)

#### Organization

This project is organized into an experiment, analysis, and the main package and
tests. Theses areas of the application have the following purposes:
This project is organized into an experiment, analysis, main package and
tests. These areas of the application have the following purposes:

- Experiment: holding the shuffled SMS messages, divided by set (train or test) and
class (spam or ham).
- Experiment: holding the shuffled SMS messages, organized separately by set (train or test) and class (spam or ham).
- Analysis: holding calculations and statistics as well as a trained model for
training classes and finally test results.
- main: connecting options to experiment code, making copies of the original
Expand Down Expand Up @@ -62,7 +101,7 @@ spamScore = math.Pow(10, spamScore)

##### Preprocessing: Remove Punctuation

Removing punctuation has almost no affect on the results. This might be due to
Removing punctuation has almost no effect on the results. This might be due to
the two classes tending to have similar punctuation. Also, many text
messages don't use any punctuation.

Expand All @@ -85,7 +124,7 @@ accuracy uplift from the stemmer means that the combination was also not helpful
##### Preprocessing: Remove 100 Most Common

This preprocessor removes the 100 most common English words from all messages.
While the accuracy did get a small list from this adjustment, the difference
While the accuracy did get a small lift from this adjustment, the difference
was very small, and the improvement was mostly to ham identification whereas
more ham was incorrectly identifies as spam. For this reason, I would not use
this in a production environment.
Expand Down Expand Up @@ -222,7 +261,3 @@ The purpose of this project was academic in nature. I don't recommend using
this code in production. I have licensed this code with an MIT license, so
reuse is permissible. If you are in an academic institution, you might have
additional guidelines to follow.

### Author Details

Daniel Wolf
4 changes: 0 additions & 4 deletions analysis/analysis.go
Original file line number Diff line number Diff line change
Expand Up @@ -124,10 +124,6 @@ type Class struct {

type Analyses []Analysis

func (a Analyses) WriteToFile() error {
return nil
}

func Run(ex experiment.Experiment) Analyses {
var analyses Analyses

Expand Down
Binary file added classification.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 0 additions & 8 deletions main.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,8 +15,6 @@ func main() {
flag.StringVar(&flagFilename, "file", "textMsgs.data", "filename")
var flagDelimiter string
flag.StringVar(&flagDelimiter, "delimiter", "\t", "delimiter between class and words in data (default is tab)")
var flagWriteToFile bool
flag.BoolVar(&flagWriteToFile, "write", false, "write classes to files")
flag.Parse()

exp, err := parse.FromFile(flagFilename, flagDelimiter)
Expand All @@ -27,12 +25,6 @@ func main() {

analyses := analysis.Run(exp)

if flagWriteToFile {
if err := analyses.WriteToFile(); err != nil {
fmt.Printf("could not write analysis to file: %s", err)
}
}

for _, a := range analyses {
c := color.New(color.FgCyan).Add(color.Underline)
c.Printf("Analysis: %s\n", a.Name)
Expand Down

0 comments on commit a18b316

Please sign in to comment.