It is my try to write spam detector with Naive Bayes and Fisher classifiers. Here is a link to [GitHub repo](https://github.com/kopylovvlad/text_classifier_kv) with source-code of implementation Naive Bayes and Fisher classifiers.

The main goal is set up classifiers for maximum accuracy. The classifier should filter spam correct and has minimum mistakes with 'ham' category.

File spam.csv has 5 572 messages. For each experiment, I will divide data from the file to 2 subsets: train subset with 3000 messages and test subset with 2572 messages. Let's overview how many 'ham' and 'spam' categories in every subset.

In [None]:
import text_classifier_kv as tc # package with naivebayes and fisher classifiers
from typing import List, Dict, Tuple
import csv
import os
print(os.listdir("../input"))
# print(text_classifier_kv.fisherclassifier())
# Any results you write to the current directory are saved as output.

row_limit: int = 3000
# (category, text)
train_data: List[Tuple[str, str]] = []
train_data_ham: int = 0
train_data_spam: int = 0
test_data: List[Tuple[str, str]] = []
test_data_ham: int = 0
test_data_spam: int = 0
reader = csv.reader(open('../input/spam.csv', newline='', encoding='latin-1'))
i: int = 0
for row in reader:
    i += 1
    if i == 1:
        continue
    if i < row_limit + 2:
        train_data.append((row[0], row[1]))
        if row[0] == 'ham':
            train_data_ham += 1
        else:
            train_data_spam += 1
    else:
        test_data.append((row[0], row[1]))
        if row[0] == 'ham':
            test_data_ham += 1
        else:
            test_data_spam += 1

print('Data overview:')
print('Train data size: %d' % len(train_data))
print('Train data ham: %d' % train_data_ham)
print('Train data spam: %d' % train_data_spam)
print('')
print('Test data size: %d' % len(test_data))
print('Test data ham: %d' % test_data_ham)
print('Test data spam: %d' % test_data_spam)


Train subset has 2591 messages with 'ham' category and 409 messages with 'spam' category. The ratio is 13:2,4.
Test subset has 2234 messages with 'ham' category and 338 messages with 'spam' category. The ratio is 11,1:1,7.

Experiment #1: run train and test with Naive Bayes and Fisher classifiers.

In [None]:
import text_classifier_kv as tc # package with naivebayes and fisher classifiers
from typing import List, Dict, Tuple
import csv
import os
# Any results you write to the current directory are saved as output.

row_limit: int = 3000
# (category, text)
train_data: List[Tuple[str, str]] = []
test_data: List[Tuple[str, str]] = []
reader = csv.reader(open('../input/spam.csv', newline='', encoding='latin-1'))
i: int = 0
for row in reader:
    i += 1
    if i == 1:
        continue
    if i < row_limit + 2:
        train_data.append((row[0], row[1]))
    else:
        test_data.append((row[0], row[1]))


def experiment(
    classifier,
    train_data: List[Tuple[str, str]],
    test_data: List[Tuple[str, str]],
    show_not_equal_ham: bool = False,
    show_not_equal_spam: bool = False
) -> None:
    # main function train, test and print result
    [classifier.train(text, cat) for (cat, text) in train_data]
    stat: Dict[str, int] = {
        'equal': 0,
        'not_equal': 0,
        'not_equal_ham': 0,
        'not_equal_spam': 0
    }
    for (cat, text) in test_data:
        predict = classifier.classify(text)
        if cat == predict:
            stat['equal'] += 1
        else:
            stat['not_equal'] += 1
            if cat == 'ham':
                stat['not_equal_ham'] += 1
                if show_not_equal_ham == True:
                    print(text)
            else:
                stat['not_equal_spam'] += 1
                if show_not_equal_spam == True:
                    print(text)
    print(stat)
    ac: float = stat['equal'] / \
        ((stat['equal']+stat['not_equal'])/100)
    print('Accuracy: %f' % ac)
    return None


print('Experiment #1.1 - naivebayes')
cl = tc.naivebayes()
experiment(cl, train_data, test_data)

print('Experiment #1.2 - fisherclassifier')
cl = tc.fisherclassifier()
experiment(cl, train_data, test_data)



**Results:**
Naive Bayes classifier has accuracy 94.86%. Fisher classifier has accuracy 95.37%. 
Accuracy is great, but each classifier has mistakes with detecting 'ham' category. Test subset has 2234 messages with 'ham' category.
That means Naive Bayes classifier has 5% mistakes with 'ham' category. And Fisher classifier has 4.7% mistakes with 'ham' category.
With 'spam' category, Naive Bayes classifier has 5,6% mistakes and Fisher classifier has 4.1% mistakes.

How to increase it? We have two ways:
1.  Change algorithm for .train-method
2. Customise .classify-method by set up thresholds for each category

**Change algorithm for .train-method**
By-default .train method take a text-message. Transform text to Dict of unique words. Increase values in Dict of features and Dict of categories. We can try to increase size of Dict of unique words by ignoring common words. We can take all messages in train subset and use only words with frequency bigger than 10% and less than 50%.

**Customize .classify-method by set up thresholds for each category**
For Fisher classifier, we can set up minimum thresholds for 'spam' category. Other words, you can claim it as 'spam' only if you really sure.

Experiment #2: run Naive Bayes and Fisher classifiers with ignoring commin words.

In [None]:
import text_classifier_kv as tc # package with naivebayes and fisher classifiers
from typing import List, Dict, Tuple, Callable
import csv
import os
# Any results you write to the current directory are saved as output.

row_limit: int = 3000
# (category, text)
train_data: List[Tuple[str, str]] = []
test_data: List[Tuple[str, str]] = []
reader = csv.reader(open('../input/spam.csv', newline='', encoding='latin-1'))
i: int = 0
for row in reader:
    i += 1
    if i == 1:
        continue
    if i < row_limit + 2:
        train_data.append((row[0], row[1]))
    else:
        test_data.append((row[0], row[1]))

        
def generate_text_vector(
    text_list: List[str],
    min_frequency: float=0.1,
    max_frequency: float=0.5,
    getfeatures: Callable[[str], Dict[str, int]] = tc.getwords,
) -> Callable[[str], Dict[str, int]]:
    '''
    Get a Dict of words from array of text between max and mix frequency
    '''
    text_list_len: int = len(text_list)
    text_vector: List[str] = []
    # Dict of uniq words for each text
    apcount: Dict[str, int] = {}

    for text in text_list:
        for word, count in getfeatures(text).items():
            apcount.setdefault(word, 0)
            if count > 0:
                apcount[word] += 1

    # Dict of uniq words for all texts
    # all words are between max and mix frequency
    fr_set: List[float] = []
    for word, count in apcount.items():
        frac = float(float(count) / float(text_list_len))
        fr_set.append(frac)
        if frac > min_frequency and frac < max_frequency:
            text_vector.append(word)

    def get_match_words(text: str) -> Dict[str, int]:
        new_dict: Dict[str, int] = {}
        for text, i in getfeatures(text).items():
            if text in text_vector:
                new_dict[text] = i

        return new_dict
    return get_match_words


def experiment(
    classifier,
    train_data: List[Tuple[str, str]],
    test_data: List[Tuple[str, str]],
    show_not_equal_ham: bool = False,
    show_not_equal_spam: bool = False
) -> None:
    # main function train, test and print result
    [classifier.train(text, cat) for (cat, text) in train_data]
    stat: Dict[str, int] = {
        'equal': 0,
        'not_equal': 0,
        'not_equal_ham': 0,
        'not_equal_spam': 0
    }
    for (cat, text) in test_data:
        predict = classifier.classify(text)
        if cat == predict:
            stat['equal'] += 1
        else:
            stat['not_equal'] += 1
            if cat == 'ham':
                stat['not_equal_ham'] += 1
                if show_not_equal_ham == True:
                    print(text)
            else:
                stat['not_equal_spam'] += 1
                if show_not_equal_spam == True:
                    print(text)
    print(stat)
    ac: float = stat['equal'] / \
        ((stat['equal']+stat['not_equal'])/100)
    print('Accuracy: %f' % ac)
    return None

train_texts: List[str] = [text for _c, text in train_data]
getweatures_ignoring_common_words: Callable[[str], Dict[str, int]] 
getweatures_ignoring_common_words = generate_text_vector(
    train_texts, 
    min_frequency=0.0003,
    max_frequency=0.05
)
print('Experiment #2.1 - naivebayes with words frequency limit')
cl = tc.naivebayes(getfeatures=getweatures_ignoring_common_words)
experiment(cl, train_data, test_data)

print('Experiment #2.2 - fisherclassifier with words frequency limit')
cl = tc.fisherclassifier(getfeatures=getweatures_ignoring_common_words)
experiment(cl, train_data, test_data)


**Result:**

Results are little better, but not excellent. Let's try with thresholds. Unfortunately, it is not useful for Naive Bayes: we will try with only Fisher Classifier.

In [None]:
import text_classifier_kv as tc # package with naivebayes and fisher classifiers
from typing import List, Dict, Tuple, Callable
import csv
import os
# Any results you write to the current directory are saved as output.

row_limit: int = 3000
# (category, text)
train_data: List[Tuple[str, str]] = []
test_data: List[Tuple[str, str]] = []
reader = csv.reader(open('../input/spam.csv', newline='', encoding='latin-1'))
i: int = 0
for row in reader:
    i += 1
    if i == 1:
        continue
    if i < row_limit + 2:
        train_data.append((row[0], row[1]))
    else:
        test_data.append((row[0], row[1]))

        
def generate_text_vector(
    text_list: List[str],
    min_frequency: float=0.1,
    max_frequency: float=0.5,
    getfeatures: Callable[[str], Dict[str, int]] = tc.getwords,
) -> Callable[[str], Dict[str, int]]:
    '''
    Get a Dict of words from array of text between max and mix frequency
    '''
    text_list_len: int = len(text_list)
    text_vector: List[str] = []
    # Dict of uniq words for each text
    apcount: Dict[str, int] = {}

    for text in text_list:
        for word, count in getfeatures(text).items():
            apcount.setdefault(word, 0)
            if count > 0:
                apcount[word] += 1

    # Dict of uniq words for all texts
    # all words are between max and mix frequency
    fr_set: List[float] = []
    for word, count in apcount.items():
        frac = float(float(count) / float(text_list_len))
        fr_set.append(frac)
        if frac > min_frequency and frac < max_frequency:
            text_vector.append(word)

    def get_match_words(text: str) -> Dict[str, int]:
        new_dict: Dict[str, int] = {}
        for text, i in getfeatures(text).items():
            if text in text_vector:
                new_dict[text] = i

        return new_dict
    return get_match_words


def experiment(
    classifier,
    train_data: List[Tuple[str, str]],
    test_data: List[Tuple[str, str]],
    show_not_equal_ham: bool = False,
    show_not_equal_spam: bool = False
) -> None:
    # main function train, test and print result
    [classifier.train(text, cat) for (cat, text) in train_data]
    stat: Dict[str, int] = {
        'equal': 0,
        'not_equal': 0,
        'not_equal_ham': 0,
        'not_equal_spam': 0
    }
    for (cat, text) in test_data:
        predict = classifier.classify(text)
        if cat == predict:
            stat['equal'] += 1
        else:
            stat['not_equal'] += 1
            if cat == 'ham':
                stat['not_equal_ham'] += 1
                if show_not_equal_ham == True:
                    print(text)
            else:
                stat['not_equal_spam'] += 1
                if show_not_equal_spam == True:
                    print(text)
    print(stat)
    ac: float = stat['equal'] / \
        ((stat['equal']+stat['not_equal'])/100)
    print('Accuracy: %f' % ac)
    return None

train_texts: List[str] = [text for _c, text in train_data]
getweatures_ignoring_common_words: Callable[[str], Dict[str, int]] 
getweatures_ignoring_common_words = generate_text_vector(
    train_texts, 
    min_frequency=0.0003,
    max_frequency=0.05
)


print('Experiment #3.1 - fisherclassifier with word frequency limit and cat-minimum')
cl = tc.fisherclassifier(getfeatures=getweatures_ignoring_common_words)
cl.setminimum('spam', 0.929)
experiment(cl, train_data, test_data)

print('Experiment #3.2 - fisherclassifier with cat-minimum')
cl = tc.fisherclassifier()
cl.setminimum('spam', 0.949)
experiment(cl, train_data, test_data)


Results are better.
Fisher Classifier with word frequency limit and cat-minimum has 97.51% accuracy.
There are 6 mistakes with 'ham' category and 58 mistakes with 'spam'. It has 0,26% mistakes with 'ham' and 17,15% with 'spam'.

Fisher Classifier with cat-minimum has 98.09% accuracy.
There are 1 mistake with 'ham' category and 48 mistakes with 'spam'. It has 0,04% mistakes with 'ham' and 14,2% with 'spam'.