# Document categorization (Binary Classification):

Here we will be using document sets of two different categories and will try to classify them based on the content they hold. We will be using both `Naive Bayes`, `Support Vector Machines` and `Random Forest` algorithm for this classification.

Here we will be using document set of **Hockey** and **Baseball** based article set. 

In [1]:
import nltk
from nltk.corpus import PlaintextCorpusReader
hockey_dir = "Data/data_set/mini_newsgroups/mini_newsgroups/rec.sport.hockey"
baseball_dir = "Data/data_set/mini_newsgroups/mini_newsgroups/rec.sport.baseball"

In [41]:
# creating corpus
# chanding encoding type from 'utf-8' to 'latin-1', as we are getting encoding issues below
hockey_corpus = PlaintextCorpusReader(hockey_dir,'.*', encoding='latin-1')
baseball_corpus = PlaintextCorpusReader(baseball_dir,'.*',encoding='latin-1')
hockey_corpus.fileids()[:10]

['52550',
 '52555',
 '52561',
 '52569',
 '52587',
 '52589',
 '52599',
 '52600',
 '52616',
 '52618']

In [42]:
baseball_corpus.fileids()[:10]

['102590',
 '102605',
 '102606',
 '102612',
 '102622',
 '102625',
 '102631',
 '102641',
 '102649',
 '102655']

In [43]:
hockey_corpus.raw('52555')

'Newsgroups: rec.sport.hockey\nPath: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!fs7.ece.cmu.edu!europa.eng.gtefsd.com!howland.reston.ans.net!usc!cs.utexas.edu!utnut!torn!watserv2.uwaterloo.ca!watmath!undergrad.math.uwaterloo.ca!leibniz.uwaterloo.ca!ayim\nFrom: ayim@leibniz.uwaterloo.ca (Alfred Yim)\nSubject: And... THEY\'RE OFF!!!!!\nMessage-ID: <C4z808.2v2@undergrad.math.uwaterloo.ca>\nKeywords: Leafs Chicago\nSender: news@undergrad.math.uwaterloo.ca\nOrganization: University of Waterloo\nDate: Sun, 4 Apr 1993 20:38:32 GMT\nLines: 39\n\nWell, I gotta tell ya,\n\nlast night\'s Leafs game vs the Devils was a nail-bitter LET ME TELL YOU!\nIt was a well played game by BOTH teams (I thought) but according to the\nDon and Ron it was the an "off-night" for the Leafs and the Devils \nwere outplaying Toronto. Well, I BEG to differ....\n\nIMHO, Clark deserved to be a first star as much as Gilmour did. His\nfast breaks towards the net and the good opportunites that he\ncreated reminded m

## Preprocessing the data:

In [44]:
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words("english"))

In [45]:
import re
def preprocessor(text):
    text = re.sub(r'\W+|_|\d+',' ',text)
    text = word_tokenize(text)
    text = [word for word in text if word not in stop_words]
    text = [stemmer.stem(word) for word in text]
    return text

In [46]:
# preprocessing our corpus
# article = [(data1,type1),(data2,type1)......]
hockey_articles = [(preprocessor(hockey_corpus.raw(fileid)),'hockey') for fileid in hockey_corpus.fileids()]
baseball_articles = [(preprocessor(baseball_corpus.raw(fileid)),'baseball') for fileid in baseball_corpus.fileids()]

In [52]:
print(hockey_articles[0][0][:20],hockey_articles[0][1])

['newsgroup', 'rec', 'sport', 'hockey', 'path', 'cantaloup', 'srv', 'cs', 'cmu', 'edu', 'crabappl', 'srv', 'cs', 'cmu', 'edu', 'fs', 'ece', 'cmu', 'edu', 'europa'] hockey


## Merging our datasets and building featureset:

In [53]:
documents = hockey_articles + baseball_articles
print(type(documents))

<class 'list'>


In [63]:
print(documents[10])
print("__________________________________________________________________________________________________")
print("")
print(documents[11])

(['path', 'cantaloup', 'srv', 'cs', 'cmu', 'edu', 'rochest', 'udel', 'gatech', 'howland', 'reston', 'an', 'net', 'noc', 'near', 'net', 'mozz', 'unh', 'edu', 'kepler', 'unh', 'edu', 'mtt', 'from', 'mtt', 'kepler', 'unh', 'edu', 'matthew', 'T', 'thompson', 'newsgroup', 'rec', 'sport', 'basebal', 'subject', 'music', 'censorship', 'survey', 'pleas', 'fill', 'date', 'apr', 'gmt', 'organ', 'univers', 'new', 'hampshir', 'durham', 'NH', 'line', 'messag', 'ID', 'qhph', 'sk', 'mozz', 'unh', 'edu', 'refer', 'C', 'hna', 'F', 'cs', 'uiuc', 'edu', 'nntp', 'post', 'host', 'kepler', 'unh', 'edu', 'hello', 'I', 'paper', 'censorship', 'music', 'I', 'would', 'appreci', 'took', 'time', 'particip', 'survey', 'pleas', 'answer', 'question', 'ask', 'simpli', 'mean', 'room', 'explain', 'answer', 'chose', 'the', 'last', 'question', 'comment', 'question', 'suggest', 'thank', 'advanc', 'pleas', 'E', 'mail', 'address', 'end', 'I', 'male', 'femal', 'II', 'age', 'iii', 'major', 'occup', 'IV', 'type', 'music', 'liste

#### Randomizing:

In [55]:
#randomizing
import random
random.shuffle(documents)

#### Gathering all words together:

In [64]:
all_words = list()
for tuple_ in documents:
    for word in tuple_[0]:
        all_words.append(word)

In [71]:
all_words_frequency = nltk.FreqDist(all_words)
all_words_frequency.most_common(10)

[('edu', 1571),
 ('I', 791),
 ('apr', 506),
 ('cmu', 423),
 ('cs', 416),
 ('com', 401),
 ('news', 299),
 ('net', 260),
 ('srv', 249),
 ('sport', 248)]

In [73]:
len(set(all_words)),len(all_words)

(5863, 42598)

In [74]:
print(all_words_frequency['game'])

241


We can see that the most common words in both the data sets combined are 
`
('edu', 1571),
 ('I', 791),
 ('apr', 506),
 ('cmu', 423),
 ('cs', 416),
 ('com', 401),
 ('news', 299),
 ('net', 260),
 ('srv', 249),
 ('sport', 248)
 `.
 We can also notice that there are many words that are repeated in both document sets. Length of all_words list is 42598 while length of set(all_words) is 5863. 

#### Building feature set:
In a feature set we assign `True` or `False` for presence or absence of word from all_words set, to documents.

In [83]:
word_features = all_words
#defining function to map our NLTK friendly dataset
def find_dataset(wordset):
    words = set(wordset)
    features = {}
    for w in word_features:
        features[w] = w in words
    return features

In [85]:
feature_set = [(find_dataset(wordset),cat) for (wordset,cat) in documents]

In [91]:
type(feature_set[0][0])

dict

In [92]:
type(feature_set[0][1])

str

Our feature set is ready with bool values.

## Creating and training our model:

In [93]:
len(feature_set)

202

In [94]:
#train and test set
train = feature_set[:160]
test = feature_set[160:]

In [98]:
# Building Naive Bayes Model
classifier = nltk.NaiveBayesClassifier.train(train)
classifier

<nltk.classify.naivebayes.NaiveBayesClassifier at 0x16f50f61780>

In [101]:
accuracy = nltk.classify.accuracy(classifier,train)
accuracy

0.99375

In [106]:
classifier.show_most_informative_features()

Most Informative Features
                  hockey = False          baseba : hockey =     53.7 : 1.0
                 basebal = False          hockey : baseba =     49.7 : 1.0
                    goal = True           hockey : baseba =     13.7 : 1.0
                 basebal = True           baseba : hockey =     12.2 : 1.0
                    note = True           baseba : hockey =      7.7 : 1.0
                    base = True           baseba : hockey =      7.7 : 1.0
                  cornel = True           baseba : hockey =      7.0 : 1.0
                     cup = True           hockey : baseba =      7.0 : 1.0
                  philli = True           baseba : hockey =      7.0 : 1.0
                     ice = True           hockey : baseba =      7.0 : 1.0


In [118]:
#test
results = [classifier.classify(x[0]) for x in test]

## Support Vector Machines:

In [143]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
svc_classifier = SklearnClassifier(SVC(kernel='rbf'))

In [127]:
svc_classifier.train(train)
accuracy_train = nltk.classify.accuracy(svc_classifier,train)
accuracy_train

0.975

In [129]:
accuracy_test = nltk.classify.accuracy(svc_classifier,test)
accuracy_test

0.9285714285714286

#### Fine tuning:

In [142]:
svc_classifier = SklearnClassifier(SVC(kernel='rbf',C=8))
svc_classifier.train(train)
accuracy_train = nltk.classify.accuracy(svc_classifier,train)
print(accuracy_train)
accuracy_test = nltk.classify.accuracy(svc_classifier,test)
print(accuracy_test)

0.99375
1.0


## Random Forest:

In [144]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.ensemble import RandomForestClassifier

In [149]:
RFclassifier = SklearnClassifier(RandomForestClassifier(n_estimators=50))
RFclassifier.train(train)
accuracy_train = nltk.classify.accuracy(RFclassifier,train)
print(accuracy_train)
accuracy_test = nltk.classify.accuracy(RFclassifier,test)
print(accuracy_test)

0.99375
1.0


## Conclusion:
We have used here three categorization algorithms namely `Naive Bayes`, `Support Vector Machines` and `Random Forest`. We were already able to achieve approx 100% accuracy in `Naive Bayes` . For the remaining two algorithms we need to tweak the hyper parameters and we could achieve 100% accuracy.