# Naive Bayes Document Classification

This program was created by Brandon Watts and is used to classify documents using the Multinomial Naive Bayes classifier.

### Creating the Mapping

To start you need to create a csv file with the following structure:

| Label  | Directory|
| -------|:--------:|
| NEWS   | Comp     |
| COMP   | News     |


In this example we have 2 Labels (NEWS & COMP) which will map to the 2 Directoreis (News & Comp) Respectively.

### Loading the Data

We will first load in dependencies.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import Tools.TextTools as TextTools
from sklearn import metrics
import sys

Now lets create an array to hold the each Directories DataFrame once they are created so we can concatenate them all together.

In [2]:
documents_dataframes = []

We now need to read in the csv file we made earlier.

In [3]:
documents_mapping_df = pd.read_csv("mapping.csv")

In [4]:
print(documents_mapping_df)

  Label Directory
0  COMP      Comp
1  NEWS      News


Now we need to read through each row and create a Dataframe with all the files mapped to the supplied Label.

In [6]:
for index, row in documents_mapping_df.iterrows():
    documents_dataframes.append(TextTools.build_data_frame(row["Label"], row['Directory']))

In [7]:
print(documents_dataframes)

[               class                                               text
COMP/13259.txt  Comp     Cone Trees in the UGA Graphics System: Sugg...
COMP/7183.txt   Comp     The Challenge of Deep Models, Inference Str...
COMP/25473.txt  Comp     Extracting Multi-Dimensional Signal Feature...
COMP/40879.txt  Comp        Instance Pruning Techniques      Abstrac...
COMP/39955.txt  Comp     Structured Interviews on the Object-Oriente...
COMP/23267.txt  Comp     P++: A Language for Software System Generat...
COMP/16393.txt  Comp     Karin Petersen Kai Li   Department of Compu...
COMP/39172.txt  Comp     Block Edit Models for Approximate String Ma...
COMP/18209.txt  Comp     Mutable Object State for Object-Oriented Lo...
COMP/20782.txt  Comp     High Performance Geographic Information Sys...
COMP/23507.txt  Comp     Models for Computer Generated Parody       ...
COMP/43032.txt  Comp     Observations and Recommendations on the Int...
COMP/10894.txt  Comp     A Safe, Efficient Regression Test Sele

Now lets concatenate all these smaller DataFrames into 1 big one.

In [12]:
documents = pd.concat(documents_dataframes)

In [13]:
print(documents.head())

               class                                               text
COMP/13259.txt  Comp     Cone Trees in the UGA Graphics System: Sugg...
COMP/7183.txt   Comp     The Challenge of Deep Models, Inference Str...
COMP/25473.txt  Comp     Extracting Multi-Dimensional Signal Feature...
COMP/40879.txt  Comp        Instance Pruning Techniques      Abstrac...
COMP/39955.txt  Comp     Structured Interviews on the Object-Oriente...


The last thing we will do is mix up the dataframe so that the files of the same label are not right next to each other.

In [19]:
documents = documets.reindex(np.random.permutation(documets.index))

In [20]:
print(documents.head())

                     class                                               text
NEWS/cv019_14482.txt  News  there's something about ben stiller that makes...
COMP/23596.txt        Comp     The Effect of Group Size and Communication ...
COMP/20782.txt        Comp     High Performance Geographic Information Sys...
COMP/13259.txt        Comp     Cone Trees in the UGA Graphics System: Sugg...
COMP/23507.txt        Comp     Models for Computer Generated Parody       ...


### Cleaning the text

I have created a file called "TextTools" that come with all the standard nlp pipleine components. Right now we are cleaning the text by removing uneccesary spacing, removing the stop words, and applying Lemmatization.

In [22]:
documents['text'] = documents['text'].map(lambda x: TextTools.clean_text(x)).map(lambda x: TextTools.detokenize(x))

In [24]:
print(documents.head())

                     class                                               text
NEWS/cv019_14482.txt  News  's someth ben stiller make popular choic among...
COMP/23596.txt        Comp  the effect group size commun mode cscw environ...
COMP/20782.txt        Comp  high perform geograph inform system : experi s...
COMP/13259.txt        Comp  cone tree uga graphic system : suggest robust ...
COMP/23507.txt        Comp  model comput gener parodi abstract thi paper o...


### Creating the testing and training set

In [26]:
X = documents.iloc[:, 1].values
y = documents.iloc[:, 0].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2, random_state = 0)

To train the classfiier we need to turn the text to features. We will use sklearn's 'CountVectorizer' to do this.

In [52]:
count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(X_train)
print(counts)

  (0, 320)	1
  (0, 236)	1
  (0, 490)	1
  (0, 6741)	1
  (0, 3245)	1
  (0, 3928)	1
  (0, 6190)	1
  (0, 4674)	1
  (0, 1296)	1
  (0, 6867)	1
  (0, 2418)	1
  (0, 500)	1
  (0, 1015)	1
  (0, 2677)	1
  (0, 1174)	1
  (0, 6673)	1
  (0, 5470)	1
  (0, 5801)	1
  (0, 3723)	1
  (0, 6124)	1
  (0, 4059)	1
  (0, 1753)	1
  (0, 1672)	1
  (0, 5151)	1
  (0, 994)	1
  :	:
  (31, 4833)	2
  (31, 2499)	5
  (31, 1585)	1
  (31, 2504)	2
  (31, 6527)	4
  (31, 4211)	1
  (31, 5817)	1
  (31, 4689)	1
  (31, 1696)	1
  (31, 4401)	2
  (31, 3597)	1
  (31, 4994)	1
  (31, 1535)	1
  (31, 1668)	1
  (31, 6725)	2
  (31, 5281)	3
  (31, 5031)	1
  (31, 6232)	1
  (31, 2806)	1
  (31, 3921)	1
  (31, 2411)	1
  (31, 2564)	1
  (31, 5016)	1
  (31, 1707)	1
  (31, 5239)	1


### Creating the classifier

In [53]:
classifier = MultinomialNB()
targets = y_train
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Testign the Classifier

In [54]:
preds = classifier.predict(count_vectorizer.transform(X_test))
print(metrics.classification_report(y_test, preds))

             precision    recall  f1-score   support

       Comp       1.00      1.00      1.00         6
       News       1.00      1.00      1.00         3

avg / total       1.00      1.00      1.00         9

