#Final Exam Dataset #3:  Document Classification

This task concerns about classifying text documents as class 0 or class 1. We intentionally hide the meaning of the class labels, but you 
can think of a class label as the category of an article. For example, given an article, we can classify whether that article
is about Technology or Politics based on the words used in the article. 

For example, the following article can be classified as Technology news:
<blockquote>
<a href="http://technews.acm.org/archives.cfm?fo=2015-02-feb/feb-13-2015.html#773401">
http://technews.acm.org/archives.cfm?fo=2015-02-feb/feb-13-2015.html#773401</a>

<b>With Google Glass App Developed at UCLA, Scientists Can Analyze Plants' Health in Seconds</b>
<br/>
(UCLA Newsroom (CA) (02/09/15) Shaun Mason)

<i>
University of California, Los Angeles (UCLA) researchers have developed a Google Glass application that enables the wearer to quickly analyze the health of a plant without damaging it. The app analyzes the concentration of chlorophyll, which indicates water, soil, and air quality. Conventional methods for measuring chlorophyll concentration involve removing some of the plant's leaves, dissolving them in a chemical solvent, and then performing the chemical analysis. With the new Google Glass app, leaves are examined and then left functional and intact. The system relies on an image captured by the Google Glass camera to measure the chlorophyll's light absorption in the green part of the optical spectrum. The system also has a handheld illuminator unit that can be produced using three-dimensional printing. The user controls the device with the Google Glass touch control pad or with the voice command feature. The system photographs the leaf and wirelessly sends an enhanced image to a remote server, which processes the data from the image and sends back a chlorophyll concentration reading in less than 10 seconds. "This will allow a scientist to get readings walking from plant to plant in a field of crops, or look at many different plants in a drought-plagued area and accumulate plant health data very quickly," says UCLA professor Aydogan Ozcan.
</i>
</blockquote>
One way to convert the above article into a format that we can train classifiers on is to use bag of words model. 
A good explanation of this model can be found here:
<a href="http://en.wikipedia.org/wiki/Bag-of-words_model">
http://en.wikipedia.org/wiki/Bag-of-words_model</a>
        
If we apply the bag of words model to the above article, we get a dictionary:
<pre>
{
    "Google":5,
    "glass":5,
    "app":3
    ... # There are many more entries in this dictionary, but we only list the first 3 words and their frequencies
}
</pre>
This means the the word "Google" showed up in the article 5 times, and the word "glass" appeared in the article 5 times. We then replace
"Google" by the feature number 1, "glass" by feature number 2, and "app" by feature number 3
to get the following entry to represent this text document:
<blockquote>
1:5 2:5 3:3  ... # with many more entries in this document, but we only list the first 3
</blockquote>
If we assign this document a class label 0 (to represent technology),
then we can get a text document in the training set as:
<blockquote>
0 1:5 2:5 3:3 ... # with many more entries in this document, but we only list the first 3
</blockquote>
to represent this article.

Thus, given the attached training set with 417852 text documents in 2 categories (class 0 and class 1), you are to build a model 
(or as an ensemble of models) your goal is to classify the test set (with 32301 text documents) as accurately as possible.


##Tutorial on converting raw text data into a Bag of Words.

If you want to learn how to use Python to convert raw data into a bag of words, please reference the following link:
https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

<b>To simplify this assignment, we have already converted the raw data from articles for you.</b>  However, this article explains how the data was generated.

## Sample code to solve this problem

In [2]:
# We first load the data using the Python library 
from sklearn.datasets import load_svmlight_file
import numpy as np

# You might want to specify the exact path to locate your input files
# Say, "C:\\cs249\\train.txt" for Windows users.

x_train, y_train = load_svmlight_file("train.txt")
x_test, y_test = load_svmlight_file("test.txt")

print(x_train.shape)  # should output (417852, 802934)
print(y_train.shape)  # should output (417852,)
print(x_test.shape)   # should output (32301, 802934)
print(y_test.shape)   # should output (32301,)


(417852, 802934)
(417852,)
(32301, 802934)
(32301,)


x_train holds the training set without the class labels.
y_train holds the class labels for the training set.

x_test holds the test set without the test labels.
y_test holds the class labels for the test set.

However, we've assigned -1 to all the class labels for test set.  Your job is to develop a classifier that predicts correct class labels for the test set.

x_train.shape shows the dimension of the text datasets. 
In this dataset, the training set has 417852 training examples with 802934 words. 

A challenge in this dataset is that it is stored in sparse matrix format instead of dense matrix format. For example, the 
first row of the training set is:

0 188:1 13191:1 118098:1

This means that this text document only has word 188, word 13191, and word 118098 present (after removing stopwords like "a", "the", "as", ...etc). If we represent this in dense matrix format, we will have 802931 zeros in this row:

[0,0,...0,1,0,...,0,1,0,...,0,1,0,..,0] 

Storing 802931 zeros and 3 ones wastes a lot of memory, so in the literature, people solve this problem by only storeing the word and its frequencies in a sparse matrix format:

0 188:1 13191:1 118098:1

Unfortunately, not all the machine learning libraries in scikit-learn support the use of sparse matrices.
Below is a list of libraries that should be able to work with sparse matrix format:
<pre>
svm.SVR()
svm.NuSVR()
naive_bayes.BernoulliNB().
naive_bayes.MultinomialNB()
neighbors.KNeighborsRegressor()
linear_model.ElasticNet()
linear_model.PassiveAggressiveRegressor()
linear_model.PassiveAggressiveClassifier()
linear_model.Perceptron()
linear_model.Ridge()
linear_model.Lasso()
linear_model.LinearRegression()
linear_model.LogisticRegression()
linear_model.SGDClassifier()
linear_model.SGDRegressor()
</pre>
There might be other libraries that can work with sparse matrix format though.

In [5]:
# Since we have the training data, we can use NaiveBayes to train our classifier.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

clf_NB = MultinomialNB()
clf_NB.fit(x_train, y_train)
y_train_pred_NB = clf_NB.predict(x_train)
confusion_matrix(y_train, y_train_pred_NB) 


array([[262154,  15223],
       [ 22006, 118469]])

In [6]:
accuracy_score(y_train, y_train_pred_NB) 

0.91090386069708895

You can see that the training accuracy is 91.09%.

#What your program's output should look like

Your program should output 32301 lines consisting of 0s and 1s for the prediction labels for the test set like:
<pre>
0
0
1
...
1
</pre>

## How to format your output correctly for Mooshak

In [None]:
# you can use the following code to generate output that is acceptable by Mooshak:

# generate the test labels
y_test_pred_NB = clf_NB.predict(x_test)                    

output_str = "\n".join(map(str,y_test_pred_NB.astype(int))) 

# You might want to specify the exact path to output your text file
# Say, "C:\\cs249\\output_label.txt" for Windows users.
f = open('output_label.txt', 'w') 

f.write(output_str);
f.close()

# Then submit the .txt file to Mooshak