## Machine Learning: Categorization

We'll use python's package sklearn (http://scikit-learn.org/) to train the computer.
Basically, we tried four algorithms, and picked the best one.

Our strategy is to train 9k out of the 10k 'training data' and perform the result to the rest 1k data.
The correctness is then judged by comparing the 'correct answers'

---

### TRANING ALGORITHM

    1. segment the news
    2. vectorize using the frequency table, for the highes D words (D = dimension of the vector)
    3. Change how we segment, e.g. containting the removal of punctuations, stop words or not
    4. Find the best D as well as segmentation and apply to the test data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# dimension of the vector
D = 2000

In [3]:
# read-in data for training
train = pd.read_csv("data/train_sep.csv", encoding="utf-8")
try:
    del train["Unnamed: 0"]
except:
    pass 

In [4]:
# segmentation without removing stop words, etc. (tested to be the best)
total = "\t".join(train["news"]).split("\t")
total_table = pd.Series(total).value_counts()

In [5]:
# frequency table, the top D words
top_w = total_table[:D].index

In [6]:
# replace the "news" column with segmented strings
vec = []
for i in range(train.shape[0]):
    d = pd.Series(train["news"][i].split("\t"))
    dc = d.value_counts()
    v = dc[top_w]
    v = np.array(v)
    v = np.nan_to_num(v)
    vec.append(v)

In [7]:
vs = pd.Series(vec)

In [8]:
train["vec"] = vs

In [9]:
# set x = contents of the news
# set y = the correct "category" -> to be trained
x = np.array(vec)
y = np.array(train["category"], dtype = np.int)

---
An example bad algorithm: decision tree

In [10]:
from sklearn import tree

In [11]:
clf = tree.DecisionTreeClassifier()
clf = clf.fit(x[1000:], y[1000:])

In [12]:
g = clf.predict(x[:1000])

In [13]:
# correctness
np.sum(g == y[:1000])/float(len(g))

0.56000000000000005

---
Final chosen algorithm: Support Vector Machine

In [14]:
from sklearn import svm

In [15]:
clf2 = svm.SVC(degree = 3)
clf2 = clf2.fit(x[1000:], y[1000:])

In [16]:
g2 = clf2.predict(x[:1000])

In [17]:
# the best correctness we have gotten
np.sum(g2 == y[:1000])/float(len(g2))

0.69199999999999995

In [18]:
### Other trials
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.ensemble import AdaBoostClassifier

In [19]:
# clf3 = RandomForestClassifier()
# clf3 = AdaBoostClassifier()
# clf3 = clf3.fit(x[1000:], y[1000:])

In [20]:
# g3 = clf3.predict(x[:1000])

In [34]:
# np.sum(g3 == y[:1000])/float(len(g3))

---
### Let's do it on test data

In [22]:
test = pd.read_csv("data/test_sep.csv", encoding="utf-8")
try:
    del test["Unnamed: 0"]
except:
    pass 

In [23]:
vec_test = []
for i in range(test.shape[0]):
    d = pd.Series(test["news"][i].split("\t"))
    dc = d.value_counts()
    v = dc[top_w]
    v = np.array(v)
    v = np.nan_to_num(v)
    vec_test.append(v)

In [24]:
x_test = np.array(vec_test)

In [35]:
x_test.shape

(1000, 2000)

In [26]:
best_clf = svm.SVC()
besr_clf = best_clf.fit(x, y)

In [27]:
g_test = besr_clf.predict(x_test)

In [28]:
out = pd.DataFrame()

In [29]:
out["id"] = test["id"]

In [30]:
out["cagegory"] = pd.Series(g_test)

In [31]:
# save the result as csv file
out.to_csv("sample_fin2.csv", index=False)

In [32]:
# how's the training for the training data?
g2 = besr_clf.predict(x[:1000])
np.sum(g2 == y[:1000])/float(len(g2))

0.77500000000000002

In [33]:
# print out the final result
out

Unnamed: 0,id,cagegory
0,10001,3
1,10002,6
2,10003,8
3,10004,6
4,10005,6
5,10006,7
6,10007,2
7,10008,10
8,10009,2
9,10010,10


---
### Some tests regarding different dimensions, parameters, etc.

    Here we present part of our test for the dependencies of correctness on 
    (1) dimension of the vector (frequency table)
    (2) number of tranining samples
    
As it shows, we propose to have MORE training data next time!

<img src='http://i.imgur.com/tvsI4Cz.png'>