## Libraries

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cluster import KMeans
from sklearn import metrics

Pandas and Sklearn were used on this section.

## Dataset

In [2]:
data = pd.read_csv('../04_merged_content/dataset.csv', sep = ';',
                   names=['name','cod','words'], index_col='name', header=0)
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 600 entries, 03000000012017_6622932_inicial_lorena.pdf to 02000012462017_8103027_peticao_inicial_0000803-08.2017.5.11.0017.pdf
Data columns (total 2 columns):
cod      600 non-null int64
words    599 non-null object
dtypes: int64(1), object(1)
memory usage: 14.1+ KB


The dataset built on stage 04 were load and had its columns renamed, 'Nome do Arquivo' became 'name', 'iCodDocumento' to 'cod' and 'words' for 'Keyphrases', besides, 'name' column was set as index. Some info was given and it is notable that the column 'words' appears to have a null values.

In [3]:
na = data[data.words.isnull()].index.values
data[data.words.isnull()]

Unnamed: 0_level_0,cod,words
name,Unnamed: 1_level_1,Unnamed: 2_level_1
02000008202017_7611302_20170413_FEDERAL_SEGUROS_-_00820_2017_-_0224726-83.2011.8.04.0001_16.pdf,5083,


In [4]:
data.drop(na, inplace=True)
data.shape

(599, 2)

In [5]:
data.to_csv('clean_data.csv', sep = ';')

The null value hypothesis was tested and confirmed. It was decided to discard this line since there were no words to be analyzed. A new dataset was created and stored on a csv file as a [final and clean dataset](./clean_data.csv), so it can be used by others programs and machine learning algorithms.

## Machine Learning

In [6]:
x_train, x_test, y_train, y_test = train_test_split(data.words, data.cod, test_size = 0.2, random_state=42)
print(len(x_train))
print(len(x_test))

479
120


The dataset was split into train and test lines and its columns into 'x' and 'y', representing 'word' column as features and 'cod' as labels. Train dataset with 479 lines and 120 for test.

In [7]:
vec = CountVectorizer()
vec_train = vec.fit_transform(x_train)
vec_train.shape

(479, 40583)

According to [sklearn documentation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) 'CountVectorizer' converts a collection of text documents to a matrix of token count, that is, every word becomes a column and each line counts its frequency on a document. As shown above, 40583 words were found and counted in the train set.

In [8]:
clf = MultinomialNB().fit(vec_train, y_train)

In [9]:
vec_test = vec.transform(x_test)
predictions = clf.predict(vec_test)
clf.score(vec_test, y_test)

0.9416666666666667

Multinomial Nayve Bayes was chosen to build a classification model that can predict a document 'iCodDocument' (cod) based on the words of a document keyphrases using supervised machine learning. This reached a great score of over 94% on top of the test dataset.

## Validation

In [10]:
print(metrics.classification_report(y_test, predictions))
print(metrics.confusion_matrix(y_test, predictions))

             precision    recall  f1-score   support

       5070       0.96      0.97      0.97       109
       5083       0.70      0.64      0.67        11

avg / total       0.94      0.94      0.94       120

[[106   3]
 [  4   7]]


Some additional metrics were calculated to give a better understanding of the model. As shown, great values were archieved by '5070' label, and goods values for '5083'. Probably, this discrepancy was caused by the low number of '5083' labels in relation to the '5070' labels, which suggests that larger number of documents that have '5083' label could improve the model.