# Preprocessamentos
## Tokenização manual
#### Remoção de sinais gráficos e pontuação: [ ] ) ( @ # $ | , ! . : ; ?
#### Foram preservadas palavras com híven intermediário

## Remoção de Stop Words
#### Através do corpus do nltk para português: nltk.corpus.stopwords.words('portuguese')

## Stemming
#### Através do stemmer do nltk para português: nltk.stem.RSLPStemmer()


# Classificadores Utilizados
## Naive Bayes
## Entropia Máxima


# Features
## Total: 2κ features
#### κ raízes (stems) de palavras mais frequentes (κ features)
#### Frequência das κ palavras mais frequentes (κ features)


# Dificuldades
## Foram utilizadas técnicas para o português, mas há textos em inglês, e textos praticamente numéricos na amostra, o que gera muito ruído.


# Resultados
## κ = 1
#### Naive Bayes: 0.407673860911
#### Maximum Entropy: 0.428057553957

## κ = 2
#### Naive Bayes: 0.468824940048
#### Maximum Entropy: 0.44964028777

## κ = 5
#### Naive Bayes: 0.465227817746
#### Maximum Entropy: 0.488009592326

## κ = 10
#### Naive Bayes: 0.47721822542
#### Maximum Entropy: 0.509592326139

## κ = 20
#### Naive Bayes: 0.448441247002
#### Maximum Entropy: 0.494004796163


# Conclusões
## Os classificadores são influenciados pela quantidade de features.

## A não utilização da frequência das palavras reduziu drasticamente a acurácia do classificador (aproximadamente de 50% para 20%).

## A segunda palavra tem grande impacto; as demais, não tanto.

## A feature κ-palavras mais frequentes aumenta a acurácia do classificador proporcionalmente a κ apenas até o ponto em que a quantidade de palavras distintas se torna inferior a κ, ou seja, o vocabulário dos textos não é diversificado o bastante, e portanto, não existe a quantidade de radicais esperada: κ (nesses casos, utilizou-se todo o conjunto obtido).

In [6]:
# =================================================================================================
#     Text Parsing and Tokenizing
# =================================================================================================
import re

NOT_FOUND = -1

# This will store categories in the first column and news in the second column
categories = []
texts = []

# Reads the file all at once
newsFile = open("news_data.xml", 'r')
rawText = newsFile.read()
newsFile.close()

# Keyword length
CATEGORY_LENGTH = len('category="')
TEXT_LENGTH = len("<text>")

# A cursor to read through the file
textPointer = 0                                         # Starts at the first character of file
textPointer = rawText.find("category=", textPointer)    # Searches for the next category

# Iterates through the text searching for categories and non-empty texts
# Stores everything at the variable "data"
while textPointer != NOT_FOUND:
    # Caches the new category
    textPointer += CATEGORY_LENGTH
    categoryEnd = rawText.find('"', textPointer)
    currentCategory = rawText[textPointer : categoryEnd]

    # Searches for the text
    textPointer = rawText.find("<text>", categoryEnd) + TEXT_LENGTH
    textEnd = rawText.find("</text>", textPointer)
    currentText = rawText[textPointer: textEnd].lower().decode("utf-8")

    ## Tokenizes
    # Removes punctuation
    currentText = re.sub('[\)\(@#$|,!.:;?]', ' ', currentText)

    # Converts to list and filters some noise
    currentText = currentText.split()
    if currentText != []:
        currentText = filter(lambda a: (a != '-'), currentText) # Filters some noise

        # Stores results
        categories.append(currentCategory)
        texts.append(currentText)

    # Prepares for next cycle
    textPointer = rawText.find("category=", textPointer)



# =================================================================================================
#     Stop Word Removal and Stemming
# =================================================================================================
import nltk

# N_TOTAL = int(0.2 * len(texts))
N_TOTAL = len(texts)

# Stop Words
for stopword in nltk.corpus.stopwords.words('portuguese'):
    for i in xrange(N_TOTAL):
        texts[i] = filter(lambda a: (a.lower() != stopword.lower()), texts[i])

# Stemming
PTsTemmer = nltk.stem.RSLPStemmer()
for i in xrange(N_TOTAL):
    for j in xrange(len(texts[i])):
        texts[i][j] = PTsTemmer.stem(texts[i][j])



# =================================================================================================
#     Defining Features
# =================================================================================================
CATEGORY_COLUMN = 0
TEXT_COLUMN = 1


WORD_COLUMN = 0
FREQUENCY_COLUMN = 1
FEATURE_COUNT = 1

N_FIT = int(0.5 * N_TOTAL)

def getFeatures (text):
    fd = nltk.FreqDist(word for word in text)

    # Most frequent words
    wordFreq = []
    for word in list(fd.keys()[:20]):
        wordFreq.append([word, fd[word]])

    # Sorts by frequency (highest frequency first)
    wordFreq = sorted(wordFreq, key=lambda tuple: -tuple[FREQUENCY_COLUMN])[:FEATURE_COUNT]

    features = {}
    for i in xrange(min(FEATURE_COUNT, len(wordFreq))):
        features['w%d' % i] = wordFreq[i][WORD_COLUMN]
        features['f%d' % i] = wordFreq[i][FREQUENCY_COLUMN]

    return features

# Set Features
featureSets= [(getFeatures(text), category) for (text, category) in zip(texts, categories)]

trainSet, testSet = featureSets[:N_FIT], featureSets[N_FIT:]

naiveBayes = nltk.NaiveBayesClassifier.train(trainSet)
print nltk.classify.accuracy(naiveBayes, testSet)

maxEntropy = nltk.MaxentClassifier.train(trainSet)
print nltk.classify.accuracy(maxEntropy, testSet)

0.407673860911
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------


             1          -2.63906        0.018


             2          -1.27541        0.659


             3          -1.00982        0.751


             4          -0.85193        0.759


             5          -0.75088        0.765


             6          -0.68183        0.769


             7          -0.63193        0.777


             8          -0.59426        0.779


             9          -0.56480        0.785


            10          -0.54111        0.785


            11          -0.52164        0.787


            12          -0.50532        0.787


            13          -0.49143        0.785


            14          -0.47946        0.787


            15          -0.46903        0.787


            16          -0.45985        0.785


            17          -0.45169        0.785


            18          -0.44440        0.785


            19          -0.43784        0.787


            20          -0.43190        0.788


            21          -0.42649        0.788


            22          -0.42155        0.788


            23          -0.41700        0.788


            24          -0.41282        0.788


            25          -0.40894        0.787


            26          -0.40534        0.787


            27          -0.40199        0.788


            28          -0.39886        0.788


            29          -0.39592        0.788


            30          -0.39317        0.788


            31          -0.39058        0.788


            32          -0.38814        0.789


            33          -0.38584        0.789


            34          -0.38366        0.790


            35          -0.38159        0.790


            36          -0.37962        0.790


            37          -0.37776        0.790


            38          -0.37598        0.790


            39          -0.37428        0.791


            40          -0.37266        0.791


            41          -0.37111        0.791


            42          -0.36963        0.791


            43          -0.36821        0.793


            44          -0.36685        0.791


            45          -0.36554        0.791


            46          -0.36428        0.791


            47          -0.36307        0.791


            48          -0.36191        0.791


            49          -0.36079        0.791


            50          -0.35970        0.791


            51          -0.35866        0.791


            52          -0.35765        0.791


            53          -0.35667        0.791


            54          -0.35573        0.791


            55          -0.35482        0.791


            56          -0.35393        0.791


            57          -0.35307        0.791


            58          -0.35224        0.791


            59          -0.35144        0.791


            60          -0.35065        0.791


            61          -0.34989        0.791


            62          -0.34915        0.791


            63          -0.34843        0.793


            64          -0.34773        0.793


            65          -0.34705        0.793


            66          -0.34639        0.793


            67          -0.34574        0.793


            68          -0.34512        0.793


            69          -0.34450        0.793


            70          -0.34390        0.793


            71          -0.34332        0.793


            72          -0.34275        0.793


            73          -0.34220        0.793


            74          -0.34165        0.793


            75          -0.34112        0.793


            76          -0.34060        0.793


            77          -0.34010        0.793


            78          -0.33960        0.793


            79          -0.33912        0.793


            80          -0.33864        0.793


            81          -0.33818        0.793


            82          -0.33772        0.791


            83          -0.33728        0.791


            84          -0.33684        0.791


            85          -0.33641        0.791


            86          -0.33599        0.791


            87          -0.33558        0.791


            88          -0.33518        0.791


            89          -0.33479        0.793


            90          -0.33440        0.793


            91          -0.33402        0.793


            92          -0.33364        0.793


            93          -0.33328        0.793


            94          -0.33292        0.793


            95          -0.33256        0.793


            96          -0.33222        0.793


            97          -0.33188        0.793


            98          -0.33154        0.793


            99          -0.33121        0.793


         Final          -0.33089        0.793
0.428057553957
