In [1]:
import nltk
import json
import random

## Carga de los datos

Punto de partida: un .json con una lista de proyectos asignados 

In [2]:
with open('../tmp/devex-corpus.json') as corpus_file:
    corpus = json.loads(corpus_file.read())

In [3]:
print(type(corpus))
print(corpus[0])

<type 'list'>
[u"GE-Tbilisi: Legislative impact assessment, drafting and representationService contract award noticeGeorgia1.Publication reference:EuropeAid/135035/DH/SER/GE.2.Publication date of the contract notice:30.4.2014.3.Lot number and lot title:1 lot only.4.Contract number and value:ENPI/2015/357121, 1 713 780 EUR.5.Date of award of the contract:23.1.2015.6.Number of tenders received:6.7.Overall score of chosen tender:99,54.8.Name, address and nationality of successful tenderer:IBF International Consulting SA, avenue Louise 209A, 1050 Brussels, BELGIUM.Center for International Legal Cooperation (CILC), Koninginnegracht 7, 2514 AA The Hague, NETHERLANDS.Deutsche Stiftung f\u0171r internationale rechtliche Zusammenarbeit e.V. (IRZ), Ubierstra\xdfe 92, 53173 Bonn, GERMANY.ISO code of country: BE.9.Duration of contract:30 months.10.Contracting authority:European Union, represented by the European Commission, on behalf of and for the account of the partner country, Tbilisi, GEORGIA.

## Preparado de los datos

En primer lugar, hay que limpiar el HTML.

In [4]:
from bs4 import BeautifulSoup

corpus = [(BeautifulSoup(description, 'html.parser').get_text(), company) \
            for description, company in corpus ]

A continuación, vamos a convertir las descripciones en listas de palabras.

In [5]:
descriptions = []
for description, company in corpus:
    words = description.encode('utf-8').split()
    descriptions.append((words, company))

Definimos dos funciones importantes: **word_filter** y **word_normalizer**:
- **word_filter**: toma una serie de palabras y devuelve solo las que consideramos útiles para clasificar.
- **word_normalizer**: normaliza las palabras (eg. lowercase, stemming/lemmatization (quedarse con la raíz), etc.)

In [6]:
def words_filter(words):
    return [w for w in words if len(w) > 4]

In [7]:
def words_normalizer(words):
    normalizer_functions = []
    # words to lowercase
    normalizer_functions.append(lambda w: w.lower())
    # remove all this signs of punctuation
    for sign in ['.', ',', ';', ':', '!', '?', ')', '(', '[', ']']:
        fn = lambda w, sign=sign: fn(w.replace(sign, '')) if sign in w else w
        normalizer_functions.append(fn)
    for fn in normalizer_functions:
        words = map(fn, words)
    return list(words)

In [8]:
print(words_normalizer(["duration:", "189", "days"]))

['duration', '189', 'days']


A partir de las descripciones, tenemos que extraer una lista de palabras características ("word features"). Será simplemente la lista total de palabras que aparecen en nuestras descripciones. Usamos las siguientes funciones:

In [9]:
def get_words_in_descriptions(descriptions, words_normalizer, words_filter):
    """
    Given a list of descriptions [([description words], "company name")], join all the
    words into a single list and return it.
    """
    all_words = []
    for (words, company) in descriptions:
      all_words.extend(words_filter(words_normalizer(words)))
    return all_words

In [10]:
def get_word_features(wordlist):
    """Given a list of all the words, return a (dict_keys) list of unique words."""
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

(Una **FreqDist** es simplemente una cuenta de cuántas veces aparece cada palabra)

In [11]:
wordlist = nltk.FreqDist(get_words_in_descriptions(descriptions, words_normalizer, words_filter))
wordlist.most_common(10)

[('contract', 2291),
 ('duration', 1709),
 ('location', 1514),
 ('action', 1446),
 ('theme', 1254),
 ('modification', 1253),
 ('services', 866),
 ('procurement', 836),
 ('management', 743),
 ('country', 718)]

In [12]:
word_features = get_word_features(get_words_in_descriptions(descriptions, words_normalizer, words_filter))
print word_features



## Creación del clasificador

Nuestro clasificador será una caja negra. Le proporcionaremos una descripción, y nos devolverá una clasificación (la empresa que más probable ganaría este contrato).

Nuestro clasificador necesitará un training set. Usaremos las primeras 500 descripciones como training set, y el resto como testing set.

In [13]:
# es importante tener grupos de training y testing homogéneos (mismas proporciones de cada empresa)
random.seed(54965)
random.shuffle(descriptions)

In [14]:
middle = len(descriptions)/2
training_set = descriptions[:middle]
testing_set = descriptions[middle:middle+middle/2]
print(len(descriptions))
print(len(training_set))
print(len(testing_set))

2500
1250
625


Para el clasificador no podemos usar nuestras descripciones directamente. Necesitamos una función (*feature extractor*) que sea capaz de extraer las features de la descripción. Esta función solo toma la descripción y devuelve nuestras *word features* indicando si son relevantes (**aproximación más básica: si aparecen o no en la descripción**).

In [15]:
def extract_features(description):
    """
    Feature extractor. Given a description (as a list of words), returns our list of word features,
    indicating whether each word is relevant for the description.
    """
    # remove duplicates
    description_words = set(description)
    features = {}
    # word_features is defined outside
    for word in word_features:
        features['contains({})'.format(word)] = (word in description_words)
    return features

In [16]:
mod_extract_features = lambda description: extract_features(words_normalizer(words_filter(description)))

In [17]:
mod_extract_features('this is a project made in indonesia'.split())

{'contains(corporate)': False,
 'contains(waste)': False,
 'contains(number16)': False,
 'contains(familiarmente)': False,
 'contains(cocris)': False,
 'contains(their)': False,
 'contains(48410total)': False,
 'contains(2010-08-23)': False,
 'contains(economically)': False,
 'contains(estrada)': False,
 'contains(remained)': False,
 'contains(vehicule)': False,
 'contains(marcos)': False,
 'contains(dgpsa)': False,
 'contains(responsabilities)': False,
 'contains(philanthropic)': False,
 'contains(1999319)': False,
 'contains(cuenca)': False,
 'contains(cambodia)': False,
 'contains(discrimination)': False,
 'contains(415677)': False,
 'contains(regulationthe)': False,
 'contains(raising)': False,
 'contains(tianjin)': False,
 'contains(projects)': False,
 'contains(noticecontract)': False,
 'contains(idanoa)': False,
 'contains(incedencia)': False,
 'contains(madhepura)': False,
 'contains(10/2015)': False,
 'contains(victimas)': False,
 'contains(elections)': False,
 'contains(emisi

Usamos esta función para crear el training set, convirtiendo las descripciones en listas de features entendibles por el clasificador:

In [18]:
training_set = nltk.classify.apply_features(mod_extract_features, training_set)
testing_set = nltk.classify.apply_features(mod_extract_features, testing_set)

finalmente, creamos el clasificador, usando nuestro training set para entrenarlo

In [19]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [20]:
classifier.show_most_informative_features()

Most Informative Features
        contains(method) = True           Indivi : IBF In =     76.9 : 1.0
        contains(amount) = True           Indivi : Moore  =     76.5 : 1.0
      contains(duration) = True           IBF In : Chemon =     75.1 : 1.0
         contains(admin) = True           Chemon : HTSPE  =     72.1 : 1.0
   contains(description) = True           Indivi : IBF In =     69.8 : 1.0
      contains(location) = True           AGRECO : Chemon =     67.7 : 1.0
         contains(total) = True             UNDP : IBF In =     67.4 : 1.0
       contains(funding) = True           Chemon : Transt =     65.7 : 1.0
         contains(order) = True           Chemon : Indivi =     62.5 : 1.0
   contains(procurement) = True           Indivi : AGRECO =     62.5 : 1.0


## Application of the classifier

Ahora podemos probar el clasificador. Intentemos clasificar una sola descripción:

In [21]:
"El primer contrato ha sido asignado a la empresa {}".format(testing_set[0][1])

'El primer contrato ha sido asignado a la empresa UNICEF'

Comprobemos si nuestro clasificador lo hace bien:

In [22]:
classifier.classify(testing_set[0][0])

u'UNICEF'

¡Uhm! Pero, ¿qué hay del resto? Creamos una función para comprobar los aciertos en **todo** el training set:

In [23]:
def check_accuracy(testing_set, classifier):
    """Given a training set and a classifier, apply it to all cases and return the percentage of hits."""
    hits = 0
    cases = 0
    for description, company in testing_set:
        cases += 1
        if classifier.classify(description) == company:
            hits += 1
    return (hits, cases, round(float(hits)/float(cases), 2))

In [24]:
hits, cases, accuracy = check_accuracy(testing_set[:100], classifier)
print("Nuestro clasificador acierta en el {}% ({} {}) de los casos".format(accuracy*100, hits, cases))

Nuestro clasificador acierta en el 39.0% (39 100) de los casos
