# NLTK

Descargar corpus y modelos.

In [None]:
import nltk
nltk.download()
# instalar corpus gutenberg y modelo punkt (tokenizador y segmentador)

Alternativamente:

In [None]:
import nltk
nltk.download('punkt')
nltk.download('gutenberg')

In [None]:
from nltk.corpus import gutenberg
gutenberg.fileids()

In [None]:
gutenberg.sents('austen-emma.txt')

# Estadísticas Básicas

Versión básica con diccionarios:

In [None]:
count = {}

for sent in gutenberg.sents('austen-emma.txt'):
    for word in sent:
        if word in count:
            count[word] += 1
        else:
            count[word] = 1
count

Versión mejorada con defaultdicts:

In [None]:
from collections import defaultdict

count = defaultdict(int)

for sent in gutenberg.sents('austen-emma.txt'):
    for word in sent:
        count[word] += 1

In [None]:
print('10 palabras más frecuentes:', sorted(count.items(), key=lambda x: -x[1])[:10])
print('Vocabulario:', len(count))
print('Tokens:', sum(count.values()))

Versión usando clase Counter:

In [None]:
from collections import Counter

count = Counter()

for sent in gutenberg.sents('austen-emma.txt'):
    count.update(sent)

In [None]:
print('10 palabras más frecuentes:', count.most_common()[:10])
print('Vocabulario:', len(count))
print('Tokens:', sum(count.values()))

# Corpus de Texto Plano

- http://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.plaintext.PlaintextCorpusReader
- http://www.nltk.org/book/ch02.html

Primero crear archivo example.txt: "Estimados Sr. y sra. Gómez. Se los cita por el art. 32 de la ley 21.234."

In [2]:
from nltk.corpus import PlaintextCorpusReader

help(PlaintextCorpusReader)

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/home/pedro/nltk_data'
    - '/home/pedro/.virtualenvs/pln/nltk_data'
    - '/home/pedro/.virtualenvs/pln/share/nltk_data'
    - '/home/pedro/.virtualenvs/pln/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************


In [None]:
corpus = PlaintextCorpusReader('.', 'example.txt')

In [1]:
list(corpus.sents())

NameError: name 'corpus' is not defined

# Tokenización

- http://www.nltk.org/api/nltk.tokenize.html#nltk.tokenize.regexp.RegexpTokenizer
- http://www.nltk.org/book/ch03.html#regular-expressions-for-tokenizing-text

De la documentación de NLTK obtenemos una expresión regular para tokenizar:

In [None]:
pattern = r'''(?x)    # set flag to allow verbose regexps
     (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
   | \w+(?:-\w+)*        # words with optional internal hyphens
   | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
'''

Lo probamos:

In [None]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(pattern)

corpus = PlaintextCorpusReader('.', 'example.txt', word_tokenizer=tokenizer)
list(corpus.sents())

Vemos que tokeniza mal todas las abreviaciones y el número "21.234".
Mejoramos la expresión regular y probamos:

In [None]:
pattern = r'''(?x)    # set flag to allow verbose regexps
   (?:\d{1,3}(?:\.\d{3})+)  # numbers with '.' in the middle
   | (?:[Ss]r\.|[Ss]ra\.|art\.)  # common spanish abbreviations
   | (?:[A-Z]\.)+        # abbreviations, e.g. U.S.A.
   | \w+(?:-\w+)*        # words with optional internal hyphens
   | \$?\d+(?:\.\d+)?%?  # currency and percentages, e.g. $12.40, 82%
   | \.\.\.            # ellipsis
   | [][.,;"'?():-_`]  # these are separate tokens; includes ], [
'''
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(pattern)

corpus = PlaintextCorpusReader('.', 'example.txt', word_tokenizer=tokenizer)
list(corpus.sents())

Ahora tokeniza bien!!

(La segmentación en oraciones sigue estando mal, pero resolver eso queda fuera de esta clase.)