Fuente: http://nbviewer.jupyter.org/github/jdwittenauer/ipython-notebooks/blob/master/notebooks/misc/LanguageVectors.ipynb

### Dataset

1) Descargar el corpus (si no está guardado)

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn
import nltk
#nltk.download()

2) The Reuters corpus it's a collection of more than 10,000 news documents published in 1987 categorized into 90 different topics.

In [2]:
from nltk.corpus import reuters
len(reuters.words())

1720901

3) Identificando las palabras únicas (unique words) o vocabulario

In [None]:
vocabulary = set(reuters.words())
len(vocabulary)

4) Distribución de frequencias del corpus

In [None]:
fdist = nltk.FreqDist(reuters.words())
print(fdist)

In [None]:
fdist.most_common(10)

5) Gráfico acumulativo de las frecuencias

In [None]:
fig, ax = plt.subplots(figsize=(16,12))
ax = fdist.plot(20, cumulative=True)

### Limpieza

- Conversión a minúsculas
- Eliminación de signos de puntuación
- Eliminación de "stop words"

In [None]:
stopwords = nltk.corpus.stopwords.words()
cleansed_words = [w.lower() for w in reuters.words() if w.isalnum() and w.lower() not in stopwords]
vocabulary = set(cleansed_words)
len(vocabulary)

In [None]:
fdist = nltk.FreqDist(cleansed_words)
fdist.most_common(20)

### Vector Representation (BoW)

Bag of Words de "sklearn"

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words='english')
vectorizer

Se genera una lista de documentos a partir del corpus original (sin preprocesar) para usar luego con el CountVectorizer

In [None]:
files = [f for f in reuters.fileids() if 'training' in f]
corpus = [reuters.raw(fileids=[f]) for f in files]
len(corpus)

In [None]:
corpus[0]

El corpus (de "entrenamiento") es una lista de documentos de textos (raw text).
<br> Esta lista se envía al CountVectorizer para que construya nuestra <b>BoW matrix</b>

In [None]:
X = vectorizer.fit_transform(corpus)
X

In [None]:
print X[0]

Analizando el contenido como un arreglo numpy:

In [None]:
X.toarray()

Podemos obtener los nombres (términos) de las características (dimensiones) que hacen referencia a las columnas de nuestra BoW matrix

In [None]:
vectorizer.get_feature_names()[2000:2015]

### Vector Representation (TF-IDF)

Se utiliza otra funcionalidad de sklearn:

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
tfidf

In [None]:
X_weighted = tfidf.fit_transform(X)
X_weighted.toarray()

## LSA

In [None]:
from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words=stopset, use_idf=True, ngram_range=(1,3))

In [None]:
X = vectorizer.fit_transform(corpus)

In [None]:
from sklearn.decomposition import TruncatedSVD
lsa = TruncatedSVD(n_components=27, n_iter=100)
lsa.fit(X)

In [None]:
lsa.components_[0]

In [None]:
terms = vectorizer.get_feature_names()
print len(terms)

In [None]:
terms[100:200]

In [None]:
for i, comp in enumerate(lsa.components_):
    termsInComp = zip(terms,comp)
    sortedTerms = sorted(termsInComp, key=lambda x:x[1], reverse=True) [:10]
    print "Concept %d" % i
    for term in sortedTerms:
        print term[0]
    print " "

### Tópicos: Non-Negative Matriz Factorization (NMF)

Técnica usada para extracción de tópicos:

In [None]:
from sklearn.decomposition import NMF
nmf = NMF(n_components=10).fit(X)

feature_names = vectorizer.get_feature_names()

for topic_idx, topic in enumerate(nmf.components_):
    print('Topic #%d:' % topic_idx)
    print(' '.join([feature_names[i] for i in topic.argsort()[:-20 - 1:-1]]))
    print('')