# Train and visualize a model in Tensorflow - Part 0.5 (optional): Dataset Preprocessing

This notebook is optional for those who want to follow how we got the dataset we are using for the tutorial. For this tutorial you need to previously do all the configuration steps explained in **tutorial 0**, plus install the optional libraries (particularly `scikit-learn` and `gensim`).

## Dataset Download

The task we choose to work in this tutorial is document classification using the 20 Newsgroup Corpus, which is a standard resource for such task. It is a corpus of emails with a topic. For information on the 20 Newsgroup Corpus please refer to the [official website of the project](http://qwone.com/~jason/20Newsgroups/). 

For this tutorial we will be using the dataset with duplicates removed and only "From" and "Subject" headers. It is the file named [20news-18828.tar.gz](http://qwone.com/~jason/20Newsgroups/20news-18828.tar.gz).

For the word embeddings we will be using Word2Vec's pre-trained embeddings on the [Google News corpus](https://cs.famaf.unc.edu.ar/~ccardellino/resources/word_vectors/google/GoogleNews-vectors-negative300.bin.gz).

### Extracting the data

For the preprocessing we need to load the 20 Newsgroup data and the Word2Vec's model

In [None]:
from __future__ import absolute_import, print_function, unicode_literals

import fnmatch
import gensim
import numpy as np
import os

from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
model = gensim.models.KeyedVectors.load_word2vec_format('./resources/GoogleNews-vectors-negative300.bin.gz',
                                                        binary=True)

In [None]:
def find_files(path, file_pattern='*'):
    for root, _, filenames in os.walk(path):
        for filename in fnmatch.filter(filenames, file_pattern):
            yield os.path.join(root, filename)

In [None]:
files_20ng = sorted(find_files('./resources/20newsgroup/'))
labels = [os.path.basename(os.path.dirname(fname)) for fname in sorted(files_20ng)]

In [None]:
vectorizer = TfidfVectorizer(input='filename', decode_error='replace', stop_words='english', max_features=10000)
document_matrix = vectorizer.fit_transform(files_20ng)

In [None]:
embedding_matrix = np.zeros((document_matrix.shape[1], model.vector_size))

In [None]:
for word, idx in vectorizer.vocabulary_.items():
    if word in model:
        embedding_matrix[idx, :] = model[word]

In [None]:
document_matrix.dot(embedding_matrix).shape