# Text Feature Extraction

This Notebook illustrates the translation of complex objects (plain text documents) into a set of features suitable for training a machine learning algorithm.

We will use the simplest of such models the **Bag of Words** on the [Reuters 50](https://archive.ics.uci.edu/ml/datasets/Reuter_50_50) collection of texts.

**Note** If the original link to the source data above is down. You can get a google drive copy of the C50 Reuters dataset from [here](https://drive.google.com/drive/folders/1kKFrHulkbxknPDEp9A1f1KwIYzyfcDME?usp=sharing)

In this notebook we will:
1. Discuss a pipeline to transform files into list of words.
2. Implement a few document similarity metrics.
3. Compare `sklearn` document vectorizer implementations to our own.
4.  Save some preprocessed data sets for later reuse.

In practice we will want to use `sklearn` vectorizers, but here we re-implement them ourselves so that we understand all the subtle details of their definition.


## Preliminaries

#### Imports

In [1]:
import string
from collections import Counter
import os
import pickle

import numpy as np
import pandas as pd

# Domain specific libraries to handle text
#
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.preprocessing import normalize


### IMPORTANT: Fist time Use

This notebook downloads some data, and generates some pre-processed files we will neede later.

Set the `fist_time` flag below to True **once**, after that, it will be faster to run the notebook
with the flag set to false.

In [2]:
first_time=False

#### Data Directories

In [3]:
if first_time: 
    documents=pd.read_csv("/Users/Emily/Desktop/dfg-humanrights/filenames.csv",index_col="document_id")
    documents['label']=documents['file_name'].apply(lambda x: x.split('_')[0])
    documents['file_name']=documents['file_name'].apply(lambda x: "clean_text3/"+ x )
    documents.to_csv('filenames.csv')
else: 
    documents=pd.read_csv("/Users/Emily/Desktop/dfg-humanrights/filenames.csv",index_col="document_id")

## The Bag Of Words Document Model Representation

### Bag of Words Model

One of the simplest models of a document is the **Bag of Words** model:

we **ignore word ordering** and represent a
document by the list of words that it containts.

We still have a lot of choices we could make:

* Does punctuation like `.` and `?` count as words or do we ignore them?
* is `New York` one word or two, what about `U.S.`?
* Do `car` and `cars` count as different words?
* What about `safe` and `safely`
* What do we do about miss-spelled words, do we try to fix them?
* Do we consider different capitalizations of the same word:  `Car` versus `car`?
* Do we include high frequency, low information content words such as `a` and `the` in our bag of words?

All this choices are **problem dependent**. If we have a particular ML problem to solve we will use our **domain** knowledge to resolve those questions in the context of that specific task.

Today we just illustrate how to put together a **data pre-processing** pipeline. 

The pipeline is designed in such a way that it will be easier later to change our answer to any of those questions and control how we prepare the text.



### Document Preprocessing

For this exercise we will use a default pipeline that is both sensible and simple to implement:
* We remove punctuation
* `New York` counts as two words but  `U.S` counts as one.
* `car` and "`cars` are the same word, same for `safe`, and `safely`.
* miss-spelled words count as different words.
* We remove capitalization so `Car` and `car` count as the same word.
* We remove high frequency such as `a` and `the`.

We will rely on python's **NLTK** (Natural Language Toolkit) library to perform tasks that require **domain** knowledge about
text processing:
* **tokenization**: breaking character streams into words
* **stemming**: normalizing words into their roots: `cars` -> `car`, `safely` -> `safe`
* **stop word removal**: high frequency, low information words that we will consider just *noise* and ignore.

We can write a function that goes from text to stems for reuse later

We also need to run the stop words through stemmer (so that they will we identified and removed from text). 

Things to Consider Adding: 

-expanding the stop words list

-removing specific words that appear more than 80% (arbitrary threshold)

In [4]:
#import spacy
#from html import unescape

# create a spaCy tokenizer
#spacy.load('en')
#lemmatizer = spacy.lang.en.English()

def stem_tokenizer(text):
    porter_stemmer = PorterStemmer()
    return [porter_stemmer.stem(token) for token in word_tokenize(text.lower())]

# remove html entities from docs and
# set everything to lowercase
def my_preprocessor(text):
    if (type(text) is not list):
        text=text.split('\n')
    text=" ".join(text).replace("\n"," ").replace('\u200b', '').replace('\n', '')
    return text

# tokenize the doc and lemmatize its tokens
def my_tokenizer(text):
    porter_stemmer = PorterStemmer()
    punctuation=list(string.punctuation)
    stop0=" ".join(stopwords.words("english")+punctuation)
    stop_words=set(stem_tokenizer(stop0))
    stem_list=stem_tokenizer(text)
    used_list=[token for token in stem_list if token not in stop_words]
    return used_list

# create a dataframe from a word matrix
def wm2df(wm, feat_names):
    
    # create an index for each row
    doc_names = ['Doc{:d}'.format(idx) for idx, _ in enumerate(wm)]
    df = pd.DataFrame(data=wm.toarray(), index=doc_names,
                      columns=feat_names)
    return(df)




### Testing: run preprocessor and tokenizer method on document sample

In [5]:

with open(documents["file_name"][0], 'rb') as fh:
    doc = fh.read().decode("utf-8")
text=my_preprocessor(doc)
my_tokenizer(text)


['unit',
 'state',
 'secur',
 'exchang',
 'commiss',
 'washington',
 'dc',
 '20549',
 'schedul',
 '14a',
 'proxi',
 'statement',
 'pursuant',
 'section',
 '14',
 'secur',
 'exchang',
 'act',
 '1934',
 'amend',
 'file',
 'registr',
 'ý',
 'file',
 'parti',
 'registr',
 '¨',
 'check',
 'appropri',
 'box',
 'begin',
 'tabl',
 '¨',
 'preliminari',
 'proxi',
 'statement',
 '¨',
 'confidenti',
 'use',
 'commiss',
 'permit',
 'rule',
 '14a-6',
 'e',
 '2',
 'ý',
 'definit',
 'proxi',
 'statement',
 '¨',
 'definit',
 'addit',
 'materi',
 '¨',
 'solicit',
 'materi',
 'rule',
 '14a-12',
 'end',
 'tabl',
 'aaron',
 'inc.',
 'name',
 'registr',
 'specifi',
 'charter',
 'name',
 'person',
 'file',
 'proxi',
 'statement',
 'registr',
 'payment',
 'file',
 'fee',
 'check',
 'appropri',
 'box',
 'begin',
 'tabl',
 'ý',
 'fee',
 'requir',
 '¨',
 'fee',
 'comput',
 'tabl',
 'per',
 'exchang',
 'act',
 'rule',
 '14a-6',
 '1',
 '0-11',
 '1',
 'titl',
 'class',
 'secur',
 'transact',
 'appli',
 '2',
 'aggre

### Apply to Entire Corpus

In [14]:
tfidf_vectorizer=TfidfVectorizer(input='filename', preprocessor=my_preprocessor, tokenizer=my_tokenizer)
X_tfidf=tfidf_vectorizer.fit_transform(documents["file_name"].head(5))
X_tfidf = normalize(X_tfidf) #explains why we normalize: https://www.quora.com/What-is-the-benefit-of-normalization-in-the-tf-idf-algorithm
tfidf_features = tfidf_vectorizer.get_feature_names()
# create a dataframe from the matrix
wm2df(X_tfidf, tdidf_features)

Unnamed: 0,,'','74,**,+/-,+1,--,"-1,667",-145.0,"-2,500",...,—investors—corpor,—potenti,—who,•,∎,▪,☐,☒,✓,✔
Doc0,0.032877,0.110153,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.002055,0.0,0.063092,0.0,0.019521,0.0,0.0,0.0,0.0
Doc1,0.0,0.049414,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.045516,0.0,0.0,0.0,0.0,0.0,0.016881
Doc2,0.0,0.123771,0.0,0.003591,0.0,0.001197,0.017955,0.0,0.0,0.0,...,0.0,0.0,0.0,0.033044,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.0,0.085005,0.002337,0.0,0.003116,0.0,0.0,0.001558,0.000779,0.001558,...,0.000779,0.0,0.000779,0.045643,0.085691,0.0,0.07868,0.002337,0.345878,0.0
Doc4,0.0,0.053197,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
count_vectorizer=CountVectorizer(input='filename', preprocessor=my_preprocessor, tokenizer=my_tokenizer)
X_count=count_vectorizer.fit_transform(documents["file_name"].head(5))
X_count = normalize(X_count) #explains why we normalize: https://www.quora.com/What-is-the-benefit-of-normalization-in-the-tf-idf-algorithm
count_features = count_vectorizer.get_feature_names()
# create a dataframe from the matrix
wm2df(X_count, count_features)

Unnamed: 0,,'','74,**,+/-,+1,--,"-1,667",-145.0,"-2,500",...,—investors—corpor,—potenti,—who,•,∎,▪,☐,☒,✓,✔
Doc0,0.016158,0.11361,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.00101,0.0,0.055038,0.0,0.009594,0.0,0.0,0.0,0.0
Doc1,0.0,0.065449,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.050989,0.0,0.0,0.0,0.0,0.0,0.010654
Doc2,0.0,0.128449,0.0,0.001776,0.0,0.000592,0.008879,0.0,0.0,0.0,...,0.0,0.0,0.0,0.029005,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.0,0.098051,0.001285,0.0,0.001713,0.0,0.0,0.000856,0.000428,0.000856,...,0.000428,0.0,0.000428,0.044529,0.047099,0.0,0.043245,0.001285,0.190107,0.0
Doc4,0.0,0.065158,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [11]:
set_vectorizer=CountVectorizer(input='filename', binary=True, preprocessor=my_preprocessor, tokenizer=my_tokenizer)
X_set=set_vectorizer.fit_transform(documents["file_name"].head(5))
X_set = normalize(X_set) #explains why we normalize: https://www.quora.com/What-is-the-benefit-of-normalization-in-the-tf-idf-algorithm
set_features = set_vectorizer.get_feature_names()
# create a dataframe from the matrix
wm2df(X_set, set_features)

Unnamed: 0,,'','74,**,+/-,+1,--,"-1,667",-145.0,"-2,500",...,—investors—corpor,—potenti,—who,•,∎,▪,☐,☒,✓,✔
Doc0,0.018415,0.018415,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.018415,0.0,0.018415,0.0,0.018415,0.0,0.0,0.0,0.0
Doc1,0.0,0.022863,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.022863,0.0,0.0,0.0,0.0,0.0,0.022863
Doc2,0.0,0.020315,0.0,0.020315,0.0,0.020315,0.020315,0.0,0.0,0.0,...,0.0,0.0,0.0,0.020315,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.0,0.018458,0.018458,0.0,0.018458,0.0,0.0,0.018458,0.018458,0.018458,...,0.018458,0.0,0.018458,0.018458,0.018458,0.0,0.018458,0.018458,0.018458,0.0
Doc4,0.0,0.02133,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Using Digrams as features

Instead of using *words* counts as features, we can use *pairs of words*.
This are called **digrams**. 


In [12]:
digram_vectorizer=CountVectorizer(input="filename",preprocessor=my_preprocessor, tokenizer=my_tokenizer,ngram_range=(2,2))
X_digram=digram_vectorizer.fit_transform(documents["file_name"].head(5))
X_digram = normalize(X_digram) #explains why we normalize: https://www.quora.com/What-is-the-benefit-of-normalization-in-the-tf-idf-algorithm
digram_features = digram_vectorizer.get_feature_names()
# create a dataframe from the matrix
wm2df(X_digram, digram_features)

Unnamed: 0, , 1, begin, compon, end, stock, target,'' '','' -actual,'' .1,...,✓ sustain,✓ sustanatm,✓ suzann,✓ timothi,✓ train,✓ ✓,✔ 0,✔ 1,✔ 2,✔ 3
Doc0,0.01999,0.001538,0.010764,0.001538,0.010764,0.003075,0.001538,0.007688,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005782,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.007709,0.003855,0.007709,0.007709
Doc2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.005328,0.0,0.001776,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Doc3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.022973,0.001094,0.0,...,0.006564,0.006564,0.002188,0.002188,0.004376,0.018598,0.0,0.0,0.0,0.0
Doc4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [122]:
digrams=list(digram_vectorizer.vocabulary_.keys())[:5] # only get the first 5 digrams
for digram in digrams:
    print( digram,digram_vectorizer.vocabulary_[digram])

unit state 50204
state secur 46507
secur exchang 44417
exchang commiss 22719
commiss washington 14778


In [68]:
V_digram=X_digram.shape[1]
V_digram

93426

### Saving Trained models for Reuse

Training a text model (calling `fit_transform`) is slow because we need to process all the documents in the training corpus.

Sometimes it is convenient to save pre-trained models to disk for reuse later. We can also save the pre-processed test documents:

In [125]:
data_dir='preprocessing'

In [157]:
set_vectorizer_filename=   data_dir+"/set_vectorizer.p"
set_features_filename=     data_dir+"/set_features.p"
pickle.dump(set_vectorizer, open( set_vectorizer_filename, "wb" ) )
pickle.dump(X_set,         open( set_features_filename, "wb" ) )

In [158]:
count_vectorizer_filename=   data_dir+"/count_vectorizer.p"
count_features_filename=     data_dir+"/count_features.p"
pickle.dump(count_vectorizer, open( count_vectorizer_filename, "wb" ) )
pickle.dump(X_count,             open( count_features_filename, "wb" ) )


In [159]:
tfidf_vectorizer_filename=   data_dir+"/tfidf_vectorizer.p"
tfidf_features_filename=     data_dir+"/tfidf_features.p"
pickle.dump(tfidf_vectorizer, open( tfidf_vectorizer_filename, "wb" ) )
pickle.dump(X_tfidf,              open( tfidf_features_filename, "wb" ) )


In [160]:
digram_vectorizer_filename=   data_dir+"/digram_vectorizer.p"
digram_features_filename=     data_dir+"/digram_features.p"
pickle.dump(digram_vectorizer, open( digram_vectorizer_filename, "wb" ) )
pickle.dump(X_digram,              open( digram_features_filename, "wb" ) )


In [16]:
from sklearn.decomposition import TruncatedSVD

# SVD represent documents and terms in vectors 
svd_model = TruncatedSVD(n_components=20, algorithm='randomized', n_iter=100, random_state=122)

svd_model.fit(X_tfidf)

terms = tfidf_vectorizer.get_feature_names()

for i, comp in enumerate(svd_model.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
    print("Topic "+str(i)+": ")
    for t in sorted_terms:
        print(t[0])
        print(" ")

Topic 0: 
committe
 
compens
 
tabl
 
compani
 
share
 
director
 
stock
 
Topic 1: 
award
 
stock
 
share
 
plan
 
shall
 
compani
 
—
 
Topic 2: 
appl
 
rsu
 
sharehold
 
vest
 
grant
 
award
 
cook
 
Topic 3: 
✓
 
alcoa
 

 
committe
 
0
 
∎
 
☐
 
Topic 4: 
aaron
 
robinson
 
director
 
share
 
perform
 
common
 
2015
 


### Additional Notes

#### Mapping new documents  into an already learned vocabulary

When testing a test classifier on new documents we need to keep the vocabulary **fixed**.

We call `transform`, not `fit_transform` on the `sklearn` vectorizer.

Words in new documents, will get mapped to the columns learned on the training set.
New words not in the original vocabulary will be ignored.

If we called `fit_transform` instead the indexing of words would be all **scrambled up**.

In [None]:
X_test=countVectorizer.transform(test_documents["filename"])

Lets check the word counts of a few new documents

In [None]:
print("doc","word"+" "*11,"dim","count",sep="\t")
for i1 in range(2):
    for word in words:
        document=X_test[i1]
        dimension=countVectorizer.vocabulary_[word]
        print(i1,f"{word:15}",dimension,document[0,dimension],sep="\t")

## Document Similarity Definitions

Given a representation of a document as a stream of tokes (stems), we need to define the concept of a document distance metric.

Usually in text processing the concept described is the normalized similarity $0<s(t_1,t_2)<1$, the translation to distance is simply

$$
    d(t_1,t_2) = 1-s(t_1,t_2)
$$

Again, may different definitions of similarity are possible, we will consider three here:
* Set intersection similarity.
* vector of counts similarity.
* TF-IDF (Term Frequency, Inverse Document Frequency) similarity.

There are many more choices, and, as usual, which one works best depends on the  problem at hand.

### Set Intersection Similarity Measure
#### Definition

We can consider our **bag of words** as just the set with the stems contained in document.

Then document similarity between to documetns is just the normalized intersection of  the document's stem's sets.

$$
    S_{\textrm{set}}(S_1,S_2)= \frac{|S_1 \cap S_2|}{\sqrt{|S_1|\cdot |S_2|}}
$$
where $|\cdot|$ is the set cardinality (number of elements).

In words: the *set similarity* of two documents is the ratio of the number of words in common to the geometric mean of the document's vocabulary length.

This is the same as considering $S_i$ as a **one-hot-encoded** vector of words: each word in the vocabulary is a dimension, and a component of the vector is 1 if that word is present on the document, 0 otherwise.  

With that interpretation
$$
    S_{\textrm{set}}(S_1,S_2)= \frac{S_1 \cdot S_2}{\sqrt{|S_1|\cdot |S_2|}}=\cos(S_1,S_2)
$$
where $\cdot$ is the regular scalar product of vectors and $|\cdot|$ is the vector norm (numerically identical to the set's cardinality).

Note how documents from the same author are more similar to each other than to documents from a second author

### Word Count Similarity Measure
#### Definition

We can include a bit more information on the **Bag of Words** model by keep track on how many times each word appear on a document
$$
    S_{\textrm{count}} = \frac{C_1 \cdot C_2}{\sqrt{|C_1|\cdot |C_2|}}=\cos(C_1,C_2)
$$
This is the same cosine similarity used on the set case but a  **count's feature vector** rather than been only 0 or 1, each dimension $w$ of document  $C$ contains the number times  word $w$ appears in the document.

#### Implementation
`python` collections module has a [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) object that computes the counts for us.

Again, documents from the same author are closer to each other. But now the differences look more pronounced.