# Text Feature Extraction

This Notebook illustrates the translation of complex objects (plain text documents) into a set of features suitable for training a machine learning algorithm.

We will use the simplest of such models the **Bag of Words** on the [Reuters 50](https://archive.ics.uci.edu/ml/datasets/Reuter_50_50) collection of texts.

**Note** If the original link to the source data above is down. You can get a google drive copy of the C50 Reuters dataset from [here](https://drive.google.com/drive/folders/1kKFrHulkbxknPDEp9A1f1KwIYzyfcDME?usp=sharing)

In this notebook we will:
1. Discuss a pipeline to transform files into list of words.
2. Implement a few document similarity metrics.
3. Compare `sklearn` document vectorizer implementations to our own.
4.  Save some preprocessed data sets for later reuse.

In practice we will want to use `sklearn` vectorizers, but here we re-implement them ourselves so that we understand all the subtle details of their definition.


## Preliminaries

#### Imports

In [11]:
import string
from collections import Counter
import os
import pickle

import numpy as np
import pandas as pd

#
# Domain specific libraries to handle text
#
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.preprocessing import normalize


#### Fist time Use

This notebook downloads some data, and generates some pre-processed files we will neede later.

Set the `fist_time` flag below to True **once**, after that, it will be faster to run the notebook
with the flag set to false.

In [2]:
first_time=True

#### Data Directories

In [3]:
documents=pd.read_csv("/Users/Emily/Desktop/dfg-humanrights/filenames.csv",index_col="document_id")
documents['file_name']=documents['file_name'].apply(lambda x: "clean_text/"+ x )
documents.head()

Unnamed: 0_level_0,file_name,label
document_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,clean_text/AGTC_DEF 14A_20181115_0001193125-18...,AGTC
1,clean_text/AHPI_DEF 14A_20181108_0001144204-18...,AHPI
2,clean_text/AKRX_DEF 14A_20181121_0001308179-18...,AKRX
3,clean_text/AMTY_DEF 14A_20181128_0001185185-18...,AMTY
4,clean_text/AOSL_DEF 14A_20181109_0001387467-18...,AOSL


###  Data Set Pre-processing

Because documents can be large  we will just keep a list of their file names in memory.

We will try really hard not read them all into memory at the same time.

Each document is labeled by its author.  We will  use that label later on the course, but not today.

## The Bag Of Words Document Model Representation

### Bag of Words Model

One of the simplest models of a document is the **Bag of Words** model:

we **ignore word ordering** and represent a
document by the list of words that it containts.

We still have a lot of choices we could make:

* Does punctuation like `.` and `?` count as words or do we ignore them?
* is `New York` one word or two, what about `U.S.`?
* Do `car` and `cars` count as different words?
* What about `safe` and `safely`
* What do we do about miss-spelled words, do we try to fix them?
* Do we consider different capitalizations of the same word:  `Car` versus `car`?
* Do we include high frequency, low information content words such as `a` and `the` in our bag of words?

All this choices are **problem dependent**. If we have a particular ML problem to solve we will use our **domain** knowledge to resolve those questions in the context of that specific task.

Today we just illustrate how to put together a **data pre-processing** pipeline. 

The pipeline is designed in such a way that it will be easier later to change our answer to any of those questions and control how we prepare the text.



### Document Preprocessing

For this exercise we will use a default pipeline that is both sensible and simple to implement:
* We remove punctuation
* `New York` counts as two words but  `U.S` counts as one.
* `car` and "`cars` are the same word, same for `safe`, and `safely`.
* miss-spelled words count as different words.
* We remove capitalization so `Car` and `car` count as the same word.
* We remove high frequency such as `a` and `the`.

We will rely on python's **NLTK** (Natural Language Toolkit) library to perform tasks that require **domain** knowledge about
text processing:
* **tokenization**: breaking character streams into words
* **stemming**: normalizing words into their roots: `cars` -> `car`, `safely` -> `safe`
* **stop word removal**: high frequency, low information words that we will consider just *noise* and ignore.

We can write a function that goes from text to stems for reuse later

In [16]:
def stem_tokenizer(text):
    porter_stemmer=PorterStemmer()
    return [porter_stemmer.stem(token) for token in word_tokenize(text.lower())]

We also need to run the stop words through stemmer (so that they will we identified and removed from text). 

In [17]:
f = open(documents['file_name'].loc[0], "r")
process=stem_tokenizer(f.read())
process

['unit',
 'state',
 'secur',
 'and',
 'exchang',
 'commiss',
 'washington',
 ',',
 'd.c.',
 '20549',
 'schedul',
 '14a',
 'proxi',
 'statement',
 'pursuant',
 'to',
 'section',
 '14',
 '(',
 'a',
 ')',
 'of',
 'the',
 'secur',
 'exchang',
 'act',
 'of',
 '1934',
 'file',
 'by',
 'the',
 'registr',
 '☒',
 'file',
 'by',
 'a',
 'parti',
 'other',
 'than',
 'the',
 'registr',
 '☐',
 'check',
 'the',
 'appropri',
 'box',
 ':',
 '[',
 'begin',
 'tabl',
 ']',
 '☐',
 'preliminari',
 'proxi',
 'statement',
 '☐',
 'confidenti',
 ',',
 'for',
 'use',
 'of',
 'the',
 'commiss',
 'onli',
 '(',
 'as',
 'permit',
 'by',
 'rule',
 '14a-6',
 '(',
 'e',
 ')',
 '(',
 '2',
 ')',
 ')',
 '☒',
 'definit',
 'proxi',
 'statement',
 '☐',
 'definit',
 'addit',
 'materi',
 '☐',
 'solicit',
 'materi',
 'pursuant',
 'to',
 '§240.14a-12',
 '[',
 'end',
 'tabl',
 ']',
 'appli',
 'genet',
 'technolog',
 'corpor',
 '(',
 'name',
 'of',
 'registr',
 'as',
 'specifi',
 'in',
 'it',
 'charter',
 ')',
 'not',
 'applic',
 

In [53]:
def process_text(filename,stop): 
    porter_stemmer = PorterStemmer()
    file=open(filename)
    lines=file.readlines()
    text=" ".join(lines).replace("\n"," ").replace('\u200b', '').replace('\n', '')
    stem_list=stem_tokenizer(text)
    used_list=[token for token in stem_list if token not in stop_words]
    #used_list.remove('\u200b')
    return used_list

In [54]:
filename=documents["file_name"][10]
punctuation=list(string.punctuation)
stop0=stopwords.words("english")+punctuation
stop_words=set(stem_tokenizer(" ".join(stop0)))
stems=process_text(filename,stop_words)


In [None]:
#STOPPED HEREEEEE

In [56]:
documents['clean_text']=documents['file_name'].apply(lambda x: process_text(filename, stop_words))
documents

Unnamed: 0_level_0,file_name,label,clean_text
document_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,clean_text/AGTC_DEF 14A_20181115_0001193125-18...,AGTC,"[unit, state, secur, exchang, commiss, washing..."
1,clean_text/AHPI_DEF 14A_20181108_0001144204-18...,AHPI,"[unit, state, secur, exchang, commiss, washing..."
2,clean_text/AKRX_DEF 14A_20181121_0001308179-18...,AKRX,"[unit, state, secur, exchang, commiss, washing..."
3,clean_text/AMTY_DEF 14A_20181128_0001185185-18...,AMTY,"[unit, state, secur, exchang, commiss, washing..."
4,clean_text/AOSL_DEF 14A_20181109_0001387467-18...,AOSL,"[unit, state, secur, exchang, commiss, washing..."
5,clean_text/APD_DEF 14A_20190124_0001206774-18-...,APD,"[unit, state, secur, exchang, commiss, washing..."
6,clean_text/ARAY_DEF 14A_20181116_0001193125-18...,ARAY,"[unit, state, secur, exchang, commiss, washing..."
7,clean_text/ARCW_DEF 14A_20181204_0001558370-18...,ARCW,"[unit, state, secur, exchang, commiss, washing..."
8,clean_text/ARL_DEF 14A_20181212_0001387131-18-...,ARL,"[unit, state, secur, exchang, commiss, washing..."
9,clean_text/AYI_DEF 14A_20180831_0001144215-18-...,AYI,"[unit, state, secur, exchang, commiss, washing..."


## Document Similarity Definitions

Given a representation of a document as a stream of tokes (stems), we need to define the concept of a document distance metric.

Usually in text processing the concept described is the normalized similarity $0<s(t_1,t_2)<1$, the translation to distance is simply

$$
    d(t_1,t_2) = 1-s(t_1,t_2)
$$

Again, may different definitions of similarity are possible, we will consider three here:
* Set intersection similarity.
* vector of counts similarity.
* TF-IDF (Term Frequency, Inverse Document Frequency) similarity.

There are many more choices, and, as usual, which one works best depends on the  problem at hand.

### Set Intersection Similarity Measure
#### Definition

We can consider our **bag of words** as just the set with the stems contained in document.

Then document similarity between to documetns is just the normalized intersection of  the document's stem's sets.

$$
    S_{\textrm{set}}(S_1,S_2)= \frac{|S_1 \cap S_2|}{\sqrt{|S_1|\cdot |S_2|}}
$$
where $|\cdot|$ is the set cardinality (number of elements).

In words: the *set similarity* of two documents is the ratio of the number of words in common to the geometric mean of the document's vocabulary length.

This is the same as considering $S_i$ as a **one-hot-encoded** vector of words: each word in the vocabulary is a dimension, and a component of the vector is 1 if that word is present on the document, 0 otherwise.  

With that interpretation
$$
    S_{\textrm{set}}(S_1,S_2)= \frac{S_1 \cdot S_2}{\sqrt{|S_1|\cdot |S_2|}}=\cos(S_1,S_2)
$$
where $\cdot$ is the regular scalar product of vectors and $|\cdot|$ is the vector norm (numerically identical to the set's cardinality).

Note how documents from the same author are more similar to each other than to documents from a second author

### Word Count Similarity Measure
#### Definition

We can include a bit more information on the **Bag of Words** model by keep track on how many times each word appear on a document
$$
    S_{\textrm{count}} = \frac{C_1 \cdot C_2}{\sqrt{|C_1|\cdot |C_2|}}=\cos(C_1,C_2)
$$
This is the same cosine similarity used on the set case but a  **count's feature vector** rather than been only 0 or 1, each dimension $w$ of document  $C$ contains the number times  word $w$ appears in the document.

#### Implementation
`python` collections module has a [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) object that computes the counts for us.

Again, documents from the same author are closer to each other. But now the differences look more pronounced.

## Text Feature Extraction with `klearn`

`sklearn` provide combinient methods to import text into a set of features suitable for machine learning 


### Represent Text as Word Counts

We configure a `CountVectorizer` that will
* use a list of `filename`s as input and read the file to get text
* process text using `steam_tokenizer`
* remove stop words from our `stop` list

It will return a matrix of word counts arranged by `(document index)` x (`word index`)

each word is mapped to its own column

In [49]:
countVectorizer=CountVectorizer(input="filename",tokenizer=stem_tokenizer,stop_words=stop_words)

We have a count vectorizer, but we still need to learn a vocabulary by **fitting** a corpus of documents

In [55]:
X=countVectorizer.fit_transform(documents["file_name"])

X is a **sparse matrix**, it has 2,500 rows, one per document, and 28,131 columns, one for each word in vocabulary

In [None]:
X[0]

First document had 197 words, so only 197 columns out of 28131 are non-zero

If we want to know what columns have positive counts we use the `nonzero` function that returns  (row,colum) for nonzero elements

In [35]:
X[0].nonzero()[1] # we only want the column, we already know row is zero

array([7921, 7636, 1195, ..., 7817, 8093, 8615], dtype=int32)

Now lets count the **total number of words in corpus**

In [36]:
print("Total Words",X.sum())

Total Words 297963


So X is just a **matrix**:
* **rows** are **documents**
* **columns** are **words**, each word has its own column

There are 28k columns, but the representation is **sparse** to save memory, only non-zero entries are stored in X.

### Represent Text as a Set of Words

To represent text as a set of words (instead of a vector of counts) we pass an extra flag to
`CountVectorizer` so that it only counts up to `one`. 

In [61]:
setVectorizer=CountVectorizer(input="filename",binary=True,tokenizer=stem_tokenizer,stop_words=stop_words)

In [39]:
X_set=setVectorizer.fit_transform(documents["file_name"])

### Cleaning

### Represent Text as TF-IDF weighted Counts

`sklearn` has also a vectorizer that weights columns by the inverse document frequencies.

In [94]:
tf_idf_vectorizor = TfidfVectorizer(input="filename",tokenizer=stem_tokenizer,stop_words=stop_words)
tf_idf = tf_idf_vectorizor.fit_transform(documents["file_name"])
tf_idf_norm = normalize(tf_idf)
tf_idf_array = tf_idf_norm.toarray()
pd.DataFrame(tf_idf_array, columns=tf_idf_vectorizor.get_feature_names()).head()

Unnamed: 0,,'','at,'execut,'for,'reason,**,***,*****************,*if,...,—mr,—our,——————————————,†,††,•,●,◻,☐,☒
0,0.0,0.052022,0.0,0.0,0.0,0.0,0.0,0.0,0.004747,0.0,...,0.0,0.0,0.0,0.0,0.0,0.12434,0.0,0.0,0.017055,0.012492
1,0.000475,0.010278,0.0,0.0,0.001036,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.002338,0.001027
2,0.0,0.110367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.078447,0.0,0.005178,0.0,0.0,0.0,0.0,0.0,0.005178,...,0.0,0.0,0.0,0.0,0.0,0.0,0.026151,0.0,0.058139,0.008176
4,0.0,0.055112,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.068175,0.0,0.0,0.0,0.0


### Using Digrams as features

Instead of using *words* counts as features, we can use *pairs of words*.
This are called **digrams**. 


In [65]:
digramVectorizer=CountVectorizer(input="filename",tokenizer=stem_tokenizer,stop_words=stop_words,ngram_range=(2,2))

In [66]:
X_digram=digramVectorizer.fit_transform(documents["file_name"])

In [67]:
digrams=list(digramVectorizer.vocabulary_.keys())[:5] # only get the first 5 digrams
for digram in digrams:
    print( digram,digramVectorizer.vocabulary_[digram])

unit state 88157
state secur 81484
secur exchang 77564
exchang commiss 37717
commiss washington 24117


In [68]:
V_digram=X_digram.shape[1]
V_digram

93426

### Saving Trained models for Reuse

Training a text model (calling `fit_transform`) is slow because we need to process all the documents in the training corpus.

Sometimes it is convenient to save pre-trained models to disk for reuse later. We can also save the pre-processed test documents:

In [73]:
data_dir='preprocessing'
set_vectorizer_filename=data_dir+"/set_vectorizer.p"


In [78]:
set_vectorizer_filename=   data_dir+"/set_vectorizer.p"
set_features_filename=     data_dir+"/set_features.p"
pickle.dump(setVectorizer, open( set_vectorizer_filename, "wb" ) )
pickle.dump(X_set,         open( set_features_filename, "wb" ) )

In [79]:
count_vectorizer_filename=   data_dir+"/count_vectorizer.p"
count_features_filename=     data_dir+"/count_features.p"
pickle.dump(countVectorizer, open( count_vectorizer_filename, "wb" ) )
pickle.dump(X,             open( count_features_filename, "wb" ) )


In [80]:
tfidf_vectorizer_filename=   data_dir+"/tfidf_vectorizer.p"
tfidf_features_filename=     data_dir+"/tfidf_features.p"
pickle.dump(tfidfVectorizer, open( tfidf_vectorizer_filename, "wb" ) )
pickle.dump(Xi,              open( tfidf_features_filename, "wb" ) )


In [81]:
digram_vectorizer_filename=   data_dir+"/digram_vectorizer.p"
digram_features_filename=     data_dir+"/digram_features.p"
pickle.dump(digramVectorizer, open( digram_vectorizer_filename, "wb" ) )
pickle.dump(X_digram,              open( digram_features_filename, "wb" ) )


In [None]:
##IMPLEMENT LATER WHEN WE HAVE A TRAIN/TEST SET

#### Mapping new documents  into an already learned vocabulary

When testing a test classifier on new documents we need to keep the vocabulary **fixed**.

We call `transform`, not `fit_transform` on the `sklearn` vectorizer.

Words in new documents, will get mapped to the columns learned on the training set.
New words not in the original vocabulary will be ignored.

If we called `fit_transform` instead the indexing of words would be all **scrambled up**.

In [None]:
X_test=countVectorizer.transform(test_documents["filename"])

Lets check the word counts of a few new documents

In [None]:
print("doc","word"+" "*11,"dim","count",sep="\t")
for i1 in range(2):
    for word in words:
        document=X_test[i1]
        dimension=countVectorizer.vocabulary_[word]
        print(i1,f"{word:15}",dimension,document[0,dimension],sep="\t")