In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


# Text

## Bag of words

A widely used technique in NLP (natural language processing). It's a great approach to start with for any text-based problem. It's also the basis of many other more advanced methods. 

### Tokenization and transformation

The splitting of text into pieces is known as tokenization. The most common way to split is on words, but in some cases (for example in character based langauges) you may want to split on character or split on pairs or groups of words or even something more advanced. 

Groups of words in a split are known as n-gram. Two or three word combinations are known as bigrams and trigrams. Bigram exmaple: 'the lazy', 'brown fox' and trigrams 'brown fox jumps', 'jumps over the'

#### transformation

such as reducing all letters to lower case to prevent fox and Fox counting as 2 seperate accounts. 

#### Stemming

which strips word suffices can also be a transformation technique for extracting more signals out of different words with simiilar meanings. i.e. jump, jumping, jumps, jumped to al be expressed as jump

### Vectorization

After defining the dictionary you can convert any text to a set of numbers corresponding to the occurences of each dictionary word in the text. 

##### Stop Words

words that are generally not that important or meaningless i.e. 'the', 'is', 'and'. Most ML engineers will remove the stop words and most libraries have a pre-stop word list.

### Bag of words

One problem with bag of words models is the nature of simple word counts. if a non-stop-word is common in the corups for example 'data'. It's not necessarily infomrative to konw that the word also appears in a new text. Instead, you'd do better by focusing on relatively rare words that are more highly predictive of the outcome of interest. 

To this end, it's common to scale the word counts by the inverse of the total count of that word in the corpus. Because we're describing a corpus in numbers. If there is an abundent count of a word in the training corpus but not in the new document then there is some meaning there. This means preferring rare words over common to find meaning in the differences in the rare ones. 

#### term frequency-inverse document frequency (tf-idf)

This algo is commonly used to handle this issue. It calculates a product of the term frequency and inverses the document frequency. 

#### Laten semantic analysis (lsa) or latent semantic indexing (lsi)

The ideas is to use the bag of word counts to build a term document matrix, with a row for each term and a column for each document. The elements of this matrix are then normalized similarly to the tf-idf process in order to avoid frequent terms dominating the power of the matrix. 

The value of this is there are themes or concepts that the LSA can pattern out. For example 'dog' may have related words such as 'barking', 'kennel' so on. 

##### singular value decomposition (SVD)

you split the term document into 3 matrices (T,S,D). T is the term-concept matrix that relates the term (barking or kennel) to concepts (dog) and D is the concept document matrix that relates individual documents to concepts that you'll later use to extract the features from the LSA model. 

The S matrix holds the singular values. These denote the relative importance that a term has to a document. 

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD


In [3]:
def latent_semantic_analysis(docs):
    tfidf = TfidfVectorizer() #this uses default params
    tfidf.fit(docs) #creates the dictionary
    vecs = tfidf.transform(docs) #uses dictionary to vectorize documents
    svd = TruncatedSVD(n_components=100)
    svd.fit(vecs) #creating SVD matrices
    return svd.transform(vecs)

In [4]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

In [5]:

latent_semantic_analysis(newsgroups_train.data)

array([[ 0.24972705, -0.06943154, -0.01310705, ...,  0.05060385,
         0.02069774,  0.01851646],
       [ 0.1399918 , -0.07671322, -0.03975507, ...,  0.06868958,
         0.00248835,  0.0073129 ],
       [ 0.37184255, -0.04142798, -0.0670953 , ..., -0.03153767,
        -0.00784163, -0.02100757],
       ..., 
       [ 0.18476811, -0.00611318, -0.08038995, ..., -0.02209117,
        -0.00948749, -0.02047383],
       [ 0.18795807, -0.06606492,  0.04157621, ...,  0.0360893 ,
        -0.00985462, -0.00469434],
       [ 0.08231697, -0.09080726,  0.00372898, ...,  0.00083822,
         0.01503654,  0.01007338]])

#### probailistic method (pLSA) or latent Dirichlet Analysis (LDA)

LSA is based on linear algebra (math with vectors and matrices) but an equivalent analysis scan be done using probabilistic methods that model each document as a statistical mixture of topic distrubitions. 

The specific assumptions are made on the distribution of topics. You build an the assumption that a document can be described by a small set of topics and that ay term (word) can be attributed to a topic. In practice, LDA, can perform well on diverse datasets. 