## NPL
---
#### Elo notes

Natural Language Processing is a subfield of machine learning focused on making sense of text. Text is inherently unstructured and has all sorts of tricks required for converting (vectorizing) text into a format that a machine learning algorithm can interpret.

#### Information Retrieval 

Information retrieval (IR) Ranking of documents via a search query, is the activity of obtaining information resources relevant to an information need from a collection of information resources. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.

Web search engines are the most visible IR applications.

An object is an entity that is represented by information in a content collection or database. User queries are matched against the database information. However, as opposed to classical SQL queries of a database, in information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching

#### Bag of words

The bag-of-words model is a n-gram model, with n=1. The bag of words is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision. 

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

The Bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a "bag of words", we can calculate various measures to characterize the text. The most common type of characteristics, or features calculated from the Bag-of-words model is term frequency, namely, the number of times a term appears in the text (Vectorization). 

To address the problem for common words like "the", "a", "to" are almost always the terms with highest frequeny in the text **"normalize"** the term frequencies is to weight a term by the __inverse of document frequency__, or **tf–idf**.

#### N-gram model

Bag-of-word model is an orderless document representation—only the counts of words mattered. The n-gram model can be used to store spatial information within the text. Applying a __bigram__ model will parse the text into two words units and store the term frequency of each unit as before.

#### Sentiment analysis

Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in the social media, for the purpose of marketing.

#### Spam filter

Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ("ham"). 

Imagine that there are two literal bags full of words. One bag is filled with words found in spam messages, and the other bag is filled with words found in legitimate e-mail. While any given word is likely to be found somewhere in both bags, the "spam" bag will contain spam-related words such as "stock", "Viagra", and "buy" much more frequently, while the "ham" bag will contain more words related to the user's friends or workplace.

To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be.

#### First dimension: mathematical basis

Set-theoretic models represent documents as sets of words or phrases. Similarities are usually derived from set-theoretic operations on those sets. Common models are:
- Standard Boolean model
- Extended Boolean model
- Fuzzy retrieval

Algebraic models represent documents and queries usually as vectors, matrices, or tuples. The similarity of the query vector and document vector is represented as a scalar value.
- Vector space model
- Generalized vector space model
- (Enhanced) Topic-based Vector Space Model
- Extended Boolean model
- Latent semantic indexing a.k.a. latent semantic analysis

Probabilistic models treat the process of document retrieval as a probabilistic inference. Similarities are computed as probabilities that a document is relevant for a given query. Probabilistic theorems like the Bayes' theorem are often used in these models.
- Binary Independence Model
- Probabilistic relevance model on which is based the okapi (BM25) relevance function
- Uncertain inference
- Language models
- Divergence-from-randomness model
- Latent Dirichlet allocation

Feature-based retrieval models view documents as vectors of values of feature functions (or just features) and seek the best way to combine these features into a single relevance score, typically by learning to rank methods. Feature functions are arbitrary functions of document and query, and as such can easily incorporate almost any other retrieval model as just another feature.

#### Second dimension: properties of the model

- Models without term-interdependencies treat different terms/words as independent. This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables.


- Models with immanent term interdependencies allow a representation of interdependencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.


- Models with transcendent term interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They rely an external source for the degree of interdependency between two terms. (For example, a human or sophisticated algorithms.)



#### Precision PPV

Precision or Positive Predicted Value (PPV) and recall (TPR)

Precision is the fraction of the documents retrieved that are relevant to the user's information need.

$ {\displaystyle {\mbox{precision}}={\frac {|\{{\mbox{relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{retrieved documents}}\}|}}} $

In binary classification, precision is analogous to positive predictive value. Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. 

Note that the meaning and usage of "precision" in the field of information retrieval differs from the definition of accuracy and precision within other branches of science and statistics.


#### Recall TPR

Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

${\displaystyle {\mbox{recall}}={\frac {|\{{\mbox{relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{relevant documents}}\}|}}}$

In binary classification, recall is often called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query.

It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.


#### Fall-out

The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available:

$ {\displaystyle {\mbox{fall-out}}={\frac {|\{{\mbox{non-relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{non-relevant documents}}\}|}}} $

In binary classification, fall-out is closely related to specificity and is equal to $ {\displaystyle (1-{\mbox{specificity}})}$. It can be looked at as the probability that a non-relevant document is retrieved by the query.


#### F-score / F-measure

The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:

${\displaystyle F={\frac {2\cdot \mathrm {precision} \cdot \mathrm {recall} }{(\mathrm {precision} +\mathrm {recall} )}}}$

This is also known as the ${\displaystyle F_{1}}$ measure, because recall and precision are evenly weighted.

The general formula for non-negative real ${\displaystyle \beta }$ is:

${\displaystyle F_{\beta }={\frac {(1+\beta ^{2})\cdot (\mathrm {precision} \cdot \mathrm {recall} )}{(\beta ^{2}\cdot \mathrm {precision} +\mathrm {recall} )}}\,}$

Two other commonly used $F$ measures are the ${\displaystyle F_{2}}$ measure, which weights recall twice as much as precision, and the ${\displaystyle F_{0.5}}$ measure, which weights precision twice as much as recall.

The F-measure was derived by van Rijsbergen (1979) so that ${\displaystyle F_{\beta }}$ "measures the effectiveness of retrieval with respect to a user who attaches ${\displaystyle \beta }$ times as much importance to recall as precision". It is based on van Rijsbergen's effectiveness measure ${\displaystyle E=1-{\frac {1}{{\frac {\alpha }{P}}+{\frac {1-\alpha }{R}}}}}.$ Their relationship is:

${\displaystyle F_{\beta }=1-E}$ where ${\displaystyle \alpha ={\frac {1}{1+\beta ^{2}}}}$

F-measure can be a better single metric when compared to precision and recall; both precision and recall give different information that can complement each other when combined. If one of them excels more than the other, F-measure will reflect it

#### Tokenization

In computer science, lexical analysis is the process of converting a sequence of characters (such as in a computer program, web page or document) into a sequence of tokens (strings with an assigned and thus identified meaning) which results in another tokenized document.

#### Stop Words

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". Other search engines remove some of the most common words—including lexical words, such as "want"—from a query in order to improve performance.

In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel. The channel modifies the message in some way. The receiver attempts to infer which message was sent. In this context, entropy (more specifically, Shannon entropy) is the expected value (mean) of the information contained in each message. 'Messages' can be modeled by any flow of information.

In information theory/decision trees, features that do not have that much information in them are not worth keeping around. In NLP, these features are called stop words.

#### Sentence Segmentation

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing.


#### Pre processing

Text homogenization.

#### NGrams

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1) – order Markov model. n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression. Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.

#### Pipeline





In [1]:
import nltk

In [77]:
# nltk.download('all')

In [24]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/eloisaelias/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [61]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import TfidfVectorizer

In [4]:
news_train = fetch_20newsgroups(subset='train')

In [9]:
news_train.target_names

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [10]:
subset = ['alt.atheism', 'sci.electronics']

In [12]:
X_train = fetch_20newsgroups(subset='train', categories=subset) 

In [13]:
sentence = 'I love statistics, programming and data science!'

In [14]:
document = word_tokenize(sentence)

In [15]:
print document

['I', 'love', 'statistics', ',', 'programming', 'and', 'data', 'science', '!']


In [20]:
stopw  = stopwords.words('english')
print stopw

[u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', u'you', u'your', u'yours', u'yourself', u'yourselves', u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', u'some', u'such', u'no', u'nor', u

In [19]:
word_list = []
cleaning = [word for word in word_list if not word in stopwords.words('english')]

#### Segmatation


In [43]:
sentences = 'Cross Industry Standard Process for Data Mining. commonly known by its acronym CRISP-DM'

In [25]:
segment = nltk.data.load('tokenizers/punkt/english.pickle')

In [31]:
segments = segment.tokenize(sentences)

In [32]:
segments

['Cross Industry Standard Process for Data Mining.',
 'commonly known by its acronym CRISP-DM']


#### Documents into Tokens

In [48]:
def documents(segments):
    doc = []
    for seg in segments:
        doc.append(word_tokenize(seg))
    return doc

In [52]:
# Result: two documents
docs(segments)

[['Cross', 'Industry', 'Standard', 'Process', 'for', 'Data', 'Mining', '.'],
 ['commonly', 'known', 'by', 'its', 'acronym', 'CRISP-DM']]

#### NGrams

In [74]:
sentence = 'Cross Industry Standard Process for Data Mining'

In [58]:
n = 3
threegrams = ngrams(sentence.split(), n)
for grams in threegrams:
    print grams

('Cross', 'Industry', 'Standard')
('Industry', 'Standard', 'Process')
('Standard', 'Process', 'for')
('Process', 'for', 'Data')
('for', 'Data', 'Mining')


#### Bag of words - Vectorization

In [70]:
# Equivalent to CountVectorizer followed by TfidfTransformer.
vectorizer = TfidfVectorizer(stop_words = 'english', ngram_range=(1, 2))

In [75]:
X = vectorizer.fit_transform([sentence])

In [76]:
print X

  (0, 0)	0.301511344578
  (0, 4)	0.301511344578
  (0, 9)	0.301511344578
  (0, 7)	0.301511344578
  (0, 2)	0.301511344578
  (0, 6)	0.301511344578
  (0, 1)	0.301511344578
  (0, 5)	0.301511344578
  (0, 10)	0.301511344578
  (0, 8)	0.301511344578
  (0, 3)	0.301511344578


### TF-IDF Features

TFIDF is a relevance measure. It is used for identifying documents that are related to a search query. A search query itself is also a document.

In information retrieval, tf–idf, short for __term frequency–inverse document frequency__, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in information retrieval, text mining, and user modeling. 

The tf-idf value increases proportionally to the number of times a word appears in the document, but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. Nowadays, tf-idf is one of the most popular term-weighting schemes. For instance, 83% of text-based recommender systems in the domain of digital libraries use tf-idf.

Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification.

#### Term frequency

The first form of term weighting is due to Hans Peter Luhn (1957) and is based on the Luhn Assumption:

- The weight of a term that occurs in a document is simply proportional to the term frequency.

#### Inverse document frequency

An inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

- The specificity of a term can be quantified as an inverse function of the number of documents in which it occurs

__tf–idf__is the product of two statistics, term frequency and inverse document frequency. Various ways for determining the exact values of both statistics exist.


#### Ranking

Ranking of query results is one of the fundamental problems in information retrieval (IR), the scientific/engineering discipline behind search engines. 

Given a query __q__ and a collection __D__ of documents that match the query, the problem is to rank, that is, sort, the documents in __D__ according to some criterion so that the "best" results appear early in the result list displayed to the user. Classically, ranking criteria are phrased in terms of relevance of documents with respect to an information need expressed in the query.

Ranking is often reduced to the computation of numeric scores on query/document pairs; a baseline score function for this purpose is the __cosine similarity between tf–idf vectors representing the query and the document in a vector space model__, BM25 scores, or probabilities in a probabilistic IR model. A ranking can then be computed by sorting documents by descending score. 

An alternative approach is to define a score function on pairs of documents __d₁, d₂__ that is positive if and only if __d₁__ is more relevant to the query than __d₂__ and using this information to sort.

Ranking functions are evaluated by a variety of __means__; one of the simplest is determining the precision of the first __k top-ranked__ results for some fixed __k__; for example, the proportion of the top 10 results that are relevant, on average over many queries.

Frequently, computation of ranking functions can be simplified by taking advantage of the observation that only the relative order of scores matters, not their absolute value; hence terms or factors that are independent of the document may be removed, and terms or factors that are independent of the query may be precomputed and stored with the document.






