## NPL
---
#### Elo notes

Natural Language Processing is a subfield of machine learning focused on making sense of text. Text is inherently unstructured and has all sorts of tricks required for converting (vectorizing) text into a format that a machine learning algorithm can interpret.

#### Information Retrieval 

Information retrieval (IR) Ranking of documents via a search query, is the activity of obtaining information resources relevant to an information need from a collection of information resources. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.

Web search engines are the most visible IR applications.

An object is an entity that is represented by information in a content collection or database. User queries are matched against the database information. However, as opposed to classical SQL queries of a database, in information retrieval the results returned may or may not match the query, so results are typically ranked. This ranking of results is a key difference of information retrieval searching compared to database searching

#### Bag of words

The bag-of-words model is a n-gram model, with n=1. The bag of words is a simplifying representation used in natural language processing and information retrieval (IR). In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. The bag-of-words model has also been used for computer vision. 

The bag-of-words model is commonly used in methods of document classification where the (frequency of) occurrence of each word is used as a feature for training a classifier.

The Bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a "bag of words", we can calculate various measures to characterize the text. The most common type of characteristics, or features calculated from the Bag-of-words model is term frequency, namely, the number of times a term appears in the text (Vectorization). 

To address the problem for common words like "the", "a", "to" are almost always the terms with highest frequeny in the text **"normalize"** the term frequencies is to weight a term by the __inverse of document frequency__, or **tf–idf**.

#### N-gram model

Bag-of-word model is an orderless document representation—only the counts of words mattered. The n-gram model can be used to store spatial information within the text. Applying a __bigram__ model will parse the text into two words units and store the term frequency of each unit as before.

#### Sentiment analysis

Extract subjective information usually from a set of documents, often using online reviews to determine "polarity" about specific objects. It is especially useful for identifying trends of public opinion in the social media, for the purpose of marketing.

#### Spam filter

Bayesian spam filtering, an e-mail message is modeled as an unordered collection of words selected from one of two probability distributions: one representing spam and one representing legitimate e-mail ("ham"). 

Imagine that there are two literal bags full of words. One bag is filled with words found in spam messages, and the other bag is filled with words found in legitimate e-mail. While any given word is likely to be found somewhere in both bags, the "spam" bag will contain spam-related words such as "stock", "Viagra", and "buy" much more frequently, while the "ham" bag will contain more words related to the user's friends or workplace.

To classify an e-mail message, the Bayesian spam filter assumes that the message is a pile of words that has been poured out randomly from one of the two bags, and uses Bayesian probability to determine which bag it is more likely to be.

#### First dimension: mathematical basis

Set-theoretic models represent documents as sets of words or phrases. Similarities are usually derived from set-theoretic operations on those sets. Common models are:
- Standard Boolean model
- Extended Boolean model
- Fuzzy retrieval

Algebraic models represent documents and queries usually as vectors, matrices, or tuples. The similarity of the query vector and document vector is represented as a scalar value.
- Vector space model
- Generalized vector space model
- (Enhanced) Topic-based Vector Space Model
- Extended Boolean model
- Latent semantic indexing a.k.a. latent semantic analysis

Probabilistic models treat the process of document retrieval as a probabilistic inference. Similarities are computed as probabilities that a document is relevant for a given query. Probabilistic theorems like the Bayes' theorem are often used in these models.
- Binary Independence Model
- Probabilistic relevance model on which is based the okapi (BM25) relevance function
- Uncertain inference
- Language models
- Divergence-from-randomness model
- Latent Dirichlet allocation

Feature-based retrieval models view documents as vectors of values of feature functions (or just features) and seek the best way to combine these features into a single relevance score, typically by learning to rank methods. Feature functions are arbitrary functions of document and query, and as such can easily incorporate almost any other retrieval model as just another feature.

#### Second dimension: properties of the model

- Models without term-interdependencies treat different terms/words as independent. This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables.


- Models with immanent term interdependencies allow a representation of interdependencies between terms. However the degree of the interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents.


- Models with transcendent term interdependencies allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They rely an external source for the degree of interdependency between two terms. (For example, a human or sophisticated algorithms.)



#### Precision PPV

Precision or Positive Predicted Value (PPV) and recall (TPR)

Precision is the fraction of the documents retrieved that are relevant to the user's information need.

$ {\displaystyle {\mbox{precision}}={\frac {|\{{\mbox{relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{retrieved documents}}\}|}}} $

In binary classification, precision is analogous to positive predictive value. Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. 

Note that the meaning and usage of "precision" in the field of information retrieval differs from the definition of accuracy and precision within other branches of science and statistics.


#### Recall TPR

Recall is the fraction of the documents that are relevant to the query that are successfully retrieved.

${\displaystyle {\mbox{recall}}={\frac {|\{{\mbox{relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{relevant documents}}\}|}}}$

In binary classification, recall is often called sensitivity. So it can be looked at as the probability that a relevant document is retrieved by the query.

It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore, recall alone is not enough but one needs to measure the number of non-relevant documents also, for example by computing the precision.


#### Fall-out

The proportion of non-relevant documents that are retrieved, out of all non-relevant documents available:

$ {\displaystyle {\mbox{fall-out}}={\frac {|\{{\mbox{non-relevant documents}}\}\cap \{{\mbox{retrieved documents}}\}|}{|\{{\mbox{non-relevant documents}}\}|}}} $

In binary classification, fall-out is closely related to specificity and is equal to $ {\displaystyle (1-{\mbox{specificity}})}$. It can be looked at as the probability that a non-relevant document is retrieved by the query.


#### F-score / F-measure

The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:

${\displaystyle F={\frac {2\cdot \mathrm {precision} \cdot \mathrm {recall} }{(\mathrm {precision} +\mathrm {recall} )}}}$

This is also known as the ${\displaystyle F_{1}}$ measure, because recall and precision are evenly weighted.

The general formula for non-negative real ${\displaystyle \beta }$ is:

${\displaystyle F_{\beta }={\frac {(1+\beta ^{2})\cdot (\mathrm {precision} \cdot \mathrm {recall} )}{(\beta ^{2}\cdot \mathrm {precision} +\mathrm {recall} )}}\,}$

Two other commonly used $F$ measures are the ${\displaystyle F_{2}}$ measure, which weights recall twice as much as precision, and the ${\displaystyle F_{0.5}}$ measure, which weights precision twice as much as recall.

The F-measure was derived by van Rijsbergen (1979) so that ${\displaystyle F_{\beta }}$ "measures the effectiveness of retrieval with respect to a user who attaches ${\displaystyle \beta }$ times as much importance to recall as precision". It is based on van Rijsbergen's effectiveness measure ${\displaystyle E=1-{\frac {1}{{\frac {\alpha }{P}}+{\frac {1-\alpha }{R}}}}}.$ Their relationship is:

${\displaystyle F_{\beta }=1-E}$ where ${\displaystyle \alpha ={\frac {1}{1+\beta ^{2}}}}$

F-measure can be a better single metric when compared to precision and recall; both precision and recall give different information that can complement each other when combined. If one of them excels more than the other, F-measure will reflect it

#### Tokenization

In computer science, lexical analysis is the process of converting a sequence of characters (such as in a computer program,web page or document) into a sequence of tokens (strings with an assigned and thus identified meaning) which results in another tokenized document.

#### Stop Words

In computing, stop words are words which are filtered out before or after processing of natural language data (text). Any group of words can be chosen as the stop words for a given purpose. For some search engines, these are some of the most common, short function words, such as the, is, at, which, and on. In this case, stop words can cause problems when searching for phrases that include them, particularly in names such as "The Who", "The The", or "Take That". Other search engines remove some of the most common words—including lexical words, such as "want"—from a query in order to improve performance.

In information theory, systems are modeled by a transmitter, channel, and receiver. The transmitter produces messages that are sent through the channel. The channel modifies the message in some way. The receiver attempts to infer which message was sent. In this context, entropy (more specifically, Shannon entropy) is the expected value (mean) of the information contained in each message. 'Messages' can be modeled by any flow of information.

In information theory/decision trees, features that do not have that much information in them are not worth keeping around. In NLP, these features are called stop words.

#### Sentence Segmentation

Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics. The term applies both to mental processes used by humans when reading text, and to artificial processes implemented in computers, which are the subject of natural language processing.


#### Pre processing

Text homogenization.

#### NGrams

An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1) – order Markov model. n-gram models are now widely used in probability, communication theory, computational linguistics (for instance, statistical natural language processing), computational biology (for instance, biological sequence analysis), and data compression. Two benefits of n-gram models (and algorithms that use them) are simplicity and scalability – with larger n, a model can store more context with a well-understood space–time tradeoff, enabling small experiments to scale up efficiently.