## Week 07 - Natural Langage Processing

#### 7.1/7.2 - Text Mining Analytics
Text mining focuses on the process and analytics focuses more on the result. Text Analytics turns data into high-quality information or actionable knowledge.
- Minimize human effort
- Supplies knowledge for optimal decision making
- Mining focuses on the process
- Analytics emphasizes on the result
- Text retrieval is preprocessing for text mining
- Text retrieval is needed for knowledge provenance; turning text data into actionable knowedge

###### Data Mining Problems
- Dealing with text/non-text data (numerical, categorical, video, relational).
	- Non-text data collected from sensors
	- Text data collected from humans
- Problem is turning all data into actionable knowledge as output to change the world.

The real world is percieved by the observer that is human, who will then go and express their perceptions of what they percieved for text data. 
- You can mine knowledge about a language expressed as text data
- You can mine the content of text data observed by the user
- You can mine knowledge about the observer
- You can infer other real-world variables (predictive analytics)

Non-text data can help establish context. We can partition text data into different time periods, different locations, or any metadata that form interesting comparisons. Non-text data can help make context-sensitive analysis of content or language usage or opionions about the observer or authors of text data.

###### Course Overview
- NLP and Text Representation
- Word Association Mining and Analysis
- Topic Mining and Analysis
- Opinion Mining and Sentiment Analysis
- Text-based Prediction

#### 7.3/7.4 - Natural Language Content Analysis

###### Basic Concepts in NLP
- Lexical analysis is Part-of-speech tagging. Which words are nouns, verbs, adjectives, etc. 
- Syntactical Parsing is finding patterns in lexicon. Verb phrase, noun phrase, prepositional phrase, etc.
- Semantic Analysis is binding words and phrases into symbols. A woman being a type of human for example.
- Inference is about understanding the world given content. A woman in a sentence about sadness would be sad. Someone doing an action like running from another entity would be in fear.
- Pragmatic Analysis is all about the purpose of a sentence.

###### What We Can't Do 
- 100% POS tagging (ambiguity)
- General complete parsing (ambiguity)
- Precise deep learning semantic analysis (true definition of certain words)

###### Computer Steps
- Segment the words
- Understand the categories (Lexical Analysis; i.e. part-of-speech tagging)
- Understand relationships between the words (Syntactical Parsing; i.e. parsing)
- Make inference from the context of the above steps

The representation of the all of the above through symbols, which is also known as Semantic Analysis. After this has been established, then we can make inference from the context of the sentence. Pragmatic Analysis is the use of language (why someone would say something).

#### 7.5/7.6 - Text Representation
Take a string of text and transform into a sequence of words. With this word sequence you can then perform parts-of-speech analysis, but we will have two layers at this point: sequence and parts-of-speech. Once we have POS, we generate a syntactic structure for words and their associations. And then from this we can understand the context of words like a name is a person, and place is a location, etc. This is forming entities and relations. After this is logic predicates, which is essentailly inferencing rules.

As this process goes from beginning to end generality is lost (requires more human effort and less accurate, but closer to knowledge representation)t. The final step is understanding is speech acts; the why behind why the sentence exists. 

| Text Representation  | Enabled Analysis  | Examples |
|-|-|-|-|-|
| String | String Processing | Compression |
| Words | Word relations analysis; topic analysis; sentiment analysis | Thesaurus discovery; topic and opinion related discovery |  
| Syntactic Structures | Syntactic Graph | Stylistic Analysis; structure-based feature extraction |  
| Entities and relations | Knowledge graph analysis; information network analysis  | Discovery of knowledge and opinions about specific entities |  
| Logic and Predicates | Integrative analysis of scattered knowledge; logic inference | Knowledge assistant for biologists |

#### 7.7/7.8 - Word Association Mining
Basic word Relations:
- Paradigmatic: A & B have this if they can be substituted for each other (same class; sentence will still make sense). Words with high context similarity will likely have this relation.
- Syntagmatic: A & B have this if they can be combined with each other (Noun and a verb). Words with high co-occurences but relatively low individual occurrences likely have this relation.
- Joint DIscovery of two relations: paradigmatically related words tend to have syntagmatic relation with the same word.

###### Why Mine Word Associations?
- Useful for improving accuracy of NLP
	- POS tagging
	- Parsing
	- Entity Recognition
	- Acronym Exapnsion
	- Grammar Learning
- Useful for text retrieval and mining
	- Text Retrieval (word associations suggest variation of a query)
	- Automatic construction of topic map for browsing (words as nodes and associations as edges)
	- Compare and summarzie opinions

Left context are word/s that appear before the word being analyzed. Right context are for the right of the word/s being analyzed. Even a general context analyzies the text around the word being analyzed. This is the study of paradigmatic relations.

Syntagmatic relations studies the words that appear in text given the presence/absence of other words. This can even go as deep as words that appear left and right of the word being analyzed. 

###### General Ideas of Mining Word Associations
- Paradigmatic
	- Represent each word by its context
	- Compute context similarity
	- Words with high similarity likely have paradigmatci relation
- Syntagmatic
	- Count how many times two words occur together in a context
	- Compare their co-occurences with their individual occurrences
	- Words with a high co-occurrence but relatively low individual occurences likely have a high syntagmatic relation.

Word context are known as a pseudo document. Left context are words that can appear before the word. Right context are the words that appear after the word. Window context are a number of words that appear around the word. This is essentially a bag-of-words. But context may contain adjacent or non-adjacent words.  

###### Measuring Context Similarity
$$Sim(Cat, Dog) = Sim(Left1(Cat), Left1(Dog)) + Sim(Right1(Cat), Right1(Dog)) + ...$$
If this is high then the words a paradigmatically related.

Imagine bag-of-words as a Vector Space Model. Each vector will be like a function of the pseudo-document given the target word as a parameter, and then measure the similarity of the word vectors. In the VSM will be a frequency vector representing the context. Each word is a dimension, and track the count of each word.

###### Expected Overlay of Words in Context (EOWC)
$$x_i = \frac{ c( w_i, d_1 ) }{ \mathbf{ | d_1 | } }$$
$$y_i = \frac{ c( w_i, d_2 ) }{ \mathbf{ | d_2 | } }$$
$$sim(X, Y) = \sum_{i = 1}^N\ X_i dot Y_i$$

is the probability that two selected words picked at random are identical. We want to analyze the similarity between $x_1$ and $y_i$. Each word has a weight that is the probability that a word taken from the vector of words is the target word, given the context. All words in the vector $d_j$ will sum to one, since they are normalized. 

Problem is that this favors one frequent term very well over matching more distinct terms. It treats every word equally, which is not ideal for common stop words. 

- Sublinear Transformation of Term Frequency (TF)
- Reward Matching a Rare Word (Inverse Document Frequency: IDF)

#### 7.9 - Paradigmatic Relation Discovery
###### BM25 Transformation
Imagine the x-axis is $x = c(w, d)$ and the y-axis is $y = TF(w, d)$.
$$y = \frac{(k + 1)x}{x + k}$$

where $k$ is a parameter and $x$ is a raw count of a word. The upper bound is $k + 1$. This puts a strict constraint on high frequency terms.

###### IDF Weighting: Penalize Popular Terms
x-axis is $k$ and the y-axis is $IDF(w)$. $M$ is the total number of docs in collection and $k$ is the total number of docs containing $w$.
$$IDF(w) = log[\frac{M + 1}{k}]$$

this function gives a higher value for a lower $k$. THe lowest value is when $k$ reaches its' maximum, which is $M$: all docs contain the word. THe smallest value is only one document containing the word and therefore the word is rare.

###### Adapting BM25 for Paradigmatic Relation Mining
$d_1 = (x_1, ..., x_N)$ and $d_2 = (y_1, ..., y_N)$.
$$BM25( w_i, d_1 ) = \frac{ ( k + 1 ) * c( w, d_1 ) }{ c( w_i, d_1 ) + k( 1 - b + ( b * \frac{ \mathbf{ | d_1 | } }{ AVDL } ) ) } * log( \frac{ m + 1 }{ df(  w ) } )$$

$$x_i = \frac{ BM25( w_i, d_1 ) }{ \sum_{ j = 1 }^N BM25( w_j, d_1 ) }$$

where $b \in [ 0, 1 ]$ and $k \in [ 0, + \infty )$. $k$ sets the upper bound and controls the linear transformation. $b$ controls length normalization.
$$Sim( d_1, d_2 ) = \sum_{ i = 1 }^N IDF( w_i )x_iy_i$$

$y_i$ is defined similarly. THis assures that high frequency words would obtain a lower weight.

BM25 can also discover syntagmatic relations. The highyl weighted terms in the context vector of word $w$ are likely syntagmatically related to $w$. 

###### BM25 to Discover Syntagmatic Relations
$$IDF-weighted\ d_1 = (x_1 * IDF(w_1), ..., x_N * IDF(w_N))$$

$x_i$ only reflects how frequent a word occurs in context. IDF Weighting makes it so that common words will not be the highest weighted terms. The highest weighted terms are the terms frequent in the context, but not in the collection.

###### The main idea for discovering paradigmatic relations
- Collecting the context of a candidate word to form a pseudo document (bag of words)
- Computing similarity of the corresponding context documents of two candidate words
- Highly similar word pairs can be assumed to have paradigmatic relations
