## Week 12 - Text Mining

#### 12.3 - Prediction
Goal is to make predictions of the world based on the data gathered through sensors.

###### Data Mining Loop:
- Real world data comes in
- The data is perceived by sensors, which could be people
- This data comes in the form of non-text data and text data
- Perform joint mining of data to generate multiple predictors 
- These predictor variables will be added to a predictive model
- Predict values of real world variables
- Cycle repeats with the machine determining what data should be collected next   

###### People:
- Involved in data mining and generation of the features
- Involved in the predictive model building and testing
- Involved in applying the predictions to make decisions or take action
- Involved in controlling the sensors of data collection

###### Subtasks of Text-Based Prediction
- Mining content of text data
- Mining knowledge about the observer

###### Joint Analysis of Text and Non-Text Data
Non-text data help text-mining
- Provides context; provide way to partition data in different ways
- Contextual Text Mining: mine text in the context defined by non-text data

Text data help non-text data mining
- Text data determines patterns discovered from non-text data
- Pattern Annotation: Using text data to interpret patterns found in non-text data

#### 12.4 - Motivation
Text has rich context information
- Direct context: time, location, authors, source (meta data)
- Indirect context: social network of authors, author's age, etc.
- Any related data can be regarded as context

Context can be used to:
- Partition data for comparative analysis
- Provide meaning to discovered topics

Once given context the data can be partitioned in many different ways. Context could be time, location, etc. Then we could compare topics between years or between locations. We could also partition that data based on the author's location or by a specific topic.

#### 12.5 - Contextual Probabilistic Latent Semantic Analysis (CPLSA)
Idea: 
- Add context variables into generative model (enable discovery contextualized topics)
- This inlfuences both coverage and content variation of topics

Add Extension of PLSA
- Model conditional likelihood of text given context
- Assume context-dependent views of a topic; allow discovery variations of same topic in different contexts
- Assume context-dependent topic coverage; cover topics differently based on time, location, etc.
- EM algorithm still used for parameter estimation
- Estimated parameters naturally contain context variables, enabling contextual text mining

###### Generation  Process of CPLSA
Assume multiple topics in the context of an event in time. Also assume different views for each topic. Each view is a different version of word distribtuion and tied to some context variables like location, time, author, etc. Think of this like a matrix. Theme coverage also varies according to contexts. People of a location, time, etc would be interested in certain topics over the other topics.

So first choose coverage, then use coverage to choose topic, then draw word from the topic. Then next time may choose a different a topic to generate a different word. This is essentially allowing the context to dictate the choice of the words. 

#### 12.6 - Mining Topics with Social Networking
The context of text article can form a network
- Authors of research articles may form collaboration networks
- Authors of social media context may form social networks
- Locations associated with text can be connected to form a geographic network

Benefit of Joint Analysis of text and its network context
- Network imposes constraints on topics in text (authors connected in a network tend to write about similar topics)
- Text helps characterize the content associated with each subnetwork

###### Network Supervised Topic Modeling
$$\Lambda^{*} = argmax_{\Lambda}\ p(TextData | \Lambda)$$

where utilization of maximum likelihood can return the parameters.

View as solving an optimization problem via maximum likelihood given a set of parameters. Constraints on model parameters $\Lambda$ include:
- Text at two adjacent nodes of network tends to cover similar topics
- Topic distribution are smoothed over adjacent nodes
- Add network-induced regularizers to the likelihood objective function

$$\Lambda^{*} = argmax_{\Lambda}\ f(p(TextData | \Lambda), r(\Lambda | Network))$$

Optimize function $f$ combines the likelihood function with a regularizer function $r$ using params $\Lambda$ and the Network. This is like imposing a prior to the model. Text Data is any generative model for text. Network can be any graph that connects text objects.

###### NetPLSA
This is a Network-Induced Prior, which essentially translates to neighbors having a similar topic distribution. THis gives more meaningful topics.
$$O(C, G) = (1 - \lambda) * (\sum_{d} \sum_{w} c(w, d) log\ \sum_{j = 1}^k\ p(\theta_j | d) p(w | \theta_j)) + \lambda * (-\frac{1}{2} \sum_{(w, v) \in E} w(u, v) \sum_{j = 1}^k (p(\theta_j | u) - p(\theta_j | v))^2)$$

where:
- $O(C, G)$ is the text collection in the network graph
- $(\sum_{d} \sum_{w} c(w, d) log\ \sum_{j = 1}^k\ p(\theta_j | d) p(w | \theta_j))$ is the PLSA log-likelihood, which we want to maximize for the parameters.
- $\sum_{j = 1}^k (p(\theta_j | u) - p(\theta_j | v))^2)$ quantifies the difference in the topic coverage at node $u$ and $v$. This computes the square of their difference and we want to minimize this difference. THis makes it possible to find the parameters to maximize the PLSA likelihood and respect the constrant of difference. The negative sign allows minimization when maximizing the objective function.
- $w(u, v)$ is the weight of the edge between $u$ and $v$. This weight indicates the relationship between nodes. High wieght means topic coverage is similar.
- $\lambda$ controls the influence of network constraint. When 0 the equation essentially reverts back to the PLSA log-likelihood. If a large number, then the network model will have a greater effect.

###### Text Information Network
We can view text data that naturally lives in a rich information network with all other related data. Text data can be associated with:
- Nodes of the network
- Edges of the network
- Paths of the network
- Subnetworks

#### 12.7 Mining Causal Topics with TIme Series Supervision
Input:
- Time series
- Text data produced in similar time period

Output:
- Topics whose coverage in text stream has strong correlations with time series (causal topics)

###### Iterative Casual Topic Modeling
Idea is to have iterative adjustment of topic, discovered by topic models using time series to induce a prior. Take text stream as input, apply topic modeling to generate a collection of topics, then use external time series to understand which topics are more causal related to external time series.

Look into the words of the top-ranked list/topic, that is the topic related to the time series. Then figure out which words are correlated with the time series. Seperate the words based on positive and negative correlations with the time series. These correlation categories become the prior to be sent back into the topic model. This process repeats in a cycle. 

This process is a heuristic way to optimize causality and coherence. Measue causailty will use time series $X_t$ and time series of external information $Y_t$. Does $X_t$ cause $Y_t$? This is causality. Correlation measures the relationship between the two. 

Imagine a plot where the topic coherence is the x-axis and the topic-time series causaility is the y-axis. the ideal causal topics would be an equally high coherence and causality. The extreme value of coherence is a pure topic model. THe extreme value for topic-time series is a causal model without much context.

###### Granger Causality Test 
Uses the history information of $Y$ to predict itself. Then add $X$ to see if it improves $Y$. If so, then we can say $X$ has a casual relation with $Y$.