<a href="https://colab.research.google.com/github/NadiaHolmlund/BDS_M2_Exam_Notes/blob/main/BDS_M2_Exam_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports for examples

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

# Datasets for examples

# Network Analysis

## What is a network?

A network is a system of elements (nodes/vertices) and connections (edges/links) between them. Networks are used to present relational data and can be applied to many types of relationships between different types of elements.


nodes: system theory jargon

vertices: graph theory jargon

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Unknown)

## Types of networks

The content, meaning and interpretation of networks depends of the elements and relationships displayed. Types of networks includes:

Social networks:
- Nodes/vertices represent actors (persons, firms, other socially constructed entitites)

- Edges/links represent relationships between actors (friendship, interaction, co-affiliation, similarity, etc)

Other networks:
- Chemistry: Interaction between molecules
- Computer Science: The world-wide-web, inter- and intranet topologies
- Biology: Food-web, ant-hives

The possibilities to depict relational data are manifold, e.g.:

Relations among persons
- Kinship: mother of, wife of…
- Other role based: boss of, supervisor of…
- Affective: likes, trusts…
- Interaction: give advice, talks to, retweets…
- Affiliation: belong to same clubs, shares same interests…

Relations among organizations
- As corporate entities, joint ventures, strategic alliances
- Buy from / sell to, leases to, outsources to
- Owns shares of, subsidiary of
- Via their members (Personnel flows, friendship…)

## Relational data structures

### Edgelist

- A common form of storing relational data
- An edgelist is a dataframe containing minimum two columns:
  - column 1: Source node of a connection
  - column 2: Target node of a connection
- Nodes are typically identified by unique IDs
- An edge list can also contain additional columns that describe **attributes** of the edges such as magnitude aspects for an edge. If the edges have a magnitude attribute the graph is considered **weighted** (e.g., number of interactions, strenght of friendship). 

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/maxresdefault.jpg)


### Adjacency matrix / Socio matrix

- Represented as a n*n matrix, where n stands for the number of elements (nodes/vertices) of which relationships should be represented
- The value in the cell that intercepts row n and column m indicates if an edge is present (=1) or absent (=0).
- An adjacency matrix can be produced by crosstabulating an edgelist

Note the impact of directed vs. undirected vs. weighted networks on adjacency matrices

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Different-types-of-graphs-and-their-corresponding-adjacency-matrix-representations-The.ppm.png)

### Nodelist

- Contains information about the nodes (edgelist and adjacency matrix stores only connectivity patterns ***between*** nodes)
  - e.g. name, gender, age, group etc.

  ![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.00.45.png)

## Network graphs

### Graph objects

Tabular data dependency
- Between observation dependency: Summary statistics of variables are between observations (column-wise) interdependent, meaning changing a value of some observation will change the corresponding variables summary statistics.
- Within observation dependency: Summary statitics of variables are within observations (row-wise) interdependent, meaning changing a variable value might change summary statistics of the observation
- Otherwise, values are (at least mathematically) independent

Graph data dependency
- Above holds true, but graph data holds additional dependencies due to the relational structure of data.
- E.g. adding/removing node(s) may imply adding/removing edge(s) and adding/removing edge(s) may change the characteristics of node(s), due to their relational interdependence

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.24.25.png)


### Graph concepts and terminology

- The vertices ***u*** and ***v*** are called the end vertices of the edge ***(u,v)***
- If two edges have the same end vertices they are ***Parallel***
- An edge of the form ***(v,v)*** is a ***loop***
- A Graph is ***simple*** if it has no parallel edges and loops
- A Graph is said to be ***Empty*** if it has no edges. Meaning ***E*** is empty
- A Graph is a ***Null Graph*** if it has no vertices. Meaning ***V*** and ***E*** is empty
- Edges are ***Adjacent*** if they have a common vertex. Vertices are ***Adjacent*** if they have a common edge
- The ***degree*** of the vertex ***v***, written as ***d(v)***, is the number of edges with v as an end vertex. By convention, we count a loop twice and parallel edges contribute separately
- ***Isolated*** Vertices are vertices with degree 1.
- A Graph is ***Complete*** if its edge set contains every possible edge between ALL of the vertices
- A ***Walk*** in a Graph ***G = (V,E)*** is a finite, alternating sequence of the form ViEiViEi consisting of vertices and edges of the graph ***G***
- A ***Walk*** is ***Open*** if the initial and final vertices are different. A ***Walk*** is ***Closed*** if the initial and final vertices are the same

#### Self-lopps

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.48.44.png)

### UPDATE Types of graphs

1. Weigthed vs. Unweighted
2. Directed vs. Undirected
3. Unimodal vs. Multimodal
4. Unidimensional vs. Multidimensional

`networkx` graph classes
1. Graph
2. DiGraph
3. MultiGraph
4. MultiDigraph

#### Weigthed vs. Unweighted

Weighted: edges have values associated with them

Unweighted: edges either exist or do not

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.48.07.png)

#### Directed vs. Undirected

Directed: edges are not necessarily reciprocated

Undirected: edges are always mutual

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.46.51.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.46.34.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.46.24.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.46.01.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.47.34.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.47.05.png)

#### Unimodal vs. Multimodal

Unimodal networks (1-mode): include only one type of node (e.g. all nodes represent people)

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_simple.png)

Multimodal (2-mode / Bipartite / Bimodal): include more than one type of node (.e.g people and research papers)

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_half.png)

#### Unidimensional vs. Multidimensional

Unidimensional: includes one type of edges

Multidimensional: Includes multiple types of edges (can be analysed as multiplex network or multiple distinct networks)

Multidimensional networks are a special type of multilayer networks with multiple types of relations. They also consist of nodes and edges, but the nodes exist in separate layers, representing different forms of interactions, which connect to form an aspect. Aspects (or stacks of layers) can be used to represent different types of contacts, spatial locations, subsystems, or points in time

Example:
The multiplex social network of Star Wars saga. Each layer denotes a different episode and two nodes are connected to each other if the corresponding characters acted together in one or more scenes.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Muxviz_Star_Wars_Social_Network.png)

#### Visualizing networks

##### Matrix plots

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.50.20.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.50.58.png)

##### Arc plots

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.51.33.png)

##### Circos plots

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.52.19.png)

#### Irrational vs. rational graphs

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.49.42.png)

### NetworkX library

A graph object is a specific datastructure which contains node and edgelists jointly, and enables the application of graph algorithms on them. We work with the [`networkx`](https://networkx.github.io/documentation/stable/index.html) library, which is the standard for network analysis in the Python community.

In [None]:
import networkx as nx # Main network analysis library

In NetworkX, graph data are stored in a dictionary-like fashion.
They are placed under a `Graph` object,
canonically instantiated with the variable `G` as follows:

```python
G = nx.Graph()
```

Of course, you are free to name the graph anything you want!

Nodes are part of the attribute `G.nodes`.
There, the node data are housed in a dictionary-like container,
where the key is the node itself
and the values are a dictionary of attributes. 
Node data are accessible using syntax that looks like:

```python
G.nodes[node1]
```

Edges are part of the attribute `G.edges`,
which is also stored in a dictionary-like container.
Edge data are accessible using syntax that looks like: 

```python
G.edges[node1, node2]
```
Because of the dictionary-like implementation of the graph,
any hashable object can be a node.
This means strings and tuples, but not lists and sets.

## Local network structure (node-level measures)

Methods to summarise the pattern of node connectivity to inter something on their characteristics.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.38.56.png)

### Degree centrality

- Counts the number of edges adjacent to a node.
- Formally, the degree of node $i$ is the number of existing edges $e_{ij}$ with other nodes $j$ in a network with $n$ nodes:

$$d_{ij} =\sum\limits_{j=1}^{n} e_{ij} ~ where: ~ i \neq j$$

**Degree centrality in directed networks**

In directed networks, a node-pair has two different roles:

* **Ego:** The node the edge originates from.
* **Alter:** The node the edge leads to.

Network metrics have to take directionality into account. For example, degree centrality is now differentiated between the
- **in-degree** centrality (now many edges lead to the node)
- **out-degree** centrality (now many edges lead to the node)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.43.47.png)

### Eigenvector centrality

- Weighs a node's degree centrality by the centrality of the nodes adjacent to it (and their centrality in turn by their centrality).

$$x_{v}={\frac {1}{\lambda }}\sum _{t\in M(v)}x_{t}={\frac {1}{\lambda }}\sum _{t\in G}a_{v,t}x_{t}$$

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.48.08.png)

### Betweenness centrality

- Measures the extent to which it lies on short paths.
- A higher betweenness indicates that a node lies on more short paths and hence should somehow be important for traversing between different parts of a network.

In formulaic representation

* The geodesic betweenness $B_{n}(i)$ of a **vertex** in a weighted, undirected network is

$$B_{n}(i) =  \sum_{s,t \in G} \frac{ \Psi_{s,t}(i) }{\Psi_{s,t}}$$
where vertices $s,t,i$ are all different from each other

* $\Psi_{s,t}$ denotes the number of shortest paths (geodesics) between vertices $s$ and $t$
* $\Psi_{s,t}(i)$ denotes the number of shortest paths (geodesics) between vertices $s$ and $t$ **that pass through vertex** $i$.
* The geodesic betweenness $B_n$ of a network is the mean of $B_n(i)$ over all vertices $i$

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.51.47.png)

### Neighborhood

- Examines the surroundings of a node in terms of the nodes it is connected to, i.e. it's neighborhood
- Ego-network of node: How many nodes are in a certain geodesic distance (meaning the shortest path), i.e. how many nodes are not more than x-steps away.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.57.16.png)

### UPDATE Clustering (Community detection)
what is within and between network connectivity??

- Group nodes based on graph topology (sometimes referred to as community detection based on its commonality in social network analysis)
- Main logic: Form groups which have a ***maximum within-connectivity*** and a ***minimum between-connectivity***.
- Consequently: Nodes in the same community should have a higher probability of being connected than nodes from different communities.

**Community clustering in directed networks**

Most community detection algorithms implemented in `NetworkX` only work with undirected networks. So, we can do 2 things to handle these:

1. Convert the network in an undirected one.
2. Use the "edge betweenness" algorithm, the only one implemented that can handle directed networks.

There are (just like for clustering of tabular data in UML) many different algorithms and approaches to detect and delineate communities. [Here](https://github.com/benedekrozemberczki/awesome-community-detection) you find a summary of currently used approaches.

Example: The Louvain Algorithm

One of the most widely used community detection algorithms. It usually delivers good results, scales well, and can handle weighted networks. Furthermore, there is an actively maintained, easy to use Python implementation, [`python-louvain`](https://python-louvain.readthedocs.io).

It optimises a quantity called modularity:

$$  \sum_{ij} (A_{ij} - \lambda P_{ij}) \delta(c_i,c_j) $$

$A$ - The adjacency matrix

$P_{ij}$ - The expected connection between $i$ and $j$.

$\lambda$ - Resolution parameter

Can use lots of different forms for $P_{ij}$ but the standard one is the so called configuration model:

$P_{ij} = \frac{k_i k_j}{2m}$

Loosely speaking, in an iterative process:
- You take a node and try to aggregate it to one of its neighbours.
- You choose the neighbour that maximizes a modularity function.
- Once you iterate through all the nodes, you will have merged few nodes together and formed some communities.
- This becomes the new input for the algorithm that will treat each community as a node and try to merge them together to create bigger communities.
- The algorithm stops when it’s not possible to improve modularity any more.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2013.04.56.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2013.06.10.png)

### Assortiativity

- Measures if two nodes that share certain characteristics have a higher or lower probability to be connected.


### Reciprocity

- Measures if directed edges are reciptocated, meaning that an edge between `i,j` makes an edge between `j,i` more likely

## Global network structure (overall-level measures)

### Density

- The density of a measure represents the share of all possible connections in the network.

### Transistivity / Clustering Coefficient

- Transistivity, also called the Clustering Cefficient indicates how much the network tends to be locally clustered. That is measured by the share of closed triplets. Again,w e will dig into that next time.

### Diameter

- The diameter is the longest of the shortest paths between two nodes of the network.

### Mean distance / Average path lenght

- The mean distance / average path lenght represents the mean of all shortest paths between all nodes. It is a measure of diffusion potential within a network.

## Small worlds

Small worlds are an interesting network structure, combining short path lenght betwen the nodes with a high clustering coefficient. That means, that we have small interconected clusters, which are in turn connected by **gatekeepers** (the edges we call **bridges** or **structural holes**). 

A small-world network is a type of mathematical graph in which most nodes are not neighbors of one another, but the neighbors of any given node are likely to be neighbors of each other and most nodes can be reached from every other node by a small number of hops or steps.

Mathematically, small world networks of size n have an average distance O(log n), meaning that between any two random nodes, the expected distance is O(log n).

⟨L⟩ ∝ log n

Small-world network example
Hubs are bigger than other nodes
Average degree= 3.833
Average shortest path length = 1.803.
Clustering coefficient = 0.522

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Small-world-network-example.png)

## Similarity networks

- Constructed by mapping similarity between all observations, e.g. using Cosine Distance

## Multimodal network analysis

Multi-modal means a network has several "modes", i.e. it connects entities on different conceptual levels. The most common is a **2-mode** (or **bipartite**) network. 

Examples could be:

* Author $\rightarrow$ Paper
* Inventor $\rightarrow$ Patent
* Member $\rightarrow$ Club network. 

Here, elements in different modes represent different things. In real-life research examples you find 2-mode networks in for instance:
- co-occurence (2 actors mentioned in the same news-article)
- co-affiliation (2 actors are member of the same association)
- co-characteristics (2 actors both like to talk about a certain topic on twitter).

### Network Projections

Two-mode networks are rarely analysed in their original form. Although this is preferable, few methods exist for that purpose. As such, these networks are often transformed into one-mode networks (only one type of nodes) to be analysed. This procedure is often referred to as projection. Projection is done by selecting one of the sets of nodes and linking two nodes from that set if they were connected to the same node (of the other kind).

We can alalyse them in sepperation (and sometimes we should), but often its helpful to *project* them onto one mode. Here, we create a node in one mode by joint association with another mode.

2-mode

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_half.png)

1-mode

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_simple.png)

![](https://www.dropbox.com/s/e4vnq7kh24pyu0t/networks_2mode.png?dl=1)

Particularly in citation networks, we can also use the implicite 2-mode structure of $Publications \rightarrow Citation$

That helps us to apply some interesting metrics, such as:

* direct citations
* Bibliographic coupling
* Co--citations

Interestingly, different projections of this 2-mode network give the whole resulting 1-mode network a different meaning.

![](https://www.dropbox.com/s/f8g8nr83lucvpqx/networks_biblio.png?dl=1)


### Weighted Network Projection

In a similar spirit as the method used by Newman (2001), it is also possible to discount for the number of nodes when projecting weighted two-mode networks.

 
 For example, it could be argued that if many online users post to a thread, their ties should be weaker than if there were few people posting to the thread. A straight forward generalisation is the following function: $w_{ij} = \sum_p \frac{w_{i,p}}{N_p - 1}$. 
 
 This formula would create a directed one-mode network in which the out-strength of a node is equal to the sum of the weights attached to the ties in the two-mode network that originated from that node. For example, node C has a tie with a weight of 5 in the two-mode network and an out-strength of 5 in the one-mode projection.

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_forum_newman2001.png)

# Natural Language Processing

Introduction to statistical natural language processing, focusing on NLP in combination with supervised ML as well as topic modelling for unsupervised approaches.

1 text vs many texts
1 long vs many short ones

why does it makes differenc and how
long: the level of analysis, what are we looking at. 
- look at entities (e.g. person, characters, places mentioend in the book
- terms, the use of certain types of language


short: mostly working with in NLP
- compare them
- what are they about

the question/conclusion is: our interest of analysis goes into the direction of saying either were analysing the text as such (what is the statement in the text) or elements of text (a bit of an ideal descriotion)

elements (old school NLP)
- entitites (who are the persons emntioend is this long text in a windows of 15 words e.g.)
- language use, what kind of verbs and adjectives are around
- n-grams combinations of two/three terms, define what the text is about
- relations between terms
- relations between entitites

text as such (linked more to machine learning)
- sentiment, does a certain text express a certain sentiment towards what it describes. regression problems, e.g. more joyful, less joyful in a description etc.
- predict class, text desribing different types of restaruants, read description and try to predict take-away, high end etc. (i.e. classification SML)
- topics, what is the topical compositions of the texts, e.g. tweets. patterns and topics emerging again anad again and how do they change over time
- similarity, there is this piece of text, can you show me other texts that are semantically similar, not only in terms of keywords, even though that could be a way, but also other things that are talking about the same topics in the same way, going beyong similar or counting keywords.

Container model

how to turn a piece of text into something readable for computers

scenario: have sentence 'this is a very nice house'. how to translate this into something we can use for nlp.

one approach: think about the sentence as a container - it contains words. so its kind of a way of representing meaning through combinations of elements. the container model. 

if the sentence is a container for terms, then the terms are varying a menaing and the combinaiton of terms is the meanign of the sentence. what is bearing more meaning (debtabel). house if important and nice, very but this is a . is not so interesting. 

there are some relationsship between the entitties, translate it to represent relationsship with networks structures. can even say it's directed relationships

how to get there - look at grammar. what are the subjects, what the objects, and how are they related. packages exist for this. this is turning texts into elements and performing analysis on the elements. tedius works, but gives many insights. 

this is from text to network analysis. if you want to perform network analysis, this would be the kind of pipeline.

represent text as a relational structure, creating an edgelist etc.

pic of screen





Bag of words model

how do we do, this kind from text to ML
how to we go from text to the kind of data we can use in ml approach. we need tabular data incl. text and features. 

what could the features be: the most intuitive is that f1 to fn is the vocabulary contained in the whole collection of texts. i.e. the entire vocab we use to talk and to write. so it's all of the words from before. this is called the ***bag of words representation of text***

pic of screen

example:
the fat cat sits on the mat.
the cat ate a fat rat
the dog is sad

the first we can do is kick out the words that don't carry much meaning. e.g. articles the, is, a etc. preprositions and full stops.

normalizing text: stemming or lemmatizing
normaliszing the vocab by removing gender and that declension and conjugations??? changes the verbs according to whether is plural, singular, etc. and transform past tense into present. 

reason to normalize - for discovering meaning it is the same, as long as what you want to figure out is related to meaning and not so much to gender or time etc. 

leaves a vocab of fat, cat, sit, mat, eat, rat, dog, sad.

take the vocab and make a table out of it.

pic of screen
pic of screen

then we can just go through each sentence and put 1's and 0's if the word is there or not. then we get a sparse matrix, but it represents the texts in tabular form. from here you can use it in ML. that's the bag of words model in NLP.


TFIDF = term frequency - inverse document frequency

unimportant words

one word appears in all texts, e.g. the word president in political text. so maybe the word isn't that important fo the individual sentences.

what about originality. an original word is probably more important that a word theat is very general.

most famous way to deal with it is TF-IDF = term frequency - inverse document frequency.

what does it do:
it tries to weight. if a words is all over the place it mya not be as important as words that doesn't appear that much.

the wieght of term x and sentence y is equal to the term frequency of the word x in sentence y times the logarithm of the total number of documents divided by dfx (the number of documents the word x appears in, i.e. cannot be larger than N)

the closer log is to one, the lower the importance of the word

tfidf discount general terms / highlight specific terms. really just looking a frequency distribution.



from Bag of words (BoW) to topic modeling and embeddings

unsupervised ML

BoW and tfids is good enough for many supervised ml. e.g. a classifier into different categories, e.g. sentiment analysis. performance can be quite alright for BoW but will be better for tfidf


how can we go from sparse to more dense - LSI (latent semantic indexing) or LSA (latent semantic analysis). basically used something similar to singular value decomposition (SVD) or NMF (non negative matrix factorization). kinda similar in terms of mathematics and output.

simplified example:

we have a mtrix with words and docs. it's very sparse (usually the case in texts). what you can do is transform that using dimensionality reduction approach to make it dense. result is one matrix that is the documents vs topics and one matrix that topics vs vocab

picture from screen

here, what we get is one dense matrix that identifies topics, the topics are similar to UML, i.e. components. the topics are easy to interpret because the way we speak and write are logical. so linear algebra can help to uncover latent topics in text corpora. each document will be represented as a combination of topics. at the same time, there as a matrix that will look as to which extent terms contribute to topics. what is the relationship between certain terms an certain topics. what are the most prominent topics and what are the words that represent these topics. this is alower dimenstion matric that we can use for SML, similarity (calculate cosine similarity - advantage of similarity on reduced matrix compared to bow/tfidf is that in reduced you are able to get similarity on two documents that don't share any words but contains certain synonyms. bow the problem is you need to have exact common terms, you don't need that in lsi/LSA because it's based on topics not words. getting closer to semantic similarity here.)


LDA
another topic modeling approach- LDA (latent dirichlet allocation. probabilistc approach. wouldn't use it for SML, but great for topic discovery and visualizations

Corex
- can we make existing approaches more informative
- anchor words - means that you get some kind of semi-supervised  representation. so you can infuse the model with some domain expertise. i know there are some topics in this corpus and i know htese are terms that belong to the different topics. so take these into account when you are making the groupings automatically.



Syntax

in simple terms: what is the order of things in a sentence, it makes a difference, e.g.
- dog bites man vs man bites dog

word to vec
approach (some of the way, all the way requires deep learning) word2vec = word embeddings. the rational/mechanics:
- assumptions - distributional hypopothesis "you shall know the meaning of a word by observing the company it keeps", i.e. the meaning of a word is defined by the context in which it tends to occur. by observing the words surrounding a word will give us an udnerstanding of what the word mean.

the algorithm is a formalization of that.
you look at the first three and then try to predict the 4th words. once youre done with that you move the window by one and try to preidct word 5 etc. etc.

imagine youre taking all your words in a text or coprus, creating a sml classification provblem where input is x, the first 3 words, and y is word number 4 etc. etc. etc.

the model is shallow neural network, that happens is: by doing this over an over agian you are identifying patterns that are mapping the realiontshio between this terms and all the possible contexts it has. you end up with a vector (word embedding) representation of the word that carries the meaning of the word. ocnce you have these word embeddings, you can figure out the folliwning:
- similar terms have similar vectors, because similar terms tend to appear in similar contexts, e.g. the word cat will appear in similar contexts as feline
- Vberling - Vparis + Vbeer = Vwine, it captures not only the words but also the context and allows you to do linear algebra on the words and find the latent structures that are underlying.
- average vector of all the words in a sentence as a document representation.
- input for deep learning applications
  - what if were not only looking at words in a sentence but feeding the models with vectors that are representing the words in the right order. so feed in word vector in a neural network into specific models, going to a model where we know for each word what it means. 

run the exercise on a LOT of text and training the model and you will end up with something that really good at finding the meaning of words
