<a href="https://colab.research.google.com/github/NadiaHolmlund/BDS_M2_Exam_Notes/blob/main/BDS_M2_Exam_Notes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Imports for examples

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set()

# Datasets for examples

# Network Analysis

## What is a network?

A network is a system of elements (nodes/vertices) and connections (edges/links) between them. Networks are used to present relational data and can be applied to many types of relationships between different types of elements.


nodes: system theory jargon

vertices: graph theory jargon

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Unknown)

## Types of networks

The content, meaning and interpretation of networks depends of the elements and relationships displayed. Types of networks includes:

Social networks:
- Nodes/vertices represent actors (persons, firms, other socially constructed entitites)

- Edges/links represent relationships between actors (friendship, interaction, co-affiliation, similarity, etc)

Other networks:
- Chemistry: Interaction between molecules
- Computer Science: The world-wide-web, inter- and intranet topologies
- Biology: Food-web, ant-hives

The possibilities to depict relational data are manifold, e.g.:

Relations among persons
- Kinship: mother of, wife of…
- Other role based: boss of, supervisor of…
- Affective: likes, trusts…
- Interaction: give advice, talks to, retweets…
- Affiliation: belong to same clubs, shares same interests…

Relations among organizations
- As corporate entities, joint ventures, strategic alliances
- Buy from / sell to, leases to, outsources to
- Owns shares of, subsidiary of
- Via their members (Personnel flows, friendship…)

## Relational data structures

### Edgelist

- A common form of storing relational data
- An edgelist is a dataframe containing minimum two columns:
  - column 1: Source node of a connection
  - column 2: Target node of a connection
- Nodes are typically identified by unique IDs
- An edge list can also contain additional columns that describe **attributes** of the edges such as magnitude aspects for an edge. If the edges have a magnitude attribute the graph is considered **weighted** (e.g., number of interactions, strenght of friendship). 

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/maxresdefault.jpg)


### Adjacency matrix / Socio matrix

- Represented as a n*n matrix, where n stands for the number of elements (nodes/vertices) of which relationships should be represented
- The value in the cell that intercepts row n and column m indicates if an edge is present (=1) or absent (=0).
- An adjacency matrix can be produced by crosstabulating an edgelist

Note the impact of directed vs. undirected vs. weighted networks on adjacency matrices

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Different-types-of-graphs-and-their-corresponding-adjacency-matrix-representations-The.ppm.png)

### Nodelist

- Contains information about the nodes, aka attributes (edgelist and adjacency matrix stores only connectivity patterns ***between*** nodes)
  - e.g. name, gender, age, group etc.

  ![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.00.45.png)

## Network graphs

### Graph objects

Tabular data dependency
- Between observation dependency: Summary statistics of variables are between observations (column-wise) interdependent, meaning changing a value of some observation will change the corresponding variables summary statistics.
- Within observation dependency: Summary statitics of variables are within observations (row-wise) interdependent, meaning changing a variable value might change summary statistics of the observation
- Otherwise, values are (at least mathematically) independent

Graph data dependency
- Above holds true, but graph data holds additional dependencies due to the relational structure of data.
- E.g. adding/removing node(s) may imply adding/removing edge(s) and adding/removing edge(s) may change the characteristics of node(s), due to their relational interdependence

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.24.25.png)


### Graph concepts and terminology

- The vertices ***u*** and ***v*** are called the end vertices of the edge ***(u,v)***
- If two edges have the same end vertices they are ***Parallel***
- An edge of the form ***(v,v)*** is a ***loop***
- A Graph is ***simple*** if it has no parallel edges and loops
- A Graph is said to be ***Empty*** if it has no edges. Meaning ***E*** is empty
- A Graph is a ***Null Graph*** if it has no vertices. Meaning ***V*** and ***E*** is empty
- Edges are ***Adjacent*** if they have a common vertex. Vertices are ***Adjacent*** if they have a common edge
- The ***degree*** of the vertex ***v***, written as ***d(v)***, is the number of edges with v as an end vertex. By convention, we count a loop twice and parallel edges contribute separately
- ***Isolated*** Vertices are vertices with degree 1.
- A Graph is ***Complete*** if its edge set contains every possible edge between ALL of the vertices
- A ***Walk*** in a Graph ***G = (V,E)*** is a finite, alternating sequence of the form ViEiViEi consisting of vertices and edges of the graph ***G***
- A ***Walk*** is ***Open*** if the initial and final vertices are different. A ***Walk*** is ***Closed*** if the initial and final vertices are the same

### Types of graphs

1. Weigthed vs. Unweighted
2. Directed vs. Undirected
3. Unimodal vs. Multimodal
4. Unidimensional vs. Multidimensional

`networkx` graph classes
1. Graph
2. DiGraph
3. MultiGraph
4. MultiDigraph

#### Weigthed vs. Unweighted

Weighted: edges have values associated with them

Unweighted: edges either exist or do not

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.48.07.png)

#### Directed vs. Undirected

Directed: edges are not necessarily reciprocated

Undirected: edges are always mutual

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.46.51.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.46.34.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.46.24.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.46.01.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.47.34.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.47.05.png)

#### Unimodal vs. Multimodal

Unimodal networks (1-mode): include only one type of node (e.g. all nodes represent people)

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_simple.png)

Multimodal (2-mode / Bipartite / Bimodal): include more than one type of node (.e.g people and research papers)

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_half.png)

#### Unidimensional vs. Multidimensional

Unidimensional: includes one type of edges

Multidimensional: Includes multiple types of edges (can be analysed as multiplex network or multiple distinct networks)

Multidimensional networks are a special type of multilayer networks with multiple types of relations. They also consist of nodes and edges, but the nodes exist in separate layers, representing different forms of interactions, which connect to form an aspect. Aspects (or stacks of layers) can be used to represent different types of contacts, spatial locations, subsystems, or points in time

Example:
The multiplex social network of Star Wars saga. Each layer denotes a different episode and two nodes are connected to each other if the corresponding characters acted together in one or more scenes.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Muxviz_Star_Wars_Social_Network.png)

#### Visualizing networks

##### Matrix plots

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.50.20.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.50.58.png)

##### Arc plots

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.51.33.png)

##### Circos plots

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.52.19.png)

#### Irrational vs. rational graphs

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.49.42.png)

### NetworkX library

A graph object is a specific datastructure which contains node and edgelists jointly, and enables the application of graph algorithms on them. We work with the [`networkx`](https://networkx.github.io/documentation/stable/index.html) library, which is the standard for network analysis in the Python community.

In [None]:
import networkx as nx # Main network analysis library

In NetworkX, graph data are stored in a dictionary-like fashion.
They are placed under a `Graph` object,
canonically instantiated with the variable `G` as follows:

```python
G = nx.Graph()
```

Of course, you are free to name the graph anything you want!

Nodes are part of the attribute `G.nodes`.
There, the node data are housed in a dictionary-like container,
where the key is the node itself
and the values are a dictionary of attributes. 
Node data are accessible using syntax that looks like:

```python
G.nodes[node1]
```

Edges are part of the attribute `G.edges`,
which is also stored in a dictionary-like container.
Edge data are accessible using syntax that looks like: 

```python
G.edges[node1, node2]
```
Because of the dictionary-like implementation of the graph,
any hashable object can be a node.
This means strings and tuples, but not lists and sets.

## Local network structure (node-level measures)

Methods to summarise the pattern of node connectivity to inter something on their characteristics.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.38.56.png)

### Degree centrality

- Counts the number of edges adjacent to a node.
- Formally, the degree of node $i$ is the number of existing edges $e_{ij}$ with other nodes $j$ in a network with $n$ nodes:

$$d_{ij} =\sum\limits_{j=1}^{n} e_{ij} ~ where: ~ i \neq j$$

**Degree centrality in directed networks**

In directed networks, a node-pair has two different roles:

* **Ego:** The node the edge originates from.
* **Alter:** The node the edge leads to.

Network metrics have to take directionality into account. For example, degree centrality is now differentiated between the
- **in-degree** centrality (how many edges lead ***to*** the node)
- **out-degree** centrality (how many edges lead ***from*** the node)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.43.47.png)

### Eigenvector centrality

- Weighs a node's degree centrality by the centrality of the nodes adjacent to it (and their centrality in turn by their centrality).

$$x_{v}={\frac {1}{\lambda }}\sum _{t\in M(v)}x_{t}={\frac {1}{\lambda }}\sum _{t\in G}a_{v,t}x_{t}$$

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.48.08.png)

### Betweenness centrality

- Measures the extent to which it lies on short paths.
- A higher betweenness indicates that a node lies on more short paths and hence should somehow be important for traversing between different parts of a network.

In formulaic representation

* The geodesic betweenness $B_{n}(i)$ of a **vertex** in a weighted, undirected network is

$$B_{n}(i) =  \sum_{s,t \in G} \frac{ \Psi_{s,t}(i) }{\Psi_{s,t}}$$
where vertices $s,t,i$ are all different from each other

* $\Psi_{s,t}$ denotes the number of shortest paths (geodesics) between vertices $s$ and $t$
* $\Psi_{s,t}(i)$ denotes the number of shortest paths (geodesics) between vertices $s$ and $t$ **that pass through vertex** $i$.
* The geodesic betweenness $B_n$ of a network is the mean of $B_n(i)$ over all vertices $i$

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.51.47.png)

### Neighborhood

- Examines the surroundings of a node in terms of the nodes it is connected to, i.e. it's neighborhood
- Ego-network of node: How many nodes are in a certain geodesic distance (meaning the shortest path), i.e. how many nodes are not more than x-steps away.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2012.57.16.png)

### UPDATE Modularity / Clustering (Community detection)
what is within and between network connectivity??

- Group nodes based on graph topology (sometimes referred to as community detection based on its commonality in social network analysis)
- Main logic: Form groups which have a ***maximum within-connectivity*** and a ***minimum between-connectivity***.
- Consequently: Nodes in the same community should have a higher probability of being connected than nodes from different communities.

**Community clustering in directed networks**

Most community detection algorithms implemented in `NetworkX` only work with undirected networks. So, we can do 2 things to handle these:

1. Convert the network in an undirected one.
2. Use the "edge betweenness" algorithm, the only one implemented that can handle directed networks.

There are (just like for clustering of tabular data in UML) many different algorithms and approaches to detect and delineate communities. [Here](https://github.com/benedekrozemberczki/awesome-community-detection) you find a summary of currently used approaches.

Example: The Louvain Algorithm

One of the most widely used community detection algorithms. It usually delivers good results, scales well, and can handle weighted networks. Furthermore, there is an actively maintained, easy to use Python implementation, [`python-louvain`](https://python-louvain.readthedocs.io).

It optimises a quantity called modularity:

$$  \sum_{ij} (A_{ij} - \lambda P_{ij}) \delta(c_i,c_j) $$

$A$ - The adjacency matrix

$P_{ij}$ - The expected connection between $i$ and $j$.

$\lambda$ - Resolution parameter

Can use lots of different forms for $P_{ij}$ but the standard one is the so called configuration model:

$P_{ij} = \frac{k_i k_j}{2m}$

Loosely speaking, in an iterative process:
- You take a node and try to aggregate it to one of its neighbours.
- You choose the neighbour that maximizes a modularity function.
- Once you iterate through all the nodes, you will have merged few nodes together and formed some communities.
- This becomes the new input for the algorithm that will treat each community as a node and try to merge them together to create bigger communities.
- The algorithm stops when it’s not possible to improve modularity any more.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2013.04.56.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2013.06.10.png)

### Assortiativity

- Measures if two nodes that share certain characteristics have a higher or lower probability to be connected.


### Reciprocity

- Measures if directed edges are reciptocated, meaning that an edge between `i,j` makes an edge between `j,i` more likely

## Global network structure (overall-level measures)

### Density

- The density of a measure represents the share of all possible connections in the network.

### Transitivity / Clustering Coefficient

- Transitivity, also called the Clustering Cefficient indicates how much the network tends to be locally clustered. That is measured by the share of closed triplets.

### Diameter

- The diameter is the longest of the shortest paths between two nodes of the network.

### Mean distance / Average path lenght

- The mean distance / average path lenght represents the mean of all shortest paths between all nodes. It is a measure of diffusion potential within a network.

## Small worlds

Small worlds are an interesting network structure, combining short path lenght betwen the nodes with a high clustering coefficient. That means, that we have small interconected clusters, which are in turn connected by **gatekeepers** (the edges we call **bridges** or **structural holes**). 

A small-world network is a type of mathematical graph in which most nodes are not neighbors of one another, but the neighbors of any given node are likely to be neighbors of each other and most nodes can be reached from every other node by a small number of hops or steps.

Mathematically, small world networks of size n have an average distance O(log n), meaning that between any two random nodes, the expected distance is O(log n).

⟨L⟩ ∝ log n

Small-world network example
Hubs are bigger than other nodes
Average degree= 3.833
Average shortest path length = 1.803.
Clustering coefficient = 0.522

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Small-world-network-example.png)

## Similarity networks

Simirality networks are constructed by mapping similarity between all observations, e.g. using
- Cosine Similarity
- Pearson Coefficient
- Euclidean Distance

##### Cosine distance

Cosine similarity takes into account the degree of the vertices or how many common neighbors other pairs of vertices has and also allow for varying degrees of vertices.

Salton proposed that we regard the i-th and j-th rows/columns of the adjacency matrix as two vectors and use the cosine of the angle between them as a similarity measure. The cosine similarity of i and j is the number of common neighbors divided by the geometric mean of their degrees.

Its value lies in the range from 0 to 1. The value of 1 indicates that the two vertices have exactly the same neighbors while the value of zero means that they do not have any common neighbors. Cosine similarity is technically undefined if one or both of the nodes has zero degree, but according to the convention, we say that cosine similarity is 0 in these cases.

#### Pearson Coefficent

Pearson product-moment correlation coefficient is an alternative method to normalize the count of common neighbors. This method compares the number of common neighbors with the expected value that count would take in a network where vertices are connected randomly. This quantity lies strictly in the range from -1 to 1.

#### Euclidean distance

Euclidean distance is equal to the number of neighbors that differ between two vertices. It is rather a dissimilarity measure, since it is larger for vertices which differ more. It could be normalized by dividing by its maximum value. The maximum means that there are no common neighbors, in which case the distance is equal to the sum of the degrees of the vertices.

## Multimodal network analysis

Multi-modal means a network has several "modes", i.e. it connects entities on different conceptual levels. The most common is a **2-mode** (or **bipartite**) network. 

Examples could be:

* Author $\rightarrow$ Paper
* Inventor $\rightarrow$ Patent
* Member $\rightarrow$ Club network. 

Here, elements in different modes represent different things. In real-life research examples you find 2-mode networks in for instance:
- co-occurence (2 actors mentioned in the same news-article)
- co-affiliation (2 actors are member of the same association)
- co-characteristics (2 actors both like to talk about a certain topic on twitter).

### Network projection

Two-mode networks are rarely analysed in their original form. Although this is preferable, few methods exist for that purpose. As such, these networks are often transformed into one-mode networks (only one type of nodes) to be analysed. This procedure is often referred to as projection. Projection is done by selecting one of the sets of nodes and linking two nodes from that set if they were connected to the same node (of the other kind).

We can alalyse them in sepperation (and sometimes we should), but often its helpful to *project* them onto one mode. Here, we create a node in one mode by joint association with another mode.

2-mode

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_half.png)

1-mode

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_simple.png)

![](https://www.dropbox.com/s/e4vnq7kh24pyu0t/networks_2mode.png?dl=1)

Particularly in citation networks, we can also use the implicite 2-mode structure of $Publications \rightarrow Citation$

That helps us to apply some interesting metrics, such as:

* direct citations
* Bibliographic coupling
* Co--citations

Interestingly, different projections of this 2-mode network give the whole resulting 1-mode network a different meaning.

![](https://www.dropbox.com/s/f8g8nr83lucvpqx/networks_biblio.png?dl=1)


### UPDATE proper projection

### UPDATE quick and dirty projection

Mergin the dataframe with itself based on one of the nodes as key, deleting selfloops and then you can create the edges.

### Weighted network projection

It is possible to discount for the number of nodes when projecting weighted two-mode networks.
 
 For example, it could be argued that if many online users post to a thread, their ties should be weaker than if there were few people posting to the thread. A straight forward generalisation is the following function: $w_{ij} = \sum_p \frac{w_{i,p}}{N_p - 1}$. 
 
 This formula would create a directed one-mode network in which the out-strength of a node is equal to the sum of the weights attached to the ties in the two-mode network that originated from that node. For example, node C has a tie with a weight of 5 in the two-mode network and an out-strength of 5 in the one-mode projection.

![](https://toreopsahl.files.wordpress.com/2009/04/fig1_twomode_forum_newman2001.png)

# Natural Language Processing

## Elements of text vs. Text as such

Interest of analysis goes into the direction of: Either were analysing the text as such (what is the statement in the text) or elements of text.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-11-06%20at%2013.31.59.png)


## Preprocessing concepts

Syntax and semantics. The syntax is the grammatical structure of the text, and semantics is the meaning being conveyed.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes//main/Images/vURoVPyiqdRaOmec.png)

### Tokenization

- Tokenization separates a piece of text into smaller units called tokens
- Tokenization can be broadly classified into 3 types:
  - word tokens: smarter
  - character tokens: s-m-a-r-t-e-r
  - subword token (n-gram characters): smart-er
- But different methods may be applied when tokenizing

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/tokenize.png)

### Normalization

- Reduces randomness of text, bringing it closer to a predefined 'standard, i.e. improving efficiency
- Before normalizing:
  - expand contractions (by making a dictionary)
  - tokenize
  - remove punctuations
- Normalization techniques:
  - Stemming: reducing words to their word stem or root form (may not be a dictionary word)
    - Over-stemming: more than required is removed, e.g. “university” and “universe” = “univers”.
    - Under-stemming: less than required is removed, e.g. “data” and “datum” = “dat” and “datu” (instead of “dat”).
  - Lemmatization: reducing words to their base word in the language (is in the dictionary)
    - A root word is called lemma. A lemma is the canonical form, dictionary form, or citation form of a set of words.


Should you always normalize?

No, it depends on the problem.
- Lemmatization is needed for topic modeling and for training word vectors, which is dependent of string matches and accurate word counts
- Semantic analysis methods on the other hand have different ratings depending on the form of the word and therefore the input should not be stemmed or lemmatized.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/pLfTViDgRXuHYvWE-2.png)

### Frequency based dictionary filtering

- Removing stop words, i.e. filtering out high-frequency words that add little or no semantic value to a sentence, e.g.:
  - to, for, on, and, the, etc.
  - You can even create custom lists of stopwords to include words that you want to ignore.

### POS tagging (Part-of-Speech) (TAGGER)

- Assigning each word (token) in a sentence the part of speech category that it assumes in that sentence.
- Target is to identify the grammatical group of a given word: noun, pronoun, adjective, verb, adverbs, etc. based on the context.
- POS tagging improves accuracy, .e.g ‘leaves’ without a POS tag would get lemmatized to ‘leaf’, but with a verb tag, its lemma would be ‘leave’.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/GVCinaTpbVEEOukr.png)

### Dependency Parsing (PARSER)

Dependency grammar refers to the way the words in a sentence are connected. A dependency parser, therefore, analyzes how ‘head words’ are related and modified by other words to understand the syntactic structure of a sentence.

nsubj: nominal subject

dobj: direct object

conj: conjugation

advmod: adverbial modifier

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/aEFHscfTPxihutra.png)

### Named Entity Recognition (NER)

- Extracts entities from text documents, e.g. names, places, organizations, email addresses, etc.
- Relationship extraction takes this one step further and finds relationships between two nouns. e.g. “Lumiere lives in Nice,” a person (Lumiere) is related to a place (Nice) by the semantic category “lives in.”

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/AVySmceRpFKFjcaM.png)

RJ example: From text to network

There may be relationships between entitites in the text which can be translated to represent relationships with network structures. It may even be directed relationships

How to get there:

Look at grammar. What are the subjects, objects, and how are they related. Packages exist for this, turning texts into elements and performing analysis on the elements. I.e. representing text as a relational structure, creating an edgelist, nodelist, etc.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-11-06%20at%2014.02.49.png)

## Vectorization approaches

### Container model (not vectorization, intro from RJ videos)

Scenario: we have a sentence 'This is a very nice house'

Think about the sentence as a container - it contains words. I.e. it's a way to represent meaning through combinations of elements.

If the sentence is a container of terms, the terms carry meaning and the combination of terms is the meaning of the sentence. I.e. some terms bear more meaning (debatable) e.g. house and nice are important, whereas very, but, this, is, a and . may not be that important.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-11-06%20at%2013.58.00.png)





### Bag of words model (BoW) (Bag of words representation of text)

In order to apply ML approaches, we need tabular data including text and features.
- text = each sentence
- features = f1 to fn is the vocabulary contained in the whole collection of texts, i.e. all words used in the sentences

Example:

the fat cat sits on the mat

the cat ate a fat rat

the dog is sad

Procedure:
- Kick out words that don't carry much meaning
  
  - e.g. articles the, is, a etc. preprositions and full stops.

- Normalize text, stemming or lemmatizing
  - Normalizing the vocab by removing gender, declension and conjugation. Changes the verbs according to whether is plural, singular, etc. and transform past tense into present.
  - Reason for normalizing: Discovering meaning is the same in normalized text, as long as what you want to figure out is related to meaning and not so much to gender or time etc. (in that case do not normliaze)

- Turn the vocab into a table
  - Rows: Text (sentences)
  - Columns: Features (words)

- Add 1's and 0's
  - depending if the word is present in the sentence or not. Result is a sparse matrix representing text in tabular form

- Apply ML models

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-29%20at%2016.51.35.png)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2007.56.59.png)

### UPDATE Embeddings??

### Word2Vec

In simple terms:
- The order of things in a sentence, e.g. dog bites man vs man bites dog

Word2Vec = word embeddings

Rational/mechanism behind word2vec:
- The distributional hypopothesis "You shall know a word by the company it keeps", i.e. the meaning of a word is defined by the context in which it tends to occur. By observing the words surrounding a word will give us an understanding of what the word means.

The word2vec algorithm is a formalization of that. You look at the first three words and try to predict the 4th word. Once you're done with that you move the window by one and try to predict word 5 etc. etc.

Imagine you're taking all your words in a text or corpus, creating a SML classification problem where input is x, the first 3 words, and y is word number 4 etc. etc. etc.

The model is a shallow naural network. What happens is:
- By running above again and again you identify patterns that map the relationship between the terms and all possible contexts it has. Ending up with a vector representation (word embedding) of the words that carry the meaning of the word. Once you have these word embeddings you can figure out the following:
- Similar terms have similar vectors, because similar terms tend to appear in similar contexts, e.g. the word cat will appear in similar contexts as feline
- Vberlin - Vparis + Vbeer = Vwine, it captures not only the words but also the context and allows you to do linear algebra on the words and find the latent structures that are underlying.
- Average vector of all the words in a sentence as a document representation
- Input for deep learning applications
  - What if were not only looking at words in a sentence but feeding the model with vectors that represent the words in the right order. Feeding word vectors into a neural network into specific models, going deeper into meaning structures.

Run the exercise on a LOT of text and train the model and you will end up with something thats really good at finding the meaning of words.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2008.36.37.png)

The Word2vec model captures both syntactic and semantic similarities between the words. One of the well known examples of the vector algebraic on the trained word2vec vectors is:

Vector(“King”)-Vector(“Man”)= Vector(“Queen”)-Vector(“Woman)

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/1*5F4TXdFYwqi-BWTToQPIfg.jpeg)

### TF-IDF (Term Frequency - Inverse Document Frequency)

TF-IDF tries to weigh words, i.e. if a word appears in all texts it may not be as important for the individual texts. Original words are probably more important than general words. e.g. the word 'president' in political tweets.

The weight of term x and sentence y is equal to the term frequency of the word x in sentence y times the logarithm of the total number of documents divided by dfx (the number of documents the word x appears in, i.e. cannot be larger than N).

The closer log is to one, the lower the importance of the word

i.e. TF-IDF discount general terms / highlight specific terms. In reality just looking a frequency distribution.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2008.01.35.png)



## Dimensionality reduction and topic modeling

LSA
SVD
LDA

### From Bag of words (BoW) to topic modeling and embeddings

SML:
- BoW and TF-IDF is sufficient for many SML approaches, e.g. classification or sentiment analysis. Performance can be quite alright with BoW but better with TF-IDF

Going from sparse matrix to dense:
- LSI (latent semantic indexing) or LSA (latent semantic analysis).
- Similar to SVD (singular value decomposition) or NMF (non negative matrix) in terms of mathematics and output

Simplified example using LSI/LSA:
- We have a sparse matrix with words and docs
- Transforming the matrix to dense using dimensionality reduction approaches (LSI or LSA)
- The result is one matrix with documents vs. topics and one matrix with topics vs. vocab
  - documents vs. topics: dense matrix that identifies topics, similar to UML, i.e. components. Topics are easy to interpret because the way we speak and write is logical. Linear algebra can help uncover latent topics in text corpora. Each document is represented as a combination of topics.
  - topics vs. vocab: dense matrix that look at to which extent terms contribute to topics. What is the relationship between certain terms and certain topics. This is a lower dimension matrix that can be used for:
      - SML
      - Similarity (e.g. cosine similarity. The advantage of similarity based on reduced matrices compared to BoW/TF-IDF is that you are able to get similarity of two documents that don't share any words but contains certain synonyms. With BoW you need to have exact common terms, LSI/LSA on the other hand is based on topics and not words, i.e. getting closer to semantic similarity.

![](https://raw.github.com/NadiaHolmlund/BDS_M2_Exam_Notes/main/Images/Screenshot%202022-10-30%20at%2008.14.36.png)


LDA (latent dirichlet allocation):
- Another topic modeling approach
- Probabilistc approach
- Wouldn't use it for SML, but great for topic discovery and visualizations

LSI (also known as Latent Semantic Analysis, LSA) learns latent topics by performing a matrix decomposition (SVD) on the term-document matrix.

LDA is a generative probabilistic model, that assumes a Dirichlet prior over the latent topics.

In practice, LSI is much faster to train than LDA, but has lower accuracy.

Suppose you have the following set of sentences:

I ate a banana and spinach smoothie for breakfast
I like to eat broccoli and bananas.
Chinchillas and kittens are cute.
My sister adopted a kitten yesterday.
Look at this cute hamster munching on a piece of broccoli.

Latent Dirichlet allocation is a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

Sentences 1 and 2: 100% Topic A
Sentences 3 and 4: 100% Topic B
Sentence 5: 60% Topic A, 40% Topic B
Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, ... (at which point, you could interpret topic A to be about food)
Topic B: 20% chinchillas, 20% kittens, 20% cute, 15% hamster, ... (at which point, you could interpret topic B to be about cute animals)


Corex:
- Can we make existing approaches more informative
- Anchor words - means that you get some kind of semi-supervised  representation. So you can infuse the model with some domain expertise. E.g. I know there are some topics in this corpus and I know these are terms that belong to the different topics, so take these into account when you are making the groupings automatically



## NLP in supervised pipelines

## Explainability

## Typical applications

Industry Applications Of NLP
News Aggregation
Social Media Monitoring
Analytics For Marketing
Automated Customer Service
Content Filtering- Spam filters

semantic search

Text Classification

Text Classification Illustration
Text classification is the process of understanding the meaning of the unstructured text and organizing it into predefined classes (also called labels, or tags). One of the most popular text classification tasks is sentiment analysis, which aims to categorize unstructured text data by sentiment. Other common classification tasks include intent detection, topic modeling, and language detection.



chatbots

Approach:
- Train a model on variants of a question.
- Take input and predict the type of question asked - this is called “intent”
- Reply with a pre-defined response corresponding to the question asked.


Modern bots are more complex. They evaluate the whole (or large parts of the) dialogue. In addition some have the capacity to generate text.

The data needed for building such a system is a collection from a company FAQ for instance with variations for each question-answer pair type. For examples with many q-a pairs like the banking-faq, you can consider similarity-based approaches, where input text is matched with most "similar" predifined questions and then the user picks upon request - "did you mean xyz-question" Alternatively, one could create paraphrased versions of questions for each question-answer pair to have more training examples. This can be done manually or
The overall architecture is as follows:
Given a free text/prompt, predict which question is asked (intent)
Pick corresponding answer (e.g. random out of 2-3) to simulate dialogue
In this notebook, we will first use TFIDF-Logit, then standard SpaCy vectors and finally Floret (new SpaCy vectors that combine a new efficient compression with fastText, that helps overcome typos by including subword-elements in the model)

## Word vectors (intuition)