## Chapter 1
# Theoretical Concepts of Network Analysis
- Network
    - <mark>Set of links (ties or edges) and  objects (nodes or vertices)</mark>..
    - Can be <u>represented as a graph or matrix to show connections between nodes</u>. $\rightarrow$ Allows applying a variety of mathematical, computational, and statistical techniques to extract and analyze the main features of the network.

## Sociological Meaning of Network Relations
- Sociologically, a network is a <mark>set of relations between actors</mark>..
- Relations between nodes in a network are called <u>ties</u>.
    - Mean kinship with people, affection, enmity, exchange of favors or loans, membership in a club, or attending certain events together.
    - Can have <u>direction</u>.
    
    
- Social networks
    - Theoretical models to analyze and visualize the relations between actors.
    - Relations between individuals may vary in intensity.
    - More than one kind of relation among individuals.
    
    
- Another important concept that is related to relations among actors within social networks is <u>transitivity</u>(추이성):
    - Suppose that we know that actor $a$ knows actor $b$ and actor $b$ knows actor $c$. Does that also imply that actor $a$ also knows actor $c$?


- The relational structure of a social network
    - A graph of nodes $N = \{1,,2,,3...n\}$, where each pair of nodes is connected by an edge.
    - An adjacency matrix $X$ in which $\{X_{ij}\}$ is the relationship between node $i$ (sender) and node $j$ (receiver).
    - $\{X_{ij}\}$ can take a binary value either 1 or 0 indicating the presence or absence of the relationship between $i$ and $j$.

## Network Measurements
1. Network connection - transitivity, multiplexity, homophily(동종애), dyads and mutuality, balance and triads, and reciprocity
2. Network distribution - the distance between nodes, degree cen- trality, closeness centrality, betweenness centrality, eigenvector centrality, PageRank, geodesic distance and shortest path, eccentricity, and density
3. Network segmentation - cohesive subgroups, cliques, clustering coefficient, k-cores, core/periphery, blockmodels, and hierarchical clustering.

### Network Connection
- = connectivity
- Refers to the ability to move from one node to another in a network.
- The ratio between route distance and geodesic distance.
- Can be calculated locally (for a part of the network) and globally (for the entire network).

### Transitivity(추이성)
- A network property that refers to the extent to which a <u>relation between two nodes is transitive</u>.
- In social networks,
    - it is a <u>very important measure</u> but less important in other types of networks.
    - the term transitivity reflects the <u>friend-of-a-friend</u> concept. 
- Sometimes used as a synonym of <u>whole-network [clustering coefficient](https://lsy5518.wordpress.com/2014/02/19/네트워크-집단화-계수-network-clustering-coefficient/)</u>.
<img src="Fig1.1.png" width=500>
$$\textit{Transitivity} = \frac{\textit{3 * No. of triangles}}{\textit{No.connected triples}}$$

- In real social networks, a value between 0.3 and 0.6 is common for the transitivity index. <????>
- The transitivity of a graph is strongly related to the clustering coefficient as each of them measures the relative frequency of triangles. 
- They are the most common measures used to calculate the number of triangles in a network.

### Multiplexity
- Ties can have <u>strength</u>. This can be in <u>the form of the intensity of the relationship</u> or by how many different types of content that the tie contains.
- Ties with more than one dimension are called “multiplex.” $\leftrightarrow$ "uniplex"
    - multiplex ties are stronger than uniplex ties.
    
<img src="Fig1.2.png" width=500>
- Sociograms
    - Good way of looking at the picture of the social structure.
    - Not easy to analyze, in particular for large networks.
    - A better way to do the analysis of the above network is by using <u>matrices</u>.
<img src="Table1.1.png" width=500>
<img src="Table1.2.png" width=500>

### Homophily(동종애?)
- Homophily is the tendency of individuals to connect with others who share the same attitudes and beliefs.
- Homophily in the context of online social networking can be understood from the similarity of users who are using the network in terms of age, educational background, region, or profession.

### Dyads and Mutuality
- A dyad is a pair of actors in a network potentially connected by a social relation.
- It is the simplest possible social group in which the transfer of information between the two nodes can be fast and easy.
- Dyads are among the concepts that characterize social networks.
- Each dyad can be in one of the three cases: a positive relationship, a negative relationship, and no relationship between the nodes of the dyad (empty dyad)

<img src="Fig1.3.png" width=500>

### Balance and Triads
- A triad
    - A network structure consisting of three actors and three dyads.
    - Identify four different types of relationships, depending on the number of negative relationships between nodes:
        1. a friend of my friend is my friend, 
        2. an enemy of my enemy is my friend, 
        3. a friend of my friend is my enemy, and 
        4. an enemy of my enemy is my enemy.
    
- Structural balance theory
    - The key concept in many applications of the signed graph theory.
    - Tries to explain how a specific pattern of negative and positive signs can result in a different kind of relationships.
    - Triads with an odd number of “+” edges are balanced, while triads with an even number of “-” edges are unbalanced.
    - Imbalanced graph con gurations usually create stress for individuals located on them.
    - (& signed graph) Very helpful in predicting and explaining friendships and animosity changes in human communities, which is useful in the analysis of enmity in tribal wars, political con icts, or international relations.
    
<img src="Fig1.4.png" width=500>
- Examples of signed graphs include friend/foe, trust/distrust or like/dislike, esteem/disesteem, praise/blame, in uence/negative in uence, etc.
    - In figure a, all the three actors have positive feelings, and there is no place for conflict among them. The configuration is coherent and lacks inner tensions between members.
    - Figure b is also stable since two actors (B and C) share the same negative feeling towards actor A, but they like each other.
    - Figure c is unstable because actors A and B have a negative feeling towards each other, while both have a positive feeling towards actor C which has to divide its loyalty between the other two actors.
    - Figure d is also unstable and will eventually break down, as it has an odd number of negative signs.

### Reciprocity
- A measure of the tendency towards building mutually directed connections between two actors.
- Can be an indicator of the importance of a relationship between two actors.
- Facebook's reciprocity is 1(or 100%), Twitter is not.

## Network Distribution

### Distance Between Two Nodes
- Distance is a network metric that allows the calculation of the number of edges between any pair of nodes in a network.
- Measuring distances between nodes in graphs is critical for many implementations like graph clustering and outlier detection.

### Degree Centrality(연결 중심성)
- In degree centrality metric, the importance of a node is determined by <u>how many nodes it is connected to</u>.
- It is a measurement of <u>the number of direct links</u> to other actors in the network.
- It is a local measure.
- Actors in social networks with a high degree of centrality serve as <u>hubs</u> and as <u>major channels of information</u>.
- A node $i$’s degree centrality $d(i)$ can be formulated as
$$d(i) = \sum_j m_{ij}$$
where $m_{ij} = 1$ if there is a link between nodes $i$ and $j$ and $m_{ij} = 0$ if there is no such link. 
- For directed networks, it is important to differentiate between the <u>in-degree centrality</u> and the <u>out-degree centrality</u>.
<img src="Fig1.5.png" width=500>


- Persons with the highest-degree centrality are often third parties and deal makers and able to benefit from this brokerage.
- For directed networks, in-degree is often used as a proxy for popularity.

### Closeness Centrality(근접 중심성)
- Defined as <u>how close</u>, to a particular actor, other actors are.
- The sum of the geodesic distances of a node to all other nodes in the network.
- Closeness centrality, $C(i)$,
$$C(i)=\sum_j d_{ij}$$
where $d_{ij}$ is the geodesic distance from node $i$ to node $j$ (number of links in the shortest path from node $i$ to node $j$).
- It is important to <u>understand information dissemination</u> in networks in the way that the distance between one particular node and others has an effect on how this node can receive from or send information (e.g., gossip) to other nodes.
- Because closeness centrality is based on the distance between network nodes, it can be considered the inverse of centrality because <u>large values refer to lower centrality</u>, whereas <u>small values refer to high centrality</u>.
<img src="Fig1.6.png" width=500>

### Betweenness Centrality(매개 중심성)
- Can be described as <u>how important</u> an actor is, as a link between different networks.
- It represents the <u>number of times</u> an actor needs to pass via a given actor to reach another actor.
- Nodes with high betweenness centrality <u>control the flow of information</u> because they <u>form critical bridges</u> between other actors or groups of actors
- Betweenness centrality of node $i$ is calculated as follows:
$$b(i)=\sum_{j,k}\frac{g_{jik}}{g_{jk}}$$
where $g_{jk}$ is the number of shortest paths from node $(j)$ to node $k$ ($j$ and $k\neq i$) and $g_{jik}$ is the number of shortest paths from node $(j)$ to node $k$ passing through the node $(i)$.
<img src="Fig1.7.png" width=500>
- Betweenness centrality shows which nodes are likely <u>pathways of information</u> and can be used to determine how a graph will break apart of nodes are removed.
- In trust networks, the key concept is betweenness centrality. Because this measure describes that potential of the particular individual to control the communication among other individuals in a network.

### Eigenvector Centrality(고유벡터 중심성)
- Eigenvector centrality measurement <u>describes the centrality of a person</u> with regard to the global structure of the network.
- Assigns relative scores to all nodes in the network based on the concept that connections to nodes with high scoring contribute more to the score of the node in question than connections to nodes with low scoring.
- Measures the extent to which a node is <u>connected to well-connected nodes</u>.
- Computed by taking the principal eigenvector of the adjacency matrix.
<img src="Fig1.8.png" width=500>

### PageRank
- PageRank, which is a <u>variant of the eigenvector centrality measure</u>, calculates the importance of a Web page by considering the probability that a user visits this page based on the hyperlinks.
- The importance of a page is determined <u>by the importance of the pages that it is linked to</u>.
- For a directed network $G$ with adjacency matrix $A$, the PageRank score of a page $u$ is determined iteratively by the scores of its incoming neighbors:
$${PR}(u)=\frac{1-\alpha}{N}+\alpha\sum_v A_{vu}{PR}(v)/d_{out}(v)$$
where $\alpha$ is a damping factor that has a value between 0 and 1, $N$ is the total number of nodes, and $d_{out}(v)$ is the degree of outgoing links of $v$.
- In PageRank: nodes are Web pages, links are Web links, and state is the temporary “importance” of that node. It is calculated by forcedly assigning positive non- zero weights to all pairs of nodes to make the entire network strongly connected. Its coefficient matrix is a transition probability matrix that can be obtained by dividing each column of the adjacency matrix by the number of 1’s in that column. 
<img src="Fig1.9.png" width=500>


- Calculation: just one dominant eigenvector of the TPM of a strongly connected network always exists, with $\lambda = 1$. This shows the equilibrium distribution of the population over WWW. So, just solve $x = Ax$, and we will get the PageRank for all the Web pages on the World Wide Web.


- PageRank algorithm gives nontrivial results only for asymmetric networks.
- In its basic form, PageRank is vulnerable to misuse by some people, and hence it needs to be regularly and secretly adjusted.

### Geodesic Distance and Shortest Path
- For <u>directed graphs</u>, the geodesic distance $d$ between two nodes $(u, v)$ is defined as the number of edges between these two nodes, while the shortest path is defined as the path from $u$ to $v$ with minimum number of edges.
- For <u>undirected graphs</u>, 
    - the distance $d$ from $u$ to $v$ is the same distance as from $v$ to $u$ (i.e., $d(u, v) = d(v, u)$) because all paths can be reversed. 
    - the distance from one node to itself is always zero (i.e., $d (u,u) = 0$), and the distance between two adjacent nodes (i.e., there exists only one edge between $u$ and $v$) is one ($d (u,v) = 1$). 
    - If there is no path between $u$ and $v$, then $d(u, v) = \infty$.
- Many important network measurements are based on the concept of geodesic distance and shortest path.

### Eccentricity(이심률, 離心率)
- The <u>maximum distances from a given node to all other nodes</u> in a network.
- A popular <u>path-based measure</u> that gives value to each node in a network based on its direct and indirect connections.
- If we calculate the <u>maximum eccentricity</u>, which is the maximum distance between nodes, we get what is called “<u>diameter</u>.” 
- The eccentricity $E$ of a node $u$ is defined as the maximum distance between $u$ and all other nodes. This means that we need to compute the geodesic distance from $u$ to all other nodes and choose the maximum value.
$$E(u)=\max_{v\in V}d(u,v)$$

- Calculating eccentricity is important for understanding <mark>how information can propagate from one node to other nodes</mark>.
<img src="Fig1.10.png" width=500>
    - maximum eccentricity -network diameter- is 4.
    - node $D$ has eccentricity 2, which means that the furthest nodes from $D$ are located 2 hops away. Information propagated by node $D$ can pass through nodes $E$ and $B$ in the smallest number of steps to reach the furthest nodes in the graph
    - node $A$, for example, has eccentricity 4 which means information needs to pass through at least three nodes ($B$, $D$, and $E$) and four edges to reach the furthest nodes.

### Density
- Density is defined as <u>the degree</u> to which network nodes are <u>connected one to another</u>.
- It can be used as a measure of <u>how close a network is to complete</u>.
- In the case of a complete graph, density = 1.
- The density $D(G)$ of graph $G$, density values range between zero and one $[0, 1]$, the possible number of edges is $n (n − 1) / 2$:
$$D(G) = \frac{2m}{n(n-1)}$$
where $m=|E|$ is the number of edges in $G$.


- In ego-centric netwoks, consider the links around the focal node. The goal of such an analysis would be to explore the in uence of the focal node on its neighbors in the subgroup that the node belongs.
- In socio-centric networks, consider the complete graph and the network constraints as well.


<img src="Fig1.11.png" width=500>
- In graph a has $m = 4$ edges, resulting in a density $D(G)= (2\times 4)/ (5\times 4) = 0.4$. 
- In graph b, $m=6$, resulting in density $D(G)= (2\times6)/(5\times4) = 0.6$. 
- Finally, in graph c, $m = 10$ , resulting in density $D(G)=(2\times10)/(5\times4) = 1$, which means that the graph is <u>fully connected</u>.


- **[Q]** How much is the density of online social networks?

## Network Segmentation

### Cohesive Subgroups
- Cohesive groups are <u>communities</u> in which the nodes are connected to others in the same group <u>more frequent</u> than they are to those who are outside of the group, allowing all of the members of the group to reach each other.
- Within such a <u>highly cohesive group, members</u> tend to have <u>strong homogenous beliefs</u>.
- cohesive group analysis
    - clique : a maximal complete subgraph.
    - k-core : defined as a subgraph having a minimum degree greater than or equal to $k$.   (??? k)

### Cliques
- A graph (or subgraph) in which <u>every node is connected to every other node</u>. In SN, A social grouping in which all individuals know each other.
- A <u>maximal clique</u> is a clique that is <u>not a subset</u> of any other clique in the graph.
<img src="Fig1.12.png" width=500>
- <u>Finding maximum cliques</u> in a network is an interesting and highly applicable problem in <u>analyzing social relationships</u> since it provides the largest set of common friendships.
- By comparing the sizes of the maximum friendship cliques over two social networks, we may <u>get an idea about some aspects of group dynamics, such as teamwork, trust, and productivity</u>.

### K-Cores
- In undirected graphs, a k-core is a connected maximal induced subgraph having a <u>minimum value greater than or equal to $k$</u>. 
- This means that each node has ties to <u>at least $k$ other nodes</u>. 
- K-cores can be used as <u>a way to find cohesive subgroups</u> in graphs.
- K-core can be used by researchers as a sampling technique for <u>collecting data from social networks</u>.

### Clustering Coefficient(결집 계수)
- Clustering coefficient is a measure of <u>how much nodes tend to form dense subgraphs</u>.
- For SN, this can be interpreted as the probability that two friends of a single person are also themselves friends.
- SN's CC > RN's CC

<img src="Fig1.13.png" width=500>
- The clustering coefficient $C$ of a node $i$, the <u>local(node</u>) clustering coefficient:
$$C_i=\begin{pmatrix}k_i \\ 2\end{pmatrix}^{-1}\ T(i)=\frac{2T(i)}{k_i(k_i-1)}$$
where $T(i)$ is the number of distance triangles with node $i$ and $K_i(k_i −1)$ is the maximum number of possible connections in neighbors of $i$.
- <u>A large $C$</u> implies that the network is <u>well connected locally</u> to form a cluster.
- Network average clustering is defined as:
$$C=\frac{1}{n}\sum_{i\in v} C_i$$
- The computation of the node clustering coefficient $C_i$ can be done in time $O(n^2)$ in the worst case via counting <u>all edges connected directly to node $i$</u>.
- The network clustering coefficient can be computed by counting all closed triplets. A brute-force approach can be used to examine network clustering coef cient by examining the combination of nodes, with time complexity equals $O(n^3)$.
- Another approach proposes the use of fast matrix multiplication on the adjacency matrix representation of graph $G$ to solve triplet finding, counting, and node counting in $O(n^{2.376})$ time and $O(n^2)$ space.
- In bipartite graphs, where triangles do not exist, clustering coefficient is trivially equal to zero.

### Core/Periphery
- <b>The core</b>, in which nodes are embedded in a coherent subgraph
- <b>The periphery</b>, in which nodes are loosely connected
- The study of dynamic behavior of networks that evolve over time is called <u>dynamic network analysis (DNA)</u>, and it differs from traditional social network analysis, which proposes that the network is static during analysis. While DNA is <u>more concerned with the activity of network actors and their interactions</u>, SNA concentrates on the structural properties of a network.

### Blockmodels
- Blockmodeling is <u>an analytic method that uses data partitioning in a social network</u> to classify actors based on their patterns of ties to others.
- The general idea of blockmodels is that <u>nodes are partitioned (based on some equivalence measure) into discrete subgroup positions</u>.
- Several equivalence measures are used for this purpose such as <u>the structural equivalence, the regular equivalence, and the stochastic equivalence</u>.
- Because blockmodeling has <u>the benefit of identifying individuals who can play the same role in the same or a different population</u>, it was used <u>to identify role structures among business managers using social ties</u>. It was also used <u>to understand the structure of co-citation networks and the structure of personality types</u> in small groups.

### Hierarchical Clustering
- One of earliest methods for clustering data using <u>a similarity matrix</u>.
- It is part of a wider classification criterion for clustering algorithms: hierarchical (e.g., hierarchical clustering) vs. non-hierarchical (e.g., k-means).
- Agglomerative version
    - Starts with clustering all data items into clusters (or groups) based on a distance metric.
    - The closest pair of clusters is then merged by moving up the hierarchy into a single cluster, while the correlation between data items of different clusters decreases as we move from the bottom (leaves) to the top (root). 
    - This step is repeated until all items are clustered into one single cluster.
    - This algorithm is <u>commonly used</u> by the social community <u>for community detection</u>.
    - Finding out which clusters to merge can also be done in different ways.
        - <u>The complete-linkage method</u> 
            - Considers <u>the distance between two clusters to be equal to the greatest distance</u> from any data item in one cluster to any data item in the other cluster. 
            - Not used in general because it is highly sensitive to outliers.
        - <u>The single-linkage method</u> 
            - Considers the distance between any two clusters to be equal to <u>the shortest distance</u> from and a data item in the first cluster and any data item in the other cluster. 
            - Suffers from what is called “chain effect” and is less suited to detect spherical clusters, it is still more commonly used since it has the ability to detect elongated and irregular clusters.
        - <u>The average-linkage clustering method</u> considers the distance between one cluster and another to be equal to <u>the average distance</u> from any data item in the first cluster to any data item in the other cluster.
- Divisive version
    - Does the reverse such that it starts with one single cluster and subdivides it into smaller pieces
- The agglomerative version is more commonly used than the divisive one.

## Recent Developments in Network Analysis

### Community Detection
- To discover subsets of nodes (clusters) of connected communities in which nodes have many internal edges and few external edges.
- Identifying communities in graphs is feasible only if the graph is sparse, which means that number of links $m \approx$ number of nodes $n$.
- Detecting communities in a graph is NP-complete.
- Formally, the problem of community detection can be given as follows: given graph $G = (V; E)$, a community $C$ can be de ned as a subgraph of $G$ comprising a set $V_C\in V$ of objects that share some similarity.
- <u>K-cliques</u>, <u>modularity optimization</u>, and <u>link communities</u> are some of the popular methods in the stream of community detection.
- The most popular algorithm for identifying communities in networks was proposed by Girvan and Newman who identified three types of betweenness measures: <u>geodesic edge betweenness</u>, <u>random-walk edge betweenness</u>, and <u>current-flow edge betweenness</u>.

### Link Prediction
- About inferring the links between nodes based on their attributes and the global patterns of links in the social network.
- Do not form randomly, we can infer that two nodes that share some similarity.
- The link prediction problem can be formally given as follows: given a network graph $G = (V, E)$, the task here is to predict whether there will be a link between two nodes $u$ and $v$, where both $u$ and $v\in V$ and $e(u, v)\notin E$.
- Finding node similarities can be done using measures for analyzing the proximity of nodes in a network such as a <u>degree distribution</u>, <u>common neighbors</u>, <u>preferential attachment</u>, <u>Jaccard coefficient</u>, <u>Leicht-Holme-Newman Index</u>, and <u>Kart Index</u>.
- Node-based mehtods
- <u>Likelihood-based methods</u> make an assumption about the network organization and structure and <u>use that assumption to predict missing links</u>.
- Matrix factorization and feature-based classification, where the latter method tries to select a number of features from the network topological or non-topological set of features such that <u>the chosen features should effectively describe the link likelihood</u> appropriately.
- In general, methods that <u>use features to train a binary classification model</u> consider the link prediction problem <u>as a supervised classification task</u>, while methods that are based on node similarity look at link prediction as an unsupervised task.
- The task of link prediction can be classi ed into three categories: 
    - predict missing or unobserved links in a network,
    - predict the links that are likely to be formed in the near future,
    - predict whether there will be an interaction between a pair of nodes with a previously observed association.
- Examples
    - The study of the mechanisms that control the formation and deletion of nodes in these networks. 
    - E-commerce systems for building product recommendations.
    - In the healthcare sector, identify the interactions between drugs and proteins or between diseases and proteins.
    - In social networks, uses similarity of users to predict or recommend new friendships.

### Spatial Networks
- A graph in which nodes are embedded in a metric space, which means that the nodes are located in a space equipped with a particular metric.
- Examples
    - The Internet, as it consists of a set of nodes (i.e., routers) linked by edges (i.e., cables) with different lengths and latency times.
    - Drawing a map with various types of communication networks (e.g., cars and trains)
- If we want to build a spatial network, we need  rst to draw a graph (i.e., nodes and edges) of a system and then add the physical representations to the network, which will help better understand the importance of physical distances between elements to the operation of networks.

### Protein-Protein Interaction Networks
- A type of biological networks
- Nodes in these networks represent individual proteins, and the interactions between proteins (PPI) are the edges. Such interactions take the form of two or more proteins binding together to carry out a specific biological function. Analyzing PPI networks is essential for a full understanding of the molecular basis for most of the diseases.

### Recommendation Systems
- S/W tools that give sugges- tions for items that are very likely to be of interest to a particular user.
- Providing recommendations to users can be done in one of two ways: content-based filtering or collaborative filtering. 
    - CBF, the user will be recommended items similar to the ones he/she preferred in the past. 
    - CF, a user’s past behavior (e.g., items previously purchased or rated) and a similar decision made by other users to predict items that the user may have an interest in.
    - Based on the premise that similar users like similar items.
- <mark>Two-mode network</mark>, since it has two types of nodes, users, and items. 
    - A user-item network can be built according to what every user has bought/liked in the past. 
    - The relationship between two users becomes stronger if they have more things in common.
<img src="Fig1.14.png" widt=450>
- A simple analysis process to divide that network into two subnetworks can be applied next to put the items that are similar in one group. 
- Recommendations are given next based on how strong that relationship is. 
- Such type of recommendation systems are also referred to as social recommendation systems, which also have roots in the theory of homophily.(fig. 1.14)

## iGraph
- Tools for social network analysis : <u>MuxViz</u>, <u>NetworkX</u>(used throughout this study), and <u>iGraph</u>.
- iGraph is a software for the analysis and visualization of large networks. It allows analysts to perform tasks that are related to network analysis such as community discovery, cohesive groups, structural holes, dyad, triad, and motif count. 

In [None]:
# install.packages("igraph")

In [None]:
require(igraph)

In [None]:
g <- graph(c(1,2, 1,3, 2,3, 3,4, 4,5, 4,6, 6,7, 6,8), n=9)

In [None]:
print(g)

In [None]:
plot(g, vertex.color="yellow")

In [None]:
g <- graph.tree(30, 3)
plot(g)

In [None]:
install.packages("rgl")

In [None]:
require(rgl)

In [None]:
require(tcl/tk)

In [None]:
tkplot(g, layout=layout.kamada.kawai, vertex.color="yellow")

In [None]:
rglplot(g, layout=layout.kamada.kawai, vertex.color="cyan")