## Social Network Analysis

Social networks are connections between people and other people. 
Networks can also exist between words in a document or set of documents. 
Networks can also exist between people and work products; 
for example, as in a software version control system or record of wikipedia page edits. 

### Nodes

In a social network, "nodes" are people. 
I am a node. 
You are node. 
In a large network, we need to decide if there are criteria for people to exist "as a node". 
For example, if a social network is constructed from observations about people being in the same place at the same time, that presence is the critera. 
If a social network is constructed from communication records, as in a software version control system or discussion forum, the criteria are different. 
For example, I might choose to only show nodes who have dyadic communication 
(It is not enough for me to message you; you must message me back. 
If you do not, then I can decide I do not need to show your node in my analysis). 

### Edges

Edges are the connections between nodes we referenced them above. 
Dyadic communication as a concept is a reference to the "edges". 
The number, type and frequency of edges between nodes underly most of the measures of social networks, and have a strong influence on how these networks are visualized. 

### Key Social Network Measures

Social network analysis is a substantial field of inquiry and statistical development. 
Like other families of statistical methods, the number of discrete statistics one can calculate is substantial. 
We will not try to boil this ocean. 
There are a few measures that are conceptual core to a lot of social network analysis.  
We will focus on those. 

#### 1. **Degree Centrality** : The number of connections (edges) between a node and other nodes. 
The higher the degree centrality, the more important, or central, a node is in a network. 
There are two kinds: **In degree** and **out degree**. In degree centrality is often a proxy for popularity. 
If people are frequently interacting with you, then you are likely "popular". 
The reasons for your popularity depend on the type of network. 
A General is likely popular because she has power. 
A famous actor is likely popular because he is interesting. 
Out degree centrality can be thought of as an indicator of gregariousness. 
If you message a lot of people, you might be considered outgoing. 
Outgoing. Out degree. In-portant. In degree. 

#### 2. **Betweenness Centrality** : High betewenness centrality is indicated by a person being a broker between two networks. 
These people are often called "connectors". 
For example, if somebody has a strong network in the defense contractor industry, and a strong set of network connections inside the Pentagon, their betweenness centrality between those two networks will be measured as high compared to a person without strong connections in both networks. 
The image below illustrates a node with strong connections to two networks. 
Its the big red one in the middle.  

You will see there are other networks in the image that are not connected to this one; 
which means they are networks that are isolated from each other, 
and have no nodes that connect them together. 
In this way, it illustrates how _high betweenness nodes_ are important for healthy social networks. 
We need people who reach across boundaries. 
Social Network Analysis ("SNA") helps us to identify those people. 

![Betweenness Centrality](../images/betweenness.jpg "Betweenness Centrality In a Nutshell")


### Expanding Conceptual and Historical Grounding in What Social Network Analysis "Means"

In some respects, Social Network Analysis can be dangerous and misleading. 
The visualizations you generate with SNA tools can create a compelling and lasting impression about the importance of different individuals in a larger social network. 
If you choose communication instances as your edge, more communication will indicate a more influential person. 
This may be reasonable in a communication network focused on coordinating the distribution of goods. 
In a network like Twitter or Facebook, however, more communication may not necessarily indicate a more influential person. 
Perhaps uncle Sean just posts too darn much and nobody listens?

The reading below, by Freeman, is an important conceputal grounding in social network analysis and what it means. **Chapter's 1 and 2 are highly recommended because they will ground your thinking about these types of visualization in original intent of the practice of social network analysis.** 
The remaining chapters walk you through a history of discipline. 
The sometimes overwhelming number of measures and means of interpreting SNA emerges from this history, which I summarize as : 

####  1. 
In the beginning, sociologists in Iowa wanted to understand the **structure** of society on a local level. 
From this, they developed techniques for mapping relations between people in the real world. 

#### 2. 
In the middle, mathematicians developed sophisticated quantifications and discrete indicators of social network positions for individuals and social systems. 
This is important. 
Most SNA statistics apply to an entire network or individuals in that network. A few apply to subsets of the network ("Triadic Closure", for example). 
But lets keep our inroduction focused on individuals and networks. 

##### [READING (Chapters 1 and 2: The Development of Social Network Analysis](/static/PDF/Freeman-2004.pdf)

####   3. 
The book you are reading was published in 2004, just before online social networking sites (SNS's) took off. 
These days, if you want to understand social networking, social networking sites are at the center. 
With their emergence, the volume of connection types grew significantly. 
At first, it was useful and interesting to examine an individual's position in a single SNS, like Facebook or Twitter. 
Now, as in the days of the sociologists in Iowa, the ways and means people use to connect are diverse and unpredictable. 
We do not know all the locations of social gatherings; 
in this case, the "locations" are online places, and new ones are emerging each day. 
This does not render our work analyzing any one technology mediated network useless. 
Rather, it highlights the practical importance of the analyst's awareness of missing data. 



### Further Reading

These readings will ground you first in the validity issues that arise when conducting social network analysis against electronic trace data, 
and second, in a way of being systematic in your analysis in the future. 
You will find the visualization tools straightforward to use (though you may sometimes feel overwhelmed by the sheer number of SNA libraries available in R). 
The decisions you make when using these tools should be informed by reflections on these two additional readings. 
This will help you ensure that you are presenting your information consumers with good information. 
Be skeptical and press ahead. 

  * [Validity Issues in Social Network Analysis](/static/PDF/SNA-ValidityIssues.pdf)
  * [A Systematic Methodology and Ontology for SNA on Electronic Trace Data](/static/PDF/SystematicMethodology.pdf)

## Visualizing our First Network

In [None]:
## In R, we first load the libraries that our code will call methods from. 
## In this case, we are using iGraph and ggplot2
library(igraph)
library(ggplot2)

## igraph has a few options we want to set. 
## printing the attributes will give us visualizations with labels.
igraph.options(print.vertex.attributes = TRUE)
igraph.options(print.edge.attributes = TRUE)

## Data Sets

The data set in this example is composed of discussions 
between students in an online course in
Australia, using the Moodle course management system.

The data has been cleaned up to take on a certain format 
that network analysis libraries 
generally understand. 

#### Data Structure
 - id,userid_from,created,userid_to,class_course_id
 - id - row id
 - userid_from - the "source" node
 - userid_to - the "target" node
 - class_course_id - A filterable, extra field we will use to narrow the scop of our analysis 

## Sample Data: Note that it is anonymized. 
 -  769996,369819,1409542094,333567
 -  750705,374598,1408359941,374955



In [None]:
## Read in a list of files
files <- system("ls /dsa/data/all_datasets/netdata/mdl_forum_posts_scrubbed_snadata_with_courseid.csv",intern=T)

## Note that the list of files in this case is a single file. Having this code as a foundation
## will enable you to run multiple network analyses
## in the future by iterating over a set of files. 

for (i in 1:length(files)) {

  i = 1 ## name your iterator

    ## name the file you might save later (not this time)
    fileNamer = as.character(i)

    ## "el" is a common abbreviation for "edge list". 
    ## The data structure above, with "source" and "target" values is 
    ## often referred to as an "edge list"
    
    ## Typically, there will either be an edge list composed of one row
    ## for each instance of communication in a social network 
    ## composed of communication traces. So, if I email you 10 times, 
    ## there will be 10 rows with me as the source and you as the target.
    ## If you email me back 9 times, there will be 9 additional rows 
    ## where you are the source and I am the target. 
    ## The "source" and "target" map to "out degree" and "in degree" 
    ## centrality measures, respectively. 
    
    ## An alternative network structure that some libraries process 
    ## (but **not** all, so be aware) is the weighted network. In a weighted network,
    ## the network structure contains an aggregation of the strength of 
    ## connection for each pair of nodes with instances of communication
    ## in each direction. Weighted networks are helpful when you have 
    ## multiple communication channels that you want to indicate 
    ## the relative strength and importance of. More on these later. 
    
    ## So, back to "el" : "Edge List" 
    
	el <- read.table(files[i], header=TRUE, sep=",", dec=".", fill=TRUE)
	
	# based on variable values
    ## we subset our edge list to, in this case, show us the network
    ## for a single class. 
	subsetEL <- el[ which(el$class_course_id == 55113),] 
    
    ## This is a defensive programming move. If the line above errors for any
    ## reason I have not over written my original variable. 
	el <- subsetEL

    ## graph.data.frame is a method in the "igraph" library. 
    ## you can reference the igraph API documents here:
    ## http://igraph.org/r/doc/igraph.pdf
    
    ## This graph is directed
	disAll <-graph.data.frame(el, directed=TRUE)

    # I want to label the nodes in my graph, so I assign 
    # the name of each node, which is an igraph internal 
    # variable for all the nodes culled from the edgelist,
    # to the label variable.  The label variable is for 
    # the sociogram (graph) itself. 
	V(disAll)$label <- V(disAll)$name
	
    # There are a number of different layout algorithms
    # The algorithm you choose impacts how the nodes and edges
    # are represented visually. 
    
    # You will find a list of layout_ options in igraph here:
    # http://igraph.org/r/doc/#L 
    
	layout_with_sugiyama(disAll)
	
    ## You can set the arrow size to e bigger or smaller
	E(disAll)$arrow.size <-0.4

	# size of sphere calculated based on the betweenness value
	V(disAll)$size <-(3+sqrt(sqrt(betweenness(disAll, v=V(disAll)))))
	g<-disAll

	if (is.connected(g)){
		com <- spinglass.community(g, spins=8)
		V(g)$color <- com$membership+1
	}

	plot(g, vertex.label.dist=.5, vertex.label.cex=.7, vertex.label.color="black",
	vertex.frame.color="white")
}



## Interpreting the sociogram

In the sociogram (social network visualization) produced above, you see several small, disconnected networks. 
What can we discern from such a structure?
- Not all the nodes are connected to each other. This is not what is called a "fully connected graph". There are subnetworks of intereaction that do not cross paths with each other at all in this corpora. 
- Given this fact, we know that there must be places in the online course where students interact around parts of the course, but do not feel obligated to partcipate in everything all the time. Or, it is possible that the course is structured into smaller groups that work together in isolation. To **really** know why there are isolated subnetworks we need to dig into the particular functions of this social and technical arrangment.  We will not do this here; I make this point so that its clear to you that as an analyst using social network analysis, the visualization is usually not the final product. You will iterate many times and critique your work several times before you are fully confident in your description of what your sociogram represents. 

# SAVE YOUR NOTEBOOK