##### Liam Byrne
##### DATA 620 - Web Analytics
##### Fall - 2017

# Project 1

***

This Project will look at the important interactions within a small accounting firm in 1993. The social and work interactions were tracked by the employees and the researchers. We will be looking at the important people within these circles by measuring the individual’s degree of centrality and eigenvector centrality within these networks. We will look at these centrality measurements using the attributes of gender and job role.

There are four separate networks:
+ Social Interactions observed by the researchers
+ Work Interactions observed by the researchers
+ Social Interactions reported by the employees
+ Work Interactions reported by the employees

We will first load the data:

In [1]:
import networkx as nx
import pandas as pd

def norm_edges(ntwrk):
    '''
    Removes Edge weights from network
    Returns new network
    '''
    tmp_ntwrk = ntwrk.copy()
    for i in range(0, tmp_ntwrk.shape[0]):
        for j in range(0, tmp_ntwrk.shape[1]):
            if i == j:
                tmp_ntwrk[i][j] = 0
            elif tmp_ntwrk[i][j] > 0:
                tmp_ntwrk[i][j] = 1
                
    return(tmp_ntwrk)
                
def adj_to_edge_list(ntwrk, directed = True, self_edge = False):
    '''
    Returns an edge list of tuples from an adjacency matrix
    '''
    tmp_lst = []
    
    for i in range(0, ntwrk.shape[0]):
        for j in range(0 if directed else i, ntwrk.shape[1]):
            if (ntwrk[i][j] != 0) and not (i == j and not self_edge):
                    tmp_lst.append((i, j))
                
    return(tmp_lst)
                
# Repo urls for data
urls = ["https://raw.githubusercontent.com/Liam-O/Data620/master/Project1/webster_sex_job.txt",
        "https://raw.githubusercontent.com/Liam-O/Data620/master/Project1/webster_social_obs.txt",
        "https://raw.githubusercontent.com/Liam-O/Data620/master/Project1/webster_social_rep.txt",
        "https://raw.githubusercontent.com/Liam-O/Data620/master/Project1/webster_work_obs.txt",
        "https://raw.githubusercontent.com/Liam-O/Data620/master/Project1/webster_work_rep.txt"]

web_sex_job = pd.read_csv(urls[0], sep = " ", header = None)
web_soc_obs = pd.read_csv(urls[1], sep = " ", header = None)
web_soc_rep = pd.read_csv(urls[2], sep = " ", header = None)
web_work_obs = pd.read_csv(urls[3], sep = " ", header = None)
web_work_rep = pd.read_csv(urls[4], sep = " ", header = None)

sex_dict = {1: "male", 2 : "female"}
job_dict = ({1: "Partner", 2: "Manager", 3: "Accountant", 4: "Staff member"})

We will create network graphs for the data loaded above. The interactions observed by the researchers is undirected and the interactions recorded by the employees is directed; an undirected and directed graph were used for these networks, respectively. The attributes need to be mapped to these individuals, i.e. their representative nodes, which is challenging due to the strict and non-robust functions associated with *NetworkX* in concert with *Pandas*. Individuals were missing from some networks, which needed to be dealt with carefully in the mapping process.

In [2]:
# Observed Social interaciton is non-directed
soc_obs_g = nx.Graph()
soc_obs_g.add_edges_from(adj_to_edge_list(web_soc_obs, directed = False))
nx.set_node_attributes(soc_obs_g, "sex",
                       dict((i, sex_dict[web_sex_job[0][i]]) for i in soc_obs_g.nodes()))
nx.set_node_attributes(soc_obs_g, "job",
                       dict((i, job_dict[web_sex_job[1][i]]) for i in soc_obs_g.nodes()))

# Observed Work interaciton is non-directed
work_obs_g = nx.Graph()
work_obs_g.add_edges_from(adj_to_edge_list(web_work_obs, directed = False))
nx.set_node_attributes(work_obs_g, "sex",
                       dict((i, sex_dict[web_sex_job[0][i]]) for i in work_obs_g.nodes()))
nx.set_node_attributes(work_obs_g, "job",
                       dict((i, job_dict[web_sex_job[1][i]]) for i in work_obs_g.nodes()))

# Reported Social interaciton is directed
soc_rep_g = nx.DiGraph()
soc_rep_g.add_edges_from(adj_to_edge_list(web_soc_rep, directed = False))
nx.set_node_attributes(soc_rep_g, "sex",
                       dict((i, sex_dict[web_sex_job[0][i]]) for i in soc_rep_g.nodes()))
nx.set_node_attributes(soc_rep_g, "job",
                       dict((i, job_dict[web_sex_job[1][i]]) for i in soc_rep_g.nodes()))

# Reported Work interaciton is directed
work_rep_g = nx.DiGraph()
work_rep_g.add_edges_from(adj_to_edge_list(web_work_rep, directed = False))
nx.set_node_attributes(work_rep_g, "sex",
                       dict((i, sex_dict[web_sex_job[0][i]]) for i in work_rep_g.nodes()))
nx.set_node_attributes(work_rep_g, "job",
                       dict((i, job_dict[web_sex_job[1][i]]) for i in work_rep_g.nodes()))

# Reported is directed
# Need multidigraph to preserve parrallel directed edges, i.e. work and social
web_rep_g = nx.MultiDiGraph()
web_rep_g.add_edges_from(adj_to_edge_list(web_soc_rep, directed = True),
                         interaction = "Social")
web_rep_g.add_edges_from(adj_to_edge_list(web_work_rep, directed = True),
                         interaction = "Work")

The degrees of centrality are calculated for each network and the results are stored in a data frame and printed in a table. A large degree of centrality reflects a lot of network interacitons amongst other employees. Note that *NaN* means that the employee did not have any interaction for the respective metric.

In [3]:
from IPython.display import display

central = pd.DataFrame({"worker_id": range(0, 23)})
central["sex"] = (web_sex_job[0].map(sex_dict))
central["job"] = (web_sex_job[1].map(job_dict))
central["work_observed"] = pd.Series(nx.degree_centrality(work_obs_g))
central["work_reported"] = pd.Series(nx.degree_centrality(work_rep_g))
central["social_observed"] = pd.Series(nx.degree_centrality(soc_obs_g))
central["social_reported"] = pd.Series(nx.degree_centrality(soc_obs_g))
print("Degree of centrality of social and work networks:")
display(central)

Degree of centrality of social and work networks:


Unnamed: 0,worker_id,sex,job,work_observed,work_reported,social_observed,social_reported
0,0,male,Partner,0.681818,0.608696,0.090909,0.090909
1,1,male,Partner,0.181818,0.434783,,
2,2,male,Partner,0.409091,0.782609,0.181818,0.181818
3,3,female,Accountant,,0.304348,0.636364,0.636364
4,4,female,Accountant,0.409091,0.434783,0.636364,0.636364
5,5,male,Manager,0.5,0.521739,0.590909,0.590909
6,6,male,Manager,0.318182,0.521739,0.681818,0.681818
7,7,female,Accountant,0.318182,0.26087,0.681818,0.681818
8,8,male,Accountant,0.318182,0.130435,0.636364,0.636364
9,9,male,Accountant,0.272727,0.217391,0.181818,0.181818


It is important to note that none of the *partners* are *female* and there is only one *female manager*. The top five results for degree centrality in *work observed* are all male partners; one female appears in the top five for *work reported* and she is a *manager*. The top five results for social interaction contain a diverse sample of job titles and gender.

We will look at the largest degrees of centrality for each type of interaction:

In [4]:
print("The worker with the highest degree of centrality for observed work \n"
      "interaction is worker {0}, who is {1} with a position of {2} and a degree centrallity of {3}.".format(
              central.loc[central.work_observed.argmax(), "worker_id"],
              central.loc[central.work_observed.argmax(), "sex"],
              central.loc[central.work_observed.argmax(), "job"],
              central.loc[central.work_observed.argmax(), "work_observed"]))
              
print("\n\nThe worker with the highest degree of centrality for observed social \n"
      "interaction is worker {0}, who is {1} with a position of {2} and a degree centrallity of {3}.".format(
              central.loc[central.social_observed.argmax(), "worker_id"],
              central.loc[central.social_observed.argmax(), "sex"],
              central.loc[central.social_observed.argmax(), "job"],
              central.loc[central.social_observed.argmax(), "social_observed"]))

print("\n\nThe worker with the highest degree of centrality for reported work \n"
      "interaction is worker {0}, who is {1} with a position of {2} and a degree centrallity of {3}.".format(
              central.loc[central.work_reported.argmax(), "worker_id"],
              central.loc[central.work_reported.argmax(), "sex"],
              central.loc[central.work_reported.argmax(), "job"],
              central.loc[central.work_reported.argmax(), "work_reported"]))

print("\n\nThe worker with the highest degree of centrality for reported social \n"
      "interaction is worker {0}, who is {1} with a position of {2} and a degree centrallity of {3}.".format(
              central.loc[central.social_reported.argmax(), "worker_id"],
              central.loc[central.social_reported.argmax(), "sex"],
              central.loc[central.social_reported.argmax(), "job"],
              central.loc[central.social_reported.argmax(), "work_reported"]))

The worker with the highest degree of centrality for observed work 
interaction is worker 12, who is male with a position of Partner and a degree centrallity of 0.727272727273.


The worker with the highest degree of centrality for observed social 
interaction is worker 18, who is female with a position of Staff member and a degree centrallity of 0.818181818182.


The worker with the highest degree of centrality for reported work 
interaction is worker 2, who is male with a position of Partner and a degree centrallity of 0.782608695652.


The worker with the highest degree of centrality for reported social 
interaction is worker 18, who is female with a position of Staff member and a degree centrallity of 0.434782608696.


Looking at the eigenvector centrality, which assigns importance based on the centrality of its neighbors which are incident on it, we can isolate the important individuals who connect highly central individuals.

The output of the work reported eigenvector centrality is unusual. Without going into too much detail, because it is outside the scope of this assignment, when there are multiple eigenvalues with the same (largest) magnitude, the algorithm fails. Further discussion on this topic from StackOverflow can be found [here](https://stackoverflow.com/questions/43208737/using-networkx-to-calculate-eigenvector-centrality).

In [5]:
central_eig = pd.DataFrame({"worker_id": range(0, 23)})
central_eig["sex"] = (web_sex_job[0].map(sex_dict))
central_eig["job"] = (web_sex_job[1].map(job_dict))
central_eig["work_observed"] = pd.Series(nx.eigenvector_centrality(work_obs_g))
central_eig["work_reported"] = pd.Series(nx.eigenvector_centrality_numpy(work_rep_g))
central_eig["social_observed"] = pd.Series(nx.eigenvector_centrality(soc_obs_g))
central_eig["social_reported"] = pd.Series(nx.eigenvector_centrality(soc_obs_g))
print("Eigenvector centrality of social and work networks:")
display(central_eig)

Eigenvector centrality of social and work networks:


Unnamed: 0,worker_id,sex,job,work_observed,work_reported,social_observed,social_reported
0,0,male,Partner,0.301357,-7.128252e-18,0.007166,0.007166
1,1,male,Partner,0.067635,1.461778e-16,,
2,2,male,Partner,0.179513,2.090356e-15,0.043399,0.043399
3,3,female,Accountant,,2.417114e-14,0.258266,0.258266
4,4,female,Accountant,0.223304,3.069961e-13,0.258289,0.258289
5,5,male,Manager,0.253102,3.622086e-12,0.249903,0.249903
6,6,male,Manager,0.161633,3.622158e-12,0.261268,0.261268
7,7,female,Accountant,0.175831,4.245577e-11,0.263584,0.263584
8,8,male,Accountant,0.179247,4.245419e-11,0.255198,0.255198
9,9,male,Accountant,0.167584,4.243169e-11,0.077129,0.077129


Looking at the results from the table, we notice that *male partners* serve as bridges between highly connected individuals. Socially, there is a diverse result of job titles and gender that connect social networks.

### Conclusion

As far as work interactions, it appears that job titles (i.e.*partners* and *managers*) play a significant role in importance within the network at this accounting firm. From the centrality analysis, it appears that the *partners* have a lot of interaction within their groups and that work details flow through these induvial to other parts of the company. This does not seem out of the ordinary of how most companies work.

The social networks within a company can sometimes be very instrumental in getting things accomplished and spreading information that is not possible within the typical work hierarchical structure. These induvial who have a large degree of social interaction and who can bridge large social groups can either be a service or a liability depending on the type of information and ideas they circulate.


***
### References

Source:
    https://icon.colorado.edu/#!/networks

Data Description:
    A multiplex network of interactions among employees at a small accounting firm,
    from 1993. 'Observed' interactions are undirected, while 'reported'
    interactions are directed, indicating that node i reported socializing or
    working with node j. Node metadata gives sex (male or female) and job roles.
    The meaning of the edge weights is unknown.

Citation:
    C. M. Webster. "Detecting context-based constraints in social perception."
    Journal of Quantitative Anthropology, 5(5), 285-303. (1995)
