<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></span></li><li><span><a href="#Setting-Everything-Up-and-Running-our-First-Analysis" data-toc-modified-id="Setting-Everything-Up-and-Running-our-First-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Setting Everything Up and Running our First Analysis</a></span></li><li><span><a href="#Degree" data-toc-modified-id="Degree-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Degree</a></span></li><li><span><a href="#Out-Degree" data-toc-modified-id="Out-Degree-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Out Degree</a></span></li><li><span><a href="#In-Degree" data-toc-modified-id="In-Degree-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>In Degree</a></span></li><li><span><a href="#Deeper-Dive-into-Malaysia" data-toc-modified-id="Deeper-Dive-into-Malaysia-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Deeper Dive into Malaysia</a></span></li></ul></div>

# Introduction

When we decided to pick this data we were looking forward to the challenges this would bring us, but we had no idea how it all would turn out both for our business case or graphs. What it turned out to be, was a data that only really made sense to get some of the centrality-measures out of. In order to get data that made sense, we cleaned our data and removed the variables that was duplicate in a sense of how we wanted to present our data.

The data is LinkedIn data based on migration from countries. We had negative and positive values that showed us the net migration rate between countries. These were positive and negative to the respective countries, so we only kept the positive ones. We also took out the values with 0 from our data because the net migration rate equalled out between the two countries.

To continue with our analyses we asked ourself, what would be interesting to investigate further?

1. What countries are less, and what countries are more connected to the rest of the world?
2. What countries does the majority leave from vs. travel to?

Before we get started on any of our analyses, lets import the libraries needed, import our data and get a basic map. To make our map, we have to make it as undirected even though the data is directed. We thought of it as directed in that sense of net value of workforce moving between countries. If more people move from i.e. Sweden to Norway  than the other way around, the net value would be positive for Sweden -> Norway, and the direction would also be like that.

# Setting Everything Up and Running our First Analysis

In [1]:
#!pip install mplleaflet
#!pip install networkx

In [2]:
import mplleaflet
import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

In [3]:
# importing edge and node pandas dataframe
edges = pd.read_csv("./edges.csv", sep = ";")

nodes = pd.read_csv("./nodes_coordinates.csv", sep = ";")

In [4]:
# assigning nodes to dictionary with a unique identifier
node = nodes.set_index("node").to_dict("index").items()

In [5]:
# adding nodes and edges to a directed graph object
D1 = nx.from_pandas_edgelist(edges, source = "from",
                                   target = "to",
                                   create_using = nx.DiGraph)
D1.add_nodes_from(node)
nx.is_directed(D1)

True

Why directed? Net value of workforce moved to countries would indicate some sort of direction. We thought of it as directed in that sense of net value of workforce moving between countries. If more people move from i.e. Sweden to Norway than the other way around, the net value would be positive for Sweden -> Norway, and negative Norway -> Sweden, which is also why we removed the negative values. They can't be interpreted. 

In [6]:
# adding nodes and edges to an undirected graph object
U1 = nx.from_pandas_edgelist(edges, source = "from",
                                   target = "to",
                                   create_using = nx.Graph)
U1.add_nodes_from(node)
nx.is_directed(U1)

False

We just told you that the graph is directed, so why do we make a graph object that is undirected? For our case it does not make sense, but we discovered a liability in the package we are using to create the maps, so to be able to illustrate we have to illustrate without the arrows, also called direction, on our map. However, we will be using attributes from the directed graph object to add value to the undirected map we create.

In [7]:
print("There are", D1.number_of_edges(), "edges in our network")

There are 2001 edges in our network


In [8]:
print("There are", D1.number_of_nodes(), "nodes in our network")

There are 140 nodes in our network


In [9]:
# setting our coordinates for the map
coordinates_dict = {}
for node in nodes.node:
    coordinates_dict[node] = list(nodes.loc[nodes.node == node ,['lon','lat']].values[0])

In [11]:
coordinates_dict

{'ae': [53.847818000000004, 23.424076],
 'af': [67.709953, 33.93911],
 'al': [20.168331, 41.153332],
 'am': [45.038189, 40.069099],
 'ao': [17.873887, -11.202691999999999],
 'ar': [-63.616671999999994, -38.416097],
 'at': [14.550072, 47.516231],
 'au': [133.775136, -25.274398],
 'az': [47.576927000000005, 40.143105],
 'ba': [17.679076000000002, 43.915886],
 'bd': [90.35633100000001, 23.684994],
 'be': [4.469936, 50.503887],
 'bf': [-1.561593, 12.238333],
 'bg': [25.48583, 42.733883],
 'bh': [50.637772, 25.930414000000003],
 'bj': [2.315834, 9.30769],
 'bo': [-63.588653, -16.290154],
 'br': [-51.92528, -14.235004],
 'bs': [-77.39628, 25.03428],
 'bw': [24.684866, -22.328474],
 'by': [27.953389, 53.709807],
 'ca': [-106.34677099999999, 56.130366],
 'cd': [21.758664000000003, -4.038333000000001],
 'ch': [8.227511999999999, 46.818188],
 'ci': [-5.54708, 7.539989],
 'cl': [-71.542969, -35.675146999999996],
 'cm': [12.354722, 7.369722],
 'cn': [104.195397, 35.86166],
 'co': [-74.297333, 4.57

It makes perfect sense, to plot a graph we need to have coordinates. These we get from our distinct nodes, and will be shown on the map below only indicating location and edges between countries.

In [10]:
# Plotting our first map with mplleaflet
fig, ax = plt.subplots(figsize =(20,10))

nx.draw_networkx_nodes(U1, pos= coordinates_dict, node_size = 10)
nx.draw_networkx_edges(U1, pos = coordinates_dict, edge_color= 'gray', alpha = .1)
nx.draw_networkx_labels(U1, pos = coordinates_dict, label_pos = 10.3, font_size=16)
mplleaflet.display(fig=ax.figure)

Now that we have plotted our nodes and edges on a map we want to get some insight into our data. As our data is year to year migration between workers in countries, it does not really make sense to check for all sorts of centrality measures, distance, diameter and average path lengths. However, in many cases maybe people do travel from The United States to Brazil and then to Spain, but this is impossible to measure as our data does not track individual activity, but net (inn-out) of countries as a whole.

For this we dive into our first question "What countries are less, and what countries are more connected to the rest of the world?", and although it is obvious that this is a weekly connected graph, we want to look into the degree for all nodes and start from there:

# Degree

In [11]:
# creating a dataframe for total degree
degree = nx.degree_centrality(D1)
degreeDF = pd.DataFrame()
degreeDF["node"] = degree.keys()
degreeDF["degree"] = degree.values()
degreeDF = degreeDF.sort_values("node")

In [12]:
# sorting based on level of degree instead of node name
degreeDF_sorted_degree = degreeDF.sort_values("degree", ascending = False)

# getting our bottom 10 connected countries
bottom_10 = degreeDF_sorted_degree.tail(10).sort_values("degree")
bottom_10

Unnamed: 0,node,degree
134,fj,0.014388
119,bs,0.014388
133,tg,0.021583
109,mg,0.021583
139,uz,0.021583
42,pr,0.021583
54,ml,0.021583
124,mn,0.021583
115,mw,0.021583
98,ht,0.028777


In [13]:
# getting our top 10 connected countries
top_10 = degreeDF_sorted_degree.head(10).sort_values("degree", ascending = False)
top_10

Unnamed: 0,node,degree
94,us,0.863309
16,ca,0.805755
58,gb,0.784173
48,fr,0.76259
38,de,0.719424
0,ae,0.669065
44,es,0.625899
73,in,0.618705
74,it,0.604317
85,nl,0.597122


In [14]:
# importing edges and nodes in pandas dataframe again
edges1 = pd.read_csv("./edges.csv", sep = ";")

nodes1 = pd.read_csv("./nodes_coordinates.csv", sep = ";")

In [15]:
nodes1["degree"] = degreeDF["degree"]

In [16]:
# assigning nodes to dictionary with a unique identifier
node1 = nodes1.set_index("node").to_dict("index").items()

In [17]:
# adding nodes and edges to a graph object
U3 = nx.from_pandas_edgelist(edges1, source = "from",
                                   target = "to",
                                   create_using = nx.Graph)
U3.add_nodes_from(node1)
nx.is_directed(U3)

False

In [18]:
# adding nodes and edges to a graph object
D3 = nx.from_pandas_edgelist(edges1, source = "from",
                                   target = "to",
                                   create_using = nx.DiGraph)
D3.add_nodes_from(node1)
nx.is_directed(D3)

True

In [19]:
# Plotting with mplleaflet
fig, ax = plt.subplots(figsize =(20,10))

nx.draw_networkx_nodes(U3, pos= coordinates_dict, node_size = (nodes1["degree"]*100))
nx.draw_networkx_edges(U3, pos = coordinates_dict, edge_color= 'gray', alpha = .1)
nx.draw_networkx_labels(U3, pos = coordinates_dict, label_pos = 10.3, font_size=16)
mplleaflet.display(fig=ax.figure)

In our graph you can see the most connected countries highlighted by size. The bigger the circle, the more connected the country. We can already here get some information that the US, Canada, Australia, China, India and pretty much all Europe is highly connected, and countries in Africa, some parts of Asia and South/Middle America. We get this confirmed with our graph where we see that these are the top 10 connected countries:

	Node	       Degree
	USA	        0.863309
	Canada	     0.805755
	UK             0.784173
	France	     0.762590
	Germany	    0.719424
	UAE	        0.669065
	Spain	      0.625899
	India	      0.618705
	Italy	      0.604317
	Netherlands    0.597122

We also see that here are some countries that are not at all well connected too, and these will be interesting to look further into if they receive or give out their workforce when we analyze in- and out-degree, and if they are even in the bottom list for any at all.

	Node	       Degree
	Fiji	       0.014388
	Bahamas    	0.014388
	Togo	       0.021583
	Madagascar 	0.021583
	Uzbekistan 	0.021583
	Puerto Rico	0.021583
	Mali    	   0.021583
	Mongolia       0.021583
	Malawi  	   0.021583
	Haiti   	   0.028777

We can see that this aligns with any intuition of the world, the bigger and more wealthy the country, the more of the workforce on LinkedIn will migrate between countries.

# Out Degree

In [20]:
# importing edge and node pandas dataframe
edges2 = pd.read_csv("./edges.csv", sep = ";")

nodes2 = pd.read_csv("./nodes_coordinates.csv", sep = ";")

In [21]:
# creating a dataframe for degree
outdegree = nx.out_degree_centrality(D1)
outdegreeDF = pd.DataFrame()
outdegreeDF["node"] = outdegree.keys()
outdegreeDF["outdegree"] = outdegree.values()
outdegreeDF = outdegreeDF.sort_values("node")

In [22]:
# sorting based on level of degree instead of node name
outdegreeDF_sorted = outdegreeDF.sort_values("outdegree", ascending = False)

# getting our bottom 11 connected countries
bottom_11_out = outdegreeDF_sorted.tail(11).sort_values("outdegree")
bottom_11_out # 11 because the bottom 11 has 0.0 as their value

Unnamed: 0,node,outdegree
7,az,0.0
33,ve,0.0
134,fj,0.0
127,md,0.0
109,mg,0.0
114,ni,0.0
59,ba,0.0
12,bj,0.0
97,cu,0.0
101,tt,0.0


In [23]:
# getting our top 10 connected countries
top_10_out = outdegreeDF_sorted.head(10).sort_values("outdegree", ascending = False)
top_10_out

Unnamed: 0,node,outdegree
16,ca,0.755396
38,de,0.654676
48,fr,0.546763
6,au,0.496403
85,nl,0.482014
44,es,0.47482
23,ch,0.453237
58,gb,0.446043
128,se,0.431655
11,be,0.388489


In [24]:
nodes2["outdegree"] = outdegreeDF["outdegree"]

In [25]:
# assigning nodes to dictionary with a unique identifier
node2 = nodes2.set_index("node").to_dict("index").items()

In [26]:
# adding nodes and edges to a graph object
U4 = nx.from_pandas_edgelist(edges2, source = "from",
                                   target = "to",
                                   create_using = nx.Graph)
U4.add_nodes_from(node2)
nx.is_directed(U4)

False

In [27]:
# adding nodes and edges to a graph object
D4 = nx.from_pandas_edgelist(edges2, source = "from",
                                   target = "to",
                                   create_using = nx.DiGraph)
D4.add_nodes_from(node2)
nx.is_directed(D4)

True

In [28]:
# Plotting with mplleaflet
fig, ax = plt.subplots(figsize =(20,10))

nx.draw_networkx_nodes(U4, pos= coordinates_dict, node_size = (nodes2["outdegree"]*150))
nx.draw_networkx_edges(U4, pos = coordinates_dict, edge_color= 'gray', alpha = .1)
nx.draw_networkx_labels(U4, pos = coordinates_dict, label_pos = 10.3, font_size=16)
mplleaflet.display(fig=ax.figure)

From the map we can see that there have been some massive changes of degreeness as countries such as UK, USA etc have much smaller nodes now than they were previously. This must have to do that their out-degreeness is lower. Let's have a look into the data to explain this further, starting with the top 10 list:

	Node	      Out-degree
	Canada	    0.755396
	Germany	   0.654676
	France	    0.546763
	Australia	 0.496403
	Netherlands   0.482014
	Spain	     0.474820
	Switzerland   0.453237
	UK	        0.446043
	Sweden	    0.431655
	Belgium	   0.388489

Looking at our top 10 list we can see that Canada, Germany, France, Spain, UK and the Netherlands are still at the top for degree when it comes to the out degree, which is also reflected on the map. Now it is interesting to see if USA, UAE, Italy and India are making it to the top 10 list of in-degree.

	Node	      Out-degree
	Azerbaijan	0.0
	Venezuela	 0.0
	Fiji	      0.0
	Moldova	   0.0
	Madagascar	0.0
	Nicaragua	 0.0
	Bos. Herz	 0.0
	Benin	     0.0
	Cuba	      0.0
	Trin Toba	 0.0
	Puerto Rico   0.0

It is interesting to see the 11 countries with lowest out degree. This does not mean they don't have workers moving out of their countries, just that there are more people migrating in from all countries people migrate to. Since no country had zero total degree, we are expecting to see some numbers on these when we do our next analysis.

# In Degree

In [29]:
# importing edge and node pandas dataframe
edges3 = pd.read_csv("./edges.csv", sep = ";")

nodes3 = pd.read_csv("./nodes_coordinates.csv", sep = ";")

In [30]:
# creating a dataframe for degree
indegree = nx.in_degree_centrality(D1)
indegreeDF = pd.DataFrame()
indegreeDF["node"] = indegree.keys()
indegreeDF["indegree"] = indegree.values()
indegreeDF = indegreeDF.sort_values("node")

In [31]:
# sorting based on level of degree instead of node name
indegreeDF_sorted = indegreeDF.sort_values("indegree", ascending = False)

# getting our bottom 5 connected countries
bottom_10_in = indegreeDF_sorted.tail(10).sort_values("indegree")
bottom_10_in

Unnamed: 0,node,indegree
139,uz,0.0
132,pg,0.0
54,ml,0.007194
119,bs,0.007194
115,mw,0.007194
129,cd,0.007194
136,ge,0.007194
138,py,0.007194
137,ga,0.007194
135,lu,0.007194


In [32]:
# getting our top 5 connected countries
top_10_in = indegreeDF_sorted.head(10).sort_values("indegree", ascending = False)
top_10_in

Unnamed: 0,node,indegree
73,in,0.604317
94,us,0.553957
14,br,0.381295
32,tr,0.374101
25,cn,0.366906
58,gb,0.338129
0,ae,0.323741
102,za,0.266187
77,my,0.266187
74,it,0.251799


In [33]:
nodes3["indegree"] = indegreeDF["indegree"]

In [34]:
# assigning nodes to dictionary with a unique identifier
node3 = nodes3.set_index("node").to_dict("index").items()

In [35]:
# adding nodes and edges to a graph object
U5 = nx.from_pandas_edgelist(edges3, source = "from",
                                   target = "to",
                                   create_using = nx.Graph)
U5.add_nodes_from(node3)
nx.is_directed(U5)

False

In [36]:
# adding nodes and edges to a graph object
D5 = nx.from_pandas_edgelist(edges3, source = "from",
                                   target = "to",
                                   create_using = nx.DiGraph)
D5.add_nodes_from(node3)
nx.is_directed(D5)

True

In [37]:
# Plotting with mplleaflet
fig, ax = plt.subplots(figsize =(20,10))

nx.draw_networkx_nodes(U5, pos= coordinates_dict, node_size = (nodes3["indegree"]*150))
nx.draw_networkx_edges(U5, pos = coordinates_dict, edge_color= 'gray', alpha = .1)
nx.draw_networkx_labels(U5, pos = coordinates_dict, label_pos = 10.3, font_size=16)
mplleaflet.display(fig=ax.figure)

Looking at our map we can see that USA, Brazil, India, UK and China has pretty big nodes, as well as Bahamas, Cuba, Island and many other countries have small ones. Where do people migrate for work?

	Node	       Indegree
	India	      0.604317
	USA	        0.553957
	Brazil	     0.381295
	Turkey	     0.374101
	China	      0.366906
	UK	         0.338129
	UAE	        0.323741
	South Africa   0.266187
	Malaysia	   0.266187
	Italy	      0.251799

Looking at our top list we have many interesting findings, but what caught our attention was Malaysia. How come Malaysia has such high in-degree? We will be diving deeper into this after checking out the 10 countries with least in-dregree:

	Node	       Indegree
	Uzbekistan	 0.000000
	Papua N.G.	 0.000000
	Mali	       0.007194
	Bahamas	    0.007194
	Malawi	     0.007194
	Congo	      0.007194
	Georgia	    0.007194
	Paraguay	   0.007194
	Gabon	      0.007194
	Luxembourg	 0.007194

No surprises here, so we will continue to pursue the country of Malaysia:

# Deeper Dive into Malaysia

In [38]:
D1.in_degree("my")

37

In [39]:
D1.out_degree("my")

14

From our calculations above we can see that Malaysia has migration from 37 countries and migrate to 14 countries. Before we finish this analysis off we want to get a little bit more information off of our nvestigation of Malaysia.

In [40]:
malaysian_edges = edges[["from_country_name", 
                        "to_country_name", 
                        "weight"]][edges["to_country_name"] == "Malaysia"].sort_values("weight", ascending = False)
malaysian_edges.head(2)

Unnamed: 0,from_country_name,to_country_name,weight
1976,Singapore,Malaysia,1444
1716,Mauritius,Malaysia,141


What is interesting about this fining in our dataframe is that the ratio of in/out for Singapore to Malaysia is 14.44. The next one on the list is only 1.41. Malaysia has a strong economy with student migrants. They have good universities and agreements with companies to sponsor university degrees to foreign students. A quick Google search and you can see that "Singapore was one of the 14 states of Malaysia from 1963 to 1965". Considering they are neighbor countries this all makes much sense, but it is also very high numbers and information we wouldn't look into if it wasn't for our analysis.

From this analysis we have learned that GDP is not the only factor of migration between countries when the topic is about where to find work. Although we could show off all our knowledge in the subject of SNA by doing eigenvectors, betweenness and other measures and interpret these numbers, we chose not to since we don't feel these numbers will give us any real insight into how the work migration works between countries. Even though people travel from Singapore to Malaysia to work, it does not mean that people move somewhere else or that this even makes sense in the real world.