This notebook then shows code for the visualisation of the author network of arxiv.org
At first I import everything I need, including the dataframe.

In [0]:
import itertools
import pandas as pd
frame=pd.read_csv('framemai17.csv')
frame=frame.drop(frame.columns[0], axis=1)

To work with the frame I needed to do a pre-processing step to put the names into a list where the names have the "Last Name, First Name" format

In [0]:
span=2
for s in range(0,len(frame)):
    print s
    words = str(frame["Authors"][s]).strip(",").split(",")
    liste=[",".join(words[i:i+span]) for i in range(0, len(words), span)]
    liste=[element.strip() for element in liste]
    liste=[element.strip("[") for element in liste]
    frame["Authors"][s]=[element.strip("]") for element in liste]

This part of code down here was actually the toughest nut to crack.
Here I am constructing a dataframe with the author connections. For each paper there are most of the time multiple authors working on it. My goal was to visualize this network of connections. Showing who regularly works together with whom.
So for a paper that has the authors A and B I want to draw a connection from A to B. Therefore I constrcuted a dataframe with two columns "From" and "To".
So for each paper in the frame I look at the authors and then use some combinatorics. My first try was a naive setting where I combined all the authors with each other. For a paper with authors A and B that would have ended up with the connections AA, AB, BA and BB. I thought: How hard can it be to do that?
Well, very hard. It took ages to make progress like this so I chose to ignore the AA and BB cases. That still didn't work too well and then I figured that it is a combinatorics problem. I don't need the AA and BB connections and if I have an AB connection I also don't necessessarily need the BA connection. So I found this "itertools" package on the internet that can do exactly that job.
But it still needed a huge amount of time to finish.
I read that it could be done faster if I work with GPU coding, but I thought: "Meh, by the time I learn that the computing is done already", so I decided to just wait for it to be done.
End of story: To compute the whole dataframe it took about a week, day and night.

In [0]:
connections=pd.DataFrame({'from':[], 'to':[]})
for i in range(140000,len(frame)):
    print i
    if len(frame["Authors"][i])==1:
        connections=connections.append({"from":frame["Authors"][i][0].decode("utf-8"), "to":frame["Authors"][i][0].decode("utf-8")}, ignore_index=True)
    else:
        for element in itertools.combinations(frame["Authors"][i],2):
            connections=connections.append({"from":element[0].decode("utf-8"), "to":element[1].decode("utf-8")}, ignore_index=True)

Now I compute an additional row for the wight of the edges. How many times did a pair of authors work together?

In [0]:
connections["count"]=connections.groupby(["from","to"])["from"].transform("count")
connections=connections.drop_duplicates()
connections=connections.reset_index(drop=True)

Now I relabel the columns. For displaying the graph I used a program called "Gephi". Gephi can read CSV data and visualize it if it is in a specific format. That's what I needed this step for.

In [0]:
connections=connections.rename(columns={"from":"Source","to":"Target","count":"Weight"})

Here I eliminate the author artifact of "..." since we saw in the first statistics already that this is a thing that needs to be taken care of. I also eliminated all edges that just had a weight of 1 to make the graph computable. I tried displaying the whole thing, but it would just never finish, so I had to compromise.

In [0]:
connections = connections[connections.Target != "..."]
connections = connections[connections.Weight != 1]

In [0]:
connections

Unnamed: 0,Source,Target,Weight
165,"Santos-Sanz, P.","Thirouin, A.",2
188,"Lacerda, P.","Ortiz, J. L.",2
207,"Ortiz, J. L.","Duffard, R.",2
210,"Ortiz, J. L.","Thirouin, A.",2
228,"Duffard, R.","Thirouin, A.",2
234,"Fujimori, Toshiaki","Nitta, Muneto",2
360,"Ghose, Partha","Ghose, Partha",2
451,CMS Collaboration,CMS Collaboration,4
471,"Fedoseev, Gleb","Ioppolo, Sergio",5
472,"Fedoseev, Gleb","Lamberts, Thanja",3


Here you can take a look at the author network computed in Gephi. I used a layout algorithm to make clusters better visible and colored the biggest clusters.
The biggest clusters, the pink and black, are astrophysicists who write a lot of papers together. That makes sense since astrophysicists need to share and pool their resources and money to rent time on telescopes. Another one that quite stood out is the blue one on the left. There is one node in the center and a lot of nodes grouped around it. I found that the center node is a professor from a US university and the nodes around her are students who she supervises doing experiments. The orange cluster on the right side is interesting because there seems to be one node connecting two different clusters. Looking into this I found that the connecting node is a professor from an Italian university and the nodes in the bigger cluster are fellow professors and students at this university. But he is also part of a project in Germany that is the other cluster.
You see, it can be very interesting to visualise this and take a deeper look into the network.

<a href="https://ibb.co/z6X2r8f"><img src="https://i.ibb.co/CPb9zJ0/Screenshot-2018-12-21-at-11-20-45-AM.png" alt="Screenshot-2018-12-21-at-11-20-45-AM" border="0"></a>