~GSoC2010: Ankit Guglani Social Network Analysis
Clone this wiki locally
To calculate and present in a visually appeasing manner basic network statistics that would be interesting / useful to the end user. Also to provide with the implications of such measures in easy to understand language.
1.] Identifying the measures of interest
2.] Figure out how (more precisely where) to implement the computation code
3.] Integration (updating the crawler to get all the data needed, how to store the data, when to compute it etc).
1.] Before I dive into all the awesome stuff we can compute, we need to take note that even after we fix the crawler, the data is quite limited — for example, we don’t know how often a user’s friends communicate/interact with each other – so we can’t compute tie-strength.
At the same time, we can compute many basic statistics like centrality (based on Degree, Closeness and Betweenness), average path length, shortest path length, number of triads, key-player analysis, graphs for in and out degree-distribution and many many more. A few stats of interest would be:
> Friend Clustering: Think of this as an auto-lists feature for twitter which groups your friends into groups based on mutual friends. Algorithmically there are a few ways to do this.
One is by key-player elimination (where key-player is the node [user] with highest no of paths passing through it OR in other words with the most brokerage), here you continually eliminate the key-player from the network and you are left with well defined groups. The problems of this approach are: A) when to stop B) it excludes the key-players from being in any group and does not allow for over-lapping groups.
Another way of doing this would be to form groups based on number of mutual friends. A variant of this is the k-clique percolation method which has previously been successful in the identification of overlapping communities in large real networks. (more on the method here: http://en.wikipedia.org/wiki/Clique_percolation_method ). This does leave us to define the k, but I am sure we can come up with the k (or a function for k) after some trial and error with the real data.
> Tie strengths for your ego-network: This can help answer the questions: “Who are you close friends?”. Although we can’t compute the tie-strengths for the whole network-system, we can compute the tie strength of the user’s ego-network (all nodes one degree away from the node in question). This will take into account both mutual friends and amount of interactions (replies).
> Baseline comparison of the network-system to your ego-network: This can help answer: “What do you choose your friends based on?”. This can capture deviations of your friend network in comparison to the networks of your friends. This can be based on any characteristic we capture for the profiles for all users (like gender may be available from facebook profiles and so forth).
2.] Depending on which measure we end up actualizing, it ma be as trivial as an SQL prepared statement to writing some real code that access the data makes calculations and saves the results in SQL when called through the web-app. Most of the simpler suggested measures can be self contained within SQL.
3.] I would assume this to be a separate method from the crawler and accessible through the web-app. So additionally GUI to call these methods needs to be created. and some graphical representation of the results may also be considered.
This will require some work on the crawler itself.
> This will require editing the heart of ThinkTank, the crawler itself.
> It may not immediately seem useful, but progressively becomes more useful as more networks are ‘plugged-in’.
This is 1 of 2 projects that I’d like to propose for ThinkTank as part of GSoC. (personally I prefer the this one — the other one is here: http://wiki.github.com/ginatrapani/thinktank/gsoc-ankit-guglani-geospatial-visualizations … but you already knew that)
More about me, here: http://ankitguglani.wordpress.com (in dire need of an update).