## Finding Quincy & Friends

(NOTE: *This kernel was modified in order to make use of the csv prepared by @residentmario*)

This a quest to find Quincy Larson, the person behind the freeCodeCamp project, and his friends in the freeCodeCamp **main** chat. More specifically I am thinking of rendering the Quincy's [ego-network](http://www.analytictech.com/e-net/pdwhandout.pdf).

More generally the goals are to
* build a preliminary network with the people that connected to Quincy, those who sent messages to him and those who received messages from him (*Quincy's friends*), 
* try to visualize the network and finally 
* see where Quincy is located using interactive tools.


For that end I will make some rapid consessions about the data in order to clean it and reduce it to something simpler. We will end our analyses and sketch the visualizations using `networkx`, `matplotlib` and `BohekJS`.

Let's start by checking the files are there: 

In [None]:
import numpy
import matplotlib.pyplot as plt
import timeit
import pandas
import datetime, calendar

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

The files are ok.  Let's get the dataset of interest:

In [None]:
fulldataset = pandas.read_csv("../input/freecodecamp_casual_chatroom.csv")
#exclude camperbot
fulldataset = fulldataset.loc[fulldataset['fromUser.username']!='camperbot',:]
#there where some usernames as NaN
#their messages could be used for other projects, but not for this one
fulldataset = fulldataset.loc[(~fulldataset['fromUser.username'].isnull()),:]

To stay up to our "simplicity rule" we are not going to bother about [time or dataset overlaps](https://www.kaggle.com/free-code-camp/all-posts-public-main-chatroom) for now. 

Now let's see how many people sent messages to Quincy Larson (@QuincyLarson) and how many he sent to others. To find those he sent, we will check the `fromUser.username` column, while for those he received we will inspect the `mentions` column. We will focus on his activity along the time, specifically per month, so we will be using the `sent` column as well.

It is important to clarify that this approach won't give us the complete picture. Quincy might have sent messages to specific users without mentioning their usernames. The contrary would have been also possible. However we expect the data will lead us to a trend. 

First let's parse the date:

In [None]:
fulldataset['parsed_sent'] = fulldataset['sent'].apply(lambda x: x[:7])
fulldataset['parsed_sent'].head()

Now let's make a bar chart to see the number of messages Quincy sent and received per day. Finding the messages that Quincy sent will be easy with `pandas`. However finding the ones he received will be trickier - we need to unpack the values in the `mentions` column which are still in a `json` format. For the unpack I will use the `ats` library. `json` library can be also used.

Let's unpack the existing `screenname` attributes of each `mentions` data point and save the them in a list under a new column, `mentions.screennames`: 

In [None]:
import ast

mu = []
def solve_metions(x):
    for i, ms in enumerate(x):
        ous = []
        mtsasts = ast.literal_eval(ms)
        mtsast = [x for x in mtsasts]
        if len(mtsast) > 0:
            for l in mtsast:
                ou = l['screenName']
                ous.append(ou)
        mu.append(ous)
        
fulldataset[['mentions']].apply(solve_metions)
fulldataset['mentions.screennames'] = mu
fulldataset['mentions.screennames'].head()

Now we create two subsets of data, one with the "sent" and another with the "recieved".

In [None]:
quincy_sent = fulldataset.loc[fulldataset['fromUser.username']=='QuincyLarson',['fromUser.username','parsed_sent']].groupby('parsed_sent').count().reset_index()
fulldataset['quincy_received_check'] = fulldataset['mentions.screennames'].apply(lambda x: 'QuincyLarson' in x)
quincy_received = fulldataset.loc[fulldataset['quincy_received_check'],['fromUser.username','parsed_sent']].groupby('parsed_sent').count().reset_index()
quincy_sent.rename(columns={'fromUser.username':'msgsOut'}, inplace = True)
quincy_received.rename(columns={'fromUser.username':'msgsIn'}, inplace = True)
quincy_sent.head(), quincy_received.head()

For rendering a chart we might need to work on the dates a bit more...

In [None]:
#check also https://stackoverflow.com/a/7015758 for a elaborated solution to solve the monthly range
firstday = min(min(quincy_sent['parsed_sent']), min(quincy_received['parsed_sent']))
lastday = max(max(quincy_sent['parsed_sent']), max(quincy_received['parsed_sent']))
print(firstday, lastday)
print(datetime.datetime.strptime(firstday,'%Y-%m'),datetime.datetime.strptime(lastday,'%Y-%m'))
print((datetime.datetime.strptime(lastday,'%Y-%m')-datetime.datetime.strptime(firstday,'%Y-%m')).days)

dates = [(datetime.datetime.strptime(firstday,'%Y-%m') + datetime.timedelta(days=x)).strftime('%Y-%m') for x in range((datetime.datetime.strptime(lastday,'%Y-%m')-datetime.datetime.strptime(firstday,'%Y-%m')).days+1) ]
print(len(dates), dates[0], dates[-1])

In [None]:
totaldates = pandas.DataFrame(dates, columns=['parsed_sent'])
totaldates.set_index('parsed_sent', inplace=True)
totaldates.head()

In [None]:
totaldates = totaldates.reset_index().drop_duplicates().set_index('parsed_sent')
totaldates.head()

In [None]:
quincy_sent.head(), quincy_sent.tail()

In [None]:
quincy_sent.set_index('parsed_sent', inplace=True)
totaldates = totaldates.join(quincy_sent)
totaldates.loc[totaldates['msgsOut'].isnull()] = 0
totaldates['msgsOut'] = totaldates['msgsOut'].astype('int')
quincy_received.set_index('parsed_sent', inplace=True)
totaldates = totaldates.join(quincy_received)
totaldates.loc[totaldates['msgsIn'].isnull()] = 0
totaldates['msgsIn'] = totaldates['msgsIn'].astype('int')

totaldates.head()

In [None]:
quincy_sent.reset_index(inplace=True), quincy_received.reset_index(inplace=True)

In [None]:
totaldates['msgsOut'] = -1*totaldates['msgsOut']
totaldates['msgsOut'].head()

In [None]:
fig = plt.figure(figsize=(5,5))
ax = totaldates[['msgsIn','msgsOut']].plot(kind='bar', stacked=True, title ="Quincy's outs and ins per day", figsize=(15, 10), legend=True, fontsize=12, ylim=(-700,700))
ax.set_xlabel("date", fontsize=12)
ax.set_ylabel("counts", fontsize=12)
ax.xaxis.set_ticks_position('none')
ax.set_xticklabels(['' if i%6!=0 else x for i,x in enumerate(totaldates.index)], fontsize=12, rotation=0)
plt.show()

Based on the figure above Quincy wrote less and less messages in the main chatroom with the time. He also received less messages. However, the number of messages he was receiving were closely the same amount he sent.

Keeping our preference for simplicity, I will only use data from 2015-01 until 2016-06:

In [None]:
partialdataset = fulldataset.loc[fulldataset['parsed_sent'] <= '2016-06',:]

Let's try to get Quincy and his Friends out of this partial dataset. Let's get those from who Quincy received messages and only show the "top10 senders": 

In [None]:
#interesting site? https://tomaugspurger.github.io/method-chaining.html
senderstoQuincy = partialdataset.loc[partialdataset['quincy_received_check'],['quincy_received_check','fromUser.username']].groupby('fromUser.username').count().rename(columns={'quincy_received_check':'msgsSenders'}).sort_values(by='msgsSenders', ascending=False)
print('Number of senders to Quincy ', len(senderstoQuincy))
senderstoQuincy[:10].plot(kind='bar', figsize=(12,12))

The total of unique users sending messages to Quincy could be an estimation of Quincy's unweighted **in-degree**.

Let's do the same as above but for those who received messages from Quincy, and try to make a rough evalution of his **out-degree**. This will be a bit more elaborate, as this require to go through the `mentions.screennames` lists. There are several ways to solve this. I usually enjoy libraries like `collections` very much, so I will use it.

In [None]:
import collections
receiversfromQuincy = collections.Counter()

def count_receivers(x):
    for sus in x:
        for su in sus:
            receiversfromQuincy[su] += 1
        
partialdataset.loc[partialdataset['fromUser.username'] == 'QuincyLarson',['mentions.screennames']].apply(count_receivers)
print('Number of receivers from Quincy ',len(receiversfromQuincy))
del receiversfromQuincy['QuincyLarson'] #he wrote to himself...
pandas.DataFrame(receiversfromQuincy.most_common(10),columns=['fromUser.username','msgsReceivers']).set_index('fromUser.username').plot(kind='bar', figsize=(12,12))

## The Quincy's Net

We have accumulated some good information about Quincy's "friends" to work on a simple chart. I will try an example using `networkx`. To fill the graph I will try a simple, brute force code. Kaggle system is responding well to this challenges so far :-) .

(some references: 
    https://networkx.github.io/documentation/networkx-1.10/tutorial/tutorial.html
    https://networkx.github.io/documentation/stable/release/release_2.0.html
)


I will create a directed graph where relations are who mentioned who. Each node will receive the name of the sender. The relations (edges) represent *messages sent by the sender to that recipient* and will be weighted as number of messages.

But first, let's clean the data a bit more by making additional simplifications. Let's do the following:
* First, let's remove all those who sent less than 10 messages to anyone in the whole chat
* Second, after that selection we will reduce the friend's list even more by removing those who wrote less than 5 messages to anyone in the "Quincy and Friends".
* Additionally, we won't include anyone who didn't receive or send at least one message to Quincy.

Let's use the union list of those who received from and those who received messages from Quincy, and then make use again of a dictionary to track them all.


In [None]:
#0. Union of senders and receivers
#Just notice that we are making a mixed use of Python datatypes: pandas `Series` and a `dict` 
#to create a `set` of senders and receivers.
#That is because the solution I gave to the problem of dealing with nested lists 
#within a pandas Series, which was a dictionary.
senders_or_receivers = set(receiversfromQuincy.keys()) | set(senderstoQuincy.index)
print('There were ',len(senders_or_receivers),' Quincys senders or receivers in total')

In [None]:
#1. Removing those who sent less than 10 messages to anyone in the whole chat
partialallcount=partialdataset[['fromUser.username']].groupby(['fromUser.username'])['fromUser.username'].count()
allcountwithlessthan10 = partialallcount[partialallcount < 10]
#allcountwithlessthan10.index
senders_or_receivers.difference_update(set(allcountwithlessthan10.index))
print('There were ',len(senders_or_receivers),' Quincys senders or receivers left after extracting those who sent less than 10 messages to anyone in the chat')

In [None]:
#2. From the resulting list, removing those who sent less than 5 messages to anyone in the Quincy's circle
#3. Also select only those who are part of the Quincy's circle
trackingvalidfriends = dict([(u,0) for u in senders_or_receivers])
def count_friendsmsgs(x):
    for sus in x:
        for su in sus:
            if su in list(trackingvalidfriends.keys()):
                trackingvalidfriends[su] += 1

partialdataset.loc[:,['mentions.screennames']].apply(count_friendsmsgs)

validfriends = [u for u,v in trackingvalidfriends.items() if v >= 5]

print('Finally, there were ',len(validfriends),' between the Quincys senders and receivers who sent 5 or more messages to those in the same group') 

In [None]:
'QuincyLarson' in validfriends #he has to be there too

We got a cleaner list of friends to work with. We will build the graph directly from them. Because we are going to evaluate exchanges of messages between them, we need the methods for the analysis of [directed graphs](https://www.youtube.com/watch?v=DBRW8nwZV-g). I think the `DiGraph` class from `networkx` should work. We will take the number of *sent* messages as the weight.

There could be different ways to build the graph. My approach will be to build a **fully connected weighted digraph** with one pass on the dataset and then clean up the digraph extracting those edges with weight equal 0 (zero). It is important to notice that this is NOT the best way if you have many nodes, but it will simplify the approach. I encourage you to find a better way!

In [None]:
import itertools
import networkx

DG = networkx.DiGraph(name='QuincyNet')
#add all edges at once, weighted
DG.add_weighted_edges_from([(a,b,0) for a,b in itertools.permutations(validfriends,2)])

validfriendsdataset = partialdataset.loc[(partialdataset['fromUser.username'].isin(validfriends)),['fromUser.username','mentions.screennames']]
validfriendsdataset.head()


See that the users in the final list of Quincy and Friends were responsible of 1,061,154 messages during the selected period. To simplify our search, let's get rid of those messages where the length of `mentions.screennames` is 0 with a simple procedure. Remember we are not evaluating messages where the `mentions.screennames` is empty.

In [None]:
def isempty(x):
    if len(x) > 0:
        return False
    else:
        return True
        
validfriendsdataset['mentionsempty'] = validfriendsdataset['mentions.screennames'].apply(isempty)
validfriendsdataset = validfriendsdataset.loc[validfriendsdataset['mentionsempty'] == False, ['fromUser.username','mentions.screennames']]
validfriendsdataset.head()

From all the messages they sent, they collectively sent a message by mentioning usernames half of the time (ie. 495,627 messages). It could be less: some of those messages were sent to themselves, something we are going to correct later.

Now we have enough data to build our network. The following step is to fill up the weight of the edges by counting the number of messages that node A sent to node B (mentions). Keep in mind that there could be a message targeted to several users.

In [None]:
def weight_edges(x):
    cu = x['fromUser.username']
    for mu in x['mentions.screennames']:
        if cu != mu and cu != None and mu != None and mu != 'camperbot' and mu in validfriends:
            #print(cu,mu)
            current_weight = DG.get_edge_data(cu, mu)['weight']
            DG[cu][mu]['weight'] = current_weight + 1
    
validfriendsdataset.apply(weight_edges, axis=1)
print()

In [None]:
print("The preliminary list of Quincy's super-friends (nodes) who sent messages to each other is {} ".format(len( DG.nodes() ) ) )
print("The number of connections (edges) between them is {} (fully connected digraph)".format( len( DG.edges() ) ) )

Although the task is still incomplete, we can already get some numbers about the messages Quincy sent or received:

In [None]:
print('Weighted Degree fully connected digraph\nout: ',DG.out_degree('QuincyLarson', weight='weight'), '- in:',DG.in_degree('QuincyLarson', weight='weight'))

The problem is that he still appears like fully connected to everyone else in the digraph, when in reality we will say he is connected if he either sent or received messages or both, ie. the edge had weight (`weight != 0`).

In [None]:
print('Unweighted Degree, fully connected digraph\nout: ',DG.out_degree('QuincyLarson'), '- in:',DG.in_degree('QuincyLarson'))

So now let's get rid of those edges without weight:

In [None]:
import copy
E = copy.deepcopy(DG.edges())
for edge in E:
    if DG[edge[0]][edge[1]]['weight'] == 0:
        DG.remove_edge(edge[0], edge[1])

print("The final lenght of the list of Quincy's super-friends (nodes) who sent messages to each other is {} ".format(len( DG.nodes() ) ) )
print("The number of connections between them is {}".format( len( DG.edges() ) ) )


This is likely a better outcome. To confirm we are possibly in the right track, let's check Quincy's in- and out-degrees, weighted and not.

In [None]:
print('Weighted Degree not fully connected digraph\nout: ',DG.out_degree('QuincyLarson', weight='weight'), '- in:',DG.in_degree('QuincyLarson', weight='weight'))

In [None]:
print('Unweighted Degree not fully connected digraph\nout: ',DG.out_degree('QuincyLarson'), '- in:',DG.in_degree('QuincyLarson'))

Quincy got involved in no more than the 3% of the total connections of his "Friends" network in the main chatroom (unweighted degree/total unique connections). Moreover, he might not be responsible of more than 0.7% of the total messages sent between the members of that network (weighted degree/total messages).

Now we see that after cleaning the data from possible spurious contacts, Quincy has a slightly larger in-degree than out-degree. The unbalance is usually related to a higher **prestige**. We could assume that this is possible as we know Quincy is the leader of freeCodeCamp.

From here there are a lot of analyses we can still practice on this dataset. However I will rather stop my analysis here not without jumping into a visualization to complete this first exploration of Quincy's network.

## Quincy's Guide to the Chatroom Galaxy

In this last section, we will still using `networkx` with `matplotlib` to visualize the graph we just created.



In [None]:
import matplotlib.pyplot as plt
plt.figure(figsize=(15,15))
spring_pos = networkx.spring_layout(DG)
plt.axis("off")
networkx.draw_networkx(DG, pos=spring_pos, with_labels=False, node_size=35)

#matplotlib.show()

What looks like a [sea urchin](http://otlibrary.com/wp-content/gallery/sea-urchin/sea-urchin.jpg) could be a possible approximation to a shape of how the group interacted. From this preliminary view we could argue that there is a core of users with strong interaction while there is still some satellites of people around them. Too soon for conclusions though.

I wanted to evaluate the [**eigenvector centrality**](https://en.wikipedia.org/wiki/Eigenvector_centrality) of Quincy's "closest friends". The eigenvector centrality will rank the centrality of the users in the network based on how they linked to other central players. I will hack a nice and simple work made by [Can Güney Aksakalli](https://aksakalli.github.io/2017/07/17/network-centrality-measures-and-their-visualization.html) (github: @aksakalli) to render a visualization of the eigenvector centrality of the network we just created:

In [None]:
import matplotlib.colors as mcolors
def draw(G, pos, measures, measure_name):
    plt.figure(figsize=(15,15))
    nodes = networkx.draw_networkx_nodes(G, pos, node_size=25, cmap=plt.cm.plasma, 
                                   node_color=list(measures.values()),
                                   nodelist=list(measures.keys()))
    nodes.set_norm(mcolors.SymLogNorm(linthresh=0.01, linscale=1))
    
    # labels = nx.draw_networkx_labels(G, pos)
    edges = networkx.draw_networkx_edges(G, pos, edge_color='lightgrey')
    plt.title(measure_name)
    plt.colorbar(nodes)
    plt.axis('off')
    plt.show()
    
#draw(G, spring_pos, networkx.katz_centrality(G, alpha=0.1, beta=1.0), 'Graph Katz Centrality')
draw(DG, spring_pos, networkx.eigenvector_centrality(DG), 'Graph Eigenvector Centrality')

In [None]:
sorted(networkx.eigenvector_centrality(DG).items(),key=lambda x:x[1], reverse = True)[:10]

Notice that Quincy is given the highest score of the eigenvector centrality because the ego-network we constructed is in fact made around him. Apart of it, you can see that no everyone who he communicated the most are central to his network. Example: 'qmikew1', 'jbmartinez' or 'Aireleslie' were not between his top10 list of receivers or senders. They are probably more central to his network and possibly the chat as a whole.

But wait! Where is Quincy???? To find him I was tented to move around the chart and zooming it, to see if I could spot his face or hear his voice. Let's  use the `bohenJS` library for that end and get meanwhile a closer look to  his surrounds. We arranged a dress code: he will be wearing a blue shirt, very effective choice if you want to be found in the crowd.

(NOTE: notice that in the graph below I implemented a quick code that relocated the positions of the nodes in different places than the graph above; there are ways to keep the same positions though)

In [None]:
##https://bokeh.pydata.org/en/latest/docs/user_guide/quickstart.html
#https://bokeh.github.io/datashader-docs/user_guide/6_Networks.html
##https://bokeh.pydata.org/en/latest/docs/user_guide/graph.html
##https://anaconda.org/Viz-group/graph_edge_and_node_select/notebook
##https://anaconda.org/Viz-group/media/notebook
#https://stackoverflow.com/questions/46397671/using-bokeh-how-does-one-plot-variable-size-nodes-and-node-colors

from bokeh.plotting import figure, output_notebook, show # bokeh plotting library
from bokeh.models import MultiLine, Circle, HoverTool, ColumnDataSource
from bokeh.palettes import Spectral4
from bokeh.models.graphs import from_networkx, NodesAndLinkedEdges, EdgesAndLinkedNodes
# We'll show the plots in the cells of this notebook
output_notebook()

In [None]:
plot_width = int(750)
plot_height = int(plot_width//1.2)

b_plt = figure(title="Where is Quincy??",
               tools='pan, wheel_zoom, hover, reset', 
               plot_width=plot_width, 
               plot_height=plot_height,
               x_range=(-1.1,1.1),
               y_range=(-1.1,1.1),
              )
##It is a bit tricky thing to port attributes from networkx to BokehJS
##to see an option, consult this:
## https://stackoverflow.com/questions/47210530/adding-node-labels-to-bokeh-network-plots
## I am keeping it simple...
#
renderer = from_networkx(DG, networkx.spring_layout, scale=30, center=(0,0) )
listindex = renderer.node_renderer.data_source.data['index']
#q_x, q_y = networkx.spring_layout(DG)['QuincyLarson']
'''
below I was using a hack that prevented the label to render correctly; now is fine but there is an error?

  q_x, q_y = renderer.layout_provider._property_values['graph_layout']['QuincyLarson']
  b_plt.circle(q_x,q_y,fill_color='blue', radius = 0.02, name='QuincyLarson')

the error is as in: https://stackoverflow.com/questions/46397671/using-bokeh-how-does-one-plot-variable-size-nodes-and-node-colors 
'''
newsource = ColumnDataSource({'index':listindex, 'color_node':['white' if x != 'QuincyLarson' else 'blue' for x in listindex], 'radii':[0.01 if x != 'QuincyLarson' else 0.02 for x in listindex]})
renderer.node_renderer.data_source = newsource
renderer.node_renderer.glyph = Circle( fill_color = {'field':'color_node'}, radius = {'field':'radii'} )
renderer.node_renderer.hover_glyph = Circle(fill_color=Spectral4[1])
renderer.edge_renderer.glyph = MultiLine(line_color="#CCCCCC")
renderer.edge_renderer.hover_glyph = MultiLine(line_color=Spectral4[3])
renderer.inspection_policy=NodesAndLinkedEdges()
hover =b_plt.select({'type': HoverTool})
hover.tooltips = [("index","@index"), ("(x,y)","($x,$y)")]
b_plt.renderers.append(renderer)
show(b_plt)


Just zoom and pan the figure. Can you see the slightly bigger blue point? Hover on: That is Quincy!

From here it is up to you if you want to discover more about Quincy or anyone in the freeCodeCamp main chatroom. Depending on your expertise, enthusiasm, the kind of questions you have and what this data can do, it can be a simple trip of few minutes to one that might last several months. If you want a start point, just use this code. No doubts many of you will do much better, if not now likely in the future!

We only ask: **be kind and professional with the users**. It is public data but not for harm.

Whatever you decide, I wish you the best fun!! Ask questions about the dataset if you need so. And as we usually say in freeCodeCamp: Happy Coding!

In [None]:
#renderer.__dict__
#renderer.layout_provider._property_values['graph_layout']['QuincyLarson']
THIS CODE WILL BREAK JUST... HERE!