# Compsci 690V 

We pick the mini challenge 1 of VAST Challenge 2008 for this homework. 

**Description:** The Paraiso movement is controversial and is having considerable social impact in a specific area of the world. We have extracted a segment of the Paraiso (the movement) Wikipedia edits page. Please note this is not the Paraiso Manifesto Wiki page which is part of the background materials, but a related different page. Please use visual analytics to describe the social relationships of the editors (those that have edited/modified the Wikipedia page) as they are reflected in these files.


Problem Statement for MC1:
What are the factions represented in these edit pages?  In other words, describe the individuals or groups that edit the pages with regard to any agendas and goals you hypothesize.  

I parsed the text of each commit into a dataframes with the columns:
* **timestamp** - the timestamp of the edit
* **user** - the user or ip of the editor
* **minorEdit** - True or False - if the edit is minor
* **pageLength** - Length of the page after the edit (in Bytes)
* **comment** - comment or description of the edit
* **entireEdit** - entire raw edit text

***Note***:
    * Please run bokeh server
    * jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000

In [1]:
import numpy as np
import pandas as pd
import re
from bokeh.plotting import figure
from bokeh.palettes import Spectral4
from bokeh.io import output_notebook, show
from bokeh.models import ColumnDataSource, Select,LabelSet, HoverTool
import nltk
from nltk.stem.snowball import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans,DBSCAN
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from bokeh.models.widgets import DataTable, TableColumn
from bokeh.layouts import Row,widgetbox
from sklearn.manifold import MDS
from bokeh.models import GraphRenderer, StaticLayoutProvider, Oval, ColumnDataSource,Legend,Select
from bokeh.models import Plot, Range1d, MultiLine, Circle, HoverTool, TapTool, BoxSelectTool
from bokeh.models.graphs import from_networkx, NodesAndLinkedEdges, EdgesAndLinkedNodes
from bokeh.io import output_notebook,show,curdoc
#output_notebook()


file = open('VASTchallenge08-20080315-Deinosuchus/WIKI EDITS PAGE/Paraiso Edits.txt','r', errors='ignore')

lines = file.readlines()
#print (lines)

df = pd.DataFrame(columns=['timestamp','user','minorEdit','pageLength','comment','entireEdit'])
i=0
for line in lines[3:]:
        entireEdit = line
        #print (entireEdit)
        #split via brackets
        token = re.split('\(|\)',line)
        #print ('token',token)
        timestamp = re.search('(\d)+(.)*(\s)(\d\d\d\d)',token[4])
        timestamp = timestamp.group(0)
        timestamp = pd.to_datetime(timestamp)
        user = re.findall('(\d\d\d\d)(.+)',token[4])[0]
        #print ('user',user)
        user = user[1].split(' ')[1]
        #print ('user',user)
        m = token[6]
        m= True if (m==' m ') else False
        byte = token[7].split(' ')[0]
        #print ('byte',byte)
        try :
            byte = int(byte.replace(',',''))
        except:
            byte = None
        comment = token[-2].split('?')
        comment = comment[-1]
        if ('byte' in comment ):
            comment =''
        #print ('comment',comment)
        

        df.loc[i]=[timestamp,user,m,byte,comment,entireEdit]
        i+=1

print (df.shape)
print (df.head())

#####Thereare 1009 edits######



(1009, 6)
            timestamp           user minorEdit pageLength  \
0 2007-01-15 23:44:00         Alonzo      True     100571   
1 2007-01-14 03:21:00        Alfonso     False     100552   
2 2007-01-13 18:36:00        Adriano     False     102461   
3 2007-01-13 16:45:00  DailosTamanca     False     100959   
4 2007-01-13 15:35:00   Moisescorral     False     102461   

                                          comment  \
0  Scientific criticism of Paraiso beliefs - link   
1                                                   
2                 except for eg BLP violation, et   
3                   ; i support Gustava's changes   
4             You propose FIRST, then you DELETE!   

                                          entireEdit  
0  # (cur) (last) 23:44, 15 January 2007 Alonzo (...  
1  # (cur) (last) 03:21, 14 January 2007 Alfonso ...  
2  # (cur) (last) 18:36, 13 January 2007 Adriano ...  
3  # (cur) (last) 16:45, 13 January 2007 DailosTa...  
4  # (cur) (last) 15:35, 13 J

## Method 1: Term Frequency-Inverse document Frequency feature extraction and clustering
We will use TF-idf feature extraction and cluster the users using Kmeans clustering to see if there are any distinct clusters forming. To do that, we first create a dataframe **userWiseComments** to store all the comments from all the edits of a user. 

In [2]:
def getUserWiseComments(dataframe):
    userWiseComments = pd.DataFrame(columns=['user','allComments'])
    users = list(set(dataframe['user']))
    i = 0
    for user in users:
        userRows = df.loc[df['user'] == user]
        allComments = ''
        for row in userRows.iterrows():
            allComments += row[1]['comment'] + ' '
        userWiseComments.loc[i] = [user,allComments]
        i += 1
    return userWiseComments

userWiseComments = getUserWiseComments(df)

print (userWiseComments.shape)
print (userWiseComments.head(2))


#####THERE ARE 387 DIFFERENT USERS#########

(387, 2)
            user                                        allComments
0  200.119.211.x                                            Origin 
1       Remedios  Reverted 1 edit by 203.59.152.x identified as ...


We will now create a **tfidf matrix** using the TfidfVectorizer and creating english stems of tokens for all comments of each user:

In [3]:
stemmer = SnowballStemmer("english")

#take comments from dataframe
commentsDf = userWiseComments['allComments']

#token given text and find their stems
def tokenizeAndStem(text):
    CommentsTokens=[]
    #for userComments in commentsDf:
    CommentsTokens = (nltk.word_tokenize(text))
    #filter out punctuations and numeric tokens
    filtered_tokens = []
    for token in CommentsTokens:
        if re.search('[a-zA-Z0-9]', token):
            filtered_tokens.append(token)
    stems = [stemmer.stem(t) for t in filtered_tokens]
    return stems


#making Term Frequency-Inverse document Frequency matrix model
tfidf_vectorizer = TfidfVectorizer(max_df=0.8,
                                 min_df=0, stop_words='english',
                                 use_idf=False,tokenizer=tokenizeAndStem, ngram_range=(1,1))  #creating features by taking 3 words at a time

tfidf_matrix = tfidf_vectorizer.fit_transform(commentsDf)
print ('tf-idf matrix shape',tfidf_matrix.shape)
terms = tfidf_vectorizer.get_feature_names()
#terms are list of features in the matrix
print ('Some features are',terms[500:530])

tf-idf matrix shape (387, 1038)
Some features are ['join', 'jose', 'josecastro79', 'journalist', 'juan', 'juicio', 'junk', 'just', 'justic', 'justicia', 'justif', 'justifi', 'jw', 'kill', 'know', 'ko', 'ku', 'label', 'lack', 'languag', 'layman', 'lc', 'lead', 'leav', 'lectur', 'legal', 'legit', 'let', 'level', 'lie']


Next we create **clusters**:

In [4]:
num_clusters = 4
km = KMeans(n_clusters=num_clusters)
km.fit(tfidf_matrix)

clusters = km.labels_.tolist()

def get_colors(clusters):
    colors=[]
    for i in clusters:
        if i==0:
            colors.append('red')
        elif i==1:
            colors.append('blue')
        elif i==2:
            colors.append('green')
        elif i==3:
            colors.append('yellow')
    return colors

colors = get_colors(clusters)

Lets look at the **top 10 features** of each cluster:

In [5]:
order_centroids = km.cluster_centers_.argsort()[:,::-1]
topFeatures=[]
for centres in order_centroids:
    topFeature =[]
    for words in centres[:10]:
        topFeature.append(terms[words])
    topFeatures.append(topFeature)
    print (topFeature)
    


['version', 'agustin', 'jibbon7', 'hamsterlopithecus', 'hispa', 'sara', 'honoria', 'snakey', 'rosalind', 'socorro']
['paraiso', 'belief', 'critic', 'link', 'page', 'practic', 'influenc', 'controversi', 'ad', 'replac']
['origin', 'hous', 'mind', 'definit', 'lectur', 'text', 'influenc', 'belief', 'error', 'especi']
['revert', 'revis', 'vandal', 'edit', 'use', '1', 'tw', 'identifi', 'xxxxxxxxx', 'edemir']


After this step, we focus on Visualization:
This step requires **Dimension Reduction**

We can try and test 3 dimension reduction algos:
 1. MDS - Multidimensional scaling (MDS) is a set of data analysis techniques that display the structure of distance-like data as a geometrical picture. [More Info]( http://www.benfrederickson.com/multidimensional-scaling) 
 2. PCA - We do orthonormal linear transformation such that data variance is maximized. It can be thought as special case of SVD in fact.
 3. Truncated SVD -  singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. The original matrix is sparse and we find 'concept' related to each cluster. [More information]( https://nlp.stanford.edu/IR-book/html/htmledition/latent-semantic-indexing-1.html)

In [6]:
from sklearn.decomposition import PCA

PCAresult = PCA(n_components=2).fit_transform(tfidf_matrix.toarray())
xs = PCAresult[:,0]
ys = PCAresult[:,1]

In [7]:
from sklearn.decomposition import TruncatedSVD
from sklearn.random_projection import sparse_random_matrix

svd = TruncatedSVD(n_components=2, n_iter=10, random_state=42)
svdResult = svd.fit_transform(tfidf_matrix)  

xs  = svdResult[:,0]
ys = svdResult[:,1]

In [8]:
#dimensionality reduction for the TF-IDF matrix (or df equally)
MDS()

# convert two components as we're plotting points in a two-dimensional plane

# we will also specify `random_state` so the plot is reproducible.
mds = MDS(n_components=2, random_state=1)

#toarray() converts sparse array to dense numpy array
pos = mds.fit_transform(tfidf_matrix.toarray())  # shape (n_components, n_samples)

#store the dimensions in xs, ys
xs, ys = pos[:, 0], pos[:, 1]


Create tables for Graph rendering.
   * Posx and PosY are given by Dimension reduction algo.

In [9]:
userList  = list(userWiseComments['user'])

def update_userNetworkDf(df,xs,ys,clusters):
    
    userNetworkDf = pd.DataFrame(columns = ['user','posX','posY'])
    userNetworkDf['user'] = userList
    userNetworkDf['posX']  = xs
    userNetworkDf['posY']= ys
    userNetworkDf['cluster'] = clusters
    return userNetworkDf

userNetworkDf = update_userNetworkDf(df,xs,ys,clusters)

print (userNetworkDf.shape)
print (userNetworkDf.head())


(387, 4)
            user      posX      posY  cluster
0  200.119.211.x  0.929167  0.200235        2
1       Remedios  0.030112  0.962936        3
2        Eltri85  0.013001  0.006355        1
3         Aluino  0.989880 -0.194505        1
4       Clodoveo -0.096577 -0.762346        1


In [10]:
#supporing data structure for legends

def updateSources(userNetworkDf):
    sourceMat=[]
    for name,group in userNetworkDf.groupby('cluster'):
        sourceMat.append(group)
    return sourceMat

sourceMat = updateSources(userNetworkDf)


In [22]:
edges = pd.DataFrame(columns=['from','to','fromName','toName'])

for row in df.iterrows():
    
    for name in userList:
        if name in row[1]['comment'] and name!=' ' and name!=row[1]['user']:
            fromName = row[1]['user']
            toName = name
            edgefrom = userNetworkDf[userNetworkDf['user'] ==row[1]['user'] ].index.tolist()[0]
            edgeto = userNetworkDf[ userNetworkDf['user'] ==name ].index.tolist()[0]
            edges.loc[len(edges)] = [edgefrom,edgeto,fromName,toName]
            
print (edges.shape)
print (edges.head()) 

(176, 4)
  from   to       fromName        toName
0   73  176  DailosTamanca       Gustava
1  176    7        Gustava  Moisescorral
2  344  263         Edemir        Dragon
3  176   99        Gustava    61.9.148.x
4   12  127      Salvatora       Honoria


**Code for Plotting the Graph.**
  * Nodes are given by the users who have edited the Wikipedia Page
  * Coloring scheme is derived from Clustering
  * An edge from one node to other represent any mention/reverts done by that user

In [12]:
#plot
plot = figure(title="Graph of Wiki Users",
              height=700,width=1200, tools=["tap",'pan','wheel_zoom','box_zoom','reset','box_select'],
              x_range = (-1.1,1.1),
              y_range=(-1.1,2.6))


#legend code
source0 = ColumnDataSource(data=dict(x=sourceMat[0]['posX'],y=sourceMat[0]['posY'], 
                                                                    user = list(sourceMat[0]['user'])))
source1 = ColumnDataSource(data=dict(x=sourceMat[1]['posX'],y=sourceMat[1]['posY'], 
                                                                    user = list(sourceMat[1]['user'])))
source2 = ColumnDataSource(data=dict(x=sourceMat[2]['posX'],y=sourceMat[2]['posY'], 
                                                                    user = list(sourceMat[2]['user'])))
source3 = ColumnDataSource(data=dict(x=sourceMat[3]['posX'],y=sourceMat[3]['posY'], 
                                                                    user = list(sourceMat[3]['user'])))

c0 = plot.circle(x='x',y='y',source = source0,size=2,alpha=0.5,fill_color='red',line_color='red',legend = str(topFeatures[0]))
c1 = plot.circle(x='x',y='y',source=source1,size=2,alpha=0.5,fill_color='blue',line_color='blue',legend = str(topFeatures[1]))
c2 = plot.circle(x='x',y='y',source= source2,size=2,alpha=0.5,fill_color='green',line_color='green',legend = str(topFeatures[2]))
c3 = plot.circle(x='x',y='y',source=source3,size=2,alpha=0.5,fill_color='yellow',line_color='yellow',legend = str(topFeatures[3]))



########graph#########
graph = GraphRenderer()
Node  =  graph.node_renderer

########nodes##########
graph.node_renderer.glyph = Circle(size=7, fill_color='fill_color')
graph.node_renderer.selection_glyph = Circle(size=7, fill_color=Spectral4[2])
graph.node_renderer.nonselection_glyph = Circle(size=7, fill_color='fill_color',fill_alpha=0.1,line_alpha=0.1)
graph.node_renderer.hover_glyph = Circle(size=7, fill_color=Spectral4[1])

graph.edge_renderer.glyph = MultiLine(line_color="#CCCCCC", line_alpha=0.8, line_width=1)
graph.edge_renderer.selection_glyph = MultiLine(line_color=Spectral4[2], line_width=1)
graph.edge_renderer.hover_glyph = MultiLine(line_color=Spectral4[1], line_width=5)


#data
node_indices = list(range(0,len(userNetworkDf)))
Node.data_source.data = dict(index=node_indices,fill_color = colors,user = userNetworkDf['user'])
graph.edge_renderer.data_source.data = dict(start=list(edges['from']),end=list(edges['to']),
                                           startName = list(edges['fromName']),endName = list(edges['toName']))


### start of layout code
x = list(userNetworkDf['posX'])
y=list(userNetworkDf['posY'])
graph_layout = dict(zip(node_indices, zip(x, y)))
graph.layout_provider = StaticLayoutProvider(graph_layout=graph_layout)
edge_hover_tool = HoverTool( tooltips=[('revert/mentions',"@startName -> @endName")])
plot.add_tools(edge_hover_tool)
#c = plot.circle(x='x',y='y',source=source,size=15,alpha=0.001,fill_color='color')
plot.legend.label_text_font_size ='7pt'
plot.legend.location = 'top_left'
plot.legend.background_fill_alpha = 0.1
plot.renderers.append(graph)

#interaction policies
graph.inspection_policy = EdgesAndLinkedNodes()
graph.selection_policy = NodesAndLinkedEdges()


show(plot)



E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: fill_color [renderer: GlyphRenderer(id='5b12d53a-3108-470b-b055-167cc3276594', ...)]


This code is for intercation and callbacks

In [13]:
#####################################################################################################
from bokeh.layouts import row, layout
DimReductionList =['MDS',
                  'PCA',
                  'Truncated SVD']

#widget
algo_attribute_select = Select(value='Truncated SVD',
                          title='Select algo for Dimension Reduction:',
                          width=200,
                          options=DimReductionList)



#callback
def update_algo_attribute(attrname,old,new):
    algo = algo_attribute_select.value
    print ('in the func')
    if algo=='MDS':
        mds = MDS(n_components=2, random_state=1)
        pos = mds.fit_transform(tfidf_matrix.toarray())  # shape (n_components, n_samples)
        xs, ys = pos[:, 0], pos[:, 1]
        #print (xs,ys)
    if algo=='PCA':
        PCAresult = PCA(n_components=2).fit_transform(tfidf_matrix.toarray())
        xs = PCAresult[:,0]
        ys = PCAresult[:,1]
        #print (xs,ys)
    if algo=='TSNE':
        TSNEresult = TSNE(n_components=2).fit_transform(tfidf_matrix.toarray())
        xs = TSNEresult[:,0]
        ys = TSNEresult[:,1]
        #print (xs,ys)
    if algo=='Truncated SVD':
        svd = TruncatedSVD(n_components=2, n_iter=10, random_state=42)
        svdResult = svd.fit_transform(tfidf_matrix)
        xs  = svdResult[:,0]
        ys = svdResult[:,1]
        #print (xs,ys)
    
    userNetworkDf = update_userNetworkDf(df,xs,ys,clusters) 
    print ('userNteworkDf updated..')
    sourceMat =updateSources(userNetworkDf)
    
    #data
    node_indices = list(range(0,len(userNetworkDf)))
    Node.data_source.data = dict(index=node_indices,fill_color = colors,user = userNetworkDf['user'])


    ### start of layout code
    x = list(userNetworkDf['posX'])
    y=list(userNetworkDf['posY'])
    graph_layout = dict(zip(node_indices, zip(x, y)))
    graph.layout_provider = StaticLayoutProvider(graph_layout=graph_layout)
    plot.renderers.append(graph)
    
    source0.data = dict(x=sourceMat[0]['posX'],y=sourceMat[0]['posY'], 
                                                                    user = list(sourceMat[0]['user']))
    source1.data = dict(x=sourceMat[1]['posX'],y=sourceMat[1]['posY'], 
                                                                        user = list(sourceMat[1]['user']))
    source2.data = dict(x=sourceMat[2]['posX'],y=sourceMat[2]['posY'], 
                                                                        user = list(sourceMat[2]['user']))
    source3.data = dict(x=sourceMat[3]['posX'],y=sourceMat[3]['posY'], 
                                                                        user = list(sourceMat[3]['user']))


    

print ('\n Red'+str(topFeatures[0])+' \n Blue'+str(topFeatures[1])+'\n Green'+str(topFeatures[2])+ 
               '\n Yellow'+ str(topFeatures[3]))
   

algo_attribute_select.on_change('value', update_algo_attribute)




 Red['version', 'agustin', 'jibbon7', 'hamsterlopithecus', 'hispa', 'sara', 'honoria', 'snakey', 'rosalind', 'socorro'] 
 Blue['paraiso', 'belief', 'critic', 'link', 'page', 'practic', 'influenc', 'controversi', 'ad', 'replac']
 Green['origin', 'hous', 'mind', 'definit', 'lectur', 'text', 'influenc', 'belief', 'error', 'especi']
 Yellow['revert', 'revis', 'vandal', 'edit', 'use', '1', 'tw', 'identifi', 'xxxxxxxxx', 'edemir']


Few users that stand out in group ['belief', 'use', 'vandal', 'revert', 'revis', 'link', 'edit', 'practic', 'critic', 'influenc'] are:-

* VictoriaV
* RyogaNica
* DailosTamanca
* Rm99

In another group ['paraiso', 'religion', 'replac', 'page', 'critic', 'belief', 'cult', 'scientif', 'celebr', "'s"]:
* Sara
* Edemir
* Augustin
* Amado


To be noted:
- Sara and VictoriaV have edited/revertes DailosTamanca. So possibly they are together. There is no other user in graph which have common reverts. 
- VictoriaV is clearly opposite to Rm99 and Augustin. Sara has edges to Rm99. This possibly confirm hypothesis in above statement that VictoriaV and Sara are together.
- We cannot judge whose side is Edimir on. He has most edges.


### Method 2: Creating groups via reverts and mentions

In this method we hypothesize that the users will be split into 4 major group:
1. Bots.
2. Neutral users (unbiased).
3. Users aligned with Government of India, Indian Army and Central Reserve Police.
4. Users aligned with Kashmiri protesters and separatists.

Wikipedia bots have Bot at the end of their name, therefore getting group one is easy:

In [14]:
userWiseComments2 = getUserWiseComments(df)
userMentions = pd.DataFrame(columns=['user','mention','phrase', 'to'])
listOfUserNames = list(userWiseComments2['user'])
#sia = SentimentIntensityAnalyzer()
for row in userWiseComments2.iterrows():
    tokens = row[1]['allComments'].split()
    for name in listOfUserNames:
        indexes = [i for i,token in enumerate(tokens) if token==name]
        for i in indexes:
            start = max(0, i-5)
            phrase = " ".join(tokens[start:i+1])
            printText = ""
            to = False
            for word in tokens[start:i+1]:
                if(word == 'to' or word == 'To'):
                    printText +=word
                    to = True
                else:
                    printText +=word + " "
            #print(printText)
            userMentions.loc[len(userMentions)] = [row[1]['user'],name,phrase,to]
            
          

In [15]:
relationshipMatrix = {}
for name1 in userList:
    relationshipMatrix[name1] = {}
    for name2 in userList:
        relationshipMatrix[name1][name2] = 0
        
for row in userMentions.iterrows():
    if(row[1]['to'] == True):
        if(row[1]['user']<row[1]['mention']):
            relationshipMatrix[row[1]['user']][row[1]['mention']] += 1
        else:
            relationshipMatrix[row[1]['mention']][row[1]['user']] += 1
    else:
        if(row[1]['user']<row[1]['mention']):
            relationshipMatrix[row[1]['user']][row[1]['mention']] -= 1
        else:
            relationshipMatrix[row[1]['mention']][row[1]['user']] -= 1

relationships = pd.DataFrame(columns=['user1','user2','relationship'])
i=0
for key1,dictionary in relationshipMatrix.items():
        for key2,value in dictionary.items():
            if(value != 0):
                relationships.loc[i] = [key1,key2,value]
                i+=1
                


A heatmap is created based on relationship matrix

In [16]:
from bokeh.models import (
    ColumnDataSource,
    HoverTool,
    LinearColorMapper,
    BasicTicker,
    PrintfTickFormatter,
    ColorBar,
)
from math import pi
import pandas as pd
from bokeh.palettes import Spectral6
from bokeh.io import show

relationdf = pd.DataFrame.from_dict(data=relationshipMatrix)
#print (relationdf.head())


relationdf.index.name = 'user1'
relationdf.columns.name = 'user2'

user1 = list(relationdf.index)
user2 = list(relationdf.columns)

df2 = pd.DataFrame(relationdf.stack(), columns=['coeff']).reset_index()
#colors = [ "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce", "#ddb7b1", "#cc7878", "#933b41"]
colorsHeatMap = [ "#a5bab7", "#ddb7b1", "#dfccce","#933b41"]
colorsHeatMap = list(reversed(colorsHeatMap))
mapper = LinearColorMapper(palette=colorsHeatMap, low=df2.coeff.min(), high=df2.coeff.max())
source = ColumnDataSource(df2)

TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom"

p = figure(title="relationship mat",
           x_range=user1, y_range=user2,
           x_axis_location="above", plot_width=1000, plot_height=700,
           tools=TOOLS, toolbar_location='below')

p.grid.grid_line_color = None
p.axis.axis_line_color = None
p.axis.major_tick_line_color = None
p.axis.major_label_text_font_size = "3pt"
p.axis.major_label_standoff = 1
p.xaxis.major_label_orientation = pi / 3
p.yaxis.major_label_orientation = pi / 3

p.rect(x="user1", y="user2", width=1, height=1,
       source=source,
       fill_color={'field': 'coeff', 'transform': mapper},
       line_color=None)

color_bar = ColorBar(color_mapper=mapper, major_label_text_font_size="5pt",
                     ticker=BasicTicker(desired_num_ticks=len(colorsHeatMap)),
                     formatter=PrintfTickFormatter(format="%d%%"),
                     label_standoff=6, border_line_color=None, location=(0, 0))
p.add_layout(color_bar, 'right')

p.select_one(HoverTool).tooltips = [
     ('compare', '@user2 -> @user1'),
     ('score', '@coeff'),
]

show(p) 

E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: fill_color [renderer: GlyphRenderer(id='5b12d53a-3108-470b-b055-167cc3276594', ...)]


**Observation with Regard to HW5 and 6:**
    * We can see that Truncated SVD and PCA gives fairly good idea about clusters. MDS dimension reduction is not good enough.
    * Method 1 is very good to find 'Edit-wars' in Wikipedia, i.e. finding two factions and their ideology.
    * We can see that the group that has features such as 'reverts' or 'edits' are opposite to another group. One faction can be seen as aligned to Indian state and another to rebels/Pakistan.
    * Relationship matrix re-interates the memebers of a particular group. Heat map has 289*289 cells so it is difficult to visualize on small screens. It gives a very good idea about a specific user.

**Observation with Regard to HW7:**
    * We can see that Truncated SVD gives poor results this time. Many nodes overlap each other. MDS and PCA are distributed, so we can inspect easily.
    * Many user with edits who do not have edges, are possibly in nuetral group.
    * There is central node in MDS which includes many bots, which has several user overlapped. It is therefore useless for inspection.
    * There is a cluster which contain features such as ['belief', 'version', 'vandal', 'use', 'revert', 'revis', 'link', 'edit', 'practic', 'critic']. It may contain both the factions in edit-war:pro-paraiso and anti-paraiso.
    * The heat map is useless here as compared to previous homework. Does not give any good information.

*Using obeservations of Graph*

Lets look at our predictions, and analyse the situation

In [17]:

print ((userWiseComments[userWiseComments['user']=='Edemir']['allComments'].tolist()))


['Reverted good faith edits by Dragon; No need to dampen it; the reference points to a blanket opinion of psychiatry and psychology.. using TW  - Counselor doesn\'t need quotes here. Reverted 1 edit by 24.195.147.x identified as vandalism to last revision by Diegoob. using TW  Paraiso as a cult - More cleanup, more typos, fixed refs, etc. Texts and Lectures - More vandalism cleanup, and moved a period Origin - Trimming out more vandalism .  - Removing vandalism Minor language - Good job on that pernicious sentence, Amado That darn sentence... I\'m not even sure the intro should be this long. Restored language that actually does reflect the contents of the refs, and removed a ref that didn\'t apply.  Neutral, factually accurate. Says nothing more or less than that the State Department has such reports, and that Paraisos report descrimination. Previous phrasing is more neutral, but the ref is fine. Introduction - Minor language Reverted 2 edits by Alverio identified as vandalism to last 

**From graph, Edimir has notably, edges to:**
- VictoriaV
- Sara
- Amado

**Few Highlights of Edemir edit**
- Reverted 1 edit by Alberto identified as vandalism to last revision by Sara. 
- Undid revision xxxxxxxxx - VictoriaV, it\'s still not relevant
- addition  Scientific criticism of Paraiso\'s beliefs
- revert Typos Paraiso as a state-recognized religion - If references for these statements aren\'t found post-haste, this paragraph goes.
- **using TW Home Health Care - added confrontation of Paraiso members and Dept of Health Home Health Care - Dept of Health intervention**



In [18]:

print ((userWiseComments[userWiseComments['user']=='Amado']['allComments'].tolist()))


['Morals and Ethics Organizations - moved pics around Influences Paraiso Justice - This whole section is false, there is no such info in Paraiso Ethics Book, I did look! I was going to edit but this Justice is of too little importance to be here. Paraiso Ethics - moved a sentence around The Cielo - wikilink Restored Paraiso Paraiso basic concepts while maintaining "well sourced materials" know you don\'t have an excuse to undue me Paraiso Ethics - The states of existance Ethics in Paraiso Ethics in Paraiso Criticisms of Paraiso Ethics Paraiso Justice Deleted false statement. A person can only be declared afther it has been proven in a Justicia Juicio that he commited a high crime. Intro to C. Ethics 1998 is no longer used refer to 2006 edition only Morals Ethics in Paraiso - Created subsection for Criticisms of Paraiso Ethics The Cielo Right and Wrong morals Survive and Right and Wrong New: Right and wrong  Removed pov pushing from first sentence, the controversy is already covered in 

**From graph, Amado has reverted/mentioned:**
- Rm99

**Sara, Edemir, VictoriaV, RyogaNica have reverted edits by Amado**

**Highlights from Amado edits:**
- there are also Journalists, courts and the governing bodies that are not critical of Paraiso
- Influences - removed uncited opinion
- Stop your POV pushing Rm99!!!!
- Paraiso Justice Deleted false statement

It seems Amado is pro-Paraisan, in group with VictoriaV and Sara.

In [19]:
print ((userWiseComments[userWiseComments['user']=='VictoriaV']['allComments'].tolist()))

['WEASEL word, there is no dispute here  Influences - read the ref, there goes the propaganda once more. Influences - missing fact added Paraiso as a state-recognized religion Reverted 1 edit by Edemir; See, GoodD, you actually should read the refs. makes more sense then. . using TW Reverted 1 edit by Edemir; "listed" is factually wrong. each country gets its own report.. using TW added ref and adjusted text  non-RS middle thing it\'s a bit stronger, man. porn link farm removed typo try google site:state.gov +Paraiso before you kill text. just same samples found in 30sec. non-RS needs to be exchanged with real ref original source for ref included  Demography of the United States  anyway, i don\'t insist on the %s but the pop is in the ref.  well, ok, i used a calculator. easy on WP:V  Membership - dunno the methodic reqs but this % is really low, isn\'t it. another one code stuff ". Reverted to revision xxxxxxxxx by Mandos; unreliable source cannot be used. using TW Reverted to revisio

**From graph, VictoriaV has notably, edges to:**
- Edimir
- Rm99
- Agustin
- DailosTamanca
- Amado

**Highlights from VictoriaV edit**
- added Paraiso as a state-recognized religion
- Undid revision xxxxxxxxx by Agustin If you don\'t understand something, ask, but don\'t delete references
- Reverted 1 edit by Rm99 identified as vandalism to last revision by VictoriaV
- **There are lot of instances against Augustin and Rm99.**

**Final Obervations**
- VictoriaV states Paraiso as state-recognized religion. HE/she is Pro-paraiso. Sara and Amado are teamed up also.
- In other group are- Agustin and Edemir and Rm99.
- There is lot of mention of 'vandalism' and words like 'unvandalised'.It even features in top 10 features of 2 clusters. 
Particularly by Edemir and VictoriaV.
- Since Edemir is anti-Paraiso, he states ''using TW Home Health Care - added confrontation of Paraiso members and Dept of Health Home Health Care - Dept of Health intervention''. **This possibly suggests that a few elements of movements might include violence.**

**Things to be noted**
While arriving at solution, visualization was used as a tool to help. And not give final answer. Although 
clustering gave a very good idea about groups and faction, an assumption was marked while making graph.
This is, that all mentions/reverts in the comments by user of another user are in opposite/negative sense. It helped chalk out and visualise the edges. On further inspecting, could I finally conclude the possibilities.

**The answer arrived was already given but visualization was a bit different. Had the answer been not known, most of the people might have been grouped correctly by Visual Inspection only. Only confusion was Edemir.**

In [20]:
l=layout ([
    [plot],
    [algo_attribute_select],
    [p]
])
curdoc().add_root(l)
show(l)

E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: fill_color [renderer: GlyphRenderer(id='5b12d53a-3108-470b-b055-167cc3276594', ...)]
E-1001 (BAD_COLUMN_NAME): Glyph refers to nonexistent column name: fill_color [renderer: GlyphRenderer(id='5b12d53a-3108-470b-b055-167cc3276594', ...)]
W-1004 (BOTH_CHILD_AND_ROOT): Models should not be a document root if they are in a layout box: Figure(id='1a17b0b6-3eeb-4514-8a3a-5d25534ea160', ...)
W-1004 (BOTH_CHILD_AND_ROOT): Models should not be a document root if they are in a layout box: Figure(id='20e33fe2-7b7a-4695-8692-0a151d2930f9', ...)
