# Recommendations: Part 2



In [13]:
from py2neo import Graph
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)
pd.set_option('display.max_colwidth', 100)

Update the cell below to use the IP Address, Bolt Port, and Password, as you did previously.

In [14]:
# Change the line of code below to use the Bolt URL,  and Password of your Neo4j database instance.
# graph = Graph("<Bolt URL>", auth=("neo4j", "<Password>")) 
 
graph = Graph("bolt://localhost:7687", auth=("neo4j", "1234"))

## PageRank

[PageRank](https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/) is an algorithm that measures the transitive influence or connectivity of nodes. It can be computed by either iteratively distributing one node’s rank (originally based on degree) over its neighbors or by randomly traversing the graph and counting the frequency of hitting each node during these walks.

Run this PageRank code over the whole graph to find out the most influential article in terms of citations:

In [15]:
query = """
CALL gds.pageRank.write({
  nodeProjection:'Article', 
  relationshipProjection:'CITED',
  writeProperty:'pagerank'})
"""
graph.run(query).data()

[{'writeMillis': 630,
  'nodePropertiesWritten': 184313,
  'ranIterations': 20,
  'didConverge': False,
  'centralityDistribution': {'p1': 0.14999961853027344,
   'max': 8.435179710388184,
   'p5': 0.14999961853027344,
   'p90': 0.21162700653076172,
   'p50': 0.1627492904663086,
   'p95': 0.24867820739746094,
   'p10': 0.14999961853027344,
   'p75': 0.17645645141601562,
   'p99': 0.4000425338745117,
   'p25': 0.1536426544189453,
   'p100': 8.435179710388184,
   'min': 0.14999961853027344,
   'mean': 0.1774682370197582,
   'stdDev': 0.07776703571044356},
  'postProcessingMillis': 281,
  'createMillis': 299,
  'computeMillis': 2320,
  'configuration': {'maxIterations': 20,
   'writeConcurrency': 4,
   'relationshipWeightProperty': None,
   'cacheWeights': False,
   'concurrency': 4,
   'sourceNodes': [],
   'writeProperty': 'pagerank',
   'nodeLabels': ['*'],
   'sudo': False,
   'dampingFactor': 0.85,
   'relationshipTypes': ['*'],
   'tolerance': 1e-07}}]

This query stores a 'pagerank' property on each node. Execute this code to view the most influential articles:

In [16]:
query = """
MATCH (a:Article)
RETURN a.title as article,
       a.pagerank as score
ORDER BY score DESC 
LIMIT 10
"""
graph.run(query).to_data_frame()

Unnamed: 0,article,score
0,,8.435
1,Rough sets,6.877
2,,5.66
3,,4.24
4,,4.145
5,,4.1
6,Revised report on the algorithm language ALGOL 60,4.009
7,A method for obtaining digital signatures and public-key cryptosystems,3.89
8,,3.88
9,,3.857


## Personalized PageRank

[Personalized PageRank](https://neo4j.com/docs/graph-data-science/current/algorithms/page-rank/#algorithms-page-rank-examples-personalised) is a variant of PageRank that allows us to find influential nodes based on a set of source nodes.

For example, rather than finding the overall most influential articles, you could instead, find the most influential articles with respect to a given author.
Execute this code to use a personalized PageRank algorithm:

In [17]:
query = """
MATCH (a:Author {name: $author})<-[:AUTHOR]-(article)-[:CITED]->(other)
WITH collect(article) + collect(other) AS sourceNodes
CALL gds.pageRank.stream({
  nodeProjection:'Article',
  relationshipProjection:'CITED',
  sourceNodes: sourceNodes})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).title AS article, score
ORDER BY score DESC
LIMIT 10
"""

author_name = "Peter G. Neumann"
graph.run(query, {"author": author_name}).to_data_frame()

Unnamed: 0,article,score
0,,0.374
1,A technique for software module specification with examples,0.289
2,Public interest and the NII,0.278
3,,0.278
4,,0.278
5,Risks of e-voting,0.278
6,The foresight saga,0.278
7,,0.278
8,,0.278
9,A messy state of the union: taming the composite state machines of TLS,0.225


## Topic Sensitive Search

You can also use Personalized PageRank to do 'Topic Specific PageRank'. 

When an author is searching for articles to read, they want that search to take themselves as authors into account. Two authors using the same search term would expect to see different results depending on their area of research.

Create a full text search index on the 'title' and 'abstract' properties of all nodes that have the label 'Article' by executing this code:

In [6]:
query = """
    CALL db.index.fulltext.createNodeIndex('articles', ['Article'], ['title', 'abstract'])
"""
graph.run(query).data()

[]

Check that the full text index has been created by running the following query:

In [7]:
query = """
CALL db.indexes()
YIELD name, state, populationPercent, uniqueness, type, entityType, labelsOrTypes, properties, provider
WHERE type = "FULLTEXT"
RETURN *
"""
graph.run(query).to_data_frame()

Unnamed: 0,entityType,labelsOrTypes,name,populationPercent,properties,provider,state,type,uniqueness
0,NODE,[Article],articles,100.0,"[title, abstract]",fulltext-1.0,ONLINE,FULLTEXT,NONUNIQUE


You can search the full text index like this:

In [8]:
query = """
CALL db.index.fulltext.queryNodes("articles", "open source")
YIELD node, score
RETURN node.title as article, score, [(author)<-[:AUTHOR]-(node) | author.name] AS authors
LIMIT 10
"""
graph.run(query).to_data_frame()

Unnamed: 0,article,score,authors
0,Progressive open source,11.74,"[Jamie Dinkelacker, Pankaj K. Garg, Dean Nelson, Rob Miller]"
1,Open source application spaces: the 5th workshop on open source software engineering,11.15,"[Joseph Feller, Scott A. Hissam, Brian Fitzgerald, Krishna K Lakhani, Walt Scacchi]"
2,Reusing Open-Source Software and Practices: The Impact of Open-Source on Commercial Vendors,11.016,"[Alan W. Brown, Grady Booch]"
3,From Research Software to Open Source,10.328,[Susan L. Graham]
4,The comment density of open source software code,10.292,"[Oliver Arafati, Dirk Riehle]"
5,Software architecture in an open source world,10.25,[Roy T. Fielding]
6,Managing a corporate open source software asset,10.187,"[James D. Herbsleb, Vijay K. Gurbani, Anita Garvert]"
7,Organizational adoption of open source software: barriers and remedies,9.894,"[Anol Bhattacherjee, Areej M. Yassin, Del Nagy]"
8,IBM's pragmatic embrace of open source,9.849,[Pamela Samuelson]
9,Analysing the reliability of Open Source software projects,9.683,"[Lerina Aversano, Maria Tortorella]"


Here is a query to find the authors that have published the most articles on 'open source':

In [10]:
query = """
CALL db.index.fulltext.queryNodes("articles", "open source")
YIELD node, score
MATCH (node)-[:AUTHOR]->(author)
RETURN author.name as author, sum(score) AS totalScore, collect(node.title) AS articles
ORDER By totalScore DESC
LIMIT 20
"""

graph.run(query).to_data_frame()

Unnamed: 0,author,totalScore,articles
0,Denys Poshyvanyk,51.326,"[Machine learning-based detection of open source license exceptions, Recommending source code fo..."
1,Brian Fitzgerald,49.406,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w..."
2,Joseph Feller,45.596,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w..."
3,James D. Herbsleb,38.699,"[Managing a corporate open source software asset, A case study of a corporate open source develo..."
4,Gail C. Murphy,36.538,"[Who should fix this bug, Hipikat: recommending pertinent software development artifacts, FEAT a..."
5,Walt Scacchi,34.336,"[Open source application spaces: the 5th workshop on open source software engineering, Experienc..."
6,Ahmed E. Hassan,33.8,"[A study of the quality-impacting practices of modern code review at Sony mobile, An empirical s..."
7,Daniel M. German,33.478,"[Machine learning-based detection of open source license exceptions, Open source-style collabora..."
8,Martin P. Robillard,33.017,"[Disseminating architectural knowledge on open-source projects: a case study of the book ""archit..."
9,Scott A. Hissam,32.555,"[Open source application spaces: the 5th workshop on open source software engineering, The 3rd w..."


Next, use full text search and Personalized PageRank to find interesting articles for different authors:

In [11]:
query = """
MATCH (a:Author {name: $author})<-[:AUTHOR]-(article)-[:CITED]->(other)
WITH a, collect(article) + collect(other) AS sourceNodes
CALL gds.pageRank.stream({
  nodeQuery: 'CALL db.index.fulltext.queryNodes("articles", $searchTerm)
   YIELD node, score
   RETURN id(node) as id',
  relationshipQuery: 'MATCH (a1:Article)-[:CITED]->(a2:Article) 
   RETURN id(a1) as source,id(a2) as target', 
  sourceNodes: sourceNodes,
  validateRelationships:false,
  parameters: {searchTerm: $searchTerm}})
YIELD nodeId, score
WITH gds.util.asNode(nodeId) AS n, score
WHERE not(exists((a)<-[:AUTHOR]-(n))) AND score > 0
RETURN n.title as article, score, [(n)-[:AUTHOR]->(author) | author.name][..5] AS authors
order by score desc limit 10
"""

params = {"author": "Tao Xie", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Unnamed: 0,article,score,authors
0,Static detection of cross-site scripting vulnerabilities,0.386,"[Zhendong Su, Gary Wassermann]"
1,Who should fix this bug,0.278,"[Gail C. Murphy, John Anvik, Lyndon Hiew]"
2,"Automated, contract-based user testing of commercial-off-the-shelf components",0.278,"[Yvan Labiche, Michal M. Sówka, Lionel C. Briand]"
3,Concern graphs: finding and describing concerns using structural program dependencies,0.278,"[Martin P. Robillard, Gail C. Murphy]"
4,Characterizing logging practices in open-source software,0.278,"[Yuanyuan Zhou, Ding Yuan, Soyeon Park]"
5,Conceptual module querying for software reengineering,0.236,"[Elisa L. A. Baniassad, Gail C. Murphy]"
6,Bandera: extracting finite-state models from Java source code,0.15,"[James C. Corbett, Shawn Laubach, John Hatcliff, Robby, Matthew B. Dwyer]"
7,AsDroid: detecting stealthy behaviors in Android applications by user interface and program beha...,0.15,"[Xiangyu Zhang, Bin Liang, Jianjun Huang, Lin Tan, Peng Wang]"
8,Semantics-based code search,0.15,[Steven P. Reiss]
9,EXSYST: search-based GUI testing,0.128,"[Gordon Fraser, Florian Gross, Andreas Zeller]"


Execute the same query with a different author:

In [12]:
params = {"author": "Marco Aurélio Gerosa", "searchTerm": "open source"}
graph.run(query, params).to_data_frame()

Unnamed: 0,article,score,authors
0,Toward an understanding of the motivation of open source software developers,0.388,"[Yunwen Ye, Kouichi Kishida]"
1,Hipikat: recommending pertinent software development artifacts,0.322,"[Gail C. Murphy, Davor Cubranic]"
2,Version Sensitive Editing: Change History as a Programming Tool,0.274,[David L. Atkins]
3,Which bug should I fix: helping new developers onboard a new project,0.239,"[Jianguo Wang, Anita Sarma]"
4,Tesseract: Interactive visual exploration of socio-technical relationships in software development,0.203,"[Patrick Wagstrom, Larry Maccherone, James D. Herbsleb, Anita Sarma]"
5,Role Migration and Advancement Processes in OSSD Projects: A Comparative Case Study,0.175,"[Chris Jensen, Walt Scacchi]"
6,Does the initial environment impact the future of developers,0.175,"[Audris Mockus, Minghui Zhou]"
7,Unifying artifacts and activities in a visual tool for distributed software development teams,0.173,"[Jon Froehlich, Paul Dourish]"
8,A case study of open source software development: the Apache server,0.11,"[Audris Mockus, James D. Herbsleb, Roy Fielding]"
9,A case study of the evolution of Jun: an object-oriented open-source 3D multimedia library,0.11,"[A. Takasbima, Kouichi Kishida, Yoshiyuki Nishinaka, Kaoru Hayashi, Atsushi Aoki]"
