In [1]:
!pip install py2neo pandas matplotlib



In [2]:
from py2neo import Graph
import pandas as pd

import matplotlib 
import matplotlib.pyplot as plt

plt.style.use('fivethirtyeight')
pd.set_option('display.float_format', lambda x: '%.3f' % x)

import pandas as pd
from collections import Counter

In [3]:
graph = Graph("bolt://localhost:7687", auth=("neo4j", "neo"))

In [4]:
graph.run("""
LOAD CSV WITH HEADERS FROM "file:///responses.csv" as row
CREATE (p:Person)
SET p += row
""").stats()

constraints_added: 0
constraints_removed: 0
contained_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 1010
labels_removed: 0
nodes_created: 1010
nodes_deleted: 0
properties_set: 151500
relationships_created: 0
relationships_deleted: 0

Most of the answers range from one to five where five is defined as “Strongly agree” and one as “Strongly disagree”. They appear as strings in the csv file and we have to convert them to integers first.

In [6]:
graph.run("""
MATCH (p:Person)
UNWIND keys(p) as key
WITH p,key where not key in ['Gender',
                'Left - right handed',
                'Lying','Alcohol',
                'Education','Smoking',
                'House - block of flats',
                'Village - town','Punctuality',
                'Internet usage']
CALL apoc.create.setProperty(p, key, toInteger(p[key])) YIELD node
RETURN distinct 'done';
""").stats()

constraints_added: 0
constraints_removed: 0
contained_updates: False
indexes_added: 0
indexes_removed: 0
labels_added: 0
labels_removed: 0
nodes_created: 0
nodes_deleted: 0
properties_set: 0
relationships_created: 0
relationships_deleted: 0

Some of the answers are categorical. An example is the alcohol question, where possible answers are “never”, “social drinker” and “drink a lot”.
As we would like to convert some of them to vectors let’s examine all the possible answers they have.

In [7]:
graph.run("""
MATCH (p:Person)
UNWIND ['Gender',
        'Left - right handed',
        'Lying','Alcohol',
        'Education','Smoking',
        'House - block of flats',
        'Village - town','Punctuality',
        'Internet usage'] as property
RETURN property,collect(distinct(p[property])) as unique_values;
""").to_data_frame()

Unnamed: 0,property,unique_values
0,Gender,"[female, male]"
1,Left - right handed,"[right handed, left handed]"
2,Lying,"[never, sometimes, only to avoid hurting someo..."
3,Alcohol,"[drink a lot, social drinker, never]"
4,Education,"[college/bachelor degree, secondary school, pr..."
5,Smoking,"[never smoked, tried smoking, former smoker, c..."
6,House - block of flats,"[block of flats, house/bungalow]"
7,Village - town,"[village, city]"
8,Punctuality,"[i am always on time, i am often early, i am o..."
9,Internet usage,"[few hours a day, most of the day, less than a..."


Let’s vectorize gender, internet and alcohol answers. We will scale them between one to five to match the integer answers range.

Gender encoding

In [8]:
graph.run("""
MATCH (p:Person)
WITH p, CASE p['Gender'] WHEN 'female' THEN 1
                         WHEN 'male' THEN 5
                         ELSE 3
                         END as gender
SET p.Gender_vec = gender;
""").stats()

constraints_added: 0
constraints_removed: 0
contained_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 0
labels_removed: 0
nodes_created: 0
nodes_deleted: 0
properties_set: 1010
relationships_created: 0
relationships_deleted: 0

Internet encoding

constraints_added: 0
constraints_removed: 0
contained_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 0
labels_removed: 0
nodes_created: 0
nodes_deleted: 0
properties_set: 1010
relationships_created: 0
relationships_deleted: 0

Alcohol encoding

In [10]:
graph.run("""
MATCH (p:Person)
WITH p, CASE p['Alcohol'] WHEN 'never' THEN 1
                          WHEN 'social drinker' THEN 3
                          WHEN 'drink a lot' THEN 5
                          ELSE 3 END as alcohol
SET p.Alcohol_vec = alcohol;
""").stats()

constraints_added: 0
constraints_removed: 0
contained_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 0
labels_removed: 0
nodes_created: 0
nodes_deleted: 0
properties_set: 1010
relationships_created: 0
relationships_deleted: 0

Consider a variable in our dataset where all the observations have the same value, say 1. If we use this variable, do you think it can improve the model we will build? The answer is no, because this variable will have zero variance.
We will use the standard deviation metric, which is just the square root of the variance.

In [11]:
graph.run("""
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage']) as all_keys
UNWIND all_keys as key
MATCH (p:Person)
RETURN key,avg(p[key]) as average,stdev(p[key]) as std 
ORDER BY std ASC LIMIT 10;
""").to_data_frame()

Unnamed: 0,average,key,std
0,3.292,Personality,0.643
1,4.732,Music,0.664
2,3.297,Dreams,0.683
3,4.614,Movies,0.695
4,4.558,Fun with friends,0.737
5,4.495,Comedy,0.78
6,3.839,Internet_vec,0.821
7,3.706,Happiness in life,0.824
8,3.328,Slow songs or fast songs,0.834
9,3.266,Parents' advice,0.866


We can observe that everybody likes to listen to music, watch movies and have fun with friends.
Due to the low variance, we will eliminate the following questions from our further analysis:
“Personality”
“Music”
“Dreams”
“Movies”
“Fun with friends”
“Comedy”

High correlation filter
High correlation between two variables means they have similar trends and are likely to carry similar information. This can bring down the performance of some models drastically (linear and logistic regression models, for instance).
source
We will use the Pearson correlation coefficient for this task. Pearson correlation adjusts for different location and scale of features, so any kind of linear scaling (normalization) is unnecessary.
Find top 10 correlations for gender feature.

In [12]:
graph.run("""
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND ['Gender_vec'] as key_1
UNWIND all_keys as key_2
WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2] ,0)) as vector_2
WHERE key_1 <> key_2
RETURN key_1,key_2, algo.similarity.pearson(vector_1, vector_2) as pearson
ORDER BY pearson DESC limit 10;
""").to_data_frame()

Unnamed: 0,key_1,key_2,pearson
0,Gender_vec,Weight,0.542
1,Gender_vec,PC,0.46
2,Gender_vec,Cars,0.438
3,Gender_vec,Action,0.409
4,Gender_vec,War,0.407
5,Gender_vec,Science and technology,0.358
6,Gender_vec,Western,0.348
7,Gender_vec,Sci-fi,0.309
8,Gender_vec,Physics,0.305
9,Gender_vec,Height,0.281


Most correlated feature to gender is weight, which makes sense. The list includes some other stereotypical gender differences like the preference for cars, action, and PC.
Let’s now calculate the Pearson correlation between all the features.

In [13]:
graph.run("""
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key_1
UNWIND all_keys as key_2
WITH key_1,key_2,p1
WHERE key_1 > key_2
WITH key_1,key_2, collect(coalesce(p1[key_1],0)) as vector_1,collect(coalesce(p1[key_2],0)) as vector_2
RETURN key_1,key_2, algo.similarity.pearson(vector_1, vector_2) as pearson
ORDER BY pearson DESC limit 10
""").to_data_frame()

Unnamed: 0,key_1,key_2,pearson
0,Medicine,Biology,0.675
1,Chemistry,Biology,0.658
2,Fantasy/Fairy tales,Animated,0.651
3,Shopping centres,Shopping,0.644
4,Medicine,Chemistry,0.612
5,Physics,Mathematics,0.587
6,Opera,Classical music,0.581
7,Snakes,Rats,0.568
8,Weight,Gender_vec,0.542
9,Punk,Metal or Hardrock,0.542


Results show nothing surprising. The only one I found interesting was the correlation between snakes and rats.
We will exclude the following questions due to high correlation from further analysis:
“Medicine”
“Chemistry”
“Shopping centres”
“Physics”
“Opera”
“Animated”

Pearson similarity algorithm
Now that we have completed the preprocessing step we will infer a similarity network between nodes based on the Pearson correlation of the features(answers) of nodes that we haven’t excluded.
In this step we need all the features we will use in our analysis to be normalized between one and five as now, we will fit all the features of the node in a single vector and calculate correlations between them.
Min-max normalization
Three of the features are not normalized between one to five. These are
‘Height’
“Number of siblings”
‘Weight’
Normalize height property between one to five. We won’t use the other two.

In [14]:
graph.run("""
MATCH (p:Person)
//get the the max and min value
WITH max(p.`Height`) as max,min(p.`Height`) as min
MATCH (p1:Person)
//normalize
SET p1.Height_nor = 5.0 *(p1.`Height` - min) / (max - min);
""").stats()

constraints_added: 0
constraints_removed: 0
contained_updates: True
indexes_added: 0
indexes_removed: 0
labels_added: 0
labels_removed: 0
nodes_created: 0
nodes_deleted: 0
properties_set: 990
relationships_created: 0
relationships_deleted: 0

Similarity network
We grab all the features and infer the similarity network. We always want to use similarityCutoff parameter and optionally topK parameter to prevent ending up with a complete graph, where all nodes are connected between each other. Here we use similarityCutoff: 0.75 and topK: 5. 

In [15]:
graph.run("""
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Comedy','Medicine','Chemistry','Shopping centres','Physics','Opera','Animated','Height','Weight','Number of siblings']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key
WITH {item:id(p1), weights: collect(coalesce(p1[key],3))} as personData
WITH collect(personData) as data
CALL algo.similarity.pearson(data, {similarityCutoff: 0.75,topK:5,write:true})
YIELD nodes, similarityPairs
RETURN nodes, similarityPairs
""").to_data_frame()

Unnamed: 0,nodes,similarityPairs
0,1010,4254


Community detection
Now that we have inferred a similarity network in our graph, we will try to find communities of similar persons with the help of Louvain algorithm

In [16]:
graph.run("""
CALL algo.louvain('Person','SIMILAR')
YIELD nodes,communityCount
""").to_data_frame()

Unnamed: 0,communityCount,nodes
0,104,1010


Apoc.group.nodes
For a quick overview of community detection results in Neo4j Browser, we can use apoc.group.nodes. We define the labels we want to include and group by a certain property. In the config part, we define which aggregations we want to perform and get returned in the visualization. 

In [18]:
graph.run("""
CALL apoc.nodes.group(['Person'],['community'], 
[{`*`:'count', Age:['avg','std'],Alcohol_vec:['avg']}, {`*`:'count'} ])
YIELD nodes, relationships
UNWIND nodes as node 
UNWIND relationships as rel
RETURN node, rel;
""").stats()

constraints_added: 0
constraints_removed: 0
contained_updates: False
indexes_added: 0
indexes_removed: 0
labels_added: 0
labels_removed: 0
nodes_created: 0
nodes_deleted: 0
properties_set: 0
relationships_created: 0
relationships_deleted: 0

Community preferences
To get to know our communities better, we will examine their average top and bottom 3 preferences.

In [19]:
graph.run("""
MATCH (p:Person)
WITH p LIMIT 1
WITH filter(x in keys(p) where not x in ['Gender','Left - right handed','Lying','Alcohol','Education','Smoking','House - block of flats','Village - town','Punctuality','Internet usage','Personality','Music','Dreams','Movies','Fun with friends','Height','Number of siblings','Weight','Medicine', 'Chemistry', 'Shopping centres', 'Physics', 'Opera','Age','community','Comedy','Gender_vec','Internet','Height_nor']) as all_keys
MATCH (p1:Person)
UNWIND all_keys as key
WITH p1.community as community,
     count(*) as size,
     SUM(CASE WHEN p1.Gender = 'male' THEN 1 ELSE 0 END) as males,
     key,
     avg(p1[key]) as average,
     stdev(p1[key]) as std
ORDER BY average DESC
WITH community,
     size,
     toFloat(males) / size as male_percentage,
     collect(key) as all_avg
ORDER BY size DESC limit 10
RETURN community,size,male_percentage, 
       all_avg[..3] as top_3,
       all_avg[-3..] as bottom_3;
""").to_data_frame()

Unnamed: 0,bottom_3,community,male_percentage,size,top_3
0,"[Gardening, Storm, Writing]",4,0.921,229,"[Action, Cheating in school, Judgment calls]"
1,"[Metal or Hardrock, Writing, Western]",5,0.004,228,"[Empathy, Romantic, Compassion to animals]"
2,"[Western, Fake, Hypochondria]",0,0.005,190,"[Fantasy/Fairy tales, Compassion to animals, E..."
3,"[Gardening, Darkness, Storm]",2,0.736,159,"[Keeping promises, Countryside, outdoors, Docu..."
4,"[Rats, Storm, Celebrities]",1,0.553,103,"[Rock, Keeping promises, Compassion to animals]"
5,"[Heights, Western, Storm]",75,0.0,2,"[Reliability, Reading, Countryside, outdoors]"
6,"[Getting up, Spending on gadgets, Western]",44,0.0,2,"[Politics, Reliability, Romantic]"
7,"[Western, Reggae, Ska, Storm]",40,0.0,1,"[Countryside, outdoors, Internet_vec, Reliabil..."
8,"[Alternative, Cheating in school, Western]",30,0.0,1,"[Compassion to animals, Eating to survive, Rel..."
9,"[Western, History, Folk]",15,0.0,1,"[Dangerous dogs, Reliability, Reading]"


Results are quite interesting. Just looking at the male percentage it is safe to say that the communities are almost all based on gender.
The biggest community are 220 ladies, who strongly agree with “Compassion to animals”, “Romantic” and interestingly “Borrowed stuff” but disagree with “Metal”, “Western” and “Writing”. Second biggest community, mostly male, agree with “Cheating in school”, “Action” and “PC”. They also don’t agree with “Writing”. Makes sense as the survey was filled out by students from Slovakia.

Gephi visualization
Let’s finish off with a nice visualization of our communities in Gephi. You need to have the streaming plugin enabled in Gephi and then we can export the graph from Neo4j using the APOC procedure apoc.gephi.add.

In [20]:
graph.run("""
MATCH path = (:Person)-[:SIMILAR]->(:Person)
CALL apoc.gephi.add(null,'workspace1',path,'weight',['community']) yield nodes
return distinct 'done'
""").stats()

Failed to write data to connection ('localhost', 7687) (Address(host='127.0.0.1', port=7687)); ("10054; 'An existing connection was forcibly closed by the remote host'; None; 10054; None")


constraints_added: 0
constraints_removed: 0
contained_updates: False
indexes_added: 0
indexes_removed: 0
labels_added: 0
labels_removed: 0
nodes_created: 0
nodes_deleted: 0
properties_set: 0
relationships_created: 0
relationships_deleted: 0