# Neo4j Assignment
Rafaila Galanopoulou 8160018 

Big Data Management Systems Course 2020 

Professor: Damianos Chatziantoniou

## Download the Dataset

1. Download the data from this [source](https://snap.stanford.edu/data/soc-Pokec.html) 
2. To unzip the files run in terminal the following command: 

`gzip -d *.gz` 

3. To select the id, age and geneder columns, knwowing that they are the 1st, 4th and 8th columns correspondingly, run the following command:
 
`cut -f1,4,8 soc-pokec-profiles.txt > profiles.txt`



## Data Transformation

In [1]:
import pandas as pd

In [16]:
filename1 = "profiles.txt"
filename2 = "soc-pokec-relationships.txt"

In [8]:
df = pd.read_csv(filename1, sep='\t', header = None)

In [11]:
df.columns = ['user_id', 'gender', 'age']

In [12]:
df

Unnamed: 0,user_id,gender,age
0,1,1.0,26.0
1,2,0.0,0.0
2,16,1.0,23.0
3,3,1.0,29.0
4,4,0.0,26.0
...,...,...,...
1632798,1632799,0.0,23.0
1632799,1632800,1.0,33.0
1632800,1632801,1.0,0.0
1632801,1632802,1.0,19.0


In [17]:
df2 = pd.read_csv(filename2, sep='\t', header = None)

In [19]:
df2.columns = ['user', 'frienduser']

In [20]:
df2

Unnamed: 0,user,frienduser
0,1,13
1,1,11
2,1,6
3,1,3
4,1,4
...,...,...
30622559,1632798,1632578
30622560,1632798,865841
30622561,1632802,1632637
30622562,1632802,1632736


In [22]:
df.to_csv('profiles.csv', index = False)
df2.to_csv('relationships.csv', index = False)

## Load data in MySQL

```
mysql> CREATE TABLE profiles (id INT NOT NULL, gender INT NOT NULL, AGE INT NOT NULL, PRIMARY KEY (id));
Query OK, 0 rows affected (0.11 sec)

mysql> LOAD DATA LOCAL INFILE '~/Documents/8th/Big-Data-Management-Systems-Assignments/Neo4jAssignment/profiles.csv' INTO TABLE profiles FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
Query OK, 1632804 rows affected, 329 warnings (12.55 sec)
Records: 1632804  Deleted: 0  Skipped: 0  Warnings: 329

mysql> CREATE TABLE friends (userid  INT NOT NULL, friendsid INT NOT NULL, FOREIGN KEY (userid) REFERENCES profiles (id));
Query OK, 0 rows affected (0.07 sec)

mysql> LOAD DATA LOCAL INFILE '~/Documents/8th/Big-Data-Management-Systems-Assignments/Neo4jAssignment/relationships.csv' INTO TABLE friends FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n';
Query OK, 30622565 rows affected, 2 warnings (9 min 35.31 sec)
Records: 30622565  Deleted: 0  Skipped: 0  Warnings: 2

mysql> SELECT userid,COUNT(\*) FROM friends GROUP BY userid;

1432694 rows in set (14.45 sec)

mysql> SELECT a.userid, a.friendsid, COUNT(b.friendsid) AS numberoff FROM friends AS a LEFT JOIN friends AS b ON a.friendsid = b.userid GROUP BY a.userid, a.friendsid;


mysql> SELECT id, COUNT(CASE WHEN profiles.age>30 THEN friendsid END) AS friends30 FROM friends LEFT JOIN profiles ON profiles.id = friends.userid GROUP BY id;

1432694 rows in set (2 min 46.77 sec)


mysql> SELECT userid, aprof.gender AS gender, COUNT(friends.friendsid) FROM friends LEFT JOIN profiles as aprof ON aprof.id=friends.userid LEFT JOIN profiles as bprof ON bprof.id=friends.userid GROUP BY userid, aprof.gender, bprof.gender HAVING aprof.gender=1;

715979 rows in set (3 min 40.82 sec)
```



## Load data in Neo4j

Firstly, we create the nodes by running the following command:

```:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///profiles.csv" AS row
FIELDTERMINATOR ';'
CREATE (:Profile {userid: toInteger(row.user), age: toInteger(row.age), gender: row.gender})```

Then, we create the relationships between the nodes who are friends by running the following command:

```:auto USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///relationships.csv" AS row
FIELDTERMINATOR ';'
MATCH (p1:Profile {userid: toInteger(row.user) })
MATCH (p2:Profile {userid: toInteger(row.friend) })
CREATE (p1)-[:isfriendwith]->(p2);```

## Performance Comments

|Description | MySQL | Neo4j |
| :--- | --- | --- |
| Load profiles' data| 12.55 sec | 21204 ms|
| Load relationships' data| 9 min 35.31 sec| |
| Run 1st query | 14.45 sec | |
| Run 2nd query|  | |
| Run 3rd query| 2 min 46.77 sec | |
| Run 4th query| 3 min 40.82 sec | |



## References

<a id="1">[1]</a> 
Ian Robinson, Jim Webber, and Emil Eifrem. 2015. *Graph Databases: New Opportunities for Connected Data* (2nd. ed.). O’Reilly Media, Inc.

<a id="2">[2]</a> [Installation guide]()

<a id="3">[3]</a> [Neo4j Documentation]()

<a id="4">[4]</a> [All Neo4j commands]() 

<a id="5">[5]</a> [Tutorial]() 

<a id="6">[6]</a> [Reading and Writing Data with neo4j video](https://www.youtube.com/watch?v=7vWEqm2evdw) 

<a id="7">[7]</a> [Import data in sql from a .csv file](https://medium.com/@AviGoom/how-to-import-a-csv-file-into-a-mysql-database-ef8860878a68) 