<a href="https://colab.research.google.com/github/KatBCN/SDMLab1/blob/main/SDMLab1_CitationsGenerator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Citations

Implement a data generator to assign citation relationships among papers.

Idea for citation code plan:

Assume that one-third of papers in our database cite another paper in our database. (Other papers would likely be citing papers outside of our database, but we won't be concerned about modeling those relationships)

1. Create citation dataframe as a sample of papers dataframe (for example 0.3)
2. Add column for citation
3. Using a random generator to choose a paper published previously to cite.

In [1]:
import pandas as pd
import numpy as np

A cypher query was used to export a .csv file of papers that are already in the graph database.

- All papers were included whether they were published as part of a journal or a conference. 

- The column "volumeEdition" should be interpreted as the title of the journal volume or conference edition where the paper was published. 

- The column "journalConference" should be interpreted as the title of the conference or journal where the paper was published.

Cypher Query:
```
MATCH(author:Author)-[:WROTE {role:"corresponding"}]->(paper:Paper)-[:PUBLISHED_IN]->(volumeEdition)-[]->(journalConference)
RETURN author.name as author, paper.title as title, volumeEdition.title as volumeEdition, volumeEdition.year as year, journalConference.title as journalConference
```

In [2]:
# path where papers.csv is stored
path = "https://raw.githubusercontent.com/KatBCN/SDMLab1/main/papers.csv"

In [3]:
papers = pd.read_csv(path)
papers.head()

Unnamed: 0,author,title,volumeEdition,year,journalConference
0,Kun Wang 0005,Robust Big Data Analytics for Electricity Pric...,"IEEE Trans. Big Data,vol.5-1",2019,IEEE Trans. Big Data
1,Hongjian Wang 0002,Non-Stationary Model for Crime Rate Inference ...,"IEEE Trans. Big Data,vol.5-2",2019,IEEE Trans. Big Data
2,Binfeng Wang,Noise-Resistant Statistical Traffic Classifica...,"IEEE Trans. Big Data,vol.5-4",2019,IEEE Trans. Big Data
3,Zheng Xu 0001,Multi-Modal Description of Public Safety Event...,"IEEE Trans. Big Data,vol.5-4",2019,IEEE Trans. Big Data
4,Zheng Yan 0002,Heterogeneous Data Storage Management with Ded...,"IEEE Trans. Big Data,vol.5-3",2019,IEEE Trans. Big Data


In [4]:
papers.shape

(1520, 5)

In [5]:
papers.describe(include='all')

Unnamed: 0,author,title,volumeEdition,year,journalConference
count,1520,1520,1520,1520.0,1520
unique,1364,1505,77,,6
top,Chae-Gyun Lim,Comments to Jean-Claude Burgelman's article Po...,BigComp-2018,,BigComp
freq,8,6,139,,755
mean,,,,2018.408553,
std,,,,2.108525,
min,,,,2014.0,
25%,,,,2017.0,
50%,,,,2019.0,
75%,,,,2020.0,


In [6]:
papers.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1520 entries, 0 to 1519
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   author             1520 non-null   object
 1   title              1520 non-null   object
 2   volumeEdition      1520 non-null   object
 3   year               1520 non-null   int64 
 4   journalConference  1520 non-null   object
dtypes: int64(1), object(4)
memory usage: 59.5+ KB


First, we need to sort the papers by year, in ascending order. The sorted order can be used as an assumption of the order of publication.

In [10]:
df = papers.sort_values(by=['year','journalConference','volumeEdition','title','author'], ascending=True).copy().reset_index()

In [11]:
df.head(10)

Unnamed: 0,index,author,title,volumeEdition,year,journalConference
0,332,Brian Schmidt,A Biologically-Inspired Approach to Network Tr...,BDC-2014,2014,BDC
1,326,Vassilis Kolias,A Covering Classification Rule Induction Appro...,BDC-2014,2014,BDC
2,322,Jong Hoon Ahnn,A Practical Approach to Scalable Big Data Comp...,BDC-2014,2014,BDC
3,334,Jianwu Wang 0001,A Scalable Data Science Workflow Approach for ...,BDC-2014,2014,BDC
4,333,Blesson Varghese,Are Clouds Ready to Accelerate Ad Hoc Financia...,BDC-2014,2014,BDC
5,335,Justin M. Wozniak,Big Data Staging with MPI-IO for Interactive X...,BDC-2014,2014,BDC
6,323,Pablo Fuentes,Characterizing the Communication Demands of th...,BDC-2014,2014,BDC
7,329,Raghavendra Kune,Genetic Algorithm Based Data-Aware Group Sched...,BDC-2014,2014,BDC
8,327,Jared Koontz,GeoLens - Enabling Interactive Visual Analytic...,BDC-2014,2014,BDC
9,328,Eileen Kuehn,Monitoring Data Streams at Process Level in Sc...,BDC-2014,2014,BDC


In [12]:
df.tail(10)

Unnamed: 0,index,author,title,volumeEdition,year,journalConference
1510,927,Qingchen Zhang,PPHOPCM - Privacy-Preserving High-Order Possib...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1511,921,Hu Xiong,Revocable Identity-Based Access Control for Bi...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1512,920,Sijie Wu,Shadow - Exploiting the Power of Choice for Ef...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1513,909,Hourieh Khalajzadeh,Survey and Analysis of Current End-User Data A...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1514,907,Chengqiang Huang,Time Series Anomaly Detection for Trustworthy ...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1515,917,Ziyi Su,Toward Architectural and Protocol-Level Founda...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1516,912,Chuishi Meng,Towards the Inference of Travel Purpose with H...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1517,922,Qichao Xu,Trust Based Incentive Scheme to Allocate Big D...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1518,916,Jian Shen 0001,Trustworthiness Evaluation-Based Routing Proto...,"IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data
1519,918,Bing Tang,"WukaStore - Scalable, Configurable and Reliabl...","IEEE Trans. Big Data,vol.8-1",2022,IEEE Trans. Big Data


Idea for citation code plan:

Assume that one-third of papers in our database cite another paper in our database. (Other papers would likely be citing papers outside of our database, but we won't be concerned about modeling those relationships)

1. Create citation dataframe as a sample of papers dataframe (for example 0.3)
2. Add column for citation
3. Using a random generator to choose a paper published previously to cite.



In [13]:
np.random.seed(42)