# Dataset Pre-processing

### Source
https://snap.stanford.edu/data/cit-HepTh.html

- Contains edge list, temporal data for each node, and text metadata for each node
- First two are saved as single text files with a line for each edge or node respectively
- Third includes a text file in .abs format for each node

## Cleaning required:
- Text extraction (including opening file format) - write to a text file for each year?
- Embedding creation from extracted text

### Graph Creation
- Split edge list and extracted as a Python list of tuples (from edge 0 to edge 1) which was used to create a graph object using the NetworkX library

### Text Extraction and Embedding Creation
- Each paper had a separate text file containing description data (title, number of pages of comments, author email, the journal name, associated date/time of publication, occassionally subject) of varying length
- Read data by line and created a string for each paper containing just the abstract
- Wrote to a single text file containing the abstract for each paper on a separate line

### Why dataset is sufficient

### Data statistics
- Number of edges: 352808
- Number of valid edges: 352807
- Number of nodes: 27770 (documents)
- All available on Stanford website which is nice

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
project_dir = 'drive/MyDrive/GaTechHw/CSE6240/Project/citationDataset/'

In [3]:
node_fp = 'Cit-HepTh.txt'

In [4]:
test_abs = '9201001.abs'

In [5]:
import io
import os
import pandas as pd

## Text File Processing

Use the function below to extract text from a single .abs file.

In [6]:
def extract_text(file):
    lines = file.split('\n')
    sep_count = 0
    text = ''
    for i in range(len(lines)):
        if lines[i] == '\\\\':
            sep_count += 1
            continue
        if sep_count == 2:
            text += lines[i] + ' '
    text = str.strip(text)
    return text

Extract text for a folder of documents and write to a single text file.

In [21]:
abs_folder = 'cit-HepTh-abstracts/'

In [22]:
years = ['1992','1993','1994','1995', '1996','1997','1998','1999',
         '2000','2001','2002','2003']

In [29]:
save_fp = 'abs_text.txt'

In [27]:
text_list = []
node_file_list = []

for folder in years:
    file_list = os.listdir(project_dir + abs_folder + folder + '/')

    for file in file_list:
        fp = project_dir + abs_folder + folder + '/' + file
        with open(fp, "r", encoding="utf-8") as f:
            file_text = f.read()
            extracted_text = extract_text(file_text)
            text_list.append(extracted_text)
    
    for file in file_list:
        node = file[:-4]
        node_file_list.append(node)

In [30]:
print(len(text_list))

print(len(node_file_list))

29555
29555


In [31]:
with open(save_fp, "w", encoding="utf-8") as f:
    for i in range(len(text_list)):
        f.write(node_file_list[i] + '\t' + text_list[i] + '\n')

## Edge List Processing

In [None]:
edge_list = []
with open(project_dir + node_fp, "r", encoding="utf-8") as f:
    node_txt = f.read()
    edge_list = node_txt.split('\n')
print(edge_list[0:10])

In [None]:
edges = []
for edge in edge_list[4:]:
    from_to = edge.split('\t')
    try:
      from_to = (int(from_to[0]), int(from_to[1]))
    except:
      continue
    edges.append(from_to)

In [None]:
print(len(edges))

In [None]:
print(edges[:10])

In [None]:
import networkx as nx

In [None]:
G = nx.DiGraph()

In [None]:
G.add_edges_from(edges)