This notebook - 
1. preprocesses the data from train and test csv. (raw data coming from annotation)
2. Prepare the data for for smooth text gcn implementation. 
3. Builds Courpus for further use.

In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
common_path = "C:/Users/adity/OneDrive/Desktop/nlu_course/Goggles_gcn/"

train_csv = common_path + 'train.csv'
test_csv = common_path + 'test.csv'

In [3]:
def zone_text_converter(val):
    val = re.sub(r'([^\s\w.,]|_)+', '', val).lower()
    val = re.sub(r',+', '.', val)
    val = re.sub(r'\d+', '', val)
    val = re.sub(r' +',' ',val)
    return val

In [4]:
def field_converter(val):
    return val.lower().replace(' ', '')

In [5]:
from nltk.corpus import stopwords

In [6]:
# def preprocess(s, word_remove_list=None):
#     stop_words = set(stopwords.words('english'))
# #     print(stop_words)
#     x = s.split()
#     filtered_words = []
#     for w in x:
#         if w not in stop_words and len(w)>1:
#             filtered_words.append(w)
#     return((' ').join(filtered_words))

In [7]:
train_df = pd.read_csv(train_csv, index_col=0, converters={'zone_text': zone_text_converter, 'subject': field_converter,
                                           'grade': field_converter, 'syllabus': field_converter})
test_df = pd.read_csv(test_csv, index_col=0, converters={'zone_text': zone_text_converter, 'subject': field_converter,
                                           'grade': field_converter, 'syllabus': field_converter})

In [8]:
train_df = train_df[train_df.category_type=='lesson']
test_df = test_df[test_df.category_type=='lesson']

In [9]:
L_train=[]
text_train_list = list(train_df['zone_text'])
label_train_list = list(train_df['annotation_id'])

In [10]:
text_train_list[:3]

['before we solve problems involving direction of current . direction of magnetic field and the direction of force by using fleming s left hand rule . we should keep the following points in mind i by convention . the direction of flow of positive charges is taken to be the direction of flow of current . so . the direction in which the positively charged particles such as protons or alpha particles . etc . . move will be the direction of electric current . ii the direction of electric current is . however . taken to be opposite to the direction of flow of negative charges such as electrons . so . if we are given the direction of flow of electrons . then the direction of electric current will be taken as opposite to the direction of flow of electrons . iii the direction of deflection of a current carrying conductor or a stream of positively charged particles or a stream of negatively charged particles like electrons tells us the direction of force acting on it . let us solve some problem

In [11]:
label_train_list[:3]

[730, 764, 753]

In [12]:
len(text_train_list)

2074

In [13]:
for i in range(len(text_train_list)):
    s = 'train'
    L_train.append(s)

In [14]:
L_train[:10]

['train',
 'train',
 'train',
 'train',
 'train',
 'train',
 'train',
 'train',
 'train',
 'train']

In [15]:
L_test=[]
text_test_list = list(test_df['zone_text'])
label_test_list = list(test_df['annotation_id'])

In [16]:
text_test_list[:3]

['let us perform an experiment to verify this fact . take a straight copper rod . suspend it horizontally by means of two connecting wires between the poles of a strong horseshoe magnet as shown in figure . a and b . if a current is now passed in the rod as shown in figure . a . you will observe that the rod gets displaced . this displacement is caused by the force acting on the current carrying rod . in accordance with fleming s left hand rule . the magnet exerts a force on the rod directed upwards . with the result that the rod will get deflected upwards . now . reverse the direction of current or interchange the poles of the magnet as shown in figure . b . you will observe that the rod is now displaced downwards i . e . . deflection of the rod has reversed . this indicates that the direction of the force acting on the rod has reversed .',
 'equivalent resistance in parallel connection figure . a shows three resistors of resistances r . r and r connected in parallel across the points

In [17]:
len(text_test_list)

508

In [18]:
for i in range(len(text_test_list)):
    s = 'test'
    L_test.append(s)

In [19]:
sentences = text_train_list
sentences.extend(text_test_list)

In [20]:
print(len(sentences))

2582


In [21]:
print(sentences[2074])

let us perform an experiment to verify this fact . take a straight copper rod . suspend it horizontally by means of two connecting wires between the poles of a strong horseshoe magnet as shown in figure . a and b . if a current is now passed in the rod as shown in figure . a . you will observe that the rod gets displaced . this displacement is caused by the force acting on the current carrying rod . in accordance with fleming s left hand rule . the magnet exerts a force on the rod directed upwards . with the result that the rod will get deflected upwards . now . reverse the direction of current or interchange the poles of the magnet as shown in figure . b . you will observe that the rod is now displaced downwards i . e . . deflection of the rod has reversed . this indicates that the direction of the force acting on the rod has reversed .


In [22]:
labels = label_train_list
labels.extend(label_test_list)
print(len(labels))

2582


In [23]:
labels[2074]

726

In [24]:
#Converting labels into string

labels = [str(i) for i in labels]
print(labels[:3])
print(type(labels[2074]))

['730', '764', '753']
<class 'str'>


In [25]:
train_or_test_list = L_train
train_or_test_list.extend(L_test)
len(train_or_test_list)

2582

In [26]:
#Creating the datset - 2 txt files (Goggles.txt in data folder and Goggles.txt in data/corpus folder)

dataset_name = 'Goggles'

meta_data_list = []

for i in range(len(sentences)):
    meta = str(i) + '\t' + train_or_test_list[i] + '\t' + labels[i]
    meta_data_list.append(meta)

meta_data_str = '\n'.join(meta_data_list)

f = open('pyG_implement/data/' + dataset_name + '.txt', 'w') 
f.write(meta_data_str)
f.close()


corpus_str = '\n'.join(sentences)

f = open('pyG_implement/data/corpus/' + dataset_name + '.txt', 'w', encoding="utf-8")
f.write(corpus_str)
f.close()

### Removing stopwords and cleaning data

Output - Goggles.clean.txt file

In [27]:
from nltk.corpus import stopwords
import nltk
from nltk.wsd import lesk
from nltk.corpus import wordnet as wn
from utils import clean_str, loadWord2Vec
import sys


dataset = 'Goggles'

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
print(stop_words)

# Read Word Vectors
# word_vector_file = 'pyG_implement/data/glove.6B/glove.6B.200d.txt'
# vocab, embd, word_vector_map = loadWord2Vec(word_vector_file)
# word_embeddings_dim = len(embd[0])
# dataset = ''

doc_content_list = []
f = open('pyG_implement/data/corpus/' + dataset + '.txt', 'rb')
# f = open('pyG_implement/data/wiki_long_abstracts_en_text.txt', 'r')
for line in f.readlines():
    doc_content_list.append(line.strip().decode('latin1'))
f.close()


word_freq = {}  # to remove rare words

for doc_content in doc_content_list:
    temp = clean_str(doc_content)
    words = temp.split()
    for word in words:
        if word in word_freq:
            word_freq[word] += 1
        else:
            word_freq[word] = 1

clean_docs = []
for doc_content in doc_content_list:
    temp = clean_str(doc_content)
    words = temp.split()
    doc_words = []
    for word in words:
        # word not in stop_words and word_freq[word] >= 5
        if dataset == 'mr':
            doc_words.append(word)
        elif word not in stop_words and word_freq[word] >= 1:
            doc_words.append(word)

    doc_str = ' '.join(doc_words).strip()
    #if doc_str == '':
        #doc_str = temp
    clean_docs.append(doc_str)

clean_corpus_str = '\n'.join(clean_docs)

f = open('pyG_implement/data/corpus/' + dataset + '.clean.txt', 'w')
#f = open('pyG_implement/data/wiki_long_abstracts_en_text.clean.txt', 'w')
f.write(clean_corpus_str)
f.close()

#dataset = '20ng'
min_len = 10000
aver_len = 0
max_len = 0 

f = open('pyG_implement/data/corpus/' + dataset + '.clean.txt', 'r')
#f = open('pyG_implement/data/wiki_long_abstracts_en_text.txt', 'r')
lines = f.readlines()
for line in lines:
    line = line.strip()
    temp = line.split()
    aver_len = aver_len + len(temp)
    if len(temp) < min_len:
        min_len = len(temp)
    if len(temp) > max_len:
        max_len = len(temp)
f.close()
aver_len = 1.0 * aver_len / len(lines)
print('min_len : ' + str(min_len))
print('max_len : ' + str(max_len))
print('average_len : ' + str(aver_len))


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\adity\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


{"it's", 'itself', 'wasn', 'here', 'from', 's', 'his', 'am', 'the', 'each', 'couldn', 'hasn', 'ma', 'very', 'has', 't', 'i', 'few', 'did', "couldn't", "weren't", 'ain', 'while', 'doing', 'most', 'down', 'that', 'out', 'mustn', 'won', 'himself', 'these', "you've", 'not', 'over', 'ours', "shouldn't", "aren't", "should've", 'myself', 'be', 'up', 'own', 'only', 'same', 'we', 'can', 'how', 'through', 'about', 'this', 'she', "hadn't", 'didn', 'isn', 'themselves', 'they', 'off', 'more', 'what', 'such', 'who', 'will', 'yours', 'was', 'should', "doesn't", "isn't", 'aren', 'again', 'wouldn', 'after', "needn't", 'with', 'a', 'have', "you're", 'further', "wouldn't", 'on', 'by', 'there', "wasn't", 'does', 'both', 'at', 'd', 'all', "don't", "hasn't", 'haven', 'below', 'too', 'if', 'weren', "you'd", 'our', 'under', "didn't", 'its', 'were', 'it', 'had', 'as', 'during', 'needn', 'than', 'shan', 'an', 'but', 'of', "mustn't", 'because', "she's", 'them', 'other', 'shouldn', 'to', 'do', 'until', 'no', 'onc

### Building graph (word - word relation, word-doc relation) 
Same as build_graph.py

Creates multiple txt files to be used in traing and testing

In [28]:
import os
import random
import numpy as np
import pickle as pkl
import networkx as nx
import scipy.sparse as sp
from utils import loadWord2Vec, clean_str
from math import log
from sklearn import svm
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
import sys
from scipy.spatial.distance import cosine


# Read Word Vectors
# word_vector_file = 'pyG_implement/data/glove.6B/glove.6B.300d.txt'
# word_vector_file = 'pyG_implement/data/corpus/' + dataset + '_word_vectors.txt'
#_, embd, word_vector_map = loadWord2Vec(word_vector_file)
# word_embeddings_dim = len(embd[0])

word_embeddings_dim = 100  #Word embedding dimension(100 - 300)
word_vector_map = {}

# shulffing
doc_name_list = []
doc_train_list = []
doc_test_list = []

f = open('pyG_implement/data/' + dataset + '.txt', 'r')
lines = f.readlines()
for line in lines:
    doc_name_list.append(line.strip())
    temp = line.split("\t")
    if temp[1].find('test') != -1:
        doc_test_list.append(line.strip())
    elif temp[1].find('train') != -1:
        doc_train_list.append(line.strip())
f.close()
# print(doc_train_list)
# print(doc_test_list)

doc_content_list = []
f = open('pyG_implement/data/corpus/' + dataset + '.clean.txt', 'r')
lines = f.readlines()
for line in lines:
    doc_content_list.append(line.strip())
f.close()
# print(doc_content_list)

train_ids = []
for train_name in doc_train_list:
    train_id = doc_name_list.index(train_name)
    train_ids.append(train_id)
print(train_ids)
random.shuffle(train_ids)

# partial labeled data
#train_ids = train_ids[:int(0.2 * len(train_ids))]

train_ids_str = '\n'.join(str(index) for index in train_ids)
f = open('pyG_implement/data/' + dataset + '.train.index', 'w')
f.write(train_ids_str)
f.close()

test_ids = []
for test_name in doc_test_list:
    test_id = doc_name_list.index(test_name)
    test_ids.append(test_id)
print(test_ids)
random.shuffle(test_ids)

test_ids_str = '\n'.join(str(index) for index in test_ids)
f = open('pyG_implement/data/' + dataset + '.test.index', 'w')
f.write(test_ids_str)
f.close()

ids = train_ids + test_ids
print(ids)
print(len(ids))

shuffle_doc_name_list = []
shuffle_doc_words_list = []
for id in ids:
    shuffle_doc_name_list.append(doc_name_list[int(id)])
    shuffle_doc_words_list.append(doc_content_list[int(id)])
shuffle_doc_name_str = '\n'.join(shuffle_doc_name_list)
shuffle_doc_words_str = '\n'.join(shuffle_doc_words_list)

f = open('pyG_implement/data/' + dataset + '_shuffle.txt', 'w')
f.write(shuffle_doc_name_str)
f.close()

f = open('pyG_implement/data/corpus/' + dataset + '_shuffle.txt', 'w')
f.write(shuffle_doc_words_str)
f.close()

# build vocab
word_freq = {}
word_set = set()
for doc_words in shuffle_doc_words_list:
    words = doc_words.split()
    for word in words:
        word_set.add(word)
        if word in word_freq:
            word_freq[word] += 1
        else:
            word_freq[word] = 1

vocab = list(word_set)
vocab_size = len(vocab)

word_doc_list = {}

for i in range(len(shuffle_doc_words_list)):
    doc_words = shuffle_doc_words_list[i]
    words = doc_words.split()
    appeared = set()
    for word in words:
        if word in appeared:
            continue
        if word in word_doc_list:
            doc_list = word_doc_list[word]
            doc_list.append(i)
            word_doc_list[word] = doc_list
        else:
            word_doc_list[word] = [i]
        appeared.add(word)

word_doc_freq = {}
for word, doc_list in word_doc_list.items():
    word_doc_freq[word] = len(doc_list)

word_id_map = {}
for i in range(vocab_size):
    word_id_map[vocab[i]] = i

vocab_str = '\n'.join(vocab)

f = open('pyG_implement/data/corpus/' + dataset + '_vocab.txt', 'w')
f.write(vocab_str)
f.close()

'''
Word definitions begin
'''
'''
definitions = []

for word in vocab:
    word = word.strip()
    synsets = wn.synsets(clean_str(word))
    word_defs = []
    for synset in synsets:
        syn_def = synset.definition()
        word_defs.append(syn_def)
    word_des = ' '.join(word_defs)
    if word_des == '':
        word_des = '<PAD>'
    definitions.append(word_des)

string = '\n'.join(definitions)


f = open('pyG_implement/data/corpus/' + dataset + '_vocab_def.txt', 'w')
f.write(string)
f.close()

tfidf_vec = TfidfVectorizer(max_features=1000)
tfidf_matrix = tfidf_vec.fit_transform(definitions)
tfidf_matrix_array = tfidf_matrix.toarray()
print(tfidf_matrix_array[0], len(tfidf_matrix_array[0]))

word_vectors = []

for i in range(len(vocab)):
    word = vocab[i]
    vector = tfidf_matrix_array[i]
    str_vector = []
    for j in range(len(vector)):
        str_vector.append(str(vector[j]))
    temp = ' '.join(str_vector)
    word_vector = word + ' ' + temp
    word_vectors.append(word_vector)

string = '\n'.join(word_vectors)

f = open('pyG_implement/data/corpus/' + dataset + '_word_vectors.txt', 'w')
f.write(string)
f.close()

word_vector_file = 'pyG_implement/data/corpus/' + dataset + '_word_vectors.txt'
_, embd, word_vector_map = loadWord2Vec(word_vector_file)
word_embeddings_dim = len(embd[0])
'''

'''
Word definitions end
'''

# label list
label_set = set()
for doc_meta in shuffle_doc_name_list:
    temp = doc_meta.split('\t')
    label_set.add(temp[2])
label_list = list(label_set)

label_list_str = '\n'.join(label_list)
f = open('pyG_implement/data/corpus/' + dataset + '_labels.txt', 'w')
f.write(label_list_str)
f.close()

# x: feature vectors of training docs, no initial features
# slect 90% training set
train_size = len(train_ids)
val_size = int(0.1 * train_size)
real_train_size = train_size - val_size  # - int(0.5 * train_size)
# different training rates

real_train_doc_names = shuffle_doc_name_list[:real_train_size]
real_train_doc_names_str = '\n'.join(real_train_doc_names)

f = open('pyG_implement/data/' + dataset + '.real_train.name', 'w')
f.write(real_train_doc_names_str)
f.close()

row_x = []
col_x = []
data_x = []
for i in range(real_train_size):
    doc_vec = np.array([0.0 for k in range(word_embeddings_dim)])
    doc_words = shuffle_doc_words_list[i]
    words = doc_words.split()
    doc_len = len(words)
    for word in words:
        if word in word_vector_map:
            word_vector = word_vector_map[word]
            # print(doc_vec)
            # print(np.array(word_vector))
            doc_vec = doc_vec + np.array(word_vector)

    for j in range(word_embeddings_dim):
        row_x.append(i)
        col_x.append(j)
        # np.random.uniform(-0.25, 0.25)
        data_x.append(doc_vec[j] / doc_len)  # doc_vec[j]/ doc_len

# x = sp.csr_matrix((real_train_size, word_embeddings_dim), dtype=np.float32)
x = sp.csr_matrix((data_x, (row_x, col_x)), shape=(
    real_train_size, word_embeddings_dim))

y = []
for i in range(real_train_size):
    doc_meta = shuffle_doc_name_list[i]
    temp = doc_meta.split('\t')
    label = temp[2]
    one_hot = [0 for l in range(len(label_list))]
    label_index = label_list.index(label)
    one_hot[label_index] = 1
    y.append(one_hot)
y = np.array(y)
print(y)

# tx: feature vectors of test docs, no initial features
test_size = len(test_ids)

row_tx = []
col_tx = []
data_tx = []
for i in range(test_size):
    doc_vec = np.array([0.0 for k in range(word_embeddings_dim)])
    doc_words = shuffle_doc_words_list[i + train_size]
    words = doc_words.split()
    doc_len = len(words)
    for word in words:
        if word in word_vector_map:
            word_vector = word_vector_map[word]
            doc_vec = doc_vec + np.array(word_vector)

    for j in range(word_embeddings_dim):
        row_tx.append(i)
        col_tx.append(j)
        # np.random.uniform(-0.25, 0.25)
        data_tx.append(doc_vec[j] / doc_len)  # doc_vec[j] / doc_len

# tx = sp.csr_matrix((test_size, word_embeddings_dim), dtype=np.float32)
tx = sp.csr_matrix((data_tx, (row_tx, col_tx)),
                   shape=(test_size, word_embeddings_dim))

ty = []
for i in range(test_size):
    doc_meta = shuffle_doc_name_list[i + train_size]
    temp = doc_meta.split('\t')
    label = temp[2]
    one_hot = [0 for l in range(len(label_list))]
    label_index = label_list.index(label)
    one_hot[label_index] = 1
    ty.append(one_hot)
ty = np.array(ty)
print(ty)

# allx: the the feature vectors of both labeled and unlabeled training instances
# (a superset of x)
# unlabeled training instances -> words

word_vectors = np.random.uniform(-0.01, 0.01,
                                 (vocab_size, word_embeddings_dim))

for i in range(len(vocab)):
    word = vocab[i]
    if word in word_vector_map:
        vector = word_vector_map[word]
        word_vectors[i] = vector

row_allx = []
col_allx = []
data_allx = []

for i in range(train_size):
    doc_vec = np.array([0.0 for k in range(word_embeddings_dim)])
    doc_words = shuffle_doc_words_list[i]
    words = doc_words.split()
    doc_len = len(words)
    for word in words:
        if word in word_vector_map:
            word_vector = word_vector_map[word]
            doc_vec = doc_vec + np.array(word_vector)

    for j in range(word_embeddings_dim):
        row_allx.append(int(i))
        col_allx.append(j)
        # np.random.uniform(-0.25, 0.25)
        data_allx.append(doc_vec[j] / doc_len)  # doc_vec[j]/doc_len
for i in range(vocab_size):
    for j in range(word_embeddings_dim):
        row_allx.append(int(i + train_size))
        col_allx.append(j)
        data_allx.append(word_vectors.item((i, j)))


row_allx = np.array(row_allx)
col_allx = np.array(col_allx)
data_allx = np.array(data_allx)

allx = sp.csr_matrix(
    (data_allx, (row_allx, col_allx)), shape=(train_size + vocab_size, word_embeddings_dim))

ally = []
for i in range(train_size):
    doc_meta = shuffle_doc_name_list[i]
    temp = doc_meta.split('\t')
    label = temp[2]
    one_hot = [0 for l in range(len(label_list))]
    label_index = label_list.index(label)
    one_hot[label_index] = 1
    ally.append(one_hot)

for i in range(vocab_size):
    one_hot = [0 for l in range(len(label_list))]
    ally.append(one_hot)

ally = np.array(ally)

print(x.shape, y.shape, tx.shape, ty.shape, allx.shape, ally.shape)

'''
Doc word heterogeneous graph
'''

# word co-occurence with context windows
window_size = 20
windows = []

for doc_words in shuffle_doc_words_list:
    words = doc_words.split()
    length = len(words)
    if length <= window_size:
        windows.append(words)
    else:
        # print(length, length - window_size + 1)
        for j in range(length - window_size + 1):
            window = words[j: j + window_size]
            windows.append(window)
            # print(window)


word_window_freq = {}
for window in windows:
    appeared = set()
    for i in range(len(window)):
        if window[i] in appeared:
            continue
        if window[i] in word_window_freq:
            word_window_freq[window[i]] += 1
        else:
            word_window_freq[window[i]] = 1
        appeared.add(window[i])

word_pair_count = {}
for window in windows:
    for i in range(1, len(window)):
        for j in range(0, i):
            word_i = window[i]
            word_i_id = word_id_map[word_i]
            word_j = window[j]
            word_j_id = word_id_map[word_j]
            if word_i_id == word_j_id:
                continue
            word_pair_str = str(word_i_id) + ',' + str(word_j_id)
            if word_pair_str in word_pair_count:
                word_pair_count[word_pair_str] += 1
            else:
                word_pair_count[word_pair_str] = 1
            # two orders
            word_pair_str = str(word_j_id) + ',' + str(word_i_id)
            if word_pair_str in word_pair_count:
                word_pair_count[word_pair_str] += 1
            else:
                word_pair_count[word_pair_str] = 1

row = []
col = []
weight = []

# pmi as weights

num_window = len(windows)

for key in word_pair_count:
    temp = key.split(',')
    i = int(temp[0])
    j = int(temp[1])
    count = word_pair_count[key]
    word_freq_i = word_window_freq[vocab[i]]
    word_freq_j = word_window_freq[vocab[j]]
    pmi = log((1.0 * count / num_window) /
              (1.0 * word_freq_i * word_freq_j/(num_window * num_window)))
    if pmi <= 0:
        continue
    row.append(train_size + i)
    col.append(train_size + j)
    weight.append(pmi)

# word vector cosine similarity as weights

'''
for i in range(vocab_size):
    for j in range(vocab_size):
        if vocab[i] in word_vector_map and vocab[j] in word_vector_map:
            vector_i = np.array(word_vector_map[vocab[i]])
            vector_j = np.array(word_vector_map[vocab[j]])
            similarity = 1.0 - cosine(vector_i, vector_j)
            if similarity > 0.9:
                print(vocab[i], vocab[j], similarity)
                row.append(train_size + i)
                col.append(train_size + j)
                weight.append(similarity)
'''
# doc word frequency
doc_word_freq = {}

for doc_id in range(len(shuffle_doc_words_list)):
    doc_words = shuffle_doc_words_list[doc_id]
    words = doc_words.split()
    for word in words:
        word_id = word_id_map[word]
        doc_word_str = str(doc_id) + ',' + str(word_id)
        if doc_word_str in doc_word_freq:
            doc_word_freq[doc_word_str] += 1
        else:
            doc_word_freq[doc_word_str] = 1

for i in range(len(shuffle_doc_words_list)):
    doc_words = shuffle_doc_words_list[i]
    words = doc_words.split()
    doc_word_set = set()
    for word in words:
        if word in doc_word_set:
            continue
        j = word_id_map[word]
        key = str(i) + ',' + str(j)
        freq = doc_word_freq[key]
        if i < train_size:
            row.append(i)
        else:
            row.append(i + vocab_size)
        col.append(train_size + j)
        idf = log(1.0 * len(shuffle_doc_words_list) /
                  word_doc_freq[vocab[j]])
        weight.append(freq * idf)
        doc_word_set.add(word)

node_size = train_size + vocab_size + test_size
adj = sp.csr_matrix(
    (weight, (row, col)), shape=(node_size, node_size))

# dump objects
f = open("pyG_implement/data/ind.{}.x".format(dataset), 'wb')
pkl.dump(x, f)
f.close()

f = open("pyG_implement/data/ind.{}.y".format(dataset), 'wb')
pkl.dump(y, f)
f.close()

f = open("pyG_implement/data/ind.{}.tx".format(dataset), 'wb')
pkl.dump(tx, f)
f.close()

f = open("pyG_implement/data/ind.{}.ty".format(dataset), 'wb')
pkl.dump(ty, f)
f.close()

f = open("pyG_implement/data/ind.{}.allx".format(dataset), 'wb')
pkl.dump(allx, f)
f.close()

f = open("pyG_implement/data/ind.{}.ally".format(dataset), 'wb')
pkl.dump(ally, f)
f.close()

f = open("pyG_implement/data/ind.{}.adj".format(dataset), 'wb')
pkl.dump(adj, f)
f.close()


[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221,

[990, 1551, 1359, 106, 1566, 1495, 1011, 1055, 1057, 2016, 226, 311, 1576, 954, 320, 2053, 900, 670, 2004, 1876, 505, 864, 173, 1464, 927, 1095, 1056, 91, 1196, 906, 1716, 1550, 1481, 782, 1763, 1085, 1131, 215, 648, 1857, 378, 1702, 1663, 878, 439, 1211, 1794, 1511, 1101, 2062, 1135, 946, 1912, 1915, 19, 1365, 1180, 844, 882, 2002, 1776, 1179, 1153, 1238, 2050, 1717, 707, 1461, 870, 781, 1708, 1958, 1734, 268, 529, 1605, 1332, 484, 930, 933, 287, 4, 1190, 1385, 1674, 1544, 389, 955, 239, 1575, 740, 1675, 243, 218, 572, 1860, 1134, 3, 506, 1826, 6, 1871, 1713, 1139, 1146, 812, 1297, 1866, 1720, 232, 1635, 739, 1977, 1925, 623, 1104, 1286, 1363, 345, 916, 1744, 792, 1867, 298, 398, 1112, 797, 461, 659, 992, 533, 1698, 1618, 577, 1236, 1053, 671, 1175, 908, 1950, 1496, 1100, 1311, 967, 1528, 1350, 2039, 234, 354, 2056, 1645, 1052, 1351, 1833, 435, 1624, 657, 1298, 195, 1218, 164, 1909, 649, 831, 963, 1589, 1834, 976, 1485, 156, 485, 1020, 1700, 1539, 596, 2017, 1352, 564, 1189, 1870, 24,

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(1867, 100) (1867, 129) (508, 100) (508, 129) (8881, 100) (8881, 129)
