# Node Classification on Facebook Large Page-Page Network

## About dataset

This webgraph is a page-page graph of verified Facebook sites. This graph was collected through the Facebook Graph API in November 2017 and restricted to pages from 4 categories which are defined by Facebook. These categories are: politicians, governmental organizations, television shows and companies. The task related to this dataset is multi-class node classification for the 4 site categories.

[musae_facebook_target.csv](https://raw.githubusercontent.com/TimKong21/MEMGRAPH-Graph-Analytics-with-NetworkX/main/Asssignments/Assignment%205_Graph%20embeddings%20_machine%20learning/facebook_large/musae_facebook_target.csv)
1. Nodes are the oficial Facebook pages with unique id.
2. Each page is labelled with the page type - 'tvshow', 'government', 'company', 'politician'

[musae_facebook_edges.csv](https://raw.githubusercontent.com/benedekrozemberczki/MUSAE/master/input/edges/facebook_edges.csv) 
1. Edges are the mutual likes between the Facebook pages

## Import necessary modules

In [1]:
import networkx as nx
import pandas as pd

import random
from tqdm import tqdm

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import accuracy_score

from gensim.models import Word2Vec
from stellargraph import StellarGraph
from stellargraph.data import BiasedRandomWalk

## Data importation and preprocessing

### Edges data 

In [2]:
edges = pd.read_csv('facebook_large/musae_facebook_edges.csv')
print('shape:',edges.shape)
edges.head()

shape: (171002, 2)


Unnamed: 0,id_1,id_2
0,0,18427
1,1,21708
2,1,22208
3,1,22171
4,1,6829


In [9]:
edges.shape

(171002, 2)

### Nodes data

In [3]:
nodes = pd.read_csv('facebook_large/musae_facebook_target.csv')
print('shape:',nodes.shape)
nodes.head()

shape: (22470, 4)


Unnamed: 0,id,facebook_id,page_name,page_type
0,0,145647315578475,The Voice of China 中国好声音,tvshow
1,1,191483281412,U.S. Consulate General Mumbai,government
2,2,144761358898518,ESET,company
3,3,568700043198473,Consulate General of Switzerland in Montreal,government
4,4,1408935539376139,Mark Bailey MP - Labor for Miller,politician


In [10]:
nodes.shape

(22470, 5)

#### Label encoding on page type 
4 main classifications

In [4]:
nodes['page_type'].unique()

array(['tvshow', 'government', 'company', 'politician'], dtype=object)

In [5]:
node_att = preprocessing.LabelEncoder()
node_att.fit(['tvshow', 'government', 'company', 'politician'])
nodes['label_class']=node_att.transform(nodes['page_type'])

In [6]:
nodes.head()

Unnamed: 0,id,facebook_id,page_name,page_type,label_class
0,0,145647315578475,The Voice of China 中国好声音,tvshow,3
1,1,191483281412,U.S. Consulate General Mumbai,government,1
2,2,144761358898518,ESET,company,0
3,3,568700043198473,Consulate General of Switzerland in Montreal,government,1
4,4,1408935539376139,Mark Bailey MP - Labor for Miller,politician,2


## Load and sample graph function

In [48]:
def load_graph(nodes, edges): #generate Networkx graph object
    # Networkx graph object
    G = nx.from_pandas_edgelist(edges, 'id_1', 'id_2')
    
    # set node attributes ->page type 
    page_type = pd.Series(nodes.page_type, index=nodes.id).to_dict()
    nx.set_node_attributes(G, values=page_type, name='page_type')
    
    # set node attributes ->node id
    node_id = pd.Series(nodes.facebook_id, index=nodes.id).to_dict()
    nx.set_node_attributes(G, values=node_id, name='facebook_id')
    
    return G

def sample_graph(G, num_samples, seed): #sampling + generate stellar graph object
    # randomly sample data 
    random.seed(seed) #set random state for next occurance of random elements
    samples = random.sample(list(G.nodes), num_samples)
    
    # remove unsampled nodes
    dump_nodes = []
    for n in tqdm(G):
        if n not in samples:
            dump_nodes.append(n)       
    G.remove_nodes_from(dump_nodes)
    print("nunber of sampled nodes:", len(list(G.nodes)))
    
    # generate stellar graph object       
    Stellar_G = StellarGraph.from_networkx(G) 
    
    return Stellar_G

## Graph embeddings function 

## Classifier - logistic regressiond function

## Main function

In [49]:
G = load_graph(nodes,edges) #networkx graph object 
Stellar_G = sample_graph(G, 1000, 42) #sampled stellar graph object

100%|█████████████████████████████████████████████████████████████████████████| 22470/22470 [00:01<00:00, 19871.17it/s]


nunber of sampled nodes: 1000


In [20]:
def main():
    pass

if __name__ == "__main__":
    main()