# Project: Trying to Predict Network Characteristics Using Attribute Data + Machine Learning

## Summary
Network data is generally harder to collect than attribute data, so naturally there may be an attempt to predict networking characteristics with machine learning models trained on attribute data.

The objective of this project is to demonstrate that networking characteristics among entities are fundamentally distinct from entity attributes and cannot be effectively predicted using attribute data alone, regardless of the machine learning models employed.

This project will utilize 'G03B_attribute_centrality_position.csv' and 'G03F_attribute_centrality_position.csv' as examples. These datasets contain both entity attributes and networking characteristics (centrality indicators and network position). We will employ various supervised machine learning models to predict centrality indicators using entity attributes as the input data.

Our goal is to maximize the predictive power of these models and illustrate the substantial disparity between attributes and networking characteristics, which cannot be bridged by conventional machine learning models. Ultimately, this will underscore the importance of acquiring network data and adopting graph-based approaches to address relationship structural problems.

## Contents
* Preparing Data: cleaning, one-hot encoding, scaling, polynomial features, feature selection, feature extraction
* Regression: Predicting centrality with attributes
* Regression: Predicting centrality with attributes + other centralities
* Classification: Predicting network position with attributes
* Classification: Predicting network position with attributes + centralities
* Conclusion

##  Preparing Data

In [1]:
import pandas as pd

In [2]:
F = pd.read_csv(r"C:\Users\user\Documents\G03F_attribute_centrality_position.csv", index_col = 0)
B = pd.read_csv(r"C:\Users\user\Documents\G03B_attribute_centrality_position.csv", index_col = 0)

### cleaning

In [3]:
# add IPC column
F['IPC'] = 'G03F'
B['IPC'] = 'G03B'

# merge two df
df = pd.concat([B,F], axis = 0)

# fill all missing value with 0
df = df.fillna(0)

# for position_class column, change 0 to 'not included'
df.loc[df['position_class'] == 0,'position_class'] = 'not included'
# combine 'isolate nodes' and 'isolate community' to new class 'isolates'
df.loc[df['position_class'] == 'isolate node','position_class'] = 'isolates'
df.loc[df['position_class'] == 'isolate community','position_class'] = 'isolates'

# Transform the country column
allowed_countries = {'US', 'JP', 'TW', 'KR', 'CN'}
df['Country'] = df['國家'].apply(lambda x: x if x in allowed_countries else 'Others')
#df['Country'].value_counts()

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 551 entries, 0 to 185
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   專利權人            551 non-null    object 
 1   專利件數            551 non-null    int64  
 2   他人引證次數          551 non-null    int64  
 3   自我引證次數          551 non-null    int64  
 4   發明人數            551 non-null    int64  
 5   平均專利年齡          551 non-null    int64  
 6   活動年期            551 non-null    int64  
 7   相對研發能力          551 non-null    float64
 8   國家              551 non-null    object 
 9   時期              551 non-null    object 
 10  indegree        551 non-null    float64
 11  closeness       551 non-null    float64
 12  betweenness     551 non-null    float64
 13  harmonic        551 non-null    float64
 14  eigenvector     551 non-null    float64
 15  katz            551 non-null    float64
 16  pagerank        551 non-null    float64
 17  laplacian       551 non-null    float64


In [13]:
df.describe(include = 'all').T

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
專利權人,551.0,130.0,ASML NETHERLANDS,8.0,,,,,,,
專利件數,551.0,,,,36.495463,72.043847,1.0,5.0,14.0,29.0,638.0
他人引證次數,551.0,,,,4.580762,11.2198,0.0,0.0,1.0,4.0,103.0
自我引證次數,551.0,,,,6.537205,30.158279,0.0,0.0,0.0,3.0,543.0
發明人數,551.0,,,,50.526316,95.860423,1.0,8.0,21.0,53.0,1211.0
平均專利年齡,551.0,,,,9.435572,4.793075,1.0,6.0,9.0,13.0,18.0
活動年期,551.0,,,,3.591652,1.331521,1.0,3.0,4.0,5.0,5.0
相對研發能力,551.0,,,,0.102976,0.183335,0.0,0.01,0.04,0.11,1.0
國家,551.0,15.0,US,211.0,,,,,,,
時期,551.0,4.0,2009_2013,149.0,,,,,,,


In [14]:
df.columns

Index(['專利權人', '專利件數', '他人引證次數', '自我引證次數', '發明人數', '平均專利年齡', '活動年期', '相對研發能力',
       '國家', '時期', 'indegree', 'closeness', 'betweenness', 'harmonic',
       'eigenvector', 'katz', 'pagerank', 'laplacian', 'position_class', 'IPC',
       'Country'],
      dtype='object')

### dataset for each section

In [5]:
# select columns for section 2 'predicting centrality with attributes'
df2 = df.loc[:,['專利件數', '他人引證次數', '自我引證次數', '發明人數', '平均專利年齡', '活動年期', '時期', 'IPC', 'Country', 
                'pagerank']] # to be simple, we only choose pagerank as target centrality

# select columns for section 3 'predicting centrality with attributes and other centralities'
df3 = df.loc[:,['專利件數', '他人引證次數', '自我引證次數', '發明人數', '平均專利年齡', '活動年期', '時期', 'IPC', 'Country', 
                'indegree', 'closeness', 'betweenness', 'harmonic', 'eigenvector', 'katz', 'laplacian', 
                'pagerank']] # pagerank centrality is target

# select columns for section 4 'predicting network position with attributes'
df4 = df.loc[:,['專利件數', '他人引證次數', '自我引證次數', '發明人數', '平均專利年齡', '活動年期', '時期', 'IPC', 'Country', 
                'position_class']] # position_class is target

# select columns for section 5 'predicting network position with attributes and centralities'
df5 = df.loc[:,['專利件數', '他人引證次數', '自我引證次數', '發明人數', '平均專利年齡', '活動年期', '時期', 'IPC', 'Country', 
                'indegree', 'closeness', 'betweenness', 'harmonic', 'eigenvector', 'katz', 'pagerank', 'laplacian', 
                'position_class']] # position_class is target

### preprocessing

1. define features and target
2. train test split
3. numerical features pipeline: scaling, polynomial feature generation, intitial feature selection
4. categorical features pipeline: one-hot encoding
5. all features pipeline: feature selection and feauture extraction (PCA)
6. fit_transform x_train with pipeline
7. transform x_test with the same pipeline
8. return X_train_processed, X_test_processed, y_train, y_test

The dataset we are using is small, only 551 data points. The following model trainings will control number of features to be 20 in each section, so that evaluation metrics could be compared between each section (same train, val, test dataset size; same number of features and preprocessing).

In [18]:
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, OneHotEncoder, LabelEncoder
from sklearn.feature_selection import VarianceThreshold, SelectKBest, f_regression, f_classif
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

In [7]:
# define preprocessing function for regression task
def preprocessing_20features_traintestsplit(df):
    # Define features and target
    X = df.drop(df.columns[-1], axis=1)
    y = df.iloc[:,-1]

    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = True, random_state=42)

    # Identify numerical and categorical columns
    numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
    categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

    # Preprocessing for numerical data
    numerical_pipeline = Pipeline([
        ('scaler', StandardScaler()),  # Scaling
        ('polynomial', PolynomialFeatures(degree=2, include_bias=False)),  # Polynomial features
        ('variance_threshold', VarianceThreshold(threshold=0.1)),  # Initial feature selection
    ])

    # Preprocessing for categorical data
    categorical_pipeline = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore')),  # One-hot encoding
    ])

    # Combine preprocessing steps
    preprocessor = ColumnTransformer([
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features),
    ])

    # Full pipeline including feature selection and extraction
    full_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('select_k_best', SelectKBest(score_func=f_regression, k=30)),  # Feature selection on transformed features
        ('pca', PCA(n_components=20)),  # Feature extraction
    ])

    # Fit and transform the training data
    X_train_processed = full_pipeline.fit_transform(X_train, y_train)
    X_test_processed = full_pipeline.transform(X_test)

    # Output the processed features for verification
    print("Processed training features shape:", X_train_processed.shape)
    print("Processed testing features shape:", X_test_processed.shape)

    return X_train_processed, X_test_processed, y_train, y_test

In [19]:
# define preprocessing function for classification task
def preprocessing_20features_multiclasstarget_traintestsplit(df):
    # Define features and target
    X = df.drop(df.columns[-1], axis=1)
    y = df.iloc[:,-1]

    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle = True, random_state=42)

    # Identify numerical and categorical columns
    numerical_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
    categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()

    # Preprocessing for numerical data
    numerical_pipeline = Pipeline([
        ('scaler', StandardScaler()),  # Scaling
        ('polynomial', PolynomialFeatures(degree=2, include_bias=False)),  # Polynomial features
        ('variance_threshold', VarianceThreshold(threshold=0.1)),  # Initial feature selection
    ])

    # Preprocessing for categorical data
    categorical_pipeline = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore')),  # One-hot encoding
    ])

    # Combine preprocessing steps
    preprocessor = ColumnTransformer([
        ('num', numerical_pipeline, numerical_features),
        ('cat', categorical_pipeline, categorical_features),
    ])

    # Full pipeline including feature selection and extraction
    full_pipeline = Pipeline([
        ('preprocessor', preprocessor),
        ('select_k_best', SelectKBest(score_func=f_classif, k=30)),  # Feature selection on transformed features
        ('pca', PCA(n_components=20)),  # Feature extraction
    ])

    # Fit and transform the training data
    X_train_processed = full_pipeline.fit_transform(X_train, y_train)
    X_test_processed = full_pipeline.transform(X_test)

    # Label encoding the target variable
    le = LabelEncoder()
    y_train_encoded = le.fit_transform(y_train)
    y_test_encoded = le.transform(y_test)
    
    # Output the processed features for verification
    print("Processed training features shape:", X_train_processed.shape)
    print("Processed testing features shape:", X_test_processed.shape)
    print("Encoded training target shape:", y_train_encoded.shape)
    print("Encoded testing target shape:", y_test_encoded.shape)

    return X_train_processed, X_test_processed, y_train_encoded, y_test_encoded

## Regression: 

We will use pagerank centrality as target:

|Centrality|Description|
|---|---|
|Indegree|Indegree centrality of a node is the number of neighbors that had chosen this node. It is the simplest centrality indicator of a node in a network.|
|Eigenvector|Eigenvector centrality computes the centrality for a node by adding the centrality of its neighbors. It reflects the importance of node i not just by counting how many neighbors does it has, but also take i's neighbors' importance into account. It could be seen as a more sophisticated and improved centrality indicator compared with indegree centrality.|
|Pagerank|Originally, it was developed by Google to rank webpages. Pagerank could be seen as a further improvement based on eigenvector centrality. Like eigenvector centrality,  Pagerank also consider a node's importance based on it's neighbors, but, unlike eigenvector centrality, it further considers each neighbors' number of out degree to weight the contribution of each neighbor's importance to the focus node's importance.|


In [30]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error, r2_score

In [29]:
def CompareML_Regression(df):
    # train test split
    X_train, X_test, y_train, y_test = preprocessing_20features_traintestsplit(df)

    # models and hyperparameters 
    models = {
        'KNN': (KNeighborsRegressor(), {'n_neighbors': [2,4,8,16,20,24]}),
        'DecisionTree': (DecisionTreeRegressor(), {'max_depth': [2, 3, 5, 7, 9, 11], 'min_samples_leaf':[1,2,4,6,8]}),
        'RandomForest': (RandomForestRegressor(), {'n_estimators':[10, 20, 40, 80, 100], 'max_depth':[2, 3, 5, 7, 9, 11],
                                                 'min_samples_leaf':[1,2,4,6,8]}),
        'GradientBoosting':(GradientBoostingRegressor(), {'learning_rate':[0.001, 0.01, 0.1], 
                                                          'n_estimators':[10, 20, 40, 80, 100]}),
        'Ridge': (Ridge(), {'alpha': [0, 0.1, 1, 10, 100, 300, 500, 700]}),
        'SVM': (SVR(), {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100], 'gamma':['scale', 'auto'], 
                        'kernel': ['linear', 'poly', 'rbf']})
    }

    # dictionary to store results
    results = {'Model': [], 'Best Params': [],'Best Score':[], 'MSE': [], 'R2': []}

    # perform GridSearchCV for each model
    for model_name, (model, param_grid) in models.items():
        cv = KFold(n_splits=5, shuffle=True, random_state=0)
        grid_search = GridSearchCV(model, param_grid, cv=cv, scoring='r2', n_jobs = -1)
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_
        best_score = grid_search.best_score_
        best_params = grid_search.best_params_
    
        # Evaluate the best model on the test set
        y_pred = best_model.predict(X_test)
    
        # Calculate evaluation metrics
        mse = mean_squared_error(y_test, y_pred)
        r2 = r2_score(y_test, y_pred)
    
        # Store the results
        results['Model'].append(model_name)
        results['Best Params'].append(best_params)
        results['Best Score'].append(best_score)
        results['MSE'].append(mse)
        results['R2'].append(r2)

    # store results to DataFrame
    results_df = pd.DataFrame(results)
    
    return results_df

### Predicting centrality with attributes

In [31]:
CompareML_Regression(df2)

Processed training features shape: (440, 20)
Processed testing features shape: (111, 20)


Unnamed: 0,Model,Best Params,Best Score,MSE,R2
0,KNN,{'n_neighbors': 16},0.40307,0.00041,0.35594
1,DecisionTree,"{'max_depth': 3, 'min_samples_leaf': 4}",0.232572,0.000508,0.201973
2,RandomForest,"{'max_depth': 7, 'min_samples_leaf': 2, 'n_est...",0.435975,0.000401,0.369874
3,GradientBoosting,"{'learning_rate': 0.1, 'n_estimators': 40}",0.26977,0.000323,0.492978
4,Ridge,{'alpha': 500},0.275974,0.00045,0.293966
5,SVM,"{'C': 0.0001, 'gamma': 'scale', 'kernel': 'lin...",-13.416575,0.007858,-11.339013


### Predicting centrality with attributes + other centralities

In [32]:
CompareML_Regression(df3)

Processed training features shape: (440, 20)
Processed testing features shape: (111, 20)


Unnamed: 0,Model,Best Params,Best Score,MSE,R2
0,KNN,{'n_neighbors': 4},0.719631,0.000118,0.814343
1,DecisionTree,"{'max_depth': 7, 'min_samples_leaf': 2}",0.660801,0.000173,0.728438
2,RandomForest,"{'max_depth': 5, 'min_samples_leaf': 2, 'n_est...",0.764072,8.8e-05,0.861817
3,GradientBoosting,"{'learning_rate': 0.1, 'n_estimators': 80}",0.745397,0.000106,0.834083
4,Ridge,{'alpha': 300},0.643556,0.000101,0.840702
5,SVM,"{'C': 0.0001, 'gamma': 'scale', 'kernel': 'lin...",-13.416575,0.007858,-11.339013


## Classification: 

Network position is a multi-class target variable:

|class|description|
|---|---|
|largest community|nodes locates in the largest weakly connected component and also locates in the largest community defined by edge betweenness partition community detection|
|other community|nodes locates in the largest weakly connected component but does not locate in the largest community defined by edge betweenness partition community detection|
|isolates|nodes that does not locate in the largest weakly connected component, including isolated communities and isolated nodes|
|not included|nodes that does not appear in the network|

In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, f1_score, log_loss

In [33]:
def CompareML_Classification(df):
    # train test split
    X_train, X_test, y_train, y_test = preprocessing_20features_multiclasstarget_traintestsplit(df)

    # models and hyperparameters 
    models = {
        'KNN': (KNeighborsClassifier(), {'n_neighbors': [2,4,8,16,20,24]}),
        'DecisionTree': (DecisionTreeClassifier(), {'max_depth': [2, 3, 5, 7, 9, 11], 'min_samples_leaf':[1,2,4,6,8]}),
        'RandomForest': (RandomForestClassifier(), {'n_estimators':[10, 20, 40, 80, 100], 'max_depth':[2, 3, 5, 7, 9, 11],
                                                 'min_samples_leaf':[1,2,4,6,8]}),
        'GradientBoosting':(GradientBoostingClassifier(), {'learning_rate':[0.001, 0.01, 0.1], 
                                                          'n_estimators':[10, 20, 40, 80, 100]}),
        'LogisticRegression': (LogisticRegression(), {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'max_iter':[1000]}),
        'SVM': (SVC(), {'C': [0.0001, 0.001, 0.01, 0.1, 1, 10, 100], 'gamma':['scale', 'auto'], 
                        'kernel': ['linear', 'poly', 'rbf']})
    }

    # dictionary to store results
    results = {'Model': [], 'Best Params': [],'Best Score':[], 'Accuracy': [], 'F1 Score': []}

    # perform GridSearchCV for each model
    for model_name, (model, param_grid) in models.items():
        cv = KFold(n_splits=5, shuffle=True, random_state=0)
        grid_search = GridSearchCV(model, param_grid, cv=cv, scoring='accuracy', n_jobs = -1)
        grid_search.fit(X_train, y_train)
        best_model = grid_search.best_estimator_
        best_score = grid_search.best_score_
        best_params = grid_search.best_params_
    
        # Evaluate the best model on the test set
        y_pred = best_model.predict(X_test)
    
        # Calculate evaluation metrics
        all_classes = np.unique(y_test)
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average = 'weighted', labels = all_classes)
    
        # Store the results
        results['Model'].append(model_name)
        results['Best Params'].append(best_params)
        results['Best Score'].append(best_score)
        results['Accuracy'].append(accuracy)
        results['F1 Score'].append(f1)
        

    # store results to DataFrame
    results_df = pd.DataFrame(results)
    
    return results_df

### Predicting network position with attributes

In [34]:
CompareML_Classification(df4)

Processed training features shape: (440, 20)
Processed testing features shape: (111, 20)
Encoded training target shape: (440,)
Encoded testing target shape: (111,)


Unnamed: 0,Model,Best Params,Best Score,Accuracy,F1 Score
0,KNN,{'n_neighbors': 20},0.575,0.630631,0.619331
1,DecisionTree,"{'max_depth': 11, 'min_samples_leaf': 4}",0.552273,0.513514,0.513065
2,RandomForest,"{'max_depth': 9, 'min_samples_leaf': 4, 'n_est...",0.636364,0.621622,0.614176
3,GradientBoosting,"{'learning_rate': 0.1, 'n_estimators': 40}",0.643182,0.666667,0.665136
4,LogisticRegression,"{'C': 100, 'max_iter': 1000}",0.638636,0.612613,0.598633
5,SVM,"{'C': 100, 'gamma': 'scale', 'kernel': 'linear'}",0.647727,0.666667,0.642731


### Predicting network position with attributes + centralities

In [35]:
CompareML_Classification(df5)

Processed training features shape: (440, 20)
Processed testing features shape: (111, 20)
Encoded training target shape: (440,)
Encoded testing target shape: (111,)


Unnamed: 0,Model,Best Params,Best Score,Accuracy,F1 Score
0,KNN,{'n_neighbors': 2},0.770455,0.810811,0.779203
1,DecisionTree,"{'max_depth': 9, 'min_samples_leaf': 1}",0.752273,0.693694,0.694931
2,RandomForest,"{'max_depth': 9, 'min_samples_leaf': 1, 'n_est...",0.784091,0.837838,0.819824
3,GradientBoosting,"{'learning_rate': 0.1, 'n_estimators': 80}",0.759091,0.846847,0.843512
4,LogisticRegression,"{'C': 100, 'max_iter': 1000}",0.797727,0.846847,0.842407
5,SVM,"{'C': 100, 'gamma': 'scale', 'kernel': 'linear'}",0.790909,0.846847,0.844037


## Conclusion

In the task of predicting PageRank centrality, machine learning models trained solely on entity attributes exhibit validation and test set R² values below 0.5. The highest test R² achieved is 0.49, obtained by a gradient boosting model. While more sophisticated models demonstrate better test set performance—Gradient Boosting (0.49) > Random Forest (0.36) > Decision Tree (0.2)—their overall performance remains unsatisfactory. However, incorporating additional network characteristics (such as other centralities) into the original dataset significantly improves the R² values on both validation and test sets, raising them to approximately 0.8-0.85.

Some may argue that this substantial improvement in R² stems from centrality indicators like Katz and eigenvector centrality, which have similar measurements to PageRank centrality, and thus does not necessarily support the importance of network characteristics. This skepticism is addressed by the findings in the classification section.

In the task of predicting network position, machine learning models trained solely on entity attributes achieve validation and test set accuracy between 0.5-0.65. The highest test accuracy and F1 score, both at 0.66, are achieved by a gradient boosting model. After adding centrality measures to the original dataset, both test set accuracy and F1 scores for all models significantly improve, exceeding 0.8. Given that network position classes are defined based on components and communities, which are distinct from node-level centrality indicators, the significant improvement in model performance cannot be attributed to similar measurement methods between features and target. Additionally, data preprocessing and model tuning procedures were consistently applied across sections, ensuring that these results are not influenced by machine learning techniques alone.

The findings support the central argument of this project: Networking characteristics are distinct from attributes and cannot be predicted solely with attribute data, regardless of the machine learning model used. Addressing problems that are inherently network-based requires the adoption of network or graph-based data collection and analytics approaches.