<a href="https://colab.research.google.com/github/GitableGabe/DATA_620_Collab/blob/main/Project_1_Data_620.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1/Data 620

Assignment terms:

Identify and load a network dataset that has some categorical information available for each node.
For each of the nodes in the dataset, calculate degree centrality and eigenvector centrality.
Compare your centrality measures across your categorical groups.
For example, using the dataset from a study of romantic relationships among high school students in Columbus, Ohio [http://researchnews.osu.edu/archive/chains.htm], you might want to determine if there are any differences in your calculated centrality measures between the two sexes.  You might use a t-test or another statistical measure to support your findings.




**Data Source:**
[MUSAE Facebook Page-Page Network](https://www.kaggle.com/datasets/rozemberczki/musae-facebook-pagepage-network)

Description: This dataset consists of a network of Facebook pages where nodes represent pages and edges represent mutual likes between pages. Each node has an attribute representing the category of the page.

**High - Level Plan:**
To analyze and compare the degree centrality and eigenvector centrality across different "page_type" categories in the Facebook dataset, we will follow these steps:
- Load the datasets from our public Github account for reproductivity.
- Create a graph and add node attributes.
- Calculate degree centrality and eigenvector centrality.
- Compare centrality measures across "page_type" categories.
- Perform hypothesis testing to determine if there are significant differences in centrality measures between different "page_type" categories. The results are interpreted based on the p-values.

**Load the datasets.**

In [1]:
import pandas as pd
import networkx as nx
from scipy.stats import ttest_ind

# Load the datasets
edges_url = 'https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/main/musae_facebook_edges.csv'
edges = pd.read_csv(edges_url)

node_attrs_url = 'https://raw.githubusercontent.com/Heleinef/Data-Science-Master_Heleine/main/musae_facebook_target.csv'
node_attrs = pd.read_csv(node_attrs_url)

# Print the columns and the first few rows of node attributes to identify the categorical variable
print(node_attrs.columns)
print(node_attrs.head())

Index(['id', 'facebook_id', 'page_name', 'page_type'], dtype='object')
   id       facebook_id                                     page_name  \
0   0   145647315578475                      The Voice of China 中国好声音   
1   1      191483281412                 U.S. Consulate General Mumbai   
2   2   144761358898518                                          ESET   
3   3   568700043198473  Consulate General of Switzerland in Montreal   
4   4  1408935539376139             Mark Bailey MP - Labor for Miller   

    page_type  
0      tvshow  
1  government  
2     company  
3  government  
4  politician  


**Create a graph and add node attributes.**

In [2]:
# Assuming 'page_type' is the categorical variable
# Create the graph
G = nx.from_pandas_edgelist(edges, 'id_1', 'id_2', create_using=nx.Graph())

# Add node attributes to the graph
for _, row in node_attrs.iterrows():
    G.nodes[row['id']].update(row.to_dict())


**Calculate degree centrality and eigenvector centrality.**


In [9]:
# Calculate degree centrality
degree_centrality = nx.degree_centrality(G)

# Calculate eigenvector centrality
eigenvector_centrality = nx.eigenvector_centrality(G, max_iter=1000)



**Compare centrality measures across "page_type" categories.**

In [5]:
# Create a DataFrame for centrality measures
centrality_df = pd.DataFrame({
    'id': degree_centrality.keys(),
    'degree_centrality': degree_centrality.values(),
    'eigenvector_centrality': eigenvector_centrality.values()
})

# Merge with node attributes to include categorical variables
centrality_df = centrality_df.merge(node_attrs, on='id')
print(centrality_df.head())

# Compare centrality measures by page_type
page_type_centrality = centrality_df.groupby('page_type')[['degree_centrality', 'eigenvector_centrality']].mean()
print(page_type_centrality)

      id  degree_centrality  eigenvector_centrality       facebook_id  \
0      0           0.000045            1.128138e-07   145647315578475   
1  18427           0.002270            1.184549e-05  1563883503866518   
2      1           0.001513            7.961069e-03      191483281412   
3  21708           0.008679            2.562776e-02      367067283779   
4  22208           0.009124            3.451845e-02        9229303148   

                                           page_name   page_type  
0                           The Voice of China 中国好声音      tvshow  
1                                   The Voice Global      tvshow  
2                      U.S. Consulate General Mumbai  government  
3  U.S. Department of State: Bureau of Intl Narco...  government  
4         U.S. Department of State: Consular Affairs  government  
            degree_centrality  eigenvector_centrality
page_type                                            
company              0.000330                0.0001

# Output interpretation and analysis:
Degree Centrality measures the number of connections a node has. Higher values indicate more connections.
Eigenvector Centrality measure considers both the number of connections and the quality of those connections. Higher values indicate more influence within the network.
Nodes with IDs 21708 and 22208 (government pages) have higher degree and eigenvector centrality, indicating they are more connected and influential.
Nodes representing TV shows generally have lower centrality measures compared to government pages.

The second part of the output shows the average degree and eigenvector centrality measures for different page types and one notes that:
- Government Pages have the highest average degree centrality (0.001151) and eigenvector centrality (0.004060), indicating that these pages are, on average, more connected and influential in the network.
- Company Pages have the second highest eigenvector centrality (0.000116), indicating a moderate influence.
- Politician Pages have a higher average degree centrality (0.000656) than company and TV show pages but lower eigenvector centrality (0.000102), indicating they are somewhat connected but less influential.
- TV Show Pages have the lowest average eigenvector centrality (0.000061) and a relatively low degree centrality (0.000414), indicating they are the least connected and influential among the categories.



**Perform hypothesis testing to determine if there are significant differences in centrality measures between different "page_type" categories.**

In [7]:
# Perform hypothesis testing between two example page types (replace 'page_type1' and 'page_type2' with actual values)
page_type1 = 'tvshow'
page_type2 = 'government'

page_type1_deg_cent = centrality_df[centrality_df['page_type'] == page_type1]['degree_centrality']
page_type2_deg_cent = centrality_df[centrality_df['page_type'] == page_type2]['degree_centrality']

page_type1_eig_cent = centrality_df[centrality_df['page_type'] == page_type1]['eigenvector_centrality']
page_type2_eig_cent = centrality_df[centrality_df['page_type'] == page_type2]['eigenvector_centrality']

# Perform t-tests for degree centrality
t_stat_deg, p_value_deg = ttest_ind(page_type1_deg_cent, page_type2_deg_cent)
print(f"T-test for Degree Centrality between {page_type1} and {page_type2}: T-statistic = {t_stat_deg}, P-value = {p_value_deg}")

# Perform t-tests for eigenvector centrality
t_stat_eig, p_value_eig = ttest_ind(page_type1_eig_cent, page_type2_eig_cent)
print(f"T-test for Eigenvector Centrality between {page_type1} and {page_type2}: T-statistic = {t_stat_eig}, P-value = {p_value_eig}")

# Interpret the results
def interpret_t_test(p_value, measure):
    if p_value < 0.05:
        print(f"There is a significant difference in {measure} between {page_type1} and {page_type2}.")
    else:
        print(f"There is no significant difference in {measure} between {page_type1} and {page_type2}.")

interpret_t_test(p_value_deg, 'degree centrality')
interpret_t_test(p_value_eig, 'eigenvector centrality')

T-test for Degree Centrality between tvshow and government: T-statistic = -24.030932285131964, P-value = 3.569438341614403e-124
T-test for Eigenvector Centrality between tvshow and government: T-statistic = -20.461841633756865, P-value = 3.1442877859912323e-91
There is a significant difference in degree centrality between tvshow and government.
There is a significant difference in eigenvector centrality between tvshow and government.


# Output interpretation and analysis:
The code output reveals that government pages have Higher Degree Centrality and Higher Eigenvector Centrality, while V Show Pages have Lower Degree Centrality and Lower Eigenvector.

**Degree Centrality**

T-statistic: -24.030932285131964
P-value: 3.569438341614403e-124

- The t-statistic value of -24.030932285131964 indicates a large difference in the means of degree centrality between "tvshow" and "government" page types. The negative value suggests that the mean degree centrality of "tvshow" pages is significantly lower than that of "government" pages.
- The p-value of 3.569438341614403e-124 is extremely small, far below the typical significance level of 0.05. This p-value indicates that the observed difference in degree centrality is highly statistically significant.

**Eigenvector Centrality**

T-statistic: -20.461841633756865
P-value: 3.1442877859912323e-91

- The t-statistic value of -20.461841633756865 indicates a large difference in the means of eigenvector centrality between "tvshow" and "government" page types. The negative value suggests that the mean eigenvector centrality of "tvshow" pages is significantly lower than that of "government" pages.
- The p-value of 3.1442877859912323e-91 is extremely small. This p-value indicates that the observed difference in eigenvector centrality is highly statistically significant.
