#Data 620-Assignment03

Group members: Mahmud Hasan Al Raji and Warner Alexis.

This assignment is a group effort. The assignment is designed based on three requirements, which are outlined in the three parts below

#Part-1. Identifying a dataset with network structure and at least one categorical variable
We have selected the Deezer Europe Social Network dataset for this assignment. The dataset was collected from the public API in March 2020. It is now available as a CSV file on Stanford University's dataset collection site (SNAP, link: https://snap.stanford.edu/data/feather-deezer-social.html), from where we obtained it. The dataset has 28,281 nodes and 92,752 edges. Nodes represent users from European countries, and edges represent mutual follower relationships.The user (i.e. gender) is a categorical variable and in the target file, 0 represents Male and 1 represents Female. This dataset is suitable because it provides both the network structure and categorical variables for nodes, which are required for the assignment.

The dataset is a large, sparse network with a density of only 0.00023. On average, each user maintains about 6.56 connections, and the clustering coeffcient of 0.0959 indicates a modest tendency toward small friendship circles where friends of friends are also connected. Because not
all users belong to the same connected component, the largest connected component (LCC) is extracted to provide a meaningful foundation for further analysis. This dataset, with its binary gender labels, oﬀers a valuable opportunity to explore centrality measures across categories, assess
homophily, and compare how structural importance diﬀers between groups.

#Part-2.High plan for loading and analyzing the data

*Load the edges file and create a network graph using NetworkX.

*Load node attributes and assign gender to nodes.

*Calculate basic metrics: number of nodes, number of edges,density, average degree,and transivity (cluster coefficient).

*Calculate centrality measures: degree centrality, betweennes centrality, and closeness centrality.

*Compare mean degree centrality across gender groups.

*Several visualizations will be incorporated into the analysis:

-A histogram of degree distribution, plotted on a log-log scale, will illustrate how connectivity varies across users while highlighting the existence of highly connected hubs.

-A boxplot of degree centrality by gender will compare connectivity between male and female users, enabling us to assess whether one group tends to maintain more connections.

-A bar chart of class distribution will show the proportion of users in each gender category, providing a quick assessment of balance or skew in the dataset.

-A pie chart of homophily will capture the tendency of users to connect with others of the same gender versus across genders.

-Finally, a subgraph visualization (sampling 200–300 nodes) will be produced, where nodes are colored by gender and sized by degree centrality, oﬀering an intuitive view of the network’s structure and categorical clustering.

In [1]:
##Load the edges file and create a network graph using NetworkX.

import networkx as nx
import pandas as pd

edges = pd.read_csv("https://raw.githubusercontent.com/Raji030/data620_assignment03_deezer_edges/refs/heads/main/deezer_europe_edges.csv")
G = nx.from_pandas_edgelist(edges, source="node_1", target="node_2")


In [2]:
##Load node attributes and assign gender to nodes.
node_attr = pd.read_csv("https://raw.githubusercontent.com/Raji030/data620_assignment03_deezer_target/refs/heads/main/deezer_europe_target.csv")
gender_dict = node_attr.set_index("id")["target"].map({0: "Male", 1: "Female"}).to_dict()
nx.set_node_attributes(G, gender_dict, "gender")


In [10]:
# Basic Statistics
n_nodes = G.number_of_nodes()
n_edges = G.number_of_edges()
density = nx.density(G)
avg_degree = sum(dict(G.degree()).values()) / n_nodes
transitivity = nx.transitivity(G)

print(f'Number of nodes :{n_nodes}')
print(f'Number of edges; {n_edges}')
print(f'The density is : {density}')
print(f'the Average degree {avg_degree}')
print(f'transitivity: {transitivity}')

Number of nodes :28281
Number of edges; 92752
The density is : 0.00023194184729358083
the Average degree 6.559315441462466
transitivity: 0.09592226364671026


In [14]:
# Extract largest connected component
largest_cc_nodes = max(nx.connected_components(G), key=len)
G_lcc = G.subgraph(largest_cc_nodes).copy()
lcc_size = len(largest_cc_nodes)
print(f'largest connected component:{lcc_size}')

largest connected component:28281


In [3]:
##Calculate centrality measures.

degree_centrality = nx.degree_centrality(G)
betweenness_centrality = nx.betweenness_centrality(G)
closeness_centrality = nx.closeness_centrality(G)


In [8]:
#print("Degree Centrality :", degree_centrality)
#print("Betweenness Centrality:", betweenness_centrality)
#print("Closeness_centrality:",closeness_centrality)

# Combine these three centralities into a data frame
centrality_df = pd.DataFrame({
    "Degree Centrality": pd.Series(degree_centrality),
    "Betweenness Centrality": pd.Series(betweenness_centrality),
    "Closeness Centrality": pd.Series(closeness_centrality)
})

# Reset index for node IDs to be appeared as a column
centrality_df = centrality_df.reset_index().rename(columns={"index": "Node"})

# Sort by degree centrality in descending order
centrality_df = centrality_df.sort_values(by="Degree Centrality", ascending=False)

# See top 10 rows
print(centrality_df.head(10))


       Node  Degree Centrality  Betweenness Centrality  Closeness Centrality
508     867           0.006082                0.019998              0.225652
2284    396           0.005375                0.009654              0.217430
509    1878           0.005127                0.011318              0.216956
995   24904           0.004137                0.018064              0.225833
1263   5989           0.003960                0.010288              0.224767
844   24069           0.003890                0.007732              0.216051
737   17963           0.003395                0.008590              0.179218
4322  23143           0.003324                0.019281              0.227648
772   11080           0.003289                0.006729              0.175809
7026  21798           0.003253                0.008633              0.213217


In [4]:
##Compare mean degree centrality across gender groups.

male_nodes = [n for n, attr in G.nodes(data=True) if attr['gender'] == 'Male']
female_nodes = [n for n, attr in G.nodes(data=True) if attr['gender'] == 'Female']

mean_male_degree = sum([degree_centrality[n] for n in male_nodes]) / len(male_nodes)
mean_female_degree = sum([degree_centrality[n] for n in female_nodes]) / len(female_nodes)

print("Mean Degree Centrality (Male):", mean_male_degree)
print("Mean Degree Centrality (Female):", mean_female_degree)


Mean Degree Centrality (Male): 0.00023811575266473027
Mean Degree Centrality (Female): 0.00022418975028783782


#Part-3. Hypothetical demonstration of comparing centrality measures across groups

Objective: Investigate whether gender affects centrality in the Deezer social network.

Hypothesis: Female users may have higher degree centrality than male users.

Result:

Mean Degree Centrality (Male): 0.000238

Mean Degree Centrality (Female): 0.000224

Interpretation: In this subset, male users have slightly higher degree centrality. They may have more connections or better visibility.

Expected Use: Understanding centrality differences can inform marketing strategies, community engagement, or targeted content delivery. Nodes with higher centrality might be more influential.

Conclusion: Although the hypothesis expected female users to have higher centrality, the observed data shows male users having slightly higher centrality. This demonstrates how centrality measures and categorical variables like gender can predict potential outcomes, both positive and negative, for nodes in a network.