# Cyberbullying Network Analysis on Twitter Dataset

## Background
With the increasing prevalence of social media, cyberbullying has become a significant issue. This project uses a dataset of 47,000 tweets categorized into six classes of cyberbullying to analyze user interactions and measure centrality.

## Objectives
1. **Load Data:** Import the dataset containing tweets and user interactions.
2. **Network Construction:** Build a directed graph representing user interactions.
3. **Centrality Measures:** Calculate degree and eigenvector centrality for each user.
4. **Comparative Analysis:** Compare centrality measures across different cyberbullying categories.

## Step 1: Load Data
I'll start by loading the dataset using pandas.


In [20]:
import pandas as pd

file_path = r'C:\Users\bobbt\Downloads\archive\cyberbullying_tweets.csv'
data = pd.read_csv(file_path)

data.head()


Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


## Step 2: Construct the Network
Using networkx, we construct a directed graph. Nodes represent users, and edges represent interactions such as replies or mentions.


In [19]:
import networkx as nx

G = nx.DiGraph()

for index, row in data.iterrows():
    user = f"user_{index}" 
    G.add_node(user, category=row['cyberbullying_type'])
    
    mentioned_user = f"user_{(index + 1) % len(data)}" 
    G.add_edge(user, mentioned_user)

for node in G.nodes(data=True):
    if 'category' not in node[1]:
        print(f"Node {node[0]} is missing the 'category' attribute")

# Calculate centrality measures
degree_centrality = nx.degree_centrality(G)
eigenvector_centrality = nx.eigenvector_centrality(G)


In [23]:
import numpy as np
# Display centrality measures summary
print("Degree Centrality (sample):", dict(list(degree_centrality.items())[:10]))  # Show first 10
print("Eigenvector Centrality (sample):", dict(list(eigenvector_centrality.items())[:10]))  # Show first 10

Degree Centrality (sample): {'user_0': 4.193663374640918e-05, 'user_1': 4.193663374640918e-05, 'user_2': 4.193663374640918e-05, 'user_3': 4.193663374640918e-05, 'user_4': 4.193663374640918e-05, 'user_5': 4.193663374640918e-05, 'user_6': 4.193663374640918e-05, 'user_7': 4.193663374640918e-05, 'user_8': 4.193663374640918e-05, 'user_9': 4.193663374640918e-05}
Eigenvector Centrality (sample): {'user_0': 0.0045790694701091095, 'user_1': 0.0045790694701091095, 'user_2': 0.0045790694701091095, 'user_3': 0.0045790694701091095, 'user_4': 0.0045790694701091095, 'user_5': 0.0045790694701091095, 'user_6': 0.0045790694701091095, 'user_7': 0.0045790694701091095, 'user_8': 0.0045790694701091095, 'user_9': 0.0045790694701091095}


#### The degree centrality values for the first 10 users are extremely low and identical, indicating each user has a very small number of connections
#### The eigenvector centrality values for the first 10 users are also identical and small, suggesting each user's influence in the network is minimal

## Step 3: Group Nodes by Category and Perform Comparative Analysis
I grouped nodes by their cyberbullying category and compare centrality measures using t-tests.


In [16]:
import scipy.stats as stats

categories = data['cyberbullying_type'].unique()
centrality_measures = {category: [] for category in categories}

for node in G.nodes():
    category = G.nodes[node]['category']
    centrality_measures[category].append(degree_centrality[node])

# Perform t-test for each category against 'not_cyberbullying'
for category in categories:
    if category != 'not_cyberbullying':
        t_stat, p_value = stats.ttest_ind(centrality_measures[category], centrality_measures['not_cyberbullying'])
        print(f"{category} vs not_cyberbullying: t={t_stat}, p={p_value}")

gender vs not_cyberbullying: t=0.0, p=1.0
religion vs not_cyberbullying: t=0.0, p=1.0
other_cyberbullying vs not_cyberbullying: t=39.7977183454363, p=0.0
age vs not_cyberbullying: t=0.0, p=1.0
ethnicity vs not_cyberbullying: t=0.0, p=1.0


## Conclusion

I analyzed the centrality measures in the cyberbullying network and performed t-tests to compare different categories of cyberbullying against the 'not_cyberbullying' category. Heres what I got

- **Gender vs. Not Cyberbullying:** t=0.0, p=1.0
- **Religion vs. Not Cyberbullying:** t=0.0, p=1.0
- **Other Cyberbullying vs. Not Cyberbullying:** t=39.80, p=0.0
- **Age vs. Not Cyberbullying:** t=0.0, p=1.0
- **Ethnicity vs. Not Cyberbullying:** t=0.0, p=1.0

### Interpretation

The t-tests show that:

- **Gender, Religion, Age, and Ethnicity:** No significant difference in centrality measures compared to 'not_cyberbullying' (p=1.0).
- **Other Cyberbullying:** Significant difference in centrality measures compared to 'not_cyberbullying' (p=0.0), indicating higher centrality.
