<a href="https://colab.research.google.com/github/AMSUCF/DHProgramming/blob/main/bluesky_visualization/Bluesky_Gemini_Visualization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Please analyze the hashtags common among these bluesky posts contained in the file, and visualize them using interactive HTML network diagrams

Here is all the data you need:
"dh_hashtag_posts.json"

## Data loading

### Subtask:
Load the data from "dh_hashtag_posts.json" into a pandas DataFrame.


**Reasoning**:
Load the data from the JSON file into a pandas DataFrame and display its first few rows and shape.



In [1]:
import pandas as pd

try:
    df_posts = pd.read_json('dh_hashtag_posts.json')
    display(df_posts.head())
    print(df_posts.shape)
except FileNotFoundError:
    print("Error: 'dh_hashtag_posts.json' not found.")
except Exception as e:
    print(f"An error occurred: {e}")

Unnamed: 0,text,created_at,author,uri,has_images,reply_to
0,"""We, Robot"": Are sci-fi movies' notions of art...",2011-02-24T23:00:19.000Z,antkaufman.bsky.social,at://did:plc:4zmrwhpfy74p2xdzlkxbgio4/app.bsky...,False,
1,In line for @xdotai artificial intelligence as...,2016-01-06T23:05:38Z,eric.zip,at://did:plc:4d56hm2dm7khhg4s7e6ocbih/app.bsky...,False,
2,"Marvin Minsky, artificial intelligence pioneer...",2016-01-26T12:26:07.000Z,transmothra.bsky.social,at://did:plc:hoiea5iz3yl7gabcrrzwjpwt/app.bsky...,False,
3,Good read: The next big thing in legal: cartho...,2016-07-08T15:31:07Z,jordanfurlong.com,at://did:plc:cwxwx7lwes5sv524vy24e5or/app.bsky...,False,
4,Artificial intelligence: Why we must learn fro...,2016-11-07T00:57:54Z,floridi.bsky.social,at://did:plc:nwbbwtjbeszeqdendqrfekrp/app.bsky...,False,


(470, 6)


## Data exploration

### Subtask:
Explore the loaded Bluesky post data to understand its structure and identify relevant columns for hashtag analysis.


**Reasoning**:
Inspect the data types, missing values, descriptive statistics, and unique hashtags in the dataframe to understand the data better.



In [2]:
# Check data types
print(df_posts.dtypes)

# Check for missing values
print(df_posts.isnull().sum() / len(df_posts) * 100)

# Descriptive statistics for numerical columns
print(df_posts.describe())

# Examine sample posts
display(df_posts.sample(5))

# Investigate unique hashtags (assuming hashtags start with '#')
# This part needs improvement as it's a naive approach.  A better regex is needed.
hashtags = []
for index, row in df_posts.iterrows():
    text = row['text']
    words = text.split()
    for word in words:
        if word.startswith('#'):
            hashtags.append(word)

unique_hashtags = set(hashtags)
print(f"Number of unique hashtags: {len(unique_hashtags)}")
print("Sample of unique hashtags:", list(unique_hashtags)[:10]) # Print first 10 for brevity

# Print the shape of the dataset
print(f"Shape of the dataset: {df_posts.shape}")

text          object
created_at    object
author        object
uri           object
has_images      bool
reply_to      object
dtype: object
text           0.00000
created_at     0.00000
author         0.00000
uri            0.00000
has_images     0.00000
reply_to      87.87234
dtype: float64
                                                     text  \
count                                                 470   
unique                                                418   
top     New Artificial Intelligence tool generates rea...   
freq                                                   10   

                      created_at                  author  \
count                        470                     470   
unique                       430                     283   
top     2024-11-29T14:51:28.720Z  medaimedia.bsky.social   
freq                          10                      31   

                                                      uri has_images  \
count                       

Unnamed: 0,text,created_at,author,uri,has_images,reply_to
172,Data is Sentient Artificial Intelligence. Curr...,2024-11-28T17:36:30.919Z,jdaviswriter.bsky.social,at://did:plc:hcft7pv52c4rdbn237fdmo7d/app.bsky...,False,at://did:plc:gumknel2w4laai2zb67vhib6/app.bsky...
130,It was wonderful to speak at yesterday's confe...,2024-11-17T07:17:46Z,ericheinze.bsky.social,at://did:plc:zn7epbbz47fevtels4vgxvcx/app.bsky...,True,
394,Artificial intelligence softens the impact of ...,2024-11-29T15:39:34.822Z,richardtol.bsky.social,at://did:plc:eilzy2ul42ho2d35j77dfelm/app.bsky...,False,
243,Number 91: artificial intelligence programs th...,2024-11-29T02:08:39.852Z,brendelbored.bsky.social,at://did:plc:of56nmyuqzvjta7qlf7gwht6/app.bsky...,False,at://did:plc:of56nmyuqzvjta7qlf7gwht6/app.bsky...
435,Incredible «Putin » to be released January 202...,2024-11-30T07:37:35.379Z,saskiadekkers.bsky.social,at://did:plc:smgbyoj2pytwuwptkpcruquj/app.bsky...,False,


Number of unique hashtags: 214
Sample of unique hashtags: ['#OECD,', '#HappyMonday', '#philosophy', '#VaniaDimitrova', '#performativepublics', '#RISK', '#jamendo', '#BTNNB', '#JAMAHealthForum', '#Compliance']
Shape of the dataset: (470, 6)


## Data wrangling

### Subtask:
Extract and clean hashtags from the 'text' column of the `df_posts` DataFrame. Create a new DataFrame with each row representing a unique hashtag and its frequency.


**Reasoning**:
Extract hashtags using regex, clean them, calculate their frequencies, and create the df_hashtags DataFrame.



In [3]:
import re
import pandas as pd

def extract_hashtags(text):
    hashtags = re.findall(r"#\w+", text)
    return hashtags

def clean_hashtag(hashtag):
    cleaned_hashtag = re.sub(r"[^a-zA-Z0-9_]", "", hashtag).lower()
    return cleaned_hashtag

hashtags_list = []
for index, row in df_posts.iterrows():
    hashtags = extract_hashtags(row['text'])
    for hashtag in hashtags:
        hashtags_list.append(clean_hashtag(hashtag))

hashtag_counts = pd.Series(hashtags_list).value_counts()
df_hashtags = pd.DataFrame({'frequency': hashtag_counts})
df_hashtags.index.name = None  # remove the index name
display(df_hashtags.head())

Unnamed: 0,frequency
ai,66
artificialintelligence,32
jama,12
jamanetworkopen,9
technology,8


## Data preparation

### Subtask:
Prepare the hashtag data for network visualization by creating a co-occurrence matrix.


**Reasoning**:
Create a co-occurrence matrix from the extracted hashtags in df_posts.



In [4]:
from collections import defaultdict

cooccurrence_matrix = defaultdict(int)

for index, row in df_posts.iterrows():
    hashtags = extract_hashtags(row['text'])
    cleaned_hashtags = [clean_hashtag(hashtag) for hashtag in hashtags]
    unique_hashtags = set(cleaned_hashtags)

    for i in range(len(unique_hashtags)):
        for j in range(i + 1, len(unique_hashtags)):
            hashtag1, hashtag2 = sorted(list(unique_hashtags)[i:i+2])
            cooccurrence_matrix[(hashtag1, hashtag2)] += 1

df_cooccurrence = pd.DataFrame.from_dict(cooccurrence_matrix, orient='index')
df_cooccurrence = df_cooccurrence.reset_index()
df_cooccurrence = df_cooccurrence.rename(columns={'index': 'hashtag_pair', 0: 'cooccurrence_count'})

df_cooccurrence[['hashtag1', 'hashtag2']] = pd.DataFrame(df_cooccurrence['hashtag_pair'].tolist(), index= df_cooccurrence.index)
df_cooccurrence = df_cooccurrence.pivot(index='hashtag1', columns='hashtag2', values='cooccurrence_count').fillna(0)

display(df_cooccurrence.head())

hashtag2,ai,aiczar,aioxford,aitools,arteyfigura,artificialintelligence,audible,automation,bbc,bigdata,...,tech,technology,tiktok,trump2,truth,ukemplaw,vaniadimitrova,virtualassistants,wordpress,x27
hashtag1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
academia,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
africa,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
age,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
agency,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
agriculture,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Data visualization

### Subtask:
Visualize the hashtag co-occurrence network using an interactive HTML network diagram.


**Reasoning**:
Create the interactive network visualization using pyvis, incorporating node sizes based on hashtag frequency and edge widths based on co-occurrence counts.



In [5]:
from pyvis.network import Network
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

try:
    net = Network(notebook=True, cdn_resources='in_line', height="750px", width="100%")

    default_size = 5  # Default node size if frequency is not found

    for index, row in df_cooccurrence.iterrows():
        # Add nodes
        net.add_node(index, label=index, title=index, size=df_hashtags.loc[index, 'frequency'] if index in df_hashtags.index else default_size)
        for col in df_cooccurrence.columns:
            if row[col] > 0:
                if col in df_hashtags.index:
                  net.add_node(col, label=col, title=col, size=df_hashtags.loc[col, 'frequency'])
                else:
                  net.add_node(col, label=col, title=col, size=default_size)
                  logging.warning(f"Hashtag '{col}' found in co-occurrence matrix but not in frequency DataFrame. Using default size.")
                net.add_edge(index, col, value=row[col], title=f"Co-occurrence: {row[col]}")

    net.show("hashtag_network.html")

except Exception as e:
    logging.error(f"An error occurred during network visualization: {e}")

ModuleNotFoundError: No module named 'pyvis'

**Reasoning**:
Install the missing `pyvis` package using pip.



In [6]:
pip install pyvis

Collecting pyvis
  Downloading pyvis-0.3.2-py3-none-any.whl.metadata (1.7 kB)
Collecting jedi>=0.16 (from ipython>=5.3.0->pyvis)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Downloading pyvis-0.3.2-py3-none-any.whl (756 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m756.0/756.0 kB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: jedi, pyvis
Successfully installed jedi-0.19.2 pyvis-0.3.2


**Reasoning**:
Re-run the code to generate the network visualization now that the `pyvis` package is installed.



In [7]:
from pyvis.network import Network
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

try:
    net = Network(notebook=True, cdn_resources='in_line', height="750px", width="100%")

    default_size = 5  # Default node size if frequency is not found

    for index, row in df_cooccurrence.iterrows():
        # Add nodes
        net.add_node(index, label=index, title=index, size=df_hashtags.loc[index, 'frequency'] if index in df_hashtags.index else default_size)
        for col in df_cooccurrence.columns:
            if row[col] > 0:
                if col in df_hashtags.index:
                  net.add_node(col, label=col, title=col, size=df_hashtags.loc[col, 'frequency'])
                else:
                  net.add_node(col, label=col, title=col, size=default_size)
                  logging.warning(f"Hashtag '{col}' found in co-occurrence matrix but not in frequency DataFrame. Using default size.")
                net.add_edge(index, col, value=row[col], title=f"Co-occurrence: {row[col]}")

    net.show("hashtag_network.html")

except Exception as e:
    logging.error(f"An error occurred during network visualization: {e}")

hashtag_network.html


ERROR:root:An error occurred during network visualization: Object of type int64 is not JSON serializable


**Reasoning**:
The error "Object of type int64 is not JSON serializable" indicates that the `size` parameter in `net.add_node` is of type int64, which is not directly serializable to JSON.  We need to convert the size to an integer.



In [8]:
from pyvis.network import Network
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

try:
    net = Network(notebook=True, cdn_resources='in_line', height="750px", width="100%")

    default_size = 5  # Default node size if frequency is not found

    for index, row in df_cooccurrence.iterrows():
        # Add nodes, converting size to int
        size = int(df_hashtags.loc[index, 'frequency']) if index in df_hashtags.index else default_size
        net.add_node(index, label=index, title=index, size=size)
        for col in df_cooccurrence.columns:
            if row[col] > 0:
                size = int(df_hashtags.loc[col, 'frequency']) if col in df_hashtags.index else default_size
                net.add_node(col, label=col, title=col, size=size)
                net.add_edge(index, col, value=int(row[col]), title=f"Co-occurrence: {int(row[col])}")

    net.show("hashtag_network.html")

except Exception as e:
    logging.error(f"An error occurred during network visualization: {e}")

hashtag_network.html


## Summary:

### Q&A
There were no explicit questions in the task description. However, the analysis implicitly answers the question: "What are the common hashtags among these Bluesky posts and how do they relate to each other?".  The analysis also implicitly addresses "How can we visualize the relationships between these hashtags?".


### Data Analysis Key Findings
* **Hashtag Extraction and Cleaning:** A regular expression `#\w+` was used to extract hashtags, followed by cleaning to remove special characters and converting to lowercase.  A more robust regex could be considered for future improvements.
* **Hashtag Frequency:** The most frequent hashtags were identified, with "ai" and "artificialintelligence" appearing most often.  The top 5 most frequent hashtags were displayed in the summary.
* **Hashtag Co-occurrence:** A co-occurrence matrix was created, showing how often hashtags appear together in the same posts.  The visualization uses the co-occurrence matrix to display relationships.
* **Interactive Network Visualization:** An HTML network diagram was created to visualize the co-occurrence relationships between hashtags. Node size represents hashtag frequency, and edge thickness represents co-occurrence strength.


### Insights or Next Steps
* **Investigate hashtag clusters:** Analyze the network visualization for clusters of related hashtags to understand topical trends and communities within the Bluesky data.
* **Refine hashtag extraction:** Implement a more sophisticated regular expression to improve the accuracy and completeness of hashtag extraction, potentially capturing a wider range of hashtag variations and formats.
