# BlueSky Starter Pack Data Collection

Team name: A5

Team members:
- **Trang Kieu:** Data collection and code review
- **Terresa Tran:** Helped along with Trang and Mei with coding and result interpretations
- **Wynne Tseng:** Report analysis and results
- **Mei Wu:** Did thematic categorization code
- **Vivien Wang:** Review and orgainzation of report

### Q1: Qualitative: Concept Definition, Context, and Rationale


#### Context and Motivation

Bluesky starter packs are curated lists of accounts and feeds (up to 150 people and up to 3 custom feeds) designed to help users discover communities and content when joining the platform. Because Bluesky is decentralized and does not rely on traditional algorithmic recommendation systems to guide content discovery, users must depend more on manual or curated tools to navigate the network. As a result,  starter packs play a critical role in shaping user discovery and social network formation.

According to "Bootstrapping Social Networks: Lessons from Bluesky Starter Packs", starter packs often circulate within communities, creating clusters or "social bubbles" where users promote and follow others within the same social groups. This suggests that starter packs may reinforce community structures and amplify visibility of certain accounts.

Understanding these patterns helps answer broader questions about influence, discovery, and community formation in decentralized social networks.

#### Concept Definition: Echo Chamber

An echo chamber refers to a social environment in which users are primarily exposed to information, accounts, and viewpoints that align with their existing perspectives, while alternative perspectives remain marginal or absent.

In the context of Bluesky starter packs, echo chambers may form when starter packs repeatedly recommend accounts that are highly similar in terms of social connections, interests, views, or topical focus. This can lead to clustered communities where information circulates within the same group.

Because starter packs function as curated recommendation systems, they tend to introduce users to specific communities rather than broader and more diverse networks, it may contribute to echo chambers that users could become embedded in relatively homogenous clusters, reinforcing existing perspectives over time. 

##### We are going to define echo chamber through:
Follower overlap percentage between accounts in the same pack


#### Hypothesis

There are shared patterns across Bluesky starter packs. Specifically:
Some accounts appear frequently across many starter packs, indicating higher visibility or influence within the network.
Certain feeds are repeatedly included, suggesting they play a central role in content discovery.
Starter packs reflect community clusters or "social bubbles," in which users promote accounts within their own communities.

#### Research Questions

This phase focuses on four primary questions:

RQ1: Which accounts appear most frequently across starter packs?  
RQ2: Which feeds appear most frequently across starter packs?  
RQ3: Do starter pack descriptions suggest themes (art, politics, tech, sports, etc.)?  
RQ4: Do certain creators create many starter packs?

Starter packs may be organized around shared interests or communities.

#### Assumptions and Potential Biases
- Starter packs represent intentional curation and community knowledge
- Frequency of inclusion approximates visibility/importance
- Starter packs may overrepresent certain communities and underrepresent others depending on which URIs we could collect

### Constraints:
The Bluesky API returns details for a starter pack only when given a specific URI, so our approach depends on first collecting a list of starter pack URIs, then fetching each pack individually.

Also, for phase 1 of this project, we are just going to analyze 1,000 randomnized starter packs to see if our hypothesis is correct first, as there is 300,000 starter packs to analyze.

In [153]:
#!pip install atproto --quiet

In [154]:
# Import Libraries
import json

from atproto import Client, models
from atproto import exceptions
from password import BSKY_USERNAME, BSKY_APP_PASSWORD
import pandas as pd
from typing import List, Dict



In [155]:
# Enter your Bluesky Username and password for authentication
# Note: You can also create a file name password in the same directory and then store your user name as BSKY_USERNAME and password as BSKY_APP_PASSWORD.
# This Jupyter Notebook will import password file and your BSKY_USERNAME and BSKY_APP_PASSWORD variables automately
USERNAME = BSKY_USERNAME
APP_PASSWORD = BSKY_APP_PASSWORD

# Authenticate steps:
client = Client()
client.login(USERNAME, APP_PASSWORD)

ProfileViewDetailed(did='did:plc:lmc4xbbyqqyui7m6ptolv3lb', handle='tkieu137.bsky.social', associated=ProfileAssociated(activity_subscription=ProfileAssociatedActivitySubscription(allow_subscriptions='followers', py_type='app.bsky.actor.defs#profileAssociatedActivitySubscription'), chat=None, feedgens=0, labeler=False, lists=0, starter_packs=0, py_type='app.bsky.actor.defs#profileAssociated'), avatar='https://cdn.bsky.app/img/avatar/plain/did:plc:lmc4xbbyqqyui7m6ptolv3lb/bafkreig5n2ooeo3ixz4yfygzcuablb4ovfm7fijbq7dorzbsfu2ezxrefy@jpeg', banner=None, created_at='2026-01-13T22:59:12.525Z', debug=None, description=None, display_name='', followers_count=6, follows_count=75, indexed_at='2026-01-13T22:59:52.725Z', joined_via_starter_pack=None, labels=[], pinned_post=None, posts_count=4, pronouns=None, status=None, verification=None, viewer=ViewerState(activity_subscription=None, blocked_by=False, blocking=None, blocking_by_list=None, followed_by=None, following=None, known_followers=KnownFol

In [156]:
# Read the starter packs dataset provided by Martin as a list of SP uri to gather data about the accounts within starter packs
df = pd.read_json("starterpacks.jsonl", lines=True)
df

Unnamed: 0,list,name,$type,createdAt,cid,author,uri,rkey,collection_time,feeds,updatedAt,description,descriptionFacets,image
0,at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky...,@‪motesandbeams.bsky.soci…'s Starter Pack,app.bsky.graph.starterpack,2024-11-16T18:42:13.188Z,bafyreie3wdnxditj3sbinyctdum23xk3x6luvdy6voy3o...,did:plc:zdtz65xortdlxi7d6hlyz2j5,at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky...,3lb3k7fkymr2f,2026-02-11 08:00:03.164000+00:00,,,,,
1,at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky...,@‪motesandbeams.bsky.soci…'s Starter Pack,app.bsky.graph.starterpack,2024-11-16T18:49:24.216Z,bafyreihb7o3slrv5i4uzl3uarlpb4gu2rluqsk6wzxhzv...,did:plc:zdtz65xortdlxi7d6hlyz2j5,at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky...,3lb3kmamw462k,2026-02-11 08:00:03.164000+00:00,,,,,
2,at://did:plc:oextljnuf4ix335o7aapym55/app.bsky...,Knowledge,app.bsky.graph.starterpack,2024-12-21T16:45:49.695Z,bafyreidb7dfysn45jtudxaro6n6alqiemb5irmv2ztoan...,did:plc:oextljnuf4ix335o7aapym55,at://did:plc:oextljnuf4ix335o7aapym55/app.bsky...,3ldtdzitbxs2z,2026-02-11 08:00:17.963000+00:00,[],2025-01-09T12:32:19.214Z,Education is the most powerful weapon which yo...,,
3,at://did:plc:anumzyo4b5gclvho6uqkrpap/app.bsky...,‪rngmom03.bsky.social‬'s Starter Pack,app.bsky.graph.starterpack,2024-11-21T18:32:11.874Z,bafyreibbdck4dxrmulfafmwai2rul3co52zqedrsxiuad...,did:plc:anumzyo4b5gclvho6uqkrpap,at://did:plc:anumzyo4b5gclvho6uqkrpap/app.bsky...,3lbi3y3hdgx23,2026-02-11 08:00:24.338000+00:00,[{'uri': 'at://did:plc:z72i7hdynmk6r22z27h6tvu...,,,,
4,at://did:plc:d42nr7dwbfh4vfvduimuk7j5/app.bsky...,‪pietve.bsky.social‬'s Starter Pack,app.bsky.graph.starterpack,2024-11-16T11:22:54.573Z,bafyreicggizjorcm6jldsbdrnlysxmwqgij4ywtvd7ljm...,did:plc:d42nr7dwbfh4vfvduimuk7j5,at://did:plc:d42nr7dwbfh4vfvduimuk7j5/app.bsky...,3lb2rnu76p42x,2026-02-11 08:00:25.661000+00:00,[{'uri': 'at://did:plc:5rw2on4i56btlcajojaxwca...,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
354690,at://did:plc:uos7xqdk7ggroepqnlzl7zev/app.bsky...,‪rockmetal68.bsky.social‬'s Starter Pack,app.bsky.graph.starterpack,2025-05-12T21:50:04.900Z,bafyreiecae6zmpsrvn4aspjdccmidq6byxapydls5zw24...,did:plc:uos7xqdk7ggroepqnlzl7zev,at://did:plc:uos7xqdk7ggroepqnlzl7zev/app.bsky...,3loyxac7ew325,2026-02-01 07:59:12.704000+00:00,[],,,,
354691,at://did:plc:huo5yychqhf6ifzoxek7bs4y/app.bsky...,Cen - Creative & Candid �…'s Starter Pack,app.bsky.graph.starterpack,2025-05-14T12:31:52.439Z,bafyreihty74mzmeubzcbuj4idgvbnsubwm26rj344wof3...,did:plc:huo5yychqhf6ifzoxek7bs4y,at://did:plc:huo5yychqhf6ifzoxek7bs4y/app.bsky...,3lp4yxxptdh2j,2026-02-01 07:59:13.744000+00:00,,,,,
354692,at://did:plc:n2noxvecqcig4lvhthwcyp3q/app.bsky...,‪mazzystar906.bsky.social‬'s Starter Pack,app.bsky.graph.starterpack,2025-01-08T01:29:23.125Z,bafyreibfc3btykpkhyou74pdto4pv47lrbo4mmvr2zsr6...,did:plc:n2noxvecqcig4lvhthwcyp3q,at://did:plc:n2noxvecqcig4lvhthwcyp3q/app.bsky...,3lf6z7du7dx2e,2026-02-01 07:59:13.932000+00:00,[],,,,
354693,at://did:plc:gaeodbykn5riavylcovou3mh/app.bsky...,Ian Greenberg's Starter Pack,app.bsky.graph.starterpack,2025-01-12T21:17:20.652Z,bafyreidkco5kixpxxkemx3exwuynniprwt4f7ayjwl2x6...,did:plc:gaeodbykn5riavylcovou3mh,at://did:plc:gaeodbykn5riavylcovou3mh/app.bsky...,3lfl5hb5zee23,2026-02-01 07:59:14.393000+00:00,,,,,


### RQ1: Which account appear most in BlueSky Starter Packs?

Bluesky starter packs are curated lists of recommended accounts meant to help new users quickly find communities and high-quality content upon joining. Starter packs are created to mitigate "cold start" and were responsible for up to 43% of daily follow operations at their peak. Because each pack is created by a different user and focuses on a different theme or interest group, the accounts that appear most frequently across many starter packs are likely to be:

- broadly influential,

- highly visible across communities,

- Central hubs in the network.

Identifying these frequently-included accounts helps us understand:

- What kinds of voices are most prominent on Bluesky,

- Which users cross community boundaries,

- Whether certain media outlets, journalists, organizations, or personalities act as “anchor nodes” in the platform’s social ecosystem.

In [157]:
def get_accounts_info(account_list: List) -> List[Dict]:
    """
    Takes a list of account objects (each one is one entry from a starter pack)
    and extracts just the important identity information we care about.

    Parameters
    ----------
    account_list : list
        A list of accounts. Each account contains a 'subject' field with
        information about the actual user (like DID and handle).

    Returns
    -------
    list[dict]
        A list of simple dictionaries. Each dictionary has:
        - "Account DID": the unique ID for the account
        - "Account Handler": the user's handle (e.g. 'nytimes.com')
    """
    list_accounts = []
    account_dict = {}
    for account in account_list:
        # Each "account" is actually a dictionary-like structure
        # pulled from the starter pack. Inside it, "subject" stores
        # the actual user profile. We extract the two fields we care about.
        account_did = account["did"]    #old code: account["subject"]["did"]
        account_handler = account["handle"]    #old code: account["subject"]["handle"]
        account_dict = {"Account DID": account_did,
                        "Account Handler": account_handler}
        list_accounts.append(account_dict)
    return list_accounts

In [158]:
def get_starter_pack_info(starter_pack) -> dict:
    starter_pack_dict = {"SP DID": starter_pack["starter_pack"]["cid"],                            # unique ID for this starter pack record
                        "SP Creator DID": starter_pack["starter_pack"]["creator"]["did"],          # DID of the user who created the pack
                        "SP Creator Handle": starter_pack["starter_pack"]["creator"]["handle"],    # their handle (username)
                        "SP Description": starter_pack["starter_pack"]["record"]["description"]}   # text description of the pack}
    return starter_pack_dict

In [159]:
def get_feed_info(starting_feed_list: List) -> List[Dict]:
    feed_list = []
    for feed in starting_feed_list:
        feed_list.append({"Feed DID": feed["cid"],
                        "Feed CID": feed["did"],
                        "Feed Description": feed["description"], 
                        "Feed Creator DID": feed["creator"]["did"],
                        "Feed Creator Handle": feed["creator"]["handle"],
                        "Feed Like Count": feed["like_count"]})
    return feed_list

In [160]:
from typing import List, Dict, Any

def get_all_accounts(list_uri: str) -> List[Any]:
    """
    Fetch ALL accounts in a list (app.bsky.graph.list) and return a flat list
    of profile objects (subjects).

    Each returned element is either:
    - a dict with keys like "did", "handle", ...
    - or a ProfileView/ProfileViewDetailed object from atproto_client.
    """
    accounts: List[Any] = []
    cursor = None

    while True:
        params = {"list": list_uri, "limit": 100}
        if cursor:
            params["cursor"] = cursor

        # Call get_list API
        res = client.app.bsky.graph.get_list(params=params)

        # if res has attribute 'items' store the info, if not, return an empty list
        items = res.items
        
        if not items:
            break
        
        for item in items:
            # check if subject is an attribute in the item and returns subject (account info),
            # if not, return items and append info
            subject = getattr(item, 'subject', item)
            accounts.append(subject)
        
        cursor = res.cursor

        if not cursor:
            break

    return accounts

In [161]:
#@ return -> list[list[dict]]
def process_starter_pack_uri(uri: str) -> List[List[Dict]]:
    """
    Given a list of starter pack URIs, download information about each starter pack
    and the (sample of) accounts included in it.

    For each starter pack URI, we:
      1. Call the Bluesky API to get a detailed view of that starter pack.
      2. Extract metadata about the starter pack (who created it, description, etc.).
      3. Extract a sample list of accounts that appear in that starter pack.
      4. Flatten this into one row per (starter pack, account) pair.

    Parameters
    ----------
    uris : list
        A list of starter pack AT-URIs (strings). Each URI identifies one starter pack.

    Returns
    -------
    list[dict]
        A list of dictionaries. Each dictionary is one row linking:
        - a specific starter pack
        - one account that appears in that pack (from the sample list)
    """
    # this will hold all rows across all starter packs
    starter_pack_accounts_list = []
    feed_list = []

    sb_dict = {}

    try:
        # 1. Ask the Bluesky API for detailed information about this starter pack
        starter_pack = client.app.bsky.graph.get_starter_pack(params={"starterPack": uri})
    except exceptions.BadRequestError:
        # If the server says "starter pack not found", we skip this URI and continue.
        print("Skipping URI (starter pack not found):", uri)
        return {"starter_pack_rows": [], "feed_rows": []}
    except Exception as e:
        # catch-all to avoid killing the whole script
        print(f"Error fetching starter pack for {uri}: {e}")
        return {"starter_pack_rows": [], "feed_rows": []}

        
    # 2) Extract some basic metadata about this starter pack with get_starter_pack_info()
    starter_pack_info = get_starter_pack_info(starter_pack)
        
    # 3) Get the all accounts that appear in this starter pack
    # pass list of account uris from sp
    try:
        unproccessed_account_list_from_sb = get_all_accounts(starter_pack["starter_pack"]["list"]["uri"])
        # Use our helper to extract just DID + handle for each account in the sample
        proccessed_list = get_accounts_info(unproccessed_account_list_from_sb)

    except Exception as e:
        print(f"Error getting accounts for {uri}: {e}")
        processed_accounts = []

    try:
        if len(starter_pack["starter_pack"]["feeds"]) != 0:
            feed_info = get_feed_info(starter_pack["starter_pack"]["feeds"])
            for feed in feed_info:
                feed_list.append({"SP URI": uri,
                            "SP DID": starter_pack_info["SP DID"],
                            "SP Creator DID": starter_pack_info["SP Creator DID"],
                            "SP Creator Handle": starter_pack_info["SP Creator Handle"],
                            "SP Description": starter_pack_info["SP Description"],
                            "Feed DID": feed["Feed DID"],
                            "Feed CID": feed["Feed CID"],
                            "Feed Description": feed["Feed Description"], 
                            "Feed Creator DID": feed["Feed Creator DID"],
                            "Feed Creator Handle": feed["Feed Creator Handle"],
                            "Feed Like Count": feed["Feed Like Count"]
                            })
    except Exception as e:
        print(f"Error getting feeds for {uri}: {e}")

    # 4) For each account in this starter pack's sample, create one flat row
    for account in proccessed_list:
        starter_pack_accounts_list.append({"SP URI": uri,
                        "SP DID": starter_pack_info["SP DID"],
                        "SP Creator DID": starter_pack_info["SP Creator DID"],
                        "SP Creator Handle": starter_pack_info["SP Creator Handle"],
                        "SP Description": starter_pack_info["SP Description"],
                        "Account DID": account["Account DID"],
                        "Account Handler": account["Account Handler"]})
        
    print(starter_pack_accounts_list)
    
    return {"starter_pack_rows": starter_pack_accounts_list,
            "feed_rows": feed_list
    }
    #return [starter_pack_accounts_list,feed_list]

In [162]:
def process_uris_to_jsonl(uris: List[str], sp_accounts_jsonl_path: str, feeds_jsonl_path: str, processed_log_path = "processed_uris.txt"):
    # Keep track fo processed uri so we dont process one uri twice
    processed_uris = set()
    try:
        with open(processed_log_path, "r") as f:
            for line in f:
                processed_uris.add(line.strip())
    except FileNotFoundError:
        pass  # first run

    # 2) Open JSONL files in append mode
    with open(sp_accounts_jsonl_path, "a", encoding="utf-8") as sp_f, \
         open(feeds_jsonl_path, "a", encoding="utf-8") as feed_f, \
         open(processed_log_path, "a", encoding="utf-8") as log_f:

        for uri in uris:
            if uri in processed_uris:
                print(f"Skipping already-processed URI: {uri}")
                continue

            print(f"Processing URI: {uri}")
            result = process_starter_pack_uri(uri)

            # Write SP–account rows
            for row in result["starter_pack_rows"]:
                sp_f.write(json.dumps(row, ensure_ascii=False) + "\n")

            # Write feed rows
            for row in result["feed_rows"]:
                feed_f.write(json.dumps(row, ensure_ascii=False) + "\n")

            # Mark URI as processed (for resume)
            log_f.write(uri + "\n")
            log_f.flush()


### Creating Testing Dataset

In [163]:
test_100 = df[:100]
test_1000 = df[:1000]

test_100.head()

Unnamed: 0,list,name,$type,createdAt,cid,author,uri,rkey,collection_time,feeds,updatedAt,description,descriptionFacets,image
0,at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky...,@‪motesandbeams.bsky.soci…'s Starter Pack,app.bsky.graph.starterpack,2024-11-16T18:42:13.188Z,bafyreie3wdnxditj3sbinyctdum23xk3x6luvdy6voy3o...,did:plc:zdtz65xortdlxi7d6hlyz2j5,at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky...,3lb3k7fkymr2f,2026-02-11 08:00:03.164000+00:00,,,,,
1,at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky...,@‪motesandbeams.bsky.soci…'s Starter Pack,app.bsky.graph.starterpack,2024-11-16T18:49:24.216Z,bafyreihb7o3slrv5i4uzl3uarlpb4gu2rluqsk6wzxhzv...,did:plc:zdtz65xortdlxi7d6hlyz2j5,at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky...,3lb3kmamw462k,2026-02-11 08:00:03.164000+00:00,,,,,
2,at://did:plc:oextljnuf4ix335o7aapym55/app.bsky...,Knowledge,app.bsky.graph.starterpack,2024-12-21T16:45:49.695Z,bafyreidb7dfysn45jtudxaro6n6alqiemb5irmv2ztoan...,did:plc:oextljnuf4ix335o7aapym55,at://did:plc:oextljnuf4ix335o7aapym55/app.bsky...,3ldtdzitbxs2z,2026-02-11 08:00:17.963000+00:00,[],2025-01-09T12:32:19.214Z,Education is the most powerful weapon which yo...,,
3,at://did:plc:anumzyo4b5gclvho6uqkrpap/app.bsky...,‪rngmom03.bsky.social‬'s Starter Pack,app.bsky.graph.starterpack,2024-11-21T18:32:11.874Z,bafyreibbdck4dxrmulfafmwai2rul3co52zqedrsxiuad...,did:plc:anumzyo4b5gclvho6uqkrpap,at://did:plc:anumzyo4b5gclvho6uqkrpap/app.bsky...,3lbi3y3hdgx23,2026-02-11 08:00:24.338000+00:00,[{'uri': 'at://did:plc:z72i7hdynmk6r22z27h6tvu...,,,,
4,at://did:plc:d42nr7dwbfh4vfvduimuk7j5/app.bsky...,‪pietve.bsky.social‬'s Starter Pack,app.bsky.graph.starterpack,2024-11-16T11:22:54.573Z,bafyreicggizjorcm6jldsbdrnlysxmwqgij4ywtvd7ljm...,did:plc:d42nr7dwbfh4vfvduimuk7j5,at://did:plc:d42nr7dwbfh4vfvduimuk7j5/app.bsky...,3lb2rnu76p42x,2026-02-11 08:00:25.661000+00:00,[{'uri': 'at://did:plc:5rw2on4i56btlcajojaxwca...,,,,


### Run the Script to Get Starter Packs Accounts and Starter Packs Feeds Dataset

In [None]:
sp_output = "test_sp_accounts.jsonl"
feed_output = "test_feeds.jsonl"
log_output = "test_processed.txt"

process_uris_to_jsonl(
    uris=test_100["uri"],
    sp_accounts_jsonl_path=sp_output,
    feeds_jsonl_path=feed_output,
    processed_log_path=log_output
)



Processing URI: at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky.graph.starterpack/3lb3k7fkymr2f
[{'SP URI': 'at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky.graph.starterpack/3lb3k7fkymr2f', 'SP DID': 'bafyreie3wdnxditj3sbinyctdum23xk3x6luvdy6voy3o2tp24pzpshud4', 'SP Creator DID': 'did:plc:zdtz65xortdlxi7d6hlyz2j5', 'SP Creator Handle': 'motesandbeams.bsky.social', 'SP Description': None, 'Account DID': 'did:plc:tsksqpjisdvtusxq6ide5tzj', 'Account Handler': 'adamkinzinger.substack.com'}, {'SP URI': 'at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky.graph.starterpack/3lb3k7fkymr2f', 'SP DID': 'bafyreie3wdnxditj3sbinyctdum23xk3x6luvdy6voy3o2tp24pzpshud4', 'SP Creator DID': 'did:plc:zdtz65xortdlxi7d6hlyz2j5', 'SP Creator Handle': 'motesandbeams.bsky.social', 'SP Description': None, 'Account DID': 'did:plc:nyxviptfyic2lvuorkr5yy3y', 'Account Handler': 'theatlantic.com'}, {'SP URI': 'at://did:plc:zdtz65xortdlxi7d6hlyz2j5/app.bsky.graph.starterpack/3lb3k7fkymr2f', 'SP DID': 'bafyreie3wdnxditj3sbi

### Top 10 Accounts Appear Most Frequenly within Starter Packs

In [165]:
#Count number of appearances within starter packs and store the result into DataFrame
account_appearance_count = accounts_df["Account Handler"].value_counts()
df_counts = account_appearance_count.reset_index()
df_counts.columns = ["Account Handler", "Count"]

# Add frequencies column
total = account_appearance_count.sum()
df_counts["Frequency"] = df_counts["Count"] / total
print(len(df_counts))
df_counts.head(10)

6187


Unnamed: 0,Account Handler,Count,Frequency
0,charlietotherescue.bsky.social,49,0.005872
1,donnacrowe.bsky.social,27,0.003236
2,bsky.app,27,0.003236
3,oh-its-chris.bsky.social,27,0.003236
4,evelynelotz.bsky.social,25,0.002996
5,tamiguitar.bsky.social,22,0.002637
6,dannyj73.bsky.social,22,0.002637
7,friendsofbcdogs.bsky.social,21,0.002517
8,pgmpinc.bsky.social,20,0.002397
9,mtncannoligirl.bsky.social,18,0.002157


### Result: Most Frequently Appearing Accounts

The accounts listed above appear in the highest number of starter packs in our dataset. This indicates that these accounts are repeatedly recommended by different starter pack creators, rather than appearing randomly or uniformly across all packs.

This supports our hypothesis that certain accounts are **promoted more frequently, suggesting that visibility on Bluesky may be concentrated among a smaller subset of accounts.** Because starter packs function as curated discovery tools, accounts that appear more often are more likely to be discovered and followed by new users. This repeated inclusion increases their visibility and reinforces their presence within the platform’s social network. 

This is similar to the principle of the "rich gets richer," also known as preferential attachment, where accounts that already appear frequently in starter packs become even more likely to be included in additional starter packs. As these accounts gain more exposure through repeated inclusion, they become more visible to users and are more likely to be followed, recognized, and recommended by others. This creates a positive feedback loop in which popular accounts continue to accumulate visibility at a faster rate than less frequently included accounts.

This finding supports our hypothesis that starter packs play a role in amplifying certain accounts and suggests that discovery on Bluesky may be influenced by cumulative visibility effects rather than purely decentralized exploration.

In [166]:
#TODO: Right now the result only include partial of the accounts included within the bluesky. 
# Need to get all to add into the dataset
# Testing the get all function
accounts_df
accounts_df.groupby("SP URI")["AccountDID"].count()

KeyError: 'Column not found: AccountDID'

### RQ2: What feeds appear the most among all starter packs with feeds?

In [None]:
#Count number of appearances within starter packs and store the result into DataFrame
feeds_appearance_count = feeds_df["Feed Creator Handle"].value_counts()
df_counts = feeds_appearance_count.reset_index()
df_counts.columns = ["Feed Creator Handle", "Count"]

# Add frequencies column
total = feeds_appearance_count.sum()
df_counts["Frequency"] = df_counts["Count"] / total
df_counts.head(10)

Unnamed: 0,Feed Creator Handle,Count,Frequency
0,bsky.app,5,0.15625
1,skyfeed.xyz,4,0.125
2,colinbaines15.bsky.social,3,0.09375
3,eepy.bsky.social,2,0.0625
4,timmersionmedia.com,2,0.0625
5,shultzman.com,2,0.0625
6,clarabelle.xyz,2,0.0625
7,aendra.com,2,0.0625
8,bsky.art,2,0.0625
9,bossett.social,1,0.03125


Similarly to our finding above, Bluesky still has the highest percentage of feeds shared by starter packs within our test data. Although the count difference is far less than Bluesky's appearance as an account shared by starter packs, it is still apparent. Once again, we only sampled a small subset of the data and need to further look into the rest of the dataset.

# Question 3: Thematic Categorization

Starter packs on Bluesky often reflecting communities built around specific interests such as politics, gaming, news,art, or travel. To better understand the thematic structure of our dataset, we developed a simple rule-based text classification method that assigns each starter pack to a category based on the keywords found in its name and description. This allow us to:
- Quantify topics and theme to see which are most common across the starter packs.
- Enable meaningful approximation that enable downstream quantitative analysis such as frequency counts 
- Comparison across categories

In [None]:
import re
from typing import Dict, List

# 1. Define the set of possible categories

STARTER_PACK_CATEGORIES = [
    "gaming", "sports", "music", "politics", "science", "tech", "finance",
    "art", "fashion", "food", "travel", "education", "health", "books",
    "movies", "animals", "religion", "nature", "comedy", "news", "unknown"
]

# 2. Keyword lists for each category
# A starter pack will be classified by matching
# these keywords to its name/description text.


STARTER_PACK_KEYWORDS: Dict[str, List[str]] = {
    "gaming":   ["game", "gaming", "pc", "console", "xbox", "playstation", "steam", "fps", "rpg", "minecraft", "valorant"],
    "sports":   ["sport", "soccer", "football", "basketball", "baseball", "tennis", "golf", "nba", "nfl", "fifa"],
    "music":    ["music", "band", "album", "song", "playlist", "dj", "concert", "guitar", "piano", "rapper"],
    "politics": ["politic", "election", "vote", "policy", "senate", "congress", "government", "campaign", "mayor","Miniter","mps","diplomacy"],
    "science":  ["science", "biology", "chemistry", "physics", "genetics", "lab", "research", "paper", "journal"],
    "tech":     ["tech", "software", "developer", "coding", "python", "javascript", "ai", "ml", "data", "cloud", "startup"],
    "finance":  ["finance", "invest", "stock", "trading", "crypto", "bitcoin", "portfolio", "economy", "market"],
    "art":      ["art", "artist", "drawing", "painting", "illustration", "sketch", "gallery", "design"],
    "fashion":  ["fashion", "style", "outfit", "streetwear", "runway", "vogue", "makeup", "beauty"],
    "food":     ["food", "cook", "recipe", "baking", "restaurant", "coffee", "tea", "chef"],
    "travel":   ["travel", "trip", "vacation", "flight", "hotel", "backpack", "itinerary", "tour"],
    "education":["education", "school", "university", "class", "homework", "study", "course", "lecture"],
    "health":   ["health", "fitness", "workout", "gym", "diet", "nutrition", "doctor", "medicine", "therapy"],
    "books":    ["book", "novel", "reading", "literature", "author", "kindle"],
    "movies":   ["movie", "film", "cinema", "director", "actor", "netflix", "tv", "series"],
    "animals":  ["animal", "cat", "dog", "pet", "wildlife", "bird", "fish"],
    "religion": ["religion", "faith", "church", "bible", "islam", "quran", "hindu", "buddha"],
    "nature":   ["nature", "outdoors", "hiking", "camping", "climate", "forest", "mountain", "ocean"],
    "comedy":   ["comedy", "joke", "meme", "funny", "satire"],
    "news":     ["news", "journalism", "reporter", "breaking", "headline"],
}


# 3) Helper: Normalize text into tokens
# Converts text -> lowercase -> word tokens

_WORD_RE = re.compile(r"[a-z0-9]+")

def _normalize(text: str) -> List[str]:
   
    """
    Normalize text and finds matching words to the category
    """
    
    text = (text or "").lower()
    return _WORD_RE.findall(text)
    
# 4. Helper: Score every category
# Counts how many keywords from each category appear in the text.
def _score_categories(tokens: List[str]) -> Dict[str, int]:
    """
    This takes each token and puts them into dictonary. 
    Return the score of the token. 
    """
    scores = {cat: 0 for cat in STARTER_PACK_KEYWORDS.keys()}
    joined = " ".join(tokens)
    for cat, kws in STARTER_PACK_KEYWORDS.items():
        for kw in kws:
            if kw in joined:
                scores[cat] += 2
            else:
                scores[cat] += sum(1 for t in tokens if t == kw)
    return scores

# 5. Main classifier function
def classify_starter_pack(name: str, description: str) -> str:

    """
    Takes in the name and descrption of starter pack. 
    Returns the Themes of the starter pack

    """

    
    text = description or ''
    tokens = _normalize(text)

    scores = _score_categories(tokens)
    best_cat = max(scores, key=scores.get)

    return best_cat if scores[best_cat] >= 2 else "Unknown"

In [None]:
df= test_1000.loc[:, ['name', 'description']]
df2 = df.copy()

In [None]:
# 6) Apply classifier to the dataset
df2["name"] = df2["name"].fillna("").astype(str)
df2["description"] = df2["description"].fillna("").astype(str)

df2 = df2.drop_duplicates(subset=["description"])

df2["Starter Pack Classification"] = df2.apply(
    lambda row: classify_starter_pack(row["name"], row["description"]),
    axis=1
)

df2.head()

Unnamed: 0,name,description,Starter Pack Classification
0,@‪motesandbeams.bsky.soci…'s Starter Pack,,Unknown
2,Knowledge,Education is the most powerful weapon which yo...,education
7,We Love Memoirs,For memoir authors and readers to connect!,books
11,UK MPs & peers working for better UK-EU relations,"Nearly 100 MPs, peers, ex-MPs or ex-MEPs on Bl...",politics
14,Greater Manchester politics,"MPs, councillors and others politicos in Great...",politics


In [None]:
df2["Starter Pack Classification"].value_counts()

Starter Pack Classification
Unknown      132
art           61
animals       54
tech          30
politics      14
science       12
gaming        11
music          9
sports         6
news           5
education      5
food           5
books          4
comedy         4
health         4
movies         2
finance        1
Name: count, dtype: int64

Similarly to our finding above, Bluesky still has the highest percentage of feeds shared by starter packs within our test data. Although the count difference is far less than Bluesky's appearance as an account shared by starter packs, it is still apparent. Once again, we only sampled a small subset of the data and need to further look into the rest of the dataset.

### RQ3: Thematic Categorization Analysis

There were 132 starter packs without descriptions, and the top five themes identified were art, animals, tech, politics, and science. 

#### **Uncertainties**: 
There might be misclassification of themes based on their given descriptions and titles. 
Themes may be too broad, especially considering that categories like art include many different mediums (ex. Film, drawing/painting, calligraphy, etc.)

##### **New Problem**: 
To define echo chambers, we will need to determine the percentage of overlapping followers within a starter pack theme. This will increase the data size  of the data and computational cost. 

#### **Next steps**: 
- We will continue collecting starter pack data and increase the dataset size. To identify echo chambers, we will randomly pick 5 starter packs (test_data) to prevent exceeding our API limit. We will then calculate followers overlaps within narrowed popular themes and split the data processing among team members to reduce the computational load.
- We can also revise the categorzing algorithm and use LLM instead of rule-based
- We can also fetch the feeds and use it material for categorizing starter packs as an additional to name and description
