In [None]:
%load_ext autoreload
%autoreload 2

import sys
import os
sys.path.append(".")
import dimen_generation

In [None]:
# Load community embedding
vectors, metadata = dimen_generation.load_embedding()

In [None]:
# Compute all pairs of similar communities
dimen_generator = dimen_generation.DimenGenerator(vectors)

In [None]:
# Finds the dimension for each given seed and dimen_names from seeds_dimen_name_pairs, then stores them in given filename
# Lightly modified from code from https://github.com/CSSLab/social-dimensions
def find_dimensions(seeds_dimen_name_pairs, scores_file_name):
    seeds = [x[0] for x in seeds_dimen_name_pairs]
    dimen_names = [x[1] for x in seeds_dimen_name_pairs]
    
    dimensions = dimen_generator.generate_dimensions_from_seeds(seeds)

    for name, dimen in zip(dimen_names, dimensions):
        print("Dimension %s:" % name)
        print("\tSeed: %s" % dimen["seed"])
        print("\tFound seeds:")
        for c1, c2 in zip(dimen["left_comms"], dimen["right_comms"]):
            print("\t\t%s -> %s" % (c1, c2))

    # Calculate scores for communities
    scores = dimen_generation.score_embedding(vectors, zip(dimen_names, dimensions))
    print(scores.head(5))

    # Save the scores to a csv
    scores.to_csv(scores_file_name)

DBSCAN and OPTICS are clustering algorithms supporting uneven cluster sizes.[4] By setting a value for eps, DBSCAN clusters can be extracted from OPTICS clusters. eps limits the distance between sample points, for them to be considered part of the same cluster, and strongly influences the clusters generated by DBSCAN. Having communities not be labelled is not of much concern, since it means they aren't similar enough in their users, to other communities. Thus how US-centric or non-US-centric their users are, is not particularly strong.

Maximizing the number of communities labelled, allows us to have a better chance at labelling clusters as US-centric or not, and having more clusters helps with being more precise about whether a community is US-centric or not. However, we cannot solely maximize the number of communities labelled, otherwise we'd put everything into a single cluster, which is not informative.



Additional sources:

https://scikit-learn.org/stable/modules/clustering.html

In [None]:
from sklearn.cluster import OPTICS

# Using squared Euclidean to avoid normalizing by variance of vectors
optics_clustering = OPTICS(metric='sqeuclidean')
optics_clustering.fit(vectors)

In [None]:
# Help from https://scikit-learn.org/stable/auto_examples/cluster/plot_optics.html#sphx-glr-auto-examples-cluster-plot-optics-py

from sklearn.cluster import cluster_optics_dbscan

""" Generates DBSCAN cluster and info about it,
    given OPTICS cluster and eps value for DBSCAN """
def cluster_for_eps(optics_cluster, eps):
    dbscan_labels = cluster_optics_dbscan(
        reachability=optics_clustering.reachability_,
        core_distances=optics_clustering.core_distances_,
        ordering=optics_clustering.ordering_,
        eps=eps,
    )

    # Checking for non-0 labels, since we want more than 1 cluster.
    # The first cluster has label 0.
    cluster_labels = [label for label in dbscan_labels if label > 0]
    samples_labelled = len(cluster_labels)
    cluster_count = 0
    if samples_labelled != 0:
        cluster_count = max(cluster_labels)

    return {
        "eps": eps,
        "labels": dbscan_labels,
        # This is samples labelled that aren't in the first cluster,
        # or considered noise.
        "samples_labelled": samples_labelled,
        "cluster_count": cluster_count + 1,
    }

def print_dbscan_cluster_info(dbscan_cluster_info):
    print(f'eps used: {dbscan_cluster_info["eps"]}')
    print(f'\tSamples labelled: {dbscan_cluster_info["samples_labelled"]}')
    print(f'\tCluster count: {dbscan_cluster_info["cluster_count"]}')

In [None]:
ITERATIONS = 100
EPS_RATE = 50
found_max_sample_labelling = False

for i in range(1, ITERATIONS):
    eps = i / EPS_RATE
    dbscan_info = cluster_for_eps(optics_clustering, eps)
    if not found_max_sample_labelling:
        max_sample_labelled_clustering = dbscan_info
        found_max_sample_labelling = True
    elif dbscan_info["samples_labelled"] > max_sample_labelled_clustering["samples_labelled"]:
        max_sample_labelled_clustering = dbscan_info

print_dbscan_cluster_info(max_sample_labelled_clustering)

An eps of 0.54 gave the most number of communities labelled as part of a cluster, with a total of 174 clusters.

Some communities are US-centric, such as r/nfl, with its focus being the NFL, a US sports league. On the other end, communities such as r/india and r/vancouver, are focused on topics that are unrelated to the US. However, there are communities that aren't specifically focused on US topics, such as r/pics. These are of interest, since they more closely represent what Reddit (at least the English speaking demographic of it) as a whole, are interested in. If we find more communities not focused on US topics, but the users are similar to that of communities that are focused on US topics, then the communities will be close together in the vector space and we can conclude that users are fairly focused on US topics.