In [None]:
%load_ext autoreload
%autoreload 2

import sys
import os
sys.path.append(".")
import dimen_generation

In [None]:
# Load community embedding
vectors, metadata = dimen_generation.load_embedding()

In [None]:
# Compute all pairs of similar communities
dimen_generator = dimen_generation.DimenGenerator(vectors)

In [None]:
# Finds the dimension for each given seed and dimen_names from seeds_dimen_name_pairs, then stores them in given filename
# Lightly modified from code from https://github.com/CSSLab/social-dimensions
def find_dimensions(seeds_dimen_name_pairs, scores_file_name):
    seeds = [x[0] for x in seeds_dimen_name_pairs]
    dimen_names = [x[1] for x in seeds_dimen_name_pairs]
    
    dimensions = dimen_generator.generate_dimensions_from_seeds(seeds)

    for name, dimen in zip(dimen_names, dimensions):
        print("Dimension %s:" % name)
        print("\tSeed: %s" % dimen["seed"])
        print("\tFound seeds:")
        for c1, c2 in zip(dimen["left_comms"], dimen["right_comms"]):
            print("\t\t%s -> %s" % (c1, c2))

    # Calculate scores for communities
    scores = dimen_generation.score_embedding(vectors, zip(dimen_names, dimensions))
    print(scores.head(5))

    # Save the scores to a csv
    scores.to_csv(scores_file_name)

Now we will attempt to find seed values that will help us identify the dimensions for whether subreddits are American focused.

In [None]:
# The following use subreddits not in the dataset used to generate the vectors
#americanness_seed = [("MURICA", "Canadia")]
#americanness_seed = [("AskAnAmerican", "AskACanadian")]
#americanness_seed = [("buildapc", "buildapccanada")]

find_dimensions(
    [(("personalfinance", "PersonalFinanceCanada"), "americanness1")],
    "finance_seeded_americanness_scores.csv"
)

Result: Lots of Canadian 2nd elements, occasionally they aren't exactly Canadian specific. Ex: ShieldAndroidTV

In [None]:
# All seeds are focused on finding American subreddits
finance_seed = (("personalfinance", "PersonalFinanceCanada"), "finance")
news_seed = (("news", "worldnews"), "news")

find_dimensions(
    [finance_seed, news_seed],
    "finance_news_americanness_scores.csv"
)

Result: Results for first dimension are the same, which is to be expected since the dimension generator tes dimensions for each seed value, independently.

One last set of seeds before trying another approach.

In [None]:
# Consulting https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits/ to an idea of subreddits
city1_seed = (("nyc", "toronto"), "city1")

find_dimensions(
    [city1_seed],
    "city_americanness_scores.csv"
)

In [None]:
pokemon_go_seed = (("pokemongoNYC", "pokemongo"), "pokemon_go")

find_dimensions(
    [pokemon_go_seed],
    "city_americanness_scores.csv"
)

# Some results don't seem to be related to how American the subreddits are.
# Ex: CabaloftheBuildsmiths -> pcmasterrace

In [None]:
# r/seattle isn't in the dataset but SeattleWA is
city2_seed = (("SeattleWA", "vancouver"), "city2")

find_dimensions(
    [city2_seed],
    "city2_americanness_scores.csv"
)

# Generated seed onguardforthee -> britishcolumbia despite both being Canadian subreddits

## On the use of dimensions for finding Americanness

One factor making this difficult is that there aren't many subreddits focused on content not related to the U.S., with a U.S. focused counterpart. For example, r/india could be the 2nd value of the seed, but r/america is a small subreddit and isn't included in the vectors.

Another factor is the errorneous generated seeds, such as `onguardforthee -> britishcolumbia` despite both being Canadian focused subreddits, and thus not American. This could be a product of seed being bad, a bit unlikely since Seattle is an American city while Vancouver is a Canadian city, and there is talk about how Vancouver is sometimes used as a filming location to give the impression of being in Seattle[2]. Another possibility is that how American focused a subreddit is, isn't ordered along some dimension, but rather exists in clusters.

I see a couple ways to go forward:
1. Attempt some kind of clustering or classification method on the vectors, to determine if subreddits are focused on countries that aren't the U.S.
    - Requires manual labelling of at least some of the subreddits. Input features can be the Word2Vec vectors for each subreddit, from "Quantifying social organization and political polarization in online platforms"[1]
2. Focus on individual posts instead of subreddits
    - Requires generating dimensions from seeds, and the seeds would be words that are typically used by American content or users, and words that are not. Not entirely undoable, but finding words typically used by Americans requires data analysis on a dataset of words used

Note: The Word2Vec space is community "contexts" to user "words". A subreddit is catered towards users of the same metric, if more users of that metric, comment in that subreddit. The assumption is that users that are similar by some metric, such as having a similar age, will often comment in similar subreddits, and not often if they aren't similar.