# Analyzing the structure of Media Bridge–submitted comments

This notebook analyzes the comments uploaded by Media Bridge to FCC Docket 17-108, with a focus on understanding the structure behind the algorithmically-generated ones.

# Load the comments

In [1]:
import pandas as pd
import re
import json
import math
from functools import reduce

Media Bridge uploaded 1.9 million comments in total:

In [2]:
mb_comments = (
    pd.read_csv(
        "../data/bulk-uploads-17-108-with-uuids.csv",
        usecols = [ "uploader", "comments", "email_address" ],
        dtype = str,
    )
    .loc[lambda df: df["uploader"] == "shane@mediabridgellc.com"]
    .assign(
        comments = lambda df: df["comments"].str.replace(u"\xa0", " ")
    )
)

len(mb_comments)

1856553

Some, however, are duplicates. There are 1.5 million unique comments, where uniqueness is defined as the combination of the comment text and the email address associated with the comment:

In [3]:
mb_deduped = (
    mb_comments
    .drop_duplicates(subset = [ "comments", "email_address" ])
)

len(mb_deduped)

1501759

# Separate randomized vs. non-randomized comments

About 472,000 of the comments have no internal randomization; they come from one of five pre-written variations. (One of those five has two sub-variations that differ only in formattng; as a result, there are six strings listed below.)

In [4]:
non_randomized = [
    "The Title II order created a gaping gap in privacy protections by taking the best cop, the FTC, off the beat. That is reason enough to support Chairman Pai's proposal to restore Internet freedom. Restore privacy by repealing Net Neutrality.",
    "Title II is a Depression-era regulatory framework designed for a telephone monopoly that no longer exists. It was wrong to apply it to the Internet and the FCC should repeal it and go back to the free-market approach that worked so well.",
    "The free-market Internet was an incredible engine of economic growth, innovation, and job creation since the 1990s and has already been substantially slowed by the 2015 Net Neutrality rules. The slowdown in investment is destroying jobs and risks a big future tax hike to make up for lost private investment. Save American jobs by repealing Net Neutrality.",
    "The FCC's Net Neutrality rules were written in the Obama White House by political staff and Tech Industry special interests who overruled the FCC's own experts. The FCC's own chief economist Tim Brennan called the rules \"an economics-free zone.\" They should be repealed.",
    "Obama's Net Neutrality order was the corrupt result of a corrupt process controlled by Silicon Valley special interests. It gives some of the biggest companies in the world a free ride at the expense of consumers and should be immediately repealed!",
    ' "Obama\'s Net Neutrality order was the corrupt result of a corrupt process controlled by Silicon Valley special interests. It gives some of the biggest companies in the world a free ride at the expense of consumers and should be immediately repealed!"',
]

In [5]:
mb_deduped_nonrandom = (
    mb_deduped
    .loc[lambda df: df["comments"].isin(non_randomized)]
)

len(mb_deduped_nonrandom)

471677

In [6]:
(
    mb_deduped_nonrandom
    ["comments"]
    .value_counts()
    .to_frame("count")
)

Unnamed: 0,count
Title II is a Depression-era regulatory framework designed for a telephone monopoly that no longer exists. It was wrong to apply it to the Internet and the FCC should repeal it and go back to the free-market approach that worked so well.,127501
"The Title II order created a gaping gap in privacy protections by taking the best cop, the FTC, off the beat. That is reason enough to support Chairman Pai's proposal to restore Internet freedom. Restore privacy by repealing Net Neutrality.",92884
"The free-market Internet was an incredible engine of economic growth, innovation, and job creation since the 1990s and has already been substantially slowed by the 2015 Net Neutrality rules. The slowdown in investment is destroying jobs and risks a big future tax hike to make up for lost private investment. Save American jobs by repealing Net Neutrality.",83072
Obama's Net Neutrality order was the corrupt result of a corrupt process controlled by Silicon Valley special interests. It gives some of the biggest companies in the world a free ride at the expense of consumers and should be immediately repealed!,74809
"The FCC's Net Neutrality rules were written in the Obama White House by political staff and Tech Industry special interests who overruled the FCC's own experts. The FCC's own chief economist Tim Brennan called the rules ""an economics-free zone."" They should be repealed.",62635
"""Obama's Net Neutrality order was the corrupt result of a corrupt process controlled by Silicon Valley special interests. It gives some of the biggest companies in the world a free ride at the expense of consumers and should be immediately repealed!""",30776


The remaining 1 million comments are, at least on their surface, unique: No two are exactly the same.

In [7]:
mb_deduped_random = (
    mb_deduped
    .loc[lambda df: ~df["comments"].isin(non_randomized)]
)

len(mb_deduped_random)

1030082

In [8]:
# If two or more comments were the same, this cell would throw an error
assert mb_deduped_random["comments"].value_counts().max() == 1

Examples:

In [9]:
print("\n\n".join(
    mb_deduped_random
    ["comments"]
    .sample(3, random_state = 0)
))

Dear Chairman Pai,  I would like to comment on Internet regulation. I strongly recommend Chairman Pai to repeal Obama's scheme to regulate the web. Americans, as opposed to Washington bureaucrats, should purchase the products they prefer. Obama's scheme to regulate the web is a betrayal of the open Internet. It stopped a free-market system that functioned supremely well for decades with broad bipartisan backing.

To the Federal Communications Commission:  I'm concerned about network neutrality regulations. I'd like to request the government to undo The previous administration's order to control the web. Individual citizens, not the FCC, should enjoy whatever products they desire. The previous administration's order to control the web is a exploitation  of net neutrality. It broke a market-based framework that functioned remarkably smoothly for many years with nearly universal backing.

Chairman Pai:  My comments re: regulations on the Internet. I'd like to suggest Ajit Pai to rescind O

# Reverse-engineer the structure of the randomized comments

The following code represents BuzzFeed News' best estimate of how the randomized comments were generated.

Each sub-list contains the possible variations, which appear to be selected (with equal weighting) at random. Sub-lists with only one item are "fixed"; they don't change from comment to comment.

One exception is a repeated phrase at the beginning of the fourth sentence of each comment; it repeats whatever happens to have been randomly selected in a particular part of the second sentence. More details on that below.

In [10]:
segments = [
    [
        "To whom it may concern:  ",
        "To the Federal Communications Commission:  ",
        "FCC:  ",
        "To the FCC:  ",
        "Dear Commissioners:  ",
        "Dear Mr. Pai,  ",
        "Dear Chairman Pai,  ",
        "Dear FCC,  ",
        "Mr Pai:  ",
        "FCC commissioners,  ",
        "Chairman Pai:  ",
        "",
    ],

    [
        "I'm concerned about",
        "I am concerned about",
        "I have concerns about",
        "I'm very concerned about",
        "I'd like to share my thoughts on",
        "Hi, I'd like to comment on",
        "I would like to comment on",
        "I want to give my opinion on",
        "I have thoughts on",
        "I'm contacting you about",
        "I'm very worried about",
        "My comments re:",
        "In reference to",
        "I am a voter worried about",
        "I'm a voter worried about",
        "Regarding",
        "With respect to",
        "In the matter of",
    ],

    [ " " ],
    
    [
        "the FCC's so-called Open Internet order",
        "Internet regulation and net neutrality",
        "the Obama takeover of the Internet",        
        "the FCC regulations on the Internet",
        "network neutrality regulations",
        "the FCC's Open Internet order",
        "the FCC rules on the Internet",
        "net neutrality and Title II",
        "Net Neutrality and Title II",
        "regulations on the Internet",
        "restoring Internet freedom",
        "net neutrality regulations",
        "Title 2 and net neutrality",
        "the future of the Internet",
        "the Open Internet order",
        "internet regulations",
        "net neutrality rules",
        "Internet regulation",
        "Network Neutrality",
        "an open Internet",
        "Internet freedom",
        "Internet Freedom",
        "Net neutrality",
        "net neutrality",
        "NET NEUTRALITY",
        "Title II rules",
    ],
    
    [ ". I" ],

    [
        "'d like to",
        " would like to",
        " want to",
        " strongly",
        "",
    ],
    
    [
        " "
    ],
    
    [
        "implore",
        "ask",
        "request",
        "urge",
        "encourage",
        "recommend",
        "suggest",
        "demand",
        "advocate",
    ],
    
    [ " " ],

    [
        "you",
        "the FCC",
        "the Federal Communications Commission",
        "the commissioners",
        "the commission",
        "Chairman Pai",
        "Ajit Pai",
        "the government"
    ],
    
    [ " to " ],
    
    [
        "undo",
        "reverse",
        "repeal",
        "overturn",
        "rescind",
    ],

    [ " " ],
    
    [
        "The previous administration's",
        "The Obama/Wheeler",
        "President Obama's",
        "Barack Obama's",
        "Tom Wheeler's",
        "Obama's",
    ],

    [ " " ],
    
    [
        "decision",
        "scheme",
        "policy",
        "order",
        "power grab",
        "plan",
    ],
    
    [ " to " ],
    
    [
        "regulate",
        "control",
        "take over",
    ],

    [ " " ],

    
    [
        "broadband",
        "the web",
        "Internet access",
        "the Internet",
    ],
    
    [ ". " ],
    
    [
        "Internet users",
        "Individual citizens",
        "People like me",
        "Citizens",
        "Individual Americans",
        "Americans",
        "Individuals",
    ],
    
    [ ", " ],
    
    [
        "rather than",
        "as opposed to",
        "not",
    ],
    
    [ " " ],
    
    [
        "Washington bureaucrats",
        "Washington",
        "big government",
        "so-called experts",
        "unelected bureaucrats",
        "the FCC Enforcement Bureau",
        "the FCC",
    ],
    
    [ ", " ],
    
    [
        "should be able to",
        "should be empowered to",
        "should be free to",
        "ought to",
        "deserve to",
        "should",
    ],
    
    [
        " ",
    ],
    
    [
        "use",
        "enjoy",
        "purchase",
        "buy",
        "select",
    ],
    
    [ " " ],
    
    [
        "the",
        "whichever",
        "whatever",
        "which",
    ],
    
    [ " " ],
        
    [
        "products",
        "applications",
        "services",
    ],
    
    [ " " ],

    [
        "they",
        "we",
    ],
    
    [ " " ],
    
    [
        "want",
        "desire",
        "prefer",
        "choose",
    ],
    
    [ ". " ],
    
    [
        "The previous administration's",
        "The Obama/Wheeler",
        "President Obama's",
        "Barack Obama's",
        "Tom Wheeler's",
        "Obama's",
    ],

    [ " " ],
    
    [
        "decision",
        "scheme",
        "policy",
        "order",
        "power grab",
        "plan",
    ],
    
    [ " to " ],
    
    [
        "regulate",
        "control",
        "take over",
    ],
    
    [ " " ],
    
    [
        "broadband",
        "the web",
        "Internet access",
        "the Internet",
    ],
    
    [ " is a " ],
    
    [
        "exploitation ",
        "distortion",
        "perversion",
        "corruption",
        "betrayal",
    ],
    
    [ " of " ],
    
    [
        "net neutrality",
        "the open Internet",
    ],
    
    [ ". It " ],
    
    [
        "disrupted",
        "undid",
        "reversed",
        "ended",
        "broke",
        "stopped",
    ],
    
    [ " a " ],
    
    [
        "light-touch",
        "pro-consumer",
        "hands-off",
        "free-market",
        "market-based",
    ],
    
    [ " " ],
    
    [
        "policy",
        "system",
        "approach",
        "framework",
    ],
    
    [ " that " ],
    
    [
        "functioned",
        "performed",
        "worked",
    ],
    
    [ " " ],
    
    [
        "supremely",
        "very, very",
        "very",
        "remarkably",
        "fabulously",
        "exceptionally",
    ],
    
    [ " " ],
    
    [
        "well",
        "successfully",
        "smoothly",
    ],
    
    [ " for " ],
    
    [
        "many years",
        "decades",
        "a long time",
        "two decades",
    ],
    
    [ " with " ],
    
    [
        "nearly universal",
        "broad bipartisan",
        "bipartisan",
        "both parties'",
        "Republican and Democrat",
    ],

    [ " " ],
    
    [
        "support",
        "consensus",
        "approval",
        "backing",
    ],
    
    [ "." ]
    
]

# Check that pattern fully matches comments

Here, we compile the comment segments into a single regular expression, which we use to check whether comments match the reverse-engineered model.

In [11]:
def segments_to_pattern(segments):
    return re.compile(r"^" + r"".join(
    r"(" + r"|".join(re.escape(option) for option in seg) + r")"
        for seg in segments) + r"$")

In [12]:
pattern = segments_to_pattern(segments)

All comments match (otherwise, the result would be greater than zero):

In [13]:
sum(re.match(pattern, x) is None for x in mb_deduped_random["comments"].values)

0

# Check that there are no superfluous permutations

Although the model above succeeds in matching all comments, so would a model that contained, for example, the entire English language. So here we check whether any individual part of the pattern is superfluous, by incrementally removing each one, and seeing whether the comments still match the pattern. (Here we use a random sample of comments, to speed up the process.)

In [14]:
sample_comments = (
    mb_deduped_random
    ["comments"]
    .sample(1000, random_state = 0)
)

In [15]:
# A lack of output for this cell is a good thing;
# it means no part of the model is superfluous

for i, segment in enumerate(segments):
    # For each sub-part of the each segment ...
    for j, option in enumerate(segment):
        
        # Replace the sub-part with "###", and then test
        # whether the pattern-matching fails. It should fail;
        # if it does not, then the sub-part is superfluous.
        segments_copy = list([ list(o) for o in segments ])
        segments_copy[i][j] = "###"
        new_pattern = segments_to_pattern(segments_copy)
        
        num_nonmatching_comments = sum((re.match(new_pattern, x) is None)
            for x in sample_comments.values)
        
        # If all of the comments still match after the "###" 
        # substitution, then the replaced sub-part isn't necessary
        # to the model.
        if num_nonmatching_comments == 0:
            print(i, j, option)

# Check that segments are randomized independently

In some text-generation models, the value of one segment may influence the possible values (or weights for those values) of subsequent segments. Here, we check whether that appears to be true for the actual model that generated these comments.

First, we extract the bits of text that each comment has used for each section, skipping the "fixed" segments. (Here again we use a random sample of comments, to speed things up.)

In [16]:
FIXED_SEGMENT_INDEX = [ i for i, x in enumerate(segments) if len(x) == 1 ]
FIXED_SEGMENT_INDEX[:3]

[2, 4, 6]

In [17]:
def extract_permutations(comment):
    permutations = [ (i, g) for i, g in enumerate(re.match(pattern, comment).groups())
        if i not in FIXED_SEGMENT_INDEX ]
    
    return pd.DataFrame(
        permutations,
        columns = [ "seg_i", "option" ],        
    )

Example, for the first comment in the sample:

In [18]:
extract_permutations(sample_comments.iloc[0])

Unnamed: 0,seg_i,option
0,0,"Dear Chairman Pai,"
1,1,I would like to comment on
2,3,Internet regulation
3,5,strongly
4,7,recommend
5,9,Chairman Pai
6,11,repeal
7,13,Obama's
8,15,scheme
9,17,regulate


Here, we create a DataFrame of all extracted segments:

In [19]:
extracted = (
    pd.concat([ extract_permutations(x).assign(comment_i = i)
        for i, x in enumerate(sample_comments) ])
    .set_index([
        "comment_i",
        "seg_i",
    ])
    ["option"]
    .unstack()
)

extracted.head()

seg_i,0,1,3,5,7,9,11,13,15,17,...,49,51,53,55,57,59,61,63,65,67
comment_i,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,"Dear Chairman Pai,",I would like to comment on,Internet regulation,strongly,recommend,Chairman Pai,repeal,Obama's,scheme,regulate,...,the open Internet,stopped,free-market,system,functioned,supremely,well,decades,broad bipartisan,backing
1,To the Federal Communications Commission:,I'm concerned about,network neutrality regulations,'d like to,request,the government,undo,The previous administration's,order,control,...,net neutrality,broke,market-based,framework,functioned,remarkably,smoothly,many years,nearly universal,backing
2,Chairman Pai:,My comments re:,regulations on the Internet,'d like to,suggest,Ajit Pai,rescind,Obama's,scheme,take over,...,the open Internet,stopped,free-market,system,functioned,"very, very",smoothly,decades,both parties',approval
3,"Dear Mr. Pai,","Hi, I'd like to comment on",the FCC rules on the Internet,,ask,Ajit Pai,reverse,The Obama/Wheeler,scheme,regulate,...,the open Internet,reversed,hands-off,policy,functioned,remarkably,smoothly,many years,Republican and Democrat,consensus
4,Mr Pai:,I'm contacting you about,the FCC's Open Internet order,,request,the FCC,repeal,The Obama/Wheeler,plan,take over,...,the open Internet,reversed,light-touch,system,performed,"very, very",smoothly,many years,Republican and Democrat,backing


To test for the independence of randomization, we calculate the correlation between any two segments in a comment:

In [20]:
segment_correlations = (
    # Turn each permutation into a dummy variable
    extracted
    .pipe(pd.get_dummies)
    
    # Calculate the correlations between them
    .corr()
    .reset_index()
    .rename(columns = { "index": "seg_a" })
    
    # Melt the correlation matrix into a long/tidy DataFrame
    .melt(
        id_vars = [ "seg_a" ],
        var_name = "seg_b",
        value_name = "corr",
    )
    .assign(
        seg_int_a = lambda df: df["seg_a"].str.extract(r"^(\d+)", expand = False).astype(int),
        seg_int_b = lambda df: df["seg_b"].str.extract(r"^(\d+)", expand = False).astype(int),
    )
    
    # Take only the first correlation (A•B instead of both A•B and B•A)
    # and ignore self-correlations
    .loc[lambda df: df["seg_a"] < df["seg_b"]]
    
    # Ignore correlations within the same segment, since they are mutually exclusive
    .loc[lambda df: df["seg_int_a"] != df["seg_int_b"]]
)

segment_correlations.head()

Unnamed: 0,seg_a,seg_b,corr,seg_int_a,seg_int_b
2508,0_,"1_Hi, I'd like to comment on",0.03564,0,1
2509,0_Chairman Pai:,"1_Hi, I'd like to comment on",-0.043523,0,1
2510,"0_Dear Chairman Pai,","1_Hi, I'd like to comment on",-0.030419,0,1
2511,0_Dear Commissioners:,"1_Hi, I'd like to comment on",-0.040624,0,1
2512,"0_Dear FCC,","1_Hi, I'd like to comment on",-0.005508,0,1


The output below demonstrates that are only a handful of pairs with a correlation above 0.15; they are all perfect correlations, meaning that the first segment choice guarantees the second. In this case, whatever is chosen for segments `13-19` is repeated for segments `39-45`. (Segments 14, 16, etc. are all fixed segments, and don't vary at all.)

In [21]:
(
    segment_correlations
    .loc[lambda df: df["corr"] > 0.15]
    .sort_values("seg_a")
)

Unnamed: 0,seg_a,seg_b,corr,seg_int_a,seg_int_b
29970,13_Barack Obama's,39_Barack Obama's,1.0,13,39
30180,13_Obama's,39_Obama's,1.0,13,39
30390,13_President Obama's,39_President Obama's,1.0,13,39
30600,13_The Obama/Wheeler,39_The Obama/Wheeler,1.0,13,39
30810,13_The previous administration's,39_The previous administration's,1.0,13,39
31020,13_Tom Wheeler's,39_Tom Wheeler's,1.0,13,39
31230,15_decision,41_decision,1.0,15,41
31440,15_order,41_order,1.0,15,41
31650,15_plan,41_plan,1.0,15,41
31860,15_policy,41_policy,1.0,15,41


The output below demonstrates that no segment pairs with a correlation below -0.15, other than the possibilities inherently excluded by the perfect correlations above.

In [22]:
(
    segment_correlations
    .loc[lambda df: df["corr"] < -0.15]
    .sort_values("seg_a")
)

Unnamed: 0,seg_a,seg_b,corr,seg_int_a,seg_int_b
31015,13_Barack Obama's,39_Tom Wheeler's,-0.203366,13,39
30806,13_Barack Obama's,39_The previous administration's,-0.191720,13,39
30179,13_Barack Obama's,39_Obama's,-0.204086,13,39
30597,13_Barack Obama's,39_The Obama/Wheeler,-0.204086,13,39
30388,13_Barack Obama's,39_President Obama's,-0.200477,13,39
...,...,...,...,...,...
33331,19_the Internet,45_broadband,-0.327781,19,45
33749,19_the Internet,45_the web,-0.337228,19,45
33123,19_the web,45_Internet access,-0.338160,19,45
33332,19_the web,45_broadband,-0.355864,19,45


In [23]:
(
    segment_correlations
    .loc[lambda df: df["corr"] < -0.15]
    .loc[lambda df: ~df["seg_int_a"].isin([ 13, 15, 17, 19 ])]
)

Unnamed: 0,seg_a,seg_b,corr,seg_int_a,seg_int_b


# Show the repeated segments

Segments `13-19`:

In [24]:
print(json.dumps(segments[13:20], indent = 2))

[
  [
    "The previous administration's",
    "The Obama/Wheeler",
    "President Obama's",
    "Barack Obama's",
    "Tom Wheeler's",
    "Obama's"
  ],
  [
    " "
  ],
  [
    "decision",
    "scheme",
    "policy",
    "order",
    "power grab",
    "plan"
  ],
  [
    " to "
  ],
  [
    "regulate",
    "control",
    "take over"
  ],
  [
    " "
  ],
  [
    "broadband",
    "the web",
    "Internet access",
    "the Internet"
  ]
]


Segments `39-45`:

In [25]:
print(json.dumps(segments[39:46], indent = 2))

[
  [
    "The previous administration's",
    "The Obama/Wheeler",
    "President Obama's",
    "Barack Obama's",
    "Tom Wheeler's",
    "Obama's"
  ],
  [
    " "
  ],
  [
    "decision",
    "scheme",
    "policy",
    "order",
    "power grab",
    "plan"
  ],
  [
    " to "
  ],
  [
    "regulate",
    "control",
    "take over"
  ],
  [
    " "
  ],
  [
    "broadband",
    "the web",
    "Internet access",
    "the Internet"
  ]
]


# Calculate possible permutations

Below, we calculate the total possible permutations, with care to exclude the perfectly correlated segments (which we do by simply removing them from the calculation).

In [26]:
def calculate_permutations(segments):
    count = reduce(lambda x, y: x * y, map(len, segments))
    print(f"Total permutations: {count:,d}")
    
    log = math.log10(count)
    print(f"Log10: {log:.2f}")

In [27]:
def remove_segments(segments, indices):
    return [ s for i, s in enumerate(segments) if i not in indices ]

In [28]:
calculate_permutations(remove_segments(segments, [ 39, 41, 43, 45 ]))

Total permutations: 9,584,250,725,597,184,000,000
Log10: 21.98


---

---

---