# Fake News Ad Tracker Analysis

[Please see here for context.](https://github.com/BuzzFeedNews/2017-04-fake-news-ad-trackers)

In [1]:
import pandas as pd

In [2]:
trackers = pd.read_csv("../data/observed-trackers.csv")

In [3]:
trackers.head()

Unnamed: 0,when,domain,url,tracker
0,before-nov-2016,alertchild.com,http://wayback.archive.org/web/20161004072420i...,DoubleClick
1,before-nov-2016,alertchild.com,http://wayback.archive.org/web/20161004072420i...,RevContent
2,before-nov-2016,alertchild.com,http://wayback.archive.org/web/20161004072420i...,DoubleClick
3,before-nov-2016,alertchild.com,http://wayback.archive.org/web/20161004072420i...,Google Syndication
4,before-nov-2016,alertchild.com,http://wayback.archive.org/web/20161004072420i...,DoubleClick Ad Exchange-Seller


## Selecting before/after-comparable domains

To make the two time frames comparable, we remove two types of sites:

- Sites with no observed trackers in the "before" period
- Sites that had disappeared by the "after" period

In [4]:
DEAD_DOMAINS = [
    "abcnews.com.co", 
    "alynews.com", 
    "areyouasleep.com", 
    "baltimoregazette.com", 
    "channel16news.com", 
    "channel17news.com", 
    "channel18news.com", 
    "christiantimesnewspaper.com", 
    "clancyreport.com", 
    "dailynews11.com", 
    "denverguardian.com", 
    "heaviermetal.net", 
    "kspm33.com", 
    "kupr7.com", 
    "ky6news.com", 
    "kypo6.com", 
    "mbynews.com", 
    "mckenziepost.com", 
    "msnbc.website",
    "newsbuzzdaily.com",
    "newsnow17.com", 
    "newswatch33.com", 
    "oreillypost.com", 
    "scrapetv.com", 
    "thebostontribune.com", 
    "thereporterz.com", 
    "wleb21.com", 
]

In [5]:
comparable_domains = trackers[
    (trackers["when"] == "before-nov-2016") &
    ~trackers["domain"].isin(DEAD_DOMAINS)
]["domain"].unique()

Of the 104 domains in the dataset, we deemed 51 to be before/after-comparable:

In [6]:
len(trackers["domain"].unique())

104

In [7]:
len(comparable_domains)

51

In [8]:
comparable_trackers = trackers[
    trackers["domain"].isin(comparable_domains)
]

## Tracker Matrix

Here we create a matrix indicating which trackers were found on which websites:

In [9]:
tracker_matrix = comparable_trackers.groupby([
    "domain",
    "when",
    "tracker"
]).size().unstack().unstack() > 0
tracker_matrix.head()

tracker,AWeber,AWeber,Acuity Ads,Acuity Ads,Acxiom,Acxiom,AdGear,AdGear,AdMarvel,AdMarvel,...,eyeReturn Marketing,eyeReturn Marketing,gumgum,gumgum,i-Behavior,i-Behavior,myThings,myThings,sovrn (formerly Lijit Networks),sovrn (formerly Lijit Networks)
when,before-nov-2016,march-2017,before-nov-2016,march-2017,before-nov-2016,march-2017,before-nov-2016,march-2017,before-nov-2016,march-2017,...,before-nov-2016,march-2017,before-nov-2016,march-2017,before-nov-2016,march-2017,before-nov-2016,march-2017,before-nov-2016,march-2017
domain,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
adobochronicles.com,False,False,True,False,False,False,True,False,False,False,...,True,False,True,False,True,False,True,False,True,False
alertchild.com,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
areyousleep.com,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
badcriminals.com,False,False,False,False,True,True,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
bizstandardnews.com,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [10]:
pd.DataFrame(
    tracker_matrix.values,
    index=tracker_matrix.index,
    columns=list(map("|".join, tracker_matrix.columns.values))
).to_csv("../output/tracker-matrix.csv")

## Tracker Net Changes

Here we calculate the net change in websites for each tracker:

In [11]:
tracker_counts = tracker_matrix.sum().unstack()\
    .assign(change=lambda x: x["march-2017"] - x["before-nov-2016"])

### Most common trackers before Nov. 2016

In [12]:
tracker_counts.sort_values("before-nov-2016", ascending=False).head(10)

when,before-nov-2016,march-2017,change
tracker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
DoubleClick,49,45,-4
Google Adsense,49,39,-10
DoubleClick Ad Exchange-Seller,46,37,-9
Google Syndication,46,31,-15
ScoreCard Research Beacon,33,35,2
BlueKai,23,17,-6
RevContent,22,18,-4
Quantcast,18,19,1
eXelate,18,20,2
Aggregate Knowledge,17,10,-7


### Trackers with largest increase

In [13]:
tracker_counts[
    tracker_counts["change"] >= 3
].sort_values(["change", "march-2017"], ascending=False)

when,before-nov-2016,march-2017,change
tracker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Acxiom,2,8,6
Content.ad,2,7,5
Kixer,14,18,4
AWeber,0,4,4
Spoutable,2,5,3


### Trackers with largest decrease

In [14]:
tracker_counts[
    tracker_counts["change"] <= -5
].sort_values([ "change", "march-2017" ], ascending=True)

when,before-nov-2016,march-2017,change
tracker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Google Syndication,46,31,-15
Google Adsense,49,39,-10
DoubleClick Ad Exchange-Seller,46,37,-9
Aggregate Knowledge,17,10,-7
Adobe Audience Manager,14,8,-6
BlueKai,23,17,-6
StickyAds,12,7,-5
Tapad,12,7,-5
Krux Digital,13,8,-5
Yahoo Ad Exchange,13,8,-5


In [15]:
tracker_counts.to_csv("../output/tracker-counts.csv")

## Tracker Statuses

Here, we create a matrix that classifies each tracker for each site into four categories:

- __Kept__: Had the tracker before and after
- __Removed__: Had the tracker before, but removed it
- __Added__: Didn't have the tracker before, but later added it
- __Never__: Didn't had the tracker before or after

In [16]:
def classify_status(x):
    if x["before-nov-2016"]:
        if x["march-2017"]:
            return "Kept"
        else:
            return "Removed"
    else:
        if x["march-2017"]:
            return "Added"
        else:
            return "Never"

In [17]:
tracker_statuses = (comparable_trackers.groupby([
    "domain",
    "tracker",
    "when",
]).size().unstack() > 0).apply(classify_status, axis=1).unstack().fillna("Never")
tracker_statuses.head()

tracker,AWeber,Acuity Ads,Acxiom,AdGear,AdMarvel,AdRoll,AdScale,Adap.tv,Adblade,AddThis,...,YllixMedia,[x+1],adingo,adsnative,eXelate,eyeReturn Marketing,gumgum,i-Behavior,myThings,sovrn (formerly Lijit Networks)
domain,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
adobochronicles.com,Never,Removed,Never,Removed,Never,Removed,Removed,Never,Never,Removed,...,Never,Never,Removed,Never,Kept,Removed,Removed,Removed,Removed,Removed
alertchild.com,Never,Never,Never,Never,Never,Never,Never,Never,Never,Never,...,Never,Never,Never,Never,Never,Never,Never,Never,Never,Never
areyousleep.com,Never,Never,Never,Never,Never,Never,Never,Never,Added,Never,...,Never,Never,Never,Never,Never,Never,Never,Never,Never,Never
badcriminals.com,Never,Never,Kept,Never,Never,Never,Never,Never,Never,Never,...,Never,Never,Never,Never,Kept,Never,Never,Never,Never,Never
bizstandardnews.com,Never,Never,Never,Never,Never,Never,Never,Never,Never,Never,...,Never,Removed,Never,Removed,Never,Never,Never,Never,Never,Never


In [18]:
tracker_statuses.to_csv("../output/tracker-statuses.csv")

## Tracker Status Counts

Here, for each tracker, we count the number of additions, removals, keeps, and nevers:

In [19]:
tracker_statuses_tidy = pd.melt(tracker_statuses.reset_index(), id_vars=["domain"])
tracker_statuses_tidy.head()

Unnamed: 0,domain,tracker,value
0,adobochronicles.com,AWeber,Never
1,alertchild.com,AWeber,Never
2,areyousleep.com,AWeber,Never
3,badcriminals.com,AWeber,Never
4,bizstandardnews.com,AWeber,Never


In [20]:
tracker_statuses_pivot = tracker_statuses_tidy.groupby(["tracker", "value"]).size()\
    .unstack().fillna(0).astype(int)\
    [[ "Added", "Removed", "Kept", "Never" ]]
tracker_statuses_pivot.head()

value,Added,Removed,Kept,Never
tracker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
AWeber,4,0,0,47
Acuity Ads,0,2,1,48
Acxiom,6,0,2,43
AdGear,1,1,1,48
AdMarvel,0,1,0,50


In [21]:
tracker_statuses_pivot.to_csv("../output/tracker-statuses-pivot.csv")

---

---

---