# Competitor Dashboard: Data Check
**09/13/2017** 
Brainly and Zuoyebang data looks aberrant; there might be many data points that have been updated by SimilarWeb _ex post_. Checked data pipeline on my end; no issues. The following investigation highlights my findings thus far. 

## Hypothesis 1: SimilarWeb changed data _ex post_

My leading hypothesis is that **SimilarWeb has retroactively updated their inferred traffic and engagement stats**, but the reason for this is at present unknown.

In [1]:
# Import dependencies
import pandas as pd

# Show all rows
pd.options.display.max_rows = 5000

# Import raw data
df_new = pd.read_csv('./outfiles/2017-09-13_15:20.csv')
df_old = pd.read_csv('./outfiles/2017-08-15_14:59.csv')

print("Current data shape", df_new.shape)
print("Previous data shape", df_old.shape)

Current data shape (8934, 40)
Previous data shape (8499, 40)


This is a difference I expected since I added Mindspark (in India) to the list of competitors to track in early September. The pipeline is performing the transformations and calculations, so I'm checking _only_ the data provided by the SimilarWeb API.

In [2]:
# Checking which columns to index data by
df_new.columns.values

array(['group_site', 'KA_initiative', 'endpoint_category', 'date',
       'average_visit_duration', 'visits', 'LT_mins', 'norm_LT',
       'average_visit_duration_TTM_sum', 'visits_TTM_sum',
       'LT_mins_TTM_sum', 'norm_LT_TTM_sum',
       'average_visit_duration_TTM_mean', 'visits_TTM_mean',
       'LT_mins_TTM_mean', 'norm_LT_TTM_mean',
       'average_visit_duration_pct_yoy', 'visits_pct_yoy',
       'LT_mins_pct_yoy', 'norm_LT_pct_yoy',
       'average_visit_duration_TTM_sum_pct_yoy', 'visits_TTM_sum_pct_yoy',
       'LT_mins_TTM_sum_pct_yoy', 'norm_LT_TTM_sum_pct_yoy',
       'average_visit_duration_TTM_mean_pct_yoy',
       'visits_TTM_mean_pct_yoy', 'LT_mins_TTM_mean_pct_yoy',
       'norm_LT_TTM_mean_pct_yoy', 'average_visit_duration_pct_mom',
       'visits_pct_mom', 'LT_mins_pct_mom', 'norm_LT_pct_mom',
       'average_visit_duration_TTM_sum_pct_mom', 'visits_TTM_sum_pct_mom',
       'LT_mins_TTM_sum_pct_mom', 'norm_LT_TTM_sum_pct_mom',
       'average_visit_duration_TTM_m

## Validating Hypothesis 1

In [3]:
# Designate columns and indices
cols = ["group_site","KA_initiative","endpoint_category","date","visits","average_visit_duration"]
new_index = cols[0:4]

# Pared down DataFrames
sw_df1 = df_new[cols]
sw_df0 = df_old[cols]

# Set indices & merge
sw_df1.set_index(new_index, inplace='True')
sw_df0.set_index(new_index, inplace='True')

df = sw_df0.merge(sw_df1, how='outer', right_index=True, left_index=True, suffixes=('_old','_new'))
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,visits_old,average_visit_duration_old,visits_new,average_visit_duration_new
group_site,KA_initiative,endpoint_category,date,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
ABC Mouse,DDM,mobile-web,2015-06-01,3.560085e+06,1.614522,3.560085e+06,1.614522
ABC Mouse,DDM,mobile-web,2015-07-01,2.979695e+06,1.874000,2.979695e+06,1.874000
ABC Mouse,DDM,mobile-web,2015-08-01,2.575315e+06,1.887024,2.575315e+06,1.887024
ABC Mouse,DDM,mobile-web,2015-09-01,2.133031e+06,1.607762,2.133031e+06,1.607762
ABC Mouse,DDM,mobile-web,2015-10-01,1.684011e+06,1.827917,1.684011e+06,1.827917
ABC Mouse,DDM,mobile-web,2015-11-01,1.423237e+06,1.645300,1.423237e+06,1.645300
ABC Mouse,DDM,mobile-web,2015-12-01,1.206644e+06,1.341610,1.206644e+06,1.341610
ABC Mouse,DDM,mobile-web,2016-01-01,1.884471e+06,1.707498,1.884471e+06,1.707498
ABC Mouse,DDM,mobile-web,2016-02-01,1.709895e+06,1.640380,1.709895e+06,1.640380
ABC Mouse,DDM,mobile-web,2016-03-01,1.626510e+06,1.675154,2.443440e+06,1.674914


This boolean table compares the data to highlight any inconsistencies:

In [4]:
# Sets conditions to find inconsistencies between updates
c1 = df["visits_old"] == df["visits_new"]
c2 = df["average_visit_duration_old"] == df["average_visit_duration_new"]

# DataFrame containing all inconsistencies
df = df.sort_index(axis=1)[-(c1|c2)]

At first glance it seems that **the hypothesis is validated** and that these _ex post_ changes affected many past data for many websites to the tune of about half of the data set:

In [5]:
pct_affected = df.shape[0] / df_new.shape[0]
print("{0:.2f}% of the rows show inconsistencies!".format(pct_affected * 100))

45.03% of the rows show inconsistencies!


It looks like the _ex post_ changes spanned the entire history of the dataset and every site:

In [6]:
# Summary lists
sites = df.index.levels[0]
dates = df.index.levels[3]

print('{} sites were affected:'.format(len(list(sites))))
sorted(sites)

111 sites were affected:


['ABC Mouse',
 'ABCya!',
 'Albert',
 'Amazon TenMarks',
 'Benchprep',
 'Better Lesson',
 'Blackboard',
 'Boundless',
 'BrainPOP',
 'Brainly',
 'BrightBytes',
 'Brilliant',
 'CK12',
 'ClassDojo',
 'Clever',
 'Code.org',
 'CodeHS',
 'Codecademy',
 'Coolmath.com',
 'Course Hero',
 'Coursera',
 'Descomplica',
 'Desmos',
 'Disney',
 'Dreambox',
 'Duolingo',
 'EdX',
 'Edgenuity',
 'Edmodo',
 'Education.com',
 'EngageNY',
 'ExploreLearning Reflex',
 'First in Math',
 'Formative',
 'Front Row',
 'Funbrain',
 'Geekie',
 'GiftedandTalented',
 'Google Classroom',
 'GreatMinds/Eurekamath',
 'IXL',
 'Illustrative Math',
 'Instructure Canvas',
 'K12',
 'KA (SimilarWeb)',
 'Kahoot!',
 'Kaplan',
 'Knewton',
 'LearnZillion',
 'Learning A-Z',
 'Lumosity',
 'Magoosh',
 'Mangahigh',
 'Manhattan Prep',
 'MasteryConnect',
 'Math2Me',
 'MathSpace.co',
 'Mathletics',
 'Mathway',
 'McGraw Hill ALEKS',
 'MeSalva',
 'Membean',
 'Memrise',
 'Mindspark',
 'MobyMax',
 'Moodle',
 'New Classrooms Teach to One: Math',

In [7]:
print('The following dates were affected:'.format(len(list(sites))))
sorted(dates)

The following dates were affected:


['2015-06-01',
 '2015-07-01',
 '2015-08-01',
 '2015-09-01',
 '2015-10-01',
 '2015-11-01',
 '2015-12-01',
 '2016-01-01',
 '2016-02-01',
 '2016-03-01',
 '2016-04-01',
 '2016-05-01',
 '2016-06-01',
 '2016-07-01',
 '2016-08-01',
 '2016-09-01',
 '2016-10-01',
 '2016-11-01',
 '2016-12-01',
 '2017-01-01',
 '2017-02-01',
 '2017-03-01',
 '2017-04-01',
 '2017-05-01',
 '2017-06-01',
 '2017-07-01',
 '2017-08-01']

In [8]:
# Reset indices to do aggregate functions
df.reset_index(inplace=True)
df.drop(["KA_initiative"], axis=1, inplace=True)

Below counts the inconsistent dates (each date has `visits` and `average_visit_duration` data points) by site and whether the data was mobile-web only, combined web, or desktop only.

In [33]:
# Desktop-only traffic issues
desktop_df = df.groupby(["group_site","endpoint_category"]).count()
desktop_df = desktop_df.loc[:,['date']]
desktop_df[desktop_df['date'] > 0]

Unnamed: 0_level_0,Unnamed: 1_level_0,date
group_site,endpoint_category,Unnamed: 2_level_1
ABC Mouse,mobile-web,18
ABC Mouse,total-traffic-and-engagement,18
ABC Mouse,traffic-and-engagement,1
ABCya!,mobile-web,18
ABCya!,total-traffic-and-engagement,18
ABCya!,traffic-and-engagement,1
Albert,mobile-web,18
Albert,total-traffic-and-engagement,18
Albert,traffic-and-engagement,1
Amazon TenMarks,mobile-web,18


It looked like most of the new data was around mobile-web (which impacts combined web since it's the aggregate of mobile-web + desktop), so below isolates just the changes to desktop-web data.

(Filtering by any instance greater than `1` since August 2017 will be new data for every endpoint.)

In [31]:
# Desktop-web-only traffic issues
desktop_df = df.groupby(["group_site","endpoint_category"]).count()
desktop_df = desktop_df.loc[(slice(None),['traffic-and-engagement']),['date']]
desktop_df[desktop_df['date'] > 1]

Unnamed: 0_level_0,Unnamed: 1_level_0,date
group_site,endpoint_category,Unnamed: 2_level_1
Front Row,traffic-and-engagement,18
Mindspark,traffic-and-engagement,18
Toppr,traffic-and-engagement,2


The first two are new additions since I ran the pipeline last. Toppr also relatively recently added.