# GLOW - Setup: Collect Baselines - Contest

## Table of Contents  <a class="anchor" id="toc"></a>

* [GLOW Wiki Baselines](#top)

    1. [Editors](#editors)
        1. [Editors](#editors_monthly)
        2. [Monthly Active Editors](#editors_active)
        3. [Monthly New Editors](#editors_new)
        4. [New editor retention](#new_editor_retention)
    2. [Articles](#articles)
        1. [Articles Count by wiki](#articles_count) 
        2. [New Articles](#new_articles)
        3. [Edits to existing articles](#article_edits)
        4. [New articles: by date/exp/survival](#new_articles_filtered)
    3. [Readers](#readers)
        1. [Pageviews](#pageviews_detailed)
    4. [Geo](#stage1b)
        1. [Monthly Unique Devices](#editors_activity_countries)
        2. [Edits geolocated](#editors_activity_countries)
        3. [Editors geolocated](#editors_activity_countries)
        4. [Pageviews across countries & wikis](#pageviews)

## Devices <a class="anchor" id="article_detail"></a>
[Back to Table of Contents](#toc)

#### Monthly Unique Devices <a class="anchor" id="devices"></a>
[Back to Table of Contents](#toc)

In [4]:
mca_uds_contest_r = '''
SELECT
    regexp_replace(
        regexp_replace(
            regexp_replace(domain, "zero\\\\.", ""),
        '^m\\\\.', ''),
    '\\\\.m\\\\.', '.') AS domain_name,
  SUM(uniques_estimate) / 5 AS monthly_unique_devices
FROM wmf.unique_devices_per_domain_monthly
WHERE 
    CONCAT(year,LPAD(month,2,'0')) >= ({contest_start_dt_pv})
    AND CONCAT(year,LPAD(month,2,'0')) < ({contest_end_dt_next_m_pv})
    AND country_code IN ('IN')
    AND domain IN ({india_domains})
GROUP BY    
    regexp_replace(
        regexp_replace(
            regexp_replace(domain, "zero\\\\.", ""),
        '^m\\\\.', ''),
    '\\\\.m\\\\.', '.')
''' 

## Editors<a class="anchor" id="editors"></a>
[Back to Table of Contents](#toc)

In [None]:
#https://hue.wikimedia.org/hue/metastore/table/cchen/editor_month

#### Monthly editors & monthly new <a class="anchor" id="editors_monthly"></a>
[Back to Table of Contents](#toc)

In [1]:
#adapted from:
#https://github.com/wikimedia-research/wiki-segmentation
#https://github.com/wikimedia-research/Editing-movement-metrics

mce_contest_r = '''
SELECT
    em.wiki AS database_code,
    COUNT(*) / 5 AS monthly_editors,
    sum(CAST(TRUNC(em.user_registration, 'MM') = TRUNC(month, 'MM') AS INT))/ 5 AS monthly_new_editors
FROM florez.glow_tiger_participants gtp  
LEFT JOIN cchen.editor_month em
    ON gtp.user_name = em.user_name 
WHERE
    em.month >= "{contest_start_dt}"
    AND em.month < "{contest_end_dt_next_m}"
    AND em.wiki IN ({india_wiki_dbs}) 
    AND em.user_id != 0 
    AND em.bot_by_group = FALSE
    AND (em.user_name not regexp "bot\\b" or em.user_name in ("Paucabot", "Niabot", "Marbot"))
GROUP BY em.wiki
'''

#### Monthly New Active Editors & monthly active editors <a class="anchor" id="editors_active"></a>
[Back to Table of Contents](#toc)

In [2]:
#monthly active editors
#adapted from:
#https://github.com/wikimedia-research/wiki-segmentation
#https://github.com/wikimedia-research/Editing-movement-metrics

mnae_contest_r = '''
SELECT
    em.wiki AS database_code,
    COUNT(*) / 5 AS monthly_active_editors,
    SUM(
        CAST(TRUNC(em.user_registration, 'MM') = TRUNC(month, 'MM') AS INT)
        )/ 5 AS monthly_new_active_editors
FROM florez.glow_tiger_participants gtp  
LEFT JOIN cchen.editor_month em
    ON gtp.user_name = em.user_name 
WHERE
    content_edits >= 5 
    AND em.month >= "{contest_start_dt}"
    AND em.month < "{contest_end_dt_next_m}"
    AND em.wiki IN ({india_wiki_dbs}) 
    AND em.user_id != 0 
    AND em.bot_by_group = FALSE
    AND (em.user_name not regexp "bot\\b" or em.user_name in ("Paucabot", "Niabot", "Marbot"))  
GROUP BY em.wiki
'''

#### Monthly Editors - including group of big wikis <a class="anchor" id="editors_active"></a>
[Back to Table of Contents](#toc)

In [3]:
mae_contest_r = '''
SELECT
    em.wiki AS database_code,
    COUNT(*) / 5 AS indic_editors_on_big_wikis_m
FROM florez.glow_tiger_participants gtp  
LEFT JOIN cchen.editor_month em
    ON gtp.user_name = em.user_name 
WHERE
    em.month >= "{contest_start_dt}"
    AND em.month < "{contest_end_dt_next_m}"
    AND em.wiki IN {wikis_big} 
    AND em.user_id != 0 
    AND em.bot_by_group = FALSE
    AND (em.user_name not regexp "bot\\b" or em.user_name in ("Paucabot", "Niabot", "Marbot"))  
GROUP BY em.wiki
'''

#### Monthly New Editors <a class="anchor" id="editors_new"></a>
[Back to Table of Contents](#toc)

#### New editor retention <a class="anchor" id="new_editor_retention"></a>
[Back to Table of Contents](#toc)

### Language Switching <a class="anchor" id="language_switching"></a>
[Back to Table of Contents](#toc)

## Readers<a class="anchor" id="readers"></a>
[Back to Table of Contents](#toc)

#### PageViews by referer_class and access_method <a class="anchor" id="pageviews_detailed"></a>
[Back to Table of Contents](#toc)

In [None]:
#https://hue.wikimedia.org/hue/metastore/table/florez/glow_tiger_articles

In [None]:
pv_rc_contest_r = '''
SELECT 
  country_code,
  project,
  SUM(view_count) as view_count,
  referer_class,
  CONCAT(year,LPAD(month,2,'0'),LPAD(day,2,'0')) AS view_date
FROM wmf.pageview_hourly
WHERE
  CONCAT(year,LPAD(month,2,'0')) >= {contest_start_dt_pv}
  AND CONCAT(year,LPAD(month,2,'0')) < {contest_end_dt_next_m_pv}
  AND agent_type='user'
  AND country_code = 'IN'
  AND project IN ({india_wiki_projects})
GROUP BY 
  country_code, project, referer_class, year, month, day
'''

In [None]:
pv_all_contest_r = '''
SELECT 
  country_code,
  project,
  SUM(view_count) as view_count,
  referer_class,
  CONCAT(year,LPAD(month,2,'0'),LPAD(day,2,'0')) AS view_date
FROM wmf.pageview_hourly
WHERE
  CONCAT(year,LPAD(month,2,'0')) >= {contest_start_dt_pv}
  AND CONCAT(year,LPAD(month,2,'0')) < {contest_end_dt_next_m_pv}
  AND agent_type='user'
  AND project IN ({india_wiki_projects})
GROUP BY 
  country_code, project, referer_class, year, month, day
'''

## Articles<a class="anchor" id="editors"></a>
[Back to Table of Contents](#toc)

#### New articles <a class="anchor" id="new_articles"></a>
[Back to Table of Contents](#toc)

In [94]:
#adapted from https://github.com/wikimedia-research/2018-19-Language-annual-plan-metrics/blob/master/Language-metrics.ipynb
#https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
#revision	create = Making an edit
# 6 months ≈ 26 weeks = 252 days
# period below starts 2019/07 

# Making the first edit to a page

m_new_article_counts_contest_r = ''' 
select
    mh.wiki_db AS database_code,
    count(*)/5 AS mon_new_articles
from wmf.mediawiki_history mh
left join event_sanitized.serversideaccountcreation ssac
on
    ssac.event.username = event_user_text and
    ssac.year >= 0
where
    mh.snapshot = "{MWH_SNAPSHOT}"
    AND mh.event_timestamp >= "{contest_start_dt_FULL}"
    AND mh.event_timestamp < "{contest_end_dt_FULL}" 
    AND event_entity = "revision"
    AND event_type = "create"
    AND mh.wiki_db in ({india_wiki_dbs})
GROUP BY wiki_db
''' 

#### avg_num_new_articles_edited <a class="anchor" id="new_articles"></a>
[Back to Table of Contents](#toc)

In [None]:
#adapted from https://github.com/wikimedia-research/2018-19-Language-annual-plan-metrics/blob/master/Language-metrics.ipynb
#https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
#page	create = Making the first edit to a page

#6 months ≈ 26 weeks = 252 days

#Making the first edit to a page
m_new_articles_edited_contest_r = ''' 
SELECT
    wiki_db database_code,
    count(*)/5 AS mon_new_articles_edited
FROM wmf.mediawiki_history mh
WHERE
    mh.snapshot = "{MWH_SNAPSHOT}" 
    AND mh.event_timestamp >= "{contest_start_dt_FULL}"
    AND mh.event_timestamp < "{contest_end_dt_FULL}" 
    AND event_entity = "page"
    AND event_type = "create"
    AND wiki_db in ({india_wiki_dbs})
GROUP BY wiki_db
''' 

#AND ssac.webhost LIKE '%wikipedia.org'

#### Existing articles, recently edited <a class="anchor" id="article_edits"></a>
[Back to Table of Contents](#toc)

In [None]:
#adapted from https://github.com/wikimedia-research/2018-19-Language-annual-plan-metrics/blob/master/Language-metrics.ipynb
#https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
#6 months ≈ 26 weeks = 252 days
#period below starts 2019/07 

eae_contest_r = ''' 
select
    wiki_db database_code,
    count(*)/5 as avg_n_existing_articles_edited
from wmf.mediawiki_history mh
where
    mh.snapshot = "{MWH_SNAPSHOT}"  
    AND mh.event_timestamp >= "{contest_start_dt_FULL}"
    AND mh.event_timestamp < "{contest_end_dt_FULL}" 
    AND event_entity = "revision"
    AND event_type = "create" 
    AND wiki_db in ({india_wiki_dbs})
GROUP BY wiki_db
''' 

#### Daily revisions by wiki <a class="anchor" id="daily_wiki_revisions"></a>
[Back to Table of Contents](#toc)

In [None]:
##Daily revisions by wiki

#`dr` stands for "daily revisions"
dra_contest_r = ''' 
    select
        wiki_db AS database_code,
        sum(if(metric = "daily_edits", value, 0)) - sum(if(metric = "daily_edits_by_bot_users", value, 0))/12 as nonbot_revs
    from wmf.mediawiki_metrics
    where
        snapshot = "{MWH_SNAPSHOT}" 
        AND dt >="{contest_start_dt_FULL}"
        AND dt <"{contest_end_dt_FULL}"
        AND metric in ("daily_edits", "daily_edits_by_bot_users")
        AND wiki_db IN ({india_wiki_dbs})
    group by wiki_db
''' 