# Table of Contents  <a class="anchor" id="toc"></a>

Guiding Question:
Q: how many articles were created from the suggestions? 

Process:

Prep:
1. Combine and clean suggestion lists
2. groupby type: editing suggestions

    a. get pageids (not including redirects) 
    
    b. get wikidata items using the local language (for use with topics, if time permits)
    
3. groupby type: translation suggestions

    a. get wikidata items using enwiki 
    
    b. get iwlinks
    
4. Read and clean the list of articles created and submitted to GLOW (includes articles ultimately disapproved)


Analysis:
5. groupby type: editing suggestions

    a. count matches by ids - count ids that were edited during the contest period
    
6. groupby type: translation suggestions

    a. count matches by wikidata item in the suggested language
    
    b. count matches by iwlink in the suggested language
    
    c. get a sum of items a+b

NOTES:
1. Analysis notes
    a. If the editor uses Content Translation, it should automatically assign the right QID
    b. If the editor doesn't use CT, either they or someone else has to assign the QID
    c. We will miss articles that were not created via Content Translation and don't have a manually added QID and/or the editor changed the suggested title to something new. 
4. Suggestion notes
    a. From the creators of the suggestions: "As a reminder, we have 2 lists: a list of suggested topics that exist in the local language but could be edited to be more complete based on the corresponding English page, and a list of topics that can be translated from English to the local language.  Based on feedback from the initial Project Tiger, we've separated out the topics by categories so editors can focus on the areas they like to write about.  The lists are ordered by popularity of what local language users are looking for."

In [1]:
import os
from glob import glob
import pandas as pd
import numpy as np

import wmfdata as wmf
from wmfdata import charting, mariadb, hive
from wmfdata.utils import pct_str, pd_display_all
import urllib
from urllib.parse import unquote

In [2]:
#compile all of the lists

f_mask = r'../../../GLOW/data/raw/g_topic_lists/*.xlsx'

gtl = \
pd.concat([gtl.assign(file=os.path.splitext(os.path.basename(f))[0],
                     sheet=sheet)
           for f in glob(f_mask)
           for sheet, gtl in pd.read_excel(f, sheet_name=None).items()],
          ignore_index=True, sort=True)

full_topic_rec_df = gtl.copy()

In [3]:
#combine Topic & entity name > article_suggestion
full_topic_rec_df['article_suggestion'] = full_topic_rec_df['entity_name'].combine_first(full_topic_rec_df['Topic'])
#rename 'sheet' to Google_topic
full_topic_rec_df = full_topic_rec_df.rename(columns={'sheet':'g_category', 
                                                      'Topic': 'g_suggested_en_title', 
                                                      'entity_name': 'g_suggested_local_title',
                                                      'english_wikipedia':'english_wikipedia_URL', 
                                                      'local_wikipedia':'local_wikipedia_URL'
                                                     })

#extract wiki name & suggestion_type (translation or edit)
#full_topic_rec_df['wiki'] = full_topic_rec_df['file'].str.extract('(^[A-Z_]+([^\(-]+))', expand=True)
full_topic_rec_df[['language_name', 'suggestion_type']] = full_topic_rec_df['file'].str.split(" ", 1, expand=True)
full_topic_rec_df['suggestion_type'] = full_topic_rec_df['file'].str.rsplit("for ").str[-1]

#extract url title
full_topic_rec_df['local_encoded_title'] = full_topic_rec_df['local_wikipedia_URL'].str.extract('([^\/]+$)', expand=True)

#extract lang code
#full_topic_rec_df['url_language_code'] = full_topic_rec_df['local_wikipedia_URL'].str.rsplit("http://").str[-1]
#full_topic_rec_df['url_language_code'] = full_topic_rec_df['url_language_code'].str.extract('([^.]+)', expand=True)

#reorder for visual skimming's sake
full_topic_rec_df = full_topic_rec_df[['article_suggestion', 'local_encoded_title','g_category', 'language_name','suggestion_type', 'file']] #'url_language_code'

#replace double coded translation entries
full_topic_rec_df['suggestion_type']=full_topic_rec_df['suggestion_type'].replace('Translating EXTERNAL', 'Translation EXTERNAL')
full_topic_rec_df['language_name'] = full_topic_rec_df['language_name'].replace('Bengali', 'Bangla')

In [4]:
full_topic_rec_df[full_topic_rec_df['language_name'].isnull()]

Unnamed: 0,article_suggestion,local_encoded_title,g_category,language_name,suggestion_type,file


In [5]:
#get database_code and language_code, confirm language_code (if needed)
lang_names =tuple(full_topic_rec_df['language_name'].unique())

ci = wmf.hive.run("""
SELECT  language_code, database_code, language_name
FROM canonical_data.wikis
WHERE language_name IN {lang_names} AND database_group = 'wikipedia'
""".format(lang_names=lang_names))

In [6]:
#merge
full_topic_rec_df_ci = full_topic_rec_df.merge(ci, how="left", on=['language_name'])

In [7]:
#change titles from denormalized (spaces) to normalized (underscore) for querying the page table etc.
full_topic_rec_df_ci['article_suggestion'] = full_topic_rec_df_ci['article_suggestion'].str.replace(' ', '_')

In [8]:
#if pnb article_suggestions exist, confirm language_code = url_language_code(see code above to add in url_language_code column)
#pnb	pnbwiki	Western Punjabi 

In [9]:
#use groupby to get two seperate dfs for each suggestion_type
translation_topic_rec_df = full_topic_rec_df_ci[full_topic_rec_df_ci['suggestion_type'] == 'Translation EXTERNAL'].copy(deep=False)

#get clean list - drop duplicates
translation_topic_rec_df_CLEAN = translation_topic_rec_df.drop_duplicates(subset=['article_suggestion', 'local_encoded_title','g_category','suggestion_type', 'language_name', 'language_code', 'database_code', 'file'], keep='first').copy(deep=False)

#keep just the duplicates - for checking data later on
translation_topic_rec_df_Dupes = pd.concat([translation_topic_rec_df, translation_topic_rec_df_CLEAN]).loc[translation_topic_rec_df.index.symmetric_difference(translation_topic_rec_df_CLEAN.index)]

In [10]:
#use groupby to get two seperate dfs for each suggestion_type
editing_topic_rec_df = full_topic_rec_df_ci.loc[full_topic_rec_df_ci['suggestion_type'] == 'Editing EXTERNAL'].copy(deep=False)

In [11]:
#encoded URL to decoded title
editing_topic_rec_df['page_title'] = editing_topic_rec_df['local_encoded_title'].apply(lambda x: unquote(x)).copy(deep=False)
editing_topic_rec_df['page_title'] = editing_topic_rec_df['page_title'].str.replace(' ', '_')

In [12]:
#get clean list - drop duplicates
editing_topic_rec_df_CLEAN = editing_topic_rec_df.drop_duplicates(subset=['article_suggestion', 'page_title','local_encoded_title','g_category', 'suggestion_type', 'language_name', 'language_code', 'database_code', 'file'], keep='first')

In [13]:
wd_vars = {}

## Editing

## Get article ids, redirects <a class="anchor" id="get_clean_list"></a>

In [14]:
nnpt = editing_topic_rec_df_CLEAN.loc[editing_topic_rec_df_CLEAN['page_title'].notnull(), ['database_code', 'page_title']]

In [15]:
nnpt['database_code'].unique()

array(['mlwiki', 'pawiki', 'hiwiki', 'orwiki', 'urwiki', 'tawiki',
       'knwiki', 'mrwiki', 'guwiki', 'tewiki', 'bnwiki'], dtype=object)

In [16]:
wikis = tuple(list(nnpt['database_code'].unique()))

wd_vars.update({'wikis': wikis})

In [17]:
pawiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'pawiki', 'page_title']))
mlwiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'mlwiki', 'page_title']))
hiwiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'hiwiki', 'page_title']))
orwiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'orwiki', 'page_title']))
urwiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'urwiki', 'page_title']))
tawiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'tawiki', 'page_title']))
knwiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'knwiki', 'page_title']))
mrwiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'mrwiki', 'page_title']))
guwiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'guwiki', 'page_title']))
tewiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'tewiki', 'page_title']))
bnwiki_titles_normalized = tuple(list(nnpt.loc[nnpt['database_code'] == 'bnwiki', 'page_title']))

In [18]:
#update the query variable to use it in queries
wd_vars.update({'pawiki_titles_normalized': pawiki_titles_normalized,
                'mlwiki_titles_normalized': mlwiki_titles_normalized,
                'hiwiki_titles_normalized': hiwiki_titles_normalized,
                'orwiki_titles_normalized': orwiki_titles_normalized,
                'urwiki_titles_normalized': urwiki_titles_normalized,
                'tawiki_titles_normalized': tawiki_titles_normalized,
                'knwiki_titles_normalized': knwiki_titles_normalized,
                'mrwiki_titles_normalized': mrwiki_titles_normalized,
                'guwiki_titles_normalized': guwiki_titles_normalized,
                'tewiki_titles_normalized': tewiki_titles_normalized,
                'bnwiki_titles_normalized': bnwiki_titles_normalized,
                })

In [19]:
get_ids_query = """
    SELECT 
           DATABASE() AS database_code,
           p1.page_id  AS page_id,
           p1.page_title AS page_title,
           p1.page_is_redirect AS p1_is_redirect,
           p2.page_id AS rpage_id,
           p2.page_title AS rpage_title,
           p2.page_len rpage_len,
           p2.page_is_redirect AS is_double_redirect
    FROM page AS p1 
    LEFT JOIN redirect AS rd 
        ON p1.page_id=rd.rd_from 
    LEFT JOIN page AS p2 
        ON (rd_namespace = p2.page_namespace)
            AND rd_title = p2.page_title  
    WHERE p1.page_namespace = 0
          AND p1.page_title IN {titles_normalized}
    """

In [20]:
pa_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = pawiki_titles_normalized), 'pawiki')
ml_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = mlwiki_titles_normalized), 'mlwiki')
hi_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = hiwiki_titles_normalized), 'hiwiki')
or_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = orwiki_titles_normalized), 'orwiki')
ur_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = urwiki_titles_normalized), 'urwiki')
ta_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = tawiki_titles_normalized), 'tawiki')
kn_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = knwiki_titles_normalized), 'knwiki')
mr_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = mrwiki_titles_normalized), 'mrwiki')
gu_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = guwiki_titles_normalized), 'guwiki')
te_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = tewiki_titles_normalized), 'tewiki')
bn_ids_r = wmf.mariadb.run(get_ids_query.format(titles_normalized = bnwiki_titles_normalized), 'bnwiki')

In [21]:
nppt_ids = pd.concat([pa_ids_r, 
                      ml_ids_r,
                      hi_ids_r,
                      or_ids_r,
                      ur_ids_r,
                      ta_ids_r,
                      kn_ids_r,
                      mr_ids_r,
                      gu_ids_r,
                      te_ids_r,
                      bn_ids_r,
                     ], sort=True, ignore_index=True)

nppt_ids.reset_index(drop=True);

In [22]:
#we do not want any duplicates here
nppt_ids[nppt_ids.index.duplicated()]

Unnamed: 0,database_code,is_double_redirect,p1_is_redirect,page_id,page_title,rpage_id,rpage_len,rpage_title


In [23]:
# |
#check to see if any of the page_ids are redirects or double redirects
((nppt_ids['p1_is_redirect']==1) & (nppt_ids['is_double_redirect']==1)).any()

False

In [24]:
# |
#check to see if any of the page_ids are redirects or double redirects
((nppt_ids['p1_is_redirect']==1) | (nppt_ids['is_double_redirect']==1)).any()

True

In [25]:
# act on the results from nppt_ids
#create a df 
all_surviving_articles = nppt_ids[['page_id','page_title', 'database_code']] 
#seperate the redirected items into their own df
redirects = nppt_ids.loc[nppt_ids['p1_is_redirect']==1]
#pull only p1.page_id, p1.page_title, p1.page_len 
redirect_df = redirects[['page_id','page_title', 'database_code']] 

In [26]:
#remove the redirect items from the all_surviving_articles df & create global articles df
nppt_articles =  all_surviving_articles[~all_surviving_articles.isin(redirect_df)].dropna(how='all')

#create a new wikicode column using quality_vars['wiki_db']
#ffill could also work here
#articles['wikicode'] = quality_vars['wiki_db']

In [27]:
nppt_articles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19304 entries, 0 to 19526
Data columns (total 3 columns):
page_id          19304 non-null float64
page_title       19304 non-null object
database_code    19304 non-null object
dtypes: float64(1), object(2)
memory usage: 603.2+ KB


## Editing articles - edit date

In [28]:
pawiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'pawiki', 'page_id']))
mlwiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'mlwiki', 'page_id']))
hiwiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'hiwiki', 'page_id']))
orwiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'orwiki', 'page_id']))
urwiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'urwiki', 'page_id']))
tawiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'tawiki', 'page_id']))
knwiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'knwiki', 'page_id']))
mrwiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'mrwiki', 'page_id']))
guwiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'guwiki', 'page_id']))
tewiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'tewiki', 'page_id']))
bnwiki_ids = tuple(list(nppt_articles.loc[nppt_articles['database_code'] == 'bnwiki', 'page_id']))

#update the query variable to use it in queries
wd_vars.update({'pawiki_ids': pawiki_ids,
                     'mlwiki_ids': mlwiki_ids,
                     'hiwiki_ids': hiwiki_ids,
                     'orwiki_ids': orwiki_ids,
                     'urwiki_ids': urwiki_ids,
                     'tawiki_ids': tawiki_ids,
                     'knwiki_ids': knwiki_ids,
                     'mrwiki_ids': mrwiki_ids,
                     'guwiki_ids': guwiki_ids,
                     'tewiki_ids': tewiki_ids,
                     'bnwiki_ids': bnwiki_ids,
                    })

In [29]:
#https://www.mediawiki.org/wiki/Manual:Revision_table#rev_timestamp
#https://www.mediawiki.org/wiki/Manual:Timestamp

#filter for those edited during the contest - 10th oct 2019 & 11th jan 2020 ---> 20191010000000
#yyyymmddhhmmss --  August 9th, 2010 00:30:06 --- 20100809003006

get_edits_query = """
    SELECT 
        DATABASE() AS database_code,
        page_title,
        page_id,
        DATE_FORMAT(rev_timestamp,"%y-%m-%d") AS edit_date
    FROM page
    JOIN revision ON rev_page = page.page_id
    WHERE rev_page = page_id
        AND rev_timestamp > 20191010000000 
        AND (rev_deleted & 4) = 0
        AND rev_page IN {ids}
        
"""

pa_edits_r= wmf.mariadb.run(get_edits_query.format(ids = pawiki_ids), 'pawiki')
ml_edits_r= wmf.mariadb.run(get_edits_query.format(ids = mlwiki_ids), 'mlwiki')
hi_edits_r= wmf.mariadb.run(get_edits_query.format(ids = hiwiki_ids), 'hiwiki')
or_edits_r= wmf.mariadb.run(get_edits_query.format(ids = orwiki_ids), 'orwiki')
ur_edits_r= wmf.mariadb.run(get_edits_query.format(ids = urwiki_ids), 'urwiki')
ta_edits_r= wmf.mariadb.run(get_edits_query.format(ids = tawiki_ids), 'tawiki')
kn_edits_r= wmf.mariadb.run(get_edits_query.format(ids = knwiki_ids), 'knwiki')
mr_edits_r= wmf.mariadb.run(get_edits_query.format(ids = mrwiki_ids), 'mrwiki')
gu_edits_r= wmf.mariadb.run(get_edits_query.format(ids = guwiki_ids), 'guwiki')
te_edits_r= wmf.mariadb.run(get_edits_query.format(ids = tewiki_ids), 'tewiki')
bn_edits_r= wmf.mariadb.run(get_edits_query.format(ids = bnwiki_ids), 'bnwiki')


In [30]:
nppt_articles_edits = pd.concat([pa_edits_r, 
                      ml_edits_r,
                      hi_edits_r,
                      or_edits_r,
                      ur_edits_r,
                      ta_edits_r,
                      kn_edits_r,
                      mr_edits_r,
                      gu_edits_r,
                      te_edits_r,
                      bn_edits_r,
                     ], sort=True, ignore_index=True)

nppt_articles_edits.reset_index(drop=True);

nppt_articles_edits['edit_date'] = pd.to_datetime(nppt_articles_edits['edit_date'], format="%y-%m-%d")

In [31]:
nppt_articles_edits.groupby(['database_code', 'page_id']).ngroups

7103

In [32]:
print(nppt_articles_edits.groupby('database_code')['page_id'].nunique())

database_code
bnwiki     940
guwiki     123
hiwiki    2401
knwiki     232
mlwiki     377
mrwiki     514
orwiki      16
pawiki     193
tawiki     821
tewiki    1311
urwiki     175
Name: page_id, dtype: int64


In [33]:
contest_edits = nppt_articles_edits[(nppt_articles_edits['edit_date'] > '2019-10-10') & (nppt_articles_edits['edit_date'] < '2020-02-11')]

In [34]:
contest_edits.groupby(['database_code', 'page_id']).ngroups

4608

In [35]:
print(contest_edits.groupby('database_code')['page_id'].nunique())

database_code
bnwiki    859
guwiki    104
hiwiki    909
knwiki    184
mlwiki    317
mrwiki    407
orwiki     15
pawiki    184
tawiki    673
tewiki    797
urwiki    159
Name: page_id, dtype: int64


In [36]:
nppt_articles_edits_contest = contest_edits[['page_title', 'page_id', 'database_code']]

In [37]:
nppt_articles_edits_contest_unique = nppt_articles_edits_contest.drop_duplicates(subset=['page_title', 'page_id', 'database_code'], keep='first').copy(deep=False)

## > How many articles from the Google list were edited since the start of the GLOW contest? 

#### 6,607 articles from the Google provided 'editing' list of articles have been edited since the contest started; 4609 of these were edited during the contest

In [38]:
test = wmf.mariadb.run("""
SELECT 
    page_id,
    DATE_FORMAT(rev_timestamp,"%y-%m-%d") AS edit_date,
    revactor_actor
FROM revision_actor_temp
JOIN revision ON(revactor_rev = rev_id AND revactor_page = rev_page)
JOIN page ON rev_page = page.page_id
WHERE rev_page = page_id
    AND rev_timestamp > 20191010000000 
    AND (rev_deleted & 4) = 0
GROUP BY revactor_rev
LIMIT 10
""", 'pawiki')

In [39]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 3 columns):
page_id           10 non-null int64
edit_date         10 non-null object
revactor_actor    10 non-null int64
dtypes: int64(2), object(1)
memory usage: 368.0+ bytes


## QIDs & Sitelinks

### wikidata Q item

In [40]:
#https://www.mediawiki.org/wiki/Wikibase/Schema/wb_items_per_site
#https://www.mediawiki.org/wiki/Manual:Page_table
#wb_items_per_site site:quarry.wmflabs.org

## QUERY CLEAN EDITING SUBLIST

In [41]:
#change titles from denormalized (spaces) to normalized (underscore) for querying the page table etc.
#nppt_articles['article_suggestion'] = nppt_articles['article_suggestion'].str.replace('_', ' ')
nppt_articles['page_title'] = nppt_articles['page_title'].str.replace('_', ' ')
#create tuples of the article_suggestions and wiki_codes to use when querying for the wikidata items
editing_titles_denormalized_CLEAN = tuple(list(editing_topic_rec_df_CLEAN['page_title']))
editing_titles_denormalized_database_codes_CLEAN = tuple(list(editing_topic_rec_df_CLEAN['database_code']))

#set up a dict variable to use with .format when querying
wd_vars.update({
    'editing_titles_denormalized' : editing_titles_denormalized_CLEAN,
    'editing_titles_denormalized_db_codes' : editing_titles_denormalized_database_codes_CLEAN,
})

In [42]:
qid_editing_CLEAN_r = wmf.mariadb.run("""
SELECT
      ips_site_page AS page_title,
      ips_item_id AS QID,
      ips_site_id AS database_code
FROM  wb_items_per_site  
WHERE ips_site_id IN {editing_titles_denormalized_db_codes} AND
      ips_site_page IN {editing_titles_denormalized}
""".format(**wd_vars), "wikidatawiki")

In [43]:
#merge in en query results to nppt_articles
#editing_topic_rec_df_CLEAN_ids_q = editing_topic_rec_df_CLEAN_ids.merge(qid_r2_editing_CLEAN, how="left", on=['page_title', 'database_code'])
nppt_articles_q = qid_editing_CLEAN_r.merge(nppt_articles, how="left", on=['page_title', 'database_code'])

## QUERY TRANSLATION SUBLIST FOR QITEMS ON ENWIKI

In [44]:
translation_topic_rec_df_CLEAN['article_suggestion'] = translation_topic_rec_df_CLEAN['article_suggestion'].str.replace('_', ' ')
titles_denormalized_translation_CLEAN = tuple(list(translation_topic_rec_df_CLEAN['article_suggestion']))

In [45]:
#get qids for translation articles
qid_en_CLEAN_r = wmf.mariadb.run("""
SELECT
  ips_site_page AS article_suggestion,
  ips_item_id AS QID
FROM  wb_items_per_site  
WHERE ips_site_id = 'enwiki' 
  AND ips_site_page IN {titles_denormalized_translation_CLEAN}
""".format(titles_denormalized_translation_CLEAN=titles_denormalized_translation_CLEAN), "wikidatawiki")

## QUERY translation rec qids for sitelinks

In [46]:
translation_qids = tuple(list(qid_en_CLEAN_r['QID']))

#set up a dict variable to use with .format when querying
wd_vars.update({
    'translation_qids' : translation_qids})

In [47]:
#https://www.wikidata.org/wiki/Help:Sitelinks
#https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q42&props=sitelinks

iwl_r = wmf.mariadb.run("""
SELECT
  linked_item.ips_item_id AS QID,
  GROUP_CONCAT(ips_site_id SEPARATOR ', ') AS iwsites,
  COUNT(ips_site_page) AS iwsitelinks
FROM (
      SELECT ips_item_id
      FROM wb_items_per_site
      WHERE ips_item_id IN {translation_qids}
      AND ips_site_id IN {wikis}
    ) AS linked_item
LEFT JOIN wb_items_per_site 
  ON linked_item.ips_item_id = wb_items_per_site.ips_item_id
LEFT JOIN page 
  ON linked_item.ips_item_id = page.page_id
GROUP BY page_id
""".format(**wd_vars), "wikidatawiki")

In [48]:
#merge to get df with article_suggestion, QID, iwsites, iwsitelinks
t_iwl_q = iwl_r.merge(qid_en_CLEAN_r, how="left", on=['QID'])

In [49]:
#merge the translation-interwikilinks-QID df (t_iwl_q) into translation_topic_rec_df_CLEAN
t_rec_iwl_q = t_iwl_q.merge(translation_topic_rec_df_CLEAN, how="left", on=['article_suggestion'])

In [50]:
#check for duplicates
t_rec_iwl_q[t_rec_iwl_q.duplicated()]

Unnamed: 0,QID,iwsites,iwsitelinks,article_suggestion,local_encoded_title,g_category,language_name,suggestion_type,file,language_code,database_code


In [52]:
#the article is sometimes suggested as an article suggestion for more than one wiki
duplicated_translation_recs = t_rec_iwl_q[t_rec_iwl_q.duplicated(['article_suggestion'])]
dupe_check = t_rec_iwl_q[t_rec_iwl_q.duplicated(['article_suggestion', 'database_code', 'local_encoded_title'])]

print("duplicated_translation_recs", len(duplicated_translation_recs))
print("dupes", len(dupe_check))

duplicated_translation_recs 4443
dupes 6


In [53]:
#drop dupes
t_rec_iwl_q = t_rec_iwl_q.drop_duplicates(subset=['article_suggestion', 'database_code', 'local_encoded_title', 'file'], keep='first')

In [54]:
translation_rec_iwl_q = t_rec_iwl_q[['article_suggestion',
                                     'QID','database_code', 
                                     'iwsites',
                                     'iwsitelinks',
                                     'language_code','file',
                                     'suggestion_type',
                                     'language_name',
                                     'g_category',]]

## Translation articles - iwlinks match

In [55]:
#create boolean column: TRUE if 'iwsites' column str contains 'database_code' value
#substring search for the database code in the “iwsites” column, if the database code is set
#[dbcode in iwsites if dbcode is not None else False for (dbcode, iwsites) in zip(translation_rec_iwl_q['database_code'], translation_rec_iwl_q['iwsites'])]
translation_rec_iwl_q['database_code_in_iwsites'] = [x[0] in x[1] if x[0] is not None else False for x in zip(translation_rec_iwl_q['database_code'], translation_rec_iwl_q['iwsites'])]

In [56]:
translation_rec_iwl_q['database_code_in_iwsites'].values.sum()

1894

In [57]:
suggestions_created_a = translation_rec_iwl_q[translation_rec_iwl_q['database_code_in_iwsites'] == True]

In [62]:
count_suggestions_created_a = translation_rec_iwl_q['database_code_in_iwsites'].values.sum()
print(count_suggestions_created_a)

1894


## Translation articles - Qid match

In [58]:
#objective: count matches by wikidata item in the suggested language

In [59]:
#use groupby to get a df of articles that DON'T have an interwiki link associated with the 
#suggestion database_code, aka, they weren't created as far as we know so far

located =  translation_rec_iwl_q[translation_rec_iwl_q['database_code_in_iwsites'] == True]
not_yet_located = translation_rec_iwl_q[translation_rec_iwl_q['database_code_in_iwsites'] == False]

In [60]:
fountain_titles = pd.read_csv("../../data/raw/articles/2019/contest_titles_n_updated.csv", sep=',', encoding = 'utf-8')
fountain_titles_to_cull = fountain_titles[['wiki_db', 'QID']]
fountain_titles_to_cull = fountain_titles_to_cull.rename(columns={'wiki_db': 'database_code'})

In [61]:
translation_sugg_to_cull = not_yet_located[['database_code','QID']]
suggestions_created_b = pd.merge(fountain_titles_to_cull, translation_sugg_to_cull, on=['database_code', 'QID'], how='inner')
count_suggestions_created_b = len(suggestions_created)
print(count_suggestions_created_b)

61


## > How many translation articles from the Google list were created since the GLOW contest? 

In [64]:
def percentage(part, whole):
  return 100 * float(part)/float(whole)

In [65]:
translation_recs = len(translation_topic_rec_df_CLEAN)
editing_recs = len(editing_topic_rec_df_CLEAN)
edited_count = nppt_articles_edits.groupby(['database_code', 'page_id']).ngroups
translation_count = suggestions_created_a+suggestions_created_b
print("Total values in full rec list:", len(full_topic_rec_df_ci))
print('***')
print("total recs in translation:", len(translation_topic_rec_df_CLEAN))
print("total recs in editing:", len(editing_topic_rec_df_CLEAN))
print('***')
print(percentage(edited_count,editing_recs),"% created from editing list")
print(percentage(translation_count,translation_recs),"% created from translation list")

Total values in full rec list: 34295
***
total recs in translation: 14155
total recs in editing: 20102
***
35.33479255795443 % created from editing list
13.811374072765807 % created from translation list


### > 1955+ articles were picked from the suggestion lists (out of 34k+)

#### 1894+ articles from the Google provided 'translation' list of articles were created, since the contest started, and had the related iwl added 

#### 61 articles from the Google provided 'translation' list of articles were created, since the contest started, and had a matching QID added

In [None]:
#includes the 1894 with related interwiki links
suggestions_created_a.to_csv("../../data/processed/query_results/topic_lists/suggestions_created_a.csv", sep=',', encoding = 'utf-8', index=False)

In [None]:
#includes the 61
suggestions_created_b.to_csv("../../data/processed/query_results/topic_lists/suggestions_created_b.csv", sep=',', encoding = 'utf-8', index=False)

In [None]:
nppt_articles_edits_contest_unique.to_csv("../../data/processed/query_results/topic_lists/stubs_edited_during_contest.csv", sep=',', encoding = 'utf-8', index=False)

#### To Do in the future, topics

TODO:
get latest topic model - see Isaac
Answer the following questions in the future, if time/project level interest:
Q: which topics did editors write about?
Q: what did editors select from these suggestions? 
Q: which topics most resonated or were most popular to write about from these lists? by wiki?
Q: which topics did our partner pass on to us...which search terms made their way to us?