# Content Translation Article Deletion Ratios Across All Wikipedias

[Task](https://phabricator.wikimedia.org/T286636)

# Background

From task description:

"Across all languages, Wikipedia articles created with Content Translation are deleted less often than those created from scratch. For example, in 2020, 3% of new translations were deleted, compared to 12% of other new articles. However, this is not the case for all Wikipedias and some specific wikis have a higher deletion rate for translations. For example, for Indonesian ([T219851#5914691](https://phabricator.wikimedia.org/T219851#5914691)) and Telugu ([T244769](https://phabricator.wikimedia.org/T244769)) the deletion ratios for Content Translation were higher compared to other articles created in these wikis."

# Purpose

The purpose of this analysis is to identify and list the number of wikis where the deletion rate of atciles created with content translation is higher than the deletion rate for articles created with other tools. Specifically, we want to answer the following questions:

* How many wikis have translations deleted more often than regular articles?
* Which are these wikis?
* Has the number of those wikis reduced compared to the previous period?
* How high is the highest deletion ratio a wiki has for translations?

This analysis will be used as a baseline to assess the evolution of deletion rates as improvements are made. 


# Data

Data comes from the [mediawiki_history](https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/MediaWiki_history) table and reflects the deletion ratios of main namespace articles that were created using Content Translation compared to the deletion ratio for main namespace articles created without the tool. Bots were excluded. 

This data is collected quarterly (every three months) to assess the evolution of deletion rates as improvements are made. This timespan was selected to caputre a sufficient time for editors to review content and avoid seasonalilty effects

**Wiki size threshold**: We removed wikis where 15 or fewer articles were created with content translation during the reviewed period to reduce noise in the data and focus on wikis with more representative data. 

In [27]:
shhh <- function(expr) suppressPackageStartupMessages(suppressWarnings(suppressMessages(expr)))
shhh({
    library(tidyverse);
     # Tables:
    library(gt);
    library(gtsummary);
})

# Quarterly Comparison

In [None]:
#FIXME: Update with parameters
#FIXME: Investigate ability to add time contraint for when the page was deleted


In [7]:
# Update with time period you wish to review
# Q1: July-September 2021
mw_snapshot <- '2021-10'
start_dt <-  '2021-07-01'
end_dt <-  '2021-09-30'

In [41]:

query <-
"
-- find both cx and non-cx created articles 
WITH created_articles AS (

SELECT
    wiki_db AS wiki,
    SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS created_cx,
    COUNT(*) AS created_total
FROM wmf.mediawiki_history
WHERE
    snapshot = '2021-10'
    AND event_timestamp BETWEEN '2021-07-01' and '2021-09-30'
-- interested in main page namespaces
    AND page_namespace = 0
-- only look at new page creations
    AND revision_parent_id = 0
    AND event_entity = 'revision'
    AND event_type = 'create' 
GROUP BY  
  wiki_db
),

--find all deleted articles that were created with cx 

deleted_articles AS (

SELECT
    wiki_db AS wiki,
    SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS deleted_cx,
    COUNT(*) AS deleted_total
FROM wmf.mediawiki_history
WHERE
    snapshot = '2021-10'
    AND event_timestamp BETWEEN '2021-07-01' and '2021-09-30'
-- interested in main page namespaces
    AND page_namespace = 0
-- only look at new page creations
    AND revision_parent_id = 0
    AND event_entity = 'revision'
-- find revisions moved to the archive table
    AND event_type = 'create'
    AND revision_is_deleted_by_page_deletion = TRUE
-- remove all bots
    AND SIZE(event_user_is_bot_by_historical) = 0  -- not a bot
GROUP BY  
  wiki_db
)

-- main query to aggregate and join sources above
SELECT
    created_articles.wiki,
    created_cx,
    (created_total - created_cx)  AS created_non_cx,
    deleted_cx,
    (deleted_total - deleted_cx) AS deleted_non_cx
FROM created_articles
JOIN deleted_articles ON 
    created_articles.wiki = deleted_articles.wiki
"

In [42]:
cx_deletion_ratio <- wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



## Overall Quarterly Deletion Ratio

In [78]:
cx_deletion_ratio_overall <- cx_deletion_ratio %>%
    #filter(created_cx > 15) %>% # remove wikis with 15 or fewer articles created using cx
    summarise(deleted_cx_pct = paste0(round(sum(deleted_cx)/sum(created_cx) * 100, 2), "%"),
           deleted_non_cx_pct = paste0(round(sum(deleted_non_cx)/sum(created_non_cx) * 100, 2), "%"),
           deletion_pct_diff = paste0(round((sum(deleted_non_cx)/sum(created_non_cx)*100)-((sum(deleted_cx)/sum(created_cx))*100), 2),"%")
           )

cx_deletion_ratio_overall


deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
<chr>,<chr>,<chr>
2.4%,3.82%,1.42%


## By Wiki
 

In [79]:
# Add columns with calculated deletion ratio

cx_deletion_ratio_bywiki <- cx_deletion_ratio  %>%
    #filter(wiki == 'arwiki') %>% # use to find ratios for single wiki
    filter(created_cx > 15) %>% # remove wikis with 15 or fewer articles created using cx
    mutate(deleted_cx_ratio = deleted_cx/created_cx, 
           deleted_non_cx_ratio = deleted_non_cx/created_non_cx, 
           deletion_ratio_diff = ((deleted_non_cx/created_non_cx)-(deleted_cx/created_cx)
           ))

## How many wikis have translations deleted more often than regular articles?

In [80]:
cx_deletion_higher <- cx_deletion_ratio_bywiki  %>%
    filter(deletion_ratio_diff < 0) %>% #find wikis with higher cx deletion ratio
    summarise(total_wikis = n())


In [81]:
print(paste0("Across all wikis where more than 15 articles have been created with content translation in Q4, there were ", 
             cx_deletion_higher[1], 
             " wikis where articles created with content translation were deleted more than articles created without cx"))

[1] "Across all wikis where more than 15 articles have been created with content translation in Q4, there were 13 wikis where articles created with content translation were deleted more than articles created without cx"


## Which are these wikis?

In [82]:
cx_deletion_higher_list <- cx_deletion_ratio_bywiki %>%
    filter(deletion_ratio_diff < 0)%>% # only wikis where cx deletion ratio is higher
    arrange(deletion_ratio_diff) #sort by highest deletion ratio difference
    

In [83]:
# reformat into table

cx_deletion_higher_list_tbl <- cx_deletion_higher_list %>%
    gt() %>%
    tab_header(
            title = "Wikis with higher deletion ratios for articles created with Content Translation",
            subtitle = "Reviewed Time Period: July 2021 through September 2021 (Q1)") %>%
    fmt_percent(
        columns = 6:8
    ) %>%

    cols_label(wiki = "Wiki project",
               created_cx = "Created CX Articles", 
               created_non_cx = "Created non-CX Articles",
               deleted_cx = "Deleted CX Articles",
               deleted_non_cx = "Deleted non-CX Articles",
               deleted_cx_ratio = "CX Articles Deletion Ratio",
               deleted_non_cx_ratio = "Non-CX Articles Deletion Ratio",
               deletion_ratio_diff = "Deletion Ratio Difference") %>%
     tab_spanner("Created Articles", 2:3) %>%
     tab_spanner("Deleted Articles", 4:5) %>%
    tab_spanner("Deletion Ratios", 6:8) %>%
    tab_footnote(
    footnote = "Excludes wikis with 15 or fewer articles created with Content Translation
            during the reviewed time period",
    locations = cells_column_labels(
      columns = 'wiki'
    )) %>%
      gtsave(
    "cx_deletion_higher_wikis_current.html", inline_css = TRUE) 


IRdisplay::display_html(data = cx_deletion_higher_list_tbl, file = "cx_deletion_higher_wikis_current.html")

Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation
Reviewed Time Period: July 2021 through September 2021 (Q1),Reviewed Time Period: July 2021 through September 2021 (Q1),Reviewed Time Period: July 2021 through September 2021 (Q1),Reviewed Time Period: July 2021 through September 2021 (Q1),Reviewed Time Period: July 2021 through September 2021 (Q1),Reviewed Time Period: July 2021 through September 2021 (Q1),Reviewed Time Period: July 2021 through September 2021 (Q1),Reviewed Time Period: July 2021 through September 2021 (Q1)
Wiki project1,Created Articles,Created Articles,Deleted Articles,Deleted Articles,Deletion Ratios,Deletion Ratios,Deletion Ratios
Wiki project1,Created CX Articles,Created non-CX Articles,Deleted CX Articles,Deleted non-CX Articles,CX Articles Deletion Ratio,Non-CX Articles Deletion Ratio,Deletion Ratio Difference
jvwiki,105,434,101,52,96.19%,11.98%,−84.21%
kuwiki,21,12573,3,69,14.29%,0.55%,−13.74%
skwiki,34,1341,9,266,26.47%,19.84%,−6.63%
fawiki,1340,57805,169,3789,12.61%,6.55%,−6.06%
lvwiki,70,2224,10,232,14.29%,10.43%,−3.85%
lawiki,17,958,2,77,11.76%,8.04%,−3.73%
mrwiki,66,5294,2,58,3.03%,1.10%,−1.93%
thwiki,34,7722,2,316,5.88%,4.09%,−1.79%
fiwiki,33,8375,3,662,9.09%,7.90%,−1.19%
aswiki,25,1666,1,57,4.00%,3.42%,−0.58%


## How high is the highest deletion ratio a wiki has for translations?

In [49]:
cx_deletion_ration_highest <- cx_deletion_ratio_bywiki %>%
    arrange(desc(deleted_cx_ratio)) %>% #sort by highest to lowest cx deletion ratio
    mutate(deleted_cx_ratio = paste0(round(deleted_cx_ratio *100,2),"%") ,
          deleted_non_cx_ratio = paste0(round(deleted_non_cx_ratio *100,2),"%") ,
          deletion_ratio_diff = paste0(round(deletion_ratio_diff * 100,2),"%") )

head(cx_deletion_ration_highest, 5)

Unnamed: 0_level_0,wiki,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_ratio,deleted_non_cx_ratio,deletion_ratio_diff
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>
1,jvwiki,105,434,101,52,96.19%,11.98%,-84.21%
2,skwiki,34,1341,9,266,26.47%,19.84%,-6.63%
3,viwiki,375,11212,65,3365,17.33%,30.01%,12.68%
4,kuwiki,21,12573,3,69,14.29%,0.55%,-13.74%
5,lvwiki,70,2224,10,232,14.29%,10.43%,-3.85%


## Has the number of those wikis reduced compared to the previous period?

In [67]:
# Deletion ratios from Q4

query <-
"
-- find all created articles 
WITH created_articles AS (

SELECT
    wiki_db AS wiki,
    SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS created_cx,
    COUNT(*) AS created_total
FROM wmf.mediawiki_history
WHERE
     snapshot = '2021-08'
    AND event_timestamp BETWEEN '2021-01-01' and '2021-03-31' 
-- interested in main page namespaces
    AND page_namespace = 0
-- only look at new page creations
    AND revision_parent_id = 0
    AND event_entity = 'revision'
    AND event_type = 'create'
-- remove bots
    AND SIZE(event_user_is_bot_by_historical) = 0 
GROUP BY  
  wiki_db
),

--find all deleted articles 

deleted_articles AS (

SELECT
    wiki_db AS wiki,
    SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS deleted_cx,
    COUNT(*) AS deleted_total
FROM wmf.mediawiki_history
WHERE
     snapshot = '2021-08'
    AND event_timestamp BETWEEN '2021-01-01' and '2021-03-31' 
-- interested in main page namespaces
    AND page_namespace = 0
-- only look at new page creations
    AND revision_parent_id = 0
    AND event_entity = 'revision'
-- find revisions moved to the archive table
    AND event_type = 'create'
    AND revision_is_deleted_by_page_deletion = TRUE
-- remove bots
    AND SIZE(event_user_is_bot_by_historical) = 0 
GROUP BY  
  wiki_db
)

-- main query 
SELECT
    created_articles.wiki,
    created_cx,
    (created_total - created_cx)  AS created_non_cx,
    deleted_cx,
    (deleted_total - deleted_cx) AS deleted_non_cx
FROM created_articles
JOIN deleted_articles ON 
    created_articles.wiki = deleted_articles.wiki
"

In [68]:
cx_deletion_ratio_previous <- wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



# Overall Previous Quarter Deletion Ratio

In [75]:
cx_deletion_ratio_overall_previous <- cx_deletion_ratio_previous %>%
    #filter(created_cx > 15) %>% # remove wikis with 15 or fewer articles created using cx
    summarise(deleted_cx_pct = paste0(round(sum(deleted_cx)/sum(created_cx) * 100, 2), "%"),
           deleted_non_cx_pct = paste0(round(sum(deleted_non_cx)/sum(created_non_cx) * 100, 2), "%"),
           deletion_pct_diff = paste0(round((sum(deleted_non_cx)/sum(created_non_cx)*100)-((sum(deleted_cx)/sum(created_cx))*100), 2),"%")
           )

cx_deletion_ratio_overall_previous

deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
<chr>,<chr>,<chr>
3.35%,8.75%,5.39%


# By Wiki Previous Quarter Deletion Ratios

In [71]:
cx_deletion_ratio_previous_bywiki <- cx_deletion_ratio_previous %>%
    #filter(wiki == 'idwiki') %>%
    filter(created_cx > 15) %>%
    mutate(deleted_cx_ratio = deleted_cx/created_cx,
           deleted_non_cx_ratio = deleted_non_cx/created_non_cx,
           deletion_ratio_diff = ((deleted_non_cx/created_non_cx)-(deleted_cx/created_cx)
           ))


In [72]:
cx_deletion_higher_previous <- cx_deletion_ratio_previous_bywiki  %>%
    filter(deletion_ratio_diff < 0) %>%
    summarise(total_wikis = n())


In [73]:
print(paste0("Across all wikis where more than 15 articles have been created with content translation in the previous quarter, there were ", 
             cx_deletion_higher_previous[1], 
             " wikis where articles created with content translation were deleted more than articles created without cx"))

[1] "Across all wikis where more than 15 articles have been created with content translation in the previous quarter, there were 13 wikis where articles created with content translation were deleted more than articles created without cx"


In [74]:
cx_deletion_higher_list_previous <- cx_deletion_ratio_previous_bywiki   %>%
    filter(deletion_ratio_diff < 0) %>%
    arrange(deletion_ratio_diff)

cx_deletion_higher_list_previous

wiki,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_ratio,deleted_non_cx_ratio,deletion_ratio_diff
<chr>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>
hawwiki,64,85,25,1,0.390625,0.011764706,-0.3788602941
kuwiki,204,4011,34,69,0.16666667,0.017202693,-0.1494639741
lawiki,25,923,3,48,0.12,0.052004334,-0.0679956663
ltwiki,23,2272,13,1135,0.56521739,0.499559859,-0.0656575321
fiwiki,83,9448,12,787,0.14457831,0.083298052,-0.0612802608
fiu_vrowiki,16,131,1,4,0.0625,0.030534351,-0.0319656489
eowiki,123,5221,5,62,0.04065041,0.01187512,-0.0287752868
kawiki,122,5434,23,889,0.18852459,0.163599558,-0.0249250318
arzwiki,110,37033,3,335,0.02727273,0.009045986,-0.0182267413
thwiki,17,4635,1,208,0.05882353,0.044875944,-0.0139475855


In [84]:
# reformat into table

cx_deletion_higher_list_tbl_previous <- cx_deletion_higher_list_previous %>%
    gt() %>%
    tab_header(
            title = "Wikis with higher deletion ratios for articles created with Content Translation",
            subtitle = "Reviewed Time Period: January 2021 through March 2021 (Q3)") %>%
    fmt_percent(
        columns = 6:8
    ) %>%

    cols_label(wiki = "Wiki project",
               created_cx = "Created CX Articles", 
               created_non_cx = "Created non-CX Articles",
               deleted_cx = "Deleted CX Articles",
               deleted_non_cx = "Deleted non-CX Articles",
               deleted_cx_ratio = "CX Articles Deletion Ratio",
               deleted_non_cx_ratio = "Non-CX Articles Deletion Ratio",
               deletion_ratio_diff = "Deletion Ratio Difference") %>%
     tab_spanner("Created Articles", 2:3) %>%
     tab_spanner("Deleted Articles", 4:5) %>%
    tab_spanner("Deletion Ratios", 6:8) %>%
    tab_footnote(
    footnote = "Excludes wikis with 15 or fewer articles created with Content Translation
            during the reviewed time period",
    locations = cells_column_labels(
      columns = 'wiki'
    )) %>%
      gtsave(
    "cx_deletion_higher_wikis_previous.html", inline_css = TRUE) 


IRdisplay::display_html(data = cx_deletion_higher_list_tbl_previous, file = "cx_deletion_higher_wikis_previous.html")

Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation
Reviewed Time Period: January 2021 through March 2021 (Q3),Reviewed Time Period: January 2021 through March 2021 (Q3),Reviewed Time Period: January 2021 through March 2021 (Q3),Reviewed Time Period: January 2021 through March 2021 (Q3),Reviewed Time Period: January 2021 through March 2021 (Q3),Reviewed Time Period: January 2021 through March 2021 (Q3),Reviewed Time Period: January 2021 through March 2021 (Q3),Reviewed Time Period: January 2021 through March 2021 (Q3)
Wiki project1,Created Articles,Created Articles,Deleted Articles,Deleted Articles,Deletion Ratios,Deletion Ratios,Deletion Ratios
Wiki project1,Created CX Articles,Created non-CX Articles,Deleted CX Articles,Deleted non-CX Articles,CX Articles Deletion Ratio,Non-CX Articles Deletion Ratio,Deletion Ratio Difference
hawwiki,64,85,25,1,39.06%,1.18%,−37.89%
kuwiki,204,4011,34,69,16.67%,1.72%,−14.95%
lawiki,25,923,3,48,12.00%,5.20%,−6.80%
ltwiki,23,2272,13,1135,56.52%,49.96%,−6.57%
fiwiki,83,9448,12,787,14.46%,8.33%,−6.13%
fiu_vrowiki,16,131,1,4,6.25%,3.05%,−3.20%
eowiki,123,5221,5,62,4.07%,1.19%,−2.88%
kawiki,122,5434,23,889,18.85%,16.36%,−2.49%
arzwiki,110,37033,3,335,2.73%,0.90%,−1.82%
thwiki,17,4635,1,208,5.88%,4.49%,−1.39%


## How many wikis had higher deletion ratios for cx translated articles both quarters?

In [58]:
intersect(cx_deletion_higher_list_previous[1], cx_deletion_higher_list[1])

wiki
<chr>
jvwiki
mrwiki


# 6 Month Period Comparison 

This was done in the analysis conducted as part of [https://phabricator.wikimedia.org/T286636#7345479](T286636) to assess very review timeframes. The team decided to proceed with quarterly updates but leaving this prior analysis here for reference.

In [269]:
# Current 6 Months
# Jan - June 2021
query <-
"
-- find both cx and non-cx created articles 
WITH created_articles AS (

SELECT
    wiki_db AS wiki,
    SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS created_cx,
    COUNT(*) AS created_total
FROM wmf.mediawiki_history
WHERE
    snapshot = '2021-08'
    AND event_timestamp BETWEEN '2021-01-01' and '2021-06-30' 
-- interested in main page namespaces
    AND page_namespace = 0
-- only look at new page creations
    AND revision_parent_id = 0
    AND event_entity = 'revision'
    AND event_type = 'create'
-- rremove bots
    AND SIZE(event_user_is_bot_by_historical) = 0 
GROUP BY  
  wiki_db
),

--find all deleted articles that were created with cx 

deleted_articles AS (

SELECT
    wiki_db AS wiki,
    SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS deleted_cx,
    COUNT(*) AS deleted_total
FROM wmf.mediawiki_history
WHERE
    snapshot = '2021-08'
    AND event_timestamp BETWEEN '2021-01-01' and '2021-06-30' 
-- interested in main page namespaces
    AND page_namespace = 0
-- only look at new page creations
    AND revision_parent_id = 0
    AND event_entity = 'revision'
    AND event_type = 'create'
-- find revisions moved to the archive table
    AND revision_is_deleted_by_page_deletion = TRUE
-- remove bots
    AND SIZE(event_user_is_bot_by_historical) = 0 
GROUP BY  
  wiki_db
)

-- main query to aggregate and join sources above
SELECT
    created_articles.wiki,
    created_cx,
    (created_total - created_cx)  AS created_non_cx,
    deleted_cx,
    (deleted_total - deleted_cx) AS deleted_non_cx
FROM created_articles
JOIN deleted_articles ON 
    created_articles.wiki = deleted_articles.wiki
"

In [270]:
cx_deletion_ratio_current_6mo <- wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



## Overall Deletion Ratio - Current 6 mo

In [271]:
cx_deletion_ratio_6cur_overall <- cx_deletion_ratio_current_6mo %>%
    summarise(deleted_cx_pct = paste0(round(sum(deleted_cx)/sum(created_cx) * 100, 2), "%"),
           deleted_non_cx_pct = paste0(round(sum(deleted_non_cx)/sum(created_non_cx) * 100, 2), "%"),
           deletion_pct_diff = paste0(round((sum(deleted_non_cx)/sum(created_non_cx)*100)-((sum(deleted_cx)/sum(created_cx))*100), 2),"%")
           )

cx_deletion_ratio_6cur_overall

deleted_cx_pct,deleted_non_cx_pct,deletion_pct_diff
<chr>,<chr>,<chr>
3.6%,8.47%,4.87%


## By Wiki

In [272]:
cx_deletion_ratio_current_bywiki <- cx_deletion_ratio_current_6mo %>%
    #filter(wiki == 'idwiki') %>%
    filter(created_cx > 15) %>% # only review wikis with more than 15 cx articles
    mutate(deleted_cx_ratio = deleted_cx/created_cx,
           deleted_non_cx_ratio = deleted_non_cx/created_non_cx,
           deletion_ratio_diff = ((deleted_non_cx/created_non_cx)-(deleted_cx/created_cx)
           ))


## How many wikis have translations deleted more often than regular articles?

In [274]:
cx_deletion_higher_current_6mo <- cx_deletion_ratio_current_bywiki %>%
    filter(deletion_ratio_diff < 0) %>%
    summarise(total_wikis = n())

cx_deletion_higher_current_6mo 

total_wikis
<int>
20


In [284]:
print(paste0("Across all wikis where more than 15 articles have been created with content translation from Jan 2021 - June 2021, there were ", 
             cx_deletion_higher_current_6mo[1], 
             " wikis where articles created with content translation were deleted more than articles created without cx"))

[1] "Across all wikis where more than 15 articles have been created with content translation from Jan 2021 - June 2021, there were 20 wikis where articles created with content translation were deleted more than articles created without cx"


## Which are these wikis?

In [276]:
cx_deletion_higher_list_current <- cx_deletion_ratio_current_bywiki %>%
    filter(deletion_ratio_diff < 0)%>% #only wikis with higher cx deletion ratios
    arrange(deletion_ratio_diff)
    

In [279]:
# reformat into table

cx_deletion_higher_list_6mo_tbl <- cx_deletion_higher_list_current %>%
    gt() %>%
    tab_header(
            title = "Wikis with higher deletion ratios for articles created with Content Translation",
            subtitle = "Reviewed Time Period: January 2021 through June 2021") %>%
    fmt_percent(
        columns = 6:8
    ) %>%

    cols_label(wiki = "Wiki project",
               created_cx = "Created CX Articles", 
               created_non_cx = "Created non-CX Articles",
               deleted_cx = "Deleted CX Articles",
               deleted_non_cx = "Deleted non-CX Articles",
               deleted_cx_ratio = "CX Articles Deletion Ratio",
               deleted_non_cx_ratio = "Non-CX Articles Deletion Ratio",
               deletion_ratio_diff = "Deletion Ratio Difference") %>%
     tab_spanner("Created Articles", 2:3) %>%
     tab_spanner("Deleted Articles", 4:5) %>%
    tab_spanner("Deletion Ratios", 6:8) %>%
    tab_footnote(
    footnote = "Excludes wikis with 15 or fewer articles created with Content Translation
            during the reviewed time period",
    locations = cells_column_labels(
      columns = 'wiki'
    )) %>%
      gtsave(
    "cx_deletion_higher_wikis_6mo.html", inline_css = TRUE) 


IRdisplay::display_html(data = cx_deletion_higher_list_6mo_tbl, file = "cx_deletion_higher_wikis_6mo.html")

Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation,Wikis with higher deletion ratios for articles created with Content Translation
Reviewed Time Period: January 2021 through June 2021,Reviewed Time Period: January 2021 through June 2021,Reviewed Time Period: January 2021 through June 2021,Reviewed Time Period: January 2021 through June 2021,Reviewed Time Period: January 2021 through June 2021,Reviewed Time Period: January 2021 through June 2021,Reviewed Time Period: January 2021 through June 2021,Reviewed Time Period: January 2021 through June 2021
Wiki project1,Created Articles,Created Articles,Deleted Articles,Deleted Articles,Deletion Ratios,Deletion Ratios,Deletion Ratios
Wiki project1,Created CX Articles,Created non-CX Articles,Deleted CX Articles,Deleted non-CX Articles,CX Articles Deletion Ratio,Non-CX Articles Deletion Ratio,Deletion Ratio Difference
hawwiki,68,128,25,25,36.76%,19.53%,−17.23%
iswiki,30,2157,7,140,23.33%,6.49%,−16.84%
kuwiki,221,5486,34,127,15.38%,2.31%,−13.07%
arywiki,57,1161,9,46,15.79%,3.96%,−11.83%
fiu_vrowiki,31,235,4,6,12.90%,2.55%,−10.35%
thwiki,24,9975,3,411,12.50%,4.12%,−8.38%
arzwiki,119,121643,9,606,7.56%,0.50%,−7.06%
azbwiki,18,1674,2,83,11.11%,4.96%,−6.15%
siwiki,37,1573,4,87,10.81%,5.53%,−5.28%
kawiki,170,10010,33,1415,19.41%,14.14%,−5.28%


## How high is the highest deletion ratio a wiki has for translations?


In [282]:
cx_deletion_ration_highest_current <- cx_deletion_ratio_current_bywiki %>%
    arrange(desc(deleted_cx_ratio))  %>%   
    mutate(deleted_cx_ratio = paste0(round(deleted_cx_ratio *100,2),"%") ,
          deleted_non_cx_ratio = paste0(round(deleted_non_cx_ratio *100,2),"%") ,
          deletion_ratio_diff = paste0(round(deletion_ratio_diff * 100,2),"%") )

head(cx_deletion_ration_highest_current, 5)

Unnamed: 0_level_0,wiki,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_ratio,deleted_non_cx_ratio,deletion_ratio_diff
Unnamed: 0_level_1,<chr>,<int>,<int>,<int>,<int>,<chr>,<chr>,<chr>
1,ltwiki,45,4254,17,2041,37.78%,47.98%,10.2%
2,hawwiki,68,128,25,25,36.76%,19.53%,-17.23%
3,mnwiki,30,1265,10,542,33.33%,42.85%,9.51%
4,iswiki,30,2157,7,140,23.33%,6.49%,-16.84%
5,kawiki,170,10010,33,1415,19.41%,14.14%,-5.28%


Lithuanian Wikipedia had the highest deletion ratio for articles created with content translation. 37.8% of all articles created with content translation rate were deleted; however, this was still less than the percent of non content translated article deletion ratio (47.9%).

The Wiki that had the highest different in deletion ratios was Hawaiian Wikipedia. 36.8% of all articles created with cx were deleted during the reviewed time period comparted to 19.5% of articles created without content translation. 

## Has the number of those wikis reduced compared to the previous period?

In [285]:
# Previous 6 Months
# July 2020 - December 2020

query <-
"
-- find both cx and non-cx created articles 
WITH created_articles AS (

SELECT
    wiki_db AS wiki,
    SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS created_cx,
    COUNT(*) AS created_total
FROM wmf.mediawiki_history
WHERE
    snapshot = '2021-08'
    AND event_timestamp BETWEEN '2020-07-01' and '2020-12-31' 
-- interested in main page namespaces
    AND page_namespace = 0
-- only look at new page creations
    AND revision_parent_id = 0
    AND event_entity = 'revision'
    AND event_type = 'create'
-- remove bots
    AND SIZE(event_user_is_bot_by_historical) = 0 
GROUP BY  
  wiki_db
),

--find all deleted articles that were created with cx 

deleted_articles AS (

SELECT
    wiki_db AS wiki,
    SUM(CAST(ARRAY_CONTAINS(revision_tags, 'contenttranslation') AS INT)) AS deleted_cx,
    COUNT(*) AS deleted_total
FROM wmf.mediawiki_history
WHERE
    snapshot = '2021-08'
    AND event_timestamp BETWEEN '2020-07-01' and '2020-12-31'  
-- interested in main page namespaces
    AND page_namespace = 0
-- only look at new page creations
    AND revision_parent_id = 0
    AND event_entity = 'revision'
-- find revisions moved to the archive table
    AND event_type = 'create'
    AND revision_is_deleted_by_page_deletion = TRUE
-- remove bots
    AND SIZE(event_user_is_bot_by_historical) = 0 
GROUP BY  
  wiki_db
)

-- main query to aggregate and join sources above
SELECT
    created_articles.wiki,
    created_cx,
    (created_total - created_cx)  AS created_non_cx,
    deleted_cx,
    (deleted_total - deleted_cx) AS deleted_non_cx
FROM created_articles
JOIN deleted_articles ON 
    created_articles.wiki = deleted_articles.wiki
"

In [286]:
cx_deletion_ratio_previous_6mo <- wmfdata::query_hive(query)

Don't forget to authenticate with Kerberos using kinit



In [287]:
cx_deletion_ratio_bywiki_previous <- cx_deletion_ratio_previous_6mo %>%
    #filter(wiki == 'idwiki') %>%
    filter(created_cx > 15)  %>%  # only wikis with at leat 15 created articles
    mutate(deleted_cx_ratio = deleted_cx/created_cx,
           deleted_non_cx_ratio = deleted_non_cx/created_non_cx,
           deletion_ratio_diff = ((deleted_non_cx/created_non_cx)-(deleted_cx/created_cx)
           ))


In [288]:
cx_deletion_higher_previous <- cx_deletion_ratio_bywiki_previous %>%
    filter(deletion_ratio_diff < 0) %>%
    summarise(total_wikis = n())

cx_deletion_higher_previous

total_wikis
<int>
21


In [290]:
print(paste0("Across all wikis where more than 15 articles have been created with content translation between July 2020 and December 2020, there were ", 
             cx_deletion_higher_previous[1], 
             " wikis where articles created with content translation were deleted more than articles created without cx"))

[1] "Across all wikis where more than 15 articles have been created with content translation between July 2020 and December 2020, there were 21 wikis where articles created with content translation were deleted more than articles created without cx"


The number of wikis with higher content translation deletion ratios decreased by 1 from July 2020 to December 2020 to January 2021 to June 2021.

We next compared the two lists of wikis to confirm if most of the wikis with higher deletion rates were the same across each quarter.

## How many wikis had higher deletion ratios for cx translated articles both quarters?

In [296]:
cx_deletion_higher_list_previous <- cx_deletion_ratio_bywiki_previous   %>%
    filter(deletion_ratio_diff < 0) %>%
    arrange(deletion_ratio_diff)

cx_deletion_higher_list_previous

wiki,created_cx,created_non_cx,deleted_cx,deleted_non_cx,deleted_cx_ratio,deleted_non_cx_ratio,deletion_ratio_diff
<chr>,<int>,<int>,<int>,<int>,<dbl>,<dbl>,<dbl>
fywiki,17,1755,14,65,0.82352941,0.037037037,-0.786492375
hawwiki,42,132,31,24,0.73809524,0.181818182,-0.556277056
ltwiki,59,3337,28,644,0.47457627,0.192987714,-0.281588558
iswiki,26,2000,7,155,0.26923077,0.0775,-0.191730769
lawiki,48,2979,9,158,0.1875,0.053037932,-0.134462068
hywiki,159,33338,22,1080,0.13836478,0.032395465,-0.105969315
azwiki,206,29671,29,1885,0.1407767,0.063530046,-0.077246653
arywiki,63,2443,5,50,0.07936508,0.020466639,-0.05889844
mywiki,313,6698,37,439,0.11821086,0.065541953,-0.05266891
cywiki,122,1451,13,85,0.10655738,0.058580289,-0.047977088


In [294]:
intersect(cx_deletion_higher_list_current[1], cx_deletion_higher_list_previous[1])

wiki
<chr>
hawwiki
iswiki
kuwiki
arywiki
arzwiki
fiwiki
lawiki
eowiki


There were 8 wikis that had higher deletion ratios for content translated articles both quarters. 