# Spanish Translation A/B Test
__- Author: Fan Yuan__  
__- Date: 03/25/2019__

## Context:
A worldwide e-commerce site with localized versions of the site want to test whther the conversion rate will be higher if the localized version is translated by a local instead of all Spanish-speaking countries using the same version of translation.
After the experiment, they found that the non-localized translation was doing better.

## Project goal:
* Confirm that the test is actually negative. That is, it appears that the old version of the site with just one translation across Spain and LatAm performs better
* Explain why that might be happening. Are the localized translations really worse?
* Design an algorithm tthat would return FALSE if the same problem is happening in the future and TRUE if everything is good and the results can be trusted

## Data:
### test_table -- Columns:
* __user_id__: the id of the user. Unique by user. Can be joined to user id in the other table. For each user, we just check whether conversion happens the first time they land on the site since the test started
* __date__: when they came the the site for the first time since the test started
* __source__: marketing channel source
    + Ads: came to the site by clicking on an advertisement
    + Soe: came to the site by clicking on search results
    + Direct: came to the site by directly typing the URL on the browser
* __device__: device used by the user, it can be mobile or web
* __browser_language__: in browser or app settings, the language chosen by the user. It can be EN, ES, Other
* __ads_channel__: if marketing channel is ads, this is the site where teh ad was displayed. It can be: Google, Facebook, Bing, Yahoo, Other. If the user didn't come via an ad, this field is NA
* __browser__: user browser. It can be: IE, Chrome, Android_App, FireFox, Iphone_app, Safari, Opera
* __conversion__: whether the user converted (1) or not (0). This is the label. A test is considered successful if it increases the proportion of users who convert
* __test__: users are randomly split into test (1) and control (0). Test users see the new translation and control the old one. For Spain-based users, this is obviously always 0 since there is no change there

### user_table -- Columns:
* __user_id__: the id of the user. It can be joined to user id in the other table
* __sex__: user sex: Male or Female
* __age__: user age (self-reported)
* __country__: user country based on ip address


In [2]:
# Import the libraries
library(tidyverse)
library(rpart)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.0       [32m✔[39m [34mpurrr  [39m 0.3.2  
[32m✔[39m [34mtibble [39m 2.1.1       [32m✔[39m [34mdplyr  [39m 0.8.0.[31m1[39m
[32m✔[39m [34mtidyr  [39m 0.8.3       [32m✔[39m [34mstringr[39m 1.3.0  
[32m✔[39m [34mreadr  [39m 1.3.1       [32m✔[39m [34mforcats[39m 0.4.0  
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


In [3]:
# Read data
user <- read_csv('Translation_Test/user_table.csv')
test <- read_csv('Translation_Test/test_table.csv')


Parsed with column specification:
cols(
  user_id = [32mcol_double()[39m,
  sex = [31mcol_character()[39m,
  age = [32mcol_double()[39m,
  country = [31mcol_character()[39m
)
Parsed with column specification:
cols(
  user_id = [32mcol_double()[39m,
  date = [34mcol_date(format = "")[39m,
  source = [31mcol_character()[39m,
  device = [31mcol_character()[39m,
  browser_language = [31mcol_character()[39m,
  ads_channel = [31mcol_character()[39m,
  browser = [31mcol_character()[39m,
  conversion = [32mcol_double()[39m,
  test = [32mcol_double()[39m
)


In [6]:
# Quick check if there's duplicate in test dataset
length(unique(test$user_id)) == length(test$user_id)

In [7]:
# Quick check if there's duplicate in user dataset
length(unique(user$user_id)) == length(user$user_id)

In [9]:
# Check the number difference between test and user dataset
length(user$user_id) - length(test$user_id)

* The user table has less number ore user ids than the test table, which means the user table lost some ids. When joining the table, should be careful not losing the ids in test table

In [None]:
# Put user and test tables together to make analysis easier
