<a href="https://colab.research.google.com/github/MaxMaffio/InterviewQuery/blob/main/Supercell_Data_Scientist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![logo.png](https://github.com/interviewquery/takehomes/blob/supercell_1/supercell_1/logo.png?raw=1)

# Supercell Data Scientist Pre-Test
---

**Please solve the following tasks.**

## Task 1: How does the revenue trickle?

The database contains three tables: `account`, `account_device` and
`transactions`. `account` contains user profiles, `account_device` their devices and `transactions` contains in-app purchases.

-   How much revenue was produced on 2013-02-01?
-   Are there any users who use both iPads and iPhones?
-   Which country produces the most revenue?
-   What is the iPad/iPhone split in Canada?
-   What proportion of lifetime revenue is generated on the player's
    first week in game?

If you believe the data we've provided is not sufficient for a task,
please outline your concerns in your report with suggestions.

Feel free to use tables and plots.

## Task 2: Visualize this!

Please visualize a single aspect of the data you find important.

-   Why did you choose this particular visualization?
-   What would you improve in your visualization?
-   What would be your conclusions and recommendations to the game team based on this visualization?

## Task 3: Patterns?

Please apply a suitable machine learning technique to the data.

-   Why did you choose this particular technique?
-   What would you improve in your model?
-   What would be your conclusions and recommendations to the game team
    based on this model?



---



In [1]:
!git clone --branch supercell_1 https://github.com/interviewquery/takehomes.git
%cd takehomes/supercell_1
!if [[ $(ls *.zip) ]]; then unzip *.zip; fi
!ls

Cloning into 'takehomes'...
remote: Enumerating objects: 1963, done.[K
remote: Counting objects: 100% (1963/1963), done.[K
remote: Compressing objects: 100% (1220/1220), done.[K
remote: Total 1963 (delta 752), reused 1928 (delta 726), pack-reused 0 (from 0)[K
Receiving objects: 100% (1963/1963), 297.43 MiB | 12.67 MiB/s, done.
Resolving deltas: 100% (752/752), done.
/content/takehomes/supercell_1
ls: cannot access '*.zip': No such file or directory
account.csv  account_device.csv  logo.png  takehomefile.ipynb  transactions.csv


In [2]:
import os
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
os.listdir()

['account_device.csv',
 'transactions.csv',
 'logo.png',
 'account.csv',
 'takehomefile.ipynb']

# READ DATA

In [3]:
# read datasets
df_account = pd.read_csv('account.csv')
df_account_device = pd.read_csv('account_device.csv')
df_transactions = pd.read_csv('transactions.csv')

In [7]:
# account
print(df_account.shape)
print(df_account["account_id"].nunique())
df_account.head()

(20420, 11)
20420


Unnamed: 0,account_id,created_date,last_login_date,last_login_country,last_login_city,create_country,language,gender,birth_year,age_seconds,session_count
0,0-2786671,2013-01-04 09:28:07,2013-01-04 10:47:40,IT,,IT,EN,0,,1821,2
1,0-2787228,2013-01-04 10:09:03,2013-02-28 17:47:07,QA,,QA,EN,0,,2843,11
2,0-2787785,2013-01-04 10:48:38,2013-01-05 12:48:43,IL,Ramla,IL,EN,0,,553,2
3,0-2788342,2013-01-04 11:26:38,2013-04-03 11:12:09,ES,Terrassa,ES,EN,0,,7895,14
4,0-2788899,2013-01-04 12:02:53,2013-03-09 06:01:59,MN,,MN,EN,0,,25666,41


In [8]:
# account_device
print(df_account_device.shape)
print(df_account_device["account_id"].nunique())
df_account_device.head()

(20420, 2)
20420


Unnamed: 0,account_id,device
0,0-2786671,"iPod4,1"
1,0-2787228,"iPhone4,1"
2,0-2787785,"iPhone4,1"
3,0-2788342,"iPad2,1"
4,0-2788899,"iPad1,1"


In [21]:
# transactions
print(df_transactions.shape)
print(df_transactions["id"].nunique())
print(df_transactions["account_id"].nunique())
df_transactions.head()

(3788, 7)
3785
1100


Unnamed: 0,id,account_id,in_game_currency_amoung,created_time,currency_code,cash_amount,created_date
0,1603767,0-2845869,1200,2013-01-07 09:31:32,USD,3.99,2013-01-07
1,1604990,0-2836366,2500,2013-01-07 13:02:16,USD,6.99,2013-01-07
2,1633303,0-2869440,1200,2013-01-10 00:12:07,USD,3.99,2013-01-10
3,1653538,0-2869440,500,2013-01-11 00:50:06,USD,1.99,2013-01-11
4,1660951,0-2914532,1200,2013-01-11 07:24:42,USD,3.99,2013-01-11


As we notice that there are some id transactions duplicated, we check if they are redundant.

In [41]:
df_transactions[df_transactions["id"].duplicated()]

Unnamed: 0,id,account_id,in_game_currency_amoung,created_time,currency_code,cash_amount,created_date
2974,1913991,4-1265456,500,2013-04-04 21:02:24,USD,1.99,2013-04-04
3326,573481,5-2136324,500,2013-03-24 22:50:11,USD,1.99,2013-03-24
3345,597327,5-2166010,500,2013-03-26 13:43:39,USD,1.99,2013-03-26


In [44]:
597327[df_transactions["id"]==597327]

Unnamed: 0,id,account_id,in_game_currency_amoung,created_time,currency_code,cash_amount,created_date
1118,597327,3-2421242,500,2013-01-30 05:49:39,USD,1.99,2013-01-30
3345,597327,5-2166010,500,2013-03-26 13:43:39,USD,1.99,2013-03-26


We can see that there are 3 couple of transactions with same id but different account id. For this reason we decide to keep the records and to add a new column named as id_transaction, in order to have a more reliable and unique id.

In [45]:
# create a new ID
df_transactions["id_transaction"] = df_transactions.index

# TASK 1

## How much revenue was produced on 2013-02-01?

In [46]:
# convert created_date into date
df_transactions['created_date'] = pd.to_datetime(df_transactions['created_time']).dt.date
# sum uop the cash_amount realized on 2013-02-01
cond_1 = df_transactions["created_date"] == pd.to_datetime("2013-02-01", format="%Y-%m-%d").date()
n_outcome = df_transactions.loc[cond_1, "cash_amount"].sum()
# print the outcome
print(f"The revenue on 2013-02-01 was: {n_outcome}")

The revenue on 2013-02-01 was: 159.64000000000004


## Are there any users who use both iPads and iPhones?

In [47]:
print(df_account_device.shape)
print(df_account_device["account_id"].nunique())
print(df_account_device[df_account_device["account_id"].duplicated()].shape)

(20420, 2)
20420
(0, 2)


## Which country produces the most revenue?

In [60]:
# sum up the cash amount for each account
df_account_cash_amount = df_transactions.groupby(["account_id"], as_index=False).agg({"cash_amount": "sum"})
print(df_account_cash_amount.shape)

# merge with account
df_account_cash_amount = df_account_cash_amount.merge(df_account, on="account_id", how="left")
print(df_account_cash_amount.shape)

# group by country
df_account_cash_amount.groupby(["create_country"], as_index=False).agg({"cash_amount": "sum"}).sort_values(by="cash_amount", ascending=False).head()


(1100, 2)
(1100, 12)


Unnamed: 0,create_country,cash_amount
59,US,5604.97
21,GB,1288.9
20,FR,1115.55
3,AU,892.71
8,CA,397.96


## What is the iPad/iPhone split in Canada?

In [90]:
# merge account with account_device
df_ac_dev = df_account.merge(df_account_device, on="account_id", how="left")
# clean a name
df_ac_dev["device"] = df_ac_dev["device"].replace("iPhåne2,1", "iPhone")
# create a column to isole the string iPad/iPhone
df_ac_dev['device_name_clean'] = ""
df_ac_dev.loc[df_ac_dev['device'].str.contains('iPad', na=False), 'device_name_clean'] = 'iPad'
df_ac_dev.loc[df_ac_dev['device'].str.contains('iPhone', na=False), 'device_name_clean'] = 'iPhone'
df_ac_dev.loc[df_ac_dev['device'].str.contains('iPod', na=False), 'device_name_clean'] = 'iPhone'

# calcoalte the output
cond_1 = df_ac_dev["create_country"] == "CA"
df_ac_dev_canada = df_ac_dev[cond_1]
df_ac_dev_canada.groupby(["device_name_clean"], as_index=False).agg({"account_id": "count"})

Unnamed: 0,device_name_clean,account_id
0,iPad,205
1,iPhone,562


## What proportion of lifetime revenue is generated on the player's first week in game?