### Dmitriy Semushin
## Bookmate test problem for Junior Analyst job opening

Import pandas for reading the data and populating the datebase, and sqlite3 to create the database and make queries:

In [1]:
import pandas as pd
from sqlite3 import dbapi2 as sq3

Load the data into a dataframe:

In [2]:
fileurl = 'https://s3.amazonaws.com/bookmate/analyst_test.csv'
df = pd.read_csv(fileurl, parse_dates=[1])
df.head()

Unnamed: 0,user_id,started_at
0,2066,2015-05-01 05:42:46
1,7931,2015-05-01 06:20:15
2,3736,2015-05-01 08:11:58
3,1604,2015-05-01 11:00:08
4,886,2015-05-02 03:55:39


Create a database:

In [3]:
db = sq3.connect('payments.db')

Make the schema for a table "paylogs" and create that table:

In [4]:
plog_schema = """
DROP TABLE IF EXISTS "paylogs";
CREATE TABLE "paylogs" (
    id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL,
    user_id INTEGER NOT NULL,
    started_at DATETIME
);
"""
db.cursor().executescript(plog_schema)
db.commit()

Populate the table from the dataframe:

In [5]:
df.to_sql("paylogs", db, if_exists='append', index = False)

Create a function for execution of queries:

In [8]:
def query(sel): 
    return db.cursor().execute(sel).fetchall()

Let's first find out how many users there are:

In [9]:
sel = """
SELECT COUNT(DISTINCT user_id)
  FROM paylogs;
"""
query(sel)

[(115,)]

Designate a variable `seldifmon` for a subquery that will be later used in different queries:

In [None]:
seldifmon = """
SELECT user_id,
       12*(strftime("%Y", 'now') - strftime("%Y", started_at)) +
       (strftime("%m", 'now') - strftime("%m", started_at)) AS difmon
  FROM paylogs
"""
#query(seldifmon)

Query that counts new users:

In [18]:
selnew = """
SELECT COUNT(DISTINCT user_id)
  FROM (SELECT user_id,
               MAX(difmon) AS maxdif
          FROM ({0}
                )
         GROUP BY user_id
         )
 WHERE maxdif = 0;
""".format(seldifmon)
query(selnew)

[(0,)]

Query that counts recurrent users:

In [12]:
selrecur = """
SELECT COUNT(DISTINCT user_id)
  FROM (SELECT user_id,
               COUNT(difmon) AS c
          FROM ({0}
                 WHERE difmon <= 1
                )
         GROUP BY user_id
         )
 WHERE c >= 2;
""".format(seldifmon)
query(selrecur)

[(0,)]

Query that counts reactivated users:

In [13]:
selreac = """
SELECT COUNT(DISTINCT user_id)
  FROM (SELECT user_id,
               COUNT(difmon) AS c,
               MIN(difmon) AS mindif
          FROM ({0}
                 WHERE difmon <> 1
                )
         GROUP BY user_id
         )
 WHERE c >= 2 AND mindif = 0;
""".format(seldifmon)
query(selreac)

[(0,)]

Query that counts churned users:

In [14]:
selchurn = """
SELECT COUNT(DISTINCT user_id)
  FROM (SELECT user_id,
               MIN(difmon) AS mindif
          FROM ({0}
                )
         GROUP BY user_id
         )
 WHERE mindif >= 1;
""".format(seldifmon)
query(selchurn)

[(115,)]

As the last date in `started_at` field is in 2015, all users count as churned. Let's change the reference date from `'now'` to the moment of the last input into the paylogs table.

Assign variables `date0` to the subquery for the new reference date and `seldifmon0` for the subquery `seldifmon` but with `'now'` changed to `date0`.

In [20]:
date0 = """
SELECT MAX(started_at)
  FROM paylogs"""

seldifmon0 = """
SELECT user_id,
       12*(strftime("%Y", ({0})) - strftime("%Y", started_at)) +
       (strftime("%m", ({0})) - strftime("%m", started_at)) AS difmon
  FROM paylogs
""".format(date0)

Query that counts new users for `date0`:

In [21]:
selnew0 = """
SELECT COUNT(DISTINCT user_id)
  FROM (SELECT user_id,
               MAX(difmon) AS maxdif
          FROM ({0}
                )
         GROUP BY user_id
         )
 WHERE maxdif = 0;
""".format(seldifmon0)
query(selnew0)

[(0,)]

Query that counts recurrent users for `date0`:

In [22]:
selrecur0 = """
SELECT COUNT(DISTINCT user_id)
  FROM (SELECT user_id,
               COUNT(difmon) AS c
          FROM ({0}
                 WHERE difmon <= 1
                )
         GROUP BY user_id
         )
 WHERE c >= 2;
""".format(seldifmon0)
query(selrecur0)

[(0,)]

Query that counts reactivated users for `date0`:

In [24]:
selreac0 = """
SELECT COUNT(DISTINCT user_id)
  FROM (SELECT user_id,
               COUNT(difmon) AS c,
               MIN(difmon) AS mindif
          FROM ({0}
                 WHERE difmon <> 1
                )
         GROUP BY user_id
         )
 WHERE c >= 2 AND mindif = 0;
""".format(seldifmon0)
query(selreac0)

[(1,)]

Query that counts churned users for `date0`:

In [25]:
selchurn0 = """
SELECT COUNT(DISTINCT user_id)
  FROM (SELECT user_id,
               MIN(difmon) AS mindif
          FROM ({0}
                )
         GROUP BY user_id
         )
 WHERE mindif >= 1;
""".format(seldifmon0)
query(selchurn0)

[(114,)]

As you can see from the queries above, at `date0` there are 114 churned users and 1 reactivated user.