# Capstone

We will start again, preparing the data to the **capstone project**. We will focus on **DMND students** that enrolled in any moment after **2017-04-01** and before **2017-08-10**. We will use as total universe users who are registered and visited DMND NDOP at least once in this period.

## Getting the universe

In [155]:
import pandas as pd
import numpy as np
import udb
import datetime

start_date = datetime.date(year=2017, month=4, day=1)
end_date = datetime.date(year=2017, month=8, day=10)

br = udb.get_ebdb_engine()
us = udb.get_analytics_engine()

all_accounts_who_visited = pd.read_sql_query("""
SELECT DISTINCT fbi.email
FROM frontend_brazil.pages fbp
  LEFT JOIN frontend_brazil.identifies fbi ON fbp.anonymous_id = fbi.anonymous_id
WHERE fbp.received_at >= '{}' AND fbp.received_at <= '{}'
""".format(start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')), con=us)

all_accounts_who_visited.shape[0]

32543

Now, let's get who became **paying student** in this period - the target feature we will train our model to predict.

## Paying students

In [156]:
paying_students = pd.read_sql_query("""
SELECT
  au.email
FROM payment_app_subscription ps
  INNER JOIN payment_app_product pp ON ps.product_id = pp.id
  INNER JOIN auth_user au ON au.id = ps.user_id
WHERE (status = 'active' OR status = 'payment_credit_retry')
      AND pp.code like 'nd018%%'
      AND register_date >= '{}'
      AND register_date <= '{}'
""".format(start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')), con=br)

paying_students.shape[0]

3601

Next step is to merge both into one dataframe, with **email** and target feature **is_paying_student**, showing 0 or 1.

In [157]:
paying_students['is_paying_student'] = 1
x = pd.merge(all_accounts_who_visited, paying_students, how='left', on='email')

In [158]:
x['is_paying_student'] = x['is_paying_student'].fillna(0).astype(int)
print(x['is_paying_student'].sum(), x.shape[0])

2952 32543


As we can see above, despite the fact we registered **3,601** paying students in Brazil database, there's data about NDOP visits of only **2,952** of them. We will move forward with these **2,952**.
From now on, we will start adding features we think can explain their behaviour in enrolling.

## How old are these accounts?
Our first hypothesis is that the likelihood of a student to enroll is influenced by how recent they have an account registered at Udacity.

In [159]:
x['email'] = x['email'].astype(str)
email_query = '('
for index, row in x.iterrows():
    if index == x['email'].shape[0] - 1:
        email_query = email_query + "'" + row['email'] + "')"
    else:
        email_query = email_query + "'" + row['email'] + "', "

In [160]:
emails_dates_joined = pd.read_sql_query("SELECT email, date_joined FROM auth_user WHERE email in {}".format(email_query), con=br)
emails_dates_joined.shape[0]

31990

As you can see above, we were not able to find all join dates from the **32,543** accounts in our universe, we found only **31,990**. Let's see if we have a better luck looking into US database:

In [161]:
emails_dates_joined_us = pd.read_sql_query("SELECT email, created_at FROM analytics_tables.accounts WHERE email in {}".format(email_query), con=us)
emails_dates_joined_us.shape[0]

31844

Strangely enough, we found even less data in US database. Let's move forward with Brazil data and disregard rows without join date:

In [162]:
result = pd.merge(x, emails_dates_joined, how='inner', on='email')
result.shape[0]

31990

This is not quite what we want: let's get the age in days, assuming today is the last day of the period:

In [163]:
def calculate_age(row):
    d2 = row['date_joined'].to_pydatetime().date()
    d1 = end_date
    return abs((d2 - d1).days)
    
result['age_in_days'] = result.apply(calculate_age, axis=1)
result.shape[0]

31990

In [164]:
x = pd.merge(x, result[['email', 'age_in_days']], how='inner', on='email')
x.shape[0]

31990

In [165]:
x.head()

Unnamed: 0,email,is_paying_student,age_in_days
0,melquic@gmail.com,0,395
1,93diegopereira@gmail.com,0,387
2,marcelmilcent@gmail.com,0,396
3,filipe.uece@gmail.com,0,394
4,ribaldorafael@gmail.com,0,397


## How many times has each user enrolled in a webinar?
Doing webinars is one of our key strategies to engage leads. Let's check if it really leads to conversion.

In [166]:
webinar_enrollments = pd.read_sql_query("""
SELECT email, COUNT(id)
FROM brazil_events.event_sign_up
WHERE enrollment_date >= '{}' and enrollment_date <= '{}'
GROUP BY email
""".format(start_date.strftime('%Y-%m-%d'), end_date.strftime('%Y-%m-%d')), con=us)

webinar_enrollments.shape[0]

23858

In [167]:
x = pd.merge(x, webinar_enrollments, how='left', on='email')
x.shape[0]

31990

In [168]:
x['webinar_enrollments'] = x['count'].fillna(0).astype(int)
x = x[['email', 'is_paying_student', 'age_in_days', 'webinar_enrollments']]
x.head()

Unnamed: 0,email,is_paying_student,age_in_days,webinar_enrollments
0,melquic@gmail.com,0,395,0
1,93diegopereira@gmail.com,0,387,1
2,marcelmilcent@gmail.com,0,396,2
3,filipe.uece@gmail.com,0,394,0
4,ribaldorafael@gmail.com,0,397,1


## How many times has each user enrolled in a free course?
There are people who believe free courses leads to paying students. Let's see if that's true.

In [169]:
free_course_enrollments = pd.read_sql_query("""
SELECT
  ac.email,
  count(ce.course_key) as course_enrollments
FROM analytics_tables.course_enrollments ce 
INNER JOIN analytics_tables.accounts ac on ce.user_id = ac.user_id
  WHERE ce.join_time <= '{}' AND ac.email IN {}
GROUP BY ac.email
""".format(end_date.strftime('%Y-%m-%d'), email_query), con=us)

free_course_enrollments.shape[0]

14831

In [170]:
x = pd.merge(x, free_course_enrollments, how='left', on='email')
x.shape[0]

31990

In [171]:
x['course_enrollments'] = x['course_enrollments'].fillna(0).astype(int)
x.head()

Unnamed: 0,email,is_paying_student,age_in_days,webinar_enrollments,course_enrollments
0,melquic@gmail.com,0,395,0,0
1,93diegopereira@gmail.com,0,387,1,1
2,marcelmilcent@gmail.com,0,396,2,10
3,filipe.uece@gmail.com,0,394,0,6
4,ribaldorafael@gmail.com,0,397,1,2


## How many times has each user navigated to each of our key pages?
The navigation pattern should be considered in predicting whether a user will become a paying student or not.

In [180]:
all_visits = pd.read_sql_query("""
SELECT
  DISTINCT fbp.id,
  fbp.path,
  fbp.referrer,
  fbp.context_user_agent,
  fbi.email
FROM frontend_brazil.pages fbp
  LEFT JOIN frontend_brazil.identifies fbi ON fbi.anonymous_id = fbp.anonymous_id
WHERE fbp.received_at <= '{}' AND fbi.email IN {}
""".format(end_date.strftime('%Y-%m-%d'), email_query), con=us)

all_visits.shape[0]

1591459

In [181]:
all_visits.head()

Unnamed: 0,id,path,referrer,context_user_agent,email
0,ajs-66a8df7da27dc5b95525feb5c46906a6,/courses/all/,https://br.udacity.com/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,amandajanot@gmail.com
1,ajs-9187d6edacc3b2b78e57e7a76e61ad9e,/,https://br.udacity.com/course/front-end-web-de...,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,naaatalia.azevedo@live.com
2,ajs-110c06e2b8add9b1d50a73b6be288f55,/courses/all/,https://classroom.udacity.com/courses/ud837,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,thiagodecnop@gmail.com
3,ajs-abcac325ab3a59146228312171f2b374,/course/data-analyst-nanodegree--nd002/,https://outlook.live.com/,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,rafaelmohedano@gmail.com
4,ajs-bb297a3349f14592fffb28ba469af1a3,/,https://www.google.com.br/,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)...,bfrascino80@gmail.com


In [182]:
visits = all_visits.copy()

In [185]:
visits['is_home'] = (visits['path'] == '/').astype(int)
visits['is_ndop'] = (visits['path'].str.contains('--nd')).astype(int)
visits['is_catalog_all'] = (visits['path'] == '/courses/all/').astype(int)
visits['is_catalog_nanodegrees'] = (visits['path'] == '/courses/nanodegrees/').astype(int)
visits['is_nanodegree_home'] = (visits['path'] == '/nanodegree/').astype(int)
visits['is_fcop_ud'] = (visits['path'].str.contains('--ud')).astype(int)
visits['is_fcop_cs'] = (visits['path'].str.contains('--cs')).astype(int)
visits['is_fcop_st'] = (visits['path'].str.contains('--st')).astype(int)
visits['is_signin'] = (visits['path'].str.contains('/signin/')).astype(int)
visits['is_event'] = (visits['path'].str.contains('/events/')).astype(int)
visits['is_50back'] = (visits['path'] == '/nanodegree/50-back/').astype(int)
visits['is_tech_requirements'] = (visits['path'] == '/tech-requirements//').astype(int)
visits['is_contact'] = (visits['path'] == '/contact/').astype(int)
visits['is_us'] = (visits['path'] == '/us/').astype(int)
visits['is_jobs'] = (visits['path'] == '/jobs/').astype(int)
visits['is_legal'] = (visits['path'] == '/legal/').astype(int)
visits['is_hire_talent'] = (visits['path'] == '/hire-talent/').astype(int)
visits['is_business'] = (visits['path'] == '/business/').astype(int)
visits['is_success'] = (visits['path'] == '/success/').astype(int)
visits['is_payment'] = (visits['path'] == '/payment/').astype(int)
visits['is_android'] = (visits['path'].str.contains('/android/')).astype(int)
visits['is_ai'] = (visits['path'].str.contains('/ai/')).astype(int)
visits['is_drive'] = (visits['path'].str.contains('/drive/')).astype(int)
visits['is_robotics'] = (visits['path'].str.contains('/robotics/')).astype(int)
visits['is_checkout'] = (visits['path'].str.contains('/checkout')).astype(int)

In [186]:
visits.head().transpose()

Unnamed: 0,0,1,2,3,4
id,ajs-66a8df7da27dc5b95525feb5c46906a6,ajs-9187d6edacc3b2b78e57e7a76e61ad9e,ajs-110c06e2b8add9b1d50a73b6be288f55,ajs-abcac325ab3a59146228312171f2b374,ajs-bb297a3349f14592fffb28ba469af1a3
path,/courses/all/,/,/courses/all/,/course/data-analyst-nanodegree--nd002/,/
referrer,https://br.udacity.com/,https://br.udacity.com/course/front-end-web-de...,https://classroom.udacity.com/courses/ud837,https://outlook.live.com/,https://www.google.com.br/
context_user_agent,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5)...
email,amandajanot@gmail.com,naaatalia.azevedo@live.com,thiagodecnop@gmail.com,rafaelmohedano@gmail.com,bfrascino80@gmail.com
is_home,0,1,0,0,1
is_ndop,0,0,0,1,0
is_catalog_all,1,0,1,0,0
is_catalog_nanodegrees,0,0,0,0,0
is_nanodegree_home,0,0,0,0,0


In [187]:
def fix_is_ndop(row):
    if row['is_checkout'] == 1:
        return 0
    else:
        return row['is_ndop']

visits['is_ndop'] = visits.apply(fix_is_ndop, axis=1)

In [188]:
import re

def is_mobile(row):
    if re.search('/Android|webOS|iPhone|iPad|iPod|BlackBerry|IEMobile|Opera Mini/', str(row['context_user_agent'])):
        return 1
    else:
        return 0

visits['is_mobile'] = visits.apply(is_mobile, axis=1)

In [191]:
visits['referrer'] = visits['referrer'].fillna('')
visits['is_referrer_google'] = (visits['referrer'].str.contains('.google.')).astype(int)
visits['is_referrer_facebook'] = (visits['referrer'].str.contains('.facebook.')).astype(int)
visits['is_referrer_live'] = (visits['referrer'].str.contains('.live.')).astype(int)
visits['is_referrer_infomoney'] = (visits['referrer'].str.contains('.infomoney.')).astype(int)
visits['is_referrer_catracalivre'] = (visits['referrer'].str.contains('.catracalivre.')).astype(int)
visits['is_referrer_android'] = (visits['referrer'].str.contains('.android.')).astype(int)
visits['is_referrer_anhanguera'] = (visits['referrer'].str.contains('anhanguera.')).astype(int)
visits['is_referrer_linkedin'] = (visits['referrer'].str.contains('.linkedin.')).astype(int)
visits['is_referrer_instagram'] = (visits['referrer'].str.contains('.instagram.')).astype(int)
visits['is_referrer_cbsi'] = (visits['referrer'].str.contains('.cbsi.')).astype(int)
visits['is_referrer_tecmundo'] = (visits['referrer'].str.contains('.tecmundo.')).astype(int)
visits['is_referrer_bing'] = (visits['referrer'].str.contains('.bing.')).astype(int)
visits['is_referrer_computerworld'] = (visits['referrer'].str.contains('.computerworld.')).astype(int)
visits['is_referrer_github'] = (visits['referrer'].str.contains('.github.')).astype(int)

In [193]:
visits.drop('path', axis=1, inplace=True)
visits.drop('referrer', axis=1, inplace=True)
visits.drop('context_user_agent', axis=1, inplace=True)
visits.head().transpose()

Unnamed: 0,0,1,2,3,4
id,ajs-66a8df7da27dc5b95525feb5c46906a6,ajs-9187d6edacc3b2b78e57e7a76e61ad9e,ajs-110c06e2b8add9b1d50a73b6be288f55,ajs-abcac325ab3a59146228312171f2b374,ajs-bb297a3349f14592fffb28ba469af1a3
email,amandajanot@gmail.com,naaatalia.azevedo@live.com,thiagodecnop@gmail.com,rafaelmohedano@gmail.com,bfrascino80@gmail.com
is_home,0,1,0,0,1
is_ndop,0,0,0,1,0
is_catalog_all,1,0,1,0,0
is_catalog_nanodegrees,0,0,0,0,0
is_nanodegree_home,0,0,0,0,0
is_fcop_ud,0,0,0,0,0
is_fcop_cs,0,0,0,0,0
is_fcop_st,0,0,0,0,0


Now that we have a dataframe with all visits from our universe of users transformed in features, let's aggregate by email:

In [195]:
f = {
    'id': ['count'],
    'is_home': ['sum'],
    'is_ndop': ['sum'],
    'is_catalog_all': ['sum'],
    'is_catalog_nanodegrees': ['sum'],
    'is_nanodegree_home': ['sum'],
    'is_fcop_ud': ['sum'],
    'is_fcop_cs': ['sum'],
    'is_fcop_st': ['sum'],
    'is_signin': ['sum'],
    'is_event': ['sum'],
    'is_50back': ['sum'],
    'is_tech_requirements': ['sum'],
    'is_contact': ['sum'],
    'is_us': ['sum'],
    'is_jobs': ['sum'],
    'is_legal': ['sum'],
    'is_hire_talent': ['sum'],
    'is_business': ['sum'],
    'is_success': ['sum'],
    'is_payment': ['sum'],
    'is_android': ['sum'],
    'is_ai': ['sum'],
    'is_drive': ['sum'],
    'is_robotics': ['sum'],
    'is_checkout': ['sum'],
    'is_mobile': ['sum'],
    'is_referrer_google': ['sum'],
    'is_referrer_facebook': ['sum'],
    'is_referrer_live': ['sum'],
    'is_referrer_infomoney': ['sum'],
    'is_referrer_catracalivre': ['sum'],
    'is_referrer_android': ['sum'],
    'is_referrer_anhanguera': ['sum'],
    'is_referrer_linkedin': ['sum'],
    'is_referrer_instagram': ['sum'],
    'is_referrer_cbsi': ['sum'],
    'is_referrer_tecmundo': ['sum'],
    'is_referrer_bing': ['sum'],
    'is_referrer_computerworld': ['sum'],
    'is_referrer_github': ['sum']
}
grouped = visits.groupby('email', as_index=False).agg(f)

In [197]:
grouped.columns = grouped.columns.droplevel(-1)

Unnamed: 0,email,is_business,is_ndop,is_drive,is_us,is_nanodegree_home,is_fcop_ud,is_tech_requirements,is_signin,is_success,...,is_ai,is_catalog_nanodegrees,is_referrer_cbsi,is_referrer_github,is_50back,is_referrer_anhanguera,is_payment,is_referrer_tecmundo,is_legal,is_referrer_live
0,+557588971838@gmail.com,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,00hf11@gmail.com,0,2,0,0,7,5,0,5,0,...,0,5,0,0,1,0,0,0,0,0
2,01bertoferreira@gmail.com,0,31,0,0,2,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,02001gp@gmail.com,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,07019312.das@gmail.com,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0


In [199]:
grouped.rename(columns={'id': 'count_visits'}, inplace=True)
grouped.head().transpose()

Unnamed: 0,0,1,2,3,4
email,+557588971838@gmail.com,00hf11@gmail.com,01bertoferreira@gmail.com,02001gp@gmail.com,07019312.das@gmail.com
is_business,0,0,0,0,0
is_ndop,0,2,31,1,0
is_drive,0,0,0,0,0
is_us,0,0,0,0,0
is_nanodegree_home,1,7,2,1,1
is_fcop_ud,0,5,0,0,0
is_tech_requirements,0,0,0,0,0
is_signin,0,5,1,0,0
is_success,0,0,0,0,0


Now, let's put everything together:

In [203]:
result = pd.merge(x, grouped, how='inner', on='email')
result.shape

(31990, 46)

In [204]:
result.head().transpose()

Unnamed: 0,0,1,2,3,4
email,melquic@gmail.com,93diegopereira@gmail.com,marcelmilcent@gmail.com,filipe.uece@gmail.com,ribaldorafael@gmail.com
is_paying_student,0,0,0,0,0
age_in_days,395,387,396,394,397
webinar_enrollments,0,1,2,0,1
course_enrollments,0,1,10,6,2
is_business,0,0,0,0,0
is_ndop,8,19,15,2,4
is_drive,0,0,2,0,0
is_us,0,0,0,0,0
is_nanodegree_home,6,9,37,2,1


In [205]:
result.to_csv('new_features.csv')