In [1]:
import pandas as pd
import plotly.graph_objs as go
from plotly.offline import iplot, plot, init_notebook_mode
from config import credentials
from sklearn.model_selection import train_test_split
import xgboost as xgb

init_notebook_mode(connected=True)

### Part 1 ‑ Exploratory data analysis
The attached logins.json file contains (simulated) timestamps of user logins in a particular
geographic location. Aggregate these login counts based on 15­minute time intervals, and
visualize and describe the resulting time series of login counts in ways that best characterize the
underlying patterns of the demand. Please report/illustrate important features of the demand,
such as daily cycles. If there are data quality issues, please report them

In [2]:
logins = pd.read_json('logins.json')
logins.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93142 entries, 0 to 93141
Data columns (total 1 columns):
login_time    93142 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 727.8 KB


A quick `.info()` call reveals the data contains no missing values.

In [3]:
# add a count column and set equal to 1
logins['count'] = 1
# set login_time as index and sort
logins = logins.set_index('login_time').sort_index()

In [4]:
# resample to a 15 min interval, summing the
# counts in each bin
logins_15m = logins.resample('15T').sum()
# add columns for the year, month, and day
logins_15m['year'] = logins_15m.index.year
logins_15m['month'] = logins_15m.index.month
logins_15m['day'] = logins_15m.index.day

In [5]:
logins_15m.head()

Unnamed: 0_level_0,count,year,month,day
login_time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1970-01-01 20:00:00,2.0,1970,1,1
1970-01-01 20:15:00,6.0,1970,1,1
1970-01-01 20:30:00,9.0,1970,1,1
1970-01-01 20:45:00,7.0,1970,1,1
1970-01-01 21:00:00,1.0,1970,1,1


In [6]:
logins_15m.describe()

Unnamed: 0,count,year,month,day
count,9381.0,9788.0,9788.0,9788.0
mean,9.928792,1970.0,2.259093,14.569268
std,8.263146,0.0,1.017219,8.683342
min,1.0,1970.0,1.0,1.0
25%,4.0,1970.0,1.0,7.0
50%,8.0,1970.0,2.0,14.0
75%,14.0,1970.0,3.0,22.0
max,73.0,1970.0,4.0,31.0


From the call `.describe()`, we can see there is only one year the data was collected in: 1970. Let's plot the raw data to see what it looks like.

In [7]:
trace = go.Scatter(x=logins_15m.index, y=logins_15m['count'])

layout = go.Layout(title='Logins',
                  yaxis=dict(title='Login Counts'))

fig = go.Figure([trace], layout)

iplot(fig, filename='15m-all.html')

Here we can see we have a little less than 4 full months of data, ranging from Jan. 1 to Apr. 13th. There are some patterns that stand out already:

1. The logins seem to build up over regular intervals before dropping again and repeating this cycle. 
2. The 15m interval with the overall highest number of logins occured on Mar. 1 @ 4:30 am. In fact, a quick inspection reveals that many of the spikes occur around this time of day.

Since there are several of these "build up" cycles, it might be useful to visualize the data for each month and day. Let's start by looking at daily login counts for each month. We can already see from the plot above that the largest spikes usually occur roughly one week apart.

In [8]:
daily = logins.resample('1D').sum()

In [9]:
trace0 = go.Scatter(x=daily.index, y=daily['count'],
                                  mode='lines+markers')

layout = go.Layout(title='Logins by Day',
                  yaxis=dict(title='Daily Login Counts'))

fig = go.Figure([trace0], layout)

iplot(fig, filename='daily.html')

The plot above shows the number of logins per day. Now it's more clear that there is a weekly recurring spike in logins. The average daily number of logins also looks to be increasing over time.

The largest number of logins in a day occured on April 4th, 1970, with 1889 logins that day. A quick search indicates that day was a Saturday. Therefore the spikes seem to happen on weekends. The spikes are always followed by valleys, the lowest number of daily logins which usually correspond to Mondays.

Next, we can look at what the intraday trends are in logins for the different days of the week. This will help shed light on the time of day when we can expect the most traffic. This may be different or the same depending on if it's a low day or high-login day.

In [10]:
dayofweek = logins.resample('15T').sum()
dayofweek['dayofweek'] = dayofweek.index.dayofweek
dayofweek['timeofday'] = dayofweek.index.time
dayofweek_group = dayofweek.groupby(['dayofweek', 'timeofday']).sum()
dayofweek_group = dayofweek_group.reset_index()
dayofweek_group.head()

Unnamed: 0,dayofweek,timeofday,count
0,0,00:00:00,126.0
1,0,00:15:00,140.0
2,0,00:30:00,144.0
3,0,00:45:00,121.0
4,0,01:00:00,109.0


In [11]:
traces=[]
for day in dayofweek_group['dayofweek'].unique():
    day_subset = dayofweek_group[dayofweek_group['dayofweek']==day]
    
    traces.append(go.Scatter(name='Day of week: %s' % day, x=day_subset.timeofday,
                            y=day_subset['count']))
                  
layout=go.Layout(title='Login Counts by Day of Week',
                yaxis=dict(title='Login Counts'))

fig=go.Figure(traces, layout)

iplot(fig, filename='dayofweek.html')

Now we can clearly see the difference in the login timeseries between weekdays and weekends! The days of the week go from 0, corresponding to Monday, to 6 which denotes Sunday. Saturdays and Sundays clearly have the largest spikes in logins. Both Saturday and Sunday also show the highest number of logins between 4:30 and 4:45 am. On the otherhand, during the weekdays there is a very consistent large spike in logins between 11:30 and 11:45 am and generally a lot of login activity between 9pm and 3 am. 

So the intraday demands can be modeled quite accurately by looking at the day of the week and intraday timeseries.

### Part 2 ‑ Experiment and metrics design
The neighboring cities of Gotham and Metropolis have complementary circadian rhythms: on
weekdays, Ultimate Gotham is most active at night, and Ultimate Metropolis is most active
during the day. On weekends, there is reasonable activity in both cities.
However, a toll bridge, with a two­way toll, between the two cities causes driver partners to tend
to be exclusive to each city. The Ultimate managers of city operations for the two cities have
proposed an experiment to encourage driver partners to be available in both cities, by
reimbursing all toll costs.
1. What would you choose as the key measure of success of this experiment in
encouraging driver partners to serve both cities, and why would you choose this metric?
2. Describe a practical experiment you would design to compare the effectiveness of the
proposed change in relation to the key measure of success. Please provide details on:
a. how you will implement the experiment
b. what statistical test(s) you will conduct to verify the significance of the
observation
c. how you would interpret the results and provide recommendations to the city
operations team along with any caveats.

1. The metric I would use would be the proportion of driver logins from each city in their counterpart city. So for drivers from Gotham, this would the proportion of logins they had in Metropolis and vice versa. This metric is a simple population statistic that could be tested between two populations to see if there is a difference after the proposed experiment is implemented.

2. First, I would separate my data into two populations: drivers registered or that primarily login in Gotham and those that primarily login in Metropolis. For each of these two populations I would then calculate the proportion of logins that occured in the sister city. After the toll reimbursement is implemented, I would collect data for a similar period of time and then calculate the new proportions for each population. For each population I would perform a two-sample t-test for the difference of means (their proportions) to test whether there's a statistically significant difference in their before and after proportions. If there is, then I would call the program a success and recommend to the city operations team to implement the program. One caveat is that there could be multiple outcomes. Either both populations showed a statistically significant difference, neither did, or only one did. Each case would have to be treated differently. For the populations that didn't show a statistically significant difference, I would suggest devising a different incentive and repeat the experiment.

### Part 3 ‑ Predictive modeling
Ultimate is interested in predicting rider retention. To help explore this question, we have
provided a sample dataset of a cohort of users who signed up for an Ultimate account inJanuary 2014. The data was pulled several months later; we consider a user retained if they
were “active” (i.e. took a trip) in the preceding 30 days.
We would like you to use this data set to help understand what factors are the best predictors
for retention, and offer suggestions to operationalize those insights to help Ultimate.
The data is in the attached file ultimate_data_challenge.json. See below for a detailed
description of the dataset. Please include any code you wrote for the analysis and delete the
dataset when you have finished with the challenge.
1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided
data for this analysis (a few sentences/plots describing your approach will suffice). What
fraction of the observed users were retained?
2. Build a predictive model to help Ultimate determine whether or not a user will be active
in their 6th month on the system. Discuss why you chose your approach, what
alternatives you considered, and any concerns you have. How valid is your model?
Include any key indicators of model performance.
3. Briefly discuss how Ultimate might leverage the insights gained from the model to
improve its long­term rider retention (again, a few sentences will suffice).

In [12]:
import json

with open('ultimate_data_challenge.json') as f:
    data = json.load(f)
    retention = pd.DataFrame(data)
    
retention.head()

Unnamed: 0,avg_dist,avg_rating_by_driver,avg_rating_of_driver,avg_surge,city,last_trip_date,phone,signup_date,surge_pct,trips_in_first_30_days,ultimate_black_user,weekday_pct
0,3.67,5.0,4.7,1.1,King's Landing,2014-06-17,iPhone,2014-01-25,15.4,4,True,46.2
1,8.26,5.0,5.0,1.0,Astapor,2014-05-05,Android,2014-01-29,0.0,0,False,50.0
2,0.77,5.0,4.3,1.0,Astapor,2014-01-07,iPhone,2014-01-06,0.0,3,False,100.0
3,2.36,4.9,4.6,1.14,King's Landing,2014-06-29,iPhone,2014-01-10,20.0,9,True,80.0
4,3.13,4.9,4.4,1.19,Winterfell,2014-03-15,Android,2014-01-27,11.8,14,False,82.4


In [17]:
trace = go.Histogram(x=retention.signup_date)

layout = go.Layout(title='City',
                  yaxis=dict(title='Number of Rides'))

fig = go.Figure([trace], layout)

iplot(fig, filename='Rides-by-city.html')

The distribution of signups has large repeating spikes on Saturdays, so more drivers tend to sign up on the weekends.

In [18]:
trace = go.Histogram(x=retention.last_trip_date)

layout = go.Layout(title='City',
                  yaxis=dict(title='Number of Rides'))

fig = go.Figure([trace], layout)

iplot(fig, filename='Rides-by-city.html')

The distrubtion of rides from January through July is shown above. There was a lot of activity initially in January but then a very stagnant number of rides in the proceeding months until June where the number of rides really picked up. This could be because more people were taking rides in Summer. 

In [19]:
trace = go.Histogram(x=retention.city)

layout = go.Layout(title='City',
                  yaxis=dict(title='Number of Rides'))

fig = go.Figure([trace], layout)

iplot(fig, filename='Rides-by-city.html')

The majority of rides were taken in Winterfell. Presumably most of this traffic was from Northerners trying to make their way to the wall because they heard that the white walkers were coming and they wanted to catch a glimpse. 

In [16]:
trace = go.Histogram(x=retention.phone)

layout = go.Layout(title='Phone',
                  yaxis=dict(title='Number of Users'))

fig = go.Figure([trace], layout)

iplot(fig, filename='Rides-by-Phone.html')