In [1]:
# importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# PART - Exploratory data analysis

The attached logins.json file contains (simulated) timestamps of user logins in a particular geographic location. Aggregate these login counts based on 15­minute time intervals, and visualize and describe the resulting time series of login counts in ways that best characterize the underlying patterns of the demand. Please report/illustrate important features of the demand, such as daily cycles. If there are data quality issues, please report them.


In [2]:
# reading json file
login_time_df = pd.read_json('logins.json')

In [3]:
# Let's take a look at the data
login_time_df.head()

Unnamed: 0,login_time
0,1970-01-01 20:13:18
1,1970-01-01 20:16:10
2,1970-01-01 20:16:37
3,1970-01-01 20:16:36
4,1970-01-01 20:26:21


In [4]:
login_time_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93142 entries, 0 to 93141
Data columns (total 1 columns):
login_time    93142 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 727.8 KB


In [5]:
login_time_df.shape

(93142, 1)

In [6]:
login_time_df.describe()

Unnamed: 0,login_time
count,93142
unique,92265
top,1970-02-12 11:16:53
freq,3
first,1970-01-01 20:12:16
last,1970-04-13 18:57:38


In [7]:
# First, we need to set index to datetimeindex to be able to resample
login_time_df = login_time_df.set_index('login_time')
login_time_df.head(5)

1970-01-01 20:13:18
1970-01-01 20:16:10
1970-01-01 20:16:37
1970-01-01 20:16:36
1970-01-01 20:26:21


In [8]:
# We will aggregate these login counts based on 15 minute time intervals
login_time_df['Counter'] = 0
login_time_df = login_time_df.resample('15min').count()

In [9]:
login_time_df = login_time_df.reset_index()

In [10]:
login_time_df.head()

Unnamed: 0,login_time,Counter
0,1970-01-01 20:00:00,2
1,1970-01-01 20:15:00,6
2,1970-01-01 20:30:00,9
3,1970-01-01 20:45:00,7
4,1970-01-01 21:00:00,1


As you can see above, we have the login dataset aggregated these login counts based on 15 minute time intervals.
<p>Let's visualize the dataset to see some patterns about login times.<p/>

In [None]:
plt.figure(figsize=(20, 10))
plt.bar(login_time_df.login_time, login_time_df.Counter,
        data=login_time_df, color='grey')
plt.xlabel('Login Time')
plt.ylabel('Frequency')
sns.set()


To register the converters:
	>>> from pandas.plotting import register_matplotlib_converters
	>>> register_matplotlib_converters()


We can see that 1970-03-01 is the most busy day, but above plot doesn't give much. 
<p> We can check which day has the most login traffic. <p/>

In [None]:
# we use datetime day name function

login_time_df['Day'] = login_time_df.login_time.dt.day_name()
login_time_df

In [None]:
day_df = login_time_df.groupby('Day', as_index=False, sort='').sum()

In [None]:
day_df.sort_values('Counter', ascending=False)

Busiest Day is Saturday and least busy day is Monday. Let's create a plot to visualize this.

In [None]:
plt.figure(figsize=(20, 10))
plt.bar(day_df.Day, day_df.Counter, color='grey', width=1)
plt.xlabel('Days')
plt.ylabel('Frequency')
plt.title('WeekDay Plot')
sns.set()

We can check which hour gets the most login as well. We will use datetime hour function and group the data by hour to see the hourly frequency. 

In [None]:
login_time_df['Hour'] = login_time_df.login_time.dt.hour
hour_df = login_time_df.groupby('Hour', as_index=False).sum()

In [None]:
hour_df.sort_values('Counter', ascending=False)

10 PM is when users login the most and 7 AM is the least. We can create a plot this to see the pattern.

In [None]:
plt.figure(figsize=(20, 10))
plt.bar(hour_df.Hour, hour_df.Counter, color='grey', width=1)
plt.xlabel('Hours')
plt.ylabel('Frequency')
plt.title('Hours Plot')
sns.set()

# Part  ‐ Experiment and metrics design


<p>The neighboring cities of Gotham and Metropolis have complementary circadian rhythms: on weekdays, Ultimate Gotham is most active at night, and Ultimate Metropolis is most active during the day. On weekends, there is reasonable activity in both cities.<p/>
<p>However, a toll bridge, with a two way toll, between the two cities causes driver partners to tend to be exclusive to each city. The Ultimate managers of city operations for the two cities have proposed an experiment to encourage driver partners to be available in both cities, by reimbursing all toll costs.<p/>

- 1. What would you choose as the key measure of success of this experiment in encouraging driver partners to serve both cities, and why would you choose this metric?
- 2. Describe a practical experiment you would design to compare the effectiveness of the proposed change in relation to the key measure of success. Please provide details on:
    a. how you will implement the experiment
    b. what statistical test(s) you will conduct to verify the significance of the
observation
    c. how you would interpret the results and provide recommendations to the city
operations team along with any caveats.


__<p>Answer:<p/>__
I would split the drivers and apply A/B test method and check if the total revenue increases.

# Part  ‐ Predictive modeling


<p>Ultimate is interested in predicting rider retention. To help explore this question, we have provided a sample dataset of a cohort of users who signed up for an Ultimate account in January 2014. The data was pulled several months later; we consider a user retained if they were “active” (i.e. took a trip) in the preceding 30 days.<p/>
<p>We would like you to use this data set to help understand what factors are the best predictors for retention, and offer suggestions to operationalize those insights to help Ultimate.<p/>

<p>The data is in the attached file ultimate_data_challenge.json. See below for a detailed description of the dataset. Please include any code you wrote for the analysis and delete the dataset when you have finished with the challenge.<p/>

<p>1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice). What fraction of the observed users were retained?<p/>
<p>2. Build a predictive model to help Ultimate determine whether or not a user will be active in their 6th month on the system. Discuss why you chose your approach, what alternatives you considered, and any concerns you have. How valid is your model? Include any key indicators of model performance.<p/>

<p>3. Briefly discuss how Ultimate might leverage the insights gained from the model to improve its long­term rider retention (again, a few sentences will suffice).<p/>

<p>Data description<p/>

- city: city this user signed up in
- phone: primary device for this user
- signup_date: date of account registration; in the form ‘YYYYMMDD’
- last_trip_date: the last time this user completed a trip; in the form ‘YYYYMMDD’
- avg_dist: the average distance in miles per trip taken in the first 30 days after signup
- avg_rating_by_driver: the rider’s average rating over all of their trips
- avg_rating_of_driver: the rider’s average rating of their drivers over all of their trips
- surge_pct: the percent of trips taken with surge multiplier > 1
- avg_surge: The average surge multiplier over all of this user’s trips
- trips_in_first_30_days: the number of trips this user took in the first 30 days after
signing up
- ultimate_black_user: TRUE if the user took an Ultimate Black in their first 30 days;
FALSE otherwise
- weekday_pct: the percent of the user’s trips occurring during a weekday

In [None]:
# read the json file
df = pd.read_json('ultimate_data_challenge.json')