![logo.png](https://github.com/interviewquery/takehomes/blob/origin/uber_3/uber_3/logo.png?raw=1)


## Part 1 - SQL Syntax

Given the below subset of Uber's schema, write executable SQL queries to answer the questions below. Please answer in a single query for each question and assume read-only access to the database (i.e. do not use CREATE TABLE).

1. For each of the cities 'Qarth' and 'Meereen', calculate 90 th percentile difference between Actual and Predicted ETA for all completed trips within the last 30 days.

2. A signup is defined as an event labeled `sign_up_success` within the `events` table. For each city ('Qarth' and 'Meereen') and each day of the week, determine the percentage of signups in the first week of 2016 that resulted in completed a trip within 168 hours of the sign up date.

**Assume a PostgreSQL database, server timezone is UTC.**


Table Name: **`trips`**

|Column Name:|Datatype:|
| :-: | :-: |
|`id`|`integer`|
|`client_id`|`integer` (Foreign keyed to `events.rider_id`)|
|`driver_id`|`integer`|
|`city_id`|`integer` (Foreign keyed to `cities.city_id`)|
|`client_rating`|`integer`|
|`driver_rating`|`integer`|
|`request_at`|`Timestamp with timezone`|
|`predicted_eta`|`integer`|
|`actual_eta`|`integer`|
|`status`|`Enum`(‘`completed`’, ‘`cancelled_by_driver`’, ‘`cancelled_by_client`’)|


Table Name: **`cities`**

|Column Name:|Datatype:|
| :-: | :-: |
|`city_id`|`integer`|
|`city_name`|`string`|



Table Name: **`events`**

|Column Name:|Datatype:|
| :-: | :-: |
|`device_id`|`integer`|
|`rider_id`|`integer`|
|`city_id`|`integer`|
|`event_name`|`Enum`(‘`sign_up_success`’, ‘`attempted_sign_up`’, ‘`sign_up_failure`’)|



## Part 2 - Experiment and metrics design


The Driver Experience team has just finished [redesigning the Uber Partner app](https://newsroom.uber.com/new-partner-app/). The new version expands the purpose of the app beyond just driving. It includes additional information on earnings, ratings, and provides a unified platform for Uber to communicate with its partners.

1. Propose and define the primary success metric of the redesigned app. What are 2-3 additional tracking metrics that will be important to monitor in addition to the success metric defined above?

2. Outline a testing plan to evaluate if redesigned app performs better (according to the metrics you outlined). How would you balance the need to deliver quick results, with statistical rigor, and while still monitoring for risks?

3. Explain how you would translate the results from the testing plan into a decision on whether to launch the new design or roll it back.

## Part 3 - Data analysis

Uber's Driver team is interested in predicting which driver signups are most likely to start driving. To help explore this question, we have provided a sample  dataset of a cohort of driver signups in January 2015.The data was pulled a few months after they signed up to include the result of whether they actually completed their first trip. It also includes several pieces of background information gather about the driver and their car.

We would like you to use this data set to help understand what factors are best at predicting whether a signup will start to drive, and offer suggestions to operationalize those insights to help Uber.

See below for a detailed description of the dataset. Please include any code you wrote for the analysis and delete the dataset when you have finished with the challenge. Please also call out any data related assumptions or issues that you encounter.

1. Perform any cleaning, exploratory analysis, and/or visualizations to use the provided data for this analysis (a few sentences/plots describing your approach will suffice). What fraction of the driver signups took a first trip?

2. Build a predictive model to help Uber determine whether or not a driver signup will start driving. Discuss why you chose your approach, what alternatives you considered, and any concerns you have. How valid is your model? Include any key indicators of model performance.

3. Briefly discuss how Uber might leverage the insights gained from the model to generate more first trips (again, a few ideas/sentences will suffice).



### Data description

**id**: driver_id

**city_id**: city_id this user signed up in

**signup_os**: signup device of the user ("android", "ios", "website", "other")

**signup_channel**: what channel did the driver sign up from ("offline", "paid", "organic", "referral")

**signup_timestamp**: timestamp of account creation; local time in the form 'YYYY/MM/DD'

**bgc_date**: date of background check consent; in the form 'YYYY/MM/DD'

**vehicle_added_date**: date when driver's vehicle information was uploaded; in the form 'YYYY/MM/DD'

**first_trip_date**: date of the first trip as a driver; in the form 'YYYY/MM/DD'

**vehicle_make**: make of vehicle uploaded (i.e. Honda, Ford, Kia)

**vehicle_model**: model of vehicle uploaded (i.e. Accord, Prius, 350z)

**vehicle year**: year that the car was made; in the form 'YYYY'




Please note that this data is fake and does not represent actual driver signup behavior



In [None]:
!git clone --branch origin/uber_3 https://github.com/interviewquery/takehomes.git
%cd takehomes/uber_3
!if [[ $(ls *.zip) ]]; then unzip *.zip; fi
!ls

Cloning into 'takehomes'...
remote: Enumerating objects: 1963, done.[K
remote: Counting objects: 100% (1963/1963), done.[K
remote: Compressing objects: 100% (1220/1220), done.[K
remote: Total 1963 (delta 752), reused 1927 (delta 726), pack-reused 0 (from 0)[K
Receiving objects: 100% (1963/1963), 297.43 MiB | 7.45 MiB/s, done.
Resolving deltas: 100% (752/752), done.
/content/takehomes/uber_3
ls: cannot access '*.zip': No such file or directory
ds_challenge_v2_1_data.csv  logo.png  takehomefile.ipynb




---



# PART 1

Table Name: **`trips`**

|Column Name:|Datatype:|
| :-: | :-: |
|`id`|`integer`|
|`client_id`|`integer` (Foreign keyed to `events.rider_id`)|
|`driver_id`|`integer`|
|`city_id`|`integer` (Foreign keyed to `cities.city_id`)|
|`client_rating`|`integer`|
|`driver_rating`|`integer`|
|`request_at`|`Timestamp with timezone`|
|`predicted_eta`|`integer`|
|`actual_eta`|`integer`|
|`status`|`Enum`(‘`completed`’, ‘`cancelled_by_driver`’, ‘`cancelled_by_client`’)|


Table Name: **`cities`**

|Column Name:|Datatype:|
| :-: | :-: |
|`city_id`|`integer`|
|`city_name`|`string`|



Table Name: **`events`**

|Column Name:|Datatype:|
| :-: | :-: |
|`device_id`|`integer`|
|`rider_id`|`integer`|
|`city_id`|`integer`|
|`event_name`|`Enum`(‘`sign_up_success`’, ‘`attempted_sign_up`’, ‘`sign_up_failure`’)|


1. For each of the cities 'Qarth' and 'Meereen', calculate 90 th percentile difference between Actual and Predicted ETA for all completed trips within the last 30 days.



```
WITH df AS (
    SELECT * FROM trips As tr
    LEFT JOIN cities AS ci ON tr.city_id = ci.city_id
    WHERE ci.city_name IN ('Qarth', 'Meereen')
    AND tr.status = 'completed'
    AND tr.actual_eta IS NOT NULL
    AND tr.predicted_eta IS NOT NULL
    AND tr.request_at >= NOW() - INTERVAL '30 days'
)
SELECT
  city_name ,
  PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY (actual_eta - predicted_eta)) AS 'percentile_90th_eta_difference'
FROM df
GROUP BY city_name
```



2. A signup is defined as an event labeled sign_up_success within the events table. For each city ('Qarth' and 'Meereen') and each day of the week, determine the percentage of signups in the first week of 2016 that resulted in completed a trip within 168 hours of the sign up date.



```
WITH signups AS (
    SELECT
        e.rider_id,
        c.city_name,
        DATEPART(WEEKDAY, e.event_timestamp) AS 'day_of_week',
        e.event_timestamp
    FROM events AS e
    LEFT JOIN cities AS c ON e.city_id = c.city_id
    WHERE c.city_name IN ('Qarth', 'Meereen')
    AND e.event_name = 'sign_up_success'
    AND e.event_timestamp >= '2016-01-01'
    AND e.event_timestamp < '2016-01-08'
),
completed_trips AS (
    SELECT
        t.client_id,
        t.request_at
    FROM trips AS t
    WHERE t.status = 'completed'
)
SELECT
    s.city_name,
    s.day_of_week,
    (COUNT(CASE WHEN ct.request_at IS NOT NULL AND ct.request_at <= s.event_timestamp + INTERVAL '168 hours' THEN 1 END)  / COUNT(*) * 100.0) AS 'percentage'
FROM signups AS s
LEFT JOIN completed_trips ct ON s.rider_id = ct.client_id
GROUP BY s.city_name, s.day_of_week
ORDER BY s.city_name ASC, s.day_of_week ASC
```



\

# PART 2

The Driver Experience team has just finished redesigning the Uber Partner app. The new version expands the purpose of the app beyond just driving. It includes additional information on earnings, ratings, and provides a unified platform for Uber to communicate with its partners.

Propose and define the primary success metric of the redesigned app. What are 2-3 additional tracking metrics that will be important to monitor in addition to the success metric defined above?

Outline a testing plan to evaluate if redesigned app performs better (according to the metrics you outlined). How would you balance the need to deliver quick results, with statistical rigor, and while still monitoring for risks?

Explain how you would translate the results from the testing plan into a decision on whether to launch the new design or roll it back.

First, I would conduct a **Data Analysis** on the users engagement factors, which identify if the the users engagement changed after the release of the new app. The factors are:
* time spent on the new app
* percentage of users looking on the new features in the app (earnings, ratings)
* chats frequency with partners
* rides-acceptance rates
* reviews of users (if any)

Secondly, I would design and conduct an **A/B Test**, in order to proof if the new app is successfulm and to check if it is worthy to extend it to the entire user population. Here the steps that I would follow:
1. define the primary success metric (OEC, Overall Evaluation Criteria): avg working daily time spent on the new app. The goal is to measure if with the new version, the users are more engaged and spend more time on looking at the earnings, ratings, chats sections.
2. Define the Null Hypothesis: the avg time spent in Control and Treatment group are the same
3. Significance level: 0.05; Power: 80%; size of the 2 groups
4. Build the Control and Treatment Groups, in order that they are comparable and good representation of the population
5. Collect the data
6. Run the Z-test statistics, calculate the p-value
7. results interpretation (Do we have enough evidence to Reject the Null Hypothesis?)
8. Compare the Confidence Interval with the Practical Decision Boundaries, in order to evaluate if it is worthy to launch the new design to the entire users population

Besides, we can define other 2-3 additional tracking metrics that will be important to monitor in addition to the success metric:
* avg time spent on the new sections of the app (earnings, ratings, chats with partners)
* frequency visualizations of earnings, ratings
* number of chats opened with partners
* number of users, in order to check that they will be stable or would increase. A decrease of number of users would mean that the app complicated the user experiences and they leave for a better app
* rides acceptance rate







\

# PART 3

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from xgboost import XGBRegressor
from xgboost import cv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import mean_squared_error, root_mean_squared_error, mean_absolute_error
import plotly.graph_objects as go
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller # Augmented Dickey Fuller Test
import statsmodels.api as sma
import itertools
from statsmodels.tools.sm_exceptions import ConvergenceWarning
import warnings
# Suppress warnings for cleaner output
warnings.simplefilter('ignore', ConvergenceWarning)
os.listdir()

['logo.png', 'takehomefile.ipynb', 'ds_challenge_v2_1_data.csv']

## read data

In [None]:
# read data
df = pd.read_csv("ds_challenge_v2_1_data.csv")

# print
print(df.shape)
df.head()

(54681, 11)


  cast_date_col = pd.to_datetime(column, errors="coerce")
  cast_date_col = pd.to_datetime(column, errors="coerce")
  cast_date_col = pd.to_datetime(column, errors="coerce")


Unnamed: 0,id,city_name,signup_os,signup_channel,signup_date,bgc_date,vehicle_added_date,vehicle_make,vehicle_model,vehicle_year,first_completed_date
0,1,Strark,ios web,Paid,1/2/16,,,,,,
1,2,Strark,windows,Paid,1/21/16,,,,,,
2,3,Wrouver,windows,Organic,1/11/16,1/11/16,,,,,
3,4,Berton,android web,Referral,1/29/16,2/3/16,2/3/16,Toyota,Corolla,2016.0,2/3/16
4,5,Strark,android web,Referral,1/10/16,1/25/16,1/26/16,Hyundai,Sonata,2016.0,
