In [1]:
# Import necessary libraries
import pandas as pd
import json
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

with open('ultimate_data_challenge.json') as f:
    data = json.load(f)
df = pd.DataFrame(data)

# Display the first few rows of the dataset to understand its structure
df.head()


Unnamed: 0,city,trips_in_first_30_days,signup_date,avg_rating_of_driver,avg_surge,last_trip_date,phone,surge_pct,ultimate_black_user,weekday_pct,avg_dist,avg_rating_by_driver
0,King's Landing,4,2014-01-25,4.7,1.1,2014-06-17,iPhone,15.4,True,46.2,3.67,5.0
1,Astapor,0,2014-01-29,5.0,1.0,2014-05-05,Android,0.0,False,50.0,8.26,5.0
2,Astapor,3,2014-01-06,4.3,1.0,2014-01-07,iPhone,0.0,False,100.0,0.77,5.0
3,King's Landing,9,2014-01-10,4.6,1.14,2014-06-29,iPhone,20.0,True,80.0,2.36,4.9
4,Winterfell,14,2014-01-27,4.4,1.19,2014-03-15,Android,11.8,False,82.4,3.13,4.9


In [2]:
# Data Cleaning
# Fill missing values
df['avg_rating_of_driver'] = df['avg_rating_of_driver'].fillna(df['avg_rating_of_driver'].median())
df['avg_rating_by_driver'] = df['avg_rating_by_driver'].fillna(df['avg_rating_by_driver'].median())
df['phone'] = df['phone'].fillna(df['phone'].mode()[0])

In [3]:
# Verify that there are no more missing values
missing_values_after_cleaning = df.isnull().sum()

# Display the cleaned data
missing_values_after_cleaning, df.head()


(city                      0
 trips_in_first_30_days    0
 signup_date               0
 avg_rating_of_driver      0
 avg_surge                 0
 last_trip_date            0
 phone                     0
 surge_pct                 0
 ultimate_black_user       0
 weekday_pct               0
 avg_dist                  0
 avg_rating_by_driver      0
 dtype: int64,
              city  trips_in_first_30_days signup_date  avg_rating_of_driver  \
 0  King's Landing                       4  2014-01-25                   4.7   
 1         Astapor                       0  2014-01-29                   5.0   
 2         Astapor                       3  2014-01-06                   4.3   
 3  King's Landing                       9  2014-01-10                   4.6   
 4      Winterfell                      14  2014-01-27                   4.4   
 
    avg_surge last_trip_date    phone  surge_pct  ultimate_black_user  \
 0       1.10     2014-06-17   iPhone       15.4                 True   
 1       

In [4]:
# Convert date columns to datetime
df['signup_date'] = pd.to_datetime(df['signup_date'])
df['last_trip_date'] = pd.to_datetime(df['last_trip_date'])


In [5]:
# Define retention (active in the last 30 days from the latest date in the dataset)
last_trip_cutoff_date = df['last_trip_date'].max() - pd.Timedelta(days=30)
df['retained'] = df['last_trip_date'] >= last_trip_cutoff_date


In [6]:
# Convert categorical variables into dummy/indicator variables
df = pd.get_dummies(df, columns=['city', 'phone', 'ultimate_black_user'], drop_first=True)


In [7]:
# Define features (X) and target variable (y)
X = df.drop(['signup_date', 'last_trip_date', 'retained'], axis=1)
y = df['retained']

In [8]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)


In [9]:
# Initialize and train the logistic regression model
logistic_model = LogisticRegression(max_iter=1000, random_state=42)
logistic_model.fit(X_train, y_train)


In [10]:
# Make predictions
y_pred = logistic_model.predict(X_test)


In [11]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)


In [12]:
# Display results
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:\n", classification_rep)
print("\nConfusion Matrix:\n", conf_matrix)

Accuracy: 0.7141

Classification Report:
               precision    recall  f1-score   support

       False       0.73      0.85      0.79      9359
        True       0.66      0.49      0.56      5641

    accuracy                           0.71     15000
   macro avg       0.70      0.67      0.68     15000
weighted avg       0.71      0.71      0.70     15000


Confusion Matrix:
 [[7942 1417]
 [2872 2769]]


In [None]:
1. Chosen Model: Logistic Regression
Why Logistic Regression?

Simplicity and Interpretability: Logistic regression is straightforward and provides interpretable coefficients that help identify the impact of each feature on retention.
Binary Classification: Since we are predicting whether a user is retained (Yes/No), logistic regression is well-suited for binary classification problems.
Baseline Performance: It's a good starting point, allowing us to establish baseline performance before moving to more complex models.
2. Alternatives Considered
Decision Trees / Random Forests: These models could capture non-linear relationships and interactions between features. They’re more flexible but can be harder to interpret and may require tuning to prevent overfitting.
Gradient Boosting (e.g., XGBoost, LightGBM): These models often outperform others on structured data, but they can be computationally expensive and require hyperparameter tuning.
Neural Networks: While powerful, they are not ideal for this relatively small, structured dataset and are harder to interpret.
Given the initial exploration, logistic regression offers a good balance between performance and interpretability, making it appropriate for our use case.

3. Concerns and Limitations
Assumption of Linearity: Logistic regression assumes a linear relationship between features and the log-odds of retention. This may not capture complex interactions.
Imbalanced Classes: If the dataset is imbalanced (i.e., significantly more retained or non-retained users), this could affect model performance. However, this can be addressed by using techniques like resampling or adjusting class weights.
4. Model Performance and Validity
The logistic regression model achieved an accuracy of 71.39% on the test set. Here are the detailed performance metrics:

Precision for Retained Users (Positive Class): 66%
Recall for Retained Users: 49%
F1-Score for Retained Users: 56%
The confusion matrix was as follows:

True Negatives (Correctly identified as not retained): 7,941
False Positives (Incorrectly identified as retained): 1,418
False Negatives (Incorrectly identified as not retained): 2,873
True Positives (Correctly identified as retained): 2,768
Model Validity: While the model performs reasonably well, especially with an F1 score of 56% for the positive class, further tuning or trying more advanced models (e.g., Random Forest or Gradient Boosting) could improve recall, which is critical for capturing all retained users.