# Megaline Subscriber Behavior Model

This notebook develops a classification model to predict which new plan a subscriber should choose (Smart or Ultra) based on their usage data. The notebook performs data exploration, splits the data into training, validation, and test sets, trains several models with hyperparameter tuning, selects the best model based on validation accuracy, and finally evaluates the best model on the test set with a sanity check.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# Load data from URL
url = "/workspaces/Subscriber-Behavior/users_behavior.csv"
df = pd.read_csv(url)

In [3]:
# Quick look at the data (optional)
print("First few rows:")
print(df.head())
print("\nData Info:")
print(df.info())

First few rows:
   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


In [4]:
# Our features are: calls, minutes, messages, mb_used
# The target is is_ultra (Ultra = 1, Smart = 0)

# Split the data into train, validation, and test sets.
# First, split off the test set (20% of data) using stratification.
train_val, test = train_test_split(df, test_size=0.20, random_state=54321, stratify=df['is_ultra'])

In [5]:
# Then, split the remaining 80% into training (60% total) and validation (20% total).
train, val = train_test_split(train_val, test_size=0.25, random_state=54321, stratify=train_val['is_ultra'])
# Now: train ~60%, val ~20%, test ~20%

# Separate features and target for each split
features_train = train.drop('is_ultra', axis=1)
target_train = train['is_ultra']

features_val = val.drop('is_ultra', axis=1)
target_val = val['is_ultra']

features_test = test.drop('is_ultra', axis=1)
target_test = test['is_ultra']

In [6]:
# Initialize variables to track the best model
best_model = None
best_val_acc = 0
best_model_name = None
best_params = {}

In [7]:
## 1. Logistic Regression
lr_model = LogisticRegression(max_iter=1000, random_state=54321)
lr_model.fit(features_train, target_train)
lr_val_pred = lr_model.predict(features_val)
lr_acc = accuracy_score(target_val, lr_val_pred)
if lr_acc > best_val_acc:
    best_model = lr_model
    best_val_acc = lr_acc
    best_model_name = "LogisticRegression"
    best_params = lr_model.get_params()

In [8]:
## 2. Decision Tree (tuning max_depth from 1 to 10)
for depth in range(1, 11):
    dt_model = DecisionTreeClassifier(max_depth=depth, random_state=54321)
    dt_model.fit(features_train, target_train)
    dt_val_pred = dt_model.predict(features_val)
    dt_acc = accuracy_score(target_val, dt_val_pred)
    if dt_acc > best_val_acc:
        best_model = dt_model
        best_val_acc = dt_acc
        best_model_name = "DecisionTreeClassifier"
        best_params = {"max_depth": depth}

In [9]:
## 3. Random Forest (tuning number of trees and max_depth)
for n_estimators in [10, 50, 100]:
    for depth in range(2, 11):
        rf_model = RandomForestClassifier(n_estimators=n_estimators, max_depth=depth, random_state=54321)
        rf_model.fit(features_train, target_train)
        rf_val_pred = rf_model.predict(features_val)
        rf_acc = accuracy_score(target_val, rf_val_pred)
        if rf_acc > best_val_acc:
            best_model = rf_model
            best_val_acc = rf_acc
            best_model_name = "RandomForestClassifier"
            best_params = {"n_estimators": n_estimators, "max_depth": depth}

In [10]:
print("Best model on the validation set:")
print("Model:", best_model_name, "with parameters:", best_params)
print("Validation accuracy:", best_val_acc)

Best model on the validation set:
Model: RandomForestClassifier with parameters: {'n_estimators': 100, 'max_depth': 8}
Validation accuracy: 0.8227060653188181


In [11]:
# Evaluate the chosen best model on the test set
test_pred = best_model.predict(features_test)
test_acc = accuracy_score(target_test, test_pred)
print("\nTest accuracy:", test_acc)


Test accuracy: 0.8133748055987559


In [12]:
# Sanity check: Compare with a baseline that predicts the majority class
majority_class = target_test.mode()[0]
baseline_pred = np.full(shape=len(target_test), fill_value=majority_class)
baseline_acc = accuracy_score(target_test, baseline_pred)
print("Baseline accuracy (majority class):", baseline_acc)

Baseline accuracy (majority class): 0.6936236391912908


### Explanation and Findings

#### Data Splitting:
We used a stratified split to ensure that each subset has a similar proportion of the two plans. The splits were approximately 60% for training, 20% for validation, and 20% for testing.

#### Model Selection:

We started with a logistic regression model as a baseline.
We then trained decision trees varying the maximum depth from 1 to 10.
Finally, we trained random forests with different numbers of trees (10, 50, 100) and maximum depths (2 to 10).
During our search, we kept track of the best validation accuracy. You might notice that more complex models (e.g., deeper trees or more trees in the random forest) can overfit, so there is a sweet spot in terms of hyperparameters.

#### Evaluation:
Once the best model was identified based on the validation set, we evaluated its performance on the test set. Additionally, a sanity check was performed by comparing our model’s accuracy to the baseline accuracy (always predicting the majority class).

#### Threshold:
The goal was to achieve an accuracy of at least 0.75 on the test set. Adjust hyperparameters and even consider additional models (or feature engineering) if the threshold isn’t met.