# Utilizing Machine Learning to Optimize Subscriber Plan
## Overview and Project Setup
Mobile carrier Megaline has discovered that many of its subscribers are still on legacy plans, even though newer plans offer enhanced features. The challenge lies in understanding subscriber behavior to effectively recommend one of the new plans, thereby increasing customer satisfaction and operational efficiency.

### Data Exploration:
(1) Load the dataset and perform an initial exploration to understand the distributions, central tendencies, and any anomalies in the data.

(2) Verify data quality and check for missing values or outliers.
### Data Splitting:
(1) Split the data into training, validation, and test sets using a stratified approach to preserve the balance of the target classes.
### Model Development and Hyperparameter Tuning:
(1) Experiment with different models such as Logistic Regression, Decision Trees, and ensemble methods like Random Forest and Gradient Boosting.

(2) Utilize grid search with cross-validation on the training set to optimize model hyperparameters.

(3) Evaluate model performance on the validation set to ensure the accuracy threshold is met.
### Final Model Evaluation:
(1) Once an optimal model is identified, conduct a final evaluation on the test set to confirm its generalization performance.

(2) Perform additional sanity checks to understand the model's behavior and robustness.

## Imports and Load Data

In [2]:
# Imports
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn import set_config
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [3]:
# Load Data
data = pd.read_csv('/datasets/users_behavior.csv')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


### Describe Data:
Each record represents one subscriber’s activity for a given month, providing a comprehensive view of how they interact with the mobile carrier’s services. The dataset contains no missing values, ensuring that each record is complete. All features are numerical, with the first four being floating-point numbers and the target variable as an integer. By capturing both voice and data usage metrics, the dataset allows for a nuanced analysis of subscriber behavior, which is essential for determining the most appropriate plan recommendation of Smart or Ultra.

## Split Data

In [4]:
# Split
features = data.drop(['is_ultra'], axis=1)
target = data['is_ultra']

# 80% training / val and 20% test
features_train_val, features_test, target_train_val, target_test = train_test_split(
    features, 
    target,
    test_size=0.20,
    random_state=12345
)

# 60% Training and 20% val
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train_val,
    target_train_val, 
    test_size=0.25, 
    random_state=12345
)

print("Training Size:", features_train.shape[0], 3214 * 0.6)
print("Validation Size:", features_valid.shape[0], 3214 * 0.2)
print("Test Size:", features_test.shape[0], 3214 * 0.2)

Training Size: 1928 1928.3999999999999
Validation Size: 643 642.8000000000001
Test Size: 643 642.8000000000001


### Summary:
The dataset, consisting of 3214 records, was split into three subsets to ensure that the model can be properly trained, validated, and tested. First, 80% of the data was separated into a combined training and validation set while reserving the remaining 20% as a test set. Then, the training and validation set was further divided into 75% for training and 25% for validation. This results in an overall split of 60% training data (approximately 1928 records), 20% validation data (around 643 records), and 20% test data (about 643 records). This approach ensures that the model is trained on a large portion of the data while having sufficient separate subsets to tune the model parameters and evaluate its generalization performance.

## Decision Tree Model

In [5]:
# Decision Tree
best_accuracy = 0
best_depth = 0
for depth in range(1,6):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions_valid)
    print('max_depth =', best_depth, ': ', accuracy)
    
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_depth = depth
        
print()
print('Best Paramaters')
print('max_depth =', best_depth, ':', best_accuracy)

max_depth = 0 :  0.7387247278382582
max_depth = 1 :  0.7573872472783826
max_depth = 2 :  0.7651632970451011
max_depth = 3 :  0.7636080870917574
max_depth = 3 :  0.7589424572317263

Best Paramaters
max_depth = 3 : 0.7651632970451011


### Summary:
The best performance for the Decision Tree model was achieved with a max_depth of 3, yielding a validation accuracy of 76.52%. Deeper trees resulted in slightly reduced accuracy, indicating that max_depth=3 provides the optimal balance between model complexity and generalization.

## Random Forest Model

In [6]:
# Random Forest
best_accuracy = 0
best_estimators = 0
depth = 3

for est in range(10, 51, 10):
    model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions_valid)
    
    print('Estimators : ', est, ': ', accuracy)
    
    if accuracy > best_accuracy:
        best_est = est
        best_accuracy = accuracy
        
print()
print('Best Parameters')
print('Estimators:', best_est, ':', best_accuracy)

Estimators :  10 :  0.7713841368584758
Estimators :  20 :  0.7713841368584758
Estimators :  30 :  0.7698289269051322
Estimators :  40 :  0.7713841368584758
Estimators :  50 :  0.7667185069984448

Best Parameters
Estimators: 10 : 0.7713841368584758


In [7]:
# Random Forest 2
best_accuracy = 0
best_estimators = 0
depth = 3

for est in range(1, 21):
    model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions_valid)
    
    print('Estimators : ', est, ': ', accuracy)
    
    if accuracy > best_accuracy:
        best_est = est
        best_accuracy = accuracy
        
print()
print('Best Parameters')
print('Estimators:', best_est, ':', best_accuracy)

Estimators :  1 :  0.7573872472783826
Estimators :  2 :  0.7573872472783826
Estimators :  3 :  0.76049766718507
Estimators :  4 :  0.76049766718507
Estimators :  5 :  0.76049766718507
Estimators :  6 :  0.7651632970451011
Estimators :  7 :  0.7682737169517885
Estimators :  8 :  0.7698289269051322
Estimators :  9 :  0.7682737169517885
Estimators :  10 :  0.7713841368584758
Estimators :  11 :  0.7744945567651633
Estimators :  12 :  0.7682737169517885
Estimators :  13 :  0.7744945567651633
Estimators :  14 :  0.7698289269051322
Estimators :  15 :  0.7776049766718507
Estimators :  16 :  0.7682737169517885
Estimators :  17 :  0.7729393468118196
Estimators :  18 :  0.7698289269051322
Estimators :  19 :  0.7729393468118196
Estimators :  20 :  0.7713841368584758

Best Parameters
Estimators: 15 : 0.7776049766718507


### Summary:
Two Random Forest experiments with a fixed max_depth of 3, we first evaluated models with n_estimators in coarse increments, and then refined the search with values from 1 to 20. The initial run showed that 10 estimators yielded a validation accuracy of approximately 77.14%. Further investigation revealed that as we increased the number of trees, the accuracy steadily improved, peaking at 15 estimators with an accuracy of around 77.76%. Beyond this point, accuracy began to fluctuate slightly. This suggests that, for our dataset and with a max_depth of 3, a Random Forest model configured with 15 estimators achieves the optimal balance between model complexity and generalization.

## Linear Regression Model

In [8]:
# Linear Regression
model = LinearRegression()
model.fit(features_train, target_train)

predictions_valid = model.predict(features_valid)
predictions_binary = (predictions_valid > 0.5).astype(int)

accuracy = accuracy_score(target_valid, predictions_binary)

print('Accuracy:', accuracy)

Accuracy: 0.7293934681181959


### Summary:
Using Linear Regression, the model was trained on the training data and then generated continuous predictions for the validation set. By applying a threshold of 0.5, these continuous predictions were converted into binary classes. This approach resulted in an accuracy of approximately 72.94% on the validation set. While this performance provides a useful baseline, it falls short of the 75% accuracy target, indicating that more specialized classification methods might be better suited.

## Test Model
Best Model = Random Forest with 15 estimators at depth of 3

In [9]:
# Test Model
best_model = RandomForestClassifier(random_state=12345, n_estimators=15, max_depth=3)
best_model.fit(features_train, target_train)

predictions_test = best_model.predict(features_test)

accuracy_test = accuracy_score(target_test, predictions_test)

print('Test Accuracy:', accuracy_test)

Test Accuracy: 0.7838258164852255


### Summary:
The best-performing Random Forest model, configured with 15 estimators and a maximum depth of 3, achieved a test accuracy of approximately 78.38%. This result exceeds the 75% target threshold and demonstrates the model's robustness and generalization ability when applied to unseen data. The high accuracy on the test set confirms that the chosen hyperparameters effectively balance model complexity and predictive performance, making it a suitable candidate for recommending the appropriate subscriber plan.

### Sanity Test:
This model passes sanity test by having an accuracy score above 50%

# Conclusion
This analysis demonstrates that by leveraging a high-quality, complete dataset of 3214 records capturing voice and data usage, we effectively developed and evaluated multiple models to recommend the optimal plan—Smart or Ultra—for Megaline subscribers. The data was strategically split into 60% training, 20% validation, and 20% test sets, ensuring robust model development and assessment. Among the models tested, the Decision Tree achieved its best validation performance at a max_depth of 3 with an accuracy of about 76.52%, while the Linear Regression approach, after thresholding its continuous predictions, reached only around 72.94%, falling short of the desired 75% threshold. In contrast, the Random Forest model, fine-tuned with a fixed max_depth of 3 and 15 estimators, not only achieved a validation accuracy of approximately 77.76% but also delivered a test accuracy of roughly 78.38%. This robust performance confirms that the Random Forest configuration effectively balances model complexity and generalization, making it a highly reliable candidate for accurately transitioning subscribers to the appropriate new plan.