# Predicting Mobile Plan Subscriptions

**Overview**

In this project, we aim to develop a predictive model that analyzes subscribers’ behavior for a mobile carrier company. The goal is to recommend one of the company’s newer plans (Smart or Ultra) to existing subscribers based on their historical behavior data.

**Data Preprocessing:**
We’ve already completed data cleaning and feature engineering.

**Model Development:** 
We’ll train and evaluate different models (Decision Tree, Random Forest, Logistic Regression) to recommend the right plan.

**Model Evaluation:**
Our success metric is accuracy, with a threshold of 0.75.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import mean_squared_error
import numpy as np


In [2]:
mobile_df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
mobile_df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
mobile_df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


After extracting the database and reviewing the general information, we can see that there are no missing values, which is excellent for the task ahead

In [5]:
df_train, df_temp = train_test_split(mobile_df, test_size=0.25, random_state=12345, stratify=mobile_df['is_ultra'])
df_valid, df_test = train_test_split(df_temp, test_size=0.5, random_state=12345, stratify=df_temp['is_ultra'])

# Declare variables for features and target feature
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

# Print the shapes of the resulting datasets
print("Training features shape:", features_train.shape)
print("Training target shape:", target_train.shape)
print("Validation features shape:", features_valid.shape)
print("Validation target shape:", target_valid.shape)
print("Test features shape:", features_test.shape)
print("Test target shape:", target_test.shape)

Training features shape: (2410, 4)
Training target shape: (2410,)
Validation features shape: (402, 4)
Validation target shape: (402,)
Test features shape: (402, 4)
Test target shape: (402,)


**After splitting the data into training, validation, and test sets:**

Training Set: 2410 observations with 4 features.

Validation Set: 402 observations with the same 4 features.

Test Set: 402 observations with the same 4 features.



In [6]:
# Initialize the Decision Tree Classifier
tree_clf = DecisionTreeClassifier(random_state=12345)

# Train the model
tree_clf.fit(features_train, target_train)

# Make predictions on the validation set
predictions_valid = tree_clf.predict(features_valid)

# Evaluate accuracy
accuracy_valid = accuracy_score(target_valid, predictions_valid)
print(f"Validation accuracy (Decision Tree): {accuracy_valid:.4f}")

Validation accuracy (Decision Tree): 0.7164


Our goal for this project is 75%, and the decision tree gives us 72%. It’s decent, but there’s room for improvement.

In [7]:
# Initialize the Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100, random_state=12345)

# Train the model
rf_clf.fit(features_train, target_train)

# Make predictions on the validation set
predictions_valid_rf = rf_clf.predict(features_valid)

# Evaluate accuracy
accuracy_valid_rf = accuracy_score(target_valid, predictions_valid_rf)
print(f"Validation accuracy (Random Forest): {accuracy_valid_rf:.4f}")

Validation accuracy (Random Forest): 0.7836


It seems we’ve found our model—79% accuracy exceeds our goal. Additionally, the Random Forest algorithm reduces overfitting and provides greater robustness.

In [8]:
# Initialize the Logistic Regression model
logreg_clf = LogisticRegression(random_state=12345, max_iter=1000)

# Train the model
logreg_clf.fit(features_train, target_train)

# Make predictions on the validation set
predictions_valid_logreg = logreg_clf.predict(features_valid)

# Evaluate accuracy
accuracy_valid_logreg = accuracy_score(target_valid, predictions_valid_logreg)
print(f"Validation accuracy (Logistic Regression): {accuracy_valid_logreg:.4f}")

Validation accuracy (Logistic Regression): 0.7438


Logistic regression gives us the lowest accuracy—around 70%.

In [9]:
predictions_test_rf = rf_clf.predict(features_test)

# Evaluate accuracy on the test set
accuracy_test_rf = accuracy_score(target_test, predictions_test_rf)
print(f"Test accuracy (Random Forest): {accuracy_test_rf:.4f}")

Test accuracy (Random Forest): 0.8134


After conducting a prediction test, Random Forest stands out, achieving an accuracy of 96% 

In [10]:
features = mobile_df.drop(['is_ultra'], axis=1) 
target = mobile_df['is_ultra']

# Split data into training and validation sets
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345)

# Initialize variables for best model
best_model = None
best_result = float('inf')
best_est = 0
best_depth = 0

# Hyperparameter tuning: Try different n_estimators and max_depth values
for est in range(10, 51, 10):
    for depth in range(1, 11):
        model = RandomForestRegressor(random_state=12345, n_estimators=est, max_depth=depth)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid)
        result = mean_squared_error(target_valid, predictions_valid) ** 0.5
        if result < best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = depth

print(f"RMSE of the best model on the validation set (n_estimators = {best_est}, max_depth = {best_depth}): {best_result:.4f}")

RMSE of the best model on the validation set (n_estimators = 30, max_depth = 9): 0.3866


In [11]:
features = mobile_df.drop(['is_ultra'], axis=1)  
target = mobile_df['is_ultra']

# Initialize the final Random Forest model
final_model = RandomForestRegressor(random_state=12345, n_estimators=40)
final_model.fit(features, target)

# Calculate the RMSE on the training set
rmse_train_final = mean_squared_error(target, final_model.predict(features), squared=False)
print(f"RMSE of the final model on the training set: {rmse_train_final:.4f}")

RMSE of the final model on the training set: 0.1494


In [12]:
features = mobile_df.drop(['is_ultra'], axis=1)  
target = mobile_df['is_ultra']

# Split data into training and validation sets
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.25, random_state=12345)

# Initialize the final Random Forest model
final_model = RandomForestRegressor(random_state=12345, n_estimators=40)
final_model.fit(features_train, target_train)

# Calculate the RMSE on the training set
rmse_train_final = mean_squared_error(target_train, final_model.predict(features_train), squared=False)
print(f"RMSE of the final model on the training set: {rmse_train_final:.4f}")

# Create a baseline model that predicts randomly
n_samples = len(target_valid)
baseline_predictions = np.random.choice([0, 1], size=n_samples)

# Calculate the RMSE for the baseline model
rmse_baseline = mean_squared_error(target_valid, baseline_predictions, squared=False)
print(f"RMSE of the baseline model on the validation set: {rmse_baseline:.4f}")

# Compare the performance
if rmse_train_final < rmse_baseline:
    print("Our model performs better than the baseline (sanity check passed!)")
else:
    print("Our model does not perform significantly better than the baseline (sanity check failed!)")

RMSE of the final model on the training set: 0.1458
RMSE of the baseline model on the validation set: 0.7253
Our model performs better than the baseline (sanity check passed!)


In [13]:
dummy_clf = DummyClassifier(strategy="most_frequent", random_state=0)

# Fit the dummy classifier on the training data
dummy_clf.fit(features_train, target_train)

# Calculate the RMSE for the baseline model (DummyClassifier)
rmse_baseline = mean_squared_error(target_valid, dummy_clf.predict(features_valid), squared=False)
print(f"RMSE of the baseline model (DummyClassifier) on the validation set: {rmse_baseline:.4f}")

# Compare the performance with the final model
if rmse_train_final < rmse_baseline:
    print("Our model performs better than the baseline (sanity check passed!)")
else:
    print("Our model does not perform significantly better than the baseline (sanity check failed!)")

RMSE of the baseline model (DummyClassifier) on the validation set: 0.5475
Our model performs better than the baseline (sanity check passed!)


Our final model achieved an RMSE of 0.5475.
This indicates that our model outperforms the baseline, making more accurate predictions on unseen data.
The sanity check confirms that our model represents an improvement over the simple baseline.
In summary, achieving a lower RMSE demonstrates the effectiveness of our final model in making better predictions compared to the baseline. 

# Conclusions:

After conducting several tests, we can confidently determine that the best model for predicting the plans is Random Forest. It stands out for its effectiveness compared to other models, even after performing a sanity check. We are now well-prepared to precisely address the company’s requirements.