Megaline, a mobile carrier, aims to encourage subscribers on legacy plans to adopt newer plans (Smart or Ultra). To achieve this, they want to leverage subscriber behavior data to build a classification model. This model will predict the most suitable plan (Smart or Ultra) for each subscriber based on their usage patterns. The data used for model development comes from subscribers who have already switched to the new plans.

In [10]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from joblib import dump

In [11]:
# Load datset
data = pd.read_csv(r"C:\Users\maryk\BK_Tri_10\Project_7\users_behavior.csv")

# Display the first few rows of the dataset
print(data.head())

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0


In [12]:
# Split the data into training, validation, and test sets
train_data, temp_data = train_test_split(data, test_size=0.4, random_state=42)
validation_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Display the sizes of the splits
print(f"Training set size: {len(train_data)}")
print(f"Validation set size: {len(validation_data)}")
print(f"Test set size: {len(test_data)}")

Training set size: 1928
Validation set size: 643
Test set size: 643


In [13]:
# Define the features and target
features = ['calls', 'minutes', 'messages', 'mb_used']
target = 'is_ultra'

X_train = train_data[features]
y_train = train_data[target]

X_validation = validation_data[features]
y_validation = validation_data[target]

# Define hyperparameter grids for RandomForestClassifier and DecisionTreeClassifier
rf_param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

dt_param_grid = {
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}

# Evaluate RandomForestClassifier
rf_grid_search = GridSearchCV(RandomForestClassifier(random_state=42), rf_param_grid, cv=3, scoring='accuracy')
rf_grid_search.fit(X_train, y_train)
best_rf_model = rf_grid_search.best_estimator_
rf_validation_accuracy = accuracy_score(y_validation, best_rf_model.predict(X_validation))
print(f"Best RandomForestClassifier Validation Accuracy: {rf_validation_accuracy:.2f}")
print(f"Best RandomForestClassifier Parameters: {rf_grid_search.best_params_}")

# Evaluate DecisionTreeClassifier
dt_grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), dt_param_grid, cv=3, scoring='accuracy')
dt_grid_search.fit(X_train, y_train)
best_dt_model = dt_grid_search.best_estimator_
dt_validation_accuracy = accuracy_score(y_validation, best_dt_model.predict(X_validation))
print(f"Best DecisionTreeClassifier Validation Accuracy: {dt_validation_accuracy:.2f}")
print(f"Best DecisionTreeClassifier Parameters: {dt_grid_search.best_params_}")

Best RandomForestClassifier Validation Accuracy: 0.81
Best RandomForestClassifier Parameters: {'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 150}
Best DecisionTreeClassifier Validation Accuracy: 0.79
Best DecisionTreeClassifier Parameters: {'max_depth': 10, 'min_samples_split': 2}


In [14]:
# Define features and target for the test set
X_test = test_data[features]
y_test = test_data[target]

# Evaluate the best RandomForestClassifier on the test set
rf_test_accuracy = accuracy_score(y_test, best_rf_model.predict(X_test))
print(f"RandomForestClassifier Test Accuracy: {rf_test_accuracy:.2f}")

# Evaluate the best DecisionTreeClassifier on the test set
dt_test_accuracy = accuracy_score(y_test, best_dt_model.predict(X_test))
print(f"DecisionTreeClassifier Test Accuracy: {dt_test_accuracy:.2f}")

RandomForestClassifier Test Accuracy: 0.82
DecisionTreeClassifier Test Accuracy: 0.79


In [15]:
# Evaluate the best RandomForestClassifier on the training set
rf_train_accuracy = accuracy_score(y_train, best_rf_model.predict(X_train))
print(f"RandomForestClassifier Training Accuracy: {rf_train_accuracy:.2f}")

# Evaluate the best DecisionTreeClassifier on the training set
dt_train_accuracy = accuracy_score(y_train, best_dt_model.predict(X_train))
print(f"DecisionTreeClassifier Training Accuracy: {dt_train_accuracy:.2f}")

RandomForestClassifier Training Accuracy: 0.88
DecisionTreeClassifier Training Accuracy: 0.88


In [16]:
# Train a Logistic Regression model
log_reg = LogisticRegression(random_state=42, max_iter=1000)
log_reg.fit(X_train, y_train)

# Evaluate Logistic Regression on validation and test sets
log_reg_validation_accuracy = accuracy_score(y_validation, log_reg.predict(X_validation))
log_reg_test_accuracy = accuracy_score(y_test, log_reg.predict(X_test))

print(f"Logistic Regression Validation Accuracy: {log_reg_validation_accuracy:.2f}")
print(f"Logistic Regression Test Accuracy: {log_reg_test_accuracy:.2f}")

# Compare with the best RandomForestClassifier and DecisionTreeClassifier
print(f"RandomForestClassifier Test Accuracy: {rf_test_accuracy:.2f}")
print(f"DecisionTreeClassifier Test Accuracy: {dt_test_accuracy:.2f}")

# Save the best model (RandomForestClassifier in this case)
dump(best_rf_model, 'best_rf_model.joblib')
print("Best RandomForestClassifier model saved as 'best_rf_model.joblib'")

Logistic Regression Validation Accuracy: 0.74
Logistic Regression Test Accuracy: 0.77
RandomForestClassifier Test Accuracy: 0.82
DecisionTreeClassifier Test Accuracy: 0.79
Best RandomForestClassifier model saved as 'best_rf_model.joblib'


In [17]:
# Create a new data point for prediction
new_data = pd.DataFrame({'calls': [50], 'minutes': [300], 'messages': [20], 'mb_used': [15000]})

# Use the trained RandomForestClassifier to make a prediction
prediction = best_rf_model.predict(new_data)

# Print the prediction
print(prediction)

[0]


In [18]:
# Print the recommendation
new_data = pd.DataFrame({'calls': [50], 'minutes': [300], 'messages': [20], 'mb_used': [15000]})
prediction = best_rf_model.predict(new_data)
if prediction[0] == 1:
    print("Recommend the Ultra plan.")
else:
    print("Recommend the Smart plan.")

Recommend the Smart plan.


Conclusion
The study aimed to classify mobile subscribers into two categories: "Ultra" (`is_ultra = 1`) or "Smart" (`is_ultra = 0`) based on their usage patterns (`calls`, `minutes`, `messages`, `mb_used`).
The dataset contains 3,214 rows and 5 columns, including the target variable `is ultra`. The data was split into training (60%), validation (20%), and test (20%) sets.
Three machine learning models were trained and evaluated:
      -RandomForestClassifier: Achieved the best performance.
      - DecisionTreeClassifier: Performed slightly worse than RandomForest.
      - LogisticRegression: Had the lowest accuracy among the three models.

Model Performance:
    - RandomForestClassifier:
      - Training Accuracy: 88.38%
      - Validation Accuracy: 80.56%
      - Test Accuracy: 82.27%
    - DecisionTreeClassifier:
      - Training Accuracy: 87.86%
      - Validation Accuracy: 79.47%
      - Test Accuracy: 79.47%
    - LogisticRegression:
      - Validation Accuracy: 74.03%
      - Test Accuracy: 76.83%
Best Model:
    - The RandomForestClassifier was selected as the best model due to its superior performance on the test set. The RandomForestClassifier can provide insights into feature importance, helping to understand which usage patterns (e.g., `calls`, `minutes`, `messages`, `mb_used`) are most predictive of the "Ultra" plan.
Insights for Marketing
    - Subscribers with high usage of `minutes` and `mb_used` are more likely to benefit from the "Ultra" plan.
    - Subscribers with lower usage patterns may be better suited for the "Smart" plan.
The RandomForestClassifier provides an effective and accurate method for recommending plans to subscribers. By leveraging this model, Megaline can optimize its marketing strategy and improve customer satisfaction by aligning plans with subscriber needs.


In [None]:
# Load dataset
data = pd.read_csv("https://jupyterhub.tripleten-services.com/user/user-3-a2e7d293-bbfb-4c29-9c44-ae25f6b58400/edit/users_behavior.csv")

# Display the first few rows of the dataset
print(data.head())

In [None]:
# Load dataset from a local file
data = pd.read_csv("path_to_local_file/users_behavior.csv")

# Display the first few rows of the dataset
print(data.head())