# Introduction

In this project, the goal is to develop a classification model to help Megaline, a mobile carrier, recommend suitable new plans (Smart or Ultra) to subscribers still using legacy plans. By analyzing subscriber behavior data, the model aims to achieve an accuracy of at least 75%, improving Megaline’s ability to target customers with the most appropriate plan. The project focuses on building the model after the data preprocessing stage has already been completed.

In [5]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error

In [8]:
#Import data file
behaviors_df = pd.read_csv('users_behavior.csv')
behaviors_df.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


## Spliting Data into Training set, validation set, and test set

In [10]:
# run train_test_split function
behaviors_df_train, behaviors_df_valid = train_test_split(behaviors_df, test_size=0.25, random_state=12345)
behaviors_df_train, behaviors_df_test = train_test_split(behaviors_df, test_size=0.20, random_state=12345)

# Training set
features_train = behaviors_df_train.drop(['is_ultra'], axis=1)
target_train = behaviors_df_train['is_ultra']

# test set
test_features = behaviors_df_test.drop(['is_ultra'], axis=1) 
test_target = behaviors_df_test['is_ultra']

# Validation set
features_valid = behaviors_df_valid.drop(['is_ultra'], axis=1)
target_valid = behaviors_df_valid['is_ultra']

print(features_train.shape)
print(target_train.shape)
print(test_features.shape)
print(test_target.shape)
print(features_valid.shape)
print(target_valid.shape)

(2571, 4)
(2571,)
(643, 4)
(643,)
(804, 4)
(804,)


## Testing quality of models

In [11]:
# DecisionTree model
for depth in range(1,11):
    decision_tree_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    decision_tree_model.fit(features_train, target_train)
    prediction_valid = decision_tree_model.predict(features_valid)
    print('max_depth =', depth, ': ', end=' ')
    print(accuracy_score(target_valid, prediction_valid)) 

max_depth = 1 :  0.75
max_depth = 2 :  0.7835820895522388
max_depth = 3 :  0.7885572139303483
max_depth = 4 :  0.7848258706467661
max_depth = 5 :  0.7898009950248757
max_depth = 6 :  0.7898009950248757
max_depth = 7 :  0.7885572139303483
max_depth = 8 :  0.7810945273631841
max_depth = 9 :  0.7910447761194029
max_depth = 10 :  0.7860696517412935


In [12]:
#testing the precision score funtion
for depth in range(1,11):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    prediction_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end=' ')
    print(precision_score(target_valid, prediction_valid))

max_depth = 1 :  0.7222222222222222
max_depth = 2 :  0.7637795275590551
max_depth = 3 :  0.7591240875912408
max_depth = 4 :  0.753731343283582
max_depth = 5 :  0.7608695652173914
max_depth = 6 :  0.7432432432432432
max_depth = 7 :  0.7983193277310925
max_depth = 8 :  0.7210884353741497
max_depth = 9 :  0.7703703703703704
max_depth = 10 :  0.7346938775510204


Max Depth 9 is performing above the accuracy threshold of 0.75, suggesting they have a good fit for the data, at least based on the validation set. However this could be due to overfitting. With that in mind I decided to also run the data with the precision score function to see a different view of the model's performance. Based on the precision score max depth 7 may be slightly better as the score drops when it gets to max depth 8 signaling it could be the start of overfitting.

In [13]:
# RandomForestClassifier
best_score = 0
best_est = 0
for est in range(1, 11):
    random_forest_model = RandomForestClassifier(random_state=54321, n_estimators=est) # max_depth=10, min_samples_split=5, min_samples_leaf=5)
    random_forest_model.fit(features_train, target_train)
    score = random_forest_model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est
display("Best model accuracy on validation set (n_estimators = {}): {}".format(best_est, best_score))

'Best model accuracy on validation set (n_estimators = 10): 0.8208955223880597'

In [14]:
# RandomForestClassifier - Final Model
final_model = RandomForestClassifier(random_state=54321, n_estimators=10) #best estimator was 10 from previous code
final_model.fit(features_train, target_train)
display("Best model accuracy on training set (n_estimators = {}):".format(best_est))

'Best model accuracy on training set (n_estimators = 10):'

I ran the code through a set number of estimator for the RandomForestClassifier(1-10). For each iteration I trained the model on the training set and assessed its accuracy on the validation set. The best performing model in terms of accuracy on the validation set had 10 estimators and achieving an accuracy score of approximately 0.821 (82.1%) which is above the 0.75 target threshold for accuracy. I also tested this model with a wider range of n_estimators as well as addinging additonal hyperparameters; however, each time the accuracy decreased and would never get back up to the original 82.1% accuracy.

In [15]:
# Logistics Regression
logistics_model = LogisticRegression(random_state=54321, solver='liblinear')
logistics_model.fit(features_train, target_train)
score_train = logistics_model.score(
    features_train,
    target_train
)  
score_valid = logistics_model.score(
    features_valid,
    target_valid
)  

display(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
display(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

'Accuracy of the logistic regression model on the training set:'

0.7016725009723843

'Accuracy of the logistic regression model on the validation set:'

0.7052238805970149

Based on the models it seem the Logistics Regression model is the least accurate.

In [16]:
# Check quality using test set
best_score = 0
best_est = 10 # from previous test
random_forest_model = RandomForestClassifier(random_state=54321, n_estimators=est)
random_forest_model.fit(features_train, target_train) 
predictions_test = random_forest_model.predict(test_features)
accuracy_test = accuracy_score(test_target, predictions_test)
print("Accuracy on the test set: {}".format(accuracy_test))

Accuracy on the test set: 0.7807153965785381


Based on the test set being run we can see that their is a slight disparity in the accuracy of the test set vs the validation set. its not by much and the accuracy score is still over the 0.75 accuracy threshold. However significant differences in accuracy scores of the test set vs the validation set can mean their is overfitting involved or even insufficient model complexity.

## Sanity Check 

In [17]:
# import dummy classifier
from sklearn.dummy import DummyClassifier

# Run dummy classifier model
dummy_class_model = DummyClassifier(strategy='most_frequent', random_state=0)
dummy_class_model.fit(features_train, target_train)
dummy_class_predictions = dummy_class_model.predict(test_features)

#Baseline model
dummy_accuracy = accuracy_score(test_target, dummy_class_predictions)
display(f"Baseline accuracy: {dummy_accuracy}")

# Compare with your model's accuracy
model_accuracy = accuracy_score(test_target, predictions_test)
print(f"Your model's accuracy: {model_accuracy}")

# Sanity Check Pass
pass_sanity_check = model_accuracy > dummy_accuracy
display(f"Sanity check passed: {pass_sanity_check}")

'Baseline accuracy: 0.6951788491446346'

Your model's accuracy: 0.7807153965785381


'Sanity check passed: True'

Based on the sanity check my model accuracy is showing an initial level of quality.

## Conclusion

Based on this testing I have determined that the best model that should be used is the RandomForestClassifier as it has the highers accuracy score. Based on the test set using this model I have been able to determine that 78& of the time the model will correctly identify whether a particular subcriber fits better with the smark plan or the ultra plan, according to the data given.