# Using Machine Learning to Suggest Current Plans for Megaline Legacy Plan Users
## Project Overview & Objective
A large number of users at Megaline are not taking advantage of current plan offerings that may better meet their needs.  Many of these users have been with the company for a long time and likely are not aware of the newer plans available--Smart and Ultra.  To increase customer satisfaction, this project will develop a model that analyzes usage of current plan users and suggests the most appropriate current plan offering (Smart or Ultra) for legacy users based on their monthly usage.  This model will be considered accurate when it achieves an accuracy of at least 0.75 using a test dataset and passes a sanity check to ensure reliability for a complex dataset.
## Data Description

In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier

In [4]:
user_behavior = pd.read_csv('/datasets/users_behavior.csv')
user_behavior

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [5]:
user_behavior.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [6]:
user_behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Thanks to previous work with this dataset, the data is clean and we do not need to do any preprocessing.  However, I will make a few notes about the dataset before proceeding to split it into a training set, validation set, and test set.

The dataset contains data on monthly number of calls, minutes used, messages sent, mb used, and plan type for over 3214 users.  All data is numerical.  Just under 1/3 of subscribers included in the dataset use the Ultra plan.  The average subscriber to one of the current plans places 63 calls per month totalling 438 minutes, sends 38 text messages, and uses 17k mb of data.

## Spliting the Data
For further analysis, I will split 60% of the data into a training set to train the model, 20% into a validation set to assist in hyperparameter tuning, and the remaining 20% will serve as a test set to evaluate the model's performance.

In [9]:
train_data, remaining_data = train_test_split(user_behavior, test_size=0.4, random_state=1)
valid_data, test_data = train_test_split(remaining_data, test_size=0.5, random_state=1)
print(f'Training Set Size: {train_data.shape[0]}')
print(f'Validation Set Size: {valid_data.shape[0]}')
print(f'Test Set Size: {test_data.shape[0]}')

Training Set Size: 1928
Validation Set Size: 643
Test Set Size: 643


## Identifying the Best Model
To determine the best model for this study, I will run three commonly used models with the training and validation sets and tune hyperparameters to determine the best fit.  The machine needs to suggest one of two plan options; therefore, the machine must perform binary classification at the highest degree of accuracy possible with the data set.

Before creating models, I will define the variables these models will use.

In [10]:
#defining variables
train_features = train_data.drop(['is_ultra', 'plan'], axis=1)
train_target = train_data['plan']
valid_features = valid_data.drop(['is_ultra', 'plan'], axis=1)
valid_target = valid_data['plan']
test_features = test_data.drop(['is_ultra', 'plan'], axis=1)
test_target = test_data['plan']

print(train_features.shape)
print(train_target.shape)
print(valid_features.shape)
print(valid_target.shape)
print(test_features.shape)
print(test_target.shape)

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


### Decision Tree Model
Decision trees generally have low accuracy, but high processing speed.  I will begin with a decision tree model to get a baseline level of accuracy for the dataset.

In [12]:
#creating decision tree model
for depth in range (1, 6):
    dt_model = DecisionTreeClassifier(max_depth=depth, random_state=1)
    dt_model.fit(train_features, train_target)
    train_predictions = dt_model.predict(train_features)
    valid_predictions = dt_model.predict(valid_features)
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(valid_target, valid_predictions))

max_depth = 1 : 0.71850699844479
max_depth = 2 : 0.7558320373250389
max_depth = 3 : 0.7713841368584758
max_depth = 4 : 0.7682737169517885
max_depth = 5 : 0.7698289269051322


The decision tree model becomes most accurate with a tree depth of 3.  However, the accuracy it achieves is only 77% and it may be underfitted.  Another model will most likely be able to improve upon that.

### Random Forest Model

In [14]:
# creating random forest model
best_score = 0
best_est = 0
for est in range(1, 11):
    rf_model = RandomForestClassifier(random_state=1, n_estimators=est)
    rf_model.fit(train_features, train_target)
    score = rf_model.score(valid_features, valid_target)
    if score > best_score:
        best_score = score
        best_est = est
print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 9): 0.7838258164852255


The Random Forest Model improves our accuracy rate but not by much.  With 9 estimators, the best model achieves 78% accuracy.  It is worth noting that increasing the number of estimators beyond 5 has very minimal impact on accuracy.  Given our data set is small enough to process quickly with 9 estimators, I will leave the hyperparameter in place, but the number of estimators can reasonably be dropped to 5 if the model is later applied to a larger dataset and processing speeds slow.  I will test one more model.

### Logistic Regression Model

In [16]:
# creating logistic regression model
lr_model = LogisticRegression(random_state=1, solver='liblinear')
lr_model.fit(train_features, train_target)
train_score = lr_model.score(train_features, train_target)
valid_score = lr_model.score(valid_features, valid_target)
print("Accuracy of the logistic regression model on the training set:", train_score)
print("Accuracy of the logistic regression model on the validation set:", valid_score)

Accuracy of the logistic regression model on the training set: 0.7240663900414938
Accuracy of the logistic regression model on the validation set: 0.6889580093312597


Logistic Regression did not produce a higher level of accuracy.  It would seem plan choice is simply not all that aligned with actual user behavior.  Nonetheless, the threshold for accuracy is 75% with this project, so I will proceed with using the Random Forest Model with 9 estimators.

## Applying the Random Forest Model to the Test Dataset
It is time to run the model with the test dataset and see if it still achieves comparable accuracy.

In [18]:
# running the rf model on the test dataset
rf_model = RandomForestClassifier(random_state=1, n_estimators=9)
rf_model.fit(train_features, train_target)
test_predictions = rf_model.predict(test_features)
accuracy = accuracy_score(test_target, test_predictions)
print('Accuracy of the Random Forest Model when applied to the test dataset:', accuracy)

Accuracy of the Random Forest Model when applied to the test dataset: 0.7978227060653188


When applied to the test dataset, the Random Forest Model achieved nearly 80% accuracy--30% better than random guessing!  This model seems to be a good starting place for this project and will provide a solid framework for Megaline to use to suggest current plans to legacy plan users.

Before concluding, I will perform a quick sanity check with the test dataset to see how the other models compare.

## Sanity Check

I will compare my Random Forest model's accuracy against the Decision Tree model on the test dataset as the Decision Tree model achieved the second-highest level of accuracy during training and validation.

In [19]:
# Creating a dummy model
dummy_model = DummyClassifier(strategy='most_frequent', random_state=1)

# Training the dummy model
dummy_model.fit(train_features, train_target)

# Making predictions using the dummy model
test_predictions = dummy_model.predict(test_features)

# Evaluating the dummy model
print("Dummy Model Accuracy:", accuracy_score(test_target, test_predictions))

Dummy Model Accuracy: 0.6936236391912908


At 69% accuracy--slightly more than the accuracy achieved during training and validation--the Dummy Classifier confirms that the test dataset does tend to achieve higher accuracy rates when run through the trained models.

## Conclusion
Machine learning has the potential to drive customer-centric service improvements within the telecommunications industry.  This project developed a machine learning model using Random Forest Classifier that recommends current plan offerings, Smart or Ultra, for legacy plan users of Megaline based on the monthly use patterns of existing current plan subscribers.  The model designed achieved 80% accuracy with the test data set, establishing a strong foundation for for Megaline to enhance customer satisfaction by aligning plan offerings with user behavior.  In future, the model can be further refined based on additional datasets that might improve its predictive accuracy and generate more conversions to current plans by Megaline legacy plan users.