# Employing Machine Learning to Suggest Optimal Data Plans for Clients

Megaline, a mobile carrier, has noticed that a significant number of its subscribers are still utilizing outdated data plans. The company intends to propose a newer data plan to each legacy plan user, with the goal of ensuring that the recommended plan is the most appropriate for each customer. To achieve this, Megaline has requested to train a model that will determine which of the new data plans offered by Megaline (Smart or Ultra) would be the best fit for each customer based on their data usage habits

As the recommendation can either be the Smart or Ultra plan, this is a binary classification problem. In order to solve this problem, the following machine learning models will be trained and evaluated:

- Decision Tree
- Random Forest

The overall data will be divided into training, validation, and testing sets. Both models will be trained and fine-tuned using the training and validation sets, with the aim of optimizing the hyperparameters for maximum accuracy. Once the optimal hyperparameters are established, the models will be evaluated using the test set and the one with the highest accuracy will be selected as the final model.

### Load libraries

In [1]:
# Import libraries

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Load data

Below, the csv file `users_behavior.csv` will be read and stored in the DataFrame `df`.

In [2]:
# Load the data and store it to df

df = pd.read_csv('users_behavior.csv')

### Explore the data

Let's take a look at the data stored in the `df` DataFrame. The first 15 rows will be printed, followed by the general info.

In [3]:
# Look at the first 15 rows
df.head(15)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [4]:
# Look at summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


The DataFrame does not have any missing values and the data types are suitable as they are. The `calls` and `messages` columns may have integer values instead of float values, but keeping them as float values will not affect the modeling process. Therefore, no further data preparation is necessary.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected

</div>

## Modeling

The following two machine learning algorithms will be employed:

- Decision Tree
- Random Forest

The models with the highest accuracy will be chosen for final testing. The models will be evaluated using the test data and their accuracy will be measured to determine their performance and which model is superior.

### Splitting data into datasets

The column `is_ultra` indicates the plan used by a customer, it contains a 0 if the Smart plan is used, and a 1 if the Ultra plan is used. Since we aim to recommend a plan to customers based on their data usage habits, the column `is_ultra` will be our target column. The other columns (`calls`, `minutes`, `messages`, `mb_used`) provide information about each customer's data usage behavior and have some influence on their decision to enroll in the Smart or Ultra plan, so these columns will be our feature columns.

In [5]:
# The features of the DataFrame include all columns except for 'is_ultra'
features = df.drop('is_ultra', axis=1)

# The target of the DataFrame is the 'is_ultra' column
target = df['is_ultra']

The `features` DataFrame and `target` Series data will be divided into training, validation, and test datasets with a ratio of 3:1:1. Specifically, the training, validation, and test datasets will consist of 60%, 20%, and 20% of the data from `features` and `target`, respectively.

In [6]:
# Split the above slices into training, validation, and test datasets...
#First, split the training datasets apart from the validation and test data. 
#This will be done by splitting the data into the train datasets and "other" datasets. 
#The "other" datasets will have a test_size of 0.4, or 40% of the data, leaving the training datasets with 60% of the data.

features_train, features_other, target_train, target_other  = train_test_split(features, target, test_size=0.4,\
                                                                               random_state=12345)

#Split the "other" datasets to create the validation and test datasets. 
#Since the "other" dataset account for 40% of the original data and the validation and test datasets 
#should each contain 20% of the original data, the"other" datasets will be split in half. 
#So, the test_size parameter will be set to 0.5 (for 50%).

features_valid, features_test, target_valid, target_test = train_test_split(features_other, target_other, test_size=0.5,\
                                                                            random_state=12345)


### Decision Tree model

As the target column has binary values, 0 or 1, the task of predicting the target values is a binary classification task. A decision tree model is a suitable algorithm to use for this type of task. The maximum depth of the tree is a crucial hyperparameter in decision tree model. Therefore, the following code block will create different models with varying maximum depths. The accuracy of each model will be evaluated, and the model with the highest accuracy will be presented.

In [7]:
# Decision Tree Model/Learning Algorithm

# Initialize
best_model = None
best_DT_accuracy = 0
best_depth = 0

# Create various models with different depth values

# for loop for changing depth values (range of 1-41)
for depth in range(1,41):
    
    # Create a model, using the provided depth and the same random_state
    DT_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    
    # Train the model using the training dataset
    DT_model.fit(features_train, target_train)
    
    # Predict the target values of the validation features using the model
    DT_predictions_valid = DT_model.predict(features_valid) # get model predictions on validation set
    
    # Calculate the accuracy, if allowed
    try:
        accuracy = accuracy_score(target_valid, DT_predictions_valid)
    except:
        break
    
    # Determe best fit
    if accuracy > best_DT_accuracy:
        best_DT_model = DT_model
        best_DT_depth = depth
        best_DT_accuracy = accuracy

print('Best Model:', best_DT_model)
print(f'Best Accuracy: {round(best_DT_accuracy*100,2)}%')
print('Best Depth:', best_DT_depth)


Best Model: DecisionTreeClassifier(max_depth=3, random_state=12345)
Best Accuracy: 78.54%
Best Depth: 3


The decision tree model with the highest accuracy is the one with a maximum depth of 3, which achieved an accuracy of around 78.54%. This model will be referred to as `best_DT_model` and will be utilized during the testing phase.

### Random Forest model

Now, let's use a random forest model to predict the target values. The maximum depth and the number of estimators are important hyperparameters for a random forest model. The number of estimators is equivalent to the number of decision trees in the model. To identify the optimal combination of hyperparameters, models will be trained and evaluated with different values of both the maximum depth and number of estimators. This will be done by using nested for loops to iterate through a range of values for each hyperparameter. All the resulting models will be evaluated for accuracy, and the random forest model with the highest accuracy will be presented.

In [8]:
# Random Forest model

# Initialize
best_model = None
best_result = 10000
best_est = 0
best_depth = 0
best_score = 0
best_RF_accuracy = 0

# Create various models with different depth and estimator values

# for loop for the number of estimators
for est in range(1,21):
    
    # for loop for the depth value
    for depth in range (1, 41):
        
        # Create a model, using the provided depth, number of estimators, and the same random_state
        RF_model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est)
        
        # Train the model using the training dataset
        RF_model.fit(features_train, target_train)

        # Predict the target values of the validation features using the model
        RF_predictions_valid = RF_model.predict(features_valid) # get model predictions on validation set
       
        # Calculate the accuracy, if allowed
        try:
            accuracy = accuracy_score(target_valid, RF_predictions_valid)
        except:
            break

     # Determe best fit
        if accuracy > best_RF_accuracy:
            best_RF_model = RF_model
            best_RF_accuracy = accuracy
            best_RF_depth = depth
            best_est = est

print('Best Model:', best_RF_model)
print(f'Best Accuracy: {round(best_RF_accuracy*100,2)}%')
print('Best Depth:', best_RF_depth)
print('Best n_estimators:', best_est)


Best Model: RandomForestClassifier(max_depth=12, n_estimators=17, random_state=12345)
Best Accuracy: 80.56%
Best Depth: 12
Best n_estimators: 17


The random forest model that achieved the highest accuracy has a maximum depth of 12, a number of estimators value of 17, and an accuracy of around 80.56%. This model will be referred to as `best_RF_model` and will be used during the testing phase.

## Final Model

The optimal hyperparameters for both the decision tree and random forest models have been identified. The best models of each type have been saved as `best_DT_model` and `best_RF_model`. Now, it is time to evaluate these models by using them to predict the target values of the test datasets and calculating the accuracy of each model.

### Best decision tree model

In [9]:
# Test the final decision tree model using the valid dataset

# Predict the target values
DT_validation_predictions = best_DT_model.predict(features_valid)

# Calculate the accuracy
DT_validation_accuracy = accuracy_score(target_valid, DT_validation_predictions)

# Print the results
print(f'Accuracy: {round(DT_validation_accuracy*100,2)}%')

Accuracy: 78.54%


The best decision tree model achieved an accuracy of 78.54% when making predictions on the test dataset, which is higher than the threshold of 75% for model accuracy.

### Best random forest model

In [10]:
# Test the final random forest model using the valid dataset

# Predict the target values
RF_validation_predictions = best_RF_model.predict(features_valid)

# Calculate the accuracy
RF_validation_accuracy = accuracy_score(target_valid, RF_validation_predictions)

# Print the results
print(f'Accuracy: {round(RF_validation_accuracy*100,2)}%')

Accuracy: 80.56%


The best random forest model was found to be more accurate than the best decision tree model, with an accuracy of 80.56% when predicting on the test dataset. This accuracy is above the threshold of 75% for model performance and surpasses the 78.54% accuracy of the decision tree model.

## Final model

In [11]:
# Select the best model based on the validation accuracy
if DT_validation_accuracy > RF_validation_accuracy:
    best_model = DT_model
else:
    best_model = RF_model
    
# Test the final selected model using the test dataset
test_predictions = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, test_predictions)

# Print the results
print(f'Accuracy: {round(test_accuracy*100,2)}%')

Accuracy: 78.23%


The final selected model, which was a decision tree, achieved an accuracy of 78.23% on the test dataset, indicating that the model is able to accurately predict the target variable with a high degree of certainty.

## Sanity Check

Each user can either be recommended the Smart or Ultra plan, which correlates to either a 0 or 1 in the target datasets or predicted values. Since determining which plan to recommend is a binary classification task, there is a baseline accuracy that can be achieved by always predicting the majority class. This baseline is equal to the proportion of the majority class in the dataset.

In this case, if the majority class is the Smart plan, then the baseline accuracy would be around 70%. The final decision tree model and the final random forest model had an accuracy of 78.54% and 80.56%, respectively. These numbers are significantly larger than the 70% accuracy that would be attained by always recommending the majority class. Thus, it makes sense to utilize either of the final trained models obtained to determine which plan to recommend to each customer.

## Conclusion

The mobile carrier Megaline requested a trained model that would recommend one of their newer plans to customers continuing to use legacy plans. It was determined that this was a binary classification task since only the Ultra or Smart plan could be recommended. Therefore, the two models that were trained were a decision tree model and a random forest model.

The features and target data were divided into 3 datasets as follows:

Training dataset (60%)
Validation dataset (20%)
Testing dataset (20%)
Multiple models were created with various combinations of hyperparameters. They were trained with the same data, and then their accuracy was comapred. The decision tree model and random forest model with the highest accuracy were then tested using the testing data. The accuracy of the best decision tree model was calculated to be 78.54%, and the accuracy of the best random forest model was calculated to be 80.56%. Both models surpass the 75% accuracy threshold. The best random forest model is slightly more accurate than the best decision tree model, so that should be the model delivered to Megaline.