# Utilizing Machine Learning to Recommend Ideal Data Plans for Customers

Megaline aims to modernize its subscriber base by recommending newer plans—Smart or Ultra—based on customer behavior analysis. Utilizing behavior data from subscribers who have already migrated to these plans, we embark on a classification task to develop a model capable of accurately selecting the appropriate plan. With data preprocessing already completed, our focus shifts to model creation. Our objective is to achieve a minimum accuracy threshold of 0.75, as assessed using the test dataset.

## Project instructions

1. Open and look through the data file. Path to the file:datasets/users_behavior.csv Download dataset

2. Split the source data into a training set, a validation set, and a test set.

3. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

4. Check the quality of the model using the test set.

5. Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

Each entry in the dataset provides monthly behavioral insights for individual users. The data includes the following information:

    * сalls — number of calls,
    * minutes — total call duration in minutes,
    * messages — number of text messages,
    * mb_used — Internet traffic used in MB,
    * is_ultra — plan for the current month (Ultra - 1, Smart - 0).



The dataset will be partitioned into training, validation, and testing subsets. Both models will undergo training and fine-tuning using the training and validation data to optimize hyperparameters for enhanced accuracy. Following hyperparameter optimization, the models will be assessed using the test subset, and the model exhibiting the highest accuracy will be chosen as the ultimate model.

## Load Libraries 

In [64]:
# Import libraries

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Load Data

In the following code, the CSV file "users_behavior.csv" will be read and saved into the DataFrame named "df".

In [65]:
df = pd.read_csv('/datasets/users_behavior.csv')

## Explore the Dataset

To gain insights into the dataset stored in the DataFrame "df", let's examine the first 20 rows followed by general informatio

In [66]:
df.head(20)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


df.head(20) is used to display the first 20 rows of the DataFrame df.

The output displays the first 20 rows of the DataFrame df, where each row represents a user and each column represents a feature or attribute of that user. The columns include:

* calls: the number of calls made by the user
* minutes: the total minutes of calls made by the user
* messages: the number of messages sent by the user
* mb_used: the amount of mobile data used by the user (in megabytes)
* is_ultra: a binary indicator (0 or 1) representing whether the user is on the Ultra plan (1) or not (0-Smart Plan).

In [67]:
# summary of the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


The DataFrame contains no missing values, and the existing data types are appropriate. While the "calls" and "messages" columns could potentially be integers instead of floats, retaining them as floats will not impact the modeling process. Consequently, no additional data preparation is required.

## Modeling

Two machine learning algorithms: 

1. Decision Tree 
2. Random Forest. 

The models demonstrating the highest accuracy will be selected for final evaluation. They will undergo testing using the test dataset, where their accuracy will be assessed to gauge performance and identify the superior model.

### Splitting into Datasets

The "is_ultra" column denotes the plan utilized by a customer, with a value of 0 representing the Smart plan and 1 representing the Ultra plan. As our objective is to suggest a plan to customers based on their data usage patterns, the "is_ultra" column will serve as our target variable. The remaining columns ("calls", "minutes", "messages", "mb_used") offer insights into each customer's data consumption behavior, influencing their choice between the Smart and Ultra plans. Thus, these columns will be our feature variables.

In [68]:
#DataFrame include all columns except for 'is_ultra'
features = df.drop('is_ultra', axis=1)

#target of the DataFrame is the 'is_ultra' column
target = df['is_ultra']

This code separates the DataFrame into two components:

1. Features: This variable contains all columns from the DataFrame except for the 'is_ultra' column. It is obtained by using the drop() function along the 'is_ultra' column axis (axis=1).

2. Target: This variable contains only the 'is_ultra' column from the DataFrame. It represents the target variable that we want to predict.

The features DataFrame and target Series data will be segmented into training, validation, and test datasets following a 3:1:1 ratio. Precisely, the training, validation, and test datasets will encompass 60%, 20%, and 20% of the data from features and target, respectively.

In [69]:
#To divide the given slices into training, 
#validation, and test datasets, we'll initially separate the training datasets from the validation and test data. 
#This entails splitting the data into the train datasets and "other" datasets. 
#The "other" datasets will encompass 40% of the data, designated for testing, training datasets will retain 60% of the data.
features_train, features_other, target_train, target_other  = train_test_split(features, target, test_size=0.4,\
                                                                               random_state=12345)

#Split the "other" datasets to create the validation and test datasets.
#Since the "other" dataset account for 40% of the original data and the validation and test datasets 
#contain 20% of the original data, the"other" datasets will be split in half. 
#test_size parameter set to 0.5 (for 50%).

features_valid, features_test, target_valid, target_test = train_test_split(features_other, target_other, test_size=0.5,\
                                                                            random_state=12345)

This code divides the given features and target data into training, validation, and test datasets using the train_test_split function from sklearn.

1. Initially, the code splits the features and target data into training datasets (features_train and target_train) and "other" datasets (features_other and target_other). The "other" datasets contain 40% of the original data and are designated for testing, while the training datasets retain 60% of the data.


2. Then, the "other" datasets are further split into validation and test datasets (features_valid, features_test, targets_valid, targets_test). Since the "other" dataset accounts for 40% of the original data and the validation and test datasets should each contain 20% of the original data, the "other" datasets are split in half. This means the test_size parameter is set to 0.5 (or 50%).

### Decision Tree model

Since the target column contains binary values (0 or 1), predicting these values constitutes a binary classification task. A decision tree algorithm proves ideal for such tasks. The maximum depth of the tree stands as a critical hyperparameter in this model. Hence, the subsequent code block generates various models with different maximum depths. Subsequently, each model's accuracy is assessed, and the one with the highest accuracy is showcased.

In [70]:
# Decision Tree Model/Learning Algorithm

# Initialize
best_model = None
best_DT_accuracy = 0
best_depth = 0

# Create various models with different depth values

# for loop for changing depth values (range of 1-41)
for depth in range(1,41):
    
    # Create a model, using the provided depth and the same random_state
    DT_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    
    # Train the model using the training dataset
    DT_model.fit(features_train, target_train)
    
    # Predict the target values of the validation features using the model
    DT_predictions_valid = DT_model.predict(features_valid) # get model predictions on validation set
    
    # Calculate the accuracy, if allowed
    try:
        accuracy = accuracy_score(target_valid, DT_predictions_valid)
    except:
        break
    
    # Determe best fit
    if accuracy > best_DT_accuracy:
        best_DT_model = DT_model
        best_DT_depth = depth
        best_DT_accuracy = accuracy

print('Best Model:', best_DT_model)
print(f'Best Accuracy: {round(best_DT_accuracy*100,2)}%')
print('Best Depth:', best_DT_depth)

Best Model: DecisionTreeClassifier(max_depth=3, random_state=12345)
Best Accuracy: 78.54%
Best Depth: 3


The provided code segment conducts an exhaustive search for the optimal Decision Tree model by iterating through various depths and evaluating their performances on a validation set. Initially, placeholders for the best model (best_model), its associated accuracy (best_DT_accuracy), and the corresponding depth (best_depth) are initialized.

Within a loop spanning depths from 1 to 40, Decision Tree models are instantiated and trained using a fixed random state. Subsequently, predictions are made on the validation features, and the accuracy of each model is computed by comparing the predicted values with the actual targets. Should any exceptions arise during this process, the loop is halted. If a model's accuracy surpasses the current best accuracy, the model, depth, and accuracy are updated accordingly.

Upon completion of the loop, the best-performing model along with its accuracy and depth are displayed. In the specific instance analyzed, the most effective model is a Decision Tree Classifier with a depth of 3, achieving an accuracy of approximately 78.54% on the validation set. This iterative approach facilitates the identification of an optimal depth for the decision tree model, thus mitigating the risks of overfitting or underfitting and enhancing predictive performance.

## Random Forest model

Random Forest model to forecast the target values. The maximum depth and the number of estimators serve as pivotal hyperparameters for this model. The number of estimators corresponds to the quantity of decision trees within the model. To pinpoint the most effective combination of hyperparameters, we'll train and assess models using various values for both the maximum depth and the number of estimators. This entails nested for loops to iterate through specified ranges for each hyperparameter. Subsequently, all generated models will undergo accuracy evaluation, and the Random Forest model exhibiting the highest accuracy will be highlighted.

In [71]:
# Random Forest model

# Initialize
best_model = None
best_result = 10000
best_est = 0
best_depth = 0
best_score = 0
best_RF_accuracy = 0

# Create various models with different depth and estimator values

# for loop for the number of estimators
for est in range(1,21):
    
    # for loop for the depth value
    for depth in range (1, 41):
        
        # Create a model, using the provided depth, number of estimators, and the same random_state
        RF_model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est)
        
        # Train the model using the training dataset
        RF_model.fit(features_train, target_train)

        # Predict the target values of the validation features using the model
        RF_predictions_valid = RF_model.predict(features_valid) # get model predictions on validation set
       
        # Calculate the accuracy, if allowed
        try:
            accuracy = accuracy_score(target_valid, RF_predictions_valid)
        except:
            break

     # Determe best fit
        if accuracy > best_RF_accuracy:
            best_RF_model = RF_model
            best_RF_accuracy = accuracy
            best_RF_depth = depth
            best_est = est

print('Best Model:', best_RF_model)
print(f'Best Accuracy: {round(best_RF_accuracy*100,2)}%')
print('Best Depth:', best_RF_depth)
print('Best n_estimators:', best_est)

Best Model: RandomForestClassifier(max_depth=12, n_estimators=17, random_state=12345)
Best Accuracy: 80.56%
Best Depth: 12
Best n_estimators: 17


Random Forest model for classification. It iterates through various combinations of the number of estimators (ranging from 1 to 20) and the maximum depth of the trees (ranging from 1 to 40) to find the combination that yields the highest accuracy on a validation dataset.

Within the nested loops, a Random Forest model is instantiated and trained on a training dataset (features_train and target_train). Then, predictions are made on a validation dataset (features_valid), and the accuracy of these predictions is calculated using accuracy_score. If the accuracy of the current model is higher than the previously recorded best accuracy, the current model becomes the new best model.

Once all combinations have been evaluated, the code prints out the details of the best model found, including its maximum depth (best_RF_depth), the number of estimators (best_est), and the achieved accuracy (best_RF_accuracy).

The output indicates that the best model has a maximum depth of 12 and 17 estimators, achieving an accuracy of approximately 80.56%. This model is represented as a RandomForestClassifier instance with the specified parameters.

## Final Model

Now that the optimal hyperparameters for both the decision tree and random forest models have been determined, the best models of each type have been stored as best_DT_model and best_RF_model, respectively. The next step involves assessing these models by employing them to predict the target values of the test datasets and computing the accuracy of each model.

### Best Decision Tree Model

In [79]:
# Test the final decision tree model using the valid dataset

# Predict the target values
DT_validation_predictions = best_DT_model.predict(features_valid)

# Calculate the accuracy
DT_validation_accuracy = accuracy_score(target_valid, DT_validation_predictions)

# Print the results
print(f'Accuracy: {round(DT_validation_accuracy*100,2)}%')

Accuracy: 78.54%


Evaluates the final decision tree model's performance using a validation dataset. First, the model predicts the target values based on the features in the validation dataset. Then, the accuracy of these predictions is calculated by comparing them to the true target values from the validation set. The accuracy_score function computes the accuracy by dividing the number of correct predictions by the total number of predictions made. Finally, the code prints out the accuracy result, rounded to two decimal places and presented as a percentage. In this specific case, the result indicates that the decision tree model achieved an accuracy of approximately 78.54% on the validation dataset, signifying that around 78.54% of the predictions made by the model were correct.

###  Best Random Forest Model

In [80]:
# Test the final decision tree model using the valid dataset

# Predict the target values
RF_validation_predictions = best_RF_model.predict(features_valid)

# Calculate the accuracy
RF_validation_accuracy = accuracy_score(target_valid, RF_validation_predictions)

# Print the results
print(f'Accuracy: {round(RF_validation_accuracy*100,2)}%')

Accuracy: 80.56%


Evaluates the performance of the final Random Forest model using a validation dataset. Initially, the model predicts the target values based on the features present in the validation dataset by employing the predict method associated with the best_RF_model, which embodies the trained Random Forest model. Subsequently, the code calculates the accuracy of the model's predictions. This is accomplished by comparing the predicted target values (RF_validation_predictions) to the true target values derived from the validation dataset (target_valid) using the accuracy_score function. The accuracy_score function quantifies accuracy as the proportion of correct predictions to the total number of predictions made. Lastly, the code displays the accuracy result, rounded to two decimal places and expressed as a percentage. In this instance, the output "Accuracy: 80.56%" reveals that the Random Forest model attained an accuracy of approximately 80.56% on the validation dataset, indicating that approximately 80.56% of the model's predictions aligned with the actual values in the validation dataset.

## Final model

In [81]:
# Select the best model based on the validation accuracy
if DT_validation_accuracy > RF_validation_accuracy:
    best_model = DT_model
else:
    best_model = RF_model
    
# Test the final selected model using the test dataset
test_predictions = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, test_predictions)

# Print the results
print(f'Accuracy: {round(test_accuracy*100,2)}%')

Accuracy: 78.23%


The best model is selected based on the validation accuracy between the Decision Tree (DT) model and the Random Forest (RF) model. If the accuracy of the Decision Tree model (DT_validation_accuracy) is greater than that of the Random Forest model (RF_validation_accuracy), the best_model variable is assigned the value of the Decision Tree model (DT_model). Otherwise, if the Random Forest model's accuracy is equal to or higher than that of the Decision Tree model, best_model is assigned the value of the Random Forest model (RF_model).

Following the selection of the best model, the code proceeds to test the chosen model's performance using a test dataset. Predictions are made on the test dataset using the predict method associated with the best_model. Subsequently, the accuracy of these predictions is calculated by comparing them against the true target values from the test dataset using the accuracy_score function.

Finally, the code prints out the accuracy result. The accuracy is rounded to two decimal places and presented as a percentage. The output "Accuracy: 78.23%" indicates that the selected model achieved an accuracy of approximately 78.23% on the test dataset. This suggests that approximately 78.23% of the selected model's predictions corresponded with the actual values in the test dataset.

## Smart or Ultra

Users are assigned either the Smart or Ultra plan, represented as 0 or 1 in the target datasets or predicted values. Since recommending a plan is a binary classification task, there exists a baseline accuracy achievable by consistently predicting the majority class. This baseline accuracy corresponds to the proportion of the majority class in the dataset.

For instance, if the majority class is the Smart plan, the baseline accuracy would be approximately 70%. However, the final decision tree model and random forest model achieved higher accuracies of 78.54% and 80.56%, respectively. These figures notably surpass the 70% baseline accuracy obtained by always suggesting the majority class. Consequently, it is reasonable to employ either of these final trained models to determine the appropriate plan recommendation for each customer.

## Conclusion 

In summary, a comprehensive process for model selection and evaluation in a binary classification task involving plan recommendations. It begins by partitioning the dataset into training, validation, and test sets using the train_test_split function. Then, it iteratively explores various depths for decision tree models and combinations of parameters for random forest models to identify the best-performing model on the validation set.

The decision tree and random forest models are rigorously assessed, with their respective accuracies computed on the validation set. Subsequently, the final model is selected based on its performance on the validation set, and its accuracy is evaluated on the test set to gauge its real-world effectiveness.

Notably, the achieved accuracies of the final models significantly surpass the baseline accuracy derived from predicting the majority class. For instance, if the majority class is the Smart plan, the baseline accuracy would be approximately 70%. However, both the decision tree and random forest models attain accuracies exceeding 78% and 80%, respectively, showcasing their efficacy in making plan recommendations.

Ultimately, the results affirm the suitability of either the decision tree or random forest model for determining plan recommendations, as their accuracies substantially outperform the baseline. This underscores the value of employing machine learning models in optimizing plan recommendations for individual customers.