<div style="background-color: #ADD8E6; padding: 20px; border-radius: 10px; box-shadow: 2px 2px 10px grey;">

#### Project Introduction: Subscriber Behavior Analysis for Mobile Plan Recommendation

**Context:**  
Mobile carrier Megaline has identified a significant portion of their customer base using outdated service plans. In response, they aim to encourage a transition to their modern plans - Smart or Ultra. To facilitate this, Megaline seeks to leverage data-driven strategies to understand subscriber behavior and guide them towards the most suitable plan.

**Objective:**  
The primary goal of this project is to develop a predictive model capable of analyzing subscriber behavior and accurately recommending one of Megaline's newer plans. The model will classify users into either the Smart or Ultra plan based on their usage patterns.

**Data Overview:**  
The dataset at our disposal (`users_behavior.csv`) contains detailed monthly behavior information of subscribers who have already shifted to the new plans. Key data points include:
- `calls`: Number of calls made.
- `minutes`: Total duration of calls.
- `messages`: Number of text messages sent.
- `mb_used`: Internet data consumption in MB.
- `is_ultra`: Current plan of the user (Ultra - 1, Smart - 0).

**Methodology:**  
- **Data Preparation**: The data will be split into training, validation, and test sets to evaluate the model's performance effectively.
- **Model Development**: We will explore various machine learning models, tweaking hyperparameters to optimize performance. The target is to achieve an accuracy score above 0.75.
- **Model Evaluation**: Post-development, the model's accuracy will be validated using the test dataset.
- **Additional Analysis**: A sanity check will be conducted to ensure the model's reliability, considering the complexity of the data.

**Project Significance:**  
This project demonstrates the application of machine learning in customer behavior analysis and decision support systems. By achieving the set accuracy threshold, the model will enable Megaline to enhance their service offerings, aligning them more closely with user preferences and behaviors.

</div>


In [1]:
# Import packages
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
# Load and view data
df = pd.read_csv("/datasets/users_behavior.csv")
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
total_count = df['is_ultra'].value_counts()
total_count

0    2229
1     985
Name: is_ultra, dtype: int64

In [4]:
# Step 2: Split the source data into a training set, a validation set, and a test set.

df_train, df_valid_and_test = train_test_split(df, test_size=0.4, random_state=42)
df_valid, df_test = train_test_split(df_valid_and_test, test_size=0.5, random_state=42)

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was split into train, validation and test sets. The proportions are reasonable

</div>

<div style="background-color: #ADD8E6; padding: 20px; border-radius: 10px; box-shadow: 2px 2px 10px grey;">
<h2> Student's comment</h2>
    
The reason for splitting the dataset into training, validation, and test sets is to evaluate and compare the performance of machine learning models accurately. The training set is used to train the models, the validation set is used for hyperparameter tuning and model selection, and the test set is used for the final evaluation of the chosen model's performance. By using separate sets, we can assess how well the model generalizes to unseen data.
    
Looking at the shape of the data means examining the dimensions or structure of the dataset. In the given code, the shape refers to the number of rows and columns in each set of features and target variables. By printing the shapes of the data, we can understand how many samples (rows) and features (columns) are present in each set. It helps us gain insights into the size and structure of the data we are working with.    
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good!

</div>

In [5]:
# Step 3: Investigate the quality of different models by changing hyperparameters. 
# Briefly describe the findings of the study.
# Decision Tree

decision_tree_cols = ['depth', 'acc_train', 'acc_valid']
decision_tree_list = []

for depth in range(1, 11):
    model_dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model_dt.fit(features_train, target_train)
    decision_tree_list.append([depth,
                              model_dt.score(features_train, target_train),
                              model_dt.score(features_valid, target_valid)
                              ])
    
decision_tree = pd.DataFrame(decision_tree_list, columns=decision_tree_cols)
decision_tree

Unnamed: 0,depth,acc_train,acc_valid
0,1,0.747925,0.730949
1,2,0.781639,0.782271
2,3,0.797199,0.791602
3,4,0.80861,0.780715
4,5,0.815353,0.772939
5,6,0.822614,0.77605
6,7,0.838693,0.780715
7,8,0.852178,0.796267
8,9,0.863589,0.780715
9,10,0.878631,0.794712


<div style="background-color: #ADD8E6; padding: 20px; border-radius: 10px; box-shadow: 2px 2px 10px grey;">
<h2> Student's comment</h2>

The provided code investigates the impact of different maximum depths on the performance of Decision Tree models. It trains and evaluates Decision Tree models with depths ranging from 1 to 10. The findings reveal that increasing the depth initially enhances the accuracy of the training set, indicating better capturing of the training data's patterns. However, the accuracy of the validation set, which represents unseen data, reaches its peak at a depth of 3 and starts to decline after that. This suggests that deeper models overfit the training data, performing poorly on new, unseen data. In summary, the study demonstrates that while increasing the depth can improve the accuracy of the training set, more is needed to lead to better generalization of new data. The optimal depth for the Decision Tree model is 3, which balances training and validation accuracy, indicating good generalization performance.
    
</div>

In [6]:
# Random Forest

random_forest_cols = ['estimator', 'acc_train', 'acc_valid']
random_forest_list = []

for estimator in range(10, 101, 10):
    model_rf = RandomForestClassifier(n_estimators=estimator, random_state=42)
    model_rf.fit(features_train, target_train)
    random_forest_list.append([estimator,
                              model_rf.score(features_train, target_train),
                              model_rf.score(features_valid, target_valid)
                              ])
    
random_forest = pd.DataFrame(random_forest_list, columns=random_forest_cols)
random_forest

Unnamed: 0,estimator,acc_train,acc_valid
0,10,0.98029,0.786936
1,20,0.991701,0.791602
2,30,0.994813,0.793157
3,40,0.997925,0.796267
4,50,0.997925,0.796267
5,60,0.999481,0.796267
6,70,0.998963,0.797823
7,80,1.0,0.800933
8,90,1.0,0.802488
9,100,1.0,0.802488


<div style="background-color: #ADD8E6; padding: 20px; border-radius: 10px; box-shadow: 2px 2px 10px grey;">
    <h2> Student's comment</h2>

The study investigates how increasing the number of estimators (trees) in a Random Forest model affects its performance. The findings reveal that as the number of estimators increases, the training and validation accuracies improve. However, there is a point where further increasing the number of estimators provides diminishing returns in terms of accuracy improvement.
The accuracy of the training set improves as the number of estimators increases until it reaches the maximum value of 1 (or 100%). On the other hand, the validation accuracy stops improving after a certain number of estimators.
Based on the results of this study, an optimal number of estimators lies between 80 and 90, as this range provides a reasonable balance between accuracy on the training and validation sets. Beyond this range, the additional estimators do not significantly improve the model's performance on unseen data, suggesting the presence of diminishing returns. In summary, increasing the number of estimators in a Random Forest initially enhances performance, but there is an optimal point where further increases do not provide significant benefits. Finding the appropriate number of estimators requires considering the dataset and balancing between model complexity and generalization ability.
    
</div>

In [7]:
# Logistic Regression

solver_list = ['lbfgs', 'liblinear', 'newton-cg','sag', 'saga']
logistic_regression_cols = ['solver', 'acc_train', 'acc_valid']
logistic_regression_list = []

for solver_item in solver_list:
    model_lr = LogisticRegression(random_state=42, solver=solver_item)
    model_lr.fit(features_train, target_train)
    logistic_regression_list.append([solver_item,
                              model_lr.score(features_train, target_train),
                              model_lr.score(features_valid, target_valid)
                              ])

logistic_regression = pd.DataFrame(logistic_regression_list, columns=logistic_regression_cols)
logistic_regression

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,solver,acc_train,acc_valid
0,lbfgs,0.743257,0.74028
1,liblinear,0.713693,0.720062
2,newton-cg,0.743257,0.74028
3,sag,0.693983,0.698289
4,saga,0.693465,0.695179


<div style="background-color: #ADD8E6; padding: 20px; border-radius: 10px; box-shadow: 2px 2px 10px grey;">
<h2> Student's comment</h2>

Despite logistic regression's relatively lower accuracy, it compensates for it by being a fast algorithm. Among the available solvers, lbfgs and newton-cg demonstrate the highest accuracy on both the training and validation sets. Regrettably, none of the solvers achieve an accuracy above the desired threshold of 75%.
    
The random forest model achieves the highest accuracy but shows signs of overfitting. This is because it combines multiple decision trees into an ensemble, which helps improve prediction accuracy but can lead to overfitting issues.
    
The decision tree model comes in second place. If the tree depth is too shallow (below 2), the model needs to be more balanced and capture the complexity of the data. On the other hand, if the tree depth exceeds 3, the model tends to be overfitted, meaning it becomes too specific to the training data and may not generalize well to new data.
    
While having the lowest prediction quality and falling short of the desired threshold, logistic regression does not exhibit signs of overfitting. It may not capture the intricacies of the data as effectively as other models, but it avoids the problem of overfitting.    
    
</div>    

<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, you tried a couple of different models and tuned their hyperparameters using the validation sets

</div>

In [9]:
# Step4: Check the quality of the model using the test set.
# Step 5: sanity check the model. 
# This data is more complex than what you’re used to working with, 
# so it's not an easy task. We'll take a closer look at it later.

# Define and train the random forest classifier
model = RandomForestClassifier(n_estimators=90, random_state=42)
model.fit(features_train, target_train)

# Calculate and print the accuracy of the model on different datasets
train_accuracy = model.score(features_train, target_train)
valid_accuracy = model.score(features_valid, target_valid)
test_accuracy = model.score(features_test, target_test)

# Display the accuracy results
print('Accuracy of the model on the training set:', train_accuracy)
print('Accuracy of the model on the validation set:', valid_accuracy)
print('Accuracy of the model on the test set:', test_accuracy)

Accuracy of the model on the training set: 1.0
Accuracy of the model on the validation set: 0.80248833592535
Accuracy of the model on the test set: 0.8133748055987559


<div style="background-color: #ADD8E6; padding: 20px; border-radius: 10px; box-shadow: 2px 2px 10px grey;">
<h2> Student's comment</h2>
    
The Random Forest model provides the accuracy of 81.33%.
    
</div>   

<div class="alert alert-success">
<b>Reviewer's comment</b>

The final model was evaluated on the test set for an unbiased estimate of its generalization performance

</div>