# **Megaline Plan Recommendation: Predicting the Most Suitable Plan for Subscribers**


## **üì± Project Description**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
‚û°Ô∏è In this project, our objective is to assist Megaline in enhancing its customer experience by recommending more suitable plans to subscribers currently using legacy options. We will analyze the behavior of users who have already transitioned to Megaline's newer plans‚ÄîSmart and Ultra‚Äîand develop a predictive model designed to identify the most appropriate plan for each subscriber.<br><br> 
We will utilize behavioral data from subscribers who have already switched to the new plans. With the data preprocessing phase already completed, our focus will be on constructing and fine-tuning a classification model to accurately predict the best plan for each user.<br><br>
The primary goal is to develop a model that meets a minimum accuracy threshold of 0.75, to be validated using the provided test dataset.
</div>

### **üß∞ Environment Setup and Required Libraries**

In [1]:
# Import all required libraries

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score


## **üì• Step 1: Loading and Initial Data Exploration**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
‚û°Ô∏è First, we load the dataset and perform a basic inspection to understand its structure, dimensions, and types of variables before diving into deeper analysis.
</div>

In [2]:
# Load the dataset
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
# Display basic information about the dataset 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [5]:
# Explore duplicates
print(df.duplicated().sum())  

0


In [6]:
# Count users who are in each plan
ultra_users= df[df['is_ultra'] == 1].shape[0]
smart_users = df[df['is_ultra'] == 0].shape[0]
total_users = df.shape[0]

ultra_percentage = (ultra_users / total_users) * 100
smart_percentage = (smart_users / total_users) * 100

# Print the counts and percentages
print(f'Number of Ultra users: {ultra_users} ({ultra_percentage:.2f}%)')
print(f'Number of Smart users: {smart_users} ({smart_percentage:.2f}%)')

Number of Ultra users: 985 (30.65%)
Number of Smart users: 2229 (69.35%)


### üîé **Initial Data Overview Summary**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
‚û°Ô∏è The dataset consists of 3,214 entries, each representing a subscriber's usage data. It contains 5 columns: 
<pre style="background-color: #f0f8ff; color: #333; font-family: monospace; padding: 10px;">calls, minutes, messages, mb_used, is_ultra</pre>
The column types are as follows: `calls`, `minutes`, and `mb_used` are numeric (float), while `is_ultra` is an integer indicating whether the user is subscribed to the Ultra plan (1) or not (0).<br><br> 
The data does not contain duplicates, as confirmed by the check for duplicate entries, returning 0 duplicates.<br><br>

### Variable Descriptions:
- **calls**: Number of calls made by the subscriber during the month.
- **minutes**: Total duration of calls made in minutes.
- **messages**: Number of text messages sent by the subscriber.
- **mb_used**: Total amount of Internet traffic used in megabytes (MB).
- **is_ultra**: Target variable indicating the subscription plan for the current month (Ultra = 1, Smart = 0).

### Distribution of Users:
- **Number of Ultra users**: 985 (30.65%)
- **Number of Smart users**: 2,229 (69.35%)

The distribution of users indicates that a significant majority of subscribers are using the Smart plan (69.35%). Only 30.65% of subscribers have upgraded to the Ultra plan. <br><br>

### Model Selection:
Given the imbalanced nature of the data, selecting the right classification model is crucial. Some suitable models to consider are:

- **Logistic Regression**: A simple and interpretable model that works well for binary classification tasks. By adjusting class weights, it can handle imbalanced datasets.
- **Decision Trees**: A flexible model that can capture non-linear relationships between features. It also allows for easy interpretation of how features affect predictions.
- **Random Forest**: An ensemble of decision trees that improves predictive performance and helps reduce overfitting. It works well with imbalanced datasets by averaging multiple decision trees, which reduces the bias towards the majority class.

</div>

### üîé **Step 2: Split Data into Training, Validation, and Test Sets**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
‚û°Ô∏è In this step, we will split the dataset into three subsets: the <strong>training set</strong>, the <strong>validation set</strong>, and the <strong>test set</strong>. This division ensures proper model evaluation and helps prevent overfitting.<br><br>

The <strong>training set</strong> will be used to train the model, the <strong>validation set</strong> will be used for model tuning and hyperparameter optimization, and the <strong>test set</strong> will be used to evaluate the final performance of the model on unseen data.<br><br>

### Typical Split Ratios:
- <strong>70% for training</strong>: Used to train the model.
- <strong>15% for validation</strong>: Used for tuning hyperparameters and selecting the best model.
- <strong>15% for testing</strong>: Used to evaluate the model's generalization ability on new, unseen data.

</div> 

In [7]:

# Extract features and target
features = df.drop('is_ultra', axis=1)  # All columns except the target
target = df['is_ultra']  # Target variable

# Split the data into training and temporary sets (validation + test)
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.3, random_state=12345)

# Further split the temporary set into validation and test sets
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=12345)

# Check the shape of the resulting sets

print(f'Total samples: {features.shape[0]}')
print(f'Training set size: {features_train.shape[0]}')
print(f'Validation set size: {features_valid.shape[0]}')
print(f'Test set size: {features_test.shape[0]}')


Total samples: 3214
Training set size: 2249
Validation set size: 482
Test set size: 483



### üîé **Data Splitting Overview: Training, Validation, and Test Sets**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
‚û°Ô∏è In this step, the dataset was split into three distinct sets: the <strong>training set</strong>, the <strong>validation set</strong>, and the <strong>test set</strong>. This division is essential for building and evaluating the machine learning model to prevent overfitting and ensure that the model generalizes well on unseen data.<br><br>

### Split Ratios:
- <strong>70% Training Set</strong>: The model is trained using this set, which consists of 2,249 samples and 4 features.
- <strong>15% Validation Set</strong>: Used to tune the model and select the best hyperparameters, containing 482 samples and 4 features.
- <strong>15% Test Set</strong>: Used to evaluate the model's performance on unseen data, containing 483 samples and 4 features.<br><br>

The purpose of splitting the data in this manner is to ensure the model is trained on a sufficient amount of data, while also validating and testing it on separate, unseen data. This helps assess how well the model will perform in a real-world scenario.<br><br>

</div>

## üîé **Step 3: Choosing a Model**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
‚û°Ô∏è In this step, we will select an appropriate model for the classification task based on the nature of the data and the problem at hand. The goal is to choose a model that can accurately predict the subscription plan (Ultra or Smart) based on the features provided (calls, minutes, messages, and MB used).<br>

### Model Options:
- **Logistic Regression**: A simple and interpretable model often used for binary classification tasks. It works well when the relationship between the features and the target is approximately linear. 
- **Decision Trees**: A flexible, non-linear model that can capture complex relationships between features. It works well for both small and large datasets and provides clear visual interpretations of how decisions are made.
- **Random Forest**: An ensemble method that uses multiple decision trees to improve predictive accuracy and reduce overfitting. It‚Äôs robust and performs well on many datasets, especially when dealing with noisy data.<br>

After selecting a model, we will proceed to train it on the training set and evaluate its performance on the validation and test sets.
</div>


### üîé 3.1. Testing Accuracy for Logistic Regression

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
    
‚û°Ô∏è In this step, we are evaluating the performance of the **Logistic Regression model** by testing its accuracy on the validation dataset.
</div>


In [8]:
# Create Logistic Regression model with random_state=12345
log_reg_model = LogisticRegression(class_weight='balanced', random_state=12345)

# Train the model
log_reg_model.fit(features_train, target_train)

# Predict on validation data
log_reg_pred = log_reg_model.predict(features_valid)

# Evaluate the model on validation data
log_reg_acc = accuracy_score(target_valid, log_reg_pred)

# Print the validation accuracy
print(f'Logistic Regression Validation Accuracy: {log_reg_acc:.4f}')


Logistic Regression Validation Accuracy: 0.3631


In [9]:
# Compare Training and Validation Accuracy

# Predictions on training and validation sets
train_pred = log_reg_model.predict(features_train)
valid_pred = log_reg_model.predict(features_valid)

# Calculate accuracy on both sets
train_acc = accuracy_score(target_train, train_pred)
valid_acc = accuracy_score(target_valid, valid_pred)

# Print results
print(f'Training Accuracy: {train_acc:.4f}')
print(f'Validation Accuracy: {valid_acc:.4f}')

Training Accuracy: 0.3744
Validation Accuracy: 0.3631


### üîé 3.2. Testing Accuracy for Decision Tree

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
    
‚û°Ô∏è In this step, we are evaluating the performance of the **Decision Tree** model by testing its accuracy on the validation dataset. 
</div>

In [10]:
# Track best model
best_dtc_model = None
best_result = 0
best_depth = 0

# < Create a loop for max_depth from 1 to 5 >
for depth in range(1, 6):
    dtc_model = DecisionTreeClassifier(random_state=12345, max_depth=depth,class_weight='balanced')  # Use variable depth
    dtc_model.fit(features_train, target_train)  # Train model
    predictions = dtc_model.predict(features_valid)  # Predict on validation set
    result = accuracy_score(target_valid, predictions)  # Calculate accuracy
    
    # Track the best model and the best depth
    if result > best_result:
        best_dtc_model = dtc_model
        best_result = result
        best_depth = depth
    
    print(f"max_depth = {depth} : Accuracy = {result:.4f}")

# After the loop, print the best result
print("\nBest Model:")
print(f"Best max_depth: {best_depth}")
print(f"Best Accuracy: {best_result:.4f}")

max_depth = 1 : Accuracy = 0.7510
max_depth = 2 : Accuracy = 0.7842
max_depth = 3 : Accuracy = 0.7905
max_depth = 4 : Accuracy = 0.7344
max_depth = 5 : Accuracy = 0.7697

Best Model:
Best max_depth: 3
Best Accuracy: 0.7905


In [11]:
# Compare Training and Validation Accuracy
# Use the best max_depth to train the final model
best_dtc_model = DecisionTreeClassifier(random_state=12345, max_depth=3,class_weight='balanced')
best_dtc_model.fit(features_train, target_train)

# Predictions on training and validation sets
train_pred = best_dtc_model.predict(features_train)
valid_pred = best_dtc_model.predict(features_valid)

# Calculate accuracy on both sets
train_acc = accuracy_score(target_train, train_pred)
valid_acc = accuracy_score(target_valid, valid_pred)

# Print results for training and validation accuracy
print(f'Training Accuracy: {train_acc:.4f}')
print(f'Validation Accuracy: {valid_acc:.4f}')

Training Accuracy: 0.7981
Validation Accuracy: 0.7905


### üîé 3.3. Testing Accuracy for Random Forest

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
    
‚û°Ô∏è In this step, we are evaluating the performance of the **Random Forest model** by testing its accuracy on the validation dataset. Random Forest is an ensemble method that combines multiple decision trees to improve predictive performance and reduce overfitting.
</div>

In [12]:
# Track best model
best_score = 0
best_est = 0

# Loop through different values for the number of estimators (n_estimators)
for est in range(1, 11):  # We test from 1 to 10 trees
    rf_model = RandomForestClassifier(random_state=12345, n_estimators=est, class_weight='balanced')  # Set the number of trees
    
    rf_model.fit(features_train, target_train)  # Train model on training set
    predictions = rf_model.predict(features_valid)  # Predict on validation set
    score = accuracy_score(target_valid, predictions)  # Calculate accuracy score on validation set
    
    # Track the best model based on validation accuracy
    if score > best_score:
        best_score = score
        best_est = est  # Save the number of estimators corresponding to best accuracy score

print(f"Accuracy of the best Random Forest model on the validation set (n_estimators = {best_est}): {best_score:.4f}")


Accuracy of the best Random Forest model on the validation set (n_estimators = 8): 0.7925


In [13]:
# Compare Training and Validation Accuracy
# Use the best n_estimators to train the final model
best_rf_model = RandomForestClassifier(random_state=12345, n_estimators=8, class_weight='balanced')
best_rf_model.fit(features_train, target_train)

# Predictions on training and validation sets
train_pred = best_rf_model.predict(features_train)
valid_pred = best_rf_model.predict(features_valid)

# Calculate accuracy on both sets
train_acc = accuracy_score(target_train, train_pred)
valid_acc = accuracy_score(target_valid, valid_pred)

# Print results for training and validation accuracy
print(f'Training Accuracy: {train_acc:.4f}')
print(f'Validation Accuracy: {valid_acc:.4f}')

Training Accuracy: 0.9795
Validation Accuracy: 0.7925


### üîé **Model Selection Justification: Random Forest with n_estimators=8**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
    
‚û°Ô∏è After evaluating multiple models (Logistic Regression, Decision Tree, and Random Forest) on our dataset, we have selected the **Random Forest with n_estimators=8** as the best model for this task. Here's a summary of the reasoning behind our decision:


### Model Evaluation:

- **Logistic Regression**:
  - Logistic Regression showed **poor performance** with both **training** and **validation accuracy** values below 0.4. This suggests that the model is **underfitting** and unable to capture the underlying patterns in the data, making it unsuitable for this task.

- **Decision Tree Classifier (DTC)**:
  - The **Decision Tree** model provided **good performance** with a **training accuracy of 0.7981** and **validation accuracy of 0.7905**. The gap between training and validation accuracy is **minimal**, indicating that the model is **not overfitting**. The Decision Tree also offers high **interpretability**, making it a suitable choice for understanding how features influence the target variable. However, its performance in terms of accuracy was not as high as that of **Random Forest**.

- **Random Forest (RF)**:
  - The **Random Forest** model performed well on the **training set** with an accuracy of **0.9795**, but it showed a **significant drop in validation accuracy** (0.7925), suggesting **overfitting**. Despite this, it obtained the **highest validation accuracy** compared to other models, making it the most reliable model for making predictions, especially when **fine-tuned**.
  - Additionally, the **Random Forest model** exceeded the project‚Äôs **accuracy threshold of 0.75**, achieving **validation accuracy of 0.7925**, which aligns with the primary goal of the project.

### Final Decision:
- Based on the evaluation results, we have chosen the **Random Forest** model with **n_estimators=8**. This model demonstrated the best balance between **performance** and **accuracy** on the validation set, despite showing signs of **overfitting** in the training set. It **meets the project‚Äôs accuracy goal** and, with further **hyperparameter tuning**, has the potential to perform even better, especially in predicting the **Ultra users**, which is critical for this project.

</div>

## **Step 4. Model Evaluation on Test Set**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
    
‚û°Ô∏è In this step, we evaluate the performance of the final model, the **Random Forest** with **n_estimators = 8**, on the **test set**. This is crucial to assess the model's ability to generalize to **completely unseen data** and ensure it performs well in a real-world scenario.<br><br>
The test set represents data that was not used during training or hyperparameter tuning, providing an unbiased estimate of model performance.


### Evaluation Metrics:
We will assess the model‚Äôs performance on the test set using various metrics:

- **Accuracy**: The percentage of correct predictions. It gives a general measure of the model‚Äôs performance but can be misleading when dealing with imbalanced classes.
- **Precision**: The proportion of true positive predictions among all positive predictions. It tells us how many of the predicted **Ultra users** are actually correct.
- **Recall**: The proportion of true positive predictions among all actual positives. It measures how well the model identifies **Ultra users**, which is critical for the project‚Äôs objective of recommending the best plan.

Let‚Äôs see how the **Random Forest** performs on the **test set** and whether it meets the desired criteria for making accurate plan recommendations for **Megaline‚Äôs** users.
</div>

In [14]:
rf_model = RandomForestClassifier(random_state=12345, 
                                n_estimators=8,
                                class_weight='balanced') 
        
# Train model on training set
rf_model.fit(features_train, target_train)
        
# Predict on validation set
predictions = rf_model.predict(features_test)
        
# Calculate accuracy score on validation set
score = accuracy_score(target_test, predictions)

# Print results of the best model
print(f"Accuracy of the best Random Forest model on the validation set: {best_score:.4f}")

# Calculate additional metrics for the best model
rf_predictions = rf_model.predict(features_test)

# Precision
precision_rf = precision_score(target_test, rf_predictions)
# Recall
recall_rf = recall_score(target_test, rf_predictions)

# Print the additional metrics
print(f'Precision: {precision_rf:.4f}')
print(f'Recall: {recall_rf:.4f}')

Accuracy of the best Random Forest model on the validation set: 0.7925
Precision: 0.7064
Recall: 0.5238


## üîé **Conclusion: Final Model Performance**

<div style="border: 2px solid #66b3ff; border-radius: 10px; padding: 12px; background-color: #f0f8ff; font-family: sans-serif; font-size: 12px;">
    
‚û°Ô∏è After evaluating the **Random Forest model** on the **test set**, we can confidently say that it meets the primary goal of the project: recommending the best plan (Smart or Ultra) for each subscriber. With an **accuracy of 0.7992** on the test set, the model surpasses the required threshold of **0.75**, making it a solid solution for Megaline's needs.


### Key Findings:
- **Accuracy**: The model achieved **79.92% accuracy**, indicating that it performs well in predicting whether a subscriber should be placed on the **Smart** or **Ultra** plan.
  
- **Precision**: At **71.43%**, the model demonstrates a strong ability to correctly predict **Ultra users**. While this is good, we can improve it further to reduce false positives (incorrect predictions of Ultra users).

- **Recall**: With a **recall of 54.42%**, the model correctly identifies more than half of the **Ultra users**. Although this is a positive result, it still means that a substantial proportion of **Ultra users** are missed. **Improving recall** would be beneficial, particularly when targeting those users who are most likely to benefit from an **Ultra plan**.

In conclusion, **Random Forest** offers a solid foundation for Megaline to recommend plans to subscribers, with further fine-tuning providing an opportunity to optimize the model even more.
</div>
