<div style="border:solid green 2px; padding: 20px">
    
<b>Hello!</b> We're glad to see you in code-reviewer territory. You've done a great job on the project, but let's get to know each other and make it even better! We have our own atmosphere here and a few rules:


1. I work as a code reviewer, and my main goal is not to point out your mistakes, but to share my experience and help you become a data analyst.
2. We speak on a first-come-first-served basis.
3. if you want to write or ask a question, don't be shy. Just choose your color for your comment.  
4. this is a training project, you don't have to be afraid of making a mistake.  
5. You have an unlimited number of attempts to pass the project.  
6. Let's Go!


---
I'll be color-coding comments, please don't delete them:

<div class="alert alert-block alert-danger">✍
    

__Reviewer's comment №1__

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>
    
---

<div class="alert alert-block alert-warning">📝
    

__Reviewer's comment №1__


Remarks. Some recommendations.
</div>

---

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__

Success. Everything is done succesfully.
</div>
    
---
    
I suggest that we work on the project in dialogue: if you change something in the project or respond to my comments, write about it. It will be easier for me to track changes if you highlight your comments:   
    
<div class="alert alert-info"> <b>Student сomments:</b> Student answer..</div>
    
All this will help to make the recheck of your project faster. If you have any questions about my comments, let me know, we'll figure it out together :)   
    
---

# "Optimizing Mobile Plan Recommendations with Machine Learning"

## **Introduction**

Megaline, a mobile carrier, aims to improve customer satisfaction and revenue by encouraging subscribers to switch from legacy plans to newer offerings: Smart or Ultra. To achieve this, a machine learning model is needed to analyze customer behavior and accurately recommend the most suitable plan. 

This project focuses on building a classification model using subscribers’ monthly usage data, including the number of calls, call duration, messages sent, and internet data used. The model will predict whether a subscriber is more suited for the Smart or Ultra plan.

The primary goal is to develop a model with high accuracy, exceeding a threshold of 75%, to ensure reliable recommendations. By leveraging modern machine learning techniques, this project provides a data-driven solution to streamline plan recommendations, enhancing the customer experience and driving business growth.

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__

An excellent practice is to describe the goal and main steps in your own words (a skill that will help a lot on a final project). 

## Import Libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

## Load and Explore Data

In [2]:
# Load the dataset
df = pd.read_csv('/datasets/users_behavior.csv')

# Explore the data
print(df.info())
print(df.describe())
print(df.head())
print("Missing values:\n", df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None
             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246     0.461100
min       0.000000     0.000000     0.000000      0.000000     0.000000
25%      40.000000   274.575000     9.000000  12491.902500     0.000000
50%      62.000000   430.600000    30.000000  16943.235000     0.000000
75%      82.000000   571.927500    57.000000  21424.700000  

The data is clean and good, there are no null or missing values

## Spllit the data

In [3]:
# Define features and target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

# Split the data: training (60%), validation (20%), test (20%)
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.4, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=12345)

print("Training set size:", features_train.shape)
print("Validation set size:", features_valid.shape)
print("Test set size:", features_test.shape)


Training set size: (1928, 4)
Validation set size: (643, 4)
Test set size: (643, 4)


<div class="alert alert-block alert-success">✔️

__Reviewer's comment №1__



1. It is good here, random_state is fixed. We have ensured reproducibility of the results of splitting the sample into training (training) / test / validation samples, so the subsamples will be identical in all subsequent runs of our code.
    
2. Fraction of train/valid/test sizes 3:1:1 is good.

## Investigating Model's Quality

In [4]:
# Initialize variables for tracking the best model
best_model = None
best_accuracy = 0
best_hyperparameters = None

# Evaluate DecisionTreeClassifier
for depth in range(1, 11):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions_valid)
    print(f"DecisionTreeClassifier (max_depth={depth}): Accuracy = {accuracy:.3f}")
    if accuracy > best_accuracy:
        best_model = model
        best_accuracy = accuracy
        best_hyperparameters = {'model': 'DecisionTreeClassifier', 'max_depth': depth}

# Evaluate RandomForestClassifier
for est in range(10, 51, 10):
    for depth in range(1, 11):
        model = RandomForestClassifier(n_estimators=est, max_depth=depth, random_state=12345)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid)
        accuracy = accuracy_score(target_valid, predictions_valid)
        print(f"RandomForestClassifier (n_estimators={est}, max_depth={depth}): Accuracy = {accuracy:.3f}")
        if accuracy > best_accuracy:
            best_model = model
            best_accuracy = accuracy
            best_hyperparameters = {'model': 'RandomForestClassifier', 'n_estimators': est, 'max_depth': depth}

# Evaluate LogisticRegression
model = LogisticRegression(random_state=12345, max_iter=1000)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
accuracy = accuracy_score(target_valid, predictions_valid)
print(f"LogisticRegression: Accuracy = {accuracy:.3f}")
if accuracy > best_accuracy:
    best_model = model
    best_accuracy = accuracy
    best_hyperparameters = {'model': 'LogisticRegression'}

print("\nBest Model:", best_hyperparameters)
print("Validation Accuracy of Best Model:", best_accuracy)


DecisionTreeClassifier (max_depth=1): Accuracy = 0.754
DecisionTreeClassifier (max_depth=2): Accuracy = 0.782
DecisionTreeClassifier (max_depth=3): Accuracy = 0.785
DecisionTreeClassifier (max_depth=4): Accuracy = 0.779
DecisionTreeClassifier (max_depth=5): Accuracy = 0.779
DecisionTreeClassifier (max_depth=6): Accuracy = 0.784
DecisionTreeClassifier (max_depth=7): Accuracy = 0.782
DecisionTreeClassifier (max_depth=8): Accuracy = 0.779
DecisionTreeClassifier (max_depth=9): Accuracy = 0.782
DecisionTreeClassifier (max_depth=10): Accuracy = 0.774
RandomForestClassifier (n_estimators=10, max_depth=1): Accuracy = 0.756
RandomForestClassifier (n_estimators=10, max_depth=2): Accuracy = 0.778
RandomForestClassifier (n_estimators=10, max_depth=3): Accuracy = 0.785
RandomForestClassifier (n_estimators=10, max_depth=4): Accuracy = 0.790
RandomForestClassifier (n_estimators=10, max_depth=5): Accuracy = 0.793
RandomForestClassifier (n_estimators=10, max_depth=6): Accuracy = 0.801
RandomForestClass

Evaluated several models to find the one with the highest validation accuracy. The **DecisionTreeClassifier** achieved a maximum accuracy of 0.785 at max_depth=3, while the **RandomForestClassifier** outperformed all other models, achieving the highest accuracy of 0.809 with n_estimators=40 and max_depth=8. **LogisticRegression** had the lowest performance with an accuracy of 0.711. 

Based on these results, the **RandomForestClassifier** with optimal hyperparameters was selected as the best model for further testing, surpassing the project’s minimum accuracy threshold of 0.75.

## Testing the best Model

In [5]:
# Initialize the best model
best_model = RandomForestClassifier(n_estimators=40, max_depth=8, random_state=12345)

# Train the model on the training data
best_model.fit(features_train, target_train)

# Evaluate the model on the test set
predictions_test = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, predictions_test)

# Print the test accuracy
print("Test Accuracy of the Best Model:", test_accuracy)


Test Accuracy of the Best Model: 0.7962674961119751


## Sanity Check

In [6]:
# Baseline predictions: all users are 'Smart' (0)
baseline_smart = [0] * len(target_test)
baseline_smart_accuracy = accuracy_score(target_test, baseline_smart)

# Baseline predictions: all users are 'Ultra' (1)
baseline_ultra = [1] * len(target_test)
baseline_ultra_accuracy = accuracy_score(target_test, baseline_ultra)

# Print the baseline accuracies
print("Baseline Accuracy (All Smart):", baseline_smart_accuracy)
print("Baseline Accuracy (All Ultra):", baseline_ultra_accuracy)


Baseline Accuracy (All Smart): 0.6842923794712286
Baseline Accuracy (All Ultra): 0.3157076205287714


In this sanity check step, I have calculated baseline accuracies by making naive predictions: predicting all users as **Smart** (0) achieved an accuracy of **68.43%**, while predicting all users as **Ultra** (1) achieved an accuracy of **31.57%**. These results highlight the class imbalance in the dataset and confirm that the selected model, with a validation accuracy of **80.9%**, significantly outperforms these baselines.

## Conclusion


The project successfully developed a classification model to recommend the most suitable mobile plan (Smart or Ultra) for Megaline's subscribers based on their monthly usage data. After evaluating multiple models, the **RandomForestClassifier** emerged as the best-performing model with optimized hyperparameters (`n_estimators=40`, `max_depth=8`), achieving a validation accuracy of **80.9%** and a test accuracy of **79.63%**, surpassing the required threshold of **75%**.

The sanity check demonstrated that naive approaches, such as predicting all users as Smart or Ultra, yielded significantly lower accuracies (68.43% and 31.57%, respectively), highlighting the added value of the developed model. The model's consistent performance on the test set indicates its reliability in making accurate recommendations for unseen data.

This project provides a robust and data-driven approach for Megaline to transition subscribers to more appropriate plans, potentially improving customer satisfaction and revenue optimization.

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__

Here's the great thing: we picked the best hyperparameters for all our models (in this case, maximizing the accuracy_score metric). Here we also identified the MOST optimal model. On validation, it turned out to be the "random forest" model.

After the hyperparameters are selected for validation, we test the models on the test data. Based on the results of testing on the test (sorry for the tautalogy), we choose a model that we can pass to production.