**Review**

Hello Kristina!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are a few problems that need to be fixed before the project is accepted. Let me know if you have questions!

# Plan Recommendation Model for Megaline
 In this project, we aim to develop a classification model that helps the mobile network provider, Megaline, recommend the most suitable new plan for its subscribers. 
We have access to data from subscribers who have already transitioned to the new plans, which will serve as the foundation for the model. 
 The task is to achieve a classification accuracy of at least 0.75 on a test dataset. The model will be evaluated using standard classification metrics such as accuracy, and the best-performing model will be selected based on its ability to generalize to unseen data. 

<div class="alert alert-warning">
<b>Reviewer's comment V1</b>
    
It's not a good idea to use sub headers for regular test. Headers and sub headers should be used for titles and subtitles only
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>
    
Thank you!
    
</div>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
users_behavior = pd.read_csv ('users_behavior.csv')
users_behavior

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [3]:
users_behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
users_behavior.columns

Index(['calls', 'minutes', 'messages', 'mb_used', 'is_ultra'], dtype='object')

In [5]:
users_behavior.duplicated().sum()

0

In [6]:
users_behavior.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [7]:
users_behavior.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [8]:
users_behavior.groupby('is_ultra').mean()

Unnamed: 0_level_0,calls,minutes,messages,mb_used
is_ultra,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,58.463437,405.942952,33.384029,16208.466949
1,73.392893,511.224569,49.363452,19468.823228


In [9]:
# Separation of features and target feature
features = users_behavior.drop('is_ultra', axis=1)  
target = users_behavior['is_ultra']


In [10]:
# Splitting data into training, validation and test sets
features_train, features_temp, target_train, target_temp = train_test_split(
    features, target, test_size=0.4, random_state=42)

features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp, target_temp, test_size=0.5, random_state=42)


<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Everything is correct. Good job! 
    
</div>

In [11]:
# Dictionary for storing results
results = {}

# 1. Logistic Regression
logistic_model = LogisticRegression(random_state=42, max_iter=1000)
logistic_model.fit(features_train, target_train)
logistic_predictions = logistic_model.predict(features_valid)
logistic_accuracy = accuracy_score(target_valid, logistic_predictions)
results['Logistic Regression'] = logistic_accuracy

# 2. Decision Tree
decision_tree = DecisionTreeClassifier(max_depth=10, random_state=54321)
decision_tree.fit(features_train, target_train)
tree_predictions = decision_tree.predict(features_valid)
tree_accuracy = accuracy_score(target_valid, tree_predictions)
results['Decision Tree'] = tree_accuracy

# 3. Random Forest
random_forest = RandomForestClassifier(n_estimators=40, max_depth=10, random_state=54321)
random_forest.fit(features_train, target_train)
forest_predictions = random_forest.predict(features_valid)
forest_accuracy = accuracy_score(target_valid, forest_predictions)
results['Random Forest'] = forest_accuracy

# Display results
for model, accuracy in results.items():
    print(f"{model} Accuracy: {accuracy}")

# Selecting the best model
best_model_name = max(results, key=results.get)
print(f"\nBest model: {best_model_name} with Accuracy: {results[best_model_name]}")

Logistic Regression Accuracy: 0.7402799377916018
Decision Tree Accuracy: 0.7900466562986003
Random Forest Accuracy: 0.8055987558320373

Best model: Random Forest with Accuracy: 0.8055987558320373


<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

Correct. But could you, please, tune hyperparameters at least for one model?
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Correct. Well done!
    
</div>

Logistic regression achieved an accuracy of 0.7403, which is close to the threshold of 0.75, but not enough to achieve the required level.
Decision tree achieved an accuracy of 0.7900, which is already above the threshold of 0.75, but remains below the accuracy of random forest.
Random forest showed the best result with an accuracy of 0.8056, which is significantly higher than the required threshold and gives the best results among all models.

In [12]:
results_depth = {}

# Loop to iterate over max_depth values
for depth in range(1, 11):
    decision_tree = DecisionTreeClassifier(max_depth=depth, random_state=54321)
    decision_tree.fit(features_train, target_train)
    tree_predictions = decision_tree.predict(features_valid)
    accuracy = accuracy_score(target_valid, tree_predictions)
    
    # Storing results
    results_depth[f"max_depth = {depth}"] = accuracy

# Displaying results
for depth, accuracy in results_depth.items():
    print(f"{depth}: {accuracy:.4f}")

# Selecting the best max_depth
best_depth = max(results_depth, key=results_depth.get)
print(f"\nBest max_depth: {best_depth} with accuracy: {results_depth[best_depth]:.4f}")

max_depth = 1: 0.7309
max_depth = 2: 0.7823
max_depth = 3: 0.7916
max_depth = 4: 0.7823
max_depth = 5: 0.7745
max_depth = 6: 0.7776
max_depth = 7: 0.7823
max_depth = 8: 0.7978
max_depth = 9: 0.7854
max_depth = 10: 0.7900

Best max_depth: max_depth = 8 with accuracy: 0.7978


Based on the enumeration of various max_eep results for the DecisionTreeClassifier model, a maximum accuracy of 0.7978 was achieved when exploring a tree of 8.

At low tree depths (e.g. max_eep = 1), the accuracy is lower at 0.7309, which is due to the insufficient structure of the model and the inability to capture all the patterns in the data.
As the accuracy grade increased, the maximum max_eep = 8 gradually increased.
After this increase in depth, there was no significant improvement, and sometimes it led to a decrease in accuracy due to possible overfitting.

In [13]:
# Predictions on the test set
if best_model_name == 'Logistic Regression':
    best_model = logistic_model
elif best_model_name == 'Decision Tree':
    best_model = decision_tree
else:
    best_model = random_forest

test_predictions = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, test_predictions)
print(f"Accuracy on the test set: {test_accuracy}")

Accuracy on the test set: 0.8180404354587869


<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Well done!
    
</div>

After selecting the best model (random forest), the accuracy on the test sample was 0.8180. This indicates good generalization ability of the model and confirms that random forest is able to effectively classify subscribers based on their behavior.