**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did an excellent job! The project is accepted. Keep up the good work on the next sprint!

# Project Summary

As a team member at Megaline, our goal is to enhance the subscriber experience by leveraging data-driven insights. We've identified a significant portion of our user base still on legacy plans, prompting us to develop a model to analyze subscriber behavior. Our aim is to recommend suitable plans from our newer offerings: Smart or Ultra.



For this project, I was uncertain whether to retain the underperforming models or not, but ultimately, I decided to include them.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Sure, why not? If you keep them, your research is reproducible!

</div>

In [1]:
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

In [2]:
user_behavior = pd.read_csv('/datasets/users_behavior.csv')
display(user_behavior)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


<div class="alert alert-warning">
<b>Reviewer's comment</b>

When loading the data, it's a good idea to at least take a look at it using `pd.DataFrame.info()` or `pd.DataFrame.head()` to make sure everything loaded correctly :)

</div>

In [3]:
# initialize temp train df and test df
# I split 20% because it is a large dataset and I'm splitting it twice which would only leave 50% for training
ub_train_temp, ub_test = train_test_split(
    user_behavior, 
    test_size=.2, # Split 20% of data to make test set
    random_state=12345)

# initialize valid df and actual train df
ub_train, ub_valid = train_test_split(
    ub_train_temp, 
    test_size=0.2,  # Split another 20% of data to make validation set
    random_state=54321)

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data split is reasonable!

</div>

In [4]:
# initialize variables
features = user_behavior.drop(['is_ultra'], axis=1)
target = user_behavior['is_ultra']

features_train = ub_train.drop(['is_ultra'], axis=1)
target_train = ub_train['is_ultra']

features_test = ub_test.drop(['is_ultra'], axis=1)
target_test = ub_test['is_ultra']

features_valid = ub_valid.drop(['is_ultra'], axis=1)
target_valid = ub_valid['is_ultra']

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good!

</div>

In [5]:
from sklearn.linear_model import LogisticRegression 

# initialize model
logReg = LogisticRegression(random_state=12345, solver='liblinear')
logReg.fit(features_train, target_train)

LogisticRegression(random_state=12345, solver='liblinear')

In [6]:
from sklearn.metrics import accuracy_score

# Make predictions on valid df using Logistic Regression
logReg_predictions_valid = logReg.predict(features_valid)

# Get logReg accuracy
logReg_accuracy = accuracy_score(target_valid, logReg_predictions_valid)

print("Accuracy:", logReg_accuracy)

Accuracy: 0.7359223300970874


In [7]:
from sklearn.tree import DecisionTreeClassifier

tree = DecisionTreeClassifier(random_state=12345, max_depth=3)
tree.fit(features_train, target_train)

DecisionTreeClassifier(max_depth=3, random_state=12345)

In [8]:
# Make predictions on valid df using Decision Tree Classifier
tree_predictions_valid = tree.predict(features_valid)

# Get tree accuracy
tree_accuracy = accuracy_score(target_valid, tree_predictions_valid)

print("Accuracy:", tree_accuracy)

Accuracy: 0.7786407766990291


# Accuracy to Depth Chart for Decision Tree

max_depth=6: 0.7650485436893204

max_depth=5: 0.7728155339805826

max_depth=4: 0.7747572815533981

**max_depth=3: 0.7786407766990291**

max_depth=2: 0.7533980582524272

In [9]:
from sklearn.ensemble import RandomForestClassifier


best_score = 0
best_est = 0
for est in range(1, 11): # Limited to 1 - 10 as to not overfit the model
    forest = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    forest.fit(features_train, target_train) # train model on training set
    score = forest.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 10): 0.7980582524271844


<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, you tried a couple of different models and did some hyperparameter tuning using the validation set

</div>

Random Forest Classifier has the highest accuracy out of the three I've tested but the slowest speed

In [10]:
# Testing forest df quality with test set
forest_predictions_test = forest.predict(features_test)

fpt_accuracy = accuracy_score(target_test, forest_predictions_test)

print(fpt_accuracy)

0.7947122861586314


<div class="alert alert-success">
<b>Reviewer's comment</b>

The final model was evaluated on the test set

</div>

Very similar accuracy to the training set 👍

<div class="alert alert-success">
<b>Reviewer's comment</b>

You mean the validation set, right? :)

</div>

In [11]:
# Sanity Check
import random

# Choose a few random indices from the test set
random_indices = random.sample(range(len(features_test)), 7)

# Print actual and predicted values for these indices
for idx in random_indices:
    actual_value = target_test.iloc[idx]
    predicted_value = forest.predict(features_test.iloc[[idx]])[0]  # Assuming a classification model
    
    print("Index:", idx)
    print("Actual Value:", actual_value)
    print("Predicted Value:", predicted_value)
    print()

Index: 175
Actual Value: 0
Predicted Value: 0

Index: 609
Actual Value: 0
Predicted Value: 0

Index: 218
Actual Value: 0
Predicted Value: 0

Index: 538
Actual Value: 1
Predicted Value: 0

Index: 147
Actual Value: 0
Predicted Value: 0

Index: 320
Actual Value: 1
Predicted Value: 0

Index: 475
Actual Value: 0
Predicted Value: 0



<div class="alert alert-warning">
<b>Reviewer's comment</b>

A better way to sanity check your model is to compare it to some kind of a baseline. For example, here we can take a constant model always predicting the majority class. Its accuracy is equal to the share of the majority class (about 70% in this case). Our model is better than that, so it probably learned something useful :)

</div>

# Conclusion

- Reviewed the original dataset 'users_behavior.csv' to identify features, target, and determine whether the problem was classification or regression.
- Split the dataset into training and testing sets using a ratio of 60:20:20 to ensure an adequate portion of the training data.
- Evaluated the performance of different models, including LogisticRegression, DecisionTreeClassifier, and RandomForestClassifier.
- Observed that RandomForestClassifier achieved the highest accuracy among the tested models.
- Identified LogisticRegression as the least accurate model out of the three tested.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Conclusions make sense

</div>