# Project description
Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

I will finish the project using the following steps:
    
- Open and look through the data file. Path to the file: /datasets/users_behavior.csv .

- Spliting the source data into a training set, a validation set, and a test set.

- Investigate the quality of different models by changing hyperparameters. While Briefly describing the findings of the study.

- Checking the quality of the model using the test set.

- Additional task: sanity check the model. This data is more complex than what I am used to working with, so it's not an easy task. We'll take a closer look at it later.

**Reviewer’s Introduction**

Hello, my name is **Karol Sandoval**, and I will be reviewing your project. I am very happy to accompany you in this stage of your learning journey.

Before we begin, I kindly ask you to follow these instructions when reviewing and responding to my comments:

Please do not move, modify, or delete any of my comments, as they are essential for the review process and for keeping track of the adjustments that need to be made.

Reply directly below each comment, briefly explaining the changes you implemented.

Write your responses in a different color (blue is a great option) so your updates are easy to identify.

Regarding your work, I want to congratulate you: your project reflects commitment, clear effort, and steady progress. You are building a solid foundation, and every improvement you make strengthens your understanding and skills. Keep going—small adjustments lead to big accomplishments!

Throughout the notebook, you will find comments highlighted with the following color-coded blocks:

<div class="alert alert-danger"> <h2> Reviewer's comment</h2> This indicates an important issue that needs to be corrected for the project to be accepted. </div> <div class="alert alert-warning"> <h2> Reviewer's comment</h2> This represents a recommendation or suggestion. It is not mandatory, but it is valuable for improving the quality of your work. </div> <div class="alert alert-success"> <h2> Reviewer's comment</h2> This highlights a compliment or positive observation about your work. </div>

You could respond using this:

<div style="background-color:#dbe9ff; border-left:6px solid #1a56db; padding:10px;">
  <b>Your answer.</b> <a class="tocSkip"></a>
</div>

If anything is unclear or you would like more guidance, please feel free to reach out to your instructor. I am also here to support you and help you strengthen your project. You're doing great—let’s keep building on this progress together!

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Review-Iterations-1" data-toc-modified-id="Review-Iterations-1-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Review Iterations 1</a></span></li><li><span><a href="#Importing-files" data-toc-modified-id="Importing-files-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Importing files</a></span></li><li><span><a href="#Spliting-the-source-data-into-a-training-set,-a-validation-set,-and-a-test-set." data-toc-modified-id="Spliting-the-source-data-into-a-training-set,-a-validation-set,-and-a-test-set.-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Spliting the source data into a training set, a validation set, and a test set.</a></span></li><li><span><a href="#Investigate-the-quality-of-different-models-by-changing-hyperparameters.-While-Briefly-describing-the-findings-of-the-study." data-toc-modified-id="Investigate-the-quality-of-different-models-by-changing-hyperparameters.-While-Briefly-describing-the-findings-of-the-study.-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Investigate the quality of different models by changing hyperparameters. While Briefly describing the findings of the study.</a></span></li><li><span><a href="#Checking-the-quality-of-the-model-using-the-test-set." data-toc-modified-id="Checking-the-quality-of-the-model-using-the-test-set.-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Checking the quality of the model using the test set.</a></span></li><li><span><a href="#Additional-task:-sanity-check-the-model." data-toc-modified-id="Additional-task:-sanity-check-the-model.-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Additional task: sanity check the model.</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></div>

<div class="alert alert-success">
<h2> Reviewer's comment</h2>

I’d like to start by highlighting that adding an interactive table of contents is an excellent addition to your notebook. It greatly improves navigation and shows good organization of your workflow.

To help you get even more out of it, here are a couple of suggestions that could make it even clearer and more functional:

1. **Brevity and clarity in section titles:** Keeping titles short and descriptive helps the reader quickly understand the purpose of each section. You might consider adjusting a few titles so they more precisely reflect their content.

2. **Link functionality:** Ideally, when a user clicks on a section, the link should take them directly to the corresponding header. In this case, some links aren’t pointing correctly. It may be useful to check how the links were generated or whether the headers have the appropriate identifiers.

3. **Relevance of the titles:** It’s important to assign titles that accurately reflect the content you will present. Since Review-Iterations-1 is not a section that appears in the notebook, it wouldn’t be a good idea to include it in the table of contents.
</div>

## Importing files
Importing files and checking data for any incorrect data types, missing values or duplicate values.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error

In [3]:
data = pd.read_csv("https://code.s3.yandex.net/datasets/users_behavior.csv")

In [4]:
display(data.head())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


<div class="alert alert-success">
<h2> Reviewer's comment</h2>

Great job loading the data and checking its initial structure using <code>head()</code>. This is an excellent practice to make sure the data was read correctly.

</div>

In [5]:
display(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

In [6]:
print(data["is_ultra"].unique())

[0 1]


All data types seems correct, with the only column that might cause problems being is_ultra. is _ultra is acting as bolean values where 0 is False and 1 is True. This might not cause problems for the model that will be developed, but if any do happen to pop up. This might be one of the possible issues.

<div class="alert alert-warning">
<h2> Reviewer's comment</h2>


Thank you for your observation :)

It caught my attention that you mentioned the <code>is_ultra</code> column might cause issues.
<em>Could you please clarify a bit more about what kind of problem you think might occur?</em>

In many cases, working with binary variables encoded as <code>0</code> and <code>1</code> can be practical for Machine Learning models, since this format makes processing easier. When these variables are kept as text (for example, “ultra” / “not ultra”), some algorithms can’t use them directly, and additional transformation steps would be required.


</div>

In [7]:
display(data.describe())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


<div class="alert alert-warning">
<h2> Reviewer's comment</h2>

Nice use of the <code>describe()</code> function. This is a valuable practice for understanding your data, as it provides key statistical information such as counts, means, standard deviations, minimums, maximums, and percentiles. These indicators help you identify potential outliers, understand the distribution of the variables, and confirm that the data was loaded correctly.

I’d also like to mention that although the <code>is_ultra</code> variable appears as <code>0</code> and <code>1</code>, it is actually a qualitative variable. This means that these numbers act as labels that distinguish one type of user from another, rather than quantities that can be interpreted as continuous values. Keeping this in mind helps avoid incorrect interpretations and allows the model to handle the variable appropriately.

It might also be helpful to complement these results with comments or brief reflections on what you observe. This enriches the analysis and strengthens your understanding of the data’s behavior.

</div>

In [8]:
print(data.isnull().sum())

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


In [9]:
print(data.duplicated().sum())

0


There seems to be no missing values or duplicate rows in the dataset, and would be safe to say that we can continue.

<div class="alert alert-warning">
<h2> Reviewer's comment</h2>

Nice work identifying the missing values and duplicate rows; this is an important step when preparing data for a machine learning model.

It could be helpful to complement your analysis with some visual resources. Visualizations often make it easier to identify patterns, trends, and relationships between variables, which can deepen your understanding of the dataset before moving on to modeling.

It may also be valuable to include some tables that summarize relevant information or highlight relationships between features, especially when working with categorical variables.

These elements are not mandatory, but they can enrich your understanding of the data before training the model.


</div>

##  Spliting the source data into a training set, a validation set, and a test set.
Splitting the source data into a training set, a validation set and a test set for the model that will be selected. 

I will split the data in the following order, putting 50% of the data in the training set, 25% in the validation set, and 25% in the test set. It is default to use the largest portion of data to train the model, so it can be as accurate as possible.

In [10]:
data_train, data_valid = train_test_split(data, test_size=0.5, random_state=12345)
data_valid, data_test = train_test_split(data_valid, test_size=0.5, random_state=12345)

<div class="alert alert-warning">
<h2>Reviewer's comment</h2>

You have used <code>test_size=0.5</code> in the first split, which assigns 50% of the data to the test set. I would like to understand a bit more about your reasoning for choosing this percentage; if you could share a brief comment explaining your choice, it would be very helpful.

I recommend reviewing previous lessons, where you might find information provided by the instructor that could help you reflect on how to divide the data into different sets.

As an additional resource, here is the official documentation for <code>train_test_split</code> in scikit-learn, which can help you perform the data split:
<a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html">Scikit-learn documentation</a>.

</div>

In [11]:
features_train = data_train.drop(["is_ultra"], axis=1)
target_train = data_train['is_ultra']
features_valid = data_valid.drop(["is_ultra"], axis=1)
target_valid = data_valid['is_ultra']
features_test = data_test.drop(["is_ultra"], axis=1)
target_test = data_test["is_ultra"]

print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(1607, 4)
(1607,)
(803, 4)
(803,)
(804, 4)
(804,)


Seems that the dataset we are working with has a uneven number of rows, thus the awkward 1 row difference in the valid and test set.

<div class="alert alert-success">
<h2> Reviewer's comment</h2>

Well done verifying the dimensions of each dataset. Confirming the size of the features and target sets is an important step in preparing your data.

</div>

## Investigate the quality of different models by changing hyperparameters. While Briefly describing the findings of the study.

I will test various models with accuracy score. This will be done by changing hyperparameters. I will set a 0.75 threshold.

In [12]:
for depth in range(1, 6):
    model1 = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model1.fit(features_train, target_train)
    predictions_valid1 = model1.predict(features_valid)
    print("max_depth =", depth, ": ", end="")
    print(accuracy_score(target_valid, predictions_valid1))

max_depth = 1 : 0.7571606475716065
max_depth = 2 : 0.7808219178082192
max_depth = 3 : 0.7870485678704857
max_depth = 4 : 0.7820672478206725
max_depth = 5 : 0.7820672478206725


max_depth of 3 seems to have the best quality of results being 0.787. This is good considering that it beats the 50/50 odds of guessing and also doesn't have a higher number max_depth of 5 since that could cause overfitting in the decision tree.

In [13]:
best_est=0
best_score=0

for est in range(1, 11): 
    model2 = RandomForestClassifier(random_state=12345, n_estimators=est) 
    model2.fit(features_train, target_train)
    score2 = model2.score(features_valid, target_valid)
    if score2 > best_score:
        best_score = score2
        best_est = est
print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 8): 0.7858032378580324


<div class="alert alert-success">
<h2>Reviewer's comment</h2>

You did a great job applying the functions and setting up the loops for <code>DecisionTreeClassifier</code> and <code>RandomForestClassifier</code>. Calculating the validation accuracy for each configuration clearly demonstrates your understanding of how the models are trained and evaluated.
</div>

N_estimators of 8 seems to have best score, but it doesn't seem to be doing better then the decision tree. With the Random Forest having a score of 0.785 and decision tree with 0.787.

In [14]:
model3 = LogisticRegression(random_state=12345, solver="liblinear")
model3.fit(features_train, target_train)
score_train = model3.score(features_train, target_train)
score_valid = model3.score(features_valid, target_valid)
print("Accuracy of the logistic regression model on the training set:", score_train)
print("Accuracy of the logistic regression model on the validation set:", score_valid)

Accuracy of the logistic regression model on the training set: 0.7423771001866832
Accuracy of the logistic regression model on the validation set: 0.7484433374844334


<div class="alert alert-success">
<h2> Reviewer's comment</h2>

Great effort on your **logistic regression** implementation. Calculating the accuracy for both the training and validation sets demonstrates that you are checking the model’s performance properly.

</div>

Seems that the logistic regression model did the worst with an accuracy score of 0.748, not meeting the accuracy threshold of 0.75.

Although not all models beat the 0.75 threshold, the best model was the decision tree model with a score of 0.787. This model will be used for the project. This makes sense, since the is_ultra column is like a bolean value with there being only 1 or 0.

<div class="alert alert-danger">
<h2> Reviewer's comment</h2>
I noticed that, for hyperparameter tuning, you have worked with only one parameter in the models. Functions like <code>DecisionTreeClassifier</code> and <code>RandomForestClassifier</code> offer several additional parameters that can significantly affect the model’s performance.

Here is the official documentation for each function so you can experiment with the available options:

- **DecisionTreeClassifier:** <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">DecisionTreeClassifier documentation</a>

- **RandomForestClassifier:** <a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html">RandomForestClassifier documentation</a>

Trying different combinations of hyperparameters can help you better understand how they influence the model’s behavior, and in general, it is a recommended practice to optimize performance. Adjusting only one hyperparameter may limit the model’s potential and affect its behavior. **You could consider experimenting with at least three different hyperparameters for each model and observe how the results change.**

</div>

## Checking the quality of the model using the test set.

I will check the quality of the final model with the use of the test set.

In [15]:
final_model = DecisionTreeClassifier(random_state=12345, max_depth=3)
final_model.fit(features_train, target_train)

In [16]:
predictions = final_model.predict(features_test)

In [17]:
def error_count(answers, predictions):
    count = 0
    for i in range(len(answers)):
        if answers[i] != predictions[i]:
            count += 1
    return count
target_test = target_test.reset_index(drop=True)
print('Errors:', error_count(target_test, predictions))

Errors: 167


Target test was givving issues with the function passed and wouldn't display without resetting index. This was most likely due to the indices not aligning with the prediction list, so I reset the target_test to allign with predictions.

<div class="alert alert-success">
<h2>Reviewer's comment</h2>

You handled the <code>reset_index(drop=True)</code> on <code>target_test</code> very effectively. This step ensures that the indices of the actual data align with your predictions, so the comparisons in your <code>error_count</code> function are correct. Resetting the index is a simple but helpful step that makes sure your model evaluation works as intended.
</div>

In [18]:
def accuracy(answers, predictions):
    new = len(answers) - error_count(answers, predictions)
    new = new/len(answers)
    return new

print('Accuracy:', accuracy(target_test, predictions))

Accuracy: 0.7922885572139303


About 8/10 this model isn't perfect, but it definitely does better then guessing(5/10) the outcome.

<div class="alert alert-success">
<h2> Reviewer's comment</h2>

Good work! You have correctly calculated the number of errors and, from that, the accuracy using the test data. This shows that the <code>error_count</code> function is working correctly and that your predictions are being compared properly with the actual values.

</div>

## Additional task: sanity check the model.

Checking how far off we are by making use of the mean squared error function in sklearn and getting the squared root of that answer.

In [19]:
result = mean_squared_error(target_test, predictions)
print(result)

0.20771144278606965


In [20]:
rmse = result**2
print(rmse)

0.043144043464270684


The rmse tells us that the predictions are roughly off by 0.0431.

<div class="alert alert-danger">
<h2> Reviewer's comment</h2>

Thank you for your effort and the work you put in! I see that you calculated the <code>mean_squared_error</code> and <code>RMSE</code>, which is well done for regression problems. However, in this case we are working on a **classification** problem, so these metrics are not the most appropriate.

The goal of the optional “sanity check” task was to verify that your model could **outperform a very simple or random model**, such as one that always predicts the most frequent class or assigns classes at random. This helps ensure that the model is actually learning something useful and that the results are not due to chance.

For example, a simple sanity check for this classification task could be:  
1. Identify the most frequent class in your training set.  
2. Create predictions for the test set where every instance is assigned this most frequent class.  
3. Calculate the accuracy of this simple reference model.  

If your actual model’s accuracy is higher than this simple reference model, it indicates that it is learning meaningful patterns from the data. You could also try a completely random prediction and see that your model performs better than random guessing.

</div>

## Conclusion

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

I have succesfully developed a model that would assign the appropriate plan on existing user behavior. Conclusions are as followed:

- There is no missing values or duplicate rows in the dataset, and all data types are correct.

- I will split the data in the following order, putting 50% of the data in the training set, 25% in the validation set, and 25% in the test set. It is default to use the largest portion of data to train the model, so it can be as accurate as possible.

- Although not all models beat the 0.75 threshold, the best model was the decision tree with a max_depth of 3 model, with a score of 0.787. This model will be used for the project. This makes sense, since the is_ultra column is like a bolean value with there being only 1 or 0.

- The model score on the test set is about 8/10 this model isn't perfect, but it definitely does better then guessing(5/10) the outcome.

- The rmse tells us that the predictions are roughly off by 0.0431.

<div class="alert alert-success">
<h2> Reviewer's comment</h2>

You did a solid job summarizing your project in the conclusions. You highlighted the key points clearly, including data quality, train-validation-test splitting, model performance, and interpretation of the results. Your explanations show a good understanding of the steps you took and the outcomes of your analysis. Well done completing the project thoroughly!

</div>