<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Thanks for taking the time to improve the project! It is now accepted. Good luck on the final sprint!

</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are a couple of problems that need to be fixed before the project is accepted. Let me know if you have questions!

# Description of the data

**Every observation in the dataset contains monthly behavior information about one user.**

*datasets/users_behavior.csv*

**The information given is as follows:**
* *сalls* — number of calls,
* *minutes* — total call duration in minutes,
* *messages* — number of text messages,
* *mb_used* — Internet traffic used in MB,
* *is_ultra* — plan for the current month (Ultra - 1, Smart - 0).

## Download the data and prepare it for analysis

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import numpy as np

In [2]:
data = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
data.columns = data.columns.str.lower()
data

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [5]:
data.groupby(['is_ultra']).mean().T

is_ultra,0,1
calls,58.463437,73.392893
minutes,405.942952,511.224569
messages,33.384029,49.363452
mb_used,16208.466949,19468.823228


In [6]:
for group in data['is_ultra'].unique():
    if group == 1:
        result = 'Ultra'
    else:
        result = 'Legancy Plan'
    print("Basic statistics for",result,"group")
    print("========================================================")
    display(data[data.is_ultra==group].describe())

Basic statistics for Legancy Plan group


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,2229.0,2229.0,2229.0,2229.0,2229.0
mean,58.463437,405.942952,33.384029,16208.466949,0.0
std,25.939858,184.512604,28.227876,5870.498853,0.0
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.23,10.0,12643.05,0.0
50%,60.0,410.56,28.0,16506.93,0.0
75%,76.0,529.51,51.0,20043.06,0.0
max,198.0,1390.22,143.0,38552.62,0.0


Basic statistics for Ultra group


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,985.0,985.0,985.0,985.0,985.0
mean,73.392893,511.224569,49.363452,19468.823228,1.0
std,43.916853,308.0311,47.804457,10087.178654,0.0
min,0.0,0.0,0.0,0.0,1.0
25%,41.0,276.03,6.0,11770.28,1.0
50%,74.0,502.55,38.0,19308.01,1.0
75%,104.0,730.05,79.0,26837.72,1.0
max,244.0,1632.06,224.0,49745.73,1.0


<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected

</div>

## Split the source data into a training set, a validation set, and a test set.

* **Training Set**: Typically, this set constitutes about 60-80% of the data. It is used to train the machine learning model.
* **Validation Set**: This set is often about 10-20% of the data and is used to fine-tune the model parameters and to provide an unbiased evaluation of a model fit during the training phase.
* **Test Set**: This is also about 10-20% of the data, used to provide an unbiased evaluation of a final model fit.

In [7]:
# Features and target
X = data.drop('is_ultra', axis=1)
y = data['is_ultra']

# Splitting the data into training and remaining (validation + test)
X_train, X_remaining, y_train, y_remaining = train_test_split(X, y, test_size=0.3, random_state=42)

# Splitting the remaining data into validation and test sets
X_valid, X_test, y_valid, y_test = train_test_split(X_remaining, y_remaining, test_size=0.5, random_state=42)

# Verifying the sizes of each set
X_train.shape, X_valid.shape, X_test.shape

((2249, 4), (482, 4), (483, 4))

* *test_size=0.3* in the *first train_test_split* function call means that 30% of the data is reserved for testing and validation.
* Then we split that 30% into two equal halves, giving 15% for validation and 15% for testing.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Data split is reasonable

</div>

## Investigate the quality of different models by changing hyperparameters

### Decision Tree Classifier

In [8]:
best_model = None
best_accuracy = 0
best_depth = 0

for depth in range(1, 6):
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    predictions_valid = model.predict(X_valid)
    accuracy = accuracy_score(y_valid, predictions_valid)
    if accuracy > best_accuracy:
        best_model = model
        best_accuracy = accuracy
        best_depth = depth
        print(f"(max_depth = {best_depth}): {best_accuracy}")

(max_depth = 1): 0.7365145228215768
(max_depth = 2): 0.7842323651452282
(max_depth = 3): 0.8008298755186722
(max_depth = 4): 0.8029045643153527


In [9]:
print(f"Accuracy of the best model on the validation set (max_depth = {best_depth}): {best_accuracy}")

Accuracy of the best model on the validation set (max_depth = 4): 0.8029045643153527


### Random Forest Classifier

In [10]:
best_model = None
best_accuracy = 0
best_est = 0
best_depth = 0

for est in range(10, 51, 10):
    for depth in range(1, 11):
        model = RandomForestClassifier(n_estimators=est, max_depth=depth, random_state=42)
        model.fit(X_train, y_train)
        predictions_valid = model.predict(X_valid)
        accuracy = accuracy_score(y_valid, predictions_valid)
        if accuracy > best_accuracy:
            best_model = model
            best_accuracy = accuracy
            best_est = est
            best_depth = depth
            print("Accuracy:", best_accuracy, "n_estimators:", best_est, "best_depth:", best_depth)

Accuracy: 0.7302904564315352 n_estimators: 10 best_depth: 1
Accuracy: 0.7800829875518672 n_estimators: 10 best_depth: 2
Accuracy: 0.7946058091286307 n_estimators: 10 best_depth: 3
Accuracy: 0.8070539419087137 n_estimators: 10 best_depth: 4
Accuracy: 0.8132780082987552 n_estimators: 10 best_depth: 6
Accuracy: 0.8236514522821576 n_estimators: 10 best_depth: 8
Accuracy: 0.8278008298755186 n_estimators: 20 best_depth: 10


In [11]:
print("Accuracy of the best model on the validation set:", best_accuracy, "n_estimators:", best_est, "best_depth:", best_depth)

Accuracy of the best model on the validation set: 0.8278008298755186 n_estimators: 20 best_depth: 10


### Logistic Regression

In [12]:
model = LogisticRegression(solver='liblinear')
model.fit(X_train, y_train)
score = model.score(X_train, y_train)
print("Logistic regression training score:", score)

Logistic regression training score: 0.7438861716318363


In [13]:
predictions_valid = model.predict(X_valid)                 
accuracy = accuracy_score(y_valid, predictions_valid)

In [14]:
print("Logistic regression testing score:", accuracy)

Logistic regression testing score: 0.7489626556016598


Based on the findings of the study:

1. **Decision Tree Classifier**: Achieved an accuracy of approximately 80.29% on the validation set with a maximum depth of 4. This model provides a reasonable level of accuracy, considering its simplicity and interpretability.

2. **Random Forest Classifier**: Outperformed the Decision Tree Classifier with an accuracy of around 82.78% on the validation set. By combining multiple decision trees, the Random Forest Classifier improves predictive accuracy and robustness.

3. **Logistic Regression**: Achieved a testing score of approximately 74.90%. Logistic regression is a simple yet effective linear model for binary classification tasks. While it has lower accuracy compared to the tree-based models, it is computationally efficient and interpretable.

Overall, the study indicates that the **Random Forest Classifier** performed the best in terms of accuracy on the validation set, followed closely by the Decision Tree Classifier. Logistic regression, while simpler, still achieved a reasonable accuracy but fell short compared to the tree-based models.

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

Great, you tried a couple of different models and did some hyperparameter tuning using the validation set.
    
There is a small problem however. Remember that we're working on a classification problem in this project, so you need to use classification models instead of regression models. And a classification metric like accuracy instead of a regression metric RMSE.

</div>

<div class="alert alert-danger">
<s><b>Reviewer's comment V2</b>

Ok, now there are now two classification models, `DecisionTreeClassifier` and `RandomForestClassifier`, great! What's left is to replace `LinearRegression` (which is a regression model) with `LogisticRegression` (which is a classification model despite the unfortunate naming), remove all regression models and remove the conclusions about regression models.

</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

All looks good now!

</div>

## Check the quality of the model using the test set.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

There's no need to redo hyperparameter tuning, you can just train the model using the best hyperparameters.

</div>

In [15]:
threshold = 0.5
predictions_test = best_model.predict(X_test)
predictions_test_binary = [1 if pred >= threshold else 0 for pred in predictions_test]
y_test_binary = [1 if true >= threshold else 0 for true in y_test]

test_accuracy = accuracy_score(y_test_binary, predictions_test_binary)

print("Accuracy of the best model on the test set:", test_accuracy)

Accuracy of the best model on the test set: 0.7846790890269151


<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

The final model was evaluated on the test set, but you also need to use accuracy instead of RMSE here

</div>

<div class="alert alert-danger">
<s><b>Reviewer's comment V2</b>

RMSE should be removed entirely, it is not a regression task

</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Great!

</div>

<div class="alert alert-danger">
<S><b>Reviewer's comment</b>

Please revise the conclusions after fixing the problems above

</div>

1. **Model Performance**: The selected model, which was identified as the best during the training and validation phases, achieved a decent accuracy of around 78.46% on the unseen test data.

2. **Generalization**: The performance of the model on the test set indicates that it generalizes reasonably well to new, unseen data. This suggests that the model has learned patterns in the training data that are applicable to similar data instances in the test set.

3. **Applicability**: The accuracy achieved by the model on the test set should be considered in the context of the specific problem domain and the requirements of the application. Depending on the application, this level of accuracy might be satisfactory or may require further improvement.

4. **Future Steps**: Further analysis could involve investigating potential areas for model improvement, such as feature engineering, hyperparameter tuning, or exploring different algorithms. Additionally, monitoring the model's performance over time and retraining it with updated data can help maintain its effectiveness.

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Good points!

</div>