<div style="border:solid green 2px; padding: 20px">
    
<b>Hello, Brandon!</b> We're glad to see you in code-reviewer territory. You've done a great job on the project, but let's get to know each other and make it even better! We have our own atmosphere here and a few rules:


1. My name is Alexander Matveevsky. I work as a code reviewer, and my main goal is not to point out your mistakes, but to share my experience and help you become a data analyst.
2. We speak on a first-come-first-served basis.
3. if you want to write or ask a question, don't be shy. Just choose your color for your comment.  
4. this is a training project, you don't have to be afraid of making a mistake.  
5. You have an unlimited number of attempts to pass the project.  
6. Let's Go!


---
I'll be color-coding comments, please don't delete them:

<div class="alert alert-block alert-danger">✍
    

__Reviewer's comment №1__

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>
    
---

<div class="alert alert-block alert-warning">📝
    

__Reviewer's comment №1__


Remarks. Some recommendations.
</div>

---

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №1__

Success. Everything is done succesfully.
</div>
    
---
    
I suggest that we work on the project in dialogue: if you change something in the project or respond to my comments, write about it. It will be easier for me to track changes if you highlight your comments:   
    
<div class="alert alert-info"> <b>Student сomments:</b> Student answer..</div>
    
All this will help to make the recheck of your project faster. If you have any questions about my comments, let me know, we'll figure it out together :)   
    
---

<div class="alert alert-block alert-danger">✍
    

__Reviewer's comment №1__

An excellent practice is to describe the goal and main steps in your own words (a skill that will help a lot on a final project). It would be good to add the progress and purpose of the study.

<div class="alert alert-info"> <b>Student сomments:</b>The goal of this project is to develop a classfication model that recommends one of Megaline's newer plans (Smart or Ultra) to subscribers based on their behavior data. This model aims to improve customer satisfaction and retention by suggesting the most suitable plan for each user.</div>

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №2__

Great!

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Load Data

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


# Split Data

In [4]:
x = df.drop('is_ultra', axis=1)  # Assuming 'plan' is the target variable
y = df['is_ultra']

In [5]:
x_train, x_temp, y_train, y_temp = train_test_split(x, y, test_size=0.4, random_state=42)
x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=0.5, random_state=42)

<div class="alert alert-block alert-danger">✍
    

__Reviewer's comment №1__

1. It is good here, random_state is fixed. We have ensured reproducibility of the results of splitting the sample into training (training) / test / validation samples, so the subsamples will be identical in all subsequent runs of our code.
    
2. Fraction of train/valid/test sizes 3:1:1 is good.


</div>

<div class="alert alert-info"> <b>Student сomments:</b> updated </div>

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №2__

Now correct

# Tuning Hyperparameter

In [6]:
#random forest
best_score = -1
for est in range(1, 10):
    model = RandomForestClassifier(random_state=42, n_estimators=est)
    model.fit(x_train, y_train)
    score = model.score(x_val, y_val)
    if score > best_score:
        best_score = score
        best_est = est

print("Best Random Forest model: n_estimators = {}, Accuracy = {}".format(best_est, best_score))

Best Random Forest model: n_estimators = 4, Accuracy = 0.7916018662519441


In [7]:
#logistic
best_score = -1
solvers = ['liblinear', 'lbfgs']

for solver in solvers:
    model = LogisticRegression(max_iter=1000, random_state=42, solver=solver)
    model.fit(x_train, y_train)
    score = model.score(x_val, y_val)
    
    if score > best_score:
        best_score = score
        best_solver = solver
print("Best Logistic Regression model: Solver = {}, Accuracy = {}".format(best_solver, best_score))

Best Logistic Regression model: Solver = lbfgs, Accuracy = 0.7402799377916018


In [8]:
#svc
best_score = -1
for c in range(1, 10):
    model = SVC(random_state=42, C=c)
    model.fit(x_train, y_train)
    score = model.score(x_val, y_val)
    if score > best_score:
        best_score = score
        best_c = c

print("Best SVC model: C = {}, Accuracy = {}".format(best_c, best_score))

Best SVC model: C = 1, Accuracy = 0.7418351477449455


# Train and Test Models

In [9]:
classifiers = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42, solver='lbfgs'),
    'Random Forest': RandomForestClassifier(random_state=42, n_estimators=4),
    'Support Vector Machine': SVC(random_state=42, C=1)
}

In [10]:
for name, clf in classifiers.items():
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_val)
    accuracy = accuracy_score(y_val, y_pred)
    precision = precision_score(y_val, y_pred)
    recall = recall_score(y_val, y_pred)
    print(f'{name} accuracy: {accuracy:.3f}, precision: {precision:.3f}, recall: {recall:.3f}')

Logistic Regression accuracy: 0.740, precision: 0.778, recall: 0.213
Random Forest accuracy: 0.792, precision: 0.740, recall: 0.492
Support Vector Machine accuracy: 0.742, precision: 0.804, recall: 0.208


Random Forest model has the highest accuracy among the three models, while the SVC model has the highest precision. 
However, the Random Forest model also has a higher recall compared to the other models, meaning it captures a larger portion of actual positive samples. 
The SVC model has the highest precision but a relatively lower recall compared to Random Forest, indicating a trade-off between precision and recall for SVM.

<div class="alert alert-block alert-danger">✍
    

__Reviewer's comment №1__

On `test' we evaluate only one model, the best one according to the results of cross-validation. This concept is reflected in the conditions of model operation in the real environment: several models do not work simultaneously in industrial operation - only one model, which was selected from several during the intermediate evaluation, is put into industrial operation. It is the same here - the test sample simulates a real data stream, and only one model should work with this stream.

# Best Model

In [11]:
model = RandomForestClassifier(random_state=42, n_estimators=4)
model.fit(x_train, y_train)
y_pred = model.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)

In [12]:
print(f'{accuracy:.3f}')

0.801


In [13]:
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
print(f'{precision:.3f}, {recall:.3f}')

0.756, 0.508


The model performs consistently well on both the validation and test sets as it has relatively similar metrics for accuracy, precision, and recall. It is able maintain its performance when evaluated on unseen data.

<div class="alert alert-info"> <b>Student сomments:</b> updated </div>

# Sanity Check

In [14]:
edge_case_values = {
    'calls': 3214,
    'minutes': 1500,
    'messages': 200,
    'mb_used': 10
}

edge_case_df = pd.DataFrame([edge_case_values])

print('Edge Case Values:')
print(edge_case_df)

Edge Case Values:
   calls  minutes  messages  mb_used
0   3214     1500       200       10


In [15]:
for name, clf in classifiers.items():
    prediction_edge_case = clf.predict(edge_case_df)
    print(f'Prediction for {name}:', prediction_edge_case)

Prediction for Logistic Regression: [0]
Prediction for Random Forest: [1]
Prediction for Support Vector Machine: [0]


The edge case values I am using are for high calls and minute, moderate messages and low mb_used. both logistic and SVC predicts 0 (false) for is_ultra means that with the given case the model think that the user is not on the "is_ultra" package. while random forest predicts 1 (True) for is_ultra means that with the given case the model think that the user is on the "is_ultra" package.

Such discrepancies in predictions can occur due to various reasons, including differences in the model's learning algorithm, feature importance, and how well the models generalize to new data.

<div class="alert alert-block alert-success">✔️
    

__Reviewer's comment №2__


Otherwise it's great😊. Your project is begging for github =)   
    
Congratulations on the successful completion of the project 😊👍
And I wish you success in new works 😊