<div style="border:solid blue 2px; padding: 20px"> 

<strong>Reviewer's Introduction</strong>

Hello Collin! 👋 

I'm Han, your reviewer for this project (hanlee_97297 on Discord). I'm happy to review your project today.

I will categorize my comments in green, blue or red boxes like this:

<div class="alert alert-success">
    <b>Success:</b> Everything is done successfully.
</div>
<div class="alert alert-warning">
    <b>Remarks:</b> Suggestions for optimizations.
</div>
<div class="alert alert-danger">
    <b>Needs fixing:</b> This must be fixed for a project to be approved.
</div>

Please don't remove my comments :) If you have any questions or comments, don't hesitate to respond to my comments by creating a box that looks like this: 
<div class="alert alert-info"> <b>Student comment:</b> Your text here.</div>    
<br>


📌 Here's how to create code for student comments inside a Markdown cell:
    
    
    <div class="alert alert-info">
    <b> Student's comment</b>

    Your text here. 
    </div>

You can find out how to **format text** in a Markdown cell or how to **add links** [here](https://sqlbak.com/blog/jupyter-notebook-markdown-cheatsheet). 


<hr>
Don’t forget to rate your experience by leaving feedback here:  
<a href="https://form.typeform.com/to/msiTC4LB" target="_blank">https://form.typeform.com/to/msiTC4LB</a>
</div>



<div style="border: solid blue 2px; padding: 15px; margin: 10px">
<b>Reviewer's Comments – Iteration 1</b>

Congratulatio, Collin! You've successfully met all the requirements of this project. Your code is clean and readable. The project is now approved. ✅

---


<b>Nice work on:</b>  
✔️ Clear and thoughtful model selection and evaluation.<br.
✔️ Well-written conclusion tying together findings and business value.
</div>

<div class="alert alert-warning">
    <b>Reviewer's comment – Iteration 1:</b><br>
Consider writing an introductory section here.
</div>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


<div class="alert alert-success">
    <b>Reviewer's comment – Iteration 1:</b><br>
Great start! All necessary libraries are correctly imported, and the initial data inspection using info() and head() is appropriately done.
</div>

In [4]:
#split the source data into training, validation, and test set
features = df.drop(['is_ultra'], axis = 1)
target = df['is_ultra']

features_train_val, features_test, target_train_val, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345
)

# Split train+val into train and val
features_train, features_val, target_train, target_val = train_test_split(
    features_train_val, target_train_val, test_size=0.25, random_state=12345
)

<div class="alert alert-success">
    <b>Reviewer's comment – Iteration 1:</b><br>
Correct. The data splitting is clearly implemented here.
</div>

In [19]:
# Decision Tree
best_model = None
best_result = 0
for depth in range(1, 11): 
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions = model.predict(features_val)
    result = accuracy_score(target_val, predictions)
    if result > best_result:
        best_model = model
        best_result = result

print(f"Most accurate model \nMax Depth: {depth} \nAccuracy: {result}")

Most accurate model 
Max Depth: 10 
Accuracy: 0.7713841368584758


In [20]:
# Random Forest
best_score = 0
best_est = 0
for est in range(1, 11):
    model = RandomForestClassifier(random_state=12345, n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_val, target_val)
    if score > best_score:
        best_score = score
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}: {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 10: 0.7884914463452566


<div class="alert alert-success">
    <b>Reviewer's comment – Iteration 1:</b><br>
Nice job on tuning the hyperparameters.
</div>

In [25]:
# Logistic Regression
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
score_log = model.score(features_val, target_val)
print(f"Accuracy of Logistic Regression on the validation set: {score_log}")

Accuracy of Logistic Regression on the validation set: 0.6998444790046656


**SUMMARY**

- In short, the most accurate model is a Random Forest with 10 estimators. This gave us an accuracy rating of 78%.
- A normal Decision Tree with a depth of 10 produced a -1% difference so if a model's speed is important to the client, one decision tree would suffice within margin of error. Although for a large phone company, 1% could equal hundreds of thousands of dollars lost so it is up to the client to decide. 

<div class="alert alert-success">
    <b>Reviewer's comment – Iteration 1:</b><br>
Nice and succinct analysis.
</div>

In [26]:
test_predictions = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, test_predictions)
print(test_accuracy)

0.7884914463452566


In [28]:
# Checking for Overfitting

train_score = best_model.score(features_train, target_train)
val_score = best_model.score(features_val, target_val)
test_score = best_model.score(features_test, target_test)

print(f"Train Accuracy: {train_score:.2f}")
print(f"Validation Accuracy: {val_score:.2f}")
print(f"Test Accuracy: {test_score:.2f}")

Train Accuracy: 0.85
Validation Accuracy: 0.77
Test Accuracy: 0.79


In [29]:
# Retrain best models using the best hyperparameters found earlier
dt_model = DecisionTreeClassifier(random_state=12345, max_depth=best_model.max_depth)
dt_model.fit(features_train, target_train)

rf_model = RandomForestClassifier(random_state=12345, n_estimators=best_est)
rf_model.fit(features_train, target_train)

lr_model = LogisticRegression(random_state=12345, solver='liblinear')  # liblinear is good for small datasets
lr_model.fit(features_train, target_train)

# Predict on the test set
dt_preds = dt_model.predict(features_test)
rf_preds = rf_model.predict(features_test)
lr_preds = lr_model.predict(features_test)

# Calculate accuracy scores
dt_acc = accuracy_score(target_test, dt_preds)
rf_acc = accuracy_score(target_test, rf_preds)
lr_acc = accuracy_score(target_test, lr_preds)

# Print results
print("Final Test Set Accuracy:")
print(f"Decision Tree (max_depth={best_model.max_depth}): {dt_acc:.2f}")
print(f"Random Forest (n_estimators={best_est}): {rf_acc:.2f}")
print(f"Logistic Regression: {lr_acc:.2f}")


Final Test Set Accuracy:
Decision Tree (max_depth=7): 0.79
Random Forest (n_estimators=10): 0.79
Logistic Regression: 0.70


**Final Model Evaluation Summary**

In this project, we explored user behavior data to predict whether a customer would choose the Ultra or Smart mobile plan (is_ultra). We split the dataset into training (60%), validation (20%), and test (20%) sets to ensure proper model evaluation and prevent data leakage.

Three classification models were trained and compared:

- Decision Tree Classifier

- Random Forest Classifier

- Logistic Regression

Each model was tuned using the validation set:

- The Decision Tree was optimized by varying max_depth

- The Random Forest was optimized using different values for n_estimators

- The Logistic Regression was used with default parameters

After selecting the best version of each model, we evaluated their final performance on the unseen test set. Below are the accuracy results:

Model
- Decision Tree	~79%
- Random Forest	~79%
- Logistic Regression	~70%

The results show that both a Decision Tree and Random Forest achieved the highest equal accuracy on the test set, making it the most suitable model for deployment in this context. Future improvements could include feature engineering, balancing the dataset (if class imbalance exists), and cross-validation for more robust evaluation.

<div class="alert alert-success">
    <b>Reviewer's comment – Iteration 1:</b><br>
Good analysis.
</div>