**Review**

Hello Joshua!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>

# sprint 7

# projectDescription 
Develop a model that will pick the right plan and Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75

In [1]:
#import libaries 
import pandas as pd 
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
#Open and look through the data file 
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
#Looking at the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

Before to work with data, you need at least to look at it. Methods head, info, describe could help you here.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Correct
    
</div>

### Split the source data into a training set, a validation set, and a test set.

In [4]:
# Split features and target
X = df.drop('is_ultra', axis=1)
y = df['is_ultra']

# First split: 80% temp, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Second split: 60% train, 20% valid (from 80% temp → 75% train, 25% valid)
X_train, X_valid, y_train, y_valid = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)

# Check the sizes of each set
X_train.shape, X_valid.shape, X_test.shape

((1928, 4), (643, 4), (643, 4))

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Good job!
    
</div>

### Developing a Model with the Highest Accuracy

In [5]:
# Train a Random Forest Classifier
model = RandomForestClassifier(random_state=42, n_estimators=100)
model.fit(X_train, y_train)

# Validate on the validation set
y_valid_pred = model.predict(X_valid)
validation_accuracy = accuracy_score(y_valid, y_valid_pred)

# Evaluate on the test set
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)

validation_accuracy, test_accuracy

(0.7822706065318819, 0.8102643856920684)

In [6]:
# 1. Train and validate a Logistic Regression model
logreg_model = LogisticRegression(random_state=42, max_iter=1000)
logreg_model.fit(X_train, y_train)
y_valid_pred_logreg = logreg_model.predict(X_valid)
logreg_val_acc = accuracy_score(y_valid, y_valid_pred_logreg)

# 2. Hyperparameter tuning for Random Forest
param_grid = {
    'n_estimators': [100, 150],
    'max_depth': [None, 10, 20]
}

X_combined = pd.concat([X_train, X_valid])
y_combined = pd.concat([y_train, y_valid])

grid_search_rf = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search_rf.fit(X_combined, y_combined)
best_rf_model = grid_search_rf.best_estimator_

# 3. Evaluate the best Random Forest model on the test set
y_test_pred_best_rf = best_rf_model.predict(X_test)
best_rf_test_accuracy = accuracy_score(y_test, y_test_pred_best_rf)

logreg_val_acc, grid_search_rf.best_params_, best_rf_test_accuracy

(0.744945567651633, {'max_depth': 10, 'n_estimators': 150}, 0.8149300155520995)

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

Correct. But:
    
1. You need to try at least one more model
2. You need to tune hyperparameters at least for one model
3. You need to select the only one best model and test this model on the test data
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Well done!
    
</div>

### MODELS Findings

 What Was Accomplished
Model 1: Random Forest Classifier

Initial model (100 trees) reached 0.782 validation accuracy and 0.81 test accuracy.

Strong generalization and good performance out-of-the-box.

Model 2: Logistic Regression

Also trained and validated (not affected by resets).

Typically performs slightly worse on complex, non-linear datasets like this.

Expected validation accuracy: ~0.73–0.75 (based on similar datasets).

Hyperparameter Tuning

A grid search for RandomForestClassifier (with variations in n_estimators and max_depth) was started but could not complete due to environment limitations.

 Final Model Selection
Best model: Random Forest Classifier with tuned parameters.

Why: It achieved >0.81 accuracy on the test set, exceeding the target of 0.75.

Logistic Regression, while simpler and easier to interpret, didn’t outperform Random Forest in accuracy.

