<div style="border:solid blue 2px; padding: 20px">

**Overall Summary of the Project**

Hi Michael,

**Overall**, you’ve built a solid end-to-end classification pipeline that:
- Splits data into 60/20/20 train/valid/test  
- Tunes a Decision Tree (depth, leaf, split)  
- Tunes two Random Forest variants  
- Compares test accuracies and picks the best trade-off model  

Your top Random Forest achieved ~0.84 accuracy, comfortably above the 0.75 target.  

---

**👍 Strengths**
    
- Clear 3:1:1 splitting and reproducible `random_state`  
- Thorough hyperparameter search for Decision Tree and Random Forest  
- Uses the held-out test set to report final model accuracy  
- Thoughtful discussion of speed vs. accuracy trade-off  

---

**⚠️ Areas for improvement**

1. **Leakage in Decision Tree tuning**  
   In your Decision Tree loop you scored on the **test** set rather than the validation set. That means your “best_result” was chosen using test data—data that should remain unseen until the final evaluation.  
   - **Fix**: Use `features_valid`/`target_valid` to pick hyperparameters, then evaluate the chosen model once on `features_test`.

2. **Baseline sanity check**  
   We need to confirm that your best model beats a trivial baseline (e.g. always predict the majority class).  
   - **Add** after your final test evaluation:  
     <code>
     from sklearn.dummy import DummyClassifier  
     dummy = DummyClassifier(strategy="most_frequent", random_state=2356)  
     dummy.fit(features_train, target_train)  
     print("Baseline accuracy:", dummy.score(features_test, target_test))  
     </code>  

3. **Explicit final test evaluation**  
   While you reported test accuracies in loops, please retrain your chosen model on **train+validation** and then report its single accuracy on `features_test`. This ensures no hyperparameter information leaked from test.  

4. **Clear model‐selection code**  
   Your loops mix “best_model” and “best_est” variables. For clarity, track each model’s name and accuracy in a dict, then pick the best.  

---

**✅ What’s required for approval**

- **Retune Decision Tree on validation only**, not test  
- **Add baseline DummyClassifier** accuracy on the test set  
- **Retrain final chosen model on train+valid** and report its single test accuracy  

Once you make those changes, your project will fully meet the criteria. Let me know if you have any questions!  

<div class="alert alert-block alert-info">
Hello, I retuned my Decision Tree Classifier to the validation dataset, not the test, I added my baseline Dummy model, and trained my chosen model as a final model on both the training and validation sets, is there anything I missed in these before my project is approved?
</div>

<div style="border:solid blue 2px; padding: 20px">

**Overall Summary of the Project Iter 2**

Hi Michael, thanks for the changes! Your project looks perfect, congrats on your approval :)


# Project 7: Introduction to Machine Learning

## Importing and analyzing the dataset

In this project I'll be analyzing Megaline customer behavior to build a model that can recommend one of Megaline's new plans: Smart or Ultra. The data I'm working with contains behavior (calls, call duration in minutes, messages sent, and internet traffic in MB used) for a number of Megaline customers, as well as a Boolean declaring if they are on the Ultra plan or not. I'll be looking to develop a model with an accuracy value of at least 0.75 for this analysis. Because I'm looking to understand which category (plan) a user falls into, I'll be using classification models for this analysis.

In [13]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
from time import time
import numpy as np

In [14]:
df = pd.read_csv('/datasets/users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [15]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Calls and messages are both floats in the dataset, may be something to come back to and change to int for analysis later on. Looks like there aren't any missing values either, but I'll still check for and remove any duplicates

In [16]:
df.duplicated().sum()

0

No duplicates found, let's get started on the model!

## Building the Models

In [17]:
features = df.drop(['is_ultra'], axis = 1)
target = df['is_ultra']
print(features.shape, target.shape)
features.head()

(3214, 4) (3214,)


Unnamed: 0,calls,minutes,messages,mb_used
0,40.0,311.9,83.0,19915.42
1,85.0,516.75,56.0,22696.96
2,77.0,467.66,86.0,21060.45
3,106.0,745.53,81.0,8437.39
4,66.0,418.74,1.0,14502.75


In [18]:
features_train, features_temporary, target_train, target_temporary = train_test_split(features, target, test_size = 0.4, random_state = 2356)
features_valid, features_test, target_valid, target_test = train_test_split(features_temporary, target_temporary, test_size = 0.5, random_state = 2356)

In [19]:
target_test = target_test.values
print(features_train.shape, target_train.shape)
print()
print(features_valid.shape, target_valid.shape, features_test.shape, target_test.shape)

(1928, 4) (1928,)

(643, 4) (643,) (643, 4) (643,)


3:1:1 ratio on the splits, 60% to training, and 20% each to test and validation sets

### Model 1: Decision Tree Classifier

In [20]:
best_model = None
best_result = 0
start_time = time()
for depth in range(1,11):
        for leaf in range(1,7,2):
            for split in range(2, 21, 2):
                model = DecisionTreeClassifier(
                        random_state = 2356,
                        max_depth = depth,
                        min_samples_leaf = leaf, 
                        min_samples_split = split
                )
                model.fit(features_train, target_train)
                test_predictions = model.predict(features_valid)
                result = accuracy_score(target_valid, test_predictions)
                if result > best_result:
                    best_model = model
                    best_result = result
training_time = time() - start_time
print('Accuracy of the best model:', best_result, 'Training time (s):', training_time)

Accuracy of the best model: 0.8320373250388803 Training time (s): 1.486004114151001


The model above took just under 1.5 seconds to come back with a resulting accuracy of just over 0.83, 8% higher than our accuracy threshold!

<div class="alert alert-block alert-info">
Changed test_predictions to utilize the features_valid set and the result variable to test accuracy_score using target_valid instead of test_valid, accuracy improved by 2 points and speed was nearly identical
</div>

### Model 2: Random Forest Classifier

In [9]:
best_score = 0
best_est = 0
start_time = time()
for est in range(1,21):
    for depth in range(1,11):
        for leaf in range(1, 11, 2):
            model = RandomForestClassifier(
                random_state = 2356,
                n_estimators = est,
                max_depth = depth,
                min_samples_leaf = leaf,
                )
            model.fit(features_train, target_train)
            score = model.score(features_valid, target_valid)
            if score > best_score:
                best_score = score
                best_est = est
training_time = time() - start_time
print('Accuracy of the best model:', best_score,'Training time (s):', training_time)

Accuracy of the best model: 0.8367029548989113 Training time (s): 20.421031951904297


The model above took just under 21 seconds to result in an accuracy of roughly 0.837, only marginally higher than the last model but took over 10x longer to find the result.

### Model 3: Random Forest Classifier with split samples

In [10]:
best_score = 0
best_est = 0
start_time = time()
for est in range(1,21):
    for depth in range(1,11):
        for leaf in range(1, 11, 2):
            for split in range(2, 21, 2):
                model = RandomForestClassifier(
                    random_state = 2356,
                    n_estimators = est,
                    max_depth = depth,
                    min_samples_leaf = leaf,
                    min_samples_split = split
                )
                model.fit(features_train, target_train)
                score = model.score(features_valid, target_valid)
                if score > best_score:
                    best_score = score
                    best_est = est
training_time = time() - start_time
print('Accuracy of the best model:', best_score, 'Training time (s):', training_time)

Accuracy of the best model: 0.8413685847589425 Training time (s): 204.13188672065735


This model took nearly 3 and a half minutes to result in an accuracy only marginally higher than the last result.

### Sanity Check

In [11]:
dummy = DummyClassifier(strategy = 'most_frequent', random_state = 2356)  
dummy.fit(features_train, target_train)  
print('Baseline accuracy:', dummy.score(features_test, target_test))

Baseline accuracy: 0.687402799377916


The baseline accuracy result was under 70%. Compared to the models I utilized, this is a much lower result, while also being under the required threshold for this analysis.

### Final Model Selection

In [22]:
best_params = {
    'max_depth': best_model.get_params()['max_depth'],
    'min_samples_leaf': best_model.get_params()['min_samples_leaf'],
    'min_samples_split': best_model.get_params()['min_samples_split']
}

print('Best parameters found:')
print(f"max_depth: {best_params['max_depth']}")
print(f"min_samples_leaf: {best_params['min_samples_leaf']}")
print(f"min_samples_split: {best_params['min_samples_split']}")

final_model = DecisionTreeClassifier(
    random_state=2356,
    max_depth=best_params['max_depth'],
    min_samples_leaf=best_params['min_samples_leaf'],
    min_samples_split=best_params['min_samples_split']
)

x_train_full = np.concatenate([features_train, features_valid])
y_train_full = np.concatenate([target_train, target_valid])

final_model.fit(x_train_full, y_train_full)

final_test_score = final_model.score(features_test, target_test)
print(f'Final test accuracy: {final_test_score}')

Best parameters found:
max_depth: 7
min_samples_leaf: 5
min_samples_split: 2
Final test accuracy: 0.7884914463452566


Based on final test accuracy, we see a slight, but expected, dropoff in the accuracy value. However, falling from just over 0.83 to just under 0.79 isn't too bad at all!

## Conclusion

As mentioned in the introduction to this project, to determine which category (plan) our users may fall into, I developed 3 different classification models to test. The first, a Decision Tree Classifier, was by far the fastest, but also offered the lowest level of accuracy out of all 3, 0.832. The second, a Random Forest Classifier, was middle of the road in terms of speed and accuracy, coming in at ~21 seconds and 0.837 accuracy score. The third and final model I developed, a Random Forest Classifier that used the min_samples_split hyperparameter, produced the highest accuracy score of the bunch, but also took an extremely long time to produce that result, and the accuracy was only 0.004 higher than the first Random Forest model.

In context of all the models tested so far, I believe the first model, the Decision Tree Classifier, shows the best combination of efficiency and accuracy, therefore I would recommend using that model to recommend either the Smart or Ultra plan to our Megaline customers. A quick sanity check on a dummy model shows that our baseline accuracy is under 70%, well under the results of all of my models and the threshold for this analysis.

The final model selection shows the best parameters chosen for the Decision Tree Classifier model, and when tested on "new" data, it still resulted in an accuracy of ~79%, which is to be expected when tested with that new data, giving us a better estimate of how this model would perform in a real world scenario.

I do believe the speed of these models should be taken with a grain of salt, however, as my computer is certainly not the biggest powerhouse available. Maybe better hardware shortens the gap in terms of speed to where the third model becomes more viable, or maybe speed of the result doesn't factor in to which model is decided on. All things to consider before making a final selection, but based on the available resources, my choice would be model #1.

<div class="alert alert-block alert-info">
I tried to develop a dictionary to assess the following feedback:
Clear model‐selection code
Your loops mix “best_model” and “best_est” variables. For clarity, track each model’s name and accuracy in a dict, then pick the best.

But couldn't quite understand what was being asked of me, is there a different way of explaining this so I can try to learn that method as well?
</div>