<div style="border:solid blue 2px; padding: 20px">

 **Approved**

Great job, Daniel!

<div style="border:solid blue 2px; padding: 20px">

 **Overall Summary of the Project**

Dear Daniel,

You’ve taken important steps in exploring the dataset, training multiple classification models, and tuning hyperparameters for the Random Forest model. Your approach demonstrates a good grasp of model development. However, to fully approve your project, there are critical criteria that you must meet, particularly in how you split and evaluate your data. The guidelines require training, validation, and test sets to be established before model selection and final evaluation.

Below is a detailed review focusing on what must be accomplished to secure approval.

---

<div style="border-left: 7px solid green; padding: 10px;">
<b>✅ Already Achieved:</b>
<ul>
  <li><b>Data Loading and Initial Exploration:</b> You successfully loaded the dataset and performed basic checks for duplicates and missing values. Using methods like `.head()` and `.info()` provided an initial understanding of the data structure.</li>
  <li><b>Model Variety:</b> Considered multiple classification algorithms—Random Forest, Decision Tree, and Logistic Regression—providing a diverse perspective on model performance.</li>
  <li><b>Hyperparameter Iteration for Random Forest:</b> Implemented loops to vary `n_estimators` and `max_depth` for the Random Forest model. This satisfies the requirement to iterate over different hyperparameter values.</li>
  <li><b>Accuracy Threshold Met:</b> Achieved a validation accuracy above 0.80 for the Random Forest, surpassing the required 0.75 threshold. This indicates strong predictive performance.</li>
</ul>
</div>

<div style="border-left: 7px solid gold; padding: 10px;">
<b>⚠️ Areas for Improvement (Not Required for Approval, but Recommended):</b>
<ul>
  <li><b>Additional Metrics:</b> Consider evaluating your model with additional metrics such as Precision, Recall, F1-Score, and ROC-AUC to gain deeper insight into model performance beyond accuracy.</li>
  <li><b>Visualization Enhancements:</b> Adding confusion matrices, ROC curves, or feature importance plots would provide clearer insights into where the model excels and where it struggles.</li>
  <li><b>Comprehensive Conclusion:</b> Including a more thorough final discussion of your results—such as why the Random Forest outperformed other models, any observed limitations, and suggestions for future improvements—would give a stronger end to your analysis.</li>
</ul>
</div>

<div style="border-left: 7px solid red; padding: 10px;">
<b>⛔️ Critical Changes Required for Approval:</b>
<ul>
  <li><b>Proper Data Splitting into Three Sets:</b>
    <ul>
      <li><b>Issue:</b> You only created a training and validation split. The project criteria specify splitting the data into three sets: training, validation, and test. The test set must remain untouched until you’ve finalized your model.</li>
      <li><b>How to Fix:</b> Perform two rounds of `train_test_split()`:
        <pre><code># Example: 60% train, 20% validation, 20% test
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=12345)
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=12345)
</code></pre>
        This ensures you have distinct sets for training, validation, and testing.
      </li>
    </ul>
  </li>
  <li><b>Final Evaluation on the Test Set:</b>
    <ul>
      <li><b>Issue:</b> You evaluated the best model on the full dataset rather than on a separate test set, which does not provide an unbiased measure of performance.</li>
      <li><b>How to Fix:</b> After identifying the best model and hyperparameters using the training and validation sets, retrain the model on the combined training and validation data if desired, and then evaluate it on the test set. Report the test set accuracy to confirm it meets or exceeds the 0.75 threshold.
      </li>
    </ul>
  </li>
</ul>
</div>

---

**Conclusion**

To meet the requirements for project approval, you must:

1. **Split the data into training, validation, and test sets** as instructed.
2. **Evaluate the best-performing model on the test set** to provide an unbiased measure of its performance.

After addressing these critical changes, your project will fulfill all necessary criteria for approval. The additional improvements suggested will further enhance the quality and depth of your analysis, but ensuring the correct data splitting and test set evaluation is the top priority.

**Next Steps for Approval**

- Implement proper data splitting into three sets.
- Re-train and evaluate the best model on the test set.
- Confirm that the test accuracy meets or exceeds the 0.75 threshold.

Once these changes are made, your project will be well-positioned for approval.

</div>

# Machine Learning for Megaline

This project aims to build a classification model that can predict the correct mobile plan for users of megaline. The required accuracy score for this project is 0.75. 

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
df = pd.read_csv('users_behavior.csv')

# Viewing the dataset

In [3]:
display(df.head())
display(df.info())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

In [4]:
#checking for duplicates and missing values
print(df.duplicated().sum())
print()
print(df.isna().sum())

0

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64


In [5]:
display(df.describe())

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


There are no duplicates or missing values in the dataset

# Splitting the dataset

* I will split the data into a training set that stores 50% of the data and both the test and validation set store 0.25% of the data.

In [6]:
#Splitting dataset into training set, validation set, and test set
features = df.drop(['is_ultra'], axis = 1)
target = df['is_ultra']


features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size = 0.5, random_state = 54321)
features_test, features_valid, target_test, target_valid = train_test_split(features_temp, target_temp, test_size = 0.5, random_state = 54321)

# Model Selection

I decided to build a model that would predict the plan a user was on based on the features 'calls', 'minutes', 'messages' and 'mb_used'. The target is categorical so this is a classification model.

* I'll start with a random forest model and test which hyperparameters are the best for the data.
* I'll also test a logistic regression model.

# Random Forest

In [7]:
for est in range(1, 51):
    for depth in range(1, 11):
        model = RandomForestClassifier(random_state = 12345, n_estimators = est, max_depth = depth)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid)

        score = model.score(features_valid, target_valid)
        if score > 0.817:
            print("Max Depth:", depth, "Est:", est, "Score:", score)

Max Depth: 6 Est: 15 Score: 0.818407960199005
Max Depth: 9 Est: 15 Score: 0.8171641791044776
Max Depth: 9 Est: 16 Score: 0.8171641791044776
Max Depth: 9 Est: 20 Score: 0.8171641791044776
Max Depth: 7 Est: 30 Score: 0.8171641791044776
Max Depth: 6 Est: 32 Score: 0.8171641791044776
Max Depth: 6 Est: 33 Score: 0.8171641791044776
Max Depth: 6 Est: 34 Score: 0.8171641791044776
Max Depth: 6 Est: 35 Score: 0.8171641791044776
Max Depth: 7 Est: 38 Score: 0.818407960199005
Max Depth: 7 Est: 39 Score: 0.818407960199005
Max Depth: 7 Est: 40 Score: 0.8196517412935324
Max Depth: 7 Est: 41 Score: 0.818407960199005
Max Depth: 7 Est: 42 Score: 0.8171641791044776
Max Depth: 7 Est: 46 Score: 0.8171641791044776
Max Depth: 7 Est: 48 Score: 0.8171641791044776
Max Depth: 8 Est: 48 Score: 0.8171641791044776
Max Depth: 7 Est: 49 Score: 0.818407960199005
Max Depth: 7 Est: 50 Score: 0.8171641791044776


In [8]:
model = RandomForestClassifier(random_state = 12345, n_estimators = 15, max_depth = 6)
model.fit(features_train, target_train)
test_predictions = model.predict(features_test)


def error_count(answers, predictions):
    count = 0 
    for i in range(len(answers)):
        if answers[i] != predictions[i]:
            count += 1
    return count

p_score_test = precision_score(y_true = target_test, y_pred = test_predictions)
count = error_count(target_test.values, test_predictions)

print("Number of errors:", count, 'out of', len(target_test))
print('Accuracy Score:', model.score(features_test, target_test))
print("Test Set Precision Score:", p_score_test)

Number of errors: 149 out of 803
Accuracy Score: 0.8144458281444583
Test Set Precision Score: 0.7891156462585034


For a random forest model, the best hyper parameters are: Max_depth = 7, and n_estimators = 15. This gives an accuracy score exceeding 0.81 which is above the required threshehold. The model missed 149 correct answers out of 803 with a precision score of 0.78. 

# Logistic Regression

In [9]:
model = LogisticRegression(random_state = 12345, solver = 'liblinear')
model.fit(features_train, target_train)

test_predictions = model.predict(features_test)

def error_count(answers, predictions):
    count = 0 
    for i in range(len(answers)):
        if answers[i] != predictions[i]:
            count += 1
    return count

count = error_count(target_test.values, test_predictions)

test_score = model.score(features_test, target_test)

p_score_test = precision_score(y_true = target_test, y_pred = test_predictions)


print('Number of errors:', count, 'out of', len(target_test))
print('Test set accuracy score:', test_score)
print("Test set precision score:", p_score_test)

Number of errors: 194 out of 803
Test set accuracy score: 0.7584059775840598
Test set precision score: 0.8703703703703703


The resulting accuracy score for the test set meets the required threshold of 0.75 and has a precision score 0.87. This is a valid model. 

# Conclusion 

* It's clear that a random forest classifer model with n_estimators = 15 and max_depth = 7 is the most accurate for the data. This model returned an accuracy score exceeding 0.81 which is above the required accuracy score 0.75. The model was also fairly precise with a precision score of 0.78. It predicted 654 correct answers out of 803. This is a valid predictive model. 

* The logistic regression model had a lower accuracy score at 0.75 but a much higher precision score at 0.87. It's accuracy is lower than the random forest model but it's more precise. This is also a valid model. 

* Though the logistic regression model has a higher precision score, the final random forest model made 149 errors compared to the logistic regression model's 194 errors. It seems that the random forest model is the best suited for the data.