**Review**

Hello Mazin!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are a few problems that need to be fixed before the project is accepted. Let me know if you have questions!

# Employing Machine Learning to Suggest Optimal Data Plans for Clients

Megaline, a mobile carrier, has noticed that a significant number of its subscribers are still utilizing outdated data plans. The company intends to propose a newer data plan to each legacy plan user, with the goal of ensuring that the recommended plan is the most appropriate for each customer. To achieve this, Megaline has requested to train a model that will determine which of the new data plans offered by Megaline (Smart or Ultra) would be the best fit for each customer based on their data usage habits

As the recommendation can either be the Smart or Ultra plan, this is a binary classification problem. In order to solve this problem, the following machine learning models will be trained and evaluated:

- Decision Tree
- Random Forest

The overall data will be divided into training, validation, and testing sets. Both models will be trained and fine-tuned using the training and validation sets, with the aim of optimizing the hyperparameters for maximum accuracy. Once the optimal hyperparameters are established, the models will be evaluated using the test set and the one with the highest accuracy will be selected as the final model.

###### Loading Required Libraries

In [1]:
import pandas as pd
import numpy as np
from tqdm import tqdm
from IPython.display import clear_output
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error

######  Loading Data

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

###### Exploring our data

In [3]:
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

In [4]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


We do not appear to have any missing values and all 4 of our columns have 3214 values. 

To be able to predict whether Megaline customers are better off with Ultimate or Surf plans, we need a reminder of the entitlements and cost of each: 

Surf @ $20/month

- 500 monthly minutes, 50 texts, and 15 GB of data
- After exceeding the package limits:
    - 1 minute: 3 cents
    - 1 text message: 3 cents
    - 1 GB of data: $10

Ultimate @ $70/month
 
- 3000 monthly minutes, 1000 text messages, and 30 GB of data
- After exceeding the package limits:
    - 1 minute: 1 cent
    - 1 text message: 1 cent
    - 1 GB of data: $7

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Good job!
    
</div>

Neither plan limits or charges based on the number of calls, so the Calls column can be removed. While the data types will not offer us any problems, converting 'calls' and 'messages' to integer will leave us with cleaner data, as these columns are rounded to the nearest whole number to begin with. Changing the MB used column will also make life easier, as these are rounded up to the nearest GB, and Megaline charges by GB. 

In [5]:
#removing Calls as ML charges by minutes not calls
#df = df.drop(columns=['calls'])

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

Why did you remove this column? You can't be sure that this column is useless for model training. I'm sure if you return this column back you can achieve better quality at least a bit.
    
When you want to remove a column, you need to conduct an experiment and check how column removing affects on final model quality.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Fixed
    
</div>

In [6]:
#converting the Call & Messages columns to integer data type
#df[['minutes', 'messages', 'calls', 'mb_used']] = df[['minutes', 'messages', 'calls', 'mb_used']].astype(int)

In [7]:
#creating a new GB Used column
#df['gb_used'] = df['mb_used']/1024

#rounding up GB Used column to better reflect ML pricing as this is the most costly metric
#df['gb_used'] = np.ceil(df['gb_used']).astype(int)

#dropping MB column
#df = df.drop(columns=['mb_used'])

In [8]:
df.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

1. By replaces mb_used with gb_used you loose some inforamtion becasue of rounding. Right now case with 1023 mb usage and case with 0 mb usage are the same because in both cases gb_used=0. But are they really the same? Of course, no. That's why you it's better to return back mb_used column. You can even leave both columns gb_used and mb_used if you want.
2. Actually, when you work with ML models, it doesn't make sense to convert all the float columns to integers. Why? Becasue any ML model works with floats only. It means that any model converts all integers to floats before to start working. 
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Fixed
    
</div>

###### Modeling

We will be employing <b>Random Forest</b> and <b>Decision Tree</b> modeling to determine which is the more accurate of the two. This is a classification task and the target feature, or goal, is predicting the 'is_ultra' column. The models will predict whether a customer is better served by a Smart or Ultra plan based on the features, or remaining data: minutes, messages, and GBs used.
We will split the data to prepare for modeling.

###### Splitting the data


In [9]:
#specifying features by dropping the 'is_ultra' column
features = df.drop('is_ultra', axis=1)

#specifying the target
target = df['is_ultra']

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Correct
    
</div>

We need to split our datasets

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

1. You can't train the model on the whole dataset
2. You can't create a test data in a such way because test is included in the train data in your case.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Fixed
    
</div>

In [10]:
df.head(20)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


Our features and target data needs to be divided into training, validation, and test datasets with a ratio of 3:1:1. This means our training, validation, and test datasets will consist of 60%, 20%, and 20% respectively of our 3214 row dataset. 

In [11]:
#splitting the training dataset from the 'rest' i.e. validation and testing. Training will be 60% and 'rest' is 40%
features_train, features_rest, target_train, target_rest = train_test_split(features, target, test_size=0.4, 
                                                                            random_state= 7788)

#splitting the validation and testing datasets to comprise of 50% each
features_valid, features_test, target_valid, target_test = train_test_split(features_rest, target_rest, test_size=0.5, 
                                                                            random_state= 7788)

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

That's correct. Good job!
    
</div>

In [12]:
display(features_train)
display(target_train)


Unnamed: 0,calls,minutes,messages,mb_used
2763,51.0,410.00,74.0,19594.45
2518,69.0,535.17,70.0,19913.49
2762,77.0,519.28,11.0,11189.95
546,65.0,458.46,0.0,15214.25
3156,86.0,639.94,34.0,16810.40
...,...,...,...,...
928,22.0,175.36,17.0,13973.08
558,161.0,1218.67,11.0,5749.01
2149,41.0,239.53,53.0,25004.15
2305,29.0,169.97,33.0,16555.43


2763    1
2518    0
2762    0
546     1
3156    0
       ..
928     0
558     1
2149    1
2305    0
303     0
Name: is_ultra, Length: 1928, dtype: int64

In [13]:
#specifying a random state for the Decision Tree
model = DecisionTreeClassifier(random_state = 7890)

#training the model
model.fit(features_train, target_train)

#creating the test dataset
#test_df = pd.read_csv('/datasets/users_behavior.csv')

DecisionTreeClassifier(random_state=7890)

<div class="alert alert-danger">
<b>Reviewer's comment V2</b>

1. You can't train the model on the whole dataset
2. You can't create a test data in a such way because test is included in the train data in your case.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Fixed
    
</div>

   ## Decision Tree Modeling

In [14]:
%%time

#initializing the model
best_model = None
best_tree_accuracy = 0
best_depth = 0

#creating a loop that will test various depths
for depth in tqdm(range (3,50)):
    
    #creating the Decision Tree model
    DTree_model = DecisionTreeClassifier(max_depth = depth, random_state = 7890)
    
    #training the model with training set
    DTree_model.fit(features_train, target_train)
    
    #obtain model predictions using validation set
    DTree_predictions_valid = DTree_model.predict(features_valid)
    
    accuracy = accuracy_score(target_valid, DTree_predictions_valid)
    
    if accuracy > best_tree_accuracy:
        DT_best_model = DTree_model
        best_tree_accuracy = accuracy
        best_depth = depth
        clear_output()
        print(f"The best_model is {DT_best_model}")
        print(f"The best accuracy is {best_tree_accuracy*100}%")
        print(f"The best depth is {best_depth}")    
    

The best_model is DecisionTreeClassifier(max_depth=8, random_state=7890)
The best accuracy is 79.62674961119751%
The best depth is 8


100%|██████████| 47/47 [00:00<00:00, 105.33it/s]

CPU times: user 402 ms, sys: 3.36 ms, total: 405 ms
Wall time: 449 ms





With an accuracy score of 79.6%, we see that a decision tree with a depth of 8 is the ideal depth for predicting phone plan subscriptions in our test set. 

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

You have a mistake here. Check the variable you use to make predictions. That's a model trained on the whole data including validation data. That's why you got such quality. 
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Good job!
    
</div>

## Random Forest Modeling

We will follow the same steps to model our test dataset using Random Forest this time. Just as we looped through the Decision Tree to find the optimal depth, we will do the same here for the depth and number of estimators of the Random Forest model. Depth can be seen as the height of the model, or how many layers it has, wheras the estimators is like the width, corresponding to the number of decision trees at each layer. 

In [15]:
# Initializing the model
best_model = None
best_RF_accuracy = 0
best_depth = 0
best_est = 0


# Creating a loop to test models with different depth and estimator values
for est in tqdm(range(1,40)):
    
    # for loop to test various depths 
    for depth in range (1, 40):
        
        # Creating the Random Forest model
        RF_model = RandomForestClassifier(max_depth=depth, random_state=7890, n_estimators=est)
        
        #Training the model with the training dataset
        RF_model.fit(features_train, target_train)

        #Obtain model predictions using validation set
        RF_predictions_valid = RF_model.predict(features_valid)
       
        
        accuracy = accuracy_score(target_valid, RF_predictions_valid)
        
     # Determe best fit
        if accuracy > best_RF_accuracy:
            best_RF_model = RF_model
            best_RF_accuracy = accuracy
            best_RF_depth = depth
            best_est = est
            
print('The best model is', best_RF_model)
print(f'The best accuracy is {best_RF_accuracy*100}%')
print('The best depth is', best_RF_depth)
print('The best number of estimators is', best_est)

100%|██████████| 39/39 [01:24<00:00,  2.17s/it]

The best model is RandomForestClassifier(max_depth=10, n_estimators=10, random_state=7890)
The best accuracy is 81.95956454121306%
The best depth is 10
The best number of estimators is 10





As we can see above, the best Random Forest model has a depth of 10, 22 estimators and an accuacy of around 82%. This is almost 2.5% more accurate than the most accurate decision tree model we uncovered. 

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Correct. Good job!
    
</div>

## Validation Testing of Our Model

We will be testing our best_RF_model from our Random Forest modeling on the validation set.

### Random Forest 

In [18]:
#predicting the target values
RF_validation_predictions = best_RF_model.predict(features_test)

#calculating the accuracy score
RF_validation_accuracy = accuracy_score(target_test, RF_validation_predictions)

#printing accuracy score 
print(f"Accuracy is {RF_validation_accuracy*100}%")

Accuracy is 79.00466562986003%


Our accuracy score is 79%, slightly above our 75% threshold for accuracy. The Random Forest model can correctly predict phone plan subscriptions for around 4 out of every 5 Megaline customers! This is promising, as it suggests that Megaline can recommend new plans to legacy customers with a high degree of confidence that recommendations accurately reflect user behavior.

This accuracy score is 3% lower than for our test dataset, signifying that both Random Forest and Decision Trees are equally good predictors of customer behavior.  

<div class="alert alert-danger">
<b>Reviewer's comment V1</b>

You need to select only one the best model based on quality on the validation data and check the quality for this model on the test data.
    
</div>

<div class="alert alert-danger">
<b>Reviewer's comment V2</b>

As final quality check you need to test your best model on the test data but not on validation data. This is why we split the data into 3 parts but not into 2 parts.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V3</b>

Well done!
    
</div>

## Sanity Testing

In [17]:
# counting how many customers are on the Ultra plan
ultra_count = df[df['is_ultra'] == 1].shape[0]
display(ultra_count)

985

985 Megaline customers have a '1' in the 'is_ultra' column. In this binary set, customers are either Ultra or Smart plan subscribers. Therefore, 985/3214, or around 31% of customers are subscribed to the Ultra plan. This leaves the remaining 69% who are on the Smart plan. If our model were to simply assume that all customers are Smart subscribers or 0 for 'is_ultra', this would still only leave us with a model that is 69% accurate.

During our Decision Tree & Random Forest modeling, we uncovered models at least 10% more accurate than they would be through random guesswork, or selecting the mode, and assuming all customers are subscribed to Smart plans. Megaline can confidently use either model knowing it performs better than random guesswork, with the Random Forest proving slightly more accurate during testing. 

## Conclusion

Megaline tasked us with testing two different classification models to better predict customer behavior and determine whether a customer was better served by moving to either Surf or Ultra. Megaline intends to use this information and recommend a different plan to legacy customers based on other customers' behavior. 

To achieve this objective, we modeled Decision Trees and Random Forests, and learned that our most accurate models were Decision Trees with depths of 8, and Random Forests with depths of 10 and 10 estimators. We split our dataset into training, validation, and testing. We checked the quality of our models against the training set using accuracy scores, and found that Decision Trees reached 79% accuracy, as did Random Forests in validation. 

Megaline will be well served by either the Decision Tree or Random Forest model, as both can classify customers and recommend plans with high degrees of accuracy that surpass our accuracy threshold. 