**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job! The project is accepted. Keep up the good work on the next sprint!

# Using Machine Learning to Recommend New Data Plans to Customers

The mobile carrier Megaline has noticed that many of their subscribers continue to use legacy data plans. The company would like to recommend one of their newer data plans to each legacy plan user, but Megaline also wants to ensure the recommended plan is the most desirable for each customer. The request has been made to train a model that will determine which new Megaline data plan (Smart or Ultra) be recommended to each customer based on their data usage behavior.

Since either the Smart or the Ultra plan can be recommended, this is a binary classification task. Therefore, the following models will be trained and tested:

- Decision Tree
- Random Forest

Training, validation, and testing datasets will be sliced from the overall data provided. The models will be trained and validated using the training and validation datasets, and the hyperparameters will be tuned for both models to obtain the highest accuracy. Once the best hyperparameters are determined, the models will be tested using the test dataset, and the model with the highest accuracy will be deemed the best model.

## Initialization

In this section the necessary libraries will be imported, the data will be read into a DataFrame, and a summary of the data will be quickly explored.

### Load libraries

All the important libraries that may be utilized throughout this report are imported in the cell block below.

In [2]:
# Import the necessary libraries

import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

### Load data

Below, the csv file **_users_behavior.csv_** will be read and stored in the DataFrame `df`.

In [3]:
# Load the data and store it to df

df = pd.read_csv('/datasets/users_behavior.csv')

### Explore the data

Let's take a look at the data stored in the `df` DataFrame. The first 15 rows will be printed, followed by the DataFrame's general info.

In [3]:
# Look at the first 15 rows of the DataFrame
df.head(15)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


In [4]:
# Look at summary of the df DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


There are no Null values in the DataFrame, and the data types are acceptable as-is. The `'calls'` and `'messages'` columns could contain integer values instead of float values, but leaving them as float values will not hinder the modeling. No work is required to prepare the data.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was loaded and inspected. 
    
> The 'calls' and 'messages' columns could contain integer values instead of float values, but leaving them as float values will not hinder the modeling.
    
Yeah, it's true!

</div>

## Modeling

The two learning algorithms/models that will be utilized are as follows:

- Decision Tree
- Random Forest

The trained models with the best quality/accuracy will be selected for final testing. The models will then be passed the test data, and their accuracy will be calculated to determine how they perform and which model is better.

### Splitting data into datasets

The column `'is_ultra'` contains either a 0 or 1 (binary classification) depending on which plan is used. The value is a 0 if the Smart plan is used, and the value is a 1 if the Ultra plan is used. Since we are trying to determine which plan to recommend to customers based on their data usage habits/behavior, the `'is_ultra'` column will be our target column.

The other columns, `'calls'`, `'minutes'`, `'messages'`, and `'mb_used'`, all contain data that provides insight into each customer's data usage habits/behavior. Each of these columns has some weighted relationship on whether or not the customer chose to enroll in the Smart or Ultra data plan, so these columns will be our feature columns.

In [5]:
# The features of the df DataFrame include all columns except for 'is_ultra'
features = df.drop('is_ultra', axis=1)

# The target of the df DataFrame is the 'is_ultra' column
target = df['is_ultra']

<div class="alert alert-success">
<b>Reviewer's comment</b>

Good!

</div>

The data in the `features` DataFrame and `target` Series will each be split up into a training dataset, a validation dataset, and a test dataset. The data will be split up using a 3:1:1 ratio. For clarity, the training datasets, validation datasets, and test datasets will each contain 60%, 20%, and 20% of the data from `features` and `target`, respectively. 

In [6]:
# Split the above slices into training, validation, and test datasets...

    # First, split the training datasets apart from the validation and test data. This will be done by splitting the data into
    # the train datasets and "other" datasets. The "other" datasets will have a test_size of 0.4, or 40% of the data, leaving 
    # the training datasets with 60% of the data.

features_train, features_other, target_train, target_other  = train_test_split(features, target, test_size=0.4,\
                                                                               random_state=12345)

    # Split the "other" datasets to create the validation and test datasets. Since the "other" dataset account for 40%
    # of the original data and the validation and test datasets should each contain 20% of the original data, the
    # "other" datasets will be split in half. So, the test_size parameter will be set to 0.5 (for 50%).

features_valid, features_test, target_valid, target_test = train_test_split(features_other, target_other, test_size=0.5,\
                                                                            random_state=12345)


<div class="alert alert-success">
<b>Reviewer's comment</b>

The data split is reasonable!

</div>

### Decision Tree model

Since the values in the target column are either 0 or 1, predicting target values is a binary classification task. A good model to use for a binary classification task is a decision tree. The most important hyperparameter for a decision tree model is its depth. Therefore, the cell block below will create models with varying maximum depths. Each model will be checked for accuracy, and the model with the best accuracy will be displayed.

In [7]:
# Decision Tree Model/Learning Algorithm

# Initialize
best_model = None
best_DT_accuracy = 0
best_depth = 0

# Create various models with different depth values

# for loop for changing depth values (range of 1-41)
for depth in range(1,41):
    
    # Create a model, using the provided depth and the same random_state
    DT_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    
    # Train the model using the training dataset
    DT_model.fit(features_train, target_train)
    
    # Predict the target values of the validation features using the model
    DT_predictions_valid = DT_model.predict(features_valid) # get model predictions on validation set
    
    # Calculate the accuracy, if allowed
    try:
        accuracy = accuracy_score(target_valid, DT_predictions_valid)
    except:
        break
    
    # Determe best fit
    if accuracy > best_DT_accuracy:
        best_DT_model = DT_model
        best_DT_depth = depth
        best_DT_accuracy = accuracy

print('Best Model:', best_DT_model)
print(f'Best Accuracy: {round(best_DT_accuracy*100,2)}%')
print('Best Depth:', best_DT_depth)


Best Model: DecisionTreeClassifier(max_depth=3, random_state=12345)
Best Accuracy: 78.54%
Best Depth: 3


The decision tree model with the best accuracy has a maximum depth of **3**, and an accuracy of approximately **78.54%**. This model will be stored as `best_DT_model`, and be utilized later on during testing.

### Random Forest model

Now let's use a random forest model to predict target values. The hyperparameters that are important to a random forest model are its depth and the number of estimators (n_estimators) it has. The number of estimators equates to the number of "trees in the forest", or in layman's terms, the number of decision trees in the model. To determine the hyperparamters that result in the best trained random forest model, models will be trained and tested with various combinations of values for both the depth and number of estimators. This will be implemented by using two for loops to iterate through a range of values for each hyperparameter. The resulting models will all be checked for accuracy, and the random forest model with the highest accuracy will be displayed.

In [8]:
# Random Forest model

# Initialize
best_model = None
best_result = 10000
best_est = 0
best_depth = 0
best_score = 0
best_RF_accuracy = 0

# Create various models with different depth and estimator values

# for loop for the number of estimators
for est in range(1,21):
    
    # for loop for the depth value
    for depth in range (1, 41):
        
        # Create a model, using the provided depth, number of estimators, and the same random_state
        RF_model = RandomForestClassifier(max_depth=depth, random_state=12345, n_estimators=est)
        
        # Train the model using the training dataset
        RF_model.fit(features_train, target_train)

        # Predict the target values of the validation features using the model
        RF_predictions_valid = RF_model.predict(features_valid) # get model predictions on validation set
       
        # Calculate the accuracy, if allowed
        try:
            accuracy = accuracy_score(target_valid, RF_predictions_valid)
        except:
            break

     # Determe best fit
        if accuracy > best_RF_accuracy:
            best_RF_model = RF_model
            best_RF_accuracy = accuracy
            best_RF_depth = depth
            best_est = est

print('Best Model:', best_RF_model)
print(f'Best Accuracy: {round(best_RF_accuracy*100,2)}%')
print('Best Depth:', best_RF_depth)
print('Best n_estimators:', best_est)


Best Model: RandomForestClassifier(max_depth=12, n_estimators=17, random_state=12345)
Best Accuracy: 80.56%
Best Depth: 12
Best n_estimators: 17


The random forest model with the best accuracy has a maximum depth of **12**, an n_estimators value of **17**, and an accuracy of approximately **80.56%**. This model will be stored as `best_RF_model`, and be utilized later on during testing.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Great, you tried a couple of different models and tuned their hyperparameters using the validation set!

</div>

## Final Model

The best hyperparameters for both the decision tree and random forest models have been determined. The best models of each kind have been stored into `best_DT_model` and `best_RF_model`. Now, it's time to test these optimal models by predicitng the target values of the test datasets, and then calculating each models' accuracy.

### Best decision tree model

In [9]:
# Test the final decision tree model using the test dataset

# Predict the target values
DT_test_predictions = best_DT_model.predict(features_test)

# Calculate the accuracy
DT_test_accuracy = accuracy_score(target_test, DT_test_predictions)

# Print the results
print(f'Accuracy: {round(DT_test_accuracy*100,2)}%')

Accuracy: 77.92%


The best decision tree model has an accuracy of **77.92%** when predicting target values for the test dataset. This is greater than the **75%** threshold for model accuracy.

### Best random forest model

In [10]:
# Test the final random forest model using the test dataset

# Predict the target values
RF_test_predictions = best_RF_model.predict(features_test)

# Calculate the accuracy
RF_test_accuracy = accuracy_score(target_test, RF_test_predictions)

# Print the results
print(f'Accuracy: {round(RF_test_accuracy*100,2)}%')

Accuracy: 79.94%


The best random forest model has an accuracy of **79.94%** when predicting target values for the test dataset. This is greater than the **75%** threshold for model accuracy. It is also greater than the **77.92%** accuracy of the decision tree model.

<div class="alert alert-success">
<b>Reviewer's comment</b>

The final models were evaluated on the test set

</div>

## Sanity (Common Sense) Check

Each user can either be recommended the Smart or Ultra plan, which correlates to either a **0** or **1** in the target datasets or predicted values. Since determining which plan to recommend is a binary classification task, there is **50%** probability of recommending the correct plan just by chance (without even consulting user data). This means that if we were to recommend either the Smart plan or Ultra plan at random, then we would have a **50%** accuracy in recommending the correct plan to each customer. 

The final decision tree model and the final random forest model had an accuracy of **77.92%** and **79.94%**, respectively. These numbers are significantly larger than the **50%** accuracy that would be attained by recommending plans at random. Thus, it makes sense to utilize either of the final trained models obtained to determine which plan to recommond to each customer.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

While it's a decent baseline, we can think of a better one. A constant model always predicting the majority class (in this case, 0) will have accuracy equal to the share of the majority class (about 70% in this case)

</div>

## Conclusion

The mobile carrier Megaline requested a trained model that would recommend one of their newer plans to customers continuing to use legacy plans. It was determined that this was a binary classification task since only the Ultra or Smart plan could be recommended. Therefore, the two models that were trained were a decision tree model and a random forest model. 

The features and target data were divided into 3 datasets as follows:

- Training dataset (60%)
- Validation dataset (20%)
- Testing dataset (20%)

Multiple models were created with various combinations of hyperparameters. They were trained with the same data, and then their accuracy was comapred. The decision tree model and random forest model with the highest accuracy were then tested using the testing data. The accuracy of the best decision tree model was calculated to be **77.92%**, and the accuracy of the best random forest model was calculated to be **79.94%**. Both models surpass the **75%** accuracy threshold. The best random forest model is slightly more accurate than the best decision tree model, so that should be the model delivered to Megaline.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Nice summary!

</div>