**Review**

Hello Enrique!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are a few problems that need to be fixed before the project is accepted. Let me know if you have questions!

# Model Creation of Megaline Subscription Data

# Introduction

We will be diving into Megaline subscriber data and creating a model to help switch customers to a newer plan. Many Megaline customers still have legacy plans so we will analyze their behavior to create a model to recommend a new plan: Smart or Ultra. The model we will create must have an accuracy of at least 0.75.

In [1]:
#Import our libraries

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error

In [2]:
#Unpack our data

users = pd.read_csv('/datasets/users_behavior.csv')
print(users.info())
users.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Let's keep in mind that for the 'is_ultra' column, 0 means the subscriber is on the Smart plan, and 1 means the subscriber is on the Ultra plan.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Correct

</div>

In [3]:
#Create features and target

features = users.drop(['is_ultra'], axis =1)
target = users['is_ultra']

In [4]:
#Split our data into a test set, validation set, and training set

#Taking 60% of the data for training set #v2
users_train, users_val_test = train_test_split(users, test_size = 0.4, random_state=12345)

#Taking #20% of the data for validation set and 20% for test set #v2
users_val, users_test = train_test_split(users_val_test, test_size=0.5, random_state=12345)

print(f"Training samples: {len(users_train)} ({len(users_train)/len(users):.2%})")
print(f"Validation samples: {len(users_val)} ({len(users_val)/len(users):.2%})")
print(f"Test samples: {len(users_test)} ({len(users_test)/len(users):.2%})")

Training samples: 1928 (59.99%)
Validation samples: 643 (20.01%)
Test samples: 643 (20.01%)


<div class="alert alert-danger">
<b>Reviewer's comment</b>

The split is not correct:
1. Check the shapes. Your train and test data have the same length. But the ratio train/valid/test should be 60/20/20 or 70/15/15
2. Your train and test data are absolutely the same

The way to correct split:
1. Split initial data into train and val_test using test_size=0.4
2. Split val_test into val and test using test_size=0.5

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Correct. Good job!

</div>

## Interim Testing to Find the Best Models

This is a CLASSIFICATION model we are training since we only need to train the model to recommend either the Smart or Ultra plan. We are not trying to predict a numerical value.

In [5]:
#Find the best Decision Tree model

features_train = users_train.drop(['is_ultra'], axis=1)
target_train = users_train['is_ultra']
features_valid = users_val.drop(['is_ultra'], axis = 1)
target_valid = users_val['is_ultra']

best_model = None
best_result = 0
for depth in range(1, 7):
	model = DecisionTreeClassifier(random_state=12345, max_depth=depth) # create a model with the given depth
	model.fit(features_train, target_train) # train the model with training set
	predictions = model.predict(features_valid) # get the model's predictions using validation set
	result = accuracy_score(target_valid, predictions) # calculate the accuracy
	if result > best_result:
		best_model = model
		best_result = result
        
print("Accuracy of the best model:", best_result, "Model:", best_model) #v2

Accuracy of the best model: 0.7853810264385692 Model: DecisionTreeClassifier(max_depth=3, random_state=12345)


<div class="alert alert-success">
<b>Reviewer's comment</b>

Correct

</div>

Already on the right path with about 79% accuracy from our Decision Tree Classifier model. However, let's go over other model types to continue to search for a better model.

In [6]:
#Find the best Random Forest model

best_score = 0
best_est = 0
for est in range(1, 25): # choose hyperparameter range
    model = RandomForestClassifier(random_state=54321, n_estimators=est) # set number of trees
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid) # calculate accuracy score on validation set
    if score > best_score:
        best_score = score# save best accuracy score on validation set
        best_est = est# save number of estimators corresponding to best accuracy score

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

Accuracy of the best model on the validation set (n_estimators = 24): 0.7822706065318819


<div class="alert alert-success">
<b>Reviewer's comment</b>

Correct

</div>

Another good model with almost 79% accuracy, above our 75% minimum threshold. This is very close to our previous model. Let's evaluate our final model: logistic regression. 

In [7]:
#Initializing and testing our logistic regression model

model = LogisticRegression(random_state=54321, solver='liblinear')
model.fit(features_train, target_train) #training the model
score_train = model.score(features_train, target_train) #calculate accuracy score on training set
score_valid = model.score(features_valid, target_valid) #calculate accuracy score on validation set

print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

Accuracy of the logistic regression model on the training set: 0.7505186721991701
Accuracy of the logistic regression model on the validation set: 0.7589424572317263


<div class="alert alert-success">
<b>Reviewer's comment</b>

Correct

</div>

Though this is the simplest and fastest model to run, it only barely meets our 75% threshold and is beat by both of our previous models. (v2)

~~Our Decision Tree model scores the highest accuracy, let's go back to this and tune our parameters to see which values gave us the best result.~~

~~for depth in range(1, 7):
        model = DecisionTreeClassifier(random_state=12345, max_depth=depth)#create a model
        model.fit(features_train, target_train) #train the model
        predictions_valid = model.predict(features_valid) #find the predictions using validation set
        print("max_depth =", depth, ": ", end='')
        print(accuracy_score(target_valid, predictions_valid))#create a loop for max_depth from 1 to 6~~
        
(v2 Fix: This was not necessary, code was adjusted above to display the ideal depth)

<div class="alert alert-danger">
<b>Reviewer's comment</b>

It seems it's a duplicate code. You already tuned depth for DecisionTreeClassifier above, didn't you?

</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Thank you:)

</div>

## Using our Best Model on the Test Set

In [8]:
#Let's use the best model on our test set

features_test = users_test.drop(['is_ultra'], axis =1)
target_test = users_test['is_ultra']

model = DecisionTreeClassifier(random_state=12345, max_depth=3) #v2
model.fit(features_train, target_train) #train our best model on the training set #v2
predictions_test = model.predict(features_test) #get predictions from the test set
best_accuracy = accuracy_score(target_test, predictions_test)
print(f"Accuracy of the best model on the test set: {best_accuracy}") #accuracy of predictions from the test set

Accuracy of the best model on the test set: 0.7791601866251944


Using the test set, our best model has displayed similar results to the validation data, nearly 78%. (v2)

<div class="alert alert-danger">
<b>Reviewer's comment</b>

You can't train ML model on the test data. Any ML model should be trained only on train data. Test data is used only once to check the final quality of your best model.
    
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Now everything is correct. Well done!

</div>

## Sanity Check on our Best Model

Let's test our model against a model that would rely simply on chance or guessing. 

In [9]:
#Import libraries for dummy testing

from sklearn.dummy import DummyClassifier
from sklearn.metrics import classification_report
from sklearn.datasets import make_classification

In [10]:
#Train a dummy model
dummy_clf = DummyClassifier(strategy="most_frequent") #v2
dummy_clf.fit(features_train, target_train) #training dummy model with training set
dummy_predictions = dummy_clf.predict(features_test) #making dummy predictions with test set
dummy_accuracy = accuracy_score(target_test, dummy_predictions) #accuracy of dummy predictions

In [11]:
#Results of our model versus a dummy model

print("Our Model Accuracy:", best_accuracy)
print("Dummy Classifier Accuracy:", dummy_accuracy)

Our Model Accuracy: 0.7791601866251944
Dummy Classifier Accuracy: 0.6842923794712286


In [12]:
if best_accuracy > dummy_accuracy:
    print("\nOur model performs better than random guessing.")
    improvement = (best_accuracy - dummy_accuracy) / dummy_accuracy * 100
    print(f"Improvement over chance: {improvement:.2f}%")


Our model performs better than random guessing.
Improvement over chance: 13.86%


<div class="alert alert-success">
<b>Reviewer's comment</b>

Well done! But usually for sanity check we use the best constant model. For classification task it's a model with `strategy="most_frequent"`. You will study it in the next sprint:)
    
</div>

# Conclusion

We have successfully concluded our expirimentation with the Megaline subscriber data. We aimed to create a model that can accurately recommend the Smart or Ultra plan to legacy subscribers. We accomplished this by creating a Decision Tree model that scored nearly 78% accuracy on our test dataset and over 78% on our validation data set, meeting our 0.75 accuracy minimum. We also created a dummy model that relies on chance and our model was over 13% better than relying on chance (v2).