**Review**

Hello Chris!

I'm happy to review your project today.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a pretty good job overall, but there are a few problems that need to be fixed before the project is accepted. Let me know if you have questions!

# Sprint 7 Final Project
---
## Introduction

In this project I'll be working for the mobile carrier, *Megaline*. After analyzing customer behavior, it was discovered many of the subscribers are on the Legacy plan. I've been tasked to develop a model that would analyze subscriber behavior and recommend one of Megaline's new plans: *Smart* or *Ultra*.

I will be using the dataset that I've preprocessed with subscribers who have previously switched plans. The csv file can be found - <a href='https://practicum-content.s3.us-west-1.amazonaws.com/datasets/users_behavior.csv'> HERE </a>

The data is composed of the following:

- `calls` : number of calls
- `minutes` : total call duration in minutes
- `messages` : number of text messages
- `mb_used` : interenet traffic used in MB
- `is_ultra` : plan for the current month (0 Ultra, 1 Smart) 

This project will consist of the following: 
<ol>
1. Data Overview<br>
2. Model Training/Testing<br>
3. Model Selection<br>
4. Conclusion </ol>

Let's get to it!

## Data Overview
---

In [1]:
# Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
# Loading dataset
df = pd.read_csv('/datasets/users_behavior.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [3]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [4]:
# Double checking for any missing values
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [5]:
# Checking for duplicates
df.duplicated().sum()

0

In [6]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

We've got the right dataset. Looks clean and most subscribers are indeed ultra subscribers.
Let's analyze user behavior for each plan's subsciber base.

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Correct
    
</div>

## Data Modeling/Testing
---

After looking at a sample of the dataset, we'll drop the `is_ultra` column as it will be our target. 

We will split the dataframe to create the training, validation, and test sets in a 3:1 ratio. I will build and test 3 classification models (Decision Tree, Random Forest, and logistic regression models) and test the accuracies. 

The model with the highest accuracy score will be chosen as our final model.

In [7]:
# Creating variables for model
features = df.drop('is_ultra',axis=1)
target = df['is_ultra']

# Splitting dataset into train, validation, and test sets

features_train, features_valid, target_train, target_valid = train_test_split(
    features,target,test_size=0.2,random_state=12345) # First split

features_train, features_test, target_train, target_test = train_test_split(
    features_train, target_train, test_size=0.25, random_state=12345) # Second Split

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Good job!
    
</div>

In [8]:
# Creating Decision Tree Model while testing for best hyperparameter settings
best_model = None
best_result = 0
for depth in range(1,10):
    model = DecisionTreeClassifier(max_depth = depth,random_state=12345)
    model.fit(features_train,target_train)
    predictions = model.predict(features_valid)
    result = accuracy_score(target_valid,predictions)
    if result > best_result:
        best_model = model
        best_result = result

print('Accuracy of best model on the validation set is:', best_model, 'Accuracy score: ',best_result)

Accuracy of best model on the validation set is: DecisionTreeClassifier(max_depth=5, random_state=12345) Accuracy score:  0.7884914463452566


After running the validation set through the model prediction function, we got a accuracy score of 0.78 with a max depth of 5 for our best Decision Tree model. 

This is above our threshold of 0.75 which passes but I want to see how how we can get with other models 

In [9]:
# Creating Logical Regression model and testing for best model
best_score = 0
best_est = 0
best_depth = 0
for est in range(1,30,5):
    for depth in range(1,10):
        model = RandomForestClassifier(n_estimators=est, max_depth=depth,random_state=12345)
        model.fit(features_train,target_train)
        score = model.score(features_valid,target_valid)
        if score > best_score:
            best_score = score
            best_est = est
            best_depth = depth
            
print('Accuracy of the best model on the validation set is (n_estimators={} depth={}:) {}'.format(
    best_est, best_depth, best_score))

Accuracy of the best model on the validation set is (n_estimators=16 depth=6:) 0.7978227060653188


This result is a touch better than the Decision Tree, as expected. The best model has n_estimators at 16 and a max depth of 6. The accuracy score is above our threshold of 0.75 by a little over 4 basis points at ~0.7978

Let's try create and test the final model, Logistic Regression.

In [10]:
# Creating and testing Logistic Regression Model
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train,target_train)
score = model.score(features_valid, target_valid)
print('The Accuracy score for the logistic regression model is',score)

The Accuracy score for the logistic regression model is 0.7511664074650077


Although the accuracy score is over our threshold of 0.75, it does not perform as well as our other two models. Perhaps there are some hyperparameters to tweek that I'm missing?

Nonetheless, our final model will be the random forest model with the confirmed best hyperparameters we received on our previous tests. Let's run the prediction on our test set and finalize our score!

In [11]:
# Creating final model and running our test set through
final_model = RandomForestClassifier(n_estimators=16, max_depth=6, random_state=12345)
final_model.fit(features_train,target_train)
final_score = final_model.score(features_test,target_test)

print('Final accuracy score: ',final_score)

Final accuracy score:  0.7822706065318819


Not bad, would have liked a higher final score but it got the job done!

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Everything is correct. Well done!
    
</div>

# Conclusion
---

After testing all 3 classification models, the popularity of the random forest models showed its value compared to the other two models. I'm surprised the decision tree performed better than the logistic regression model. I am a bit convinced it was due minimal hyperparameter tweeks.

I'm curious to learn practical way on how to get these accuracy scores much higher. It was a blast creating and testing these models. I hope you enjoyed reading as I did coding this!