Hello Andrew!

I’m happy to review your project today.
I will mark your mistakes and give you some hints how it is possible to fix them. We are getting ready for real job, where your team leader/senior colleague will do exactly the same. Don't worry and study with pleasure!

Below you will find my comments - **please do not move, modify or delete them**.

You can find my comments in green, yellow or red boxes like this:

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Success. Everything is done succesfully.
</div>

<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Remarks. Some recommendations.
</div>

<div class="alert alert-block alert-danger">

<b>Reviewer's comment</b> <a class="tocSkip"></a>

Needs fixing. The block requires some corrections. Work can't be accepted with the red comments.
</div>

You can answer me by using this:

<div class="alert alert-block alert-info">
<b>Student answer.</b> <a class="tocSkip"></a>

Text here.
</div>

# Intro to ML Project: Megaline's Newest Phone Plans

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: **Smart** or **Ultra**.

You have access to behavior data about subscribers who have already switched to the new plans. For this classification task, you need to develop a model that will pick the right plan.

Develop a model with the highest possible <u>**accuracy**</u>

## Import Required Packages / Libaries / Modules

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

## Load Data & View

In [2]:
# Loading data into a df, then previewing it

df = pd.read_csv('/datasets/users_behavior.csv')

df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [3]:
# Calling info() on the dataframe

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


## Splitting Source Data

In [4]:
# Assigning variables to features and targets

features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

# Splitting data into training and temp datasets first
# Split 70% for training dataset, 30% for temp
features_train, features_temp, target_train, target_temp = train_test_split(
    features, 
    target, 
    test_size=0.30, 
    random_state=100
)

# Now splitting data from temp dataset to create Validation & Test sets
# Splitting 30% of features_temp and target_temp in half (test_size=0.50)
# Creating validation and test sets for 'features' and 'target', 15% each
features_valid, features_test, target_valid, target_test = train_test_split(
    features_temp,
    target_temp,
    test_size=0.50,
    random_state=100
)

# Final Data Split
# features_train & target_train = 70%
# features_valid & target_valid = 15%
# features_test & target_test = 15%

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Good job!
    
</div>

## Testing Models

### Decision Tree Model

In [5]:
# Teaching model to try different tree depths

# Initializing variables
best_model = None
best_result = 0

# Outlining 'for' loop
for depth in range(1, 11):
    model_dt = DecisionTreeClassifier(random_state=100, max_depth=depth) # create model looping through depths
    model_dt.fit(features_train, target_train) # training the model
    
    # Evaluating results on validation set
    predictions = model_dt.predict(features_valid) # obtaining model predictions
    result = accuracy_score(target_valid, predictions) # calculating accuracy
    
    if result > best_result:
        best_model = model_dt
        best_result = result
        
print(f"Best Model: {best_model}")
print(f"Accuracy of Best Model: {best_result:.2f}")

Best Model: DecisionTreeClassifier(max_depth=3, random_state=100)
Accuracy of Best Model: 0.80


In our DecisionTree Model, we find that after testing for 10 levels of depth, a max depth level of 3 produces the highest level of accuracy (0.80) in training this model to produce consistent and reliable results.

### Random Forest Model

In [6]:
# Teaching model to try different amounts of trees

# Initializing variables
best_score = 0
best_est = 0

# Outlining 'for' loop
for est in range(1, 11):
    model_rf = RandomForestClassifier(random_state=100, n_estimators=est) # set number of trees
    model_rf.fit(features_train, target_train) # training the model
    
    # Evaluating results on validation set
    score = model_rf.score(features_valid, target_valid) # obtaining model score
    
    if score > best_score:
        best_score = score # producing best accuracy score on validation set
        best_est = est # producting best estimators corresponding to best accuracy score
        
print("Accuracy of Best Model on Validation Set (n_estimators = {}): {:.2f}".format(best_est, best_score))

Accuracy of Best Model on Validation Set (n_estimators = 10): 0.79


For our RandomForest Model, testing through 1 to 10 trees, the best number of estimators sits at 10 with an accuracy score of (0.79). A little bit less accuracy than our DecisionTree model.

### Logistic Regression Model

In [7]:
# Testing Logistic Regression Model on classification problem

# Initializing Logistic Regression constructor
model_lr = LogisticRegression(random_state=100, solver='liblinear')

# Training the model
model_lr.fit(features_train, target_train)

# Calculating Accuracy Score on Training Set
score_train_lr = model_lr.score(features_train, target_train)

# Calculating Accuracy Score on Validation Set
score_valid_lr = model_lr.score(features_valid, target_valid)

        
print("Accuracy on Training Set: {:.2f}, Accuracy on Validation Set: {:.2f}".format(score_train_lr, score_valid_lr))

Accuracy on Training Set: 0.71, Accuracy on Validation Set: 0.70


After testing our LogisticRegression Model, we're seeing much less accuracy scores than our DecisionTree and RandomForest Models. The training set for this model returned a (0.71) accuracy score and a (0.70) score with our validation set. Although it's consistent within the model when comparing the training and validation datasets, it is not the most accurate model we could use in this classification problem. 

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Everything is correct. Well done!
    
</div>

## Model Quality Investigation

In [12]:
# DecisionTree Model Comparison to Test Dataset

result_t1 = model_dt.score(features_test, target_test) # calculating accuracy

print(f"Accuracy Score for Test Dataset: {result_t1:.2f}")
print(f"Accuracy of Best Model in DecisionTree: {best_result:.2f}")

Accuracy Score for Test Dataset: 0.77
Accuracy of Best Model in DecisionTree: 0.80


In [13]:
# RandomForest Model Comparison to Test Dataset

result_t2 = model_rf.score(features_test, target_test) # calculating accuracy

print(f"Accuracy Score for Test Dataset: {result_t2:.2f}")
print(f"Accuracy of Best Model in RandomForest: {best_score:.2f}")

Accuracy Score for Test Dataset: 0.80
Accuracy of Best Model in RandomForest: 0.79


In [16]:
# LogisticRegression Model Comparison to Test Dataset

result_t3 = model_lr.score(features_test, target_test) # calculating accuracy

print(f"Accuracy Score for Test Dataset: {result_t3:.2f}")
print(f"Accuracy of Best Model in LogisticRegression: {score_valid_lr:.2f}")

Accuracy Score for Test Dataset: 0.67
Accuracy of Best Model in LogisticRegression: 0.70


It appears, after checking the quality of the different models, that our RandomForest Model has the closest accuracy rating to our Test Dataset implying a more accurate data analysis when introducing novel data.<br><br>

This would mean that using a RandomForest model will more accurately choose the right plan for customers of Megaline.

<div class="alert alert-success">
<b>Reviewer's comment V1</b>

Great work!
    
</div>