# Megaline Mobile Machine Learning Project

   In this project we will work with a dataset from a mobile network company named Megaline Mobile. Their main concern is about the number of customers still using their old plan. They are aiming to create a model that can analyze customer behavior and recommend to them either Smart or Ultra packages.
    
The dataset provided to use contains data on customer behavior, specifically the ones who have already made the switch to new packages from a previous statistical data analysis project. We will develop a model that can accurately choose the appropriate package for each customer. Our primary goal is to achieve a high accuracy level with the minimum set at 75%.



<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
It is always helpful for the reader to have additional information about project tasks. It gives an overview of what you are going to achieve in this project.


## Data Preperation

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
megaline_df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
megaline_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
megaline_df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [5]:
megaline_df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [6]:
megaline_df['is_ultra'].value_counts(normalize=True)

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

In [7]:
megaline_df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

After a intial look at the dataset containing information on customer behavior, the dataset looks accurate with no missing values nor duplicated rows.

The data column `is_ultra` however is heavily favored to customers with the ultra package and 70% and 30% without the ultra package.

The model's accuracy may be at risk due to this.

## Model Training

In [8]:
#Data Split into 3:1:1 ratio

train_and_valid, test = train_test_split(megaline_df, test_size = 0.2, random_state= 54321)
train, valid = train_test_split(train_and_valid, test_size = 0.25)

In [9]:
# Data train set
feature_train = train.drop(['is_ultra'], axis=1)
target_train = train['is_ultra']

# Data validation
feature_valid = valid.drop(['is_ultra'], axis=1)
target_valid = valid['is_ultra']

# Data test

feature_test = test.drop(['is_ultra'], axis=1)
target_test = test['is_ultra']

In [10]:
print(feature_train.shape)
print(feature_valid.shape)
print(feature_test.shape)

(1928, 4)
(643, 4)
(643, 4)


### Model before Tuning

In [11]:
# Create model Decision Tree
model_tree = DecisionTreeClassifier()
model_tree.fit(feature_train, target_train)

train_predict = model_tree.predict(feature_train)
valid_predict = model_tree.predict(feature_valid)

# Accuracy check
print(f'Accuracy training = {accuracy_score(target_train, train_predict)}')
print(f'Accuracy validation = {accuracy_score(target_valid, valid_predict)}')

Accuracy training = 1.0
Accuracy validation = 0.7558320373250389


The training accuracy reaches 100%.
The validation accuracy reaches 74% which is below our minimum.

This is an indication of overfitting.

The Decision Tree model without tuning may be too complex for this dataset. The model may be catching too much noise and specific patterns in the training data leading to poor performace.

In [12]:
# Create model Random Forest
model_forest = RandomForestClassifier()
model_forest.fit(feature_train, target_train)

train_predict = model_forest.predict(feature_train)
valid_predict = model_forest.predict(feature_valid)

print(f'Accuracy training = {accuracy_score(target_train, train_predict)}')
print(f'Accuracy validation = {accuracy_score(target_valid, valid_predict)}')

Accuracy training = 1.0
Accuracy validation = 0.7900466562986003


The Random Forest model training data reaches 100%.
The validation set reaches 80%.

This model appears to be a strong performer on the the training set but lower performace for the validation set. This shows that their is a potential of overfitting for this model.

In [13]:
# Create model Logistic Regression
model_regression = LogisticRegression()
model_regression.fit(feature_train, target_train)

train_predict = model_regression.predict(feature_train)
valid_predict = model_regression.predict(feature_valid)

print(f'Accuracy training = {accuracy_score(target_train, train_predict)}')
print(f'Accuracy validation = {accuracy_score(target_valid, valid_predict)}')

Accuracy training = 0.7344398340248963
Accuracy validation = 0.6842923794712286


Both accuracies for the training and validation sets from the Logistic Regression model are below our threshold. Training: 71.8 % Validation: 73.5%. The difference is not significant but, this suggest that the model performs slightly better on unseen data.

The accuracy is still below our standard of 75%.

So far there is no model that matches our criteria. We will tune our models.

## Model Tuning

### Decision Tree Tuning

In [15]:
best_train = .75
best_valid = .75
best_depth = 0

for i in range(1,11):
    tree_tuning = DecisionTreeClassifier(max_depth = i, random_state=12345)
    tree_tuning.fit(feature_train, target_train)
    
    train_predict = tree_tuning.predict(feature_train)
    valid_predict = tree_tuning.predict(feature_valid)
    
    train_accuracy = accuracy_score(target_train, train_predict)
    valid_accuracy = accuracy_score(target_valid, valid_predict)
    
    print('Depth: ', i)
    print(f'Accuracy training  : , {train_accuracy.round(3)}')
    print(f'Accuracy validation: , {valid_accuracy.round(3)}')

Depth:  1
Accuracy training  : , 0.767
Accuracy validation: , 0.712
Depth:  2
Accuracy training  : , 0.797
Accuracy validation: , 0.759
Depth:  3
Accuracy training  : , 0.812
Accuracy validation: , 0.776
Depth:  4
Accuracy training  : , 0.821
Accuracy validation: , 0.782
Depth:  5
Accuracy training  : , 0.83
Accuracy validation: , 0.781
Depth:  6
Accuracy training  : , 0.843
Accuracy validation: , 0.781
Depth:  7
Accuracy training  : , 0.858
Accuracy validation: , 0.781
Depth:  8
Accuracy training  : , 0.867
Accuracy validation: , 0.781
Depth:  9
Accuracy training  : , 0.88
Accuracy validation: , 0.792
Depth:  10
Accuracy training  : , 0.894
Accuracy validation: , 0.796


In [16]:
# decision_tree_tuning = DecisionTreeClassifier(max_depth=8, random_state=12345)
# decision_tree_tuning.fit(feature_train, target_train)

# test_predict = decision_tree_tuning.predict(feature_test)
# test_accuracy = accuracy_score(target_test, test_predict)

# print(f'Accuracy dataset test = {test_accuracy}')

Accuracy dataset test = 0.7838258164852255


Accuracy on this data set reaches 77.7 % which meets our criteria after tuning our hyperparameters.

### Random Forest Tuning

In [17]:
best_train = .75
best_valid = .75
best_depth = 0
best_est = 0
acc_diff = .5

for est in range(50, 501, 50):
    for depth in range(1,11):
        forest_tuning = RandomForestClassifier(n_estimators=est, max_depth=depth, random_state=12345)
        forest_tuning.fit(feature_train, target_train)
        
        train_predict = forest_tuning.predict(feature_train)
        valid_predict = forest_tuning.predict(feature_valid)
        
        train_accuracy = accuracy_score(target_train, train_predict)
        valid_accuracy = accuracy_score(target_valid, valid_predict)
        
        diff = abs(train_accuracy - valid_accuracy)
        
        if diff <= acc_diff:
            if (train_accuracy >= best_train and valid_accuracy >= best_valid):
                best_train = train_accuracy
                best_valid = valid_accuracy
                best_est = est
                best_depth = depth
                
print(best_train)
print(best_valid)
print(best_est)
print(best_depth)

0.9066390041493776
0.7978227060653188
350
10


In [18]:
forest_tuning = RandomForestClassifier(max_depth = best_depth, n_estimators = best_est, random_state=12345)
forest_tuning.fit(feature_train, target_train)

test_predict = forest_tuning.predict(feature_test)
test_acc = accuracy_score(target_test, test_predict)

print(f'Accuracy dataset test= {test_acc.round(3)}')

Accuracy dataset test= 0.787


The accuracy on our tuned Random Forest model proves to be even more efficient than our tuned Decision Tree model with a percentage of 78.7%. This meets our criteria.

In [19]:
regression_tuning = LogisticRegression(solver='liblinear', random_state=12345)
regression_tuning.fit(feature_train, target_train)

train_predict = regression_tuning.predict(feature_train)
valid_predict = regression_tuning.predict(feature_valid)

print(f'Acuuracy training = {accuracy_score(target_train, train_predict)}')
print(f'Acuuracy validation = {accuracy_score(target_valid, valid_predict)}')

Acuuracy training = 0.7193983402489627
Acuuracy validation = 0.6796267496111975


Unfortunately, the Logistic Regression model did not reach our criteria for neither the training or validaiton after tuning was applied. 

Training: 71.8%
Valid: 73.5%

## Conclusion

We found through our sanity check that 70% of Megaline Mobile customers are included in the Ultra package. 

When the models were not tuned, none reached the accuracy criteria of 75%. 

The Decision Tree model after tuning passed the criteria with 77.7%.

The Random Forest model after tuning passed the criteria with 78.7%.

The Logistic Regression model did not reach criteria with 73.5%.

From all the model experiments, we found that there are 2 that met the criteria:

Decision Tree: max_depth = 8
Random Forest: max_depth = 10 , n_estimator=450.

It can be concluded that the best model with the highest efficiency is the Random Forest model with a max_depth set to 10 and `n_estimator = 450`