# Developing a model to recommend a new plan


Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.


The dataset contains behavior data about subscribers who have already switched to the new plans. For this classification task, a model will be developed that will pick the right plan. The data preprocessing step has been performed previously, so the model selection step can now be performed.

In [1]:
#importing libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.gaussian_process import GaussianProcessClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

## Preliminary Analysis

Checking for any errata.

In [2]:
#loading dataset
df = pd.read_csv('/datasets/users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
#analysing features of dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


No columns appear to have missing values

In [4]:
#checking details for each column
for col in df.columns:
    print(col)
    print(df[col].describe())
    print("""
    
    
    """)

calls
count    3214.000000
mean       63.038892
std        33.236368
min         0.000000
25%        40.000000
50%        62.000000
75%        82.000000
max       244.000000
Name: calls, dtype: float64

    
    
    
minutes
count    3214.000000
mean      438.208787
std       234.569872
min         0.000000
25%       274.575000
50%       430.600000
75%       571.927500
max      1632.060000
Name: minutes, dtype: float64

    
    
    
messages
count    3214.000000
mean       38.281269
std        36.148326
min         0.000000
25%         9.000000
50%        30.000000
75%        57.000000
max       224.000000
Name: messages, dtype: float64

    
    
    
mb_used
count     3214.000000
mean     17207.673836
std       7570.968246
min          0.000000
25%      12491.902500
50%      16943.235000
75%      21424.700000
max      49745.730000
Name: mb_used, dtype: float64

    
    
    
is_ultra
count    3214.000000
mean        0.306472
std         0.461100
min         0.000000
25%         0

No columns appear to have any issues.

We can split our data without any problems.

## Splitting the data

The data will be split into 60% train, 20% validation and 20% test. 

In [5]:
#splitting data into train and validation

features = df.drop('is_ultra', axis=1)
target = df['is_ultra']

features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.2, random_state=54321) 

In [6]:
#splitting off another 20% from train set for test set
features_train, features_test, target_train, target_test = train_test_split(
    features_train, target_train, test_size=0.25, random_state=54321) 

In [7]:
#checking correct split
for i in [features_train, features_valid, features_test, target_train, target_valid, target_test]:
    print(i.shape)

(1928, 4)
(643, 4)
(643, 4)
(1928,)
(643,)
(643,)


The data is now split correctly, we can move on to testing our models.

## Testing models

### Decision Tree Classifier

Starting off with a simple Decision Tree Classifier.

We can do a broad search initially by looping through different values for max_depth, before narrowing down our search.

In [8]:
#testing a broad max depths
for depth in [1] + list(range(10,100, 10)):
    model = DecisionTreeClassifier(max_depth = depth, random_state = 54321)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    acc = accuracy_score(target_valid, predictions)
    
    print(f'depth: {depth}, accuracy_score: {acc}')

depth: 1, accuracy_score: 0.7231726283048211
depth: 10, accuracy_score: 0.7573872472783826
depth: 20, accuracy_score: 0.7216174183514774
depth: 30, accuracy_score: 0.7153965785381027
depth: 40, accuracy_score: 0.7153965785381027
depth: 50, accuracy_score: 0.7153965785381027
depth: 60, accuracy_score: 0.7153965785381027
depth: 70, accuracy_score: 0.7153965785381027
depth: 80, accuracy_score: 0.7153965785381027
depth: 90, accuracy_score: 0.7153965785381027


It appears there's no real change after a depth of 30 and that our optimal depth is somewhere between 2 and 19, let'd perform an exhaustive search between those limits.

In [9]:
#finding best depth
best_acc = 0
best_depth = 0
for depth in list(range(2,20)):
    model = DecisionTreeClassifier(max_depth = depth, random_state = 54321)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    acc = accuracy_score(target_valid, predictions)
    if acc> best_acc:
        best_depth = depth
        best_acc = acc


print(f'Best DecisionTreeClassifier accuracy= {best_acc}, best depth = {best_depth}')

Best DecisionTreeClassifier accuracy= 0.776049766718507, best depth = 7


We now have a benchmark for accuracy: 0.776.

Let's test some alternative models.

### RandomForestClassifier

Again we'll start with a broad search of the number of estimators then narrow down our search.

In [19]:
#testing broad number of estimators
for est in list(range(5,201, 5)):
    model = RandomForestClassifier(n_estimators = est, random_state = 54321)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    acc = accuracy_score(target_valid, predictions)
    
    print(f'n_estimators: {est}, accuracy_score: {acc}')

n_estimators: 5, accuracy_score: 0.7511664074650077
n_estimators: 10, accuracy_score: 0.7838258164852255
n_estimators: 15, accuracy_score: 0.7853810264385692
n_estimators: 20, accuracy_score: 0.7807153965785381
n_estimators: 25, accuracy_score: 0.7822706065318819
n_estimators: 30, accuracy_score: 0.7884914463452566
n_estimators: 35, accuracy_score: 0.7947122861586314
n_estimators: 40, accuracy_score: 0.7947122861586314
n_estimators: 45, accuracy_score: 0.7869362363919129
n_estimators: 50, accuracy_score: 0.7869362363919129
n_estimators: 55, accuracy_score: 0.7916018662519441
n_estimators: 60, accuracy_score: 0.7900466562986003
n_estimators: 65, accuracy_score: 0.7884914463452566
n_estimators: 70, accuracy_score: 0.7947122861586314
n_estimators: 75, accuracy_score: 0.7931570762052877
n_estimators: 80, accuracy_score: 0.7947122861586314
n_estimators: 85, accuracy_score: 0.7962674961119751
n_estimators: 90, accuracy_score: 0.7947122861586314
n_estimators: 95, accuracy_score: 0.79004665629

There are a few different values for the number of estimators that give the same top accuracy score between 20 and 150, let's perform an exhaustive search between those limits.

In [20]:
#finding optimal number of estimators
best_est = 0
best_acc = 0
for est in list(range(20,150)):
    model = RandomForestClassifier(n_estimators = est, random_state = 54321)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    acc = accuracy_score(target_valid, predictions)
    
    
    if acc> best_acc:
        best_est = est
        best_acc = acc

print(f'Best RandomForestClassifier accuracy= {best_acc}, best n_estimators = {best_est}')

Best RandomForestClassifier accuracy= 0.8009331259720062, best n_estimators = 36


We have a new benchmark: 0.801

### GaussianProcessClassifier

The GaussianProcessClassifier doesn't have the same 'tunability' of hyperparamaters as the previous models. No adjusting is required so we will just run it once.

In [12]:
#finding accuracy of model
model = GaussianProcessClassifier(max_iter_predict =1000)
model.fit(features_train, target_train)
predictions = model.predict(features_valid)
acc = accuracy_score(target_valid, predictions)
print(f'GaussianProcessClassifier accuracy: {acc}')

GaussianProcessClassifier accuracy: 0.6500777604976672


### GaussianNB

We can test a Naive Bayes classifier.

In [13]:
#finding accuracy of model
model = GaussianNB()
model.fit(features_train, target_train)
predictions = model.predict(features_valid)
acc = accuracy_score(target_valid, predictions)
print(f'GaussianNB accuracy: {acc}')

GaussianNB accuracy: 0.7682737169517885


### KNeighborsClassifier

Running a K neighbors classifier with different values for the number of neighbors.

In [14]:
#find best accuracy 
best_n = 0
best_acc = 0
for n in list(range(1,50)):
    model = KNeighborsClassifier(n_neighbors = n)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    acc = accuracy_score(target_valid, predictions)
    
    
    if acc> best_acc:
        best_n = n
        best_acc = acc

print(f'Best KNeighborsClassifier accuracy= {best_acc}, best n_estimators = {best_est}')

Best KNeighborsClassifier accuracy= 0.7589424572317263, best n_estimators = 36


### Intermediate Conclusion

Having tested a number of classifier models, we have found our Random Forest Classifier to have the best performance.The performance could potentially be improved via boosting. But our current model exceeds the performance threshhold we're after so we can now check it's performance on the 'unseen' test set and perform a sanity check on our results.



## Checking performance on test set

In [15]:
model = RandomForestClassifier(n_estimators = 36, random_state = 54321)
model.fit(features_train, target_train)
predictions = model.predict(features_test)
acc = accuracy_score(target_test, predictions)

print(f'Accuracy on test set: ', acc)

Accuracy on test set:  0.8211508553654744


## Sanity check

We need to check how our target variables are distributed so we can see what could be achieved by chance.


In [16]:
target_test.value_counts()

0    460
1    183
Name: is_ultra, dtype: int64

In [17]:
target_test.mean()

0.2846034214618974

~71.5% of the test target have a value of 0, so if we had a model guessing 0, 100% of the time it would achieve an accuracy score of 0.715

In [18]:
confusion_matrix(target_test, predictions)

array([[421,  39],
       [ 76, 107]])

We can see that the model has a tendency to correctly predict 0s far more often than 1s, although it still performs better than random chance.

## Conclusion

We set out to find a suitable model for recommending a new plan for users.

We trialled a number of models and tweaked hyperparameters where appropriate, namely:
- DecisionTreeClassifier
- RandomForestClassifier
- GaussianProcessClassifier
- GaussianNB
- KNeighborsClassifier

Our best performing model was RandomForestClassifier, with an accuracy score of 0.801 on the train set and 0.821 on the test set. This model also passed our sanity check.


