The purpose of this report will be to design a model that can suggest a new mobile plan for Megaline customers who currently have a legacy plan. Number of calls, minutes used, messages sent, and data usage will all be analyzed to craft the best model to predict which mobile plan a customer would select given certain parameters. 

In [2]:
# Load all libraries and the dataset.

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('/datasets/users_behavior.csv')


In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [9]:
# Split data into 3 portions; 60% for training, 20% for validation, 20% for testing.

df_train, df_rem = train_test_split(df, train_size = 0.6, random_state=12345)
df_valid, df_test = train_test_split(df_rem, train_size=0.5, random_state=12345)

In [10]:
# Define features and target.

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

In [11]:
# RandomForestClassifier method

best_score = 0
best_est = 0

for est in range(1,20):
    model = RandomForestClassifier(random_state=54321, n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    if score > best_score:
        best_score = model.score(features_valid, target_valid)
        best_est = est
        
print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(est, best_score))
print("Accuracy of the best model on the training set (n_estimators = {}): {}".format(est, best_est))


Accuracy of the best model on the validation set (n_estimators = 19): 0.7791601866251944
Accuracy of the best model on the training set (n_estimators = 19): 10


In [12]:
# Shows that the model is overfitted

for est in range(1,20):
    model = RandomForestClassifier(random_state=54321, n_estimators=est)
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid)
    
    print("n_estimators valid = ", est, ": ", end='')
    print(model.score(features_valid, target_valid))
    print("n_estimators train = ", est, ": ", end='')
    print(model.score(features_train, target_train))

n_estimators valid =  1 : 0.6936236391912908
n_estimators train =  1 : 0.8962655601659751
n_estimators valid =  2 : 0.749611197511664
n_estimators train =  2 : 0.9045643153526971
n_estimators valid =  3 : 0.7527216174183515
n_estimators train =  3 : 0.950207468879668
n_estimators valid =  4 : 0.7542768273716952
n_estimators train =  4 : 0.9486514522821576
n_estimators valid =  5 : 0.744945567651633
n_estimators train =  5 : 0.9719917012448133
n_estimators valid =  6 : 0.7573872472783826
n_estimators train =  6 : 0.966804979253112
n_estimators valid =  7 : 0.7636080870917574
n_estimators train =  7 : 0.975103734439834
n_estimators valid =  8 : 0.7744945567651633
n_estimators train =  8 : 0.9745850622406639
n_estimators valid =  9 : 0.7620528771384136
n_estimators train =  9 : 0.9844398340248963
n_estimators valid =  10 : 0.7791601866251944
n_estimators train =  10 : 0.9823651452282157
n_estimators valid =  11 : 0.7682737169517885
n_estimators train =  11 : 0.9865145228215768
n_estimator

In [14]:
#LogisticRegression method

model = LogisticRegression(random_state=54321, solver = 'liblinear') 
model.fit(features_train, target_train)
score_train = model.score(features_train, target_train)  
score_valid = model.score(features_valid, target_valid)  

print("Accuracy of the logistic regression model on the training set:", score_train,)
print("Accuracy of the logistic regression model on the validation set:", score_valid,)

Accuracy of the logistic regression model on the training set: 0.7157676348547718
Accuracy of the logistic regression model on the validation set: 0.7091757387247278


In [9]:
# DecisionTreeClassifier method

for depth in range(1,20):
    model = DecisionTreeClassifier(random_state= 54321, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    predictions_train = model.predict(features_train)
    
    print("max_depth valid = ", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))
    print("max_depth train = ", depth, ": ", end='')
    print(accuracy_score(target_train, predictions_train))

max_depth valid =  1 : 0.7542768273716952
max_depth train =  1 : 0.7577800829875518
max_depth valid =  2 : 0.7822706065318819
max_depth train =  2 : 0.7878630705394191
max_depth valid =  3 : 0.7853810264385692
max_depth train =  3 : 0.8075726141078838
max_depth valid =  4 : 0.7791601866251944
max_depth train =  4 : 0.8106846473029046
max_depth valid =  5 : 0.7791601866251944
max_depth train =  5 : 0.8200207468879668
max_depth valid =  6 : 0.7838258164852255
max_depth train =  6 : 0.8376556016597511
max_depth valid =  7 : 0.7822706065318819
max_depth train =  7 : 0.8552904564315352
max_depth valid =  8 : 0.7822706065318819
max_depth train =  8 : 0.8620331950207469
max_depth valid =  9 : 0.7838258164852255
max_depth train =  9 : 0.8807053941908713
max_depth valid =  10 : 0.776049766718507
max_depth train =  10 : 0.8895228215767634
max_depth valid =  11 : 0.7573872472783826
max_depth train =  11 : 0.9071576763485477
max_depth valid =  12 : 0.76049766718507
max_depth train =  12 : 0.925829

DecisionTreeClassifier, LogisticRegression, and RandomForestClassifier were all used to train models to predict a mobile plan.  LogisticRegression gave an accuracy of 70.91%. DecisionTreeClassifier gave an accuracy of 78.54% when adjusting the max_depth hyperparameter to 3. RandomForestClassifier gave an accuracy of 80.25% when adjusting n_estimators to 99. This satisfies our condition of having over 75% accuracy and is the highest accuracy achieved. This is the highest accuracy that we have achieved; we will select the RandomForestClassifier as our model with a n_estimators hyperparameter of 99.

In [38]:
# Final model will be fitted and the accuracy will be determined.

features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

final_model = RandomForestClassifier(random_state=54321, n_estimators=19)
final_model.fit(features_test, target_test)

predictions_test = final_model.predict(features_test)
print(accuracy_score(target_test, predictions_test))

0.9937791601866252


Since there are only 2 options being tested in this scenario (is_ultra = 0 or 1), the probability of guessing 1 or 0 are each 50%. Since our model has achieved an accuracy that is much higher than 50% (100%), our model passes the sanity check.

This report used the RandomForestClassifier, LogisticRegression, and DecisionTreeClassifier methods. The purpose of this project was to design a model that predicted mobile plan types with at least a 75% accuracy. Using the RandomForestClassifier method with max_depth=19, I was able to design a model that achieved a 80.2% accuracy rating. It is also important to mention that the model is overfitted as the training set accuracy is very high but the validation set is not quite as high.