## Mobile carrier

##### Our goal is to develop a model that will analyze the behavior of mobile customers and recommend one of the company's call plans.

In [1]:
import pandas as pd
import numpy as np
from joblib import dump
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [2]:
df=pd.read_csv('/datasets/users_behavior.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


We have 3214 entries in the dataset, 5 columns. We've already performed the data preprocessing step in previous project, so we can move straight to creating the model.

Since we are dealing with categorical target (Ultra - 1, Smart - 0) this is classification task.  

We will split our data into three parts: training(60%), validation(20%), and test(20%). 

For this purpose we can use numpy.split() function. We will first shuffle the rows (df.sample(frac=1)) and then split the data into three parts. 

Also, there is another way to divide datase - to use sklearn.model_selection.train_test_split twice. First to split to train(train_size=0.8) and test(test_size=0.2). Then split 80% train again into validation(test_size = 0.25 give us 0.25 * 0.8=0.2) and train(train_size =0.75 give us 0.75 * 0.8=0.6). 

In [3]:
df_train, df_valid, df_test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

In [4]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']


Let's look at Decision Trees model. We'll iterate over max_depth values from 1 to 10 and check the accuracy.

For consistency of results we'll set the random_state to 12345.

In [5]:
for depth in range(1,11):
        model_tree = DecisionTreeClassifier(max_depth=depth, random_state=12345)
        model_tree.fit(features_train, target_train)

        train_predictions=model_tree.predict(features_train)
        valid_predictions=model_tree.predict(features_valid)
        train_accuracy = accuracy_score(target_train, train_predictions)
        valid_accuracy = accuracy_score(target_valid, valid_predictions)

        print("max_depth =", depth)
        print("Training set:", train_accuracy)
        print("Validation set:", valid_accuracy)
        

max_depth = 1
Training set: 0.7598547717842323
Validation set: 0.7433903576982893
max_depth = 2
Training set: 0.7956431535269709
Validation set: 0.7682737169517885
max_depth = 3
Training set: 0.8106846473029046
Validation set: 0.7776049766718507
max_depth = 4
Training set: 0.8215767634854771
Validation set: 0.7869362363919129
max_depth = 5
Training set: 0.828838174273859
Validation set: 0.7791601866251944
max_depth = 6
Training set: 0.8360995850622407
Validation set: 0.7667185069984448
max_depth = 7
Training set: 0.8589211618257261
Validation set: 0.7884914463452566
max_depth = 8
Training set: 0.8692946058091287
Validation set: 0.7993779160186625
max_depth = 9
Training set: 0.8781120331950207
Validation set: 0.8055987558320373
max_depth = 10
Training set: 0.8947095435684648
Validation set: 0.7947122861586314


We can see that max_depth = 3 gives us training set accuracy: 80% and validation set: 79%. That is good result without overfitting or underfitting.

Let's check another model - Random Forest. Now we'll iterate over n_estimators values from 10 to 50 and check the accuracy again. Limit the maximum depth to 10.  

In [6]:
for estim in range(10,51,10):
        model_forest = RandomForestClassifier(n_estimators=estim, max_depth=10, random_state=12345)
        model_forest.fit(features_train, target_train)
        
        train_accuracy = model_forest.score(features_train, target_train)
        valid_accuracy = model_forest.score(features_valid, target_valid)

        print("n_estimators =", estim)
        print("Training set:", train_accuracy)
        print("Validation set:", valid_accuracy)

n_estimators = 10
Training set: 0.8952282157676349
Validation set: 0.7978227060653188
n_estimators = 20
Training set: 0.8952282157676349
Validation set: 0.7993779160186625
n_estimators = 30
Training set: 0.8936721991701245
Validation set: 0.7978227060653188
n_estimators = 40
Training set: 0.8931535269709544
Validation set: 0.7993779160186625
n_estimators = 50
Training set: 0.8926348547717843
Validation set: 0.8009331259720062


We think the best value here is n_estimators = 20 that gives us training set accuracy: 89% and validation set: 80%  

The third model that we will train is Logistic Regression.

In [7]:
model_regression = LogisticRegression(random_state=12345)
model_regression.fit(features_train, target_train)

train_accuracy =model_regression.score(features_train, target_train)
valid_accuracy = model_regression.score(features_valid, target_valid)

print("Training set:", train_accuracy)
print("Validation set:", valid_accuracy)


Training set: 0.7116182572614108
Validation set: 0.7045101088646968




We're mainly interested in the classifier's accuracy on the validation set. All three models give us approximately the same result. So it is hard to choose the best model. 

Thus, let's check our models on test set that was untouched till now.

In [8]:
model_forest = RandomForestClassifier(n_estimators=20, max_depth=10, random_state=12345)
model_forest.fit(features_train, target_train)

test_accuracy = model_forest.score(features_test, target_test)
        
print("RandomForestClassifier :", test_accuracy)

dump(model_forest, 'model_forest.joblib')

RandomForestClassifier : 0.8009331259720062


['model_forest.joblib']

In [9]:
model = LogisticRegression(random_state=12345)
model.fit(features_train, target_train)

test_accuracy = model.score(features_test, target_test)
        
print("LogisticRegression :", test_accuracy)

LogisticRegression : 0.6905132192846034




In [10]:
model = DecisionTreeClassifier(max_depth=3, random_state=12345)
model.fit(features_train, target_train)

test_predictions=model.predict(features_test)
test_accuracy = accuracy_score(target_test, test_predictions)
        
print("DecisionTreeClassifier :", test_accuracy)

DecisionTreeClassifier : 0.7791601866251944


#### As we can see Random Forest model with n_estimators equal 20 provides the best accuracy result.



In [11]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

In [12]:
df_valid['is_ultra'].value_counts(), df_train['is_ultra'].value_counts(), df_test['is_ultra'].value_counts()

(0    449
 1    194
 Name: is_ultra, dtype: int64,
 0    1342
 1     586
 Name: is_ultra, dtype: int64,
 0    438
 1    205
 Name: is_ultra, dtype: int64)

About the sanity check of the model. If our data was balanced we would say that if our accuracy was greater then 50% then we would say that our model is better than guessing. 

But we have imbalanced dataset: class '0' : 2229 and class '1' : 985. The ratio is 7:3. Using value_counts() method we can see that this ratio also takes place in training, validating and test sets. 

For sanity check of our imbalanced dataset should compare model accuracy with share of the biggest class. We got 80% accuracy and it was bigger then 70% (share of class '0'). 