# Project description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.

We have access to behavior data about subscribers who have already switched to the new plans. For this classification task, we need to develop a model that will pick the right plan. 

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

## Import the data

In [1]:
#importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np 
import seaborn as sns
from scipy import stats as st

In [2]:
from sklearn.tree import DecisionTreeClassifier
from joblib import dump
from sklearn.metrics import accuracy_score 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.dummy import DummyClassifier

In [3]:
df=pd.read_csv('/datasets/users_behavior.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df['calls'] = df['calls'].astype('int16')
df['messages'] = df['messages'].astype('int16')

In [5]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [6]:
df.head(15)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40,311.9,83,19915.42,0
1,85,516.75,56,22696.96,0
2,77,467.66,86,21060.45,0
3,106,745.53,81,8437.39,1
4,66,418.74,1,14502.75,0
5,58,344.56,21,15823.37,0
6,57,431.64,20,3738.9,1
7,15,132.4,6,21911.6,0
8,7,43.39,3,2538.67,1
9,90,665.41,38,17358.61,0


## Split the source data into sets

Training data is the set of the data on which the actual training takes place. Validation split helps to improve the model performance by fine-tuning the model after each epoch. The test set informs us about the final accuracy of the model after completing the training phase.

The training set should not be too small; else, the model will not have enough data to learn. On the other hand, if the validation set is too small, then the evaluation metrics like accuracy and precision will have large variance and will not lead to the proper tuning of the model.

In our case test set doesn't exist. Therefore the source data has to be split into three parts: training, validation, and test. The sizes of validation set and test set are usually equal. It gives us source data split in a 3:1:1 ratio.

In [7]:
# Let's say we want to split the data in 60:20:20 for train:valid:test dataset

x = df.drop(columns = ['is_ultra']).copy()
y = df['is_ultra']

# In the first step we will split the data in training and remaining dataset
x_train2, x_test, y_train2, y_test = train_test_split(x,y, train_size=0.8,random_state=12345)

# Now since we want the valid and test size to be equal (20% each of overall data). 
# we have to define valid_size=0.5 (that is 50% of remaining data)

x_train, x_valid, y_train, y_valid = train_test_split(x_train2,y_train2, test_size=0.25,random_state=12345)

print(x_train2.shape), print(y_train2.shape)
print(x_train.shape), print(y_train.shape)
print(x_valid.shape), print(y_valid.shape)
print(x_test.shape), print(y_test.shape)

(2571, 4)
(2571,)
(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


(None, None)

## Models comparison

### Decision Tree

In [8]:
goal=0
for i in range(1,11):
    for min_samples_split in range(2,11):
        for min_samples_leaf in range(1,11):
            for criterion in ['gini', 'entropy']:
                model_dt = DecisionTreeClassifier(random_state=12345, max_depth=i,criterion = criterion,
                                               min_samples_split=min_samples_split,
                                              min_samples_leaf=min_samples_leaf)
                model_dt.fit(x_train, y_train)
                predictions_dt = model_dt.predict(x_valid)
                accuracy_dt = accuracy_score(y_valid, predictions_dt)
                #print(i,criterion, ": ", end='')
                #print(accuracy)
    if accuracy_dt > goal:
        goal= accuracy_dt
        goal_max_depth=i
        goal_criterion = criterion
        goal_min_samples_leaf=min_samples_leaf
        goal_min_samples_split=min_samples_split
print('The best combination of the hyperparameters:')
print("max_depth is", goal_max_depth)
print("criterion is", goal_criterion)
print("min_samples_leaf is", goal_min_samples_leaf)
print("min_samples_split is", goal_min_samples_split)
print("Accuracy is",goal)

The best combination of the hyperparameters:
max_depth is 10
criterion is entropy
min_samples_leaf is 10
min_samples_split is 10
Accuracy is 0.7931570762052877


### Random Forest

Another way to choose which hyperparameters to adjust is by conducting an exhaustive grid search.

An exhaustive grid search takes in as many hyperparameters as we would like, and tries every single possible combination of the hyperparameters as well as as many cross-validations as we would like it to perform. An exhaustive grid search is a good way to determine the best hyperparameter values to use, but it is time consuming.


In [9]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 10, stop = 80, num = 10)]
#n_estimators =[]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [2,10]
# Minimum number of samples required to split a node
min_samples_split = [2, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 10]
# Method of selecting samples for training each tree
bootstrap = [True, False]

In [10]:
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
print(param_grid)

{'n_estimators': [10, 17, 25, 33, 41, 48, 56, 64, 72, 80], 'max_features': ['auto', 'sqrt'], 'max_depth': [2, 10], 'min_samples_split': [2, 10], 'min_samples_leaf': [1, 10], 'bootstrap': [True, False]}


In [11]:
rf_Model = RandomForestClassifier()
rf_Grid = GridSearchCV(estimator = rf_Model, param_grid = param_grid, cv = 10, verbose=2, n_jobs = 4)
rf_Grid.fit(x_train, y_train)
rf_Grid.best_params_

Fitting 10 folds for each of 320 candidates, totalling 3200 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  33 tasks      | elapsed:    8.6s
[Parallel(n_jobs=4)]: Done 154 tasks      | elapsed:   24.4s
[Parallel(n_jobs=4)]: Done 357 tasks      | elapsed:   53.0s
[Parallel(n_jobs=4)]: Done 640 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 1005 tasks      | elapsed:  2.7min
[Parallel(n_jobs=4)]: Done 1450 tasks      | elapsed:  4.3min
[Parallel(n_jobs=4)]: Done 1977 tasks      | elapsed:  5.7min
[Parallel(n_jobs=4)]: Done 2584 tasks      | elapsed:  7.4min
[Parallel(n_jobs=4)]: Done 3200 out of 3200 | elapsed: 10.0min finished


{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'auto',
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'n_estimators': 25}

In [12]:
print (f'Train Accuracy - : {rf_Grid.score(x_train,y_train):.3f}')

Train Accuracy - : 0.896


In [13]:
predictions_rf = rf_Grid.predict(x_valid)
accuracy_rf = accuracy_score(y_valid, predictions_rf)
accuracy_rf

0.7900466562986003

The accuracy is lower than expected: we were expecting to get accuracy higher than the one of the decision tree model.
We will try to alter parameters ourselves.

In [14]:
best_params = []
best_accuracy = 0

for n_estimators in (1,50):
    for max_depth in range(2,10):
        for min_samples_split in range(2,10):
            for min_samples_leaf in range(1,8):
                for criterion in ['gini', 'entropy']:
                    
                    model_rf_clf = RandomForestClassifier(random_state=123,
                                                          n_estimators=n_estimators,
                                                          max_depth=max_depth,
                                                          min_samples_split=min_samples_split,
                                                          min_samples_leaf=min_samples_leaf,
                                                          criterion=criterion)
                    
                    model_rf_clf.fit(x_train, y_train)
                    
                    predictions = model_rf_clf.predict(x_valid)
                    
                    accuracy = accuracy_score(y_valid, predictions)
                    
                    if accuracy > best_accuracy:
                        best_params = [n_estimators, max_depth, min_samples_split, min_samples_leaf, criterion]
                        best_accuracy = accuracy


In [15]:
print('''Accuracy: {}
The best combination of the hyperparameters for random forest classifier:
   n_estimators = {}
   max_depth = {}
   min_samples_split = {}
   min_samples_leaf = {}
   criterion = {} '''.format(round(best_accuracy, 5),
                             best_params[0],
                             best_params[1],
                             best_params[2],
                             best_params[3],
                             best_params[4]
                            ))

Accuracy: 0.79782
The best combination of the hyperparameters for random forest classifier:
   n_estimators = 50
   max_depth = 9
   min_samples_split = 2
   min_samples_leaf = 4
   criterion = gini 


### Logistic Regression

Logistic regression does not really have any critical hyperparameters to tune.

In [16]:
model_lr = LogisticRegression(random_state=12345, solver='liblinear') 
model_lr.fit(x_train, y_train) 
print(model_lr.score(x_train, y_train))

0.703838174273859


In [17]:
predictions_lr = model_lr.predict(x_valid)
accuracy_lr= accuracy_score(y_valid, predictions_lr)
accuracy_lr

0.6967340590979783

This is the lowest accuracy.

### Conclusion

We've tested 3 common models: Decision Tree, Random Forest and Logistic Regression.

Logistic Regression has the lowest accuracy.

We will choose random forest model since it has the highest accuracy.

## Check the quality of the model using the test set

In [18]:
model = RandomForestClassifier(random_state=123, n_estimators=50,
                               max_depth=9, min_samples_split=2,min_samples_leaf=4,
                              max_features='auto',criterion = 'gini') 
model.fit(x_train, y_train) 
predictions = model.predict(x_test)
accuracy= accuracy_score(y_test, predictions)
accuracy

0.7978227060653188

We can train the chosen model using 80% of data and check it using the test data set.

In [19]:
model = RandomForestClassifier(random_state=123, n_estimators=50,
                               max_depth=9, min_samples_split=2,min_samples_leaf=4,
                              max_features='auto',criterion = 'gini') 
model.fit(x_train2, y_train2) 
predictions = model.predict(x_test)
accuracy= accuracy_score(y_test, predictions)
accuracy

0.8040435458786936

## Sanity check of the model

**We managed to obtain the accuracy of the model to be 0.804.**

For sanity check we will compare our model to some simple baseline.

For example, we can use a constant model that is built-in the sklearn library: sklearn.dummy.DummyClassifier. 

In [20]:
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(x_train2, y_train2)
predictions_dummy=dummy_clf.predict(x_test)
accuracy_dummy= accuracy_score(y_test, predictions_dummy)
accuracy_dummy

0.6951788491446346

In [21]:
dummy_clf = DummyClassifier(strategy="prior")
dummy_clf.fit(x_train2, y_train2)
predictions_dummy=dummy_clf.predict(x_test)
accuracy_dummy= accuracy_score(y_test, predictions_dummy)
accuracy_dummy

0.6951788491446346

In [22]:
dummy_clf = DummyClassifier(strategy="stratified")
dummy_clf.fit(x_train2, y_train2)
predictions_dummy=dummy_clf.predict(x_test)
accuracy_dummy= accuracy_score(y_test, predictions_dummy)
accuracy_dummy

0.5878693623639192

The chosen model passed this sanity check. Our model has an accuracy significantly higher.

The accuracy is 0.804.