# The calling plan prediction

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. We need to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.  

We have access to behavior data about subscribers who have already switched to the new plans. For this classification task, we need to develop a model that will pick the right plan.  

The threshold for accuracy is 0.75. Check the accuracy using the test dataset.

### Step 1. Open the data file and study the general information

In [1]:
# import libs
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

*Read the file with data from "users_behavior.csv" and save it to the variable*

In [2]:
# read the data
df_users_behavior = pd.read_csv('users_behavior.csv')

*Print 5 random rows*

In [3]:
df_users_behavior.sample(n=5, random_state=12) # use sample() method

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1793,49.0,379.73,24.0,17168.5,0
512,118.0,796.32,3.0,17905.67,1
1564,42.0,280.21,0.0,19316.23,0
3143,69.0,439.39,82.0,19315.86,1
779,59.0,392.04,0.0,43824.93,1


*Data description*

Every observation in the dataset contains monthly behavior information about one user. The information given is as follows:
- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).

*Look at the general information of our dataset*

In [4]:
df_users_behavior.info() # use info() method

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


*Use describe() method for more information*

In [5]:
df_users_behavior.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


*We need to change float64 type for "calls" and "messages" to int*

In [6]:
df_users_behavior['calls'] = df_users_behavior['calls'].astype(int) # use astype() method
df_users_behavior['messages'] = df_users_behavior['messages'].astype(int) # use astype() method

### Conclusion

We have the next data with 3214 rows. The data contains monthly behavior information about one users. Let's start to develop our model.

### Step 2. Сhoosing the best model

*Split "df_users_behavior_train" into a training set, a validation set, and a test set.*

In [7]:
# use train_test_split() method
# df_users_behavior_train, df_users_behavior_valid, = train_test_split(df_users_behavior, test_size=0.2, train_size=0.8, random_state=12345)
# df_users_behavior_train, df_users_behavior_test = train_test_split(df_users_behavior_train, test_size = 0.25, train_size =0.75, random_state=12345)

# use numpy split() method
df_users_behavior_train, df_users_behavior_valid, df_users_behavior_test = (
    np.split(df_users_behavior.sample(frac=1, random_state=12345),
             [int(.6*len(df_users_behavior)),
              int(.8*len(df_users_behavior))])
)

*Check the dataframes*

In [8]:
for i in (df_users_behavior_train.shape[0],
          df_users_behavior_valid.shape[0],
          df_users_behavior_test.shape[0]):
    print(i / df_users_behavior.shape[0] * 100)

59.98755444928439
20.00622277535781
20.00622277535781


*Create features and target for the training set, the validation set, and the test set.*

In [9]:
features_train = df_users_behavior_train.drop(['is_ultra'], axis=1)
target_train = df_users_behavior_train['is_ultra']

features_valid = df_users_behavior_valid.drop(['is_ultra'], axis=1)
target_valid = df_users_behavior_valid['is_ultra']

features_test = df_users_behavior_test.drop(['is_ultra'], axis=1)
target_test = df_users_behavior_test['is_ultra']

*Train model with "DecisionTreeClassifier"*

In [10]:
# print('Decision Tree model:')
# print()
# for depth in range(1, 5):
#     model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
#     model.fit(features_train, target_train)
#     predicted_train = model.predict(features_train)
#     predicted_valid = model.predict(features_valid)
#     accuracy_train = accuracy_score(target_train, predicted_train)
#     accuracy_valid = accuracy_score(target_valid, predicted_valid)
#     print('max_depth =', depth)
#     print('Training set accuracy =', accuracy_train)
#     print('Validation set accuracy =', accuracy_valid)
#     print()

df_decissiontree = pd.DataFrame() # create dataframe
for depth in range(1, 10):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345) # create the model
    model.fit(features_train, target_train) # fit the model
    score_train = model.score(features_train, target_train) # count the train score
    score_valid = model.score(features_valid, target_valid) # count the valid score
    dif_accurancy = score_train - score_valid # count the difference
    df_decissiontree_temp = pd.DataFrame({'max_depth': [depth], # create temp dataframe with info
                                         'train_accuracy': [score_train], # about depth,
                                         'valid_accuracy': [score_valid], # accuracy and difference
                                         'difference_accurancy': [dif_accurancy]}) 
    df_decissiontree = df_decissiontree.append(df_decissiontree_temp) # append info to "df_decissiontree"

df_decissiontree

Unnamed: 0,max_depth,train_accuracy,valid_accuracy,difference_accurancy
0,1,0.761929,0.721617,0.040312
0,2,0.794606,0.751166,0.043439
0,3,0.804461,0.766719,0.037742
0,4,0.812241,0.774495,0.037746
0,5,0.825207,0.772939,0.052268
0,6,0.83195,0.769829,0.062121
0,7,0.840768,0.760498,0.08027
0,8,0.849585,0.765163,0.084422
0,9,0.862552,0.77605,0.086502


The best result with "max_depth=4". We have 0.77 "valid_accuracy". And "difference_accurancy" is low (0.038) as well.

*Train model with "RandomForestClassifier"*

In [11]:
df_randomforest = pd.DataFrame() # create dataframe
for est in range(1, 10):
    model = RandomForestClassifier(n_estimators=est, random_state=12345) # create the model
    model.fit(features_train, target_train) # fit the model
    score_train = model.score(features_train, target_train) # count the train score
    score_valid = model.score(features_valid, target_valid) # count the valid score
    dif_accurancy = score_train - score_valid # count the difference
    df_randomforest_temp = pd.DataFrame({'n_estimators': [est], # create temp dataframe with info
                                         'train_accuracy': [score_train], # about n_estimators,
                                         'valid_accuracy': [score_valid], # accuracy and difference
                                         'difference_accurancy': [dif_accurancy]}) 
    df_randomforest = df_randomforest.append(df_randomforest_temp) # append info to "df_randomforest"

df_randomforest

Unnamed: 0,n_estimators,train_accuracy,valid_accuracy,difference_accurancy
0,1,0.899378,0.729393,0.169984
0,2,0.90249,0.735614,0.166875
0,3,0.94917,0.738725,0.210445
0,4,0.942946,0.755832,0.187114
0,5,0.971473,0.766719,0.204755
0,6,0.9611,0.780715,0.180384
0,7,0.976141,0.772939,0.203202
0,8,0.971992,0.780715,0.191276
0,9,0.987552,0.777605,0.209947


We are dealing with overfitting here. "difference_accurancy" is from 0.16 to 0.20.

*Train model with "LogisticRegression"*

In [12]:
print('Logistic Regression model:')
print()
model = LogisticRegression(random_state=12345, solver='lbfgs') # create the model
model.fit(features_train, target_train) # fit the model
score_train = model.score(features_train, target_train) # count the train score
score_valid = model.score(features_valid, target_valid) # count the valid score
dif_accurancy = score_train - score_valid # count the difference
print('train_accuracy =', score_train)
print('valid_accuracy =', score_valid)
print('dif_accurancy =', dif_accurancy)


Logistic Regression model:

train_accuracy = 0.7510373443983402
valid_accuracy = 0.713841368584759
dif_accurancy = 0.03719597581358125


We have good "dif_accurancy". But "valid_accuracy" isn't enough.

### Conclusion

Thus "DecisionTreeClassifier" with "max_depth=4" has the best result for us. 

### Step 3. Check the quality of the model

*Check the quality of the model using the test set.*

In [13]:
model = DecisionTreeClassifier(max_depth=4, random_state=12345) # create model
model.fit(features_train, target_train) # fit the model
score_test = model.score(features_test, target_test) # count the test score
score_test

0.7900466562986003

### Conclusion

This's good accuracy.

### Step 4. Overall conclusion

*The main task:*

For classification task, we need to develop a model that will pick the right plan.

*Conclusion:*

We have identified that "DecisionTreeClassifier" with "max_depth=4" is the best model for our conditions.