Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.     

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset:

1. Open and look through the data file.(data already preprocessed)
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters.
4. Check the quality of the model using the test set.
5. sanity check the model.

In [57]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# 1. Open and look through the data file.

In [4]:
df = pd.read_csv('users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [6]:
df.shape

(3214, 5)

In [7]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


# 2. Split the source data into a training set, a validation set, and a test set.

In [13]:
train_set, valid = train_test_split(df, test_size=0.4, random_state=12345)
valid_set, test_set = train_test_split(valid, test_size=0.5, random_state=12345)
# train_set, valid_set, test_set

In [16]:
display(train_set.shape, valid_set.shape, test_set.shape)

(1928, 5)

(643, 5)

(643, 5)

# 3. Investigate the quality of different models by changing hyperparameters.

In [67]:
# creating features and targets for date sets
features_train = train_set.drop('is_ultra', axis=1)
target_train = train_set['is_ultra']
features_valid = valid_set.drop('is_ultra', axis=1)
target_valid = valid_set['is_ultra']
features_test = test_set.drop('is_ultra', axis=1)
target_test = test_set['is_ultra']

## Decision Tree Classifier

In [50]:
# Runing decision trees with depth 1-10 to find the best accuracy.
final_depth = 0
final_score = 0
for depth in range(1, 10):
    dtc_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    dtc_model.fit(features_train,target_train)
    #valid_pred = dtc_model.predict(features_valid)  --- not necessary
    accuracy = dtc_model.score(features_valid, target_valid)
    print(depth, accuracy)
    if accuracy > final_score:
        final_depth = depth
        final_score = accuracy

print("Final depth=", final_depth,"with accuracy:",final_score)

1 0.7542768273716952
2 0.7822706065318819
3 0.7853810264385692
4 0.7791601866251944
5 0.7791601866251944
6 0.7838258164852255
7 0.7822706065318819
8 0.7791601866251944
9 0.7822706065318819
Final depth= 3 with accuracy: 0.7853810264385692


In [51]:
# Assigning model hyperparameters
dtc_model = DecisionTreeClassifier(random_state=12345, max_depth=3)

In [36]:
#training on the training set
dtc_model = dtc_model.fit(features_train,target_train)

In [37]:
# Accuracy score for the training set
stc_score_train = dtc_model.score(features_train, target_train)
stc_score_train  

0.8075726141078838

In [40]:
# Accuracy score for the validation set
stc_score_valid = dtc_model.score(features_valid, target_valid)
stc_score_valid  

0.7853810264385692

The best Decision Tree Classifier model (depth=3) showed on the training set an accuracy of 80.7%, and on the validation set 78.5%.

## Random Forest Classifier

In [59]:
# Runing decision trees with depth 1-10 to find the best accuracy.
final_est = 0
final_score = 0
for est in range(1, 50):
    rfc_model = RandomForestClassifier(random_state=12345, n_estimators=est)
    rfc_model.fit(features_train,target_train)
    accuracy = rfc_model.score(features_valid, target_valid)
    #print(est, accuracy)
    if accuracy > final_score:
        final_est = est
        final_score = accuracy

print("Final estimators=", final_depth,"with accuracy:",final_score)

Final estimators= 3 with accuracy: 0.7947122861586314


In [60]:
# Assigning model hyperparameters
rfc_model = RandomForestClassifier(random_state=12345, n_estimators=3)

In [61]:
#training on the training set
rfc_model = rfc_model.fit(features_train,target_train)

In [62]:
# Accuracy score for the training set
rfc_score_train = rfc_model.score(features_train, target_train)
rfc_score_train 

0.9507261410788381

In [63]:
# Accuracy score for the validation set
rfc_score_valid = rfc_model.score(features_valid, target_valid)
rfc_score_valid  

0.7387247278382582

The best Random Forest Classifier (estimators=3) showed on the training set an accuracy of 95.1%, and on the validation set 73.8%.  
I am not sure why I got that result on the validation set, when runing on the loop I got a different accuracy rate.  
The big difference in accuracy between the traingng and the validation set might indicade an overfitted model.

## Logistic Regression

In [73]:
# Assigning model hyperparameters and training on the training set
lr_model = LogisticRegression(random_state=12345, solver='liblinear')
lr_model = lr_model.fit(features_train,target_train)

In [75]:
# Accuracy score for the validation set
score = lr_model.score(features_valid, target_valid)
print("Logistic regression training score:", score)

Logistic regression training score: 0.7589424572317263


The Logistic Regression showed an accuracy of 75.9%.

The best accuracy rate was on the Decision Tree Classifier.

# 4. Check the quality of the model using the test set.

In [74]:
# Assigning model hyperparameters and testing on the test set
final_model = DecisionTreeClassifier(random_state=12345, max_depth=3)
final_model = final_model.fit(features_test, target_test)

In [76]:
# Accuracy score for the test set
score = final_model.score(features_test, target_test)
print("Final modal testing set score:", score)

Final modal testing set score: 0.7993779160186625
