Import the required libraries

In [59]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

Open and check the data.

In [60]:
df = pd.read_csv('datasets/users_behavior.csv')

In [61]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [62]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [63]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


Split the data into a training, validation and test set, using the train_test_split function.

-  I have chosen the validation set to be 25% of the dataset.

In [64]:
features = df.drop('is_ultra', axis=1)
target = df['is_ultra']
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345)
features_train = df_train.drop('is_ultra', axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop('is_ultra', axis=1)
target_valid = df_valid['is_ultra']

In [65]:
print(features.shape)
print(target.shape)
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)

(3214, 4)
(3214,)
(2410, 4)
(2410,)
(804, 4)
(804,)


Selection of models.
   - I will use the following classification models:

1. Decision Tree Classifier

In [66]:
#a loop for max_depth from 1 to 5
for i in range(1,6):
    model_dt = DecisionTreeClassifier(random_state=12345, max_depth=i)
    model_dt.fit(features_train, target_train)
    dt_valid_predictions = model_dt.predict(features_valid)
    accuracy_dt = accuracy_score(target_valid, dt_valid_predictions)
    print('max_depth = ', i , ':', accuracy_dt)

max_depth =  1 : 0.75
max_depth =  2 : 0.7835820895522388
max_depth =  3 : 0.7885572139303483
max_depth =  4 : 0.7810945273631841
max_depth =  5 : 0.7810945273631841


#### Hyperparameters of max_depth give different accuracy scores.
#### The highest accuracy score is achieved at hyperparameter: max_depth=3.
#### The Decision Tree Classifier has a 78.8% accuracy, with a maximum depth of 3.

2. Random Forest Classifier

In [67]:
#a loop for number of estimators from 1 to 10:
for i in range(1, 11):
    model_rf = RandomForestClassifier(random_state=12345, n_estimators=i)
    model_rf.fit(features_train, target_train)
    rf_valid_predictions = model_rf.predict(features_valid)
    accuracy_rf = accuracy_score(target_valid, rf_valid_predictions)
    print(i, accuracy_rf)

1 0.736318407960199
2 0.7736318407960199
3 0.7649253731343284
4 0.7860696517412935
5 0.7786069651741293
6 0.7860696517412935
7 0.7786069651741293
8 0.7835820895522388
9 0.7810945273631841
10 0.7898009950248757


#### The highest accuracy score is at hyperparameter n_estimator=10
#### The Random Forest Classifier has 78.9% accuracy with 10 number of trees in the forest.

3. Logistic Regression

In [68]:
model_lr = LogisticRegression(random_state=12345)
model_lr.fit(features_train, target_train)
lr_valid_predictions = model_lr.predict(features_valid)

In [69]:
accuracy_lr = accuracy_score(target_valid, lr_valid_predictions)
print(accuracy_lr)

0.7039800995024875


#### Logistic Regression has a lower accuracy score of 70%, with no hyperparameters.

It is my observation that both the Decision Tree and Random Forest classifiers can reach the same high accuracy of 78%.

*I will, therefore, select the **Random Forest Classifier** as the best-fitting model, because:*
   - it has an accuracy score above 0.75, with 10 estimators, 
   - it helps imporve results 
   - and avoids overfitting.

I will test the selected model on the whole dataset and look at its accuracy score.

In [79]:
model_df = RandomForestClassifier(random_state=12345, n_estimators=10)
model_df.fit(features_train, target_train)
df_predictions = model_df.predict(features)

In [80]:
df_accuracy = accuracy_score(target, df_predictions)
print(df_accuracy)

0.9290603609209708


### Conclusion:
**Random Forest Classifier has the highest possible accuracy.**

**This classifier will pick the right plan: Smart or Ultra, for Megaline's subscribers.**