# Megaline plan recomendator

## Contents

1. [Introduction](#introduction)
2. [Data Loading and Inspection](#data-loading-and-inspection)
3. [Model training](#model-training)
    1. [Splitting the data into sets](#splitting-the-data-into-sets)
    2. [Decision Tree model](#decision-tree-model)
    3. [Random Forest model](#random-forest-model)
    4. [Logistic Regression model](#logistic-regression-model)
    5. [Quality check using the test set](#quality-check-using-the-test-set)
    6. [Sanity check](#sanity-check)
4. [Conclusion](#conclusion)

## Introduction

This is the project for the "Intro into Machine Learning" sprint of Tripleten's DA course.

We will bw analizing user's data for the mobile carrier Megaline, in order to train a model that could properly recommend to each customer one of Megaline's new plans: Smart or Ultra.

The requested minimum accuracy for this model is **0.75**.

For this project we'll be using the following:
- Python 3.9.5
- Pandas 1.2.4
- Sklearn 0.24.1

Versions were chosen so they match as closely as possible the versions available on the Tripleten servers

In [2]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.dummy import DummyClassifier

[Back to Contents](#contents)

## Data Loading and Inspection

Our data is contained in a single table. According to our instructions, the data is already preprocessed. Let's load it and do a quick check to make sure it's ready for use.

In [3]:
try:
    df = pd.read_csv("dataset/users_behavior.csv")      # Local path
except FileNotFoundError:
    df = pd.read_csv("/datasets/users_behavior.csv")    # Tripleten server path

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [6]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

There are no missing values, no negatives, and no absurdly large values. `is_ultra` only contains `0` and `1`.

We can begin working with the models.

[Back to Contents](#contents)

## Model Training

### Splitting the data into sets

We need to devide our dataset into three sets:

- Training set: this will be used to train the model
- Validation set: we'll use this set to check the quality of different models, and try to improve them as we adjust hyperparameters.
- Test set: this will be the final test for the model, data that has never seen before. 

We'll distribute the data as follows: 
- 60% for the training set
- 20% for the validation set
- 20% for the test set

In [7]:
# First take 20% of the data and save it as the test. df_temp has the other 80%
df_temp, df_test = train_test_split(df, test_size=0.2, random_state=12345)
# To make the validation set the same size as the test set, we'll take 25% from the temp,
# since it only has 80% of the original data. 0.8 * 0.25 = 0.2
df_train, df_valid = train_test_split(df_temp, test_size=0.25, random_state=12345)



Let's double check our math.

In [8]:
print(f'Training set size: {len(df_train)}')
print(f'Validating set size: {len(df_valid)}')
print(f'Test set size: {len(df_test)}')

Training set size: 1928
Validating set size: 643
Test set size: 643


In [9]:
features_train = df_train.drop(columns='is_ultra')
target_train = df_train['is_ultra']

features_valid = df_valid.drop(columns='is_ultra')
target_valid = df_valid['is_ultra']

features_test = df_test.drop(columns='is_ultra')
target_test = df_test['is_ultra']

The sets are ready, we can begin training models.

[Back to Contents](#contents)

### Decision Tree model

A decision tree works quite quickly, but with low accuracy. Let's see how it fares in our case.

In [10]:
# Desicion tree training
best_tree = None
best_accuracy = 0
best_depth = 0
best_max_features = 0
best_leaves = 0

max_depth_to_test = 20
max_features_to_test = 4
max_leaf_samples_to_test = 10

for depth in range(1, max_depth_to_test + 1):
    for features in range(1, max_features_to_test + 1):
        for leaves in range(1, max_leaf_samples_to_test + 1):
            model = DecisionTreeClassifier(
                max_depth=depth,
                min_samples_leaf=leaves,
                max_features=features,
                random_state=12345)

            model.fit(features_train, target_train)
            accuracy = model.score(features_valid, target_valid)

            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_tree = model
                best_depth = depth
                best_max_features=features
                best_leaves = leaves

In [11]:
# Training results
print(f'The best Tree has an accuracy of {best_accuracy:0.3f}\n')
print('The hyperparameters are:')
print(f'depth = {best_depth}')
print(f'max_features = {best_max_features}')
print(f'min_samples_leaf = {best_leaves}')

The best Tree has an accuracy of 0.801

The hyperparameters are:
depth = 10
max_features = 1
min_samples_leaf = 9


We managed to get a Decision tree with an accuracy score of 0.8. The hyperparameters used are: 

It's quite promising. But we still have more models to try.

[Back to Contents](#contents)

### Random Forest model

Instead of a single tree, we can use several and have them vote. This should improve accuracy, at the expense of speed.

In [11]:
# Random Forest training
best_forest = None
best_accuracy = 0
best_estimators = 0
best_depth = 0
best_leaves = 0
best_features = 0

max_estimators_to_test = 100
max_depth_to_test = 15
max_leaf_samples_to_test = 10
max_features_to_test = 4



for est in range(9, max_estimators_to_test + 1, 10):
    for depth in range(1, max_depth_to_test + 1):
        for leaves in range(1, max_leaf_samples_to_test + 1):
            for feat in range(1, max_features_to_test + 1):
                model = RandomForestClassifier(
                    n_estimators=est,
                    max_depth=depth,
                    min_samples_leaf=leaves,
                    max_features=feat,
                    random_state=12345
                )

                model.fit(features_train, target_train)
                accuracy = model.score(features_valid, target_valid)

                if accuracy > best_accuracy:
                    best_accuracy = accuracy
                    best_forest = model
                    best_estimators = est
                    best_depth = depth
                    best_leaves = leaves
                    best_features = feat

In [17]:
# Random Forest training
best_forest = None
best_accuracy = 0
best_estimators = 0
best_depth = 0
best_leaves = 0
best_features = 0

max_estimators_to_test = 250
max_depth_to_test = 20
max_leaf_samples_to_test = 10
max_features_to_test = 4

In [18]:
#To improve training speed, we'll first get a good range of depth and estimators
# First find a good range for max_depth
for depth in range(1, max_depth_to_test + 1):
    model = RandomForestClassifier(
        max_depth=depth,
        random_state=12345
    )
    model.fit(features_train, target_train)
    accuracy = model.score(features_valid, target_valid)

    print(f'depth = {depth}, acc = {accuracy}')


depth = 1, acc = 0.7620528771384136
depth = 2, acc = 0.7620528771384136
depth = 3, acc = 0.7682737169517885
depth = 4, acc = 0.7729393468118196
depth = 5, acc = 0.7791601866251944
depth = 6, acc = 0.7791601866251944
depth = 7, acc = 0.7853810264385692
depth = 8, acc = 0.7822706065318819
depth = 9, acc = 0.7900466562986003
depth = 10, acc = 0.7962674961119751
depth = 11, acc = 0.7978227060653188
depth = 12, acc = 0.7978227060653188
depth = 13, acc = 0.7993779160186625
depth = 14, acc = 0.7962674961119751
depth = 15, acc = 0.7993779160186625
depth = 16, acc = 0.8009331259720062
depth = 17, acc = 0.7931570762052877
depth = 18, acc = 0.7993779160186625
depth = 19, acc = 0.7884914463452566
depth = 20, acc = 0.7947122861586314


The best depth is 16, but we'll try a range around it. From `13 to 18`.

In [19]:
# Finding a good range for n_estimators
for est in range(9, max_estimators_to_test + 1, 10):
    model = RandomForestClassifier(
        max_depth=16,
        n_estimators=est,
        random_state=12345
    )
    model.fit(features_train, target_train)
    accuracy = model.score(features_valid, target_valid)

    print(f'n_estimators = {est} acc = {accuracy}')

n_estimators = 9 acc = 0.7869362363919129
n_estimators = 19 acc = 0.7869362363919129
n_estimators = 29 acc = 0.7916018662519441
n_estimators = 39 acc = 0.7931570762052877
n_estimators = 49 acc = 0.7931570762052877
n_estimators = 59 acc = 0.7947122861586314
n_estimators = 69 acc = 0.7916018662519441
n_estimators = 79 acc = 0.7947122861586314
n_estimators = 89 acc = 0.7978227060653188
n_estimators = 99 acc = 0.8009331259720062
n_estimators = 109 acc = 0.7978227060653188
n_estimators = 119 acc = 0.7978227060653188
n_estimators = 129 acc = 0.7993779160186625
n_estimators = 139 acc = 0.7993779160186625
n_estimators = 149 acc = 0.7993779160186625
n_estimators = 159 acc = 0.7993779160186625
n_estimators = 169 acc = 0.8009331259720062
n_estimators = 179 acc = 0.7993779160186625
n_estimators = 189 acc = 0.8009331259720062
n_estimators = 199 acc = 0.8009331259720062
n_estimators = 209 acc = 0.8009331259720062
n_estimators = 219 acc = 0.7993779160186625
n_estimators = 229 acc = 0.7993779160186625

We get the best result with 99 estimators, and then with 169, 189, 199 and 209. However, the difference is pretty small. We'll use `99`.

In [20]:
# Explore the rest of the hyperparameters
for depth in range(13, 19):
    for leaves in range(1, max_leaf_samples_to_test + 1):
        for feat in range(1, max_features_to_test + 1):
            model = RandomForestClassifier(
                n_estimators=99,
                max_depth=depth,
                min_samples_leaf=leaves,
                max_features=feat,
                random_state=12345
            )

            model.fit(features_train, target_train)
            accuracy = model.score(features_valid, target_valid)

            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_forest = model
                best_estimators = est
                best_depth = depth
                best_leaves = leaves
                best_features = feat

In [22]:
# Training results
print(f'The best Forest has an accuracy of {best_accuracy:0.3f}\n')
print('The hyperparameters are:')
print(f'estimators = {best_estimators}')
print(f'depth = {best_depth}')
print(f'min_sample_leaf = {best_leaves}')
print(f'max_features = {best_features}')

The best Forest has an accuracy of 0.802

The hyperparameters are:
estimators = 249
depth = 15
min_sample_leaf = 1
max_features = 1


The random forest model took a very long time to train, and is barely any better than the single decision tree in this case. We managed to get an accuracy of 0.8. 

Lets try a Logistic Regression model next.

[Back to Contents](#contents)

### Logistic Regression Model

Now we are moving away from trees. This model should have a decent accuracy, and a short training time. Let's see how it performs in our case.

In [23]:
# Logistic Regression model training

# Many of these combinations will fail, and they will produce a lot of warnings. 
# We don't care about them right now, so we'll silence them for this cell
import warnings
from sklearn.exceptions import ConvergenceWarning


with warnings.catch_warnings():
    warnings.filterwarnings('ignore', message='Line Search failed')
    warnings.filterwarnings('ignore', message='The line search algorithm did not converge')
    warnings.filterwarnings('ignore', message='Rounding errors prevent the line search from converging')
    warnings.filterwarnings('ignore', category=ConvergenceWarning)


    best_log_regression = None
    best_accuracy = 0
    best_solver = None
    best_penalty = None
    best_fit_intercept = None
    best_class_weight = None
    best_max_iter = None
    best_c_value = 0

    solver_to_test = ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky', 'sag', 'saga']
    penalty_to_test = ['l1', 'l2', 'elasticnet', None]
    fit_intercept_to_test = [True, False]
    class_weight_to_test = ['balanced', None]
    max_iter_to_test = 300
    c_values_to_test = [100, 10, 1.0, 0.1, 0.01]

    for solver in solver_to_test:
        for penalty in penalty_to_test:
            for fit_intercept in fit_intercept_to_test:
                for class_weight in class_weight_to_test:
                    for iter in range(100, max_iter_to_test + 1, 20):
                        for c in c_values_to_test:
                            # Many of these values combinations aren't supported.
                            # Those that would throw errors will be skipped. 
                            try:
                                model = LogisticRegression(
                                    solver=solver,
                                    penalty=penalty,
                                    fit_intercept=fit_intercept,
                                    class_weight=class_weight,
                                    max_iter=iter,
                                    C=c,
                                    random_state=12345
                                )

                                model.fit(features_train, target_train)
                                accuracy = model.score(features_valid, target_valid)

                                if accuracy > best_accuracy:
                                    best_accuracy = accuracy
                                    best_log_regression = model
                                    best_solver = solver
                                    best_penalty = penalty
                                    best_fit_intercept = fit_intercept
                                    best_class_weight = class_weight
                                    best_max_iter = iter
                                    best_c_value = c
                            except ValueError:
                                pass # These errors are expected, we'll just ignore them.

In [24]:
# Training results
print(f'The best Logistic regression model has an accuracy of {best_accuracy:0.3f}\n')
print('The hyperparameters are:')
print(f'solver = {best_solver}')
print(f'penalty = {best_penalty}')
print(f'fit_intercept = {best_fit_intercept}')
print(f'class_weight = {best_class_weight}')
print(f'max_iter = {best_max_iter}')
print(f'c value {best_c_value}')

The best Logistic regression model has an accuracy of 0.729

The hyperparameters are:
solver = liblinear
penalty = l1
fit_intercept = True
class_weight = None
max_iter = 100
c value 0.1


The best accuracy we could get from this model is 0.73. It doesn't fit our requirements.

[Back to Contents](#contents)

### Quality check using the test set

The best model we got, by a tiny margin, is the Random Forest.

Let's see how it performs with the test set.

In [25]:
test_accuracy = best_forest.score(features_test, target_test)
print(f'Accuracy with the test set: {test_accuracy:0.3f}')

Accuracy with the test set: 0.798


We are above our required accuracy! 

Let's check the precision and recall too, to get a better idea on what is our model doing.

In [26]:
test_predictions = best_forest.predict(features_test)
print(f'Precision: {precision_score(target_test, test_predictions):0.3f}')
print(f'Recall: {recall_score(target_test, test_predictions):0.3f}')

Precision: 0.732
Recall: 0.531


When the model predicts an Ultra user, it is 73% correct. But it only picks up on 53% of Ultra users.

[Back to Contents](#contents)

### Sanity Check

How would we fare if we just assigned one of the plans to everyone, or if we did it randomly?

In [27]:
# Ratio of Ultra users
print(df_test['is_ultra'].sum() / df_test['is_ultra'].count())


0.3048211508553655


About 30% of the clients are using the Ultra plan. 

If we assign the Smart plan to everyone, we would have an accuracy of 0.7, as that is the proportion of Smart users.

If we assign a plan randomly, we would have an accuracy of 0.5, as each case has a 50/50 chance of being correct.

Let's try another approach now. Let's use a DummyClassifier. Our Random Forest should have a greater accuracy than the Dummy.

In [36]:


strategies_to_test = ['most_frequent', 'prior', 'stratified', 'uniform', 'constant']
for strat in strategies_to_test:
    dummy = DummyClassifier(
        strategy=strat,
        constant=0,
        random_state=12345
    )
    dummy.fit(features_test, target_test)
    accuracy = dummy.score(features_test, target_test)
    print(f'Strat: {strat}\naccuracy: {accuracy:0.3f}\n')

Strat: most_frequent
accuracy: 0.695

Strat: prior
accuracy: 0.695

Strat: stratified
accuracy: 0.575

Strat: uniform
accuracy: 0.490

Strat: constant
accuracy: 0.695



Since our model's accuracy is 0.798, it is better than chance, and better than every strategy the dummy can use.

[Back to Contents](#contents)

## Conclusion

Our task was to develop a model with the highest accuracy possible. The minimum required was 0.75.

The Random Forest managed to get an accuracy score of 0.798.

The Decision Tree was very close to the Random Forest when tested against the validation set, so if run time is an issue, the tree could be used instead of the Forest, with very similar results but faster execution.