# Megaline plan recomendator

## Contents

1. [Introduction](#introduction)
2. [Data Loading and Inspection](#data-loading-and-inspection)
3. [Model training](#model-training)
    1. [Splitting the data into sets](#splitting-the-data-into-sets)
    2. [Decision Tree model](#decision-tree-model)
    3. [Random Forest model](#random-forest-model)
    4. [Logistic Regression model]
    5. [Quality check using the test set]
    6. [Sanity check]
4. [Conclusion]

## Introduction

This is the project for the "Intro into Machine Learning" sprint of Tripleten's DA course.

We will bw analizing user's data for the mobile carrier Megaline, in order to train a model that could properly recommend to each customer one of Megaline's new plans: Smart or Ultra.

The requested minimum accuracy for this model is **0.75**.

For this project we'll be using the following:
- Python 3.9.5
- Pandas 1.2.4
- Sklearn 0.24.1

Versions were chosen so they match as closely as possible the versions available on the Tripleten servers

In [3]:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

[Back to Contents](#contents)

## Data Loading and Inspection

Our data is contained in a single table. According to our instructions, the data is already preprocessed. Let's load it and do a quick check to make sure it's ready for use.

In [4]:
try:
    df = pd.read_csv("dataset/users_behavior.csv")      # Local path
except FileNotFoundError:
    df = pd.read_csv("/datasets/users_behavior.csv")    # Tripleten server path

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [6]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [7]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

There are no missing values, no negatives, and no absurdly large values. `is_ultra` only contains `0` and `1`.

We can begin working with the models.

[Back to Contents](#contents)

## Model Training

### Splitting the data into sets

We need to devide our dataset into three sets:

- Training set: this will be used to train the model
- Validation set: we'll use this set to check the quality of different models, and try to improve them as we adjust hyperparameters.
- Test set: this will be the final test for the model, data that has never seen before. 

We'll distribute the data as follows: 
- 60% for the training set
- 20% for the validation set
- 20% for the test set

In [12]:
# First take 20% of the data and save it as the test. df_temp has the other 80%
df_temp, df_test = train_test_split(df, test_size=0.2, random_state=12345)
# To make the validation set the same size as the test set, we'll take 25% from the temp,
# since it only has 80% of the original data. 0.8 * 0.25 = 0.2
df_train, df_valid = train_test_split(df_temp, test_size=0.25, random_state=12345)

In [13]:
features_train = df_train.drop(columns='is_ultra')
target_train = df_train['is_ultra']

features_valid = df_valid.drop(columns='is_ultra')
target_valid = df_valid['is_ultra']

features_test = df_test.drop(columns='is_ultra')
target_test = df_test['is_ultra']

The sets are ready, we can begin training models.

[Back to Contents](#contents)

### Decision Tree model

A decision tree works quite quickly, but with low accuracy. Let's see how it fares in our case.

In [54]:
best_tree = None
best_accuracy = 0
best_depth = 0
best_max_features = 0
best_leaves = 0

max_depth_to_test = 20
max_features_to_test = 4
max_leaf_samples_to_test = 10

for depth in range(1, max_depth_to_test + 1):
    for features in range(1, max_features_to_test + 1):
        for leaves in range(1, max_leaf_samples_to_test + 1):
            model = DecisionTreeClassifier(
                max_depth=depth,
                min_samples_leaf=leaves,
                max_features=features,
                random_state=12345)

            model.fit(features_train, target_train)
            accuracy = model.score(features_valid, target_valid)

            if accuracy > best_accuracy:
                best_accuracy = accuracy
                best_tree = model
                best_depth = depth
                best_max_features=features
                best_leaves = leaves


print(f'\nThe best results are found at depth of {best_depth}, max features {best_max_features}, min leaves {best_leaves}, and the accuracy is {best_accuracy}')


The best results are found at depth of 10, max features 1, min leaves 9, and the accuracy is 0.8009331259720062


We managed to get a Decision tree with an accuracy score of 0.8. The hyperparameters used are: 
- max_depth = 10
- max_features = 1
- min_sample_leaves = 9

It's quite promising. But we still need to see how it would perform against the test set.

[Back to Contents](#contents)

### Random Forest model