**Predictive Model for Recommending Smart and Ultra Plans for Megaline**

**CONTENT** <a id='back'></a>

* [1. Introduction](#intro)
* [2. Initialisation](#ini)
* [3. Upload Data](#uploaddata)
  * [3.1 Preparing Data](#uploaddata1)
* [4. Machine Learning Process](#ML)
  * [4.1 Training, validation, and test set](#ML1)
  * [4.2 Analysis of the model with higher quality (accuracy)](#ML2)
* [5. Conclusion](#end)

# Introduction <a id='intro'></a>

The next project consists of the development of a model that can analyse customer behaviour in the use of legacy plans in order to be able to offer them the new Megaline plans: Smart or Ultra.

The objective is to find the best model based on its accuracy (reference: accuracy of more than 0.75), i.e. minimising errors even if that means not detecting all cases.

# Initialisation <a id='ini'></a>

In [1]:
# Uploading all the libraries
from scipy import stats as st
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import pandas as pd

# Upload Data <a id='uploaddata'></a>

In [4]:
try:
    user_base = pd.read_csv('users_behavior.csv')
except FileNotFoundError:
    # Handle the file not found exception differently
    print("The file was not found. Try from other path...")
    user_base = pd.read_csv('/otra_ruta/users_behavior.csv')

## Preparing Data <a id='uploaddata1'></a>

In [5]:
user_base

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [6]:
user_base.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [7]:
user_base.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [8]:
#Checking duplicate data
duplicados = user_base.duplicated()
cantidad_duplicados = duplicados.sum()
cantidad_duplicados # Finally, we check the number of duplicate rows in 'trips_base'

0

At a glance with the above information, we can previously observe that:

- There are a total of 3214 rows.
- There are no variables with missing values.
- The data are monthly and analysing the minimum and maximum amounts of each one, no anomalous values can be seen.
- The data types are correct for each variable (float and int). 
- There is no duplication of data.

Therefore, no modifications are required.

# Machine Learning Process <a id='ML'></a>

## Training, validation, and test set <a id='ML1'></a>

In [9]:
# First, we divide and isolate the data into features (X) and the target variable (y)
x = user_base.drop('is_ultra', axis=1)
y = user_base['is_ultra']

In [10]:
# We split the dataset into a temporary training set (80%) and a test set (20%)
x_train_temp, x_test, y_train_temp, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Then, we split the temporary training set into a training set (80%) and a validation set (20%) 
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.2, random_state=42)


We now have three sets: x_train, y_train (training), x_val, y_val (validation), x_test, y_test (test). The size of these sets were generated based on standard practices within machine learning, i.e. 80% size for the training set, 20% for the validation set and 20% for the test set.

## Analysis of the model with higher quality (accuracy)  <a id='ML2'></a>

We will then perform an analysis with 3 different models (decision tree, random forest and logistic regression) for categorical variables to find the best (most accurate) model. 

Note: we will iterate the hyperparameter ‘max_depth’ in a range of 1 to 5 in the models that apply to deepen the analysis.

In [13]:
#We defined variables for the iteration and to find the best model
best_model = None
best_acc = 0

In [15]:
# Logistic Regression Model (fitted once)
model_lr = LogisticRegression(random_state=42, max_iter=1000)
model_lr.fit(x_train, y_train)
pred_lr = model_lr.predict(x_val)
acc_lr = accuracy_score(y_val, pred_lr)

# Iterate for DecisionTree and RandomForest with different max_depth
for max_depth in range(1, 5):
    
    # Decision Tree
    model_dt = DecisionTreeClassifier(max_depth=max_depth, random_state=42)
    model_dt.fit(x_train, y_train)
    pred_dt = model_dt.predict(x_val)
    acc_dt = accuracy_score(y_val, pred_dt)
    
    # Random Forest
    model_rf = RandomForestClassifier(n_estimators=100, max_depth=max_depth, random_state=42)
    model_rf.fit(x_train, y_train)
    pred_rf = model_rf.predict(x_val)
    acc_rf = accuracy_score(y_val, pred_rf)
    
    # Determine best in this iteration
    models = [
        (acc_dt, 'Decision Tree', model_dt),
        (acc_rf, 'Random Forest', model_rf),
        (acc_lr, 'Logistic Regression', model_lr)
    ]
    
    best_model_iter = max(models, key=lambda x: x[0])
    
    if best_model_iter[0] > best_acc:
        best_acc = best_model_iter[0]
        best_model = best_model_iter[2]  # Save actual model (not just the name)

print("Best validation accuracy:", best_acc)
print("Best model:", best_model)

Best validation accuracy: 0.8116504854368932
Best model: Random Forest


In [16]:
print(f'\nThe best selected model is: {best_model} with a max_depth of {max_depth} and an accuracy of {best_acc} in the validation set.')


The best selected model is: Random Forest with a max_depth of 4 and an accuracy of 0.8116504854368932 in the validation set.


In [17]:
# Evaluate the selected model on the test set
if best_model == 'Decision Tree':
    test_pred = model_dt.predict(x_test)
elif best_model == 'Random Forest':
    test_pred = model_rf.predict(x_test)
elif best_model == 'Logistic Regression':
    test_pred = model_lr.predict(x_test)

test_acc = accuracy_score(y_test, test_pred)
print(f'Accuracy in the test set: {test_acc}')


Accuracy in the test set: 0.8087091757387247


# Conclusion  <a id='end'></a>


An analysis was carried out using different models for categorical variables (Smart or Ultra) which were: Decision Tree, Random Forest and Logistic Regression. In addition, the variation of the hyperparameter in the range of 1 to 5 was included to verify the best model and deepen the analysis. 

- Note: In logistic regression this hyperparameter does not apply.

Then, as a final result we can see that the best model was the ‘random forest’ with a depth of 4 (max_depth) and an accuracy of 0.8116 in the validation set.
On the other hand, when the analysis was done on the test set the quality of the model drops slightly to 0.8087, i.e. the initial model is slightly overfitted but the objective of finding a model with a higher accuracy than the base case of 0.75 is met.