## INTRODUCTION

The mobile company Megaline is dissatisfied to see that many of its customers are using legacy plans. They want to develop a model that can analyze customer behavior and recommend one of Megaline's new plans: Smart or Ultra.

We have access to data on the behavior of subscribers who have already switched to the new plans. For this classification task, we need to create a model that selects the correct plan.

This Dataset come from the ___TrippleTen - Data Scientis Course___.

___Request___
- The accuracy threshold for each model must be greater than 0.75.
- Test the 3 basic classifier models

## DATA DESCRIPTION

users_behavior.csv
- calls: The number of calls made by the customer.
- minutes: The total duration of the calls in minutes.
- messages: The number of text messages sent.
- mb_used: The amount of internet traffic used in megabytes (MB).
- is_ultra: The plan for the current month (Ultra = 1, Smart = 0).

In [28]:
import pandas as pd
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 

In [7]:
megaline_df = pd.read_csv('users_behavior.csv')

In [8]:
megaline_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [12]:
print(megaline_df.duplicated().count())
megaline_df.sample(10)

3214


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1402,87.0,637.26,81.0,12511.37,0
559,27.0,183.8,118.0,27705.14,1
1808,53.0,370.03,0.0,32581.16,1
1320,81.0,518.73,71.0,32418.28,1
278,38.0,304.05,0.0,14252.15,1
2670,69.0,478.16,66.0,13055.63,0
2434,45.0,313.72,38.0,14234.41,0
376,51.0,326.44,16.0,14514.98,0
1586,50.0,357.58,0.0,13497.97,0
38,23.0,146.18,33.0,9711.45,0


Remarks

- It is not possible to determine whether there are duplicates in the dataframe simply based on the given data, so it will be assumed that there are no duplicates.
- There are no missing values.
- By convention, the columns calls and messages will be converted to int64.
- The target in this case will be is_ultra, which is categorical, so models for continuous values will be discarded.
- The features in this case will be calls, minutes, messages, and mb_used.

## PRE-PROCESSING

In [16]:
# Modifying Datatypes
megaline_df['calls'] = megaline_df['calls'].astype('int64')
megaline_df['messages'] = megaline_df['messages'].astype('int64')

## DATA SEGMENTATION

Data Segmentation - Training (60%) - Validation (20%) - Test (20%)

In [18]:
total = megaline_df['calls'].count()

# Getting 20%  (test)
df_remaining, df_test = train_test_split(megaline_df,test_size=0.20, random_state=1996)

# Getting 60% (Training) and 20% (Validation) 
df_train, df_valid = train_test_split(df_remaining,test_size=0.25, random_state=1996)

#Comprobamos que se dividió correctamente la población
train_pct = round((df_train['calls'].count() / total) *100,2)
valid_pct = round((df_valid['calls'].count() / total) *100,2)
test_pct = round((df_test['calls'].count() / total) *100,2)
print(f"From 100% of the data, the following was selected:\n{train_pct}% for training.\n{valid_pct}% for validation.\n{test_pct}% for testing.")

From 100% of the data, the following was selected:
59.99% for training.
20.01% for validation.
20.01% for testing.


In [20]:
# Getting features and trainings

#Train
feature_train = df_train.drop('is_ultra',axis = 1)
target_train = df_train['is_ultra']

#Validation
feature_valid = df_valid.drop('is_ultra',axis = 1)
target_valid = df_valid['is_ultra']

#Test
feature_test = df_test.drop('is_ultra',axis = 1)
target_test = df_test['is_ultra']

## TRAINING MODELS

### Decision Tree

In [36]:
b_depth = 0
b_score = 0

for depth in range(1,50):  
    model_dtc = DecisionTreeClassifier(random_state=1996, max_depth=depth) 
    model_dtc.fit(feature_train,target_train)

    score_dtc = model_dtc.score(feature_valid,target_valid)
    if score_dtc > b_score:
        b_score = score_dtc
        b_depth = depth

print(f'Best score:{b_score},\nBest depth:{b_depth}')

Best score:0.7962674961119751,
Best depth:5


In [41]:
# Configuring the model with the optimal hyperparameters identified
model_dtc = DecisionTreeClassifier(random_state=1996, max_depth=5) 
model_dtc.fit(feature_train,target_train)
valid_score_dtc = model_dtc.score(feature_valid,target_valid)

print(f"Accuracy: {valid_score_dtc}")

Accuracy: 0.7962674961119751


In [42]:
# Testing the model
test_score_dtc = model_dtc.score(feature_test,target_test)
print(f"Accuracy : {test_score_dtc}")


Accuracy : 0.7807153965785381


Remarks

- During an iteration of the model, the best score obtained was 0.796 for the validation set, using a depth: 6.
- When using the test model, the accuracy decreased by only about 0.1%, meaning that the selected parameters managed to exceed the 75% test threshold.

### Random Forest

In [26]:
# Settings
b_est = 0.0001
b_depth = 0
b_model = None
b_score = 0

for est in range (1,100,10):
    for depth in range(1,20):
        model_rfc = RandomForestClassifier(random_state=1996, n_estimators=est, max_depth = depth)
        model_rfc.fit(feature_train,target_train)
        score_rfc = model_rfc.score(feature_valid,target_valid)
        if score_rfc > b_score:
            b_score = score_rfc
            b_est = est
            b_depth = depth
            

print(f'Best score:{b_score},\nEstimator:{b_est}\nBest depth:{b_depth}')

Best score:0.8133748055987559,
Estimator:41
Best depth:6


In [None]:
# Configuring the model with the optimal hyperparameters identified
model_rfc = RandomForestClassifier(random_state=1996, n_estimators=41, max_depth = 6)
model_rfc.fit(feature_train,target_train)
valid_score_rfc = model_rfc.score(feature_valid,target_valid)
print(f"Accuracy: {valid_score_rfc}")

Accuracy: 0.8133748055987559


In [38]:
# Testing model
test_score_rfc = model_rfc.score(feature_test,target_test)
print(f"Accuracy : {test_score_rfc}")

Accuracy : 0.80248833592535


Remarks

- During an iteration of the model, the best score obtained was 0.813 for the validation set, using n_estimators: 41 and max_depth: 6.
- When using the test model, the accuracy decreased by only about 0.1%, meaning that the selected parameters managed to exceed the 75% test threshold.

### Logistic Regression

In [24]:
# # Configuring the model with the optimal hyperparameters identified
model_lr = LogisticRegression(random_state=1996, solver='liblinear') 
model_lr.fit(feature_train,target_train)
valid_score_lr = model_lr.score(feature_valid,target_valid)
print(f"Exactitud de: {valid_score_lr}")

Exactitud de: 0.702954898911353


In [43]:
# Testing model
test_score_lr = model_lr.score(feature_test,target_test)
print(f"Exactitud de: {test_score_lr}")

Exactitud de: 0.7169517884914464


Remarks
- In this case, we obtained an accuracy lower than requested, which indicates that this model may not be very useful. However, it is important to consider that when testing the model, the accuracy increased by approximately 0.1% as well.

## FINAL CONCLUTION

In [48]:
index = ['DecisionTree','RandomForest','LogisticRegression']
values = [[valid_score_dtc,test_score_dtc],[valid_score_rfc,test_score_rfc],[valid_score_lr,test_score_lr]]
metrics = ['Acc_Valid','Acc_Test']
summary = pd.DataFrame(values, columns = metrics, index = index )

summary.sort_values(by='Acc_Test',ascending=False)

Unnamed: 0,Acc_Valid,Acc_Test
RandomForest,0.813375,0.802488
DecitionTree,0.796267,0.780715
LogisticRegression,0.702955,0.716952


In conclusion, the model that best fits this task is the ___random forest___, considering that the parameters for the best result are ___trees: 41___, with the default value being 100, and ___depth: 6___. We can determine that this model has a good accuracy with a set of parameters that are not too high, which could affect both the processing speed of the system and the execution time to obtain the calculations.