# MEGALINE. Recommendation of plans with models based on customer behavior.

## Contents <a id='back'></a>

* [Introduction](#intro) 
* [1. Review of preprocessed data](#data_review)
* [2. Preprocessing of data](#data_preprocessing)
* [3. Prediction model evaluation](#model_evaluation)
* [4. Sanity check](#test)
* [Conclusions](#end)

## Introduction <a id='intro'></a>

The company is unhappy to see that many of their customers are using legacy plans. They want to develop a model that can analyze customer behavior and recommend one of Megaline's new plans: Smart or Ultra.


## Objective

Create a classification model that will allow to choose the right plan for each customer, with an accuracy threshold of 0.75

[Back to Contents](#back)


## 1. Review of preprocessed data <a id='data_review'></a>

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

In [22]:
df= pd.read_csv('/datasets/users_behavior.csv')
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [24]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


[Back to Contents](#back)

## 2. Preprocessing of data <a id='data_preprocessing'></a>

### Segment the source data into a training set, a validation set and a test set.

In [25]:
features_data=df.drop(['is_ultra'], axis=1) 
target_data= df['is_ultra']

In [26]:
features_test, features, target_test, target = train_test_split(features_data, target_data, test_size=0.25, random_state=12345)

In [27]:
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)

[Back to Contents](#back)

## 3. Prediction model evaluation <a id='model_evaluation'></a>

### Investigate the quality of different models by changing the hyperparameters.

In [28]:
# Decision tree

for depth in range (1,6):
    model_tree = DecisionTreeClassifier(random_state= 12345, max_depth=depth)
    model_tree.fit(features_train, target_train)
    
    predictions_valid= model_tree.predict(features_valid)
    
    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_valid))


max_depth = 1 : 0.7661691542288557
max_depth = 2 : 0.7761194029850746
max_depth = 3 : 0.7711442786069652
max_depth = 4 : 0.7711442786069652
max_depth = 5 : 0.7860696517412935


In [29]:
best_model = None
best_result = 0 

for depth in range (1,6):
    model_tree = DecisionTreeClassifier(random_state= 12345, max_depth=depth)
    model_tree.fit(features_train, target_train)
    
    predictions_valid= model_tree.predict(features_valid)
    
    result = accuracy_score(target_valid, predictions_valid)
    if result > best_result:
        best_model = model_tree
        best_result = result
    
print("Exactitud del mejor modelo en el conjunto de validaciones:", best_result)

Exactitud del mejor modelo en el conjunto de validaciones: 0.7860696517412935


In [31]:
best_score = 0
best_est = 0
best_model = None

for est in range(1, 10):
    model_forest = RandomForestClassifier(random_state=54321, n_estimators=est)
    model_forest.fit(features_train, target_train)
    score = model_forest.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est = est
        best_model = model_forest

    print("Accuracy del mejor modelo en el conjunto de validación (n_estimators = {}): {}".format(best_est, best_score))
   

Accuracy del mejor modelo en el conjunto de validación (n_estimators = 1): 0.7213930348258707
Accuracy del mejor modelo en el conjunto de validación (n_estimators = 2): 0.746268656716418
Accuracy del mejor modelo en el conjunto de validación (n_estimators = 2): 0.746268656716418
Accuracy del mejor modelo en el conjunto de validación (n_estimators = 2): 0.746268656716418
Accuracy del mejor modelo en el conjunto de validación (n_estimators = 2): 0.746268656716418
Accuracy del mejor modelo en el conjunto de validación (n_estimators = 2): 0.746268656716418
Accuracy del mejor modelo en el conjunto de validación (n_estimators = 2): 0.746268656716418
Accuracy del mejor modelo en el conjunto de validación (n_estimators = 2): 0.746268656716418
Accuracy del mejor modelo en el conjunto de validación (n_estimators = 2): 0.746268656716418


In [32]:
# Logical Regression

model_reg_logic = LogisticRegression(random_state=54321, solver='liblinear') 
model_reg_logic.fit(features_train, target_train) 
score_train = model_reg_logic.score(features_train, target_train) 
score_valid = model_reg_logic.score(features_valid, target_valid) 

print("Accuracy del modelo de regresión logística en el conjunto de entrenamiento:", score_train)
print("Accuracy del modelo de regresión logística en el conjunto de validación:", score_valid) 



Accuracy del modelo de regresión logística en el conjunto de entrenamiento: 0.7081260364842454
Accuracy del modelo de regresión logística en el conjunto de validación: 0.7064676616915423


Considering that the accuracy threshold is 0.75, we can observe that the method with the highest possible accuracy has been the forest model, with this model we have a high accuracy, although the speed is higher.

### Checks the quality of the model using the test set.

In [33]:
test_predictions = best_model.predict(features_test)


print('Predicciones:', test_predictions)
print('Respuestas correctas:', target_test)

Predicciones: [0 0 0 ... 1 0 1]
Respuestas correctas: 101     0
1915    0
88      0
1348    0
2264    1
       ..
2817    1
546     1
382     1
2177    1
482     1
Name: is_ultra, Length: 2410, dtype: int64


In [34]:
def accuracy (answers, predictions):
    correct=0
    
    for i in range (len(answers)):
        if answers[i] == predictions[i]:
            correct += 1
        return correct/len(answers)
    
train_predictions = best_model.predict(features)

print('Exactitud')
print('Training set:', accuracy_score(target, train_predictions))
print('Test set:', accuracy_score(target_test, test_predictions))

Exactitud
Training set: 0.8669154228855721
Test set: 0.7452282157676349


We can observe that the results are different, the model is more accurate when working with the training set, however, between the threshold of accuracy that we have.

[Back to Contents](#back)

## 4. Sanity check <a id='test'></a>

In [35]:

average_target = target.mean()
average_target

0.2997512437810945

In [36]:
predictions= pd.Series(target.mean(), index=target.index)
mse = mean_squared_error(target, predictions)
mse

0.2099004356327814

In [37]:
rmse = mse**0.5
rmse

0.4581489229855085

In [39]:

dummy_clf = DummyClassifier(strategy="most_frequent")

dummy_clf.fit(features_train, target_train)

y_pred = dummy_clf.predict(features_test)

accuracy = accuracy_score(target_test, y_pred)
print("Exactitud del clasificador Dummy: {:.2f}".format(accuracy))

Exactitud del clasificador Dummy: 0.69


[Back to Contents](#back)

## Conclusions <a id='end'></a>

1. The most accurate method in training was the random forest method, taking the accuracy threshold to be 0.75.

2. checking the quality of the model using the test set results are different, the model is more accurate when working with the training set.

3. we can observe that our predictions have erred by approximately 0.46, however, performing the sanity test with the Dummy method, we can observe that the model's accuracy is quite close to the indicated accuracy threshold of 0.75, therefore the sanity test helps us to determine that the model works and is rational.

[Back to Contents](#back)