# Content <a id='back'></a>

* [Introduction](#intro)
* [step 1. Data review.](#data_review)
    * [First impressions](#data_review_conclusions)
* [Step 2. Data preprocessing](#data_preprocessing)
    * [2.1 Duplicate values](#duplicate_values)
* [Step 3. Data Analysis](#data_analysis)
    * [3.1 Segmentation of the source data into a training set, a validation set and a test set.](#segmentation)
    * [3.2 Training different models with different hyperparameters](#training_models)
    * [3.3 Finding the best model with the best hyperparameters](#best_mode)
    * [3.3 Evaluating the best model with test set](#evaluation)
* [Step 4. Step 4. Sanity check](#sanity_check)
* [Conclusion](#end)

# Introduction <a id='intro'></a>

Mobile company Megaline is not happy that many of its customers are using legacy plans. They want to develop a model that can analyse customer behaviour and recommend one of Megaline's new plans: Smart or Ultra.

They have behavioural data on subscribers who have already switched to the new plans (from the Statistical Data Analysis sprint project). For this classification task, a model will be created that will choose the correct plan.

The goal is to develop a model with the highest possible accuracy. In this project, the accuracy threshold is sought around 0.75.

## Step 1. Data review. <a id='data_review'></a>

In [1]:
# All libraries are loaded

import pandas as pd
from matplotlib import pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyClassifier

### First impressions <a id='data_review_conclusions'></a>

In [2]:
# Import data

df = pd.read_csv('users_behavior.csv')

In [3]:
# The data frame information and a sample of the data are printed

display(df.head())
df.info()
df.describe()

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


**Observations**
1. There is not missing values in the data set.
2. The data type is correct for the data set.
3. It seems everything is correct and it is posible to continue with the other steps.
4. The values returned by the describe() method are consistent

## Step 2. Data preprocessing <a id='data_preprocessing'></a>

### Duplicate Values <a id='duplicate_values'></a>

In [4]:
# Verify duplicated data

print('Duplicated values in df:')
print(df[df.duplicated()])

Duplicated values in df:
Empty DataFrame
Columns: [calls, minutes, messages, mb_used, is_ultra]
Index: []


**Observations**

1. There are no duplicate values in "df", the consistency of the data was ensured so we can continue with the next steps.
2. Although it has not been requested in this project, it would be good to standardize the data to have less differences and noise, this could surely improve the accuracy in the classification.

## Step 3. Data Analysis <a id='data_analysis'></a>

### Segmentation of the source data into a training set, a validation set and a test set. <a id='segmentation'></a>

In [5]:
# Split of data in train, validation and test

# First, the data set is splited into 60% for training and 40% for validation and test
df_train, df_prev = train_test_split(df, test_size=0.4, random_state=12345)
# Now, df_prev is splited into 50% for validation and 50% for test
df_valid, df_test = train_test_split(df_prev, test_size=0.5, random_state=12345)

# We assign the characteristics and targets in each set
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

<div class="alert alert-block alert-success">
<b>Comentario del revisor (1ra Iteracion)</b> <a class=“tocSkip”></a>

Muy bien, realizaste la división de los datos de manera correcta, recuerda siempre este paso ya que para validar que un modelo está entrenando de manera correcta hay que tener un conjunto de prueba 
</div>

### Training different models with different hyperparameters <a id='training_models'></a>

In [6]:
# Decision three classifier model

for depth in range(1, 11):
    three_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    three_model.fit(features_train, target_train)
    predictions_three_valid = three_model.predict(features_valid)
    print("max_depth =", depth, ": ", end='')
    print(accuracy_score(target_valid, predictions_three_valid))
    

max_depth = 1 : 0.7542768273716952
max_depth = 2 : 0.7822706065318819
max_depth = 3 : 0.7853810264385692
max_depth = 4 : 0.7791601866251944
max_depth = 5 : 0.7791601866251944
max_depth = 6 : 0.7838258164852255
max_depth = 7 : 0.7822706065318819
max_depth = 8 : 0.7791601866251944
max_depth = 9 : 0.7822706065318819
max_depth = 10 : 0.7744945567651633


In [7]:
# Random Forest classifier model

best_score = 0
best_est = 0
for est in range(1, 100): # selecciona el rango del hiperparámetro
    forest_model = RandomForestClassifier(random_state=12345, n_estimators=est) # configura el número de árboles
    forest_model.fit(features_train,target_train) # entrena el modelo en el conjunto de entrenamiento
    score = forest_model.score(features_valid, target_valid) # calcula la puntuación de accuracy en el conjunto de validación
    if score > best_score:
        best_score = score# guarda la mejor puntuación de accuracy en el conjunto de validación
        best_est = est# guarda el número de estimadores que corresponden a la mejor puntuación de exactitud

print("The accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

The accuracy of the best model on the validation set (n_estimators = 23): 0.7947122861586314


In [8]:
# Logistic regresion model
max_iter = [50,60,70,100]
for iter_num in max_iter:
    logistic_model = LogisticRegression(random_state=12345, max_iter=iter_num)
    logistic_model.fit(features_train, target_train)
    logistic_predictions = logistic_model.predict(features_valid)
    logistic_accuracy = accuracy_score(target_valid, logistic_predictions)
    print("max_iter =", iter_num, ": ", end='')
    print(logistic_accuracy)

max_iter = 50 : 0.7107309486780715
max_iter = 60 : 0.7107309486780715
max_iter = 70 : 0.7107309486780715
max_iter = 100 : 0.7107309486780715


**Observations**

1. When training the three models (decision tree, random forests and logistic regression) we realize that the best accuracy value is obtained with random forests with 23 estimators, reaching 79.47% accuracy in the prediction, which exceeds the objective.
2. The second best model is decision tree with 78.53% accuracy in prediction which is still better than the target
3. Finally, Logistic regresion with 71.07% which is less than desired
4. However, we will do some extra analysis by changing other hyperparameters by creating a grid with different hyperparameters for random forests and see if it has any improvement.

### Finding the best model with the best hyperparameters <a id='best_model'></a>

In [9]:
# Finding the best model with differents hyperparameters

param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10]
}

grid_search = GridSearchCV(forest_model, param_grid, cv=5, n_jobs=-1, scoring='accuracy')
grid_search.fit(features_train, target_train)

best_model = grid_search.best_estimator_
best_predictions = best_model.predict(features_valid)
best_accuracy = accuracy_score(target_valid, best_predictions)
print(f'Accuracy of the Best Model (Random Forest): {best_accuracy:.4f}')

Accuracy of the Best Model (Random Forest): 0.7978


**Observation** 

The accuracy was improved, although only slightly, but this will help when making the prediction with the test data.

### Evaluating the best model with test set <a id='evaluation'></a>

In [10]:
# Evaluate the best model on the test set

test_predictions = best_model.predict(features_test)
test_accuracy = accuracy_score(target_test, test_predictions)
print(f'Accuracy of the Best Model on the Test Set: {test_accuracy:.4f}')

Accuracy of the Best Model on the Test Set: 0.8040


**Observation**

1. A little over 80% accuracy is achieved in class prediction, which tells us that our model works well. Of course, this can always be improved; however, it is a better value than expected.

## Step 4. Sanity check <a id='sanity_check'></a>

In [11]:
# Sanity check: Compare with a "dummy" classifier that always predicts the majority class

dummy_model = DummyClassifier(strategy="most_frequent")
dummy_model.fit(features_train, target_train)
dummy_predictions = dummy_model.predict(features_test)
dummy_accuracy = accuracy_score(target_test, dummy_predictions)
print(f'Accuracy of the Dummy Model: {dummy_accuracy:.4f}')

Accuracy of the Dummy Model: 0.6843


## Conclusion. <a id='end'></a>

1. The trained model exceeded the project objective
2.  The best model was obtained with random forests and better hyperparameters were calculated using GridSearchCV, which led to a small improvement.
3. The second best model was obtained with decision trees and the third with logistic regression.
4. There is very little chance of overfitting due to the characteristics of random forests
5. As a sanity check, the trained model was compared to a dummy model to ensure that the trained model performs significantly better than one that always predicts the majority class.
6. The model achieved 80.4% accuracy in testing; however, we could improve by standardizing the data, using more sophisticated models, more accurate hyperparameter calculations, more data collection, etc.

<div class="alert alert-block alert-info">
<b>Comentario general (1ra Iteracion)</b> <a class=“tocSkip”></a>

Hiciste un muy buen trabajo Hans! Entrenaste los modelos de la manera correcta y para tus próximos proyectos relacionados con Machine Learning te recomiendo realizar primero un EDA de tus datos y redactar las conclusiones con tu interpretación de los resultados tanto del EDA como de los modelos.
<br>
<br>
    
Por el resto lo hiciste super bien, probaste varios hiperparametros en tus modelos lo cual ayuda mucho para evaluar diferentes entrenamientos y mejorar el performance de los modelos.
    
    
Saludos!
</div>