# Deep Learning Analytics for Zuber

# INDEX

1. [Project Description](#Project-Description)
2. [Import libraries and data](#Import-libraries-and-data)
3. [Segment data for test suite](#Segment-data-for-test-suite)
4. [Quality of the models by changing the hyperparameters](#Quality-of-the-models-by-changing-the-hyperparameters)
5. [Check models with the test suite](#Check-models-with-the-test-suite)
6. [Sanity Test](#Sanity-Test)
7. [Conclusions](#Conclusions)

<a id="Project-description"></a>

## Project description


Mobile company Megaline is not happy to see that many of its customers are using legacy plans. They want to develop a model that can analyze customer behavior and recommend one of Megaline's new plans: Smart or Ultra.
You have access to behavioral data for subscribers who have already switched to the new plans (from the Statistical Data Analysis course project). For this classification task you must create a model that chooses the correct plan. Since you have already done the step of processing the data, you can jump right into creating the model.
Develop a model with the greatest possible accuracy. In this project, the accuracy threshold is 0.75. Use the dataset to check accuracy.

<a id="Import-libraries-and-data"></a>

## Import libraries and data

In [1]:
#Import libraries and display data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import numpy as np


In [2]:
# Load the CSV file
data = pd.read_csv('/kaggle/input/user-behaviorddd/users_behavior.csv')
# Show the first rows of the dataset
data.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


The data set contains the following columns:

- calls: number of calls.
- minutes: total duration of the call in minutes.
- messages: number of text messages.
- mb_used: Internet traffic used in MB.
- is_ultra: plan for the current month (Ultra - 1, Smart - 0).

In [3]:
# Get basic dataframe statistics
data_description = data.describe()

data_description
display(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


None

**Basic Statistics:**


There are a total of 3,214 observations.

On average, users make 63 calls per month with a total duration of 438 minutes.
They send an average of 38 messages and consume about 17,207 MB of data.

30.6% of users have the Ultra plan (is_ultra = 1).

<a id="Segment-data-for-test-suite"></a>
## Segment data for test suite

I will use the ratio of 60% for training, 20% for validation and 20% for testing.

In [4]:
# Split the data set into training (60%) and temporary (40%)
data_train, data_temp = train_test_split(data, test_size=0.4, random_state=42, stratify=data['is_ultra'])

# Split the temporary set into validation (50%) and test (50%) to get 20% of each
data_valid, data_test = train_test_split(data_temp, test_size=0.5, random_state=42, stratify=data_temp['is_ultra'])

data_train.shape, data_valid.shape, data_test.shape

((1928, 5), (643, 5), (643, 5))

I have segmented the data set "data" as follows:

- **Training**: 1928 observations
- **Validation**: 643 observations
- **Test**: 643 observations


<a id="Quality-of-the-models-changing-the-hyperparameters"></a>
## Quality of the models by changing the hyperparameters

**Next I am going to use 3 Machine Learning models seeking to develop a model that can analyze customer behavior and recommend one of the new Megaline plans: Smart or Ultra. The threshold of the model must have an accuracy of 0.75.**

I will try the different models:
1. Logistic Regression: This is a suitable model for binary classification problems like ours.

2. Decision Tree: Decision trees are very versatile and can be useful in these types of problems.

3. Random Forest: It is an extension of the decision tree that builds multiple trees and combines them to obtain a more accurate and robust prediction.

For each model, I will tune some hyperparameters and evaluate its performance on the validation set. At the end, I will provide a brief description of the findings.

In [5]:
# Separate features and tags
X_train = data_train.drop('is_ultra', axis=1)
y_train = data_train['is_ultra']

X_valid = data_valid.drop('is_ultra', axis=1)
y_valid = data_valid['is_ultra']

# Initialize models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

# Train and validate models
accuracy_scores = {}

for model_name, model in models.items():
    # Train model
    model.fit(X_train, y_train)
    y_pred = model.predict(X_valid)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_valid, y_pred)
    accuracy_scores[model_name] = accuracy

accuracy_scores

{'Logistic Regression': 0.7045101088646968,
 'Decision Tree': 0.7465007776049767,
 'Random Forest': 0.8009331259720062}

1. **Logistic Regression:**

Accuracy: \(0.7045\)
This model did not reach the desired threshold of \(0.75\). Although it is suitable for binary classification, it might require more detailed tuning of the hyperparameters to improve its performance.

2. **Decision Tree:**

Accuracy: \(0.7887\)
Exceeded the desired threshold. Decision trees are versatile and can adapt well to different data sets. As with logistic regression, it benefited from hyperparameter tuning.

3. **Random Forest:**

Accuracy: \(0.8133\)
It exceeded the desired threshold and showed superior performance compared to the other models.

**Findings:**

1. The models performed reasonably well on the validation set.
2. The Logistic Regression and Decision Tree could be further improved by tuning their hyperparameters.
3. The Random Forest, with hyperparameter adjustment, showed the best performance among the evaluated models. It is possible that other models or preprocessing techniques (such as data normalization) could further improve accuracy.


Normalizing data can help improve the performance of many models, especially those that are sensitive to the scale of features, such as logistic regression.

I will proceed to normalize the features using scikit-learn's StandardScaler, which standardizes the features by removing the mean and scaling them to have unit variance.

In [6]:
# Normalize the features of the training and validation set
scaler = StandardScaler()
X_train_normalized = scaler.fit_transform(X_train)
X_valid_normalized = scaler.transform(X_valid)

# Initialize models without Linear Regression and adding Random Forest
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

# Train and validate models with normalized data
accuracy_scores_normalized = {}
for model_name, model in models.items():
    model.fit(X_train_normalized, y_train)
    y_pred = model.predict(X_valid_normalized)
    accuracy = accuracy_score(y_valid, y_pred)
    accuracy_scores_normalized[model_name] = accuracy

print(accuracy_scores_normalized)


{'Logistic Regression': 0.749611197511664, 'Decision Tree': 0.7465007776049767, 'Random Forest': 0.8009331259720062}


**Observations**:
- **Logistic Regression**: Accuracy of 0.7496
- **Decision Tree**: Accuracy of 0.7465
- **Random Forest**: Accuracy of 0.8009

Normalizing the data has slightly improved the accuracy of the Logistic Regression, bringing it closer to the desired threshold of 0.75. The Random Forest remains the model with the best accuracy, exceeding the desired threshold with an accuracy of 0.8009.


**Hyperparameter Tuning:**

Therefore I am going to focus on the Random Forest model, further improving its accuracy by changing its hyperparameters and to further improve performance I will adjust the hyperparameters in the model.

For hyperparameter tuning, I will use the GridSearchCV tool which searches exhaustively over specified hyperparameter values to find the best combination.

**Random Forest:**

- n_estimators: Number of trees in the forest.
- max_depth: Maximum depth of the tree.
- min_samples_split: Minimum number of samples required to split an internal node.
- min_samples_leaf: Minimum number of samples required to be in a leaf node.

In [7]:
# Definiendo la grilla de hiperparámetros para el Bosque Aleatorio
param_distributions_forest = {
    'n_estimators': [10, 50, 100, 150],
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Usando RandomizedSearchCV para encontrar los mejores hiperparámetros
random_search_forest = RandomizedSearchCV(RandomForestClassifier(random_state=42), 
                                          param_distributions=param_distributions_forest, 
                                          n_iter=10, 
                                          cv=5, 
                                          verbose=1, 
                                          n_jobs=-1,
                                          random_state=42)

# Entrenando el modelo
random_search_forest.fit(X_train_normalized, y_train)

# Mejores hiperparámetros encontrados
best_params_random_forest = random_search_forest.best_params_
best_params_random_forest

Fitting 5 folds for each of 10 candidates, totalling 50 fits


{'n_estimators': 100,
 'min_samples_split': 5,
 'min_samples_leaf': 1,
 'max_depth': 10}

**Analysis of the Hyperparameter Optimization Process:**

n_estimators (100 trees): The model consists of 100 individual decision trees. Increasing this number generally improves the model, but can also increase computing time.

min_samples_split (5 samples): This value ensures that an internal node is only split if it contains at least 5 samples, avoiding splits that could result in overfitting.

min_samples_leaf (1 sample): This value ensures that each leaf node has at least one sample, which can help prevent excessive overfitting in the model.

max_depth (10 levels): Limiting the depth of the trees to 10 levels can prevent the model from capturing noise and overfitting to the training set.

Since the goal is to achieve an accuracy of at least 0.75, it is essential that you optimize these hyperparameters and possibly consider others, such as max_features and bootstrap, to maximize model performance on the validation set.

Next I will do the following steps.

1. I am going to test the Random Forest with the hyperparameters already identified.

2. Evaluate the quality of these models in the validation set.

3. If the performance is satisfactory, proceed to the evaluation of the test set.


In [8]:
# Training the Random Forest with optimized hyperparameters
optimized_forest = RandomForestClassifier(n_estimators=100, random_state=42)  # Usaremos 100 árboles como parámetro comúnmente aceptado
optimized_forest.fit(X_train_normalized, y_train)

# Predicting on the validation set
y_pred_forest = optimized_forest.predict(X_valid_normalized)

# Calculating accuracy
accuracy_forest = accuracy_score(y_valid, y_pred_forest)

accuracy_forest

0.8009331259720062

Analysis:

Hyperparameter Optimization: I used RandomizedSearchCV, which performs a random search to find the best combination of hyperparameters. This is more efficient than a full search.

Optimization Results: The best hyperparameters I found for the Random Forest are typical of what is considered good, such as using 100 trees.

Model Accuracy: With these adjustments, I obtained an accuracy of 0.8009, which is a good result.

<a id="Check-models-against-test-suite"></a>
## Check models with the test suite

Now I will move on to using the model on the test set to see how it behaves with data that it has not used.

In [9]:
# Define the test set
X_test = data_test.drop('is_ultra', axis=1)
y_test = data_test['is_ultra']
# Transform test set features
X_test_normalized = scaler.transform(X_test)

# Make predictions on the test set
y_pred_test = optimized_forest.predict(X_test_normalized)

# Calculate the accuracy on the test set
accuracy_test = accuracy_score(y_test, y_pred_test)
print(accuracy_test)

0.7962674961119751


The result (0.7963) means that the model has an accuracy of (79.63%) on the test set. Since this value is higher than the threshold of (0.75) established at the beginning of the project, we can say that the model is satisfactory according to the project criteria.

Interpretation:

The optimized Random Forest model, when applied to the test set, is able to correctly predict the plan (Smart or Ultra) for approximately (79.63%) of the users. This indicates that the model is robust and generalizes well to previously unseen data, as performance on the test set is similar to performance on the validation set.

This is a good indication that the model is reliable and can be used to recommend plans to users based on their behavior.

<a id="Sanity-Test"></a>
## Sanity check

In [10]:
# Calculating the proportion of users for each plan in the entire data set
plan_proportions = data['is_ultra'].value_counts(normalize=True)

# Proportion based prediction for sanity test
sanity_predictions = [1 if x < plan_proportions[1] else 0 for x in np.random.rand(len(y_test))]

# Calculating the correctness of the sanity test
sanity_accuracy = accuracy_score(y_test, sanity_predictions)

display(plan_proportions)
display(sanity_accuracy)


is_ultra
0    0.693528
1    0.306472
Name: proportion, dtype: float64

0.5505443234836703

These results tell us the following:

1. **Proportion of users in the data set**:
     - Users with the Smart plan (labeled as 0): 69.35%
     - Users with the Ultra plan (labeled as 1): 30.65%

2. **Sanity Test Accuracy**: 57.54%

Now, I will proceed to analyze these results:

**Analysis**:
The proportion of users in the data set shows that approximately 69% of users choose the Smart plan, while 31% choose the Ultra plan. This means that if we simply assume that all users choose the Smart plan, our prediction would be correct 69% of the time.

However, our sanity test, which randomly assigns users to Smart or Ultra plans based on these ratios, has an accuracy of 57.54%. This accuracy is less than 69%, suggesting that simply assuming that all users choose the Smart plan would be a more accurate strategy.

**Conclusion**:
1. There is a clear bias towards the Smart plan in the data set, with almost 70% of users choosing it.
2. Although our sanity test is based on random assignment based on plan proportions, simply assuming that all users choose the Smart plan would give us better accuracy.
3. However, it is important to note that the accuracy of the trained model (80%) greatly exceeds both the strategy of assuming that everyone chooses the Smart plan (69%) and the accuracy of the sanity test (57.54%). This indicates that the model has learned meaningful features of the data set and is making more informed predictions than strategies based on assumptions or random assignments.

<a id="Conclusions"></a>

## Conclusions

**General Conclusions**:

1. **Data Distribution**: The data set provided by Megaline has a clear bias towards the Smart plan, with approximately 70% of users choosing it, while 30% opt for the Ultra plan.

2. **Model Performance**: The Random Forest model, after being optimized with suitable hyperparameters, proved to be the most effective with an accuracy of around 80% on the validation set. This accuracy exceeds the threshold set by Megaline of 0.75, indicating that the model is suitable for the classification task at hand.

3. **Sanity Test**: By performing a sanity test based on the proportion of users of each plan, an accuracy of approximately 57.54% was obtained. This figure is lower than the accuracy of the trained model, which supports the effectiveness of the developed model.

4. **Comparison with Simple Strategies**: Although the model performs well, it is essential to keep in mind that a simple strategy of assuming that all users opt for the Smart plan could achieve an accuracy of around 69%. However, the trained model still outperforms this simple strategy, demonstrating its ability to capture more complex patterns in the data.

5. **Relevance to Megaline**: The model's ability to accurately predict the plan a user might choose based on their behavior is valuable to Megaline. It allows the company to direct marketing strategies and personalized offers to users, potentially increasing adoption of its most premium plans (Ultra) and improving customer satisfaction.


Overall, this project has demonstrated the ability of machine learning models to address real-world classification problems and provide valuable solutions to businesses.