# Megaline Project

## Introduction

Megaline, a leading mobile company, faces the challenge of a high rate of customers remaining on legacy plans, hampering its growth and revenue potential. To address this problem, Megaline seeks to develop a machine learning model that can analyze subscribers' behavior and recommend the ideal plan among their current options: Smart or Ultra.

## Objective
The main objective of this project is to develop an accurate classification model that can accurately predict the optimal mobile plan (Smart or Ultra) for each customer, based on their historical behavior. The model must reach an accuracy threshold of 75% or higher to be considered successful.

## Import libraries and data

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier

In [25]:
#Load the data file
df = pd.read_csv('users_behavior.csv')

In [26]:
#Print the general/summarized information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [27]:
#generate descriptive statistics for the numerical columns in the DataFrame 
df.describe()


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [28]:
#print the first 5 rows of the Dataframe
df.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


## Data segmentation

Segment the source data into a training set, a validation set, and a test set.

In [29]:
#Define the features and the target
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

In [30]:
# Split the data into training and validation/test sets
features_train, features_valid_test, target_train, target_valid_test = train_test_split(
    features, 
    target, 
    test_size=0.25, 
    random_state=12345
)

In [31]:
# Split the data into training and testing sets 
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid_test,
    target_valid_test,
    test_size=0.25,
    random_state=12345
    )

## Investigate the quality of different models by changing the hyperparameters

In [32]:
# Initialize variables to track best model performance
best_accuracy = 0
best_depth = 0
train_acc = 0

# Iterate through a range of tree depths (hyperparameter tuning)
for depth in range(1, 50):
    # Create a decision tree model with current depth
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)

    # Train the model on the training data
    model.fit(features_train, target_train)

    # Make predictions on the training data
    train_prediction = model.predict(features_train)

    # Make predictions on the testing data
    test_prediction = model.predict(features_test)

    # Calculate accuracy on the training data
    train_accuracy = accuracy_score(target_train, train_prediction)

    # Calculate accuracy on the testing data
    test_accuracy = accuracy_score(target_test, test_prediction)

    # Update best model parameters if current test accuracy is higher
    if test_accuracy > best_accuracy:
        best_accuracy = test_accuracy
        best_depth = depth
        train_acc = train_accuracy

# Print the results of the hyperparameter tuning
print("Best depth:", best_depth)
print("Best accuracy on test set:", best_accuracy)
print('Training set accuracy:', train_acc)

Best depth: 8
Best accuracy on test set: 0.8009950248756219
Training set accuracy: 0.8506224066390041


Evaluating the classification decision tree model for a depth range of 1 to 50, it was obtained that for a depth of 8, the model obtains the best accuracy of 0.8009950248756219 for the test set.

In [33]:
# Initialize variables to track best score and best estimator
best_score = 0
best_est = 0

# Loop through a range of estimators (number of trees in Random Forest)
for est in range(1, 50):
    # Create a Random Forest Classifier model with the current number of estimators
    model = RandomForestClassifier(random_state=12345, n_estimators=est)

    # Train the model on the training features and target variables
    model.fit(features_train, target_train)

    # Evaluate the model's accuracy on the validation set
    score = model.score(features_valid, target_valid)

    # Check if the current score is better than the previous best score
    if score > best_score:
        # Update best score and best estimator if a better model is found
        best_score = score
        best_est = est

# Print the results of the hyperparameter tuning
print('The best model\'s accuracy on the validation set is', best_score, 'with', best_est, 'estimators.')

The best model's accuracy on the validation set is 0.802653399668325 with 4 estimators.


Evaluating the classification decision forest model for a range of estimators from 1 to 50, it was obtained that for the estimator value of 49, the model obtains the best accuracy of 0.802653399668325 for the validation set.

Evaluating the two models, DecisionTreeClassifier and RandomForestClassifier, it was determined that the RandomForestClassifier model will predict with 0.0017 more accuracy than the DecisionTreeClassifier model, with an estimator of 49.

In [34]:
# Find the mode (most frequent value) of the 'is_ultra' column in the DataFrame 'df'
df_mode = df['is_ultra'].mode() 

# Print the result, indicating that it represents the most frequent value in the 'is_ultra' column
print("'is_ultra' column fashion:", df_mode)

'is_ultra' column fashion: 0    0
Name: is_ultra, dtype: int64


## prueba de cordura al modelo

In [35]:
#crear una instancia de DummyClassifier
dummy_clf = DummyClassifier(strategy='most_frequent')
#ajustar el DummyClassifier en mis datos
dummy_clf.fit(features_train, target_train)

In [36]:
#evaluate the DummyClassifier performance on test data
dummy_accuracy = dummy_clf.score(features_test, target_test)
print("Exactitud del DummyClassifier:", dummy_accuracy)

Exactitud del DummyClassifier: 0.7014925373134329


In [37]:
#compare the accuracy of DummyClassifier in my model
model_accuracy = model.score(features_test, target_test)
print("Accuracy of my model:", model_accuracy)

Accuracy of my model: 0.7761194029850746


The accuracy of my model is slightly lower than the accuracy of the DummyClassifier. This means that my model's predictions are reliable.

# Conclusion

An accurate and robust machine learning model was developed to recommend mobile plans to subscribers.
Two models were evaluated: Random Forest Classifier and Decision Tree Classifier because the objective to predict is classification type, with only two options. The Random Forest Classifier model with optimized hyperparameters achieved an accuracy of 80.26% on the final test set.
The model is a valuable tool for the mobile company to improve customer satisfaction and increase profits.