# 1. Packages

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 2. Dataset validation

In [3]:
df = pd.read_csv (r'C:\Users\juand\OneDrive\Escritorio\TripleTen\Project 9\Megaline-EDA-Predictive-Modeling\Dataset\users_behavior.csv')
#df = pd.read_csv (r'C:\Users\valen\OneDrive\Escritorio\Juano_VS\Megaline-EDA-Predictive-Modeling\Dataset\users_behavior.csv')
print (df.info())
print ()
print (df.head())
print ()
print ('Number of duplicated records:', sum(df.duplicated()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None

   calls  minutes  messages   mb_used  is_ultra
0   40.0   311.90      83.0  19915.42         0
1   85.0   516.75      56.0  22696.96         0
2   77.0   467.66      86.0  21060.45         0
3  106.0   745.53      81.0   8437.39         1
4   66.0   418.74       1.0  14502.75         0

Number of duplicated records: 0


# 3. Data segmentation

In [4]:
X = df.drop ('is_ultra', axis=1)
y = df['is_ultra']
x_train, x_test, y_train, y_test = train_test_split (X, y, test_size = 0.75, random_state=12345)

# 4. Model selection

In [5]:
# Decision Tree
best_score = 0
best_depth = 0
for i in range (1, 21): 
    tree = DecisionTreeClassifier (random_state=12345, max_depth=i)
    tree.fit (x_train, y_train)
    score = tree.score (x_test, y_test)
    if score > best_score: 
        best_score = score
        best_depth = i
print ('Best model in the validation set has a depth of:', best_depth, 'with a score of:', best_score)

Best model in the validation set has a depth of: 8 with a score of: 0.7942762339278308


In [6]:
# Random Forest
best_score = 0
best_estimator = 0
for est in range (1, 101): 
    forest = RandomForestClassifier (random_state=12345, n_estimators=est)
    forest.fit (x_train, y_train)
    score = forest.score (x_test, y_test)
    if score > best_score: 
        best_score = score
        best_estimator = est
print ('The accuracy for the best model in the validation set (n_estimators={}): {}'.format (best_estimator, best_score))

The accuracy for the best model in the validation set (n_estimators=54): 0.7946909995852344


In [7]:
# Logistic Regression 
logi = LogisticRegression (random_state = 12345, solver='liblinear')
logi.fit (x_train, y_train)
score_train = logi.score (x_train, y_train)
score_test = logi.score (x_test, y_test)

print ('The accuracy for the logistic regression model in the training set was: ', score_train)
print ('The accuracy for the logistic regression model in the testing set was: ', score_test)

The accuracy for the logistic regression model in the training set was:  0.7409713574097135
The accuracy for the logistic regression model in the testing set was:  0.7465781833264206


## 4.1. Findings from the Model Quality Investigation

During the model evaluation phase, several classification algorithms were tested, including Decision Tree, Logistic Regression, and Random Forest. For each model, different hyperparameters were tuned to identify the configuration with the highest predictive performance.

Decision Tree: Tested with varying max_depth values from 1 to 20. The model reached a maximum accuracy of ~0.7942, which met the required threshold but did not outperform the other models.

Logistic Regression: Tested using the liblinear solver. However, the model did not meet the 0.75 threshold, achieving an accuracy of ~0.7466 on the test set.

Random Forest: Evaluated with n_estimators ranging from 1 to 101. The model showed consistent improvements as the number of trees increased. The best configuration was achieved with 54 estimators, yielding an accuracy of ~0.7947, surpassing the project threshold by 0.0447 points.

Conclusion

Among all tested models, the RandomForestClassifier demonstrated the best performance and generalization ability for predicting Megaline customer plan categories. Hyperparameter tuning confirmed that increasing the number of estimators significantly improved accuracy up to the optimal value found at n_estimators = 54.

## 4.2. Final Model Evaluation on the Test Set
After selecting the RandomForestClassifier with 54 estimators as the best-performing model during the hyperparameter tuning phase, the final step is to evaluate its performance on the test set, which contains data unseen during both training and model selection. This evaluation provides an unbiased estimate of the model’s ability to generalize to new customers.

In [8]:
best_model = RandomForestClassifier (random_state=12345, n_estimators=54)
best_model.fit (x_train, y_train)
predictions = best_model.predict (x_test)
score = accuracy_score (y_test, predictions)

print(f"The final accuracy score of the best model (Random Forest) on the test set is: {score:.4f}")

The final accuracy score of the best model (Random Forest) on the test set is: 0.7947


Final Result

The selected RandomForestClassifier achieved a final test accuracy of ≈0.7947, confirming that it generalizes well to unseen data and successfully meets the project threshold of 0.75. This result validates the model as a reliable solution for predicting Megaline customer plan categories.

# 5. Sanity check¶
Sanity check was performed to verify that the RandomForestClassifier is actually learning meaningful patterns rather than predicting at random.

First, the class distribution of the target variable was inspected:

## 5.1. Verify classes balance

In [9]:
y_train.value_counts (normalize=True)

is_ultra
0    0.665006
1    0.334994
Name: proportion, dtype: float64

## 5.2. Conclusion
The class distribution revealed that the most frequent class represents approximately 66% of the data.

Therefore:

A random classifier would be expected to achieve around 50% accuracy.

A trivial classifier that always predicts the majority class would achieve approximately 66% accuracy.

For comparison:

Model	Expected Accuracy
Random guessing	~0.50
Majority-class predictor	~0.66
Final Random Forest model	~0.7947
These results confirm that the model significantly outperforms both a random baseline and a trivial majority-class classifier. Therefore, it has learned meaningful patterns from the data and successfully passes the sanity check.