# Mobile Plan Recommendation: Predicting Smart vs. Ultra Plans

# Description

This project focuses on building a classification model to predict the best plan for each customer based on their usage data, including:

- Number of calls
- Total call duration (minutes)
- Number of text messages
- Internet data usage (MB)

Project Goals
- Develop a machine learning model with an accuracy of at least 0.75 on test data.
- Evaluate multiple classification models and optimize hyperparameters.
- Conduct a sanity check to ensure the model’s performance is meaningful compared to simple baselines.

In [1]:
import pandas as pd

## Download data and prepare it for analysis

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


No missing values are found since all columns contain the same number of non-null values

In [3]:
df.duplicated().sum()

0

No duplicates found

In [4]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


### Conclusion:
The dataset has been successfully loaded, and a preliminary analysis shows that all columns have valid data with no missing values or duplicates.

## Split the source data into a training set, a validation set, and a test set

In [5]:
from sklearn.model_selection import train_test_split

# split into two sets 60% and 40%
df_train, df_test_valid = train_test_split(
    df, 
    test_size=0.4,
    random_state=42
)

In [6]:
# split 40% set in half (20% and 20%)
df_valid, df_test = train_test_split(
    df_test_valid, 
    test_size=0.5,
    random_state=42
)

Thus we get:
- training set (df_train): 60%
- validation set (df_valid): 20%
- test set (df_test)^ 20%

In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1928 entries, 2369 to 3174
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     1928 non-null   float64
 1   minutes   1928 non-null   float64
 2   messages  1928 non-null   float64
 3   mb_used   1928 non-null   float64
 4   is_ultra  1928 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 90.4 KB


In [8]:
df_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 1198 to 1510
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
 4   is_ultra  643 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 30.1 KB


In [9]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 1545 to 283
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
 4   is_ultra  643 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 30.1 KB


### Conclusion:
The dataset has been properly split into training (60%), validation (20%), and test (20%) sets to ensure safe model evaluation.

## Investigate the quality of different models by changing hyperparameters

### Separate features and target in each set

In [10]:
# training set
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

# validation set
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

# test set
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']


### Decision Tree (change max depth)

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# use different depth (from 1 to 6) and print model accuracy
for depth in range(1, 6):
	dtc_model = DecisionTreeClassifier(random_state=42, max_depth=depth) 
	dtc_model.fit(features_train,target_train) 
	predictions = dtc_model.predict(features_valid) 
	print('max_depth =', depth,':', accuracy_score(target_valid,predictions))

max_depth = 1 : 0.7309486780715396
max_depth = 2 : 0.7822706065318819
max_depth = 3 : 0.7916018662519441
max_depth = 4 : 0.7807153965785381
max_depth = 5 : 0.7729393468118196


#### Conclusion: max_depth = 3 shows the best accuracy

### Random Forest (change number of estimators)

In [12]:
from sklearn.ensemble import RandomForestClassifier

best_score = 0
best_est = 0

# choose hyperparameter range (number of trees)
for rfc_est in range(1, 11): 
    rfc_model = RandomForestClassifier(random_state=42, n_estimators=rfc_est) 
    rfc_model.fit(features_train, target_train) 
    rfc_score = rfc_model.score(features_valid, target_valid)
    if rfc_score > best_score:
        # save the best accuracy score on validation set
        best_score = rfc_score 
        # save number of estimators corresponding to the best accuracy score
        best_est = rfc_est 

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

# calculate the final random forest model
final_rfc_model = RandomForestClassifier(random_state=42, n_estimators=best_est) 
final_rfc_model.fit(features_train, target_train)

Accuracy of the best model on the validation set (n_estimators = 4): 0.7916018662519441


RandomForestClassifier(n_estimators=4, random_state=42)

#### Conclusion: number of estimators = 4 shows the best accuracy, with is idendical to Decision Tree max_depth = 3

### Logistic Regression

In [13]:
from sklearn.linear_model import LogisticRegression

# initialize logistic regression constructor with parameters random_state=42 and solver='liblinear'
lr_model =  LogisticRegression (random_state=42, solver='liblinear') 
lr_model.fit(features_train, target_train)  
# calculate accuracy score on training set
lr_score_train = lr_model.score(features_train, target_train) 
# calculate accuracy score on validation set
lr_score_valid = lr_model.score(features_valid, target_valid) 


print("Accuracy of the logistic regression model on the training set:", lr_score_train)
print("Accuracy of the logistic regression model on the validation set:", lr_score_valid)

Accuracy of the logistic regression model on the training set: 0.7136929460580913
Accuracy of the logistic regression model on the validation set: 0.7200622083981337


#### Conclusion: Logistic Regression does not show good accuracy

### Conclusion: 
Multiple models have been trained, including Decision Tree, Random Forest, and Logistic Regression with different hyperparameters. The best model is selected based on validation accuracy, which is __Random Forest__

## Check the quality of the model using the test set

In [14]:
# get predictions using final random forest model on the test set
predictions_test = final_rfc_model.predict(features_test)
# calculate accuracy
accuracy_test = accuracy_score(target_test, predictions_test)

In [15]:
print("RFC testing score:", accuracy_test)

RFC testing score: 0.8009331259720062


### Conclusion:
The model's accuracy on test set is better than on a training and validation sets.

## Sanity check the model

To sanity check our model we can compare the resulting testing score to a baseline model

### Majority class classifier

In [16]:
is_ultra_qty = df.pivot_table(index='is_ultra', aggfunc='size')
is_ultra_qty

is_ultra
0    2229
1     985
dtype: int64

Since the model is imbalanced (2229 '0' against 985 '1') we can use majority class classifier, i.e. replace all target values with the mosty common one, which is '0' in our case.

In [17]:
baseline_predictions = target_test.replace(1,0)

In [18]:
baseline_accuracy = accuracy_score(target_test, baseline_predictions)
print("Baseline accuracy:", baseline_accuracy)

Baseline accuracy: 0.6967340590979783


#### Conclusion: the testing score is significantly higher than the baseline accuracy (Majority class classifier)

## Overall Conclusion

In this project, we developed a classification model to recommend the best mobile plan (Smart or Ultra) based on user behavior. Here are the key milestones:

- Data Exploration & Preprocessing

    - The dataset contained no missing values or duplicates.
    - Features such as calls, minutes, messages, and internet usage were used for prediction.
    - The data was properly split into training, validation, and test sets.


-  Model Training & Evaluation

    - Decision Tree, Random Forest, and Logistic Regression were tested.
    - Random Forest achieved the best validation accuracy.
    - Hyperparameter tuning improved model performance.


- Final Testing & Sanity Check

    - The final model's test accuracy exceeded 0.75, meeting the project's goal.
    - A sanity check was performed using a majority-class baseline, which achieved only 69.67% accuracy.

Since the final model significantly outperformed the baseline, we can confirm that it has learned meaningful patterns.


- General conclusion

The project successfully built a reliable mobile plan recommendation system. Megaline can use this model to optimize customer plans.