# User Behavior

We are working on a machine learning code to better be able to identify reasons and users for which they switch from non-ultra users to ultra users and know which users are close to being part of the threshold to be advertise about the ultra plan.

In [72]:
#import all the necessary libraries
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [73]:
#Read all the libraries and assign them to a variable for easy access
user = '/datasets/users_behavior.csv'

In [74]:
#Read the csv file
user = pd.read_csv(user)

In [75]:
#Display the head of the dataset to make sure it was read
user.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [76]:
#Split the data into training (60%) and the remaining (40%) for the other sets
train_user, remain_user = train_test_split(user, test_size=0.4, random_state=42)

In [77]:
#Split the remaining data into validation (between 20% and 20%)
val_user, test_user = train_test_split(remain_user, test_size=0.5, random_state=42)

In [78]:
#Separate features and target variable
X_train = train_user.drop('is_ultra', axis=1)
y_train = train_user['is_ultra']
X_val = val_user.drop('is_ultra', axis=1)
y_val = val_user['is_ultra']
X_test = test_user.drop('is_ultra', axis=1)
y_test = test_user['is_ultra']

In [79]:
#Define different sets of hyperparameters to test
hyperparameters = [
    {'n_estimators': 100, 'max_depth': None, 'min_samples_split': 2},
    {'n_estimators': 200, 'max_depth': 10, 'min_samples_split': 5},
    {'n_estimators': 300, 'max_depth': 20, 'min_samples_split': 10}
]

best_model = None
best_accuracy = 0

In [80]:
#Train and evaluate models with different hyperparameters
for params in hyperparameters:
    model = RandomForestClassifier(
        n_estimators=params['n_estimators'],
        max_depth=params['max_depth'],
        min_samples_split=params['min_samples_split'],
        random_state=42
    )
    model.fit(X_train, y_train)
    y_val_pred = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_val_pred)
    print(f"Hyperparameters: {params}, Validation Accuracy: {accuracy}")
    
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_model = model

Hyperparameters: {'n_estimators': 100, 'max_depth': None, 'min_samples_split': 2}, Validation Accuracy: 0.80248833592535
Hyperparameters: {'n_estimators': 200, 'max_depth': 10, 'min_samples_split': 5}, Validation Accuracy: 0.807153965785381
Hyperparameters: {'n_estimators': 300, 'max_depth': 20, 'min_samples_split': 10}, Validation Accuracy: 0.8102643856920684


According to our dataset, Model 3 with ({'n_estimators': 300, 'max_depth': 20, 'min_samples_split': 10}).
Has the highest validation with an accuracy of (0.810), indicating that it is the best-performing model among the three. Increasing the number of trees and using a balanced depth with a moderate minimum sample plit has led to a better model performance. In the following, we test the best validation accuracy to confirm the performance.

In [81]:
#Evaluate the best model on the test set
y_test_pred = best_model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {test_accuracy}")

Test Accuracy: 0.8149300155520995


### Check the quality of the model


In [82]:
#Define the best model on validation accuracy
best_model_params = {'n_estimators': 300, 'max_depth': 20, 'min_samples_split': 10}
best_model = RandomForestClassifier(
    n_estimators=best_model_params['n_estimators'],
    max_depth=best_model_params['max_depth'],
    min_samples_split=best_model_params['min_samples_split'],
    random_state=42
)

In [83]:
#Train the best model on the training data
best_model.fit(X_train, y_train)

In [84]:
#Predict on the test set
y_test_pred = best_model.predict(X_test)

In [85]:
#Calculate performance metrics
test_accuracy = accuracy_score(y_test, y_test_pred)
classification_rep = classification_report(y_test, y_test_pred)
conf_matrix = confusion_matrix(y_test, y_test_pred)

In [86]:
print(f"Test Accuracy: {test_accuracy}")
print("Classification Report:\n", classification_rep)
print("Confusion Matrix:\n", conf_matrix)

Test Accuracy: 0.8149300155520995
Classification Report:
               precision    recall  f1-score   support

           0       0.83      0.92      0.87       448
           1       0.76      0.57      0.65       195

    accuracy                           0.81       643
   macro avg       0.79      0.75      0.76       643
weighted avg       0.81      0.81      0.81       643

Confusion Matrix:
 [[412  36]
 [ 83 112]]


From this we can see that the test accuracy is 0.84 so the model can correctly classify approximately 81.5% of the instances the test set. 

From the Non-Ultra users:
The precision is 0.83 out of all user predicted as non-ultra users, 83% were correctly classified. 
Out of all non-ultra users, 92% were correctly identified by the model.

From the Ultra Users:
The precision is 0.76% were correctly classified.
Out of all actual ultra users 57% were correclty identified by the model.

Overall conclusion
An accuracy of approximately 81.5% indicates that the model perfoms reasobaly well during the test.
The model does test better for class 0 (non-ultra users) than class 1 (ultra-users.)

If we look into the details of those classified and classified wrongly.
412 were correctly classified as non-ultra users.
36 were incorrectly classified as ultra users as non-ultra users.
112 were correctly classified as ultra-users.
83 were incorrectly classified as non-ultra users as ultra users.