<h1 style='text-align: center; front-size: 50px;'>Which Plan is The Best Fit?</h1>

# Introduction:

In this project, we will work with data from the Mobile carrier Megaline, which offers its clients different prepaid plans. Our mission is to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. In this project, the threshold for accuracy is 0.75, this will allow us to spot potential big winners and plan advertising campaigns.The dataset is stored in a single file (/datasets/users_behavior.csv). During our model development, we will:

- Load and display the dataset in a standardized format.
- Split the source data into a training set, a validation set, and a test set.
- Investigate the quality of different models by changing hyperparameters.
- Check the quality of the model using the test set.
- Sanity check the model.

By following this process, we aim to produce a detailed report that provides actionable insights for business strategy.

In [None]:
# Loading all the libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

In [None]:
# Load the dataset
data = pd.read_csv('/datasets/users_behavior.csv')
data.head()

In [None]:
# First split to separate out the test set
train_data, test_data = train_test_split(data, test_size=0.15, random_state=42)

# Second split to separate the training data into training and validation sets
train_data, validation_data = train_test_split(train_data, test_size=0.1765, random_state=42)

# Checking the sizes to ensure correct splitting
print(f"Training Set: {train_data.shape}")
print(f"Validation Set: {validation_data.shape}")
print(f"Test Set: {test_data.shape}")

In [None]:
# Splitting features and target
X_train = train_data.drop('is_ultra', axis=1)
y_train = train_data['is_ultra']

X_val = validation_data.drop('is_ultra', axis=1)
y_val = validation_data['is_ultra']

# Investigating a Decision Tree Classifier
for depth in range(1, 6):
    model = DecisionTreeClassifier(max_depth=depth, random_state=42)
    model.fit(X_train, y_train)
    predictions = model.predict(X_val)
    accuracy = accuracy_score(y_val, predictions)
    print(f'Max Depth: {depth}, Validation Accuracy: {accuracy:.4f}')

The optimal DecisionTree depth is 3, achieving the highest validation accuracy (0.7578). Beyond this, accuracy declines, indicating overfitting at greater depths. 

In [None]:
# Investigating Random Tree:
forest_results = []
for n_estimators in range(10, 60, 10):
    model = RandomForestClassifier(n_estimators=n_estimators, random_state=42)
    model.fit(X_train, y_train)
    y_prediction = model.predict(X_val)
    accuracy = accuracy_score(y_val, y_prediction)
    forest_results.append((n_estimators, accuracy))
for n_estimators, accuracy in forest_results:
    print(f'n_estimators: {n_estimators}, Accuracy: {accuracy:.4f}')

The optimal number of trees is 40 achieving the highest validation accuracy (0.7950). Beyond this, accuracy declines, indicating overfitting at larger number of trees.

In [None]:
# Splitting features and target for the test set:
X_test = test_data.drop('is_ultra', axis=1)
y_test = test_data['is_ultra']

# Using the best model (max_depth=3):
best_tree_model = DecisionTreeClassifier(max_depth=3, random_state=42)
best_tree_model.fit(X_train, y_train)

# Making prediction on the test set:
test_prediction = best_tree_model.predict(X_test)

# Evaluating model performance on the test set:
test_accuracy = accuracy_score(y_test, test_prediction)

print(f'Test Accuracy: {test_accuracy: .4f}')

The DecisionTree model with max_depth=3 performs well, achieving a test accuracy of 0.7909, which is higher than the validation accuracy. This indicates good generalization with no severe overfiting.

In [None]:
# Using the best model (n_estimator=40):

best_forest_model = RandomForestClassifier(n_estimators=40, random_state=42)
best_forest_model.fit(X_train, y_train)

# Making prediction on the test set:
y_test_pred_forest = best_forest_model.predict(X_test)

# Evaluating model performance on the test set:
forest_test_accuracy = accuracy_score(y_test, y_test_pred_forest)

# Display test set results
print(f'Test Accuracy: {forest_test_accuracy: .4f}')

The RandomTree model with n_estimator=40 performs well, achieving a test accuracy of 0.8054, which is higher than the validation accuracy. This indicates good generalization with no severe overfiting.

In [None]:
# Sanity Check the Model:
train_prediction = best_forest_model.predict(X_train)
train_accuracy = accuracy_score(y_train, train_prediction)

print(f'Train Accuracy: {train_accuracy:.4f}')
print(f'Validation Accuracy: {test_accuracy:.4f}')

# Conclusion:

Our machine learning model **(Random Forest)** successfully predicts which plan a user should switch to with an accuracy above 0.75, exceeding the required threshold. The model is optimized to distinguish between Smart and Ultra plans based on customer usage patterns, making it a valuable asset for both customer service teams and marketing strategies.