# Machine Learning Project

## Introduction 

In this project we will work with mobile carrier Megaline dataset. The goal of the project is to split the dataset on train, validation and test sets in order to test the models on them to determine which model is the best for our goal and what hyperparameters we should set to improve the accuracy score. 
Dataset is grouped by subscribers and contains columns:
* calls (number of calls by user)
* minutes (total call duration in minutes)
* messages (number of text messages)
* mb_used (internet trafic used in MB)
* is_ultra (plan for the current month (Ultra - 1, Smart - 0))

In this project we should train the model that will pick the right plan (Smart or Ultra).
The requiered result of accuracy score is 75%.

We will train these models:
* DecisionTreeClassifier 
* RandomForestClassifier 
* LogisticRegression 

In the end of all the tests we will compare the results and found which model is the best for this purpose.

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
import warnings

In [2]:
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=RuntimeWarning)

## Data preprocessing

In [3]:
data = pd.read_csv('/datasets/users_behavior.csv')

In [4]:
data

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


We got beautiful and structured dataset. Let's check if there are missing values or duplicates.

In [5]:
data.isna().value_counts()

calls  minutes  messages  mb_used  is_ultra
False  False    False     False    False       3214
dtype: int64

In [6]:
data.duplicated().value_counts()

False    3214
dtype: int64

The dataset has no duplicates and no missing values.

## Preparing Data for Machine Learning

In [7]:
data.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [8]:
# Prepare the data
features = data.drop(columns=['is_ultra'])
target = data['is_ultra']

In [9]:
# Split the data into training, validation, and test sets
features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.4, random_state=42)
features_valid, features_test, target_valid, target_test = train_test_split(features_temp, target_temp, test_size=0.4, random_state=42)

In [10]:
features_train.shape

(1928, 4)

In [11]:
features_valid.shape

(771, 4)

In [12]:
features_test.shape

(515, 4)

## Decision Tree Model

In [13]:
# Initialize and train the DecisionTreeClassifier
dtc_model = DecisionTreeClassifier(random_state=42)
dtc_model.fit(features_train, target_train)

DecisionTreeClassifier(random_state=42)

In [14]:
# Evaluate the model on the validation set
dtc_valid_pred =  dtc_model.predict(features_valid)
dtc_valid_accuracy = accuracy_score(target_valid, dtc_valid_pred)
dtc_valid_accuracy

0.7367055771725033

In [15]:
dtc_model.tree_.max_depth 

27

Achieving a 73% accuracy is a decent result, but it falls short of our goal. Let's attempt to improve the performance by fine-tuning the hyperparameters.

In [16]:
final_depth = 0 
final_score = 0
for depth in range(1, 12):
        dtc_model = DecisionTreeClassifier(random_state=42, max_depth=depth)
        dtc_model.fit(features_train, target_train)
        dtc_valid_pred = dtc_model.predict(features_valid)
        accuracy = accuracy_score(target_valid, dtc_valid_pred)
        if accuracy > final_score:
            final_depth = depth
            final_score = accuracy

print("Final depth:", final_depth, "with validation accuracy:", final_score)

Final depth: 10 with validation accuracy: 0.8041504539559015


In [17]:
# Evaluate the model on the test set
dtc_model = DecisionTreeClassifier(random_state=42, max_depth = 10)
dtc_model.fit(features_train, target_train)
dtc_test_pred =  dtc_model.predict(features_test)
test_accuracy = accuracy_score(target_test, dtc_test_pred)

print("Test Accuracy:", test_accuracy)

Test Accuracy: 0.7805825242718447


The test accuracy result is 78%. Not bad at all! Achieving this outcome was possible by setting the maximum depth of the decision tree to 10. Now, we should explore other models and compare their results to determine which one performs the best.

## Random Forest

In [18]:
rf_model = RandomForestClassifier(random_state=42)

In [19]:
rf_model = rf_model.fit(features_train, target_train)

In [20]:
rf_valid_pred = rf_model.predict(features_valid)
rf_valid_accuracy = accuracy_score(target_valid, rf_valid_pred)
rf_valid_accuracy

0.8132295719844358

Impressive! We managed to reach our goal without even adjusting the hyperparameters. Let's explore further and see if we can still enhance the performance. We want the highest score!!!

In [21]:
final_est = 0
final_score = 0
final_depth = 0
for est in range(1, 30):
    for depth in range(1,12):
            rf_model = RandomForestClassifier(random_state=42, n_estimators=est, max_depth=depth)
            rf_model.fit(features_train, target_train)
            rf_valid_pred = rf_model.predict(features_valid)
            accuracy = accuracy_score(target_valid, rf_valid_pred)
            if accuracy > final_score:
                final_est = est
                final_depth = depth
                final_score = accuracy

print("Final estimators number:", final_est, "Final depth:", final_depth,  "with accuracy:", final_score)

Final estimators number: 24 Final depth: 9 with accuracy: 0.8249027237354085


Not far from the model without setting hyperparameters but notable greater than desired result. We've found that the optimal set of hyperparameters for the Random Forest model is a max_depth of 9 and a number of estimators of 24.

In [22]:
# Evaluate the model on the test set
rf_model = RandomForestClassifier(random_state=42, max_depth=9, n_estimators=24)
rf_model.fit(features_train, target_train)
rf_test_pred =  rf_model.predict(features_test)
rf_test_accuracy = accuracy_score(target_test, rf_test_pred)

print("Test Accuracy:", rf_test_accuracy)

Test Accuracy: 0.8077669902912621


 Good! Let's try the last one model.

## Logistic Regression

In [23]:
lr_model = LogisticRegression(random_state=42)

In [24]:
lr_model = lr_model.fit(features_train, target_train)

In [25]:
lr_valid_pred = lr_model.predict(features_valid)
lr_valid_accuracy = accuracy_score(target_valid, lr_valid_pred)
lr_valid_accuracy

0.7496757457846952

We can say that we reached the goal because the result is 0.7496 so this is almost 0.75 but we need to improve the model.

In [26]:
final_solver = ''
final_score = 0
for solver in ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']:
        lr_model = LogisticRegression(random_state=42, solver=solver)
        lr_model.fit(features_train, target_train)
        lr_valid_pred = lr_model.predict(features_valid)
        accuracy = accuracy_score(target_valid, lr_valid_pred)
        if accuracy > final_score:
            final_solver = solver
            final_score = accuracy

print("Final solver:", final_solver, "with accuracy:", final_score)

Final solver: newton-cg with accuracy: 0.7496757457846952


The result has no changed.

In [27]:
# Evaluate the model on the test set
lr_model = LogisticRegression(random_state=42, solver='newton-cg')
lr_model.fit(features_train, target_train)
lr_test_pred =  lr_model.predict(features_test)
lr_test_accuracy = accuracy_score(target_test, lr_test_pred)

print("Test Accuracy:", lr_test_accuracy)

Test Accuracy: 0.7611650485436893


But we reached the desired result with newton-cg solver and only with this solver.

# Conclusion 

In this project, we aimed to develop a model to predict the most suitable mobile plan (Smart or Ultra) for Megaline subscribers based on their usage patterns. We split the dataset into training, validation, and test sets to evaluate the performance of various machine learning models.

We trained three different classifiers: DecisionTreeClassifier, RandomForestClassifier, and LogisticRegression. After testing each model with default hyperparameters, we fine-tuned their settings to improve accuracy.

The DecisionTreeClassifier initially achieved 73% accuracy, which increased to 80% on the validation set and 78% on the test set after setting the max depth hyperparameter to 10.

The RandomForestClassifier outperformed the other models, achieving 81% accuracy with default hyperparameters. After adjusting the estimator number to 24 and max depth to 9, the accuracy improved to 82% on the validation set and 80% on the test set.

LogisticRegression initially achieved 74.9% accuracy. We experimented with different solver hyperparameters and found that using the 'newton-cg' solver maintained the accuracy on the validation set and improved the accuracy on the test set.

Comparing the performance of all models, we found that the RandomForestClassifier was the most effective for our dataset. It provided excellent accuracy even with default hyperparameters and achieved the highest accuracy on the test set after fine-tuning the hyperparameters.

In conclusion, the RandomForestClassifier is the optimal choice for predicting mobile plan preferences for Megaline subscribers. Its robust performance and ability to achieve high accuracy with minimal hyperparameter adjustments make it the preferred model for our dataset and task.

In our exploration of the Megaline dataset, we found that the LogisticRegression model emerged as the less optimal choice. Despite our efforts to improve its performance through hyperparameter tuning, we observed limited enhancements. However, it's worth noting that even with modest improvements, the LogisticRegression model still achieved a commendable level of accuracy, demonstrating its viability for the task at hand.

Through this project, we gained valuable insights into the process of model selection, hyperparameter tuning, and evaluation, providing a solid foundation for future endeavors in predictive analytics and machine learning.