# Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
* [Stage 2. Model evaluation](#model_evaluation)
* [Conclusion](#end)

## Introduction <a id='intro'></a>
In this project, we will develop and evaluate a machine learning model to recommend either Megaline's Smart or Ultra plans based on subscriber behavior data. We will explore different models, fine-tune their parameters, and validate their performance using a combination of training, validation, and test datasets.

### Goal: 
Our goal is to develop a highly accurate model that recommends either the Smart or Ultra plan based on individual subscriber behavior. The minimum accuracy threshold for this project is set at 75%, evaluated using a dedicated test dataset.

### Stages
Data on user behavior is stored in the file `/datasets/users_behavior.csv`

Our project will consist of the following stages:
1. Data Overview.
2. Model evaluation.
3. Conclusions.

## Stage 1. Data overview <a id='data_review'></a>

In [1]:
#importing necessary libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [2]:
#reading the files and storing it in df
users = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
#obtaining the first rows from the table
users

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [4]:
#obtaining general information about the data
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


#### Conclusion

During the Data Exploration stage, we opened the data file, examined the content of the table, and noted any nonstandard formatting. In our project  its noted that our data is already preprocessed, so we can skip that stage and we could proceed with further analysis.

## Stage 2. Model Evaluation<a id='model_evaluation'></a>

### Split the source data into a training set, a validation set, and a test set.


We will split the data into three sets in the following proportions: 60% for the training set, 20% for the validation set, and 20% for the test set.

In [5]:
#splitting data into training,validation and test set
users_train,users_valid = train_test_split(users, test_size = 0.4, random_state=12345)
users_valid_set,users_test_set = train_test_split(users_valid,test_size=0.5,random_state=12345)

In [6]:
#displaying the shapes of the overall dataset,the training, validation, and test sets
print(users.shape)
print(users_train.shape)
print(users_valid_set.shape)
print(users_test_set.shape)

(3214, 5)
(1928, 5)
(643, 5)
(643, 5)


**Splitting data into Features/Target and Train/Validate/Test**

In [7]:
#splitting the training set into features and target variable
features_train = users.drop('is_ultra',axis=1)
target_train = users['is_ultra']
#splitting the validation set into features and target variable
features_valid_set = users_valid_set.drop('is_ultra',axis=1)
target_valid_set = users_valid_set['is_ultra']
#splitting the test set into features and target variable
features_test_set = users_test_set.drop('is_ultra',axis=1)
target_test_set = users_test_set['is_ultra']

### Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

**Desicion tree**

In [8]:
#initializing the Decision Tree Classifier with a specific random state for reproducibility
dtc_model = DecisionTreeClassifier(random_state = 12345)
#training the Decision Tree model on the training data
dtc_model = dtc_model.fit(features_train,target_train)

In [9]:
#accessing the maximum depth of the trained Decision Tree model
dtc_model.tree_.max_depth

28

In [10]:
#iterating over different depths, training the model, evaluating accuracy on validation set, and finding the best depth
final_depth = 0
final_score = 0
for depth in range (1,7):
    dtc_model = DecisionTreeClassifier(random_state = 12345,max_depth=depth)
    dtc_model.fit(features_train,target_train)
    valid_pred = dtc_model.predict(features_valid_set)
    accuracy = dtc_model.score(features_valid_set,target_valid_set)
    print("Final depth=",depth,"with accuracy:",accuracy)

Final depth= 1 with accuracy: 0.7558320373250389
Final depth= 2 with accuracy: 0.7869362363919129
Final depth= 3 with accuracy: 0.7884914463452566
Final depth= 4 with accuracy: 0.7978227060653188
Final depth= 5 with accuracy: 0.8055987558320373
Final depth= 6 with accuracy: 0.8040435458786936


Final depth 5 has the highest accuracy 0.8055987558320373

In [11]:
#changing max_depth to get best model
dtc_final_model = DecisionTreeClassifier(random_state=12345, max_depth=5)
dtc_final_model.fit(features_train, target_train)

DecisionTreeClassifier(max_depth=5, random_state=12345)

**Random forest classifier**

In [12]:
#iterating over different number of estimators, training the model, evaluating accuracy on validation set, and finding the  best number of estimators
best_score = 0
best_est = 0
for est in range(1, 51): 
    rfc_model = RandomForestClassifier(random_state=54321, n_estimators=est) 
    rfc_model.fit(features_train,target_train) 
    score = rfc_model.score(features_valid_set,target_valid_set) 
    if score > best_score:
        best_score = score
        best_est = est
    print(est,best_score)

1 0.9097978227060654
2 0.9175738724727839
3 0.9502332814930016
4 0.9533437013996889
5 0.973561430793157
6 0.973561430793157
7 0.9751166407465007
8 0.9751166407465007
9 0.9844479004665629
10 0.9844479004665629
11 0.9844479004665629
12 0.9844479004665629
13 0.9844479004665629
14 0.9844479004665629
15 0.9875583203732504
16 0.9875583203732504
17 0.9891135303265941
18 0.9891135303265941
19 0.9906687402799378
20 0.9906687402799378
21 0.9906687402799378
22 0.9906687402799378
23 0.9906687402799378
24 0.9906687402799378
25 0.9906687402799378
26 0.9906687402799378
27 0.995334370139969
28 0.995334370139969
29 0.995334370139969
30 0.995334370139969
31 0.9968895800933126
32 0.9968895800933126
33 0.9968895800933126
34 0.9968895800933126
35 0.9968895800933126
36 0.9968895800933126
37 0.9968895800933126
38 0.9968895800933126
39 0.9984447900466563
40 0.9984447900466563
41 0.9984447900466563
42 0.9984447900466563
43 0.9984447900466563
44 0.9984447900466563
45 1.0
46 1.0
47 1.0
48 1.0
49 1.0
50 1.0


We will select 19 as the optimal number of estimators for our model with the accuracy score 0.9906687402799378. This choice strikes the best balance between achieving high accuracy and minimizing computational resources. With 19 estimators, we achieve near-maximum accuracy while ensuring efficient use of memory and processing power.

In [13]:
#changing n_estimators to get best model
rfc_final_model = RandomForestClassifier(random_state=54321, n_estimators=19)
rfc_final_model.fit(features_train, target_train)

RandomForestClassifier(n_estimators=19, random_state=54321)

**Logistics regression**

We'll be using the "liblinear" and "newton-cg" solver because they are the most general ones, working well enough for small datasets with a lot of features.

In [14]:
lr_model =  LogisticRegression(random_state=54321,solver='liblinear')# initializing logistic regression constructor with parameters random_state=54321 and solver='liblinear'
lr_model.fit(features_train,target_train)  # training model on training set
score_train = lr_model.score(features_train,target_train)  
score_valid = lr_model.score(features_valid_set,target_valid_set) # calculating accuracy score on validation set  

print("Accuracy of the logistic regression model on the training set:",score_train)
print("Accuracy of the logistic regression model on the validation set:",score_valid)

Accuracy of the logistic regression model on the training set: 0.7426882389545737
Accuracy of the logistic regression model on the validation set: 0.7573872472783826


In [15]:
lr_model =  LogisticRegression(random_state=54321,solver='newton-cg')# initializing logistic regression constructor with parameters random_state=54321 and solver='liblinear'
lr_model.fit(features_train,target_train)  # training model on training set
score_train = lr_model.score(features_train,target_train)  
score_valid = lr_model.score(features_valid_set,target_valid_set) # calculating accuracy score on validation set  

print("Accuracy of the logistic regression model on the training set:",score_train)
print("Accuracy of the logistic regression model on the validation set:",score_valid)

Accuracy of the logistic regression model on the training set: 0.7479775980087119
Accuracy of the logistic regression model on the validation set: 0.7573872472783826


'newton-cg' demonstrates higher accuracy on the training set while the validation set values remain the same for both solvers. In both cases, the accuracy on the validation set is higher than on the training set, indicating a lack of overfitting and showcasing how logistic regression is more resistant to it.

### Check the quality of the model using the test set.

**Desicion tree**

In [16]:
#making predictions on the test set with the final Decision Tree model and evaluating its accuracy
dtc_predictions_test = dtc_final_model.predict(features_test_set)
dtc_accuracy_test = accuracy_score(target_test_set,dtc_predictions_test)
print("Desicion tree testing score:",dtc_accuracy_test)

Desicion tree testing score: 0.8211508553654744


**Random forest classifier**

In [17]:
#making predictions on the test set with the final Random Forest model and evaluating its accuracy
rfc_predictions_test = rfc_final_model.predict(features_test_set)
rfc_accuracy_test = accuracy_score(target_test_set,rfc_predictions_test)
print("Random forest testing score:",rfc_accuracy_test)

Random forest testing score: 0.9937791601866252


**Logistics regression**

In [18]:
#making predictions on the test set with the Logistic Regression model and evaluate its accuracy
lr_predictions_test = lr_model.predict(features_test_set)
lr_accuracy_test = accuracy_score(target_test_set,lr_predictions_test)
print("Logistics regression testing score:",lr_accuracy_test)

Logistics regression testing score: 0.7402799377916018


## Conclusion<a id='conclusion'></a>

In this project, both the Decision Tree Classifier (DTC) and Random Forest Classifier (RFC) show higher accuracy on the test set compared to the training set (0.821 vs. 0.805 for DTC and 0.993 vs. 0.990 for RFC, respectively). This suggests that our models generalize well to new, unseen data. The higher test accuracy indicates that the models are not overfitting, and they have successfully captured the underlying patterns in the data. The logistics regression model shows a lower accuracy score on the test set compared to the training and validation sets (0.740 vs. 0.747 on training and 0.757 on validation). This indicates that the model may have slightly overfitted the training and validation data, capturing patterns that do not generalize well to new, unseen data. Moreover, the logistic regression model shows an accuracy lower than the threshold of 0.75 set for this project, indicating that it does not meet the required performance criteria.

After evaluating our 3  models, including Decision Tree Classifier, Random Forest Classifier and Logistic Regression, we conclude that the Random Forest Classifier is the best choice for our project with the highest accuracy of 0.993. Given these points, the Random Forest Classifier is the most suitable model for predicting which Megaline plan (Smart or Ultra) subscribers should switch to based on their behavior data. This model will provide the best balance of accuracy, efficiency, and reliability for our classification task.