# Tariff recommendation

**Objective of the project:**

Based on data on the behavior of customers of the mobile operator "Megaline", build a model to recommend a suitable current tariff plan ("Smart" or "Ultra").

The model will be formed based on data on the behavior of users who have already switched to these tariffs.

**Description of the modeling process:**

The formation of the model will be carried out in 5 stages:

- Overview of the dataset (the data in the dataset has already been preprocessed);
- Dividing the dataset into samples (training, validation and test);
- Study of various tariff recommendation models (model for classification problem);
- Verification of models on a test sample;
- Assessing the selected model for adequacy.

## Dataset overview

To form a classification model, we connect the methods of the sklearn library: decision tree, random forest and logical regression.

To divide the dataset into samples, we connect the train_test_split method.

To assess the quality of the model - method accuracy_score.

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [None]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [None]:
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [None]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [None]:
df.tail()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3209,122.0,910.98,20.0,35124.9,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0
3213,80.0,566.09,6.0,29480.52,1


**Conclusion from the dataset review:**

The dataset contains information about calls, minutes, messages, Internet traffic and an indication of the “Ultra” tariff. Considering that there are two tariffs in total, the lines corresponding to the value 0 in the “is_ultra” column mean that such users have chosen the “Smart” tariff.
There are no gaps in the data, the data type corresponds to the values in the corresponding columns. Based on the average value in the “is_ultra” column, we can say that just over 30% of users chose the “Ultra” tariff, and the majority use the “Smart” tariff.

In [None]:
df.corr()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
calls,1.0,0.982083,0.177385,0.286442,0.207122
minutes,0.982083,1.0,0.17311,0.280967,0.206955
messages,0.177385,0.17311,1.0,0.195721,0.20383
mb_used,0.286442,0.280967,0.195721,1.0,0.198568
is_ultra,0.207122,0.206955,0.20383,0.198568,1.0


In the dataset, there are two data columns ('calls' and 'minutes'), between which there is a very high correlation, close to unity (98.2%). This means that there is a direct connection between the number of calls and the minutes spent, as a result, for training models, one of these features will be redundant, therefore, when forming a feature dataframe, we will exclude the 'calls' column, the exclusion of which, taking into account the identified correlation, should not affect the quality work of models.

## Dividing the dataset into samples

We create a feature dataframe and a target feature dataframe.

In [None]:
features = df.drop(['is_ultra', 'calls'], axis=1)
target = df['is_ultra']

Since there is no separate test sample, we will create test, validation and training samples from the dataset. In this case, the dataset is divided proportionally: 3:1:1, where 3 is the training sample coefficient.

We will form samples sequentially in 2 stages:

- First, we select a training sample - 60% of the dataset
- Then we will divide the remaining 40% equally, thereby obtaining validation and test samples.

In [None]:
features_train, features_temp, target_train, target_temp = \
          train_test_split(features, target, test_size=0.4, random_state=12345)

features_test, features_valid, target_test, target_valid = \
         train_test_split(features_temp, target_temp, test_size=0.5, random_state=12345)

In [None]:
print(features_train.shape, features_test.shape, features_valid.shape)
print(target_train.shape, target_test.shape, target_valid.shape)

(1928, 3) (643, 3) (643, 3)
(1928,) (643,) (643,)


The dimensions of the obtained samples were checked. The dimension corresponds to the required proportion of 3:1:1.

## Study of different fare recommendation models

We will conduct a sequential study of 3 models: Decision tree, Random forest and logical regression.

Model **Decision tree**

In [None]:
best_model = None
best_result = 0
best_depth = 0

for i in range(1, 11):
    model = DecisionTreeClassifier(random_state=12345, max_depth=i)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    result = accuracy_score(target_valid, predictions_valid)

    if result > best_result:
        best_model = model
        best_result = result
        best_depth = i

print("Качество наилучшей модели на валидационной выборке:", best_result.round(4), "Глубина дерева:", best_depth)

Качество наилучшей модели на валидационной выборке: 0.7963 Глубина дерева: 7


Since in the loop we look through tree depth options from 1 to 10, and the best results are achieved at depth 7, further changing (searching) the maximum tree depth parameter is impractical.

Model **Random Forest**

In [None]:
best_model = None
best_result = 0
best_est = 0
best_depth = 0

for est in range(10, 41, 5):
    for i in range (1, 11):
        model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=i)
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid)
        result = accuracy_score(target_valid, predictions_valid)

        if result > best_result:
            best_model = model
            best_result = result
            best_est = est
            best_depth = i

print("Качество наилучшей модели на валидационной выборке:", best_result.round(4), \
      "Количество деревьев:", best_est, "Максимальная глубина:", best_depth)

Качество наилучшей модели на валидационной выборке: 0.8087 Количество деревьев: 10 Максимальная глубина: 8


Since in the loop we look through options for the depth of trees from 1 to 10 and the number of trees from 10 to 40, and the best results are achieved with a depth of 8 and the number of trees 10, further changing (searching) the parameter of the maximum depth and number of trees is impractical.

Model **Logical Regression**

In [None]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
result = accuracy_score(target_valid, predictions_valid)

print("Качество модели на валидационной выборке:", result.round(4))

Качество модели на валидационной выборке: 0.6983


**Conclusion by section**

Three models are considered: Decision Tree, Random Forest and Logical Regression.

For the Decision Tree and Random Forest models, a search was made for the best hyperparameters (tree depth, number of trees) based on the accuracy criterion of the models. Hyperparameters that provide maximum accuracy on the validation set are determined.

Based on the results of calculating the accuracy of the models, the **Random Forest** model with hyperparameters was recognized as the best: depth of trees - 8, number of trees - 10. This model showed accuracy on the validation sample = **80.9%**

## Checking models on a test sample

We will sequentially test the models with the best hyperparameters found on the test sample.

Let's check all models, starting with the best one in the validation set, because the results on the validation set do not always coincide with the results on the test set.

In [None]:
model = RandomForestClassifier(random_state=12345, n_estimators=10, max_depth=8)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
result = accuracy_score(target_test, predictions_test)
print("Точность модели Случайный лес на тестовой выборке:", result.round(4))

Точность модели Случайный лес на тестовой выборке: 0.7838


In [None]:
model = DecisionTreeClassifier(random_state=12345, max_depth=7)
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
result = accuracy_score(target_test, predictions_test)
print("Точность модели Дерево решений на тестовой выборке:", result.round(4))

Точность модели Дерево решений на тестовой выборке: 0.7714


In [None]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_train, target_train)
predictions_test = model.predict(features_test)
result = accuracy_score(target_test, predictions_test)
print("Точность модели Логическая регрессия на тестовой выборке:", result.round(4))

Точность модели Логическая регрессия на тестовой выборке: 0.7076


**Section output:**

The model that showed the best accuracy result on the validation set also showed the best result on the test set.

Total. We select the best **Random Forest** model with hyperparameters: tree depth - 8, number of trees - 10.

The accuracy of this model on the test sample was **78.4%**, which is higher than the specified minimum accuracy of 75%, thus, this model satisfies the customer’s requirement.

## Checking the model for adequacy

We will check the adequacy of the model by comparing the results of the model on the entire dataset with the dataset data - the predicted and real (actual) value in the "is_ultra" column.

In [None]:
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

There are only 985 users in the dataset with the “Ultra” tariff. The total number of users in the sample is 3,214. Thus, the probability of accidentally guessing a user with the “Ultra” tariff is 30.6%
The accuracy of the model on both the validation and test samples is much higher and is about 79%.

In [None]:
model = RandomForestClassifier(random_state=12345, n_estimators=10, max_depth=8)
model.fit(features_train, target_train)
predictions = model.predict(features)
df['is_ultra_predict']=predictions

In [None]:
df['is_ultra_predict'].value_counts()

0    2580
1     634
Name: is_ultra_predict, dtype: int64

Based on the results of the model, 634 users with the “Ultra” tariff out of 985 were predicted. The accuracy of determination for the "Ultra" tariff was 64.4% for the entire dataset, which is also significantly higher than the probability of accidentally guessing this tariff.

In [None]:
df.head(20)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra,is_ultra_predict
0,40.0,311.9,83.0,19915.42,0,0
1,85.0,516.75,56.0,22696.96,0,0
2,77.0,467.66,86.0,21060.45,0,0
3,106.0,745.53,81.0,8437.39,1,0
4,66.0,418.74,1.0,14502.75,0,0
5,58.0,344.56,21.0,15823.37,0,0
6,57.0,431.64,20.0,3738.9,1,0
7,15.0,132.4,6.0,21911.6,0,0
8,7.0,43.39,3.0,2538.67,1,0
9,90.0,665.41,38.0,17358.61,0,0


When viewing part of the dataset data, you can see that the model does not predict the “Ultra” tariff for “Smart” tariffs (where is_ultra is 0, the model also predicts 0). But for some users with the Ultra tariff, the model predicts the Smart tariff. Thus, there is a one-sided error in the model, which can be explained, in my opinion, both by the volume of data in the dataset itself and by the final accuracy of the selected model.

**Conclusion by section:**

The selected model confirms its adequacy upon further study.

**GENERAL CONCLUSION:**
Using a data set of behavior of users who selected current tariffs of the Megaline company in the amount of 3,214 rows, a tariff recommendation model was created, which is based on the **Random Forest** model. This model showed high accuracy on both the validation (80.9%) and test (**78.4%**) samples, and also passed the adequacy test.
Due to the fact that the minimum accuracy of the model was determined by the customer to be 75%, this model can be used by Megaline to recommend current tariffs for users.