## Tariff Recommendation Project

**Purpose of the analysis:** to build a model capable of analyzing the behavior of customers of the mobile operator "Megaline". The mobile operator "Megaline" found out: many customers use archived tariffs. They want to build a system that can analyze customer behavior and offer users a new tariff: "Smart" or "Ultra".

**Data:** there is a dataset containing data on the behavior of customers who have already switched to "Smart" or "Ultra" tariffs.

**Loading the required libraries and open the data**

In [25]:
# libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score
from sklearn.dummy import DummyClassifier
import warnings

# mute notifications
warnings.filterwarnings('ignore')

# import data
df = pd.read_csv('users_behavior.csv')
df.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


**Checking basic information about dataset**

In [6]:
print(df.info())
print('')
print(df.describe())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None

             calls      minutes     messages       mb_used     is_ultra
count  3214.000000  3214.000000  3214.000000   3214.000000  3214.000000
mean     63.038892   438.208787    38.281269  17207.673836     0.306472
std      33.236368   234.569872    36.148326   7570.968246     0.461100
min       0.000000     0.000000     0.000000      0.000000     0.000000
25%      40.000000   274.575000     9.000000  12491.902500     0.000000
50%      62.000000   430.600000    30.000000  16943.235000     0.000000
75%      82.000000   571.927500    57.000000  21424.700000 

As can be seen, the presented dataset contains a total of 3214 observations with information about the behavior of each of the users for 1 month. The data contains the following variables:

- calls - number of calls;
- minutes - total call duration in minutes;
- messages - number of sms messages;
- mb_used - consumed internet traffic in MB;
- is_ultra - what tariff was used during the month ("Ultra" - 1, "Smart" - 0).

As for the types of variables, 4 out of 5 variables have the type of real numbers - 'float64', and only one variable - 'is_ultra' - is an integer ('int64').

Let's check the data for missing values and duplicates:

In [7]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [8]:
df.duplicated().sum()

0

The data has neither missing values nor explicit duplicates. Let's continue working with them.

## Preparing data for modeling

In [9]:
# Firstly, separating the training sample from the available data
# (equal to 60% of all data)
df_train, df2 = train_test_split(df, test_size=0.4, random_state=12345) 
print(df_train.shape)
print(df2.shape)

# The remaining 40% of the sample will be divided into test 
# and validation samples (50% each, that is, 20% of the original data)
df_valid, df_test = train_test_split(df2, test_size=0.5, random_state=12345)
print(df_valid.shape)
print(df_test.shape)

(1928, 5)
(1286, 5)
(643, 5)
(643, 5)


Creation of variables with features and a target parameters (tariff type):

In [10]:
# Variables for train sample:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

# Variables for valid sample:
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

# Variables for test sample:
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

#Checking the shape of new variables:
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

(1928, 4)
(1928,)
(643, 4)
(643,)
(643, 4)
(643,)


## Models testing

**1. Decision Tree Classification**

Firstly, let's check what depth of the tree will be optimal in terms of accuracy. Let's check the depth values of trees in the range 1-10:

In [22]:
model1 = None
best_result1 = 0
for depth in range(1, 11):
    model1 = DecisionTreeClassifier(random_state = 12345, max_depth = depth)
    model1.fit(features_train, target_train)
    result1 = model1.score(features_valid, target_valid)
    if result1 > best_result1:
        best_model1 = model1
        best_result1 = result1
        best_depth1 = depth

print("Accuracy of the best model on a validation sample:", 
      best_result1, "at depth:", best_depth1)


Accuracy of the best model on a validation sample: 0.7853810264385692 at depth: 3


As can be seen, the highest accuracy is achieved with a tree depth equal to 6 (accuracy = 0.7853810264385692):

In [10]:
model1 = DecisionTreeClassifier(random_state = 12345, max_depth = best_depth1)
model1.fit(features_train, target_train)
result1 = model1.score(features_valid, target_valid)
print(result1)

0.7853810264385692


Thus, the maximum accuracy score that can be achieved via Decision Tree model is approximately 0.785

**2. Random Forest**

Just as in the previous case, firstly let's see at what number of trees (the value of the n_estimators parameter) and with what depth the highest accuracy is achieved. Considering the number of trees from 1 to 13 and depth from 1 to 7:

In [19]:
best_model = None
best_result = 0
result = []
for depth in range(1,10):
     for est in range(10, 25, 5):
        model2 =  RandomForestClassifier(random_state=12345, n_estimators=est, 
                                              max_depth = depth) 
        model2.fit(features_train, target_train)
        predictions_valid = model2.predict(features_valid)
        score = accuracy_score(target_valid, predictions_valid)
        if score > best_result:
            best_model = model2
            best_result = score
        result.append({'n_estimators': est, 
                        'max_depth': depth, 
                        'score': score})
        
pd.DataFrame(result).style.highlight_max('score', color = 'lightgreen', axis = 0)

Unnamed: 0,n_estimators,max_depth,score
0,10,1,0.755832
1,15,1,0.752722
2,20,1,0.766719
3,10,2,0.777605
4,15,2,0.783826
5,20,2,0.783826
6,10,3,0.785381
7,15,3,0.786936
8,20,3,0.786936
9,10,4,0.790047


So, the best accuracy score (0.8) is achieved with following hyperparameters:
- 10 trees and depth = 6
- 15 trees and depth = 6
- 20 trees and depth = 7

Therefore, it would be optimal to stop at 10 trees with a depth of 6.

In [20]:
model2 = RandomForestClassifier(random_state=12345, n_estimators= 10, max_depth = 6)
model2.fit(features_train, target_train)
result2 = model2.score(features_valid, target_valid)
print(result2)

0.8009331259720062


**3. Logistic Regression**

In [21]:
model3 = LogisticRegression(random_state = 12345)
model3.fit(features_train, target_train)
result3 = model3.score(features_valid, target_valid)
print(result3)

0.7107309486780715


As can be seen, when using logistic regression, the accuracy is the smallest (compared to the 2 previous models) = 0.71.

In [23]:
result_table = pd.DataFrame({'Model': 
                            ['Decision Tree', 'Random Forest', 'Logistic Regression'],
                           'Accuracy': ['0.785', '0.8', '0.71']})
result_table

Unnamed: 0,Model,Accuracy
0,Decision Tree,0.785
1,Random Forest,0.8
2,Logistic Regression,0.71


Based on the data presented, I propose to use the Random Forest model for further work. For more confidence, let's check its quality on a test sample.

## Checking the quality of the model on a test sample

In [24]:
result_final = model2.score(features_test, target_test)
print(result_final)

0.7916018662519441


As in the case of the validation set, the value of the model quality on the test set turns out to be quite high and equal to about 0.79.

**Checking the Model for Sanity**

To test the sanity of the model, let's build simple models on data and compare them with the predictions of our model.

In [28]:
# frequent classifier
frequent_clf = DummyClassifier(strategy = 'most_frequent').fit(features_train, target_train)
y_freq_pred = frequent_clf.predict(features_test)
print('Accuracy score:', frequent_clf.score(features_test, target_test))

Accuracy score: 0.6842923794712286


In [30]:
# uniform classifier
uniform_clf = DummyClassifier(strategy = 'uniform').fit(features_train, target_train)
y_uniform_pred = uniform_clf.predict(features_test)
print('Accuracy score:', uniform_clf.score(features_test, target_test))

Accuracy score: 0.5085536547433903


As can be seen, both of the dummy models have poorer accuracy values than our Random Forest model. Therefore, we are able to consider our model 'sane'.

## Conclusion

To sum everything up, the work done made it possible to achieve the following steps:
- the data was checked for gaps and duplicates;
- the entire dataset was divided into three samples: training (60%), validation (20%) and test (20%).
- three different predictive models were tested: classification based on decision trees, random forest, and logistic regression. Based on a comparison of the accuracy indicators of each of the models, the model constructed using a random forest turned out to be of the highest quality;
- the quality of the model was checked on a test sample. The accuracy parameter turned out to be quite large = 0.79;
- the model has also been tested for sanity. The quality indicators for simply constructed models turned out to be significantly lower than those of our model.

All in all, the chosen Random Forest model can be used by "Megaline" telecom company to offer users of one of two tariffs: "Smart" or "Ultra".