<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Examining-data-from-a-file" data-toc-modified-id="Examining-data-from-a-file-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Examining data from a file</a></span></li><li><span><a href="#Splitting-the-original-data-into-training,-validation-and-test-samples" data-toc-modified-id="Splitting-the-original-data-into-training,-validation-and-test-samples-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Splitting the original data into training, validation and test samples</a></span></li><li><span><a href="#Exploring-the-quality-of-different-models-by-changing-hyperparameters" data-toc-modified-id="Exploring-the-quality-of-different-models-by-changing-hyperparameters-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Exploring the quality of different models by changing hyperparameters</a></span><ul class="toc-item"><li><span><a href="#Decision-Tree-Classifier" data-toc-modified-id="Decision-Tree-Classifier-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Decision Tree Classifier</a></span></li><li><span><a href="#Random-Forest-Classifier" data-toc-modified-id="Random-Forest-Classifier-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Random Forest Classifier</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Conclusion" data-toc-modified-id="Conclusion-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Conclusion</a></span></li></ul></li><li><span><a href="#Model-testing" data-toc-modified-id="Model-testing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Model testing</a></span></li><li><span><a href="#Model-sanity-check" data-toc-modified-id="Model-sanity-check-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Model sanity check</a></span></li></ul></div>

# Project description

Mobile operator Megaline has found out that many customers use archived tariffs. We, the company's data scientists, want to build a system that can analyse customer behaviour and offer users a new tariff: Smart or Ultra.

We have data on the behaviour of customers who have already switched to these tariffs (from the project "Statistical Data Analysis"). We need to build a model for the classification problem, which selects the appropriate tariff. No preprocessing of the data is needed - we have already done it.

Let's build a model with the highest possible value of the `accuracy` metric. For the project to be successful, we need to get the percentage of correct answers to at least 0.75. Let's test `accuracy' on a test sample by ourselves.

**Data description**.

Each object in the dataset is information about the behaviour of one user over a month. Known:

* calls - number of calls,
* minutes - total duration of calls in minutes,
* messages - number of sms messages,
* mb_used - used Internet traffic in Mb,
* is_ultra - which tariff was used during the month ("Ultra" - 1, "Smart" - 0).

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

## Examining data from a file

In [2]:
try:
    df = pd.read_csv('users_behavior.csv')
    
except:
    df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.info()
display(df.sample(10))
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
519,22.0,131.53,5.0,4877.94,0
2455,71.0,499.54,56.0,9791.01,0
3120,93.0,727.11,11.0,17723.74,0
610,84.0,517.05,81.0,10945.58,1
1971,71.0,500.65,44.0,13803.46,0
1265,66.0,437.75,22.0,25108.55,0
2393,51.0,360.56,53.0,19683.68,0
2101,73.0,478.97,0.0,14927.91,1
1021,108.0,789.17,12.0,32670.36,1
2513,39.0,242.71,0.0,20480.11,0


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


We carried out a high-level examination of the data presented in the dataset. Considering that the data have been pre-processed in the previous stages of the project, everything looks very correct, without the need for additional actions and manipulations to process the data.

It is possible to start further work on model building.

## Splitting the original data into training, validation and test samples

Our goal is to develop a model that predicts the best tariff plan for customers. Since we do not have a test dataset at our disposal, we will divide the available original dataset into three samples: **training, validation and test samples in the proportion 3:1:1**, i.e. 60% : 20% : 20% respectively.

The target feature is the column with the feature of the predicted tariff: `is_ultra`. All other columns of the table act as features.

In [4]:
# With train_test_split we split our dataset into appropriate samples 

df_train, df_valid = train_test_split(df, random_state=12345, test_size=0.4)
df_valid, df_test = train_test_split(df_valid, random_state=12345, test_size=0.5)
print('Размер обучающей выборки:', df_train.shape[0], 'объектов')
print('Размер валидационной выборки:', df_valid.shape[0], 'объекта')
print('Размер тестовой выборки:', df_test.shape[0], 'объекта')

Размер обучающей выборки: 1928 объектов
Размер валидационной выборки: 643 объекта
Размер тестовой выборки: 643 объекта


Having split our initial dataset in the required proportion, we can then proceed to building and investigating different models, with the aim of choosing the best one in terms of maximising the `accuracy' metric.

## Exploring the quality of different models by changing hyperparameters

Many machine learning libraries require features to be stored in separate variables. 

So for each of our samples we declare two variables: `features` and `target`, into which we store the features and the target feature, respectively. The column `is_ultra` acts as the target feature. All other columns are features.

In [5]:
# Assigning features and target feature to respective variables - Training set
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']

# Assigning features and target feature to respective variables - Validation set
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']

# Assigning features and target feature to respective variables - Testing set
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

### Decision Tree Classifier

We first investigate the Decision Tree Classifier model by changing the hyperparameter for maximum tree depth - `max_depth` - from 1 to 10.

We will explore the model by enumerating ten different hyperparameters and obtain quality scores for each model on the validation dataset. To ensure that our model is not overfitted, we can derive quality scores for both the validation and training datasets simultaneously. And analyze the difference in the scores.  

In [7]:
best_result = 0
best_max_depth = 0

for depth in range(1,11):
    dtc_model = DecisionTreeClassifier(random_state=12345, max_depth=depth) # initialize model
    dtc_model.fit(features_train, target_train) # fitting the model on train set
    predictions = dtc_model.predict(features_valid) # obtaining the model predictions on the validation set
    result = accuracy_score(target_valid, predictions) # examine the quality of the model on the validation set by calculating the accuracy metric
    predictions_train = dtc_model.predict(features_train) # obtaining the model predictions on the train set
    result_train = accuracy_score(target_train, predictions_train) # examine the quality of the model on the train set by calculating the accuracy metric
    print(f'Max_depth = {depth} \tAccuracy_valid = {result}, Accuracy_train = {result_train}')
    if result > best_result:
        best_result = result
        best_max_depth = depth
        
print('\nAccuracy лучшей модели "Дерево Решений" на валидационном наборе:', best_result, '\nГлубина дерева:', best_max_depth)

Max_depth = 1 	Accuracy_valid = 0.7542768273716952, Accuracy_train = 0.7577800829875518
Max_depth = 2 	Accuracy_valid = 0.7822706065318819, Accuracy_train = 0.7878630705394191
Max_depth = 3 	Accuracy_valid = 0.7853810264385692, Accuracy_train = 0.8075726141078838
Max_depth = 4 	Accuracy_valid = 0.7791601866251944, Accuracy_train = 0.8106846473029046
Max_depth = 5 	Accuracy_valid = 0.7791601866251944, Accuracy_train = 0.8200207468879668
Max_depth = 6 	Accuracy_valid = 0.7838258164852255, Accuracy_train = 0.8376556016597511
Max_depth = 7 	Accuracy_valid = 0.7822706065318819, Accuracy_train = 0.8558091286307054
Max_depth = 8 	Accuracy_valid = 0.7791601866251944, Accuracy_train = 0.8625518672199171
Max_depth = 9 	Accuracy_valid = 0.7822706065318819, Accuracy_train = 0.8812240663900415
Max_depth = 10 	Accuracy_valid = 0.7744945567651633, Accuracy_train = 0.8890041493775933

Accuracy лучшей модели "Дерево Решений" на валидационном наборе: 0.7853810264385692 
Глубина дерева: 3


The value of the `accuracy` metric on the validation set when the tree depth hyperparameter is changed ranges from 75% to 79%. The best metric value =78.54% is achieved with tree depth equal to 3.

At the same time, comparing the values of the quality scores on the validation and training dataset, we do not observe a sharp drop in the predicted scores between the training and validation datasets for each respective model. We can conclude that our model is not over-trained. Yes, with increasing tree depth, there is a slight increase in the difference, but this is more indicative of the natural behaviour of the model rather than overtraining it. 

Next we investigate another classification algorithm: Random Forest Classifier.

### Random Forest Classifier

We will investigate the Random Forest Classifier model by varying the hyperparameter of the number of trees - `n_estimators` - from 5 to 55 trees in the forest in each individual model, in increments of 5 trees.

We will explore the model by enumerating eleven different hyperparameters and obtain quality scores for each model on the validation dataset. To ensure that our model is not overfitted, we can derive quality scores for both the validation and training datasets simultaneously. And analyze the difference in the scores.  

In [9]:
best_result = 0
best_n_estimators = 0

for est in range(5,56,5):
    rfc_model = RandomForestClassifier(random_state=12345, n_estimators=est) # initializing the model
    rfc_model.fit(features_train, target_train) # fitting the model on training set
    predictions = rfc_model.predict(features_valid) # obraining model predictions
    result = accuracy_score(target_valid, predictions) # examining the accuracy metric on validation set
    predictions_train = rfc_model.predict(features_train) # obraining model predictions on validation set
    result_train = accuracy_score(target_train, predictions_train) # examining the accuracy metric on training set
    print(f'N_estimators = {est} \tAccuracy_valid = {result}, Accuracy_train = {result_train}')
    #print('N_estimators =', est, '\tAccuracy =', result)
    if result > best_result:
        best_result = result
        best_n_estimators = est
        
print('\nAccuracy лучшей модели "Случайный Лес" на валидационном наборе:', best_result, '\nЧисло деревьев в лесу:', best_n_estimators)

N_estimators = 5 	Accuracy_valid = 0.749611197511664, Accuracy_train = 0.9678423236514523
N_estimators = 10 	Accuracy_valid = 0.7853810264385692, Accuracy_train = 0.9823651452282157
N_estimators = 15 	Accuracy_valid = 0.7838258164852255, Accuracy_train = 0.9896265560165975
N_estimators = 20 	Accuracy_valid = 0.7869362363919129, Accuracy_train = 0.9891078838174274
N_estimators = 25 	Accuracy_valid = 0.7838258164852255, Accuracy_train = 0.995850622406639
N_estimators = 30 	Accuracy_valid = 0.7838258164852255, Accuracy_train = 0.995850622406639
N_estimators = 35 	Accuracy_valid = 0.7776049766718507, Accuracy_train = 0.9984439834024896
N_estimators = 40 	Accuracy_valid = 0.7838258164852255, Accuracy_train = 0.9974066390041494
N_estimators = 45 	Accuracy_valid = 0.7884914463452566, Accuracy_train = 0.9984439834024896
N_estimators = 50 	Accuracy_valid = 0.7916018662519441, Accuracy_train = 0.9979253112033195
N_estimators = 55 	Accuracy_valid = 0.7853810264385692, Accuracy_train = 0.998962655

The value of the `accuracy` metric when changing the hyperparameter number of trees in the forest ranges from 74.965% to 79.16%. The best value of metric =79.16% is achieved when the number of trees in the forest is 50.

At the same time, comparing the quality scores on the validation and training dataset, we observe a significant difference in the predicted scores between the training and validation datasets for each respective model. We can conclude that our model is over-trained.

Next, we investigate another classification algorithm: Logistic Regression.

### Logistic Regression

Let's investigate the Logistic Regression model.

In [10]:
lr_model = LogisticRegression(random_state=12345) # initializing model
lr_model.fit(features_train, target_train) # fitting model on training set
predictions = lr_model.predict(features_valid) # getting model predictions
result = accuracy_score(target_valid, predictions) # calculating accuracy on validation set

print('Accuracy модели "Логистическая Регрессия":', result)

Accuracy модели "Логистическая Регрессия": 0.7107309486780715


The value of the `accuracy' metric when applying the Logistic Regression algorithm is 71.07%, which is the lowest value among all the models built.

### Conclusion

The quality of different models built using three different algorithms "Decision Tree", "Random Forest", "Logistic Regression" was investigated.

In the course of the study we obtained:
- The model building algorithm "Decision Tree" has maximum value of `accuracy` = 78.54%, with tree depth equal to 3
- The model building algorithm "Random Forest" has maximum value of `accuracy` = 79.16%, with number of trees in the forest equal to 50
- The model building algorithm "Logistic Regression" has a value of `accuracy` = 71.07%

Although the model built using "Logistic Regression" algorithm is the fastest, in terms of computational performance, in our case it is the least accurate with a significant lag (`accuracy` = 71.07%). 

The most accurate model in our case with a metric score of `accuracy` = 79.16% is the model built using the Random Forest algorithm with the number of trees in the forest equal to 50. However, the large number of trees in the forest makes this model slow and resource consuming to learn and run. In our situation, in order to optimize the processing time and without relative loss of model prediction quality, we could also consider a Decision Tree model with tree depth equal to 3, at which `accuracy' score = 78.54%.

The conditions of the task envisage maximizing the `accuracy` metric. Accordingly, we further test the model with the maximum value of the `accuracy` metric - the model "Random Forest", with the number of trees equal to 50.  

## Model testing

The model that maximizes the value of the `accuracy` metric is built using the Random Forest algorithm with the number of trees in the forest equal to 50. Let's test the behaviour of this model on a test sample.

In [11]:
model = RandomForestClassifier(random_state=12345, n_estimators=50) # Initializing the model with the maximum value of the accuracy metric
model.fit(features_train, target_train) # fitting the model on training set
predictions = model.predict(features_test) # getting the model predictions on test set
result = accuracy_score(target_test, predictions) # calculating accuracy on test set

print('Accuracy наилучшей модели на тестовой выборке:', result)

Accuracy наилучшей модели на тестовой выборке: 0.7931570762052877


The predictive accuracy of the model on the test sample was even slightly higher than on the validation set and equals 79.32%.  

## Model sanity check

In order to assess the adequacy of the model's behaviour or to check its sanity, it is recommended to compare the model with a random constant model.

In our case, the model should predict the choice with respect to one of the two tariff plans: 0-Smart, 1-Ultra. Let us look at the distribution of the target feature values in the original dataset.

In [12]:
df.is_ultra.value_counts(normalize=True)

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

In the original dataset, the Smart tariff represents 69% of the total number of objects. As a random model, let us define a constant model that will always predict the Smart tariff. That is, the accuracy - `accuracy' - of such a model would be approximately 69%. At the same time, the predictive accuracy of the best model we built was 79.32%, which is higher than the constant model. Thus, we can conclude that our model is adequate.