# Tariff recommendation

You have at your disposal data on the behavior of customers who have already switched to these tariffs (from the draft course "Statistical Data Analysis"). You need to build a model for the classification problem that will select the appropriate rate. Data preprocessing is not required - you have already done it.

Build the model with the largest possible ***accuracy*** value. To pass the project successfully, you need to bring the percentage of correct answers to at least 0.75. Check ***accuracy*** on the test set yourself.

## Open and examine the file

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.dummy import DummyClassifier
import warnings
warnings.filterwarnings('ignore')

Read the file and save it to the `df` dataset.

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
df.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2935,88.0,642.31,170.0,44296.86,1
2647,61.0,429.99,43.0,11967.96,0
568,13.0,118.51,7.0,10174.47,0
1049,40.0,246.79,88.0,22284.78,0
3127,89.0,548.43,6.0,25307.07,1
2275,58.0,395.81,77.0,25859.09,0
2441,55.0,369.67,47.0,11584.49,1
2385,71.0,440.13,51.0,16070.92,0
54,0.0,0.0,33.0,14010.33,1
558,161.0,1218.67,11.0,5749.01,1


The dataset has 3214 rows and 5 columns:
- `calls` - number of calls;
- `minutes` - total duration of calls in minutes;
- `messages` - number of sms messages;
- `mb_used` - used Internet traffic in Mb;
- `is_ultra` - which tariff was used during the month ("Ultra" - 1, "Smart" - 0).

No data preprocessing required, no gaps. We turn to the breakdown of the dataset into samples.

## Split the data into samples

First, let's split our dataset into features (`calls`, `minutes`, `messages`, `mb_used`) and a target feature (`is_ultra`).

In [5]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

Now let's divide our dataset into training, test and validation samples in the ratio of 60:20:20.

First, we select the training and test samples in the ratio of 80:20.

In [6]:
train_features, test_features, train_target, test_target = train_test_split(
    features, target, test_size=0.2, random_state=12345)

And we select another 25% from the received training sample for the validation sample. We use this percentage, because we will select the validation sample not from the original dataset, but from 80% of it.

In [7]:
train_features, valid_features, train_target, valid_target = train_test_split(
    train_features, train_target, test_size=0.25, random_state=12345)

In [8]:
print('Training sample size:', len(train_target))
print('Test sample size:', len(test_target))
print('Validation sample size:', len(valid_target))

Training sample size: 1928
Test sample size: 643
Validation sample size: 643


Dividing the dataset into 3 samples went well. Let's start training the models.

## Explore Models

Since we are faced with the task of classification, we will explore three machine learning models:
- Decision tree;
- Random forest;
- Logistic regression.

### Decision tree

In [9]:
#Train the model with standard parameters
model = DecisionTreeClassifier(random_state=12345)
model.fit(train_features, train_target)
predictions = model.predict(valid_features)
accuracy_score(valid_target, predictions)

0.7122861586314152

The proportion of correct answers is only 0.71, which is very low. Let's check the quality of the model by changing the tree depth hyperparameter - `max_depth` and the minimum number of objects per leaf hyperparameter - `min_samples_leaf`. Ideal values for `min_samples_leaf` tend to be in the range of 1 to 20 for the CART algorithm (classification and regression trees is an acronym for classification and regression methods using a decision tree). Also, the `min_samples_leaf` parameter is the most responsible for the performance of the final trees.

In [10]:
best_tree = None
best_result = 0
for depth in range(1, 10):
    for leaf in range(1, 20):
        tree_model = DecisionTreeClassifier(random_state=12345, max_depth=depth, min_samples_leaf=leaf)
        tree_model.fit(train_features, train_target)
        tree_predictions = tree_model.predict(valid_features)
        result = accuracy_score(valid_target, tree_predictions)
        if result > best_result:
            best_tree = tree_model
            best_result = result
            best_depth = depth
            best_leaf = leaf
print('Best result:', best_result, 'with tree depth', best_depth, 'and with the number of objects in leaf', leaf)

Best result: 0.7916018662519441 with tree depth 7 and with the number of objects in leaf 19


We got a result with a percentage of correct answers of 0.79, with a tree depth of 7 and with 19 objects in the sheet. We have overcome the quality bar of 0.75 prediction accuracy by changing the hyperparameters.

### Random forest

Consider now another model - a random forest. Let's loop through the number of trees from 1 to 30 to select the best hyperparameters.

In [11]:
best_forest = None
best_result = 0
for est in range(1, 30):
    forest_model = RandomForestClassifier(random_state=12345, n_estimators=est)
    forest_model.fit(train_features, train_target)
    result = forest_model.score(valid_features, valid_target)
    if result > best_result:
        best_forest = forest_model
        best_result = result
        best_est = est
print('Best result:', best_result, 'with the number of trees', best_est)

Best result: 0.7931570762052877 with the number of trees 22


As you can see, we got the same prediction accuracy result of 0.79 as in the case of the decision tree. At the same time, the random forest works at a lower speed than the decision tree, so for the time being we give preference to the first model.

### Logistic regression

And let's check the accuracy of our last model - logistic regression. This is the simplest model with few parameters and no tendency to overfit.

In [12]:
reg_model = LogisticRegression(random_state=12345)
reg_model.fit(train_features, train_target)
result = reg_model.score(valid_features, valid_target)
print('Logistic Regression Accuracy:', result)

Logistic Regression Accuracy: 0.7262830482115086


The quality of logistic regression predictions is only 0.69, which is very low and does not overcome the minimum threshold of 0.75.

### Section Conclusion

We evaluated the quality of the three trained models on the validation set and found that:
1. The quality of the decision tree is 0.79.
2. The quality of the random forest is also 0.79.
3. The quality of the logistic regression was 0.69 and it did not overcome the minimum threshold of accuracy.
4. The decision tree and random forest have the same quality index, but the decision tree works faster, so for now we prefer this model.

## Check the model on the test set

Let's check the quality of all three models on the test sample, maybe the results here will differ from the validation sample.

In [13]:
#decision tree
print('The quality of the decision tree on the test set:', best_tree.score(test_features, test_target))

#random forest
print('The quality of the random forest on the test set:', best_forest.score(test_features, test_target))

#logistic regression
print('The quality of the logistic regression on the test sample:', reg_model.score(test_features, test_target))

The quality of the decision tree on the test set: 0.7931570762052877
The quality of the random forest on the test set: 0.7838258164852255
The quality of the logistic regression on the test sample: 0.7589424572317263


<font size="5"><b>Conclusion:</b> </font>

So, we finally confirmed the previous conclusions and the best trained model is a <b>decision tree with a prediction quality of 0.79.</b>

## Checking the model for adequacy

To check the model for adequacy, we use the DummyClassifier method.

DummyClassifier is a classifier that makes predictions using simple rules.

Usage strategy for making forecasts:
- Stratified: generates predictions based on the class distribution of the training sample.
- "Most_frequent": always predicts the most frequent label in the training set.
- "Prior": always predicts the class that maximizes the previous class (e.g. "most_frequent"), and predict_proba returns the previous class.
- "Uniform": generates predictions uniformly in random order.
- "Constant": always predicts a constant label provided by the user. This is useful for metrics that measure a non-majority class.

In [14]:
strategy = ['most_frequent', 'prior', 'stratified', 'uniform']
for index in strategy:
    dummy_clf = DummyClassifier(strategy=index, random_state=12345)
    dummy_clf.fit(train_features, train_target)
    score = dummy_clf.score(test_features, test_target)
    print('Accuracy for strategy', index, ':', score)

Accuracy for strategy most_frequent : 0.6951788491446346
Accuracy for strategy prior : 0.6951788491446346
Accuracy for strategy stratified : 0.5754276827371695
Accuracy for strategy uniform : 0.48989113530326595


As you can see, when using the simplest training methods, the accuracy does not reach the indicator of our decision tree, which means that our model works adequately.