# Tariff recommendation

Data available is the data on the behavior of customers who have already switched tariffs. The task is to build a model for the classification problem that will select the appropriate tariff. Data preprocessing is not required - see project 4. tariff_telecom.

Model metrics is *accuracy*. *Accuracy* value on the test sample should be at least 0.75.

**Project content:** <br/>
<a href='#first'>1) Examine the data</a> <br/>
<a href='#second'>2) Break the data into samples</a> <br/>
<a href='#third'>3) Explore Models</a> <br/>
<a href='#fourth'>4) Check the model on the test sample</a> <br/>
<a href='#fifth'>5) Adequacy test</a> <br/>
<a href='#sixth'>6) Summary</a> <br/>

<a id='first'></a>
## 1. Examine the data

In [1]:
import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.metrics import accuracy_score 
from sklearn.dummy import DummyClassifier 


df = pd.read_csv('/datasets/users_behavior.csv') 
df.info() 
df.head() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [2]:
df.corr()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
calls,1.0,0.982083,0.177385,0.286442,0.207122
minutes,0.982083,1.0,0.17311,0.280967,0.206955
messages,0.177385,0.17311,1.0,0.195721,0.20383
mb_used,0.286442,0.280967,0.195721,1.0,0.198568
is_ultra,0.207122,0.206955,0.20383,0.198568,1.0


In [3]:
df.duplicated().sum()

0

### Conclusion
The first step was to read and examine the data file.<br/>
The data does not contain gaps, duplicates, is converted to the correct types and has clear column names. No further data preprocessing is required.<br/>

The dataset contains 5 columns:<br/>
`calls` — number of calls,<br/>
`minutes` — total duration of calls in minutes,<br/>
`messages` — number of sms messages,<br/>
`mb_used` — Internet traffic used in Mb,<br/>
-> for further research being **features for learning**

`is_ultra` - which tariff was used during the month ("Ultra" - 1, "Smart" - 0)<br/>
-> **target feature**

<a id='second'></a>
## 2. Break the data into samples

In order to split the data into training, validation and test samples, the ratio of 3:1:1 is proposed.

In [4]:
features = df.drop('is_ultra',1) 
target = df['is_ultra'] 

features_train_1, features_test, target_train_1, target_test = train_test_split(features, target, test_size=0.2, random_state=12345)  
features_train, features_valid, target_train, target_valid = train_test_split(features_train_1, target_train_1, test_size=0.25, random_state=12345) 

display(features_train.shape) #check
display(features_test.shape) #check
features_valid.shape #check

(1928, 4)

(643, 4)

(643, 4)

### Conclusion
At the second step, the data was split into training, validation and test sets in a ratio of 3:1:1. Features and target were also allocated.

<a id='third'></a>
## 3.  Explore Models

For the study, the following classification models are proposed:

1) **Decision Tree** - `DecisionTreeClassifier`; <br/>
2) **Random Forest** - `RandomForestClassifier`; <br/>
3) **Logistic Regression** - `LogisticRegression`.

The quality of the models will be assessed by the `accuracy` metric - the ratio of the number of correct answers to the sample size. Quality control will be carried out on the validation sample.

For experiments with models, it is proposed to enumerate hyperparameters: <br/>
a) `max_depth` (depth of the "tree") from 1 to 10;<br/>
b) `n_estimators` (number of "trees") from 10 to 40, (with a step of 5).

In [5]:
#DecisionTree
best_model_tree = None
best_accuracy = 0
best_depth = 0
for depth in range(1,11): 
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth) 
    model.fit(features_train, target_train) 
    predictions_valid = model.predict(features_valid) 
    accuracy = accuracy_score(target_valid, predictions_valid) 
    if accuracy > best_accuracy: 
        best_model_tree = model
        best_accuracy = accuracy
        best_depth = depth

print("Accuracy best model:", best_accuracy)
print("Max_depth:", best_depth) 

Accuracy best model: 0.7744945567651633
Max_depth: 7


In [6]:
#RandomForest
best_model_forest = None 
best_accuracy = 0
best_depth = 0
best_est = 0
for est in range(10, 41, 5): 
    for depth in range(1,11): 
        model = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=est) 
        model.fit(features_train, target_train) 
        predictions_valid = model.predict(features_valid)
        accuracy = accuracy_score(target_valid, predictions_valid) 
        if accuracy > best_accuracy:  
            best_model_forest = model
            best_accuracy = accuracy
            best_depth = depth
            best_est = est

print("Accuracy best model:", best_accuracy)
print("Max_depth:", best_depth,"Estimators:", best_est) 


Accuracy best model: 0.7978227060653188
Max_depth: 10 Estimators: 35


In [7]:
#LogisticRegression
model_regression = LogisticRegression(random_state=12345, solver='lbfgs') # solver to avoid future warning
model_regression.fit(features_train, target_train) 
print("Accuracy best model:", model_regression.score(features_valid, target_valid)) 

Accuracy best model: 0.7262830482115086


### Conclusion
In the third step, three hyperparameter enumeration models were explored. The **Random forest** model with the following parameters: **tree depth 10 and number of trees 35** has the highest accuracy (the quality assessment metric used to test the models). <br/>
This model will be used for testing on a test sample.

<a id='fourth'></a>
## 4. Check the model on the test sample

In [8]:
predictions_test = best_model_forest.predict(features_test) 
test_accuracy = accuracy_score(target_test, predictions_test) 
print("Accuracy test", test_accuracy) 

Accuracy test 0.7993779160186625


### Conclusion
The task was to build a model with the highest possible accuracy, while the accuracy should not be lower than 0.75.

At the third step, this task was completed, test at the fourth step confirmed this. Test data model check showed an accuracy of 0.799.

Thus, the **Random forest model** is proposed with the following parameters: **the depth 10 and the number of "trees" 35**.

<a id='fifth'></a>
## 5. Adequacy test

To carry out the model adequacy test, called sanity check, it is proposed to use the comparison of accuracy on the test sample with the accuracy of predictions from the Dummy Classifier (`DummyClassifier`). For the Classifier, it is proposed to set the strategy parameter (`strategy`) as **most_frequent** - i.e. the largest class, which will be predicted for all elements. The model passes the adequacy test if the accuracy of the model from the Dummy Classifier is lower.

In [9]:
model_sanity = DummyClassifier(strategy='most_frequent') 
model_sanity.fit(features_train, target_train) 
sanity_accuracy = model_sanity.score(features_test, target_test) 
print("Dummy accuracy", sanity_accuracy) 
print("Accuracy random forest", test_accuracy)  
if sanity_accuracy >= test_accuracy:  
    print('The model did not pass the adequacy test', ) 
else: 
    print('The model did pass the adequacy test') 

Dummy accuracy 0.6951788491446346
Accuracy random forest 0.7993779160186625
The model did pass the adequacy test


In addition to the model chosen at step 3, it is also proposed to test the DecisionTree and LogisticRegression models for adequacy.

In [10]:
test_accuracy_tree = accuracy_score(target_test, best_model_tree.predict(features_test)) #accuracy test for tree
print("Accuracy test:", test_accuracy_tree)
if sanity_accuracy >= test_accuracy_tree: 
    print('DecisionTree did not pass the adequacy test', ) 
else: 
    print('DecisionTree did pass the adequacy test') 
print()

test_accuracy_regression = accuracy_score(target_test, model_regression.predict(features_test)) #accuracy test regression
print("Accuracy test:", test_accuracy_regression) 
if sanity_accuracy >= test_accuracy_regression: 
    print('LogisticRegression did not pass the adequacy test', ) 
else: 
    print('LogisticRegression did pass the adequacy test') 

Accuracy test: 0.7884914463452566
DecisionTree did pass the adequacy test

Accuracy test: 0.7589424572317263
LogisticRegression did pass the adequacy test


### Conclusion
The model used in step 4 has passed the adequacy test.

Besides, the DecisionTree and LogisticRegression models were also tested on test sample - both models have also passed the test.

<a id='sixth'></a>
## 6. Summary

Thus, a model with the highest *accuracy* value was built and tested: **Random forest** with parameters: **depth of the "tree" 10 and the number of "trees" 35**. The proportion of correct answers was 0.799 on a test sample (and 0.7978 - validation stage). <br/>
The model was also tested for adequacy - sanity check.