# Mobile tariff recommendation

You have data on the behavior of customers who have already switched to these mobile plans. You need to build a model for the classification problem that selects the right tariff. You don't need any preprocessing, you've already done it.

Build a model with as much accuracy as possible. To pass the project successfully, you need to get the correct answers to at least 0.75. Check the accuracy on the test sample yourself.

## introduction

**Project Description**

Mobile operator Megaline found out: many customers use archived tariffs. They want to build a system capable of analyzing customer behavior and offer users a new tariff: "Smart" or "Ultra".
    
You have data on the behavior of customers who have already switched to these tariffs. 

**Goals**

You need to build a model for the classification problem that selects the appropriate rate. You don't need to preprocess the data - you have already done it.

**Objectives**

Construct a model with the highest possible value of accuracy. To pass the project successfully, you need to get the percentage of correct answers to at least 0.75. Check the accuracy on the test sample yourself.

**Data Description**

Each object in the dataset is information about the behavior of one user per month:

`calls` - number of calls

`minutes` - total duration of calls in minutes

`messages` - number of sms messages

`mb_used` - used Internet traffic in MB

`is_ultra` - tariff plan used during the month ("Ultra" - 1, "Smart" - 0)

## Data preprocessing

In [None]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
from joblib import dump

import numpy as np
import scipy.stats
import warnings
import graphviz
from tqdm import tqdm_notebook

from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import datasets
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.ensemble import RandomForestRegressor


In [None]:
df = pd.read_csv('/datasets/users_behavior.csv')
df.head()

In [None]:
df.info()

**Split the data into samples**

In [None]:
df_train, df_test1 = train_test_split(df, test_size=0.4, random_state=12345)
df_test, df_valid=train_test_split(df_test1, test_size=0.5, random_state=12345)
print(df_train.shape)
print(df_test.shape)
print(df_valid.shape)

## Model training

**Let's set the features and target features**

In [None]:
features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

### Random Tree

**Adjust the the max_depth hyperparameter**

In [None]:
best_model = None
best_result = 0
train_acc, test_acc = [], []
for depth in range(1, 11):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth) 
    model.fit(features_train, target_train) 
    predictions_valid = model.predict(features_valid) 
    result = accuracy_score(target_valid, predictions_valid) 
    if result > best_result:
        best_model = model
        best_result = result

    train_acc.append(accuracy_score(target_valid, model.predict(features_valid)))
    test_acc.append(accuracy_score(target_test, model.predict(features_test)))

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)

In [None]:
df_tree1 = pd.DataFrame(list(zip(train_acc, test_acc)), columns =['train', 'test'])
df_tree1.plot(grid=True)

plt.show()

**Conclusion:**
    1. увеличение значения параметра max_depth приводит к увеличению точности классификации на обучающей выборке
    2. с некоторого момента увеличение значения max_depth приводит к ухудшению точности на тестовой выборке, так как начинается стадия переобучения.

**Adjust the min_samples_leaf hyperparameter**

In [None]:


best_model = None
best_result = 0
train_acc, test_acc = [], []
for min_samples_leaf in [1, 2, 10, 15, 16, 17]:
    model = DecisionTreeClassifier(random_state=12345, min_samples_leaf = min_samples_leaf) # обучите модель с заданной глубиной дерева
    model.fit(features_train, target_train) # обучите модель
    predictions_valid = model.predict(features_valid) # получите предсказания модели
    result = accuracy_score(target_valid, predictions_valid) # посчитайте качество модели
    predictions = model.predict(features_valid)

    if result > best_result:
        best_model = model
        best_result = result
#    print("min_samples_leaf =", min_samples_leaf, ": ", end='')
#    print(accuracy_score(target_valid, predictions_valid))
    train_acc.append(accuracy_score(target_valid, model.predict(features_valid)))
    test_acc.append(accuracy_score(target_test, model.predict(features_test)))

# < напишите здесь код расчёта на тестовой выборке >
predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)
#print("Тестовая выборка:", accuracy_test)


In [None]:
df_tree2 = pd.DataFrame(list(zip(train_acc, test_acc)), columns =['train', 'test'])
df_tree2.plot(grid=True)
plt.show()

**Conclusion:**

As the value of min_samples_leaf increases, the quality on the training sample increases, then decreases and then increases again.

It turns out that increasing the value of min_samples_leaf is one of the ways to combat overtraining when using solver trees.

**Adjust min_samples_split hyperparameter**

In [None]:
best_model = None
best_result = 0
train_acc, test_acc = [], []
for min_samples_split in [2, 5, 10, 20, 25, 30]:
    model = DecisionTreeClassifier(random_state=12345, min_samples_split = min_samples_split) 
    model.fit(features_train, target_train) 
    predictions_valid = model.predict(features_valid) 
    result = accuracy_score(target_valid, predictions_valid) 
    if result > best_result:
        best_model = model
        best_result = result

    train_acc.append(accuracy_score(target_valid, model.predict(features_valid)))
    test_acc.append(accuracy_score(target_test, model.predict(features_test)))
    
predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)

In [None]:
df_tree3 = pd.DataFrame(list(zip(train_acc, test_acc)), columns =['train', 'test'])
df_tree3.plot(grid=True)

plt.show()

**Conclusion:**

Similarly, as the min_samples_split value increases, the quality on the training sample increases, then decreases and then increases again. It turns out that increasing the min_samples_split parameter value is one of the ways to combat overtraining when using solver trees.

### Random Forest

**Adjust the n_estimators hyperparameter**

In [None]:
best_model = None
best_result = 0
train_acc, test_acc = [], []
for est in range(1, 11):
    model = RandomForestClassifier(random_state=12345, n_estimators=est) 
    model.fit(features_train, target_train) 
    model.predict(features_valid)
    result = model.score(features_valid, target_valid) 
    if result > best_result:
        best_model = model
        best_result = result

    train_acc.append(accuracy_score(target_valid, model.predict(features_valid)))
    test_acc.append(accuracy_score(target_test, model.predict(features_test)))
        
predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)

In [None]:
df_forest1 = pd.DataFrame(list(zip(train_acc, test_acc)), columns =['train', 'test'])
df_forest1.plot(grid=True)
plt.show()

**Conclusion:**

The parameter n_estimators shows the number of trees in the random forest model. We see that the quality of the model is greatly enhanced with odd values of the number of trees.

**Adjust the max_depth hyperparameter**

In [None]:
best_model = None
best_result = 0
train_acc, test_acc = [], []
for depth in range(1, 11):
    model = RandomForestClassifier(random_state=12345, max_depth=depth) 
    model.fit(features_train, target_train) 
    model.predict(features_valid)
    result = model.score(features_valid, target_valid) 
    if result > best_result:
        best_model = model
        best_result = result
        
    train_acc.append(accuracy_score(target_valid, model.predict(features_valid)))
    test_acc.append(accuracy_score(target_test, model.predict(features_test)))
    
predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)

In [None]:
df_forest2 = pd.DataFrame(list(zip(train_acc, test_acc)), columns =['train', 'test'])
df_forest2.plot(grid=True)
plt.show()

**Conclusion:**

    1. увеличение значения параметра max_depth приводит к увеличению точности классификации на обучающей выборке
    2. с некоторого момента увеличение значения max_depth приводит к ухудшению точности на тестовой выборке, так как начинается стадия переобучения.

**Adjust the min_samples_leaf hyperparameter**

In [None]:
best_model = None
best_result = 0
train_acc, test_acc = [], []
for min_samples_leaf in range(1, 11):
    model = RandomForestClassifier(random_state=12345, min_samples_leaf=min_samples_leaf) 
    model.fit(features_train, target_train) 
    model.predict(features_valid)
    result = model.score(features_valid, target_valid) 
    if result > best_result:
        best_model = model
        best_result = result

    train_acc.append(accuracy_score(target_valid, model.predict(features_valid)))
    test_acc.append(accuracy_score(target_test, model.predict(features_test)))
    
predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)

In [None]:
df_forest3 = pd.DataFrame(list(zip(train_acc, test_acc)), columns =['train', 'test'])
df_forest3.plot(grid=True)
plt.show()

**Conclusion:**
As the value of min_samples_leaf increases, the quality on the training sample increases, then decreases and then increases again. It turns out that increasing the value of min_samples_split is one of the ways to increase the accuracy of the model, as well as to combat overtraining when using solver trees.

### Logistic regression

**Let's study the influence of different algorithms on the accuracy of the model**

In [None]:
model = LogisticRegression(random_state=12345, solver='lbfgs', max_iter=1000)
model.fit(features_train, target_train)
model.predict(features_train)
result = model.score(features_train, target_train)
print("Train sample:", result)


predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)
print("Test sample:", accuracy_test) # < допишите код здесь >

In [None]:
model = LogisticRegression(random_state=12345, solver='liblinear', max_iter=1000)
model.fit(features_train, target_train)
model.predict(features_train)
result = model.score(features_train, target_train)
print("Train sample:", result)

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)
print("Test sample:", accuracy_test) # < допишите код здесь >

In [None]:
model = LogisticRegression(random_state=12345, solver='newton-cg', max_iter=1000)
model.fit(features_train, target_train)
model.predict(features_train)
result = model.score(features_train, target_train)
print("Train sample:", result)

predictions_test = model.predict(features_test)
accuracy_test = accuracy_score(target_test, predictions_test)
print("Test sample:", accuracy_test) # < допишите код здесь >

**Conclusion:**

- Depending on the chosen algorithm, the accuracy of the model may vary, as well as its learning curve.
- The best accuracy value is shown by the random forest model.
- With different values of the hyperparameters, you can influence the accuracy of the model and fight its overfitting.

**Check the model for adequacy**

In [None]:
df_train['is_ultra'].count()

In [None]:
df_train[df_train['is_ultra'] == 0]['is_ultra'].count()

In [None]:
df_train[df_train['is_ultra'] == 1]['is_ultra'].count()

In [None]:
(0.5*593 + 0.5*1335)/1928

**Conclusion:**
- The prediction of the random model is 0.5
- the constructed models are adequate

## Final conclusion:
The random forest model shows the most accurate result