In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import mean_squared_error
import sklearn.metrics

## <a name="0.0"></a>Content (clickable):
* [0. Project description, data description, library import](#0.)
* [1. (Step 1) Familiarization with the data:](#1.)
    
* [2. (Step_2) Preparing data for building models:](#Шаг_2)
     
* [3. (Step_3) determining the best model and hyperparameters:](#Шаг_3)
 
     - [General output:](#Общий_вывод:)

## Project Description<a name="0."></a>
<font size="2">([to the content](#0.0))</font>

The mobile operator Megaline found out that many customers use archive tariffs. They want to build a system capable of analyzing customer behavior and offering users a new tariff: "Smart" or "Ultra".<br/><br/>
You have at your disposal data on the behavior of customers who have already switched to these tariffs (from the course project "Statistical Data Analysis"). You need to build a model for the classification problem that will choose the appropriate tariff. You won't need data preprocessing — you've already done it.<br/><br/>
Build a model with the maximum accuracy value. To pass the project successfully, you need to bring the proportion of correct answers to at least 0.75. Check accuracy on the test sample yourself.

Each object in the dataset is information about the behavior of one user per month. Is known:
- сalls — amount calls,
- minutes — total duration of calls in minutes,
- messages — number of sms messages,
- mb_used — consumed internet traffic in MB,
- is_ultra — what tariff was used during the month ("Ultra" - 1, "Smart" - 0).


<b> Instructions for the implementation of the project</b>
- Open the data file and examine it. File path: /datasets/users_behavior.csv. Download dataset
- Divide the source data into training, validation and test samples.
- Explore the quality of different models by changing hyperparameters. Briefly write the conclusions of the study.
- Check the quality of the model on a test sample.
- Additional task: check the models for sanity. It's okay if it doesn't work out: this data is more complex than the ones you have worked with before. We will tell you more about this in the next course.

<br/><a name="1."></a>
## Step 1. Familiarization with the data
<font size="2">([to the content](#0.0))</font>

In [2]:
try:
    df_1 = pd.read_csv('users_behavior.csv')
except:
    df_1 = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df_1.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
df_1.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


## Step_2 <a name="Шаг_2"></a>
<font size="2">([to the content](#0.0))</font>

<b>Preparing data for building models:</b> 
* defining features and targets
* with the help of the train_test_split library, I will split the data into training, validation and test data in the ratio of 6/2/2

In [6]:
features = df_1.drop('is_ultra', axis = 1)
target = df_1['is_ultra']

In [7]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.4, random_state=12345)

In [8]:
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.5, random_state=12345)

## Step_3 <a name="Шаг_3"></a>

<font size="2">([to the content](#0.0))</font>

<b>determining the best model and hyperparameters:</b>
- for the DecisionTreeRegressor and RandomForestRegressor models, I will calculate accuracy with a maximum depth of trees from 1 to 5
- for RandomForestRegressor, I will calculate accuracy for a different number of trees in the forest from 10 to 100 in increments of 10
- I will determine the maximum accuracy value for all models with different hyperparameters
- I will print the data on the model with maximum accuracy on the validation sample
- I will print the results on a test sample for the best model

In [9]:
class BestModel:
    
    models = [RandomForestRegressor, DecisionTreeRegressor,  LinearRegression]
    mod_rus = ['Случайный лес', 'Дерево решений', 'Линейная регрессия']
    results_tab = []
    best_model = None
    best_result = 0
    result_test = 0
    best_depth = 0
    trees = 0
    v = 0
    
    def predictions_model(self, model):
        # обучает модель и дает прогнозные значения, результаты складывает в словарь 
        model.fit(features_train, target_train)
        predictions_valid = model.predict(features_valid) 
        predictions_test = model.predict(features_test)
        predictions_valid = (np.around(predictions_valid)**2).astype(bool)
        predictions_test = (np.around(predictions_test)**2).astype(bool)
        predictions_valid = pd.Series(predictions_valid, index=target_valid.index)
        predictions_test = pd.Series(predictions_test, index=target_test.index)
        accuracy_test = accuracy_score(target_test, predictions_test)
        accuracy_valid = accuracy_score(target_valid, predictions_valid)
        return {'accuracy_train' : accuracy_test,
                'accuracy_valid' : accuracy_valid}
    
    
    def results(self, v, mod, i, j, a_v, a_t):
        # добавляет результаты в таблицу results
        self.results_tab.append({'model' : mod,
                        'max_depth' : i,
                        'trees' : j,
                        'accuracy_valid': a_v,
                        'accuracy_test' : a_t})
    
        
    def for_print(self, v):
        # печатает результаты лучшей модели
        if v == '2_2':
            return print(f'Лучшая модель: {self.best_model}; \
            \n лучший результат на валидационной выборке accuracy = {self.best_result:.2};\
            \n максимальная глубина дерева решений в лучшей модели: {self.best_depth};\
            \n accuracy на тестовой выборке = {self.result_test:.2}')
        elif v == '1_1':
            return print(f'Лучшая модель: {self.best_model}; \
            \n лучший результат на валидационной выборке accuracy = {self.best_result:.2};\
            \n максимальная глубина дерева решений в лучшей модели: {self.best_depth};\
            \n accuracy на тестовой выборке = {self.result_test:.2};\
            \n количество деревьев в случайном лесу: {self.trees}')
        elif v == '3_3':
            return print(f'Лучшая модель: {self.best_model};\
            \n лучший результат на валидационной выборке accuracy = {best_result:.2};\
            \n accuracy на тестовой выборке = {self.result_test:.2}')
        


    def rand_frst(self, i, j):
        # random forest
        model = RandomForestRegressor(random_state = 12345, max_depth = i, n_estimators = j)
        answers = self.predictions_model(model)
        v = 1
        self.results(v, self.mod_rus[v-1], i, j, answers['accuracy_valid'], answers['accuracy_train'])
        if answers['accuracy_valid'] > self.best_result:
            self.best_model = model
            self.best_result = answers['accuracy_valid']
            self.best_depth = i
            self.result_test = answers['accuracy_train']
            self.trees = j
            self.v = '1_1'
            return self.v, self.result_test, self.best_depth, self.best_model, self.best_result, self.trees 
    
    def des_tr(self, i):
        # decision tree
        model = DecisionTreeRegressor(random_state = 12345, max_depth = i)
        answers = self.predictions_model(model)
        v = 2 
        self.results(v, self.mod_rus[v-1], i, 1, answers['accuracy_valid'], answers['accuracy_train'])
        if answers['accuracy_valid'] > self.best_result:
            self.best_model = model
            self.best_result = answers['accuracy_valid']
            self.best_depth = i
            self.result_test = answers['accuracy_train']
            self.v = '2_2'
            return self.v, self.result_test, self.best_depth, self.best_model, self.best_result, self.trees
            
    
    def lin_reg(self):
        # linear regression
        model = LinearRegression()
        answers = self.predictions_model(model)
        v = 3
        self.results(v, self.mod_rus[v-1], 0, 0, answers['accuracy_valid'], answers['accuracy_train'])
        if answers['accuracy_valid'] > self.best_result:
            self.best_model = model
            self.best_result = answers['accuracy_valid']
            self.result_test = answers['accuracy_train']
            self.v = '3_3'
            return self.v, self.result_test, self.best_model, self.best_result
            
            
    


In [10]:
t = BestModel()
for mod in t.models:
    if mod == RandomForestRegressor:
        for i in range(1,10):            
            for j in range(10,100,10):
                t.rand_frst(i, j)              
    elif mod == DecisionTreeRegressor:
        for i in range(1,10):
            t.des_tr(i)
    else:
        t.lin_reg()

In [11]:
t.for_print(t.v)

Лучшая модель: RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=6, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=70, n_jobs=None, oob_score=False,
                      random_state=12345, verbose=0, warm_start=False);             
 лучший результат на валидационной выборке accuracy = 0.8;            
 максимальная глубина дерева решений в лучшей модели: 6;            
 accuracy на тестовой выборке = 0.79;            
 количество деревьев в случайном лесу: 70


### Common_output:
<font size="2">([to the content](#0.0))</font>

<b> The best model for this task is RandomForestRegressor with the following hyperparameters:</b>
- the maximum depth of the decision tree in the model is 6
- the number of trees in a random forest is 70
<br>

<b> At the same time, the accuracy of the model prediction on the validation sample is 0.80, on the test sample 0.79 </b>

In [12]:
df = pd.DataFrame(t.results_tab)

In [13]:
df

Unnamed: 0,model,max_depth,trees,accuracy_valid,accuracy_test
0,Случайный лес,1,10,0.754277,0.737170
1,Случайный лес,1,20,0.754277,0.735614
2,Случайный лес,1,30,0.754277,0.735614
3,Случайный лес,1,40,0.754277,0.735614
4,Случайный лес,1,50,0.754277,0.735614
...,...,...,...,...,...
86,Дерево решений,6,1,0.785381,0.774495
87,Дерево решений,7,1,0.783826,0.793157
88,Дерево решений,8,1,0.779160,0.793157
89,Дерево решений,9,1,0.782271,0.780715
