# Оглавление
1. [Общая информация](#Шаг_1)
2. [Работа с данными](#Шаг_2)
3. [Исследование моделей](#Шаг_3)
    1. [DecisionTreeClassifier](#Шаг_3.1) 
    2. [RandomForestClassifier](#Шаг_3.2)
    3. [LogisticRegression](#Шаг_3.3)
4. [Проверка моделей на тестовой выборке](#Шаг_4)
5. [Проверка модели на адекватность](#Шаг_5)

<a name="Шаг_1"></a>

## 1. Общая информация

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

In [4]:
df = pd.read_csv('users_behavior.csv')

In [5]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Наименование колонок устраивает

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Пустых строк нет.

In [7]:
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


In [8]:
df[df.calls == 0].head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
54,0.0,0.0,33.0,14010.33,1
247,0.0,0.0,35.0,16444.99,1
264,0.0,0.0,21.0,19559.55,0
351,0.0,0.0,8.0,35525.61,1
390,0.0,0.0,25.0,19088.67,1


In [9]:
df[df.calls == 0].count()

calls       40
minutes     40
messages    40
mb_used     40
is_ultra    40
dtype: int64

In [10]:
df[df.minutes == 0].head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
54,0.0,0.0,33.0,14010.33,1
247,0.0,0.0,35.0,16444.99,1
264,0.0,0.0,21.0,19559.55,0
351,0.0,0.0,8.0,35525.61,1
390,0.0,0.0,25.0,19088.67,1


In [11]:
df[df.minutes == 0].count()

calls       40
minutes     40
messages    40
mb_used     40
is_ultra    40
dtype: int64

Нулевое количество звонков соответсвует нулевому количеству минут.

Предобработка данных не требуется.

[Home](#Оглавление)

<a name="Шаг_2"></a>

## 2. Работа с данными

In [12]:
df_train, df_valid = train_test_split(df, test_size=0.25, random_state=12345)

In [13]:
df_valid, df_test = train_test_split(df_valid, test_size = 0.50, random_state=12345)

In [14]:
df_train.count()

calls       2410
minutes     2410
messages    2410
mb_used     2410
is_ultra    2410
dtype: int64

In [15]:
df_valid.count()

calls       402
minutes     402
messages    402
mb_used     402
is_ultra    402
dtype: int64

In [16]:
df_test.count()

calls       402
minutes     402
messages    402
mb_used     402
is_ultra    402
dtype: int64

In [17]:
df_train.count() + df_valid.count() + df_test.count()

calls       3214
minutes     3214
messages    3214
mb_used     3214
is_ultra    3214
dtype: int64

Разбил. Проверил

[Home](#Оглавление)

<a name="Шаг_3"></a>

## 3. Исследование моделей

### Стоит задача классификации.

<a name="Шаг_3.1"></a>

#### DecisionTreeClassifier

In [18]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [19]:
features_train = df_train.drop(['is_ultra'], axis=1)

In [20]:
target_train = df_train['is_ultra']

In [21]:
features_valid = df_valid.drop(['is_ultra'], axis=1)

In [22]:
target_valid = df_valid['is_ultra']

In [23]:
for depth in range(1, 50):
    model_DecisionTreeClassifier = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model_DecisionTreeClassifier.fit(features_train, target_train)
    predictions = model_DecisionTreeClassifier.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions)
    print("depth =", depth, accuracy)

depth = 1 0.763681592039801
depth = 2 0.7935323383084577
depth = 3 0.7985074626865671
depth = 4 0.7985074626865671
depth = 5 0.7985074626865671
depth = 6 0.7786069651741293
depth = 7 0.7885572139303483
depth = 8 0.7835820895522388
depth = 9 0.7786069651741293
depth = 10 0.7810945273631841
depth = 11 0.7810945273631841
depth = 12 0.7611940298507462
depth = 13 0.763681592039801
depth = 14 0.7562189054726368
depth = 15 0.736318407960199
depth = 16 0.746268656716418
depth = 17 0.7437810945273632
depth = 18 0.736318407960199
depth = 19 0.7313432835820896
depth = 20 0.7263681592039801
depth = 21 0.7114427860696517
depth = 22 0.7189054726368159
depth = 23 0.7114427860696517
depth = 24 0.7213930348258707
depth = 25 0.7213930348258707
depth = 26 0.7213930348258707
depth = 27 0.7213930348258707
depth = 28 0.7213930348258707
depth = 29 0.7213930348258707
depth = 30 0.7213930348258707
depth = 31 0.7213930348258707
depth = 32 0.7213930348258707
depth = 33 0.7213930348258707
depth = 34 0.72139303482

Выбираю max_depth = 5

In [24]:
model_DecisionTreeClassifier = DecisionTreeClassifier(random_state=12345, max_depth=5)

In [25]:
model_DecisionTreeClassifier.fit(features_train, target_train)

DecisionTreeClassifier(max_depth=5, random_state=12345)

In [26]:
predictions = model_DecisionTreeClassifier.predict(features_valid)

In [27]:
accuracy_score(target_valid, predictions)

0.7985074626865671

<a name="Шаг_3.2"></a>

#### RandomForestClassifier

In [28]:
for estim in range(1, 51):
    model_RandomForestClassifier = RandomForestClassifier(random_state=12345, n_estimators=estim)
    model_RandomForestClassifier.fit(features_train, target_train)
    predictions = model_RandomForestClassifier.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions)
    print("estim =", estim, accuracy)

estim = 1 0.7189054726368159
estim = 2 0.7786069651741293
estim = 3 0.7736318407960199
estim = 4 0.8009950248756219
estim = 5 0.7885572139303483
estim = 6 0.7935323383084577
estim = 7 0.7885572139303483
estim = 8 0.7910447761194029
estim = 9 0.7885572139303483
estim = 10 0.7960199004975125
estim = 11 0.7935323383084577
estim = 12 0.7935323383084577
estim = 13 0.7860696517412935
estim = 14 0.7935323383084577
estim = 15 0.7885572139303483
estim = 16 0.7985074626865671
estim = 17 0.7935323383084577
estim = 18 0.7960199004975125
estim = 19 0.7835820895522388
estim = 20 0.7910447761194029
estim = 21 0.7935323383084577
estim = 22 0.7960199004975125
estim = 23 0.7910447761194029
estim = 24 0.7935323383084577
estim = 25 0.7885572139303483
estim = 26 0.7860696517412935
estim = 27 0.7910447761194029
estim = 28 0.7960199004975125
estim = 29 0.7885572139303483
estim = 30 0.7935323383084577
estim = 31 0.7910447761194029
estim = 32 0.7910447761194029
estim = 33 0.7910447761194029
estim = 34 0.793532

Выбираю n_estimators = 4

In [29]:
model_RandomForestClassifier = RandomForestClassifier(random_state=12345, n_estimators = 4)

In [30]:
model_RandomForestClassifier.fit(features_train, target_train)

RandomForestClassifier(n_estimators=4, random_state=12345)

In [31]:
predictions = model_RandomForestClassifier.predict(features_valid)

In [32]:
accuracy = accuracy_score(target_valid, predictions)

In [33]:
accuracy

0.8009950248756219

<a name="Шаг_3.3"></a>

#### LogisticRegression

In [34]:
model_LogisticRegression = LogisticRegression(random_state=12345)

In [35]:
model_LogisticRegression.fit(features_train, target_train)

LogisticRegression(random_state=12345)

In [36]:
predictions = model_LogisticRegression.predict(features_valid)

In [37]:
accuracy = accuracy_score(target_valid, predictions)

In [38]:
accuracy

0.7039800995024875

Метрика качества accuracy при сравнение с валидационной выборкой указывает на лучший результат в моделе RandomForestClassifier.

[Home](#Оглавление)

<a name="Шаг_4"></a>

## 4. Проверка модели на тестовой выборке

In [39]:
df_test.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1059,38.0,212.46,52.0,18610.53,0
2019,84.0,550.75,9.0,26712.89,0
1810,36.0,206.85,4.0,16691.2,0
648,58.0,463.23,60.0,18512.24,1
2158,80.0,609.52,23.0,24968.76,0


In [40]:
features_test = df_test.drop(['is_ultra'], axis = 1)

In [41]:
target_test = df_test['is_ultra']

In [42]:
predict_DecisionTreeClassifier = model_DecisionTreeClassifier.predict(features_test)

In [43]:
accuracy_DecisionTreeClassifier = accuracy_score(target_test, predict_DecisionTreeClassifier)

In [44]:
accuracy_DecisionTreeClassifier

0.763681592039801

In [45]:
predict_RandomForestClassifier = model_RandomForestClassifier.predict(features_test)

In [46]:
accuracy_RandomForestClassifier = accuracy_score(target_test, predict_RandomForestClassifier)

In [47]:
accuracy_RandomForestClassifier

0.7711442786069652

In [48]:
predict_LogisticRegression = model_LogisticRegression.predict(features_test)

In [49]:
accuracy_LogisticRegression = accuracy_score(target_test, predict_LogisticRegression)

In [50]:
accuracy_LogisticRegression

0.7039800995024875

Метрика (accuracy) при проверке на тестовой выборке подтверждат превосходство в данном случае модели RandomForestClassifier

[Home](#Оглавление)

<a name="Шаг_5"></a>

## 5. Проверьте модели на адекватность

In [51]:
df[df.is_ultra == 0].count() / df.count()

calls       0.693528
minutes     0.693528
messages    0.693528
mb_used     0.693528
is_ultra    0.693528
dtype: float64

Округлю. 70% - "0" и 30% - "1"

In [52]:
features_df = df.drop(['is_ultra'], axis = 1)

In [53]:
target_df = df['is_ultra']

In [54]:
predictions_df = model_RandomForestClassifier.predict(features_df)

In [55]:
accuracy_df = accuracy_score(target_df, predictions_df)

In [56]:
accuracy_df

0.9016801493466086

Возможность предсказания выше, чем 70 к 30

[Home](#Оглавление)