<div class="alert alert-success">
<b>Reviewer's comment V2</b>

I added some comments with answers to your questions, please check them out! The project is accepted now. Keep up the good work on the next sprint!
    

</div>

**Review**

Hi, my name is Dmitry and I will be reviewing your project.
  
You can find my comments in colored markdown cells:
  
<div class="alert alert-success">
  If everything is done successfully.
</div>
  
<div class="alert alert-warning">
  If I have some (optional) suggestions, or questions to think about, or general comments.
</div>
  
<div class="alert alert-danger">
  If a section requires some corrections. Work can't be accepted with red comments.
</div>
  
Please don't remove my comments, as it will make further review iterations much harder for me.
  
Feel free to reply to my comments or ask questions using the following template:
  
<div class="alert alert-info">
  For your comments and questions.
</div>
  
First of all, thank you for turning in the project! You did a great job overall, there's just one small problem that needs to be fixed before the project can be accepted. Should be pretty straightforward though. Let me know if you have any questions!

<center style='font-size:28px;'><u><b>Intro 2 ML</b></u></center>

1. [Project Description](#start)
    * [Data Description](#dd)
2. [Import](#imp)
3. [Preprocessing data](#pp)
    
4. [Splitting Data](#spldt)
    * [Prepare Data for ML](#pdml)        
5. [Quality of Different Models (Training)](#qtra)
    * [DecisionTreeClassifier](#dtc)
    * [RandomForestClassifier](#rfc)
    * [LogisticRegression](#logr)
        * [LibLinear](#pwrt)
        * [verbose is insignificant](#logrverbins)
        * [tol can be disregarded](#logrtolins)
        * [C has 2 possibly preferred values](#logrcs)
        * [intercept_scaling has an optimal value](#logriss)
        * [class_weight can be disregarded](#logrcwins)
        * [Tuning Summation](#logrtunsum)
    * [DecisionTreeRegressor](#dtr)
        * [ccp_alpha not needed to be specified](#ccopj)
        * [min_impurity_decrease not needed to be specified](#impurtyb)
        * [sketch for easy and extra slow check](#sktch)
        * [DecisionTreeRegressor tuning summation](#dtrtunsum)
    * [RandomForestRegressor](#rfr)
    * [LinearRegression](#linr)
    * [Tuning conclusions](#tratunconc)
6. [Quality of Different Models (Test)](#qtst)
    * [DecisionTreeClassifier](#dtct)
    * [RandomForestClassifier](#rfct)
    * [LogisticRegression](#logrt)
    * [DecisionTreeRegressor](#dtrt)
    * [RandomForestRegressor](#rfrt)
    * [LinearRegression](#linrt)
    * [Testing conclusions](#conctest)
7. [Sanity Check](#schk)

# Project description <a id="start"></a>

<ul>Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers behavior and recommend one of Megaline's newer plans: Smart or Ultra.</ul>
<ul>Analysis here is performed on behavior data about subscribers who have already switched to the new plans.</ul>
<ul>Aim's to develop a model with the highest possible accuracy (at least 75% accuracy on test dataset)</ul>

<div><b>In particular:</b></div>

- Split the source data into a training set, a validation set, and a test set
- Investigate the quality of different models by changing hyperparameters
- Investigate the quality of different models by their test sets
- Sanity check the models

## Description of the data <a id="dd"></a>

<b>Every observation in the dataset contains monthly behavior information about one user</b>
- **users_behavior**
<ul>сalls — number of calls</ul>
<ul>minutes — total call duration in minutes</ul>
<ul>messages — number of text messages</ul>
<ul>mb_used — Internet traffic used in MB</ul>
<ul>is_ultra — plan for the current month (Ultra - 1, Smart - 0)</ul>

# Imports <a id="imp"></a>

In [1]:
pip install -U scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Requirement already up-to-date: scikit-learn in /home/jovyan/.local/lib/python3.7/site-packages (1.0.2)
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Pre-Processing <a id="pp"></a>

In [3]:
try:
    ub = pd.read_csv('users_behavior.csv')
except:
    ub = pd.read_csv('/datasets/users_behavior.csv')

In [4]:
ub.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


No missing values, and with no cast aside, all columns are casted as well

In [5]:
ub.duplicated().sum()

0

Also no dup's

In [6]:
ub.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


Data's ready for analysis

<div class="alert alert-success">
<b>Reviewer's comment</b>

Alright, the data was loaded and inspected!

</div>

# Splitting Data <a id="spldt"></a>

In [7]:
df_train, df_valid = train_test_split(ub, test_size=0.4, random_state=12345) 
df_valid, df_test = train_test_split(df_valid, test_size=0.5, random_state=12345) 
print('train data:', end='\n')
display(df_train)
print('validation data:', end='\n')
display(df_valid)
print('test data:', end='\n')
display(df_test)

train data:


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3027,60.0,431.56,26.0,14751.26,0
434,33.0,265.17,59.0,17398.02,0
1226,52.0,341.83,68.0,15462.38,0
1054,42.0,226.18,21.0,13243.48,0
1842,30.0,198.42,0.0,8189.53,0
...,...,...,...,...,...
2817,12.0,86.62,22.0,36628.85,1
546,65.0,458.46,0.0,15214.25,1
382,144.0,906.18,0.0,25002.44,1
2177,38.0,301.27,37.0,28914.24,1


validation data:


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1386,92.0,536.96,18.0,20193.90,0
3124,40.0,286.57,17.0,17918.75,0
1956,81.0,531.22,56.0,17755.06,0
2286,67.0,460.76,27.0,16626.26,0
3077,22.0,120.09,16.0,9039.57,0
...,...,...,...,...,...
1999,56.0,398.45,4.0,23682.94,0
1023,76.0,601.10,0.0,17104.36,0
748,81.0,525.97,15.0,18878.91,0
1667,10.0,63.03,0.0,2568.00,1


test data:


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
160,61.0,495.11,8.0,10891.23,0
2498,80.0,555.04,28.0,28083.58,0
1748,87.0,697.23,0.0,8335.70,0
1816,41.0,275.80,9.0,10032.39,0
1077,60.0,428.49,20.0,29389.52,1
...,...,...,...,...,...
2401,55.0,446.06,79.0,26526.28,0
2928,102.0,742.65,58.0,16089.24,1
1985,52.0,349.94,42.0,12150.72,0
357,39.0,221.18,59.0,17865.23,0


A 3:1:1 division

<div class="alert alert-success">
<b>Reviewer's comment</b>

The data was split into train, validation and test sets. The proportions are reasonable

</div>

## Prepare data for ML <a id="pdml"></a>

In [8]:
features_train = df_train.drop('is_ultra',axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop('is_ultra',axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop('is_ultra',axis=1)
target_test = df_test['is_ultra']

# Quality of Different Models (Training - varied hyperparameters) <a id="qtra"></a>

The sprint introduced 6 models - 2 trees, 2 forests and 2 algebric approximations. For quality check, chosen hyperparameters of all 6 will vary

## DecisionTreeClassifier <a id="dtc"></a>

In [9]:
dct = pd.DataFrame()
for depth in range(1, 11):
    for split in range(2, 12):
        for leaf in range(1, 11):
            for splitter in ['best', 'random']:
                for criterion in ['gini', 'entropy']:
                    model = DecisionTreeClassifier(random_state=12345, max_depth=depth, min_samples_split=split,\
                                                   min_samples_leaf=leaf,splitter=splitter,criterion=criterion)
                    model.fit(features_train, target_train)
                    dct = dct.append({'depth': depth, 'split': split,'leaf': leaf, 'splitter': splitter, 'criterion': \
                                      criterion, 'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head(10)

Unnamed: 0,criterion,depth,leaf,score,split,splitter
3530,gini,9.0,3.0,0.804044,10.0,random
3570,gini,9.0,3.0,0.804044,11.0,random
3490,gini,9.0,3.0,0.802488,9.0,random
3491,entropy,9.0,3.0,0.799378,9.0,random
3450,gini,9.0,3.0,0.799378,8.0,random
3451,entropy,9.0,3.0,0.799378,8.0,random
3411,entropy,9.0,3.0,0.796267,7.0,random
3531,entropy,9.0,3.0,0.796267,10.0,random
3571,entropy,9.0,3.0,0.796267,11.0,random
3214,gini,9.0,4.0,0.794712,2.0,random


<div class="alert alert-warning">
<b>Reviewer's comment</b>

In the future you might want to try out [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) and [RandomizedSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html): they provide an easier way to explore grids of hyperparameters for your models. I'm not urging you to rewrite the project using these two, just pointing out some useful tools :)

</div>

<div class="alert alert-info">
  Thanks:)
</div>

2 combinations top the chart - 

- `criterion` = 'gini'
- `max_depth` = 9
- `min_samples_leaf` = 3
- `min_samples_split` = 10/11
- `splitter` = 'random'

## RandomForestClassifier <a id="rfc"></a>

In [10]:
dct = pd.DataFrame()
for depth in range(6, 13):
    for split in range(8, 12):
        for leaf in range(2, 5):
            for estimators in range(4, 10):
                for criterion in ['gini', 'entropy']:
                    model = RandomForestClassifier(random_state=12345, max_depth=depth, min_samples_split=split,\
                                                   min_samples_leaf=leaf, criterion=criterion, n_estimators=estimators)
                    model.fit(features_train, target_train)
                    dct = dct.append({'depth': depth, 'split': split,'leaf': leaf, 'estimators': estimators, 'criterion': \
                                      criterion, 'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head(10)

Unnamed: 0,criterion,depth,estimators,leaf,score,split
68,gini,6.0,8.0,4.0,0.810264,9.0
104,gini,6.0,8.0,4.0,0.808709,10.0
48,gini,6.0,4.0,3.0,0.805599,9.0
70,gini,6.0,9.0,4.0,0.805599,9.0
587,entropy,10.0,9.0,2.0,0.804044,8.0
302,gini,8.0,5.0,3.0,0.804044,8.0
36,gini,6.0,4.0,2.0,0.804044,9.0
298,gini,8.0,9.0,2.0,0.804044,8.0
58,gini,6.0,9.0,3.0,0.802488,9.0
138,gini,6.0,7.0,4.0,0.802488,11.0


Best result so far. `min_samples_split` and `min_samples_leaf` changed a bit relative to the preceding DecisionTreeClassifier. `min_samples_split` decreased by 1 to a value of 9, while `min_samples_leaf` increased by 1 to a value of 4. So, the forest allows smaller groups to originate from a node, while restricting this by allowing only groups that result in more cases (4 instead of 3 as in the single tree case) "solved" by the algorithm. Or smth like that...

- `criterion` = 'gini'
- `max_depth` = 6
- `n_estimators` = 8
- `min_samples_leaf` = 4
- `min_samples_split` = 9

## LogisticRegression <a id="logr"></a>

LogisticRegression has high speed, with medium accuracy, so varying parameters can be done thoroughly without having to risk overloading the kernel.

In [11]:
dct = pd.DataFrame()
for max_iter in range(100,250,50):
    for fit_intercept in [True, False]:
        for penalty in ['l1', 'l2']:   # compatibility with solver='LibLinear'
            for verbose in range(0,10,2):
                model = LogisticRegression(random_state=12345, penalty=penalty, fit_intercept=fit_intercept,\
                                               solver='liblinear', max_iter=max_iter,verbose=verbose)
                model.fit(features_train, target_train)
                dct = dct.append({'max_iter': max_iter, 'penalty': penalty, 'fit_intercept': \
                                  fit_intercept,'verbose': verbose, 'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)

display(dct[dct['score'] >= 0.757386])

[LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear][LibLinear]

Unnamed: 0,fit_intercept,max_iter,penalty,score,verbose
49,1.0,200.0,l2,0.758942,8.0
45,1.0,200.0,l2,0.758942,0.0
25,1.0,150.0,l2,0.758942,0.0
26,1.0,150.0,l2,0.758942,2.0
27,1.0,150.0,l2,0.758942,4.0
28,1.0,150.0,l2,0.758942,6.0
29,1.0,150.0,l2,0.758942,8.0
48,1.0,200.0,l2,0.758942,6.0
47,1.0,200.0,l2,0.758942,4.0
46,1.0,200.0,l2,0.758942,2.0


Moderate result, yet above 75% accuracy. `fit_intercept` optimal value is True, and `max_iter` can be set to 200 instead of iterated by.

<div style="font-size:20px; color:red;"><b>[LibLinear] getting printed - is there an easy way to suppress it?</b></div>

### `verbose` is insignificant <a id="logrverbins"></a>

In [12]:
fgg = dct.head(1)
top_score = fgg.score
top_score
dct = dct.reset_index()
dct1 = dct[dct['score'] >= 0.757386]
dct.info()
dct1.verbose.value_counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 6 columns):
index            60 non-null int64
fit_intercept    60 non-null float64
max_iter         60 non-null float64
penalty          60 non-null object
score            60 non-null float64
verbose          60 non-null float64
dtypes: float64(4), int64(1), object(1)
memory usage: 2.9+ KB


6.0    6
4.0    6
2.0    6
0.0    6
8.0    6
Name: verbose, dtype: int64

`verbose` has no apparent effect, so it can be disregarded

<div class="alert alert-warning">
<b>Reviewer's comment</b>

`verbose` just regulates how much info is printed out during training, it's not a hyperparameter of the model. Incidentally, it is the parameter you'd want to set to 0 to suppress printing `[LibLinear]`

</div>

<div class="alert alert-info">
Cool
</div>

### `tol` can be disregarded <a id="logrtolins"></a>

In [13]:
dct = pd.DataFrame()

for penalty in ['l1', 'l2']:
    for tol in [0.0001,0.001,0.01,0.1,1,10,100]:
        model = LogisticRegression(random_state=12345, penalty=penalty, fit_intercept=True,\
                                       solver='liblinear', max_iter=200, tol=tol)
        model.fit(features_train, target_train)
        dct = dct.append({'penalty': penalty, 'tol': tol, \
                          'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head()

Unnamed: 0,penalty,score,tol
7,l2,0.758942,0.0001
0,l1,0.757387,0.0001
1,l1,0.757387,0.001
2,l1,0.757387,0.01
3,l1,0.755832,0.1


In [14]:
fgg = dct.head(1)
top_score = fgg.score
dct1 = dct[dct['score'] >= 0.757386]
dct1.tol.value_counts()

0.0001    2
0.0100    1
0.0010    1
Name: tol, dtype: int64

`tol` has a slight preferation to the value 0.0001 - its default value. So `tol` doesn't need to be specified.

### `C` has 2 possibly preferred values - 0.1 and 1 <a id="logrcs"></a>

In [15]:
dct = pd.DataFrame()
for penalty in ['l1', 'l2']:
    for c in [0.001,0.01,0.1,1,10,100,1000]:
        model = LogisticRegression(random_state=12345, penalty=penalty, fit_intercept=True,\
                                       solver='liblinear', max_iter=200, C=c)
        model.fit(features_train, target_train)
        dct = dct.append({'penalty': penalty, 'c': c, \
                          'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head(10)

Unnamed: 0,c,penalty,score
10,1.0,l2,0.758942
2,0.1,l1,0.757387
3,1.0,l1,0.757387
4,10.0,l1,0.755832
5,100.0,l1,0.755832
6,1000.0,l1,0.755832
12,100.0,l2,0.755832
9,0.1,l2,0.748056
8,0.01,l2,0.724728
1,0.01,l1,0.710731


In [16]:
fgg = dct.head(1)
top_score = fgg.score
dct1 = dct[dct['score'] >= 0.757386]
dct1.c.value_counts()

1.0    2
0.1    1
Name: c, dtype: int64

There are clearly only 2 `C` values on top; From now on **`C` value will be limited to this pair surrounded by a safety margin, in the range between 0.01 to 10**

### `intercept_scaling` has an optimal value <a id="logriss"></a>

In [17]:
dct = pd.DataFrame()

for penalty in ['l1', 'l2']:
    for c in [0.001,0.01,0.1,1,10]:
        for intercept_scaling in [0.001,0.01,0.1,1,10,100,1000]:
            model = LogisticRegression(random_state=12345, penalty=penalty, fit_intercept=True,\
                                           solver='liblinear', max_iter=200, C=c, intercept_scaling=intercept_scaling)
            model.fit(features_train, target_train)
            dct = dct.append({'penalty': penalty, 'c': c, 'intercept_scaling': intercept_scaling,\
                              'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head(15)

Unnamed: 0,c,intercept_scaling,penalty,score
59,1.0,1.0,l2,0.758942
12,0.01,100.0,l1,0.758942
23,1.0,0.1,l1,0.757387
11,0.01,10.0,l1,0.757387
30,10.0,0.1,l1,0.757387
29,10.0,0.01,l1,0.757387
24,1.0,1.0,l1,0.757387
18,0.1,10.0,l1,0.757387
17,0.1,1.0,l1,0.757387
13,0.01,1000.0,l1,0.757387


In [18]:
fgg = dct.head(1)
top_score = fgg.score
dct1 = dct[dct['score'] >= 0.757386]
dct1.intercept_scaling.value_counts()

10.00      3
1.00       3
0.10       2
0.01       1
1000.00    1
100.00     1
Name: intercept_scaling, dtype: int64

100 is the winner, with a tiny improvement of the earlier score (0.757 to 0.759). However, when including the runners-up scores no other instances of that value are present. In addition, that top score feature a `C` value of 0.01, not one of the 2 "optimal" values found above (with safety margin removed). **values range from 0.1 to 100 will be regarded as possible `intercept_scaling` values** on future iterations. 1000 and 0.01 are also the least frequent at chart's top while also being numerically most extreme. They can be dropped off from future iterations.

### `class_weight` can be disregarded <a id="logrcwins"></a>

In [19]:
w = [{0:1000,1:100},{0:1000,1:10}, {0:1000,1:1.0}, 
     {0:500,1:1.0}, {0:400,1:1.0}, {0:300,1:1.0}, {0:200,1:1.0}, 
     {0:150,1:1.0}, {0:100,1:1.0}, {0:99,1:1.0}, {0:10,1:1.0}, 
     {0:0.01,1:1.0}, {0:0.01,1:10}, {0:0.01,1:100}, 
     {0:0.001,1:1.0}, {0:0.005,1:1.0}, {0:1.0,1:1.0}, 
     {0:1.0,1:0.1}, {0:10,1:0.1}, {0:100,1:0.1}, 
     {0:10,1:0.01}, {0:1.0,1:0.01}, {0:1.0,1:0.001}, {0:1.0,1:0.005}, 
     {0:1.0,1:10}, {0:1.0,1:99}, {0:1.0,1:100}, {0:1.0,1:150}, 
     {0:1.0,1:200}, {0:1.0,1:300},{0:1.0,1:400},{0:1.0,1:500}, 
     {0:1.0,1:1000}, {0:10,1:1000},{0:100,1:1000},'balanced']

In [20]:
dct = pd.DataFrame()
for penalty in ['l1', 'l2']:
    for c in [0.01,0.1,1,10]:
        for intercept_scaling in [0.1,1,10,100]:
            for class_weight in w:
                model = LogisticRegression(random_state=12345, penalty=penalty, fit_intercept=True,\
                                               solver='liblinear', max_iter=200, C=c, intercept_scaling=\
                                           intercept_scaling, class_weight=class_weight)
                model.fit(features_train, target_train)
                dct = dct.append({'penalty': penalty, 'intercept_scaling': \
                                  intercept_scaling, 'c': c, 'class_weight': class_weight,\
                                  'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head(12)

Unnamed: 0,c,class_weight,intercept_scaling,penalty,score
124,0.01,"{0: 1.0, 1: 1.0}",100.0,l1,0.758942
916,1.0,"{0: 1.0, 1: 1.0}",1.0,l2,0.758942
448,10.0,"{0: 1.0, 1: 1.0}",0.1,l1,0.757387
88,0.01,"{0: 1.0, 1: 1.0}",10.0,l1,0.757387
304,1.0,"{0: 1.0, 1: 1.0}",0.1,l1,0.757387
196,0.1,"{0: 1.0, 1: 1.0}",1.0,l1,0.757387
664,0.01,"{0: 1.0, 1: 1.0}",10.0,l2,0.757387
340,1.0,"{0: 1.0, 1: 1.0}",1.0,l1,0.757387
232,0.1,"{0: 1.0, 1: 1.0}",10.0,l1,0.757387
844,0.1,"{0: 1.0, 1: 1.0}",100.0,l2,0.755832


In [21]:
fgg = dct.head(1)
top_score = fgg.score
dct1 = dct[dct['score'] >= 0.756]
dct1.class_weight.value_counts()

{0: 1.0, 1: 1.0}    9
Name: class_weight, dtype: int64

The weight of 1 to each class label dominates the chart's top, with the top score equal to the previous chart's top score. The default value is giving the class labels equal weights anyway, so `class_weight` can be disregarded.

### LogisticRegression tuning summation <a id="logrtunsum"></a>

By specifying `intercept_scaling` the top score increased from 0.757 to almost 0.759. The final chosen parameters for this regression are
- `C` (limited to 0.01,0.1,1 and 10)
- `intercept_scaling` = 0.1 to 100 in jumps of 10x
- `penalty` = l1 or l2
<p></p>
- `fit_intercept` = True
- `max_iter` = 200
- `solver` = 'liblinear'
<p></p>
- `class_weight`, `tol` and `verbose` can be disregarded

## DecisionTreeRegressor <a id="dtr"></a>

In [22]:
dct = pd.DataFrame()
for depth in range(1, 19,3):
    for split in range(2, 12,2):
        for leaf in range(2, 9,2):
            for splitter in ['best', 'random']:
                for criterion in ['squared_error', 'friedman_mse', 'absolute_error', 'poisson']:
                    model = DecisionTreeRegressor(random_state=12345, max_depth=depth, min_samples_split=split,\
                                                   min_samples_leaf=leaf, criterion=criterion, splitter=splitter)
                    model.fit(features_train, target_train)
                    dct = dct.append({'depth': depth, 'split': split,'leaf': leaf, 'splitter': splitter, 'criterion': \
                                      criterion, 'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
display(dct.head(10))
dct[dct['score'] >= 0.263195]

Unnamed: 0,criterion,depth,leaf,score,split,splitter
663,poisson,13.0,6.0,0.263196,2.0,random
791,poisson,13.0,6.0,0.263196,10.0,random
759,poisson,13.0,6.0,0.263196,8.0,random
727,poisson,13.0,6.0,0.263196,6.0,random
695,poisson,13.0,6.0,0.263196,4.0,random
495,poisson,10.0,4.0,0.236074,2.0,random
559,poisson,10.0,4.0,0.236074,6.0,random
527,poisson,10.0,4.0,0.236074,4.0,random
591,poisson,10.0,4.0,0.236074,8.0,random
621,friedman_mse,10.0,4.0,0.236014,10.0,random


Unnamed: 0,criterion,depth,leaf,score,split,splitter
663,poisson,13.0,6.0,0.263196,2.0,random
791,poisson,13.0,6.0,0.263196,10.0,random
759,poisson,13.0,6.0,0.263196,8.0,random
727,poisson,13.0,6.0,0.263196,6.0,random
695,poisson,13.0,6.0,0.263196,4.0,random


In [23]:
dct.head(5).split.value_counts()

4.0     1
6.0     1
8.0     1
10.0    1
2.0     1
Name: split, dtype: int64

Score is very bad. Maybe additional parameters could help. `min_samples_split` seems irrellevant and can probably be disregarded. 'poisson' criteria, 'random' `splitter` and `max_depth`, `min_samples_leaf` of 13 and 6 respectively are the most-fitted parameters.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

It doesn't make any sense to use a regression model for a classification problem, especially if there is an equivalent classfication model.

</div>

###  `ccp_alpha` not needed to be specified <a id='ccopj'></a>

In [24]:
dct = pd.DataFrame()
for ccp_alpha in [0, 0.001,0.01,0.1,1,10]:
    model = DecisionTreeRegressor(random_state=12345, max_depth=13\
                  ,min_samples_leaf=6, criterion='poisson', splitter='random',ccp_alpha=ccp_alpha)
    model.fit(features_train, target_train)
    dct = dct.append({'ccp_alpha': ccp_alpha, \
                      'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head(10)

Unnamed: 0,ccp_alpha,score
0,0.0,0.263196
1,0.001,0.219755
2,0.01,-0.000896
3,0.1,-0.000896
4,1.0,-0.000896
5,10.0,-0.000896


No change in top score. `ccp_alpha` as 0 seems like the best choice, and as its the default value it can be disregarded

###  `min_impurity_decrease` not needed to be specified <a id='impurtyb'></a>

In [25]:
dct = pd.DataFrame()

for split in range(2, 12):
    for min_impurity_decrease in [0, 0.001,0.01,0.1,1,10]:
        model = DecisionTreeRegressor(random_state=12345, max_depth=13, min_samples_split=split\
                      ,min_samples_leaf=6, criterion='poisson', splitter='random',min_impurity_decrease=min_impurity_decrease)
        model.fit(features_train, target_train)
        dct = dct.append({'split': split, 'min_impurity_decrease': min_impurity_decrease, \
                          'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head(10)

Unnamed: 0,min_impurity_decrease,score,split
0,0.0,0.263196,2.0
12,0.0,0.263196,4.0
54,0.0,0.263196,11.0
48,0.0,0.263196,10.0
42,0.0,0.263196,9.0
36,0.0,0.263196,8.0
24,0.0,0.263196,6.0
18,0.0,0.263196,5.0
30,0.0,0.263196,7.0
6,0.0,0.263196,3.0


Same bad score. Again, 0 is the default value for `min_impurity_decrease` so it can be disregarded. `min_samples_split` can be disregarded from now on as well.

### sketch for slowly checking all at once (for ellusive combinations) <a id='sktch'></a>

In [26]:
# dct = pd.DataFrame()
# for depth in range(1, 19):
#     for split in range(2, 12):
#         for leaf in range(2, 9):
#             for splitter in ['best', 'random']:
#                 for criterion in ['squared_error', 'friedman_mse', 'absolute_error', 'poisson']:
#                     for ccp_alpha in [0,0.001,0.01,0.1,1,10,100,1000]:
#                         for min_impurity_decrease in [0,0.001,0.01,0.1,1,10,100,1000]:
#                             for max_leaf_nodes in [5,20,None]:
#                                 for min_weight_fraction_leaf in [0,0.25,0.5]:
#                                     for max_features in [None,'auto','sqrt','log2']:
                                    
                                    
#                                         model = DecisionTreeRegressor(random_state=12345, max_depth=depth, min_samples_split=split,\
#                                                                        min_samples_leaf=leaf, criterion=criterion, splitter=splitter,\
#                                                                      ccp_alpha=ccp_alpha, min_impurity_decrease=min_impurity_decrease,\
#                                                                      max_leaf_nodes=max_leaf_nodes,\
#                                                                       min_weight_fraction_leaf=min_weight_fraction_leaf,max_features=max_features)
#                                         model.fit(features_train, target_train)
#                                         dct = dct.append({'depth': depth, 'split': split,'leaf': leaf, 'splitter': splitter, \
#                                                           'criterion': criterion, 'ccp_alpha': ccp_alpha, 'min_impurity_decrease': \
#                                                           min_impurity_decrease, \
#                                                           'max_leaf_nodes': max_leaf_nodes, 'min_weight_fraction_leaf': \
#                                                           min_weight_fraction_leaf, 'max_features': max_features,\
#                                                           'score': model.score(features_valid,target_valid)}, ignore_index=True)

# dct.sort_values('score',ascending=False, inplace=True)
# dct.head(10)

### DecisionTreeRegressor tuning summation <a id="dtrtunsum"></a>

The score is very bad, with 26% accuracy. Best parameter values found:

- `criterion` = 'poisson'
- `max_depth` = 13
- `min_samples_leaf` = 6
- `splitter` = 'random'
<p></p>
- `ccp_alpha`, `min_impurity_decrease` and `min_samples_split` can be disregarded

## RandomForestRegressor <a id="rfr"></a>

In [27]:
dct = pd.DataFrame()
for depth in range(6, 13,2):
    for split in range(8, 12):
        for leaf in range(2, 5):
            for estimators in range(4, 10,2):
                for criterion in ['squared_error', 'absolute_error', 'poisson']:
                    model = RandomForestRegressor(random_state=12345, max_depth=depth, min_samples_split=split,\
                                                   min_samples_leaf=leaf, criterion=criterion, n_estimators=estimators)
                    model.fit(features_train, target_train)
                    dct = dct.append({'depth': depth, 'split': split,'leaf': leaf, 'estimators': estimators, 'criterion': \
                                      criterion, 'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head()

Unnamed: 0,criterion,depth,estimators,leaf,score,split
204,squared_error,8.0,8.0,3.0,0.25311,11.0
195,squared_error,8.0,8.0,2.0,0.252189,11.0
150,squared_error,8.0,8.0,3.0,0.251362,9.0
177,squared_error,8.0,8.0,3.0,0.250076,10.0
114,squared_error,8.0,8.0,2.0,0.249912,8.0


Very bad score - 25%. Best combination attained -

- `criteria` = 'squared_error'
- `max_depth` = 8
- `n_estimators` = 8
- `min_samples_leaf` = 3
- `min_samples_split` = 11

## LinearRegression <a id="linr"></a>

In [28]:
dct = pd.DataFrame()
for fit_intercept in ['True', 'False']:                
    model = LinearRegression(fit_intercept=fit_intercept)
    model.fit(features_train, target_train)
    dct = dct.append({'fit_intercept': fit_intercept, 'score': model.score(features_valid,target_valid)}, ignore_index=True)

dct.sort_values('score',ascending=False, inplace=True)
dct.head(10)

Unnamed: 0,fit_intercept,score
0,True,0.069702
1,False,0.069702


Very bad score - 7%.

## Tuning conclusions <a id='tratunconc'></a>

best score achieved by RandomForestClassifier (81%), with DesicionTreeClassifier and LogisticRegression closely follow (80% and 76% respectively). DecisionTreeRegressor and RandomForestRegressor have a score much worse (26%), with LinearRegression worst score of all, of 7%. The problem of determine the customer's plan is a classification problem, so its not surprising to see regression algorithms fail to solve the problem reasonably.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Very good, you tried a couple of different models and extensively tuned their hyperparameters using the validation set.

</div>

<div class="alert alert-warning">
<b>Reviewer's comment</b>

Yeah, I guess you can try using regression models for classification problems once in your life to see that it makes no sense :) Every tool has its application area, like a hammer is pretty good at hitting nails, but completely useless for making holes in the wall (unless the wall is made of paper), on the other hand hitting nails with a drill doesn't make sense, but it's good at making holes. I'm not sure what the idea behind this was, maybe you've got confused by the name of logistic regression (which is not a regression model despite the name)?

</div>

<div class="alert alert-info">
I actually thought its regression, thanks for clarifying this! I did this project like a robot, thought the task was to go through all algorithms introduced in the sprint, for any hyperparameter that might be relevant, and after a while realised it doesn't make any sense, to wait 20 minutes for every restart=>kernel:) On the other hand, if there's a specific parameter combination you might miss it when you optimise a single hyperparameter everytime. Anyway, logistic regression is really good for classification? or its something specific about this dataset (the forest had 1 tree so it supports that..)?
</div>

<div class="alert alert-warning">
<b>Reviewer's comment V2</b>

Yeah, the naming of logistic regression is unfortunate, but it's pretty much set now, so... (although it really has some relation to linear regression, as the idea is pretty much to just apply a sigmoid function to a linear combination of features (a sum with some coefficients, in other words))
    
Going through all algorithms and all possible values of hyperparameters is not really tractable. For many hyperparameters there are infinite possible values, and even if we consider a finite subset of hyperparameters, the number of models we have to train to try them all grows multiplicatively. That is, if we have 5 hyperparameters, and we want to try 10 possible values of each, we're looking at training 10^5 = 10000 models already. Even if training a model takes 1 second, we'll need ~3 hours for this. Also, for some some hyperparameters, like tree depth, the memory requirements of the model grow exponentially. More specifically, a tree with max depth of k has about 2^k nodes. Let's assume for a moment that one node needs just 4 bytes of memory (this is the size of an integer on 64-bit systems), this is an underestimation, but fine for our purposes. So, using max depth of 10, we get a tree with 2^10 nodes = 1024 nodes = 1024 * 4 bytes = 4 kilobytes, that's reasonable! Using a max depth of 20, we get 2^20 nodes = 1024 * 1024 nodes = 1024 * 1024 * 4 bytes = 4 megabytes, ok, still tractable. Max depth 30, we've got 2^30 nodes = 4 gigabytes, max depth 40 — 2^40 nodes = 4 terabytes. That's already crazy territory!
    
So, all this means that we're pretty limited both in terms of how 'big' models we can try and how many models we can try. So, it is hopeless to try out all possible combinations of hyperparameters. One nice idea is to randomly sample hyperparameter values and see if something works (this is what [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/grid_search.html#randomized-parameter-search) does). Going further we can try reducing the probability to sample from regions of hyperparameters where we already trained a model and it wasn't good. This is the idea behind bayesian optimization (here's a [nice article](https://distill.pub/2020/bayesian-optimization) if you want to learn more). Hyperparameter tuning is very much empirical, so you're just trying out different stuff until you find something works good enough. Apart from that, you're probably going to get more out of the model by cleaning the data, collecting more data, feature engineering and so on.
    
What model is good depends entirely on the data you're working with. Sometimes a simple model like logistic regression can work great, sometimes you need something more advanced. There's the famous [no free lunch theorem](https://en.wikipedia.org/wiki/No_free_lunch_theorem) in machine learning, which essentially means that there's no best algorithm (for all datasets).

</div>

# Quality of Different Models (Test Set) <a id="qtst"></a>

## DecisionTreeClassifier <a id="dtct"></a>

In [29]:
model = DecisionTreeClassifier(random_state=12345, max_depth=9, min_samples_split=11,\
                                                   min_samples_leaf=3,splitter='random',criterion='gini')
model.fit(features_train, target_train)
test_predictions = model.predict(features_test)
print(accuracy_score(test_predictions,target_test))

0.8009331259720062


## RandomForestClassifier <a id="rfct"></a>

In [30]:
model = RandomForestClassifier(random_state=12345, max_depth=6, min_samples_split=9,\
                               min_samples_leaf=4, criterion='gini', n_estimators=8)
model.fit(features_train, target_train)
test_predictions = model.predict(features_test)
print(accuracy_score(test_predictions,target_test))

0.8009331259720062


## LogisticRegression <a id="logrt"></a>

In [31]:
# dct = pd.DataFrame()
# for C in [0.01,0.1,1,10]:
#     for penalty in ['l1','l2']:
#         for intercept_scaling in [0.1,1,10,100]:
#             model = LogisticRegression(random_state=12345, penalty=penalty, fit_intercept=True,\
#                                            solver='liblinear', max_iter=200, C=C, intercept_scaling=intercept_scaling)
#             model.fit(features_train, target_train)
#             dct = dct.append({'penalty': penalty, 'C': C,'intercept_scaling': intercept_scaling, \
#                               'score': model.score(features_test,target_test)}, ignore_index=True)

# dct.sort_values('score',ascending=False, inplace=True)
# dct.head()

*###Score of lesser quality due to slight over-fitting. 74% accuracy in test vs 76% accuracy in training.###*

<div class="alert alert-danger">
<s><b>Reviewer's comment</b>

Here you're essentially doing hyperparameter tuning again, this time using the test set. This is not a good idea: we use the test set to get an unbiased estimate of the model's generalization performance, but we need completely new data which was never used for either training or hyperparameter tuning, so it only works if a model is evaluated on the test set only once. Otherwise, you just have a second validation set, and the score it gives is biased.

</div>

In [32]:
model = LogisticRegression(random_state=12345, penalty='l2', fit_intercept=True,\
                           solver='liblinear', max_iter=200, C=0.01, intercept_scaling=10.0)
model.fit(features_train, target_train)
test_predictions = model.predict(features_test)
print(accuracy_score(test_predictions,target_test))

0.7433903576982893


<div class="alert alert-info">
Did it because i wasn't sure which values to choose, the training offered several options. But of course, test can't be biased...
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Ok, just keep in mind for the future that the test set should be used to evaluate the final model once after all hyperparameters were already fixed to make sure we're getting an unbiased estimate of its generalization performance :)

</div>

## DecisionTreeRegressor <a id="dtrt"></a>

In [33]:
model = DecisionTreeRegressor(random_state=12345, max_depth=13, \
                              min_samples_leaf=6, criterion='poisson', splitter='random')
model.fit(features_train, target_train)
model.score(features_test, target_test)

0.3402118983058666

The score is much better than the training score, yet both scores are horrible, and are worse than a random choice, so that fluctuation might be part of that randomness as well.

## RandomForestRegressor <a id="rfrt"></a>

In [34]:
model = RandomForestRegressor(random_state=12345, max_depth=8, min_samples_split=11,\
                                                   min_samples_leaf=3, criterion='squared_error', n_estimators=8)
model.fit(features_train, target_train)
model.score(features_test, target_test)

0.31798793749942433

Again, the score is better than the training's one, but both are pretty bad and it doesn't mean anything of importance.

## LinearRegression <a id="linrt"></a>

In [35]:
model = LinearRegression()
model.fit(features_train, target_train)
model.score(features_test, target_test)

0.05666279057072421

Around the same accuracy witnessed at training. Much worse than tree's and forest's regression scores.

<div class="alert alert-success">
<b>Reviewer's comment</b>

Ok, final models (except logistic regression) were correctly evaluated on the test set for an unbiased estimate of their generalization performance

</div>

## Test conclusion <a id='conctest'></a>

Forecasting a user plan is a classification problem, and indeed classification algorithms, together with logistic regression, had much better scores than their regression/linear counterparts. **Best score was 80%, by both DecisionTreeClassifier and RandomForestClassifier algorithms, who passed the 75% test accuracy condition**. LogisticRegression was close, at 74.3% test accuracy.
<p>The fact both DecisionTreeClassifier and RandomForestClassifier have the exact same score might suggest the algorithm traversed through the same tree in both cases, meaning the forest traversed was actually a tree. It might be because of the relative small size of the users dataset, together with a "simple" connection between the features and the target - a connection that when translated to a tree, results in questions built on top of one another, and not a lot of different unrelated paths for an answer. With `RandomForestClassifier` considered an accurate algorithm, both 80% score algorithms have some added certainty</p>

<div class="alert alert-success">
<b>Reviewer's comment</b>

Conclusions make sense!
    
</div>

# Sanity Check <a id="schk"></a>

Only models to pass sanity check are the ones with accuracy > 50, preferably by some margin. Those models are those obtained using the `DecisionTreeClassifier`, `RandomForestClassifier` and `LogisticRegression` algorithms.

<div class="alert alert-warning">
<b>Reviewer's comment</b>

We can actually come up with a better baseline for accuracy. Considering that the target distribution is imbalanced (about 70:30), a constant model always predicting the majority class will get an accuracy equal to the share of the majority class (in this case ~70%) for free. This is one of the reasons why accuracy is not really a good metric for imbalanced data. Other than that, accuracy treats both false positives and false negatives the same, but it's possible that there is different cost associated to these two types of mistakes (for example, which is worse: if an algorithm falsely predicts that someone has cancer and they go through additional checks or if an algorithm falsely predicts that someone doesn't have cancer and they just go about their life not knowing they are slowly dying?). The next sprint explores these themes further!

</div>

<div class="alert alert-info">
  Great. Glad its only a recommendation and not a warning:) Can't think for now how to approach this problem, only that the further the ratio is from 50-50 the problem is harder.. Now im curious for next sprint!
</div>

<div class="alert alert-success">
<b>Reviewer's comment V2</b>

Yeah, no problem, you'll find out more soon! :)

</div>