# Goal
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

# Project instructions
**Phase I**
1. [ ] Open and look through the data file. Path to the file:datasets/users_behavior.csv Download dataset
<br> **Phase II** <br>
2. [ ] Split the source data into a training set, a validation set, and a test set.
<br> **Phase III** <br>
3. [ ] Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.
4. [ ] Check the quality of the model using the test set.
<br> **Phase IV** <br>
5. [ ] Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

**PHASE V (additional, not in task)** Comparison of results of different model <br>
**PostScriptum** Additional attemption to improve model

# Project evaluation
We’ve put together the evaluation criteria for the project. Read this carefully before moving on to the task. <br>
Here’s what the reviewers will look at when reviewing your project:
- [ ] How did you look into data after downloading?
- [ ] Have you correctly split the data into train, validation, and test sets?
- [ ] How have you chosen the sets' sizes?
- [ ] Did you evaluate the quality of the models correctly?
- [ ] What models and hyperparameters did you use?
- [ ] What are your findings?
- [ ] Did you test the models correctly?
- [ ] What is your accuracy score?
- [ ] Have you stuck to the project structure and kept the code neat?

In [63]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

**PHASE I. Open and look through the data file. Path to the file:datasets/users_behavior.csv Download dataset**

In [3]:
# read data to dataframe
df = pd.read_csv('/datasets/users_behavior.csv')

In [4]:
# check if the dataframe has been read correctly
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [5]:
# according to the describtion of the problem there are no any problems with data. It is a good practice to check it.
# check if there are N/A values (), type of data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


**So...** <br>
no N/A, type is numerical. *call* and *messages* are not integer but it is not a problem in fact.

In [6]:
# more information about data
df.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


**So...** <br>
All data are positive as it should be. <br>
Max *calls* (244) >> mean (63) + 3 * std (33) = 162 <br>
Max *minutes* (1632) >> mean (438) + 3 * std (234) = 1146 <br>
Max *messages* (224) >> mean (38) + 3 * std (36) = 155 <br>
Max *mb_used* (49745) >> mean (17207) + 3 * std (7570) = 39917 <br>
That means that there are some outliers.

In [7]:
# I will proceed further original dataset, but later I will try to compare results with this cut dataset
df_cut = df.query('calls < 162 and minutes < 1146 and messages < 155 and mb_used < 39917')
print('Outliers are only {:.2} % of the whole dataset'.format((len(df)-len(df_cut))/len(df)*100))

Outliers are only 2.9 % of the whole dataset


**PHASE II. Split the source data into a training set, a validation set, and a test set.**

## PHASE II. 
**Split the source data into a training set, a validation set, and a test set.**

Test set doesn't exist. In that case, the source data has to be split into three parts: training, validation, and test. The sizes of validation set and test set are usually equal. It gives us source data split in a 3:1:1 ratio

In [8]:
# as train_test_split divide dataset into two sets, to split it into three sets I should use it twice
df_train, df_rest = train_test_split(df, test_size=0.4, random_state=22)
df_valid, df_test = train_test_split(df_rest, test_size=0.5, random_state=23)

In [9]:
print('Size of train set:', len(df_train))
print('Size of validation set:', len(df_valid))
print('Size of test set:', len(df_test))
print(len(df_train),':',len(df_valid),':',len(df_test),'rate to each other as 3 : 1 : 1 as should be')

Size of train set: 1928
Size of validation set: 643
Size of test set: 643
1928 : 643 : 643 rate to each other as 3 : 1 : 1 as should be


In [42]:
sum(df_test.is_ultra)

200

In [10]:
# df_cut analysis (in parallel)
df_cut_train, df_cut_rest = train_test_split(df_cut, test_size=0.4, random_state=22)
df_cut_valid, df_cut_test = train_test_split(df_cut_rest, test_size=0.5, random_state=23)

**PHASE III.  Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.**

**Firstly**, it should be noted that we have a problem of **classification**. <br>
That means that we can use (1) **DecisionTreeClassifier**, (2) **Random Forest** and (3) **Logistic Regression** for this problem now. <br>
**Secondly** it is necessary to define features and target. Target is a *is_ultra*; features all columns except for *is_ultra* <br>
**Thirdly** target in this dataset is close to be balanced 603 (1-value) vs 1325 (0-value). That's why we use accuracy as a metric

(1) **DecisionTreeClassifier**:

In [11]:
# features and target for:
# train set
features = df_train.drop('is_ultra', axis=1)
target = df_train['is_ultra']

# validation set
valid_features = df_valid.drop('is_ultra', axis=1)
valid_target = df_valid['is_ultra']

# test
test_features = df_test.drop('is_ultra', axis=1)
test_target = df_test['is_ultra']

In [28]:
# machine learning and accuracy of DecisionTreeClassifier
for depth in range(1,7):    
    model = DecisionTreeClassifier(max_depth=depth,random_state=22)
    model.fit(features,target)
    prediction = model.predict(features)
    print('max_depth:',depth)
    print('Train set accuracy: {:.2%}'.format(accuracy_score(target, prediction)))
    valid_prediction = model.predict(valid_features)
    print('Validation set accuracy: {:.2%}'.format(accuracy_score(valid_target, valid_prediction)))

max_depth: 1
Train set accuracy: 74.53%
Validation set accuracy: 77.29%
max_depth: 2
Train set accuracy: 78.06%
Validation set accuracy: 79.47%
max_depth: 3
Train set accuracy: 79.46%
Validation set accuracy: 81.18%
max_depth: 4
Train set accuracy: 80.55%
Validation set accuracy: 80.72%
max_depth: 5
Train set accuracy: 81.07%
Validation set accuracy: 80.72%
max_depth: 6
Train set accuracy: 82.42%
Validation set accuracy: 79.63%


**So...** <br>
1,2,3 are underfitted models because train set accuracy less than validation set accuracy
5,6 are overfitted.
It is not obvious for **max_depth = 5**, cause delta is not so great (81.07 - 80.72 = 0.35; less than 0.5%). However **Decision tree classifier is tend to be overfitting with high max_depth**. <br>
So, I remain only model with **max_depth = 4 and test it further**.

In [13]:
model = DecisionTreeClassifier(max_depth=4, random_state=22)
model.fit(features,target)
test_prediction = model.predict(test_features)
print('Test set accuracy: {:.2%}'.format(accuracy_score(test_target, test_prediction)))

Test set accuracy: 79.16%


(1) Results on **DecisionTreeClassifier**: <br>
model: max_depth=4 <br>
Train set accuracy: 80.55% <br>
Validation set accuracy: 80.72% <br>
Test set accuracy: 79.16% <br>
Accuracy is higher than threshold. And it is good, cause it is not highly differ from valid and train.

(2) **RandomForestClassifier**:

In [57]:
for now_estim in range(1,50,5):
    model = RandomForestClassifier(n_estimators=now_estim, random_state=22)
    model.fit(features,target)
    prediction = model.predict(features)
    print('n_estimators:',now_estim)
    print('Train set accuracy: {:.2%}'.format(accuracy_score(target, prediction)))
    valid_prediction = model.predict(valid_features)
    print('Validation set accuracy: {:.2%}'.format(accuracy_score(valid_target, valid_prediction)))

n_estimators: 1
Train set accuracy: 89.52%
Validation set accuracy: 73.41%
n_estimators: 6
Train set accuracy: 96.01%
Validation set accuracy: 78.85%
n_estimators: 11
Train set accuracy: 97.93%
Validation set accuracy: 79.16%
n_estimators: 16
Train set accuracy: 98.44%
Validation set accuracy: 79.47%
n_estimators: 21
Train set accuracy: 99.38%
Validation set accuracy: 80.25%
n_estimators: 26
Train set accuracy: 99.43%
Validation set accuracy: 80.09%
n_estimators: 31
Train set accuracy: 99.84%
Validation set accuracy: 80.40%
n_estimators: 36
Train set accuracy: 99.84%
Validation set accuracy: 79.94%
n_estimators: 41
Train set accuracy: 99.90%
Validation set accuracy: 79.94%
n_estimators: 46
Train set accuracy: 99.79%
Validation set accuracy: 79.63%


**So...(a)** <br>
Accuracy of random forest model (applied on train test) generally rapidly increase with increase of n_estimators. <br>
After n_estimators=21 accuracy doesn't change significantly. <br>
We know that random forest doesn't tend to overfitting but we shouldn't use too high values. <br>
I will remain n_estimators=20.

In [58]:
for now_depth in range(1,50,5):
    model = RandomForestClassifier(n_estimators=20, max_depth=now_depth, random_state=22)
    model.fit(features,target)
    prediction = model.predict(features)
    print('max_depth:',now_depth)
    print('Train set accuracy: {:.2%}'.format(accuracy_score(target, prediction)))
    valid_prediction = model.predict(valid_features)
    print('Validation set accuracy: {:.2%}'.format(accuracy_score(valid_target, valid_prediction)))

max_depth: 1
Train set accuracy: 74.59%
Validation set accuracy: 76.52%
max_depth: 6
Train set accuracy: 82.52%
Validation set accuracy: 80.09%
max_depth: 11
Train set accuracy: 88.90%
Validation set accuracy: 80.72%
max_depth: 16
Train set accuracy: 95.38%
Validation set accuracy: 79.47%
max_depth: 21
Train set accuracy: 98.96%
Validation set accuracy: 79.47%
max_depth: 26
Train set accuracy: 99.12%
Validation set accuracy: 80.09%
max_depth: 31
Train set accuracy: 99.12%
Validation set accuracy: 80.09%
max_depth: 36
Train set accuracy: 99.12%
Validation set accuracy: 80.09%
max_depth: 41
Train set accuracy: 99.12%
Validation set accuracy: 80.09%
max_depth: 46
Train set accuracy: 99.12%
Validation set accuracy: 80.09%


**So...(b)** <br>
default *max_depth* works good. It isn't necessary to use this hyperparameter. <br>
I don't understand **WHY** train set accuracy much higher than validation set accuracy. If it was Decision tree Classifier I certainly claim that it is overfitting. <br>
But maybe it is OK...

In [61]:
model = RandomForestClassifier(n_estimators=21, random_state=22)
model.fit(features,target)
print('Test set accuracy: {:.2%}'.format(model.score(test_features,test_target)))

Test set accuracy: 79.63%


(2) Results on **RandomForestClassifier**: <br>
model: n_estimators=21 <br>
Train set accuracy: 99.38% <br>
Validation set accuracy: 80.25% <br>
Test set accuracy: 79.63% <br>
Accuracy is higher than threshold. Accuracy on validation and test set are the same. It is good.

(3) **LogisticRegression**:

In [74]:
model = LogisticRegression(solver='liblinear', random_state=25) # random_state is here to shuffle the data
model.fit(features,target)
print('Train set accuracy: {:.2%}'.format(model.score(features,target)))
print('Validation set accuracy: {:.2%}'.format(model.score(valid_features,valid_target)))
print('Test set accuracy: {:.2%}'.format(model.score(test_features,test_target)))

Train set accuracy: 70.80%
Validation set accuracy: 73.25%
Test set accuracy: 70.30%


In [76]:
model = LogisticRegression(solver='lbfgs')
model.fit(features,target)
print('Train set accuracy: {:.2%}'.format(model.score(features,target)))
print('Validation set accuracy: {:.2%}'.format(model.score(valid_features,valid_target)))
print('Test set accuracy: {:.2%}'.format(model.score(test_features,test_target)))

Train set accuracy: 70.80%
Validation set accuracy: 73.41%
Test set accuracy: 70.30%


(3) Results on **LogisticRegression**: <br>
model of logistic regression is "linear" so there is no any randomness here. <br>
Train set accuracy: 70.80% <br>
Validation set accuracy: 73.25% <br>
Test set accuracy: 70.30% <br>
Accuracy is lower than threshold.

**PHASE IV.  Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.**

**Sanity check** <br>

In [14]:
print('Ratio of 1-value is: {:.1%}'.format(sum(df['is_ultra']) / len(df['is_ultra'])))
print('Ratio of 0-value is: {:.1%}'.format( 1 - sum(df['is_ultra']) / len(df['is_ultra'])))

Ratio of 1-value is: 30.6%
Ratio of 0-value is: 69.4%


**So...** <br>
The "sanity"-accuracy is 69.4% because we can assign all data to 0-value. <br>
Our models predict better than this "0-value" model.

**PHASE V (additional) Comparison of results of different model:**
**Recover our results**
(1) Results on **DecisionTreeClassifier**: <br>
model: max_depth=4 <br>
Train set accuracy: 80.55% <br>
Validation set accuracy: 80.72% <br>
Test set accuracy: 79.16% <br>
Accuracy is higher than threshold. And it is good, cause it is not highly differ from valid and train.

(2) Results on **RandomForestClassifier**: <br>
model: n_estimators=21 <br>
Train set accuracy: 99.38% <br>
Validation set accuracy: 80.25% <br>
Test set accuracy: 79.63% <br>
Accuracy is higher than threshold. Accuracy on validation and test set are the same. It is good.

(3) Results on **LogisticRegression**: <br>
model of logistic regression is "linear" so there is no any randomness here. <br>
Train set accuracy: 70.80% <br>
Validation set accuracy: 73.25% <br>
Test set accuracy: 70.30% <br>
Accuracy is lower than threshold.

Although **RandomForestClassifier** accuracy on test set is a little bit better than **DecisionTreeClassifier** (79.63% vs 79.16%). It seems one can use both of this models to predict consumers choice. Mobile carrier Megaline can offer to its clients one of this plans Smart or Ultra with a good (in fact not bad) accuracy 80%. It is better than 70%, which we can obtain in sanity check case. Of course one should make our model better. How? <br>
- filter outliers
- create some new features (maybe try something like sum of minutes(normalized) and messages(normalized))
- enter in our dataset data, which naturally should be in a company as a gender of a client, age and others.

# Conclusion
Finally, we created a model which can predict clients behavior and needs. <br>
We compared 3 models: RandomForestClassifier, LogisticRegression and DecisionTreeClassifier. Two of them (trees-based) gives results better than threshold (75%). <br>
We recommend DecisionTreeClassifier because RandomForestClassifier gives strange results on train set as the model is overfitting one. <br>
We give recommendations how to improve our model.

# PostScriptum
It was interesting to apply our model to dataframe without outliers. Maybe it will give better results. <br>
And it is not so time-consuming because we prepared all necessary data.

In [82]:
# features and target for:
# train set
features = df_cut_train.drop('is_ultra', axis=1)
target = df_cut_train['is_ultra']

# validation set
valid_features = df_cut_valid.drop('is_ultra', axis=1)
valid_target = df_cut_valid['is_ultra']

# test
test_features = df_cut_test.drop('is_ultra', axis=1)
test_target = df_cut_test['is_ultra']

model = DecisionTreeClassifier(max_depth=4, random_state=22)
model.fit(features,target)
print('Train set accuracy: {:.2%}'.format(model.score(features,target)))
print('Validation set accuracy: {:.2%}'.format(model.score(valid_features,valid_target)))
print('Test set accuracy: {:.2%}'.format(model.score(test_features,test_target)))

Train set accuracy: 80.83%
Validation set accuracy: 77.88%
Test set accuracy: 78.08%


BUT our previous results were better! <br>
(1) Results on **DecisionTreeClassifier**: <br>
model: max_depth=4 <br>
Train set accuracy: 80.55% <br>
Validation set accuracy: 80.72% <br>
Test set accuracy: 79.16% <br>

Maybe it is because all of this outliers easy to predict? If someone use ~50Gb of data he without doubts should be recommended to use Ultra plan. Maybe our model (our black box) take this into account.

I am not satisfied with the quality of the prediction. But is OK for now.