#  Project Megaline by Maria Shemyakina


Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.

Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.


Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.



## 1.Open and look through the data file

Import libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)


Open file

In [2]:
data = pd.read_csv('users_behavior.csv')

Let's looking on our dataset

In [3]:
data.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
1784,67.0,356.05,48.0,19909.97,0
1194,105.0,830.37,21.0,21165.03,1
1816,41.0,275.8,9.0,10032.39,0
2854,34.0,246.06,31.0,8448.76,0
235,76.0,513.55,50.0,14584.73,1
2557,118.0,877.58,15.0,10242.8,0
1636,82.0,535.96,52.0,14259.08,0
1290,88.0,573.46,54.0,8714.69,0
446,93.0,680.59,70.0,16376.46,1
625,77.0,502.87,5.0,6928.0,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
data.describe()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


We don't have missing data and all data looks great.

## 2. Split the source data into a training set, a validation set, and a test set.

The source data should be split  3: 1: 1, respectively.

In [6]:
data_target, data_test = train_test_split(data, test_size=0.2, random_state=12345)

In [7]:
data_train, data_valid = train_test_split(data_target, test_size=0.25, random_state=12345)

Let's look what we've got

In [8]:
data_test.shape

(643, 5)

In [9]:
data_train.shape

(1928, 5)

In [10]:
data_valid.shape

(643, 5)

## 3. Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.

We will use:  decision tree, random forest  and  logistic regression
We save the features of the training and validation samples in the variables features_train and  features_valid. We also save the target features of the training and validation samples in the variables target_train and target_valid.

In [11]:
features_train = data_train.drop(['is_ultra'], axis=1)
target_train = data_train['is_ultra']
features_valid = data_valid.drop(['is_ultra'], axis=1)
target_valid = data_valid['is_ultra']

### 3.1 Decision tree

We ecplore the decision tree model at a tree depth of 1 to 7. To study all models, we take the parameter  random_state  equal to 12345.

In [12]:
for i in range(1, 8):
    model_decTree = DecisionTreeClassifier(max_depth=i, random_state=12345)
    model_decTree.fit(features_train, target_train)
    predictions_decTree = model_decTree.predict(features_valid)
    
    print("max_depth =", i, ": ", end='')
    print(accuracy_score(target_valid, predictions_decTree))

max_depth = 1 : 0.7387247278382582
max_depth = 2 : 0.7573872472783826
max_depth = 3 : 0.7651632970451011
max_depth = 4 : 0.7636080870917574
max_depth = 5 : 0.7589424572317263
max_depth = 6 : 0.7573872472783826
max_depth = 7 : 0.7744945567651633


The best result is 77.44%  was obtained with a tree depth of 7

### 3.2 RandomForest

We explore the random forest model with the hyperparameter  n_estimators equal from 10 to 80, also use random_state=12345

In [13]:
for estim in range(10, 81, 10):
    model_randomForest = RandomForestClassifier(n_estimators=estim, max_depth=10, random_state=12345)
    model_randomForest.fit(features_train, target_train)
    predictions_randomForest = model_randomForest.predict(features_valid)
    
    print("n_estimators =", estim, ":", accuracy_score(target_valid, predictions_randomForest))

n_estimators = 10 : 0.7900466562986003
n_estimators = 20 : 0.7962674961119751
n_estimators = 30 : 0.7916018662519441
n_estimators = 40 : 0.7962674961119751
n_estimators = 50 : 0.7978227060653188
n_estimators = 60 : 0.7916018662519441
n_estimators = 70 : 0.7962674961119751
n_estimators = 80 : 0.7947122861586314


The best result is 79.6% was obtained with n_estimators=70

### 3.3 Logistic Regression

Let's explore the logistic regression model and changing the hyper parameters `solver` and` penalty`. Random_state will be the same (12345)

In [14]:
model_regression = LogisticRegression(random_state=12345, solver='newton-cg', penalty='none')
model_regression.fit(features_train, target_train)
predictions_regression = model_regression.predict(features_valid)
accuracy_score(target_valid, predictions_regression)



0.7262830482115086

In [15]:
model_regression = LogisticRegression(random_state=12345, solver='liblinear', penalty='l1')
model_regression.fit(features_train, target_train)
predictions_regression = model_regression.predict(features_valid)
accuracy_score(target_valid, predictions_regression)

0.7278382581648523

In [16]:
model_regression = LogisticRegression(random_state=12345, solver='saga', penalty='elasticnet',l1_ratio=1)
model_regression.fit(features_train, target_train)
predictions_regression = model_regression.predict(features_valid)
accuracy_score(target_valid, predictions_regression)



0.6920684292379471

In [17]:
model_regression = LogisticRegression(random_state=12345, solver='lbfgs', penalty='l2')
model_regression.fit(features_train, target_train)
predictions_regression = model_regression.predict(features_valid)
accuracy_score(target_valid, predictions_regression)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


0.7262830482115086

In [18]:
model_regression = LogisticRegression(random_state=12345, solver='sag', penalty='l2')
model_regression.fit(features_train, target_train)
predictions_regression = model_regression.predict(features_valid)
accuracy_score(target_valid, predictions_regression)



0.6920684292379471

The best result is 72.78% with solver='newton-cg', penalty='none'

### Conclusion

The best result was obtained from the RandomForest model (79.6%) So check the random forest on test set

## 4. Check the quality of the model using the test set.

In [19]:
features_test = data_test.drop(['is_ultra'], axis=1)
target_test = data_test['is_ultra']
features_valid = data_target.drop(['is_ultra'], axis=1)
target_valid = data_target['is_ultra']

For testing RandomForest model on test set, let's use n_estimators=70, max_depth=10, random_state=12345

In [20]:
model = RandomForestClassifier(n_estimators=70, max_depth=10, random_state=12345)
model.fit(features_valid, target_valid)
predictions = model.predict(features_test)
accuracy_score(target_test, predictions)

0.7931570762052877

 We got accuracy 79.3% It is close to our previous result 79.6%  Mission completed )))

## 5. Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.

We assume that if the model predicts better than a simple assumption that the number of users switching to the smart and ultra tariffs will be in the same proportions as in the original sample, then it is good.

We calculate the number of people who switched to the smart tariff to the total number of objects in the sample.

In [21]:
(data['is_ultra']==0).sum() / data.shape[0]

0.693528313627878

Our model predicts more accurately than random: 79.3% vs 69.35%. This means it is good.

## Conclusion

We find the best model for our task it's a Random Forest model. We got accuracy 78.5% and made a sanity check.