# **Problem Statement**
Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
They want to develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the
new plans (from the project for the Statistical Data Analysis course). For this
classification task, you need to develop a model that will pick the right plan. Since you’ve
already performed the data preprocessing step, you can move straight to creating the
model.
Develop a model with the highest possible accuracy. In this project, the threshold for
accuracy is 0.75. Check the accuracy using the test dataset.
1. Open and look through the data file.
2. Split the source data into a training set, a validation set, and a test set.
3. Investigate the quality of different models by changing hyperparameters. Briefly
describe the findings of the study.
4. Check the quality of the model using the test set.
5. Additional task: sanity check the model. This data is more complex than what
you’re used to working with, so it's not an easy task. We'll take a closer look at it
later.


# **Open and look through the data file.**

In [3]:
#import libraries
import pandas as pd

In [4]:
telco = pd.read_csv('https://bit.ly/UsersBehaviourTelco')

In [None]:
telco.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [None]:
telco.shape

(3214, 5)

In [None]:
telco['is_ultra'].unique()

array([0, 1])

In [None]:
telco.duplicated().sum()

0

In [None]:
telco.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

# **Split the source data into a training set, a validation set, and a test set.**

In [5]:
#divide dataset to features and target
features = telco.drop(['is_ultra'], axis =1)
target = telco['is_ultra']

In [6]:
from sklearn.model_selection import train_test_split
#split in ratio 3:1:1

#split into two
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.4, random_state=12345)

#split into validation, test datasets
features_valid, features_test,target_valid, target_test = train_test_split(features_valid, target_valid, test_size=0.5, random_state=12345)

print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(1928, 4)
(643, 4)
(643, 4)


# **Model Creation and Hyperparameter Tuning**

In [None]:
#Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

rf_model = RandomForestClassifier(random_state=12345, n_estimators=50, max_depth=100)
#rf_model = RandomForestClassifier(random_state=12345, n_estimators=42, max_depth= 120)
#rf_model = RandomForestClassifier(random_state=12345, n_estimators=50)
rf_model.fit(features_train, target_train)

print("Random Forest Accuracy: ",rf_model.score(features_valid, target_valid))

Random Forest Accuracy:  0.7916018662519441


In [9]:
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier(max_depth=2, random_state= 12345)
model.fit(features_train, target_train)
print("Decision Tree Accuracy: ",model.score(features_valid, target_valid))

Decision Tree Accuracy:  0.7744945567651633


In [None]:
from sklearn.linear_model import LogisticRegression

#log_model = LogisticRegression(random_state=12345, solver='liblinear')
log_model = LogisticRegression(random_state=12345, solver='liblinear', penalty='l1')
#log_model = LogisticRegression(random_state=12345, solver='lbfgs', penalty='none', max_iter=130)

log_model.fit(features_train, target_train)
print("Logistic Regression Accuracy: ",log_model.score(features_valid, target_valid))

Logistic Regression Accuracy:  0.7573872472783826


# **Check the quality of the model using the test set**

In [None]:
print("Logistic Regression Accuracy: ",log_model.score(features_test, target_test))
print("Random Forest Accuracy: ",rf_model.score(features_test, target_test))
print("Decision Tree Accuracy: ",model.score(features_test, target_test))

Logistic Regression Accuracy:  0.7402799377916018
Random Forest Accuracy:  0.7931570762052877
Decision Tree Accuracy:  0.7744945567651633


# **Sanity Check the Model**

- I checked the numbers around target values


In [16]:
class_frequency = telco['is_ultra'].value_counts(normalize=True)
print(class_frequency)

train_freq = target_train.value_counts(normalize=True)
print(train_freq)

valid_freq = target_valid.value_counts(normalize=True)
print(valid_freq)

test_freq = target_test.value_counts(normalize=True)
print(test_freq)

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64
0    0.692427
1    0.307573
Name: is_ultra, dtype: float64
0    0.706065
1    0.293935
Name: is_ultra, dtype: float64
0    0.684292
1    0.315708
Name: is_ultra, dtype: float64
