## Defining the Question

#### Context

Mobile carrier Megaline has found out that many of their subscribers use legacy plans.
They want to develop a model that would analyze subscribers' behavior and recommend
one of Megaline's newer plans: Smart or Ultra.
We have access to behavior data about subscribers who have already switched to the
new plans. For this
classification task, we will develop a model that will pick the right plan. The threshold for accuracy is 0.75.

#### Metric of success

A model that will recommend the right plan

#### Solution steps

1. Import libraries
2. Data exploration: load data, preview and explore data, check for and handle missing values and duplicates, fix inconsistent column names if any
2. Data preparation: prepare data for use in model training
3. Data modeling: create, train and evaluate the model
4. Summarise findings and provide recommendations

## Import Libraries

In [11]:
import pandas as pd

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, accuracy_score
import warnings

In [30]:
# suppress warnings
warnings.filterwarnings('ignore')

## Data Exploration

In [2]:
# load dataset
df = pd.read_csv('https://bit.ly/UsersBehaviourTelco')

In [3]:
# view a sample
df.sample(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
2279,80.0,521.88,0.0,12851.66,0
2846,78.0,485.98,28.0,22366.48,0
731,47.0,273.6,60.0,24264.53,0
2419,9.0,80.82,2.0,12122.99,0
868,109.0,763.22,42.0,17236.5,0
524,33.0,281.77,75.0,13933.01,1
2119,26.0,172.87,16.0,6431.26,0
2656,30.0,185.07,34.0,17166.53,0
368,31.0,185.63,101.0,14344.72,0
2416,45.0,286.43,47.0,15975.66,0


In [4]:
# check column and row sizes
df.shape

(3214, 5)

In [5]:
# check for missing values
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [6]:
# check feature datatypes
df.dtypes

calls       float64
minutes     float64
messages    float64
mb_used     float64
is_ultra      int64
dtype: object

In [9]:
# check unique values in is_ultra column
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

In [10]:
# check for duplicates
df.duplicated().sum()

0

In [12]:
# correlation matrix
df.corr()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
calls,1.0,0.982083,0.177385,0.286442,0.207122
minutes,0.982083,1.0,0.17311,0.280967,0.206955
messages,0.177385,0.17311,1.0,0.195721,0.20383
mb_used,0.286442,0.280967,0.195721,1.0,0.198568
is_ultra,0.207122,0.206955,0.20383,0.198568,1.0


## Data Preparation

Set target and features

In [13]:
features = df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

Standardize features

In [15]:
scaler = StandardScaler()
features  = scaler.fit_transform(features)

Split data into training and testing sets

In [16]:
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size = 0.2, random_state = 0)

## Data Modeling

#### Train decision tree model

In [20]:
# find oout which parameters are optimal
for i in range (1,10):
  decision_classifier = DecisionTreeClassifier(max_depth = i, random_state = 12345)
  decision_classifier.fit(features_train, target_train)
  predictions = decision_classifier.predict(features_test) 
  score = decision_classifier.score(features_test, target_test)
  print('Score: {}, Depth: {}'.format(score, i))

Score: 0.7480559875583204, Depth: 1
Score: 0.7791601866251944, Depth: 2
Score: 0.7853810264385692, Depth: 3
Score: 0.7822706065318819, Depth: 4
Score: 0.7822706065318819, Depth: 5
Score: 0.7900466562986003, Depth: 6
Score: 0.7978227060653188, Depth: 7
Score: 0.7900466562986003, Depth: 8
Score: 0.7884914463452566, Depth: 9


In [28]:
# use the optimal max depth
decision_classifier = DecisionTreeClassifier(max_depth = 7, random_state = 12345)
decision_classifier.fit(features_train, target_train)
score = decision_classifier.score(features_test, target_test)
print('Score: {}'.format(score))

Score: 0.7978227060653188


#### Train random forest model

In [25]:
# find oout which parameters are optimal
for i in range(1,15):
  random_classifier = RandomForestClassifier(n_estimators = i, max_depth = 6, random_state = 12345)
  random_classifier.fit(features_train, target_train)
  score = random_classifier.score(features_test, target_test)
  print('Score: {}, Estimator: {}'.format(score, i))

Score: 0.7822706065318819, Estimator: 1
Score: 0.776049766718507, Estimator: 2
Score: 0.7962674961119751, Estimator: 3
Score: 0.7931570762052877, Estimator: 4
Score: 0.7916018662519441, Estimator: 5
Score: 0.7916018662519441, Estimator: 6
Score: 0.7978227060653188, Estimator: 7
Score: 0.8087091757387247, Estimator: 8
Score: 0.8009331259720062, Estimator: 9
Score: 0.80248833592535, Estimator: 10
Score: 0.7993779160186625, Estimator: 11
Score: 0.8009331259720062, Estimator: 12
Score: 0.80248833592535, Estimator: 13
Score: 0.80248833592535, Estimator: 14


In [27]:
# use the optimal parameters
random_classifier = RandomForestClassifier(n_estimators = 8, max_depth = 6, random_state = 12345)
random_classifier.fit(features_train, target_train)
score = random_classifier.score(features_test, target_test)
print('Score: {}'.format(score))

Score: 0.8087091757387247


#### Train logistic regression model

In [31]:
# find oout which parameters are optimal
params_grid = {'C':[0.001,0.01,0.1,1,10], 'penalty':['l1', 'l2']}
logistic_regression = LogisticRegression(random_state=0)

gd_sr_cl = GridSearchCV(estimator = logistic_regression, param_grid = params_grid, scoring = 'accuracy', cv = 5, n_jobs =-1)
gd_sr_cl.fit(features_train, target_train)

print(gd_sr_cl.best_params_)

{'C': 0.1, 'penalty': 'l2'}


In [33]:
# use the optimal parameters
logistic_regression = LogisticRegression(C = 0.1, penalty = 'l2', random_state=0)
logistic_regression.fit(features_train,target_train)
score = logistic_regression.score(features_test, target_test)
print('Score: {}'.format(score))

Score: 0.7651632970451011


Random Forest Classifier had the **best** accuracy score of 80%

Decision Tree Classifier had the second best accuracy score of 79%

Logic Regression had the least accuracy score of 76%

#### Evaluate the best model: random forest

In [34]:
predictions = random_classifier.predict(features_test)

In [36]:
print(classification_report(target_test, predictions))

              precision    recall  f1-score   support

           0       0.81      0.94      0.87       446
           1       0.78      0.52      0.62       197

    accuracy                           0.81       643
   macro avg       0.80      0.73      0.75       643
weighted avg       0.81      0.81      0.80       643



## Summary of Findings and Recommendations

Random Forest Classifier had the best accuracy score of 80%

Megaline mobile carrier can rely on the Random Forest Model to recommend Ultra or Smart plans to customers