# Megaline 2

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra.
You have access to behavior data about subscribers who have already switched to the new plans (from the project for the Statistical Data Analysis course). For this classification task, you need to develop a model that will pick the right plan. Since you’ve already performed the data preprocessing step, you can move straight to creating the model.
Develop a model with the highest possible accuracy. In this project, the threshold for accuracy is 0.75. Check the accuracy using the test dataset.

[We've provided you with some commentary to guide your thinking as you complete this project. However, make sure to remove all the bracketed comments before submitting your project.]

[Before you dive into analyzing your data, explain for yourself the purpose of the project and actions you plan to take.]

[Please bear in mind that studying, amending, and analyzing data is an iterative process. It is normal to return to previous steps and correct/expand them to allow for further steps.]

## Initialization

In [1]:
# Loading all the libraries
import pandas as pd
from sklearn.linear_model import LogisticRegression #Classification
from sklearn.tree import DecisionTreeClassifier #Classification
from sklearn.ensemble import RandomForestClassifier #Classification is categorical, we're going to compare all 3
#from sklearn.ensemble import RandomForestRegressor #Regression is numerical 
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

from sklearn.metrics import mean_squared_error #Only needed for regression

In [2]:
# Loading all the libraries
megaline = pd.read_csv('/datasets/users_behavior.csv')
megaline

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


## Prepare the data

[Explore the table to get an initial understanding of the data. Do necessary corrections to the table if necessary.]

In [3]:
# print the general/summary information about the DataFrame
megaline.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
# Print a sample of data
megaline.head(10)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
5,58.0,344.56,21.0,15823.37,0
6,57.0,431.64,20.0,3738.9,1
7,15.0,132.4,6.0,21911.6,0
8,7.0,43.39,3.0,2538.67,1
9,90.0,665.41,38.0,17358.61,0


### FIX DATA

[Describe what you see and notice in the general information and the printed data sample for the above data. Are there any issues (inappropriate data types, missing data etc) that may need further investigation and changes? How that can be fixed?]

In [5]:
print(megaline['calls'].isna().value_counts())
print(megaline['minutes'].isna().value_counts())
print(megaline['messages'].isna().value_counts())
print(megaline['mb_used'].isna().value_counts())
print(megaline['is_ultra'].isna().value_counts())

False    3214
Name: calls, dtype: int64
False    3214
Name: minutes, dtype: int64
False    3214
Name: messages, dtype: int64
False    3214
Name: mb_used, dtype: int64
False    3214
Name: is_ultra, dtype: int64


Nothing needed

### ENRICH DATA

[Add additional factors to the data if you believe they might be useful.]

Nothing needed

In [6]:
print(megaline[megaline.duplicated() != False])

Empty DataFrame
Columns: [calls, minutes, messages, mb_used, is_ultra]
Index: []


## Model Testing

[Split the source data into a training set, a validation set, and a test set.]

Since, we're classifying into plans, it's categorigal and not numerical. Therefore it's classification, not regression.

In [7]:
features = megaline.drop(['mb_used','is_ultra'],axis=1)
target = megaline['is_ultra'] 

#each is a version of above based on training, validation, and a test set

#https://towardsdatascience.com/how-to-split-data-into-three-sets-train-validation-and-test-and-why-e50d22d3e54c
#You need to create a remainder then split it because train test_split only works in twos

megaline_train, megaline_remainder = train_test_split(megaline, test_size=0.25, random_state=12345) # split 25% of data to make validation set

#features_train, target_train, features_valid, target_valid = train_test_split(features,target, test_size=0.25, random_state=12345)
#only works for regression

megaline_valid, megaline_test = train_test_split(megaline_remainder, test_size=0.25, random_state=12345)


features_train = megaline_train.drop(['mb_used', 'is_ultra'], axis=1)
target_train = megaline_train['is_ultra']

features_valid = megaline_valid.drop(['mb_used', 'is_ultra'], axis=1)
target_valid = megaline_valid['is_ultra']

test_df = megaline_test
features_test = test_df.drop(['mb_used', 'is_ultra'], axis=1)
target_test = test_df['is_ultra']

#print(features_train)
#print(target_train)

def error_count(answers, predictions):
    errors = 0
    for answer, prediction in zip(answers, predictions):
    #use zip(array1, array2) to go through two arrays at once
        if answer != prediction:
            errors += 1
    return errors

#print('RandomForestClassifier Errors:', error_count(target_test, test_predictions))

### LogisticRegression

[Investigate the quality of different models by changing hyperparameters. Briefly describe the findings of the study.]

Hypthosesis: RandomForestClassifier will have the best accuracy, even though it is the slowest.

[Test the quality of LogisticRegression(). Briefly describe the findings of the study.]

In [8]:
#First we test the LogisticRegression()

lr_model = LogisticRegression(random_state=12345, solver='lbfgs')
lr_model.fit(features_train, target_train)
lr_model_score = lr_model.score(features_valid, target_valid)
lr_test_predictions = lr_model.predict(features_test) #defined back when splitting the data, comparing the test
print("Accuracy of the Logistic Regression model on the validation set: {}".format(lr_model_score))
print('LogisticRegression Errors:', error_count(target_test, lr_test_predictions))

Accuracy of the Logistic Regression model on the validation set: 0.746268656716418
LogisticRegression Errors: 53


### DecisionTreeClassifier

[Test the quality of LogisticRegression(). Briefly describe the findings of the study.]

In [9]:
best_dtc_score = 0
best_dtc_depth = 0

#Then we test the DecisionTreeClassifier()


for depth in range(1,6):
    model = DecisionTreeClassifier(random_state=12345,max_depth=depth)
    model.fit(features_train, target_train) # train model on training set
    score = model.score(features_valid, target_valid)# calculate accuracy score on validation set
    #print()
    #print("Accuracy of the current model on the validation set (n_estimators = {}, max_depth = {}): {}".format(est, depth ,score))
    #print()
    lr_test_predictions = model.predict(features_test) #prediction
    if score > best_dtc_score: #reminder from Model Improvements - Tuning Hyperparameters
            best_dtc_score = score # save best accuracy score on validation set
            best_dtc_depth = depth # save number of estimators corresponding to best accuracy score

print("Accuracy of the best Decision Tree Classifier model on the validation set (max_depth = {}): {}".format(best_dtc_depth ,best_dtc_score))
print('DecisionTreeClassifier Errors:', error_count(target_test, lr_test_predictions))

Accuracy of the best Decision Tree Classifier model on the validation set (max_depth = 4): 0.7711442786069652
DecisionTreeClassifier Errors: 44


### RandomForestClassifier

[Test the quality of RandomForestClassifier(). Briefly describe the findings of the study.]

In [10]:
#Then we test the RandomForestClassifier()

best_rfc_score = 0
best_rfc_est = 0
best_rfc_depth = 0

for depth in range(1,6):
    for est in range(1,10): # choose hyperparameter range #remember range is 1 above ending num
        model = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth) # set number of trees
        model.fit(features_train, target_train) # train model on training set
        score = model.score(features_valid, target_valid)# calculate accuracy score on validation set
        #print()
        #print("Accuracy of the current model on the validation set (n_estimators = {}, max_depth = {}): {}".format(est, depth ,score))
        #print()
        rfc_test_predictions = model.predict(features_test)
    if score > best_rfc_score: #reminder from Model Improvements - Tuning Hyperparameters
            best_rfc_score = score # save best accuracy score on validation set
            best_rfc_est = est # save number of estimators corresponding to best accuracy score
            best_rfc_depth = depth # save number of estimators corresponding to best accuracy score

#print()
print("Accuracy of the best Random Forest Classifier model on the validation set (n_estimators = {}, max_depth = {}): {}".format(best_rfc_est, best_rfc_depth ,best_rfc_score))
print('RandomForestClassifier Errors:', error_count(target_test, rfc_test_predictions))

Accuracy of the best Random Forest Classifier model on the validation set (n_estimators = 9, max_depth = 3): 0.7810945273631841
RandomForestClassifier Errors: 43


## Test the Best Model

[Check the quality of the model using the test set.]

In [11]:
if best_rfc_score > best_dtc_score or best_rfc_score > lr_model_score:
    best_score = best_rfc_score
    best_model = RandomForestClassifier(random_state=12345, n_estimators=best_rfc_est, max_depth=best_rfc_depth)
    print('Best model is RandomForestClassifier with a score of', best_score)
    print('RandomForestClassifier Errors:', error_count(target_test, rfc_test_predictions))
elif best_dtc_score > best_rfc_score or best_dtc_score > lr_model_score:
    best_score = best_dtc_score
    best_model = DecisionTreeClassifier(random_state=12345, max_depth=best_depth)
    print('Best model is DecisionTreeClassifier with a score of', best_score)
    print('DecisionTreeClassifier Errors:', error_count(target_test, lr_test_predictions))
else:
    best_score = lr_model_score
    best_model = LogisticRegression(random_state=12345, solver='lbfgs')
    print('Best model is LogisticRegression with a score of', best_score)
    print('LogisticRegression Errors:', error_count(target_test, lr_test_predictions))

Best model is RandomForestClassifier with a score of 0.7810945273631841
RandomForestClassifier Errors: 43


In [14]:
print(features.shape)
print(target.shape)

best_model.fit(features,target)
train_predictions = best_model.predict(features_train)
test_predictions = best_model.predict(features_test)

#print('Training set:', accuracy_score(target, train_predictions))# < finish up code here >
#print('Test set:', accuracy_score(target, test_predictions)) 
#print('Test set:', accuracy_score(test_target, test_predictions)) #< finish up code here >

print('Training set:', accuracy_score(target_train, train_predictions)) #remember to compare directly each prediction to each set.
print('Test set:', accuracy_score(target_test, test_predictions)) 

(3214, 3)
(3214,)
Training set: 0.7763485477178423
Test set: 0.7860696517412935


The model is more accurate when working with the test set, but not bad, The test set is 0.01 percent more accurate.

## Sanity Check

[Additional task: sanity check the model. This data is more complex than what you’re used to working with, so it's not an easy task. We'll take a closer look at it later.]

In [15]:
#sanity checking with a dummy classifier for comparison
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(features_train, target_train)
DummyClassifier(strategy='most_frequent')
dummy_clf.predict(features_test)
print('Sanity checking the best model DummyClassifier for comparison, score:', dummy_clf.score(features_valid, target_valid))
print('DummyClassifier Errors:', error_count(target_test, test_predictions))

Sanity checking the best model DummyClassifier for comparison, score: 0.6998341625207297
DummyClassifier Errors: 43


Ooh, that's pretty awful. Even worse than LogisticRegression.

## Conclusion

Compared to the other scores of DecisionTreeClassifier's 0.771 and LogisticRegression's 0.746 the best model is RandomForestClassifier with a score of 0.784, which is well above the accuracy of 0.75.The model is more accurate when working with the test set, but not bad, The test set is 0.01 percent more accurate then the training set.