# Machine Learning for Recommmending a Phone Plan

## Open and Look Through Data File

In [1]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split 
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
df.info()
df.duplicated().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
calls       3214 non-null float64
minutes     3214 non-null float64
messages    3214 non-null float64
mb_used     3214 non-null float64
is_ultra    3214 non-null int64
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


0

The dataset users_behavior has 3214 entries and 5 columns: calls, minutes, messages, mb_used, and is_ultra. There are two data types float and integer. The columns calls, minutes, messages, and mb_used are data type float. The column is_ultra is data type integer.There are no missing or duplicate values in the dataset.

## Split the Data into a Training Set, a Validation Set, and a Test Set

In [16]:
#Split data
df_train,df_valid = train_test_split(df, test_size = 0.20, random_state = 123)
df_train,df_test = train_test_split(df_train, test_size = 0.25, random_state = 123)
#Check if data is split correct
display(len(df_train))
display(len(df_valid))
display(len(df_test))

1928

643

643

We split the data in to a training set, test set, and validation set. The training set is 60% of data, the test test is 20% of data, and the validation set is 20% of data.

## Investigate the Quality of Different Models by Changing Hyperparameters

In [5]:
#Create variables for features and target features
features_train = df_train.drop('is_ultra', axis=1)
target_train = df_train['is_ultra']
features_test = df_test.drop('is_ultra', axis=1)
target_test = df_test['is_ultra']
features_valid = df_valid.drop('is_ultra', axis=1)
target_valid = df_valid['is_ultra']

### Decision Tree Model

In [6]:
#Find highest accuracy by comparing max depth for decision tree model
best_accuracy = 0
best_depth = 0
for depth in range(1,100):
    decision_tree_model = DecisionTreeClassifier(random_state=123, max_depth = depth)
    decision_tree_model.fit(features_train,target_train)
    decision_tree_predictions_valid = decision_tree_model.predict(features_valid)
    accuracy = accuracy_score(target_valid, decision_tree_predictions_valid)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_depth = depth
print('Max Depth =', best_depth, 'Accuracy:', best_accuracy)

Max Depth = 3 Accuracy: 0.8055987558320373


### Random Forest Model

In [11]:
#Find highest accuracy by comparing n_estimators for random forest model.
best_accuracy = 0
best_n_estimator = 0
for n in range(1,50):
    random_forest_model = RandomForestClassifier(random_state=123, n_estimators = n)
    random_forest_model.fit(features_train,target_train)
    random_forest_predictions_valid = random_forest_model.predict(features_valid)
    accuracy = accuracy_score(target_valid, random_forest_predictions_valid)
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_n_estimator = n
print('n_estimator =', best_n_estimator, 'Accuracy:', best_accuracy)

n_estimator = 32 Accuracy: 0.807153965785381


### Logistic Regression Model

In [8]:
#Find accuracy of logistic regression model
logistic_regression_model = LogisticRegression(random_state=123, solver='liblinear') 
logistic_regression_model.fit(features_train,target_train)
logistic_regression_predictions_valid = logistic_regression_model.predict(features_valid)
display(accuracy_score(target_valid,logistic_regression_predictions_valid))

0.702954898911353

The best model for our data is the decision tree model at a max_depth of 6. The random forest and decision tree models were relatively close in accuracy. The logistic regression model has the lowest accuracy at 70%.

## Check the Quality of the Model Using the Test Set

In [15]:
random_forest_model = RandomForestClassifier(random_state=123, n_estimators = 32)
random_forest_model.fit(features_train,target_train)
test_prediction = random_forest_model.predict(features_test)
accuracy = accuracy_score(target_test,test_prediction)
print('Accuracy:', accuracy)

Accuracy: 0.7962674961119751


The accuracy of the model is 79.62%. The model passed the accuracy threshold of 75%.

## Sanity Check the Model

In [13]:
#Create a dummy model to sanity check the model
model_dummy = DummyClassifier(strategy = 'most_frequent', random_state=123)
model_dummy.fit(features_train, target_train)
display(model_dummy.score(features_test, target_test))

0.6842923794712286

To sanity check our model we used a dummy estimator. The accuracy of the dummy model was 68.43%.

## Conclusion

We obtained an accuracy score of 79.62% from our data. To obtain the accuracy score we tested multiple models with different hyperparameters. We found that the best model for our data was the decision tree model with a max depth of 6. We sanity checked our model by using a dummy estimator and obtained an accuracy score of 68.43%. Based on our findings we have obtained a great model for our data.