# Table of Contents

[Introduction](#1)

[Section 1: Open and look through the data file](#2)

[Section 2: Split data into training set, validation set, and test set](#3)

[Section 3: Investigate different models by changing hyperparameters](#4)
- [Decision Tree Models](#4.1)
- [Random Forest Models](#4.2)
- [Logistic Regression Models](#4.3)

[Section 4: Check model quality using the test set](#5)

[Section 5: Sanity check the models](#6)

[Conclusion](#7)

# Introduction <a id=1></a>

I am an analyst for **Megaline**, a mobile carrier that is concerned about the fact that many of their subscribers continue to use legacy plans. 

**Megaline** wants me to develop a model that can analyze subscribers' behavior and recommend one of **Megaline**'s newer plans: ***Smart*** or ***Ultra***.

To help me develop the model, I have access to a dataset of the behavior of the subscribers who have already switched to one of the new plans. 

The model needs to be able to pick the correct plan with at least 75% accuracy, otherwise it would risk having **Megaline** lose too many of their legacy subscribers due to them receiving an inaccurate recommendation that would turn them off from trusting **Megaline** with giving them a plan that is a good fit.

I acquire the model by following the steps below:

1) Opening the data file and looking through it. 

2) Splitting the source data into a training set, a validation set, and a test set.

3) Investigating the quality of different models by changing the hyperparameters.

4) Checking the quality of the two best models I found using the test set.

5) Doing a sanity check of the two best models I found.

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Section 1: Open and look through the data file <a id=2></a>

The dataset of the behavior of the subscribers who have already switched to a newer plan has the following column names:

**сalls** — Number of calls made in a month

**minutes** — Total call duration, in minutes, in a month

**messages** — Number of text messages sent in a month

**mb_used** — Total internet traffic used, in MB, in a month

**is_ultra** — The subscriber's plan for the month ("1" means ***Ultra***, "0" means ***Smart***)

In [2]:
users_behavior = pd.read_csv('/datasets/users_behavior.csv')

In my opinion, it is a good idea to display a random sample of rows of the dataframe being used to get a concrete idea of what the dataframe "looks like".

In [3]:
display(users_behavior.sample(10))

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
231,64.0,431.95,24.0,28294.62,0
3165,79.0,505.5,105.0,12406.0,0
651,18.0,147.93,18.0,1904.01,1
31,10.0,77.09,31.0,643.15,0
3113,66.0,460.33,0.0,15402.88,0
2892,40.0,300.1,35.0,0.0,1
1355,72.0,495.11,34.0,15887.37,0
1297,69.0,469.46,55.0,15362.48,0
2328,42.0,328.73,61.0,23573.96,1
584,117.0,779.5,0.0,11308.65,1


The info() function is a way to display general information about a dataframe, especially the number of non-null values and the data type of each column.

In [4]:
users_behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


It is great to see that all of the column names are in snake_case, and that there are no null values. 

The Dtypes are all acceptable, however I think that the **calls** and **messages** column values should be changed to Dtype int64 because number of calls and number of messages are always going to be whole numbers, not decimals.

Furthermore, this is a (binary) classification task because the target, recommending the ***Smart*** plan or the ***Ultra*** plan, is categorical, not numerical. 

Hence, I will change the **is_ultra** Dtype from int64 to string, the 1s to "Ultra", the 0s to "Smart", and the column name to **smart_or_ultra**. 

In [5]:
users_behavior['calls'] = users_behavior['calls'].astype(int)
users_behavior['messages'] = users_behavior['messages'].astype(int)

users_behavior['is_ultra'] = users_behavior['is_ultra'].astype(str)
users_behavior['is_ultra'] = users_behavior['is_ultra'].replace('1', 'Ultra')
users_behavior['is_ultra'] = users_behavior['is_ultra'].replace('0', 'Smart')
users_behavior = users_behavior.rename(columns = {'is_ultra': 'smart_or_ultra'})

users_behavior.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   calls           3214 non-null   int64  
 1   minutes         3214 non-null   float64
 2   messages        3214 non-null   int64  
 3   mb_used         3214 non-null   float64
 4   smart_or_ultra  3214 non-null   object 
dtypes: float64(2), int64(2), object(1)
memory usage: 125.7+ KB


Before working with a dataframe, it is always a good idea to check to see if there are any rows that are exact duplicates. If so, then the duplicates should be dropped from the dataframe because they are most likely there by mistake. (After all, it is <u>highly</u> unlikely that any two distinct Megaline subscribers would have the exact same **calls**, **minutes**, **messages**, **mb_used**, and **smart_or_ultra** values.)

In [6]:
display(users_behavior[users_behavior.duplicated()])

Unnamed: 0,calls,minutes,messages,mb_used,smart_or_ultra


Thankfully, the code below shows that there are no exact duplicates, hence I won't drop any rows.

# Section 2: Split data into training set, validation set, and test set <a id=3></a>

The <u>target</u> of the model is the **smart_or_ultra** value. The **calls**, **minutes**, **messages**, and **mb_used** values of the source data will all help with making predictions. Since a test set doesn't exist, I need to make one using the source data, as well as a training set and a validation set. A commonly used ratio when splitting source data into three parts is 60% training, 20% validation, and 20% test.

Of the three parts, I will first construct the test set, but I am curious to check something first...

In [7]:
display(users_behavior[users_behavior['smart_or_ultra'] == 'Smart'])

Unnamed: 0,calls,minutes,messages,mb_used,smart_or_ultra
0,40,311.90,83,19915.42,Smart
1,85,516.75,56,22696.96,Smart
2,77,467.66,86,21060.45,Smart
4,66,418.74,1,14502.75,Smart
5,58,344.56,21,15823.37,Smart
...,...,...,...,...,...
3206,76,586.51,54,14345.74,Smart
3207,17,92.39,2,4299.25,Smart
3210,25,190.36,0,3275.61,Smart
3211,97,634.44,70,13974.06,Smart


Of the 3214 subscribers in the source data, 2229 of them have the ***Smart*** plan, i.e. approximately 69%.

In [8]:
display(users_behavior[users_behavior['smart_or_ultra'] == 'Ultra'])

Unnamed: 0,calls,minutes,messages,mb_used,smart_or_ultra
3,106,745.53,81,8437.39,Ultra
6,57,431.64,20,3738.90,Ultra
8,7,43.39,3,2538.67,Ultra
10,82,560.51,20,9619.53,Ultra
14,108,587.90,0,14406.50,Ultra
...,...,...,...,...,...
3201,56,419.42,59,5177.62,Ultra
3203,53,390.39,85,30550.30,Ultra
3208,164,1016.98,71,17787.52,Ultra
3209,122,910.98,20,35124.90,Ultra


Consequently, the remaining 985 subscribers, i.e. approximately 31%, in the source data have the ***Ultra*** plan.

It would be nice if all three parts of the source data had approximately the same proportion of **smart_or_ultra** values, so I will use the **stratify** parameter of the **train_test_split** function. I imagine that doing this would help the model be more precise!

The next line of code constructs the test dataset.

In [9]:
_, users_behavior_test = train_test_split(users_behavior, test_size = 0.2, 
                                          stratify = users_behavior['smart_or_ultra'], random_state = 20)

Now I need to make sure that none of the rows in the test set end up appearing in the training and/or validation sets.

The new dataframe in the next line of code is the complement of the **users_behavior_test** dataframe.

In [10]:
users_behavior_not_test = users_behavior.drop(users_behavior_test.index)

Now I will split the **users_behavior_not_test** dataframe into the training and validation sets.

In [11]:
users_behavior_train, users_behavior_valid = train_test_split(users_behavior_not_test, test_size = 0.25, 
                                                              stratify = users_behavior_not_test['smart_or_ultra'], 
                                                              random_state = 6020) 

Notice that I set **test_size** equal to 0.25. The reason why this is correct is as follows:

- The source data has 3214 rows, and the test set has 643, which is about 20% of 3214. 
- I want the validation set to also have 643 rows so that I get my desired ratio. 
- The validation set is coming from the **users_behavior_not_test** dataframe which has 2571 rows.
- 643 is approximately 25% of 2571.

# Section 3: Investigate different models by changing hyperparameters <a id=4></a>

Before I can investigate different models by changing hyperparameters, I must declare the variables for the features and the target.

In [12]:
features_test = users_behavior_test.drop(['smart_or_ultra'], axis = 1)
target_test = users_behavior_test['smart_or_ultra']

features_train = users_behavior_train.drop(['smart_or_ultra'], axis = 1)
target_train = users_behavior_train['smart_or_ultra']

features_valid = users_behavior_valid.drop(['smart_or_ultra'], axis = 1)
target_valid = users_behavior_valid['smart_or_ultra']

**DECISION TREE MODELS** <a id=4.1></a>

The first model I will investigate is a decision tree. Before I change any hyperparameters, I am curious to see how accurate the default decision tree is.

In [13]:
decision_tree_model = DecisionTreeClassifier(random_state = 0)
    
decision_tree_model.fit(features_train, target_train)

decision_tree_predictions_valid = decision_tree_model.predict(features_valid)

print('Accuracy Percentage:', accuracy_score(target_valid, decision_tree_predictions_valid)*100)

Accuracy Percentage: 71.07309486780716


Not good enough! I am looking for at least 75% accuracy. 

Maybe the decision tree will be more accurate if I specify the value of the **max_depth** hyperparameter? This is the most important hyperparameter of a decision tree because it helps control for overfitting and underfitting. If the **max_depth** value is too large, then overfitting will occur, whereas if it is too small, underfitting will occur. The code below loops this hyperparameter's value from 1 to 15, and prints the values whose accuracies are at least 75%.

In [14]:
for depth in range(1, 16):
    decision_tree_model = DecisionTreeClassifier(random_state = 1, max_depth = depth)
    decision_tree_model.fit(features_train, target_train)
    decision_tree_predictions_valid = decision_tree_model.predict(features_valid)
    if accuracy_score(target_valid, decision_tree_predictions_valid)*100 >= 75:
        print('max_depth =', depth, '; ', end = '')
        print('Accuracy Percentage =', accuracy_score(target_valid, decision_tree_predictions_valid)*100)

max_depth = 2 ; Accuracy Percentage = 76.98289269051321
max_depth = 3 ; Accuracy Percentage = 78.53810264385692
max_depth = 4 ; Accuracy Percentage = 78.53810264385692
max_depth = 5 ; Accuracy Percentage = 78.69362363919129
max_depth = 6 ; Accuracy Percentage = 78.84914463452566
max_depth = 7 ; Accuracy Percentage = 78.53810264385692
max_depth = 8 ; Accuracy Percentage = 78.0715396578538
max_depth = 9 ; Accuracy Percentage = 77.60497667185071
max_depth = 10 ; Accuracy Percentage = 78.22706065318819
max_depth = 11 ; Accuracy Percentage = 78.38258164852256
max_depth = 12 ; Accuracy Percentage = 75.42768273716952
max_depth = 13 ; Accuracy Percentage = 76.36080870917574


Interesting! The most accurate decision tree is the one with a **max_depth** value of 6, which has an accuracy of about 78.8%. 

While the decision tree with a **max_depth** value of 6 is accurate enough for this project, I wonder if there exist an even more accurate model?

**RANDOM FOREST MODELS** <a id=4.2></a>

Decision trees sacrifice accuracy for speed, whereas random forests sacrifice speed for accuracy. Since I desire a model that is even more accurate than 78.8%, let's see how accurate the default random forest will be, followed by seeing how accuracy changes as I change the **n_estimators** hyperparameter.

In [15]:
random_forest_model = RandomForestClassifier(random_state = 2)
    
random_forest_model.fit(features_train, target_train)

print('Accuracy Percentage:', random_forest_model.score(features_valid, target_valid)*100)

Accuracy Percentage: 80.40435458786936


From 78.8% to 80.4% is a small, but appreciated improvement. 

Now I will loop the **n_estimators** hyperparameter's value from 1 to 100. I realize that this is ***a lot*** of values, and it would take a long time to print all 100 results, so I write the loop in such a way that it only prints the **n_estimator** values that have at least 81% accuracy (which is 80.4% rounded up to the nearest integer). 

In [16]:
for n_est in range(1, 101):
    random_forest_model = RandomForestClassifier(random_state = 3, n_estimators = n_est)
    random_forest_model.fit(features_train, target_train)
    random_forest_model_score = random_forest_model.score(features_valid, target_valid)
    if random_forest_model_score*100 > 81:
        print('n_estimators =', n_est, '; ', end = '')
        print('Accuracy Percentage =', random_forest_model_score*100)

n_estimators = 22 ; Accuracy Percentage = 81.02643856920683
n_estimators = 38 ; Accuracy Percentage = 81.33748055987559
n_estimators = 84 ; Accuracy Percentage = 81.02643856920683
n_estimators = 86 ; Accuracy Percentage = 81.02643856920683
n_estimators = 88 ; Accuracy Percentage = 81.18195956454122
n_estimators = 89 ; Accuracy Percentage = 81.18195956454122
n_estimators = 90 ; Accuracy Percentage = 81.02643856920683
n_estimators = 94 ; Accuracy Percentage = 81.02643856920683
n_estimators = 95 ; Accuracy Percentage = 81.18195956454122


Oddly enough, the second smallest **n_estimators** value in the above list, 38, is the one that is most accurate, about 81.3%. 

I find it interesting how the **n_estimators** values of 22, 84, 86, 90, and 94 all have the same accuracy, as do the **n_estimators** values of 88, 89, and 95. 

22 is a ***much*** smaller number than 84, 86, 90, and 94, so I find it especially odd that 22 yielded the same accuracy as the other four numbers.

**LOGISTIC REGRESSION MODELS** <a id=4.3></a>

Now I wonder how accurate a logistic regression model is for this project's source data. Though logisitic regression models tend to be less accurate than random forest models, logisitic regression models tend to be much faster than random forests, as well as tend to be more accurate than decision tree models. 

The first value for the **solver** hyperparameter I learned about is "liblinear", so I will test that one first.

In [17]:
logistic_model = LogisticRegression(random_state = 4, solver = "liblinear")

logistic_model.fit(features_train, target_train)

logistic_model_score_train = logistic_model.score(features_train, target_train)
logistic_model_score_valid = logistic_model.score(features_valid, target_valid)

print("Accuracy of the logistic regression model on the training set (%):", logistic_model_score_train*100)
print("Accuracy of the logistic regression model on the validation set (%):", logistic_model_score_valid*100)

Accuracy of the logistic regression model on the training set (%): 70.33195020746888
Accuracy of the logistic regression model on the validation set (%): 70.45101088646967


This result is actually (slightly) worse than that of the initial decision tree model. 

Generally speaking, setting **solver** equal to "liblinear" is a good choice if the dataset is small, but I am not sure if this project's dataset should be classifed as small or not, so the next line of code loops through all possible **solver** values to see if changing this hyperparameter makes any significant difference to the accuracy of this logistic regression model.

In [18]:
for method in ["liblinear", "lbfgs", "newton-cg", "sag", "saga"]:
    logistic_model = LogisticRegression(random_state = 5, solver = method)
    logistic_model.fit(features_train, target_train)
    logistic_model_score_train = logistic_model.score(features_train, target_train)
    logistic_model_score_valid = logistic_model.score(features_valid, target_valid)
    print('solver =', method)
    print('Training Set Accuracy (%) =', logistic_model_score_train*100)
    print('Validation Set Accuracy (%) =', logistic_model_score_valid*100)
    print()

solver = liblinear
Training Set Accuracy (%) = 70.33195020746888
Validation Set Accuracy (%) = 70.45101088646967

solver = lbfgs
Training Set Accuracy (%) = 74.84439834024896
Validation Set Accuracy (%) = 74.65007776049767

solver = newton-cg
Training Set Accuracy (%) = 74.84439834024896
Validation Set Accuracy (%) = 74.65007776049767

solver = sag
Training Set Accuracy (%) = 69.34647302904564
Validation Set Accuracy (%) = 69.36236391912908

solver = saga
Training Set Accuracy (%) = 69.34647302904564
Validation Set Accuracy (%) = 69.36236391912908





Curiously, none of the five **solver** values gave me a model that is at least 75% accurate! The parameter values "lbfgs" and "newton-cg", which for one reason or another have the same accuracy, are both ***very*** close (about 74.8%), but very close is not good enough for this project. Furthermore, these two parameter values are the only ones who accuracy is better than that of the default decision tree, which I find counterintuitive.

I could keep fiddling with the multiple different parameters of the logistic regression model, but I see no good reason to do so since I found a decision tree model and a random forest model that are more than 75% accurate.

# Section 4: Check model quality using the test set <a id=5></a>

I must now choose between the decision tree model and the random forest model. I think the random forest is the better choice because it is more accurate without being overly computationally expensive. Let's now evaluate it using the test set!

In [19]:
best_random_forest_model = RandomForestClassifier(random_state = 7, n_estimators = 38)

best_random_forest_model.fit(features_train, target_train)

best_random_forest_model_score = best_random_forest_model.score(features_test, target_test)

print('Accuracy Percentage =', best_random_forest_model_score*100)

Accuracy Percentage = 82.11508553654744


Excellent! As expected, this model is more than 75% accurate (just like what happened with the validation set)!

# Section 5: Sanity check the models <a id=6></a>

I observed earlier that about 69% of the subscribers in the source data have the ***Smart*** plan. Hence, an appropriate dummy model to use to sanity check the two best models from **Section 4: Check model quality using the test set** is one that assumes 100% of subscribers have the ***Smart*** plan. Such a model would be correct about 69% of the time, which is not only less than 75% accurate (the minimum acceptable accuracy rate), it is clearly far less accurate than the at least 80% accuracy that the decision tree and random forest are capable of getting. I can safely declare this sanity check successful!

# Conclusion <a id=7></a>

Thanks to splitting the source data into three sets (training, validation, and test) and using these sets to investigate three different machine learning models (decision tree, random forest, and logistic regression), I successfully constructed two models that can predict with more than 75% accuracy which new(er) **Megaline** plan, ***Smart*** or ***Ultra***, is a better recommendation for subscribers who are currently using a legacy plan. 

For each of the three different models, I tried out multiple different values for a major hyperparameter in the hopes of finding an accurate enough model of each type. Unfortunately, I was unable to construct a logistic regression model, but that honestly is not an issue because my testing throughout this project demonstrates that the decision tree and random forest models are by all means suitable to help **Megaline** achieve its goal.The optimal **max_depth** value for the decision tree model is 6, and the optimal **n_estimators** value for the random forest is 38. 

I recommend **Megaline** stakeholders to choose the random forest model because though it takes more time to run than the decision tree model, it appears to be about 1 to 3 percentage points more accurate. Though this might not sound like much, in the long run, the random forest model would likely result in a significantly larger number of subscribers switching from a legacy plan to a newer plan (thanks to receiving a recommendation) than would the decision tree model to the point that the random forest model yields significantly more revenue.