## Recommending a phone plan for Megaline users
This project is focused on developing a model that would analyze subscibers' behavior, thereby, recommending one of Megaline's latest phone plans: Smart or Ultra. We want a model that is at least 75% accurate. This is a classification task, therefore, we will test the Decision Tree, Random Forest, and Logistic Regression classifiers

### Data Description


- `calls:` numbers of calls 
- `minutes:` total call duration in minutes
- `messages:` number of text messages
- `mb_used:` internet traffic used in MB
- `is_ultra:` plan for the current month (Ultra - 1, Smart - 0)



### Table of Contents

1. General Information
2. Splitting into training, validation, and test sets
3. Testing Models
4. Quality Check on the Test Set
5. Sanity Check: Model vs Chance 
6. Conclusion

## General Information

importing the necessary libraries

In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn import set_config
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score 


Loading the dataset

In [2]:
data_users = pd.read_csv("/datasets/users_behavior.csv")

data_users.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [3]:
data_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


After checking the dataset, we see that there are no missing values in the all columns. However, we will convert the columns  "calls" and "messages" to intergers 

In [4]:
#Converting the values in "calls" column from float to integer
data_users["calls"] = data_users["calls"].astype("int")

#Converting the values in "calls" column from float to integer
data_users["messages"] = data_users["messages"].astype("int")

In [5]:
data_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


### Splitting into training, validation, and test sets

For this project, we need three datasets, one for training, the other for validation, and the third for testing the accuracy of our model. The percentages of the original dataset would be set at 60 for training, 20 for validation, and 20 for testing

In [6]:
#splitting data_users into data_users_train (60%) and data_users_valid (40%)

data_users_train, data = train_test_split(data_users, test_size = 0.4, random_state = 12345)

In [7]:
#Getting the shape of the training and testing datasets

print(data_users_train.shape)
print(data.shape)

(1928, 5)
(1286, 5)


In [8]:
#splitting data into data_users_valid (50%) and data_users_test (50%)

data_users_valid, data_users_test = train_test_split(data, test_size = 0.5, random_state = 12345)

#Getting the shape of the validation and testing datasets
print(data_users_valid.shape)
print(data_users_test.shape)

(643, 5)
(643, 5)


We need to define the features and target sections for each set. For the features, we will call all the columns in the  dataframe but drop the target column. The target is the **is_ultra** column.



In [9]:
#defining the training features
train_features = data_users_train.drop('is_ultra', axis=1)

#defining the training target feature
train_target = data_users_train['is_ultra']

#defining the validation features
valid_features = data_users_valid.drop('is_ultra', axis=1)

#defining the validation target feature
valid_target = data_users_valid['is_ultra']

#defining the test features
test_features = data_users_test.drop('is_ultra', axis=1)

#defining the target feature
test_target = data_users_test['is_ultra']

The features for each dataset has been defined. Now, we want to train, validate, and test the models.

### Testing Models

To determine the best model for out dataset, we will test the dataset with Decision Tree, Random Forest, and Logistic Regression models. First, we train them with the training dataset, and then test them on the validation set by comparing a prediction using features from the validation set to the actual target from the validation set. For each phase, we will tweak hyperparameters so we can get a higher accuracy score, the latter being the metric for choosing the best model to move forward with

### Decision Tree

For the Decision Tree model, we be calling the DecisionTreeClassifier() function. The hyperparameters that we would be making use of are random_state and max_depth. The random state will be given a fixed value of 12345, while the max_depth parameter will be varied from 1 to 12, and then we would print the model with the best accuracy.


In [10]:
best_model  = None

best_result = 0

best_depth = 0

for x in range (1, 13):
    
    model = DecisionTreeClassifier(random_state = 12345, max_depth = x)
    
    model.fit(train_features, train_target)
    
    predictions_valid = model.predict(valid_features)
    
    result = accuracy_score(valid_target, predictions_valid)
    
    if result > best_result:
        
        best_result = result
        
        best_depth = x
    
print("The best accuracy is:", best_result, "with a depth of:", best_depth)

The best accuracy is: 0.7853810264385692 with a depth of: 3


The Decision Tree model has the best accuracy of 78.53% when the max_depth is 3

### Random Forest

For the Random Forest model, we be calling the RandomForestClassifier() function. The hyperparameters that we would be making use of are random_state and the n_estimators. Then we will loop through values of max_depth, and within that loop, loop through values of n_estimators. We will use this loop to create models with different permutations of max_depth and n_estimators values that we will store in the list, from which we will choose the model with the highest accuracy score.

In [11]:
best_rf_model = None

best_result_rf = 0

best_est_rf = 0

best_depth_rf = 0

for est in range (10, 71, 10): #loops through values of est from 1 to 70 with a step of 10 for n_estimators
    for depth in range(1, 13): #loops through various depths of trees from 1 to 12
        model_rf = RandomForestClassifier(random_state = 12345, max_depth = depth, n_estimators = est)
        model_rf.fit(train_features, train_target)
        
        predictions_valid_rf = model_rf.predict(valid_features)
        
        result = accuracy_score(valid_target, predictions_valid_rf)
        
        if result > best_result_rf:
            
            best_rf_model = model
            
            best_est_rf = est
            
            best_depth_rf = depth
            
            best_result_rf = result
            
print("The accuracy of the best model on the validation set:", best_result_rf, "n_estimators:", best_est_rf,
                                         "best_depth:", best_depth_rf)
        
    

The accuracy of the best model on the validation set: 0.8087091757387247 n_estimators: 40 best_depth: 8


Our result tells us that the best random forest classifier model is the one with max_depth of 8 and n_estimators=40, with an accuracy score is 80.87%


### Logistic Regression

We will use the LogisticRegression() function. The random_state parameter stays the same, and the solver is set to  'liblinear'



In [12]:
model_lr = LogisticRegression(random_state=12345, solver='liblinear')

model_lr.fit(train_features, train_target)

valid_pred_lr = model_lr.predict(valid_features)

print("Logistic Regression Accuracy =", accuracy_score(valid_target, valid_pred_lr))


Logistic Regression Accuracy = 0.7589424572317263


The accuracy of our Logistic Regression model is 75.9%

The best model we have come up with is the Random Forest model with an accuracy score of 80.9%, max_depth=8 and n_estimators=40. Coming in second place is Decision Tree model with max_depth=3 (accuracy score 78.53%). The last is Logistic Regression with an accuracy score of 75.9%

## Quality Check on the Test Set

The best model that we got will be used on our test set. However, we need to retrain the model using both the training and validation sets combined. To combine those sets, we can use the pd.concat function which takes a list of the sets invoved as argument, and set the parameter axis=0 to make it a vertical stacking.

In [13]:
merged_tables = pd.concat([data_users_train, data_users_valid], axis = 0)#vertically stacks the training and validation sets
merged_tables.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 2571 entries, 3027 to 3197
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     2571 non-null   int64  
 1   minutes   2571 non-null   float64
 2   messages  2571 non-null   int64  
 3   mb_used   2571 non-null   float64
 4   is_ultra  2571 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 120.5 KB


In [14]:
merged_tables.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
3027,60,431.56,26,14751.26,0
434,33,265.17,59,17398.02,0
1226,52,341.83,68,15462.38,0
1054,42,226.18,21,13243.48,0
1842,30,198.42,0,8189.53,0


Defining the features and the targets

In [15]:
merged_features = merged_tables.drop('is_ultra', axis=1)
merged_target = merged_tables['is_ultra']


We train our model using the new features and target, and make predictions using the features from the test set, and get an accuracy score. 


In [16]:
model_best_rf = RandomForestClassifier(random_state=12345, max_depth=8, n_estimators=40)

model_best_rf.fit(merged_features, merged_target)

predicted_rf = model_best_rf.predict(test_features)

print(accuracy_score(predicted_rf, test_target))


0.7993779160186625


We got an accuracy score of 79.93%, which is over the 75% threshold for our project

## Sanity Check

To check the sanity of our model, we compare it with the target feature "is_ultra" in our test dataset

In [17]:
san_check_data = data_users_test["is_ultra"].value_counts(normalize = True)

In [18]:
print(san_check_data)

0    0.684292
1    0.315708
Name: is_ultra, dtype: float64


The percentage of **Smart** clients is 68.4%, and the percentage of **Ultra** clients is 31.6%. 

Next, we analyze the class frequencies of the Random Forest predictions

In [19]:
model_rf = RandomForestClassifier(random_state=12345, max_depth=8, n_estimators=40)

model_rf.fit(merged_features, merged_target)

predicted_valid = pd.Series(model_rf.predict(test_features))

rf_class_frequency = predicted_valid.value_counts(normalize = True)

print(rf_class_frequency)

0    0.785381
1    0.214619
dtype: float64


Next we create a baseline model, which is a constant model that predicts 0 for any observation

In [20]:
#Creating a baseline model

target_pred_constant = pd.Series(0, index = data_users_test.index)

print(accuracy_score(test_target, target_pred_constant))

0.6842923794712286


The baseline model has a prediction of 68.4% which is lower than 79.93%(~80%) of our random forest classifier. So the random forest classifier passes the sanity check. 

## Conclusion

This project was done to develop a model that would analyze subscibers' behavior, thereby, recommending one of Megaline's latest phone plans: Smart or Ultra. The data was split into training, validation and test datasets. After testing the models, Random Forest Classifier performed best with an accuracy score of approximately 81% on the validation dataset, and approximately 80% on the test dataset.

Also,the baseline model has an accuracy score of 68.4% which is lower than 79.93%(~80%) of our random forest classifier. So the random forest classifier passes the sanity check