# Machine learning model for user behavior analysis

## Project information

We're back at Megaline. They have launched two new plans, but many of the users are still sticking with the old ones. Megaline thinks that users might want to subscribe to the new plans if they were given a personalized recommendation.

Our goal is to create a machine-learning model that can analyze user behavior and recommend a package tailored to their needs.

**Objectives:**
1. Create an ML model for this task.
<br>
To ensure that Megaline gets a model that does its job as intended with maximum performance, we need to do the following steps:
1. Adjusting model hyperparameters, comparing the results, and picking the best settings,
1. Testing the model,
1. Conducting a sanity check on the model.

## Dataset description

Luckily, we can use the data that we processed in our previous project. This eliminates the need to preprocess the data, except for the ML-related standardization.

`users_behavior.csv` contains the following columns:
- `calls`: number of calls made
- `minutes`: total duration of all calls, in minutes
- `messages`: number of SMS messages sent
- `mb_used`: Internet usage, in megabytes
- `is_ultra`: plan subscribed by the user (`1`: Ultra, `0`: Smart)

## Loading libraries

For this project, we'll compare the performance of different classification models (as opposed to regression models) because we need to classify users into 2 plans.

In [1]:
# for dataframe manipulation
import pandas as pd

# ML model libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

# sklearn tools
from sklearn.preprocessing import StandardScaler # to standardize data, increases models' learning performance
from sklearn.model_selection import train_test_split # to split datasets into training and testing tests
from sklearn.metrics import accuracy_score # to calculate the model's accuracy score

## Loading dataset

In [2]:
df = pd.read_csv(r'datasets\users_behavior.csv')

# Checking the dataset
print(df.info())
df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB
None


Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


Although the binary values in `is_ultra` represent boolean True/False, we will keep the data type as `int64` because a logistic regression model classification results in such values.

In [3]:
# Checking target proportions
df['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

Apparently, we're provided with an imbalanced dataset: Smart plan users (`is_ultra = 0`) account for ~69% of the data. To ensure that our models are not making correct predictions simply by chance, we need to set our baseline metric to a higher value. We'll set the baseline metric score to **75%**.

### Splitting

Creating an ML model involves three steps: training, validation, and testing. We only have one dataset for all three steps, so we need to split it for use in each stage.

Training requires more data than the latter two stages. Dividing the dataset with a 3:1:1 ratio (60% for training, 20% each for validation and testing) should be able to provide sufficient data for every part. Because `train_test_split` can only split a dataset into two, we'll need to execute this function twice.

We will set the `random_state` hyperparameter to an arbitrary value of `12345` throughout the project. This will ensure that we get consistent results every time.

In [5]:
# Splitting the dataset into test_df (60%) and the rest (20% + 20%)
train_df, df2 = train_test_split(df, train_size=0.6, random_state=12345)

# Splitting df2 into validation and test sets
val_df, test_df = train_test_split(df2, test_size=0.5, random_state=12345)

In [6]:
# Checking each set
print('train_df:')
print(train_df.info())
print()

print('val_df:')
print(val_df.info())
print()

print('test_df:')
print(test_df.info())

train_df:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1928 entries, 3027 to 482
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     1928 non-null   float64
 1   minutes   1928 non-null   float64
 2   messages  1928 non-null   float64
 3   mb_used   1928 non-null   float64
 4   is_ultra  1928 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 90.4 KB
None

val_df:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 1386 to 3197
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
 4   is_ultra  643 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 30.1 KB
None

test_df:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 160 to 2313
Data columns (total 5 

### Defining targets and features

In accordance with project goals, `is_ultra` will be the target for our model and other columns will be the features. Target and features need to be separated from each of the three sets.

In [7]:
# train_df
train_features = train_df.drop('is_ultra', axis=1) # Excluding is_ultra from the set
train_target = train_df['is_ultra']

# val_df
val_features = val_df.drop('is_ultra', axis=1)
val_target = val_df['is_ultra']

# test_df
test_features = test_df.drop('is_ultra', axis=1)
test_target = test_df['is_ultra']

### Scaling: standardization

The performance of regression models is affected by the difference in data values, especially when features are measured in different units. Simply put, regression models may see data with larger numbers as having more significance than those with smaller values. Scaling increases the efficiency of regression models by converting data values into a uniform scale. To anticipate outliers in the current and future data, we will use standardization, a commonly used scaling method that is more robust to outliers.

It should be noted that, due to the difference in algorithms, scaling does not have any effect on tree-based models.

In [8]:
# Creating an instance of StandardScaler
standard_scaler = StandardScaler()

# Fitting & transforming training data
train_features = standard_scaler.fit_transform(X=train_features.values) # .values attribute excludes dataframe headers and prevent errors/warnings

# Transforming validation & test sets
val_features = standard_scaler.transform(X=val_features.values)
test_features = standard_scaler.transform(X=test_features.values)

## Model training & validation

Next, we will train each of the three classification models and evaluate their performance in predicting the validation set. The models' performance will be measured by their validation metric scores (not training scores because they will only rise with more training). 

We need the model to correctly predict two classes with minimal error, so the metric in question will be accuracy. Accuracy measures how many times the model was correct at giving the correct plan recommendation.

We will only take models with a **minimum of 75% accuracy**, in accordance with the baseline score mentioned above.

### Decision tree

The performance of this model varies by tree depth. This means that we have to keep the tree deep enough to produce the best results, but not excessively deep to prevent overfitting and wasting resources. To achieve this, we'll train and validate the model 10 times with increasing depth and pick the one with the best scores.

In [9]:
# Defining variables to store scores and models in
tree_train_best_score = 0
tree_best_score = 0
tree_best_depth = 0

for depth in range(1, 11):
    # Creating & training models with different depths
    tree_model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    tree_model.fit(train_features, train_target)
    
    # Getting training scores
    train_pred = tree_model.predict(train_features)
    tree_train_score = accuracy_score(train_target, train_pred)
    
    # Validation and obtaining validation metric scores
    valid_pred = tree_model.predict(val_features)
    val_acc_score = accuracy_score(val_target, valid_pred)
    
    # Storing the best depth and scores
    if val_acc_score > tree_best_score:
        tree_train_best_score = tree_train_score
        tree_best_score = val_acc_score
        tree_best_depth = depth
    
print('Best depth:', tree_best_depth, 'training accuracy:', tree_train_best_score, 'validation accuracy:', tree_best_score)

Best depth: 3 training accuracy: 0.8075726141078838 validation accuracy: 0.7838258164852255


The model hit peak validation accuracy at `depth = 3`, whose score (**~78.5%**) is only ~2.5% lower than the training accuracy. This score satisfies our baseline requirement. We will use this depth as the hyperparameter for our decision tree model.

### Random forest

Next, we'll use the power of more trees (technically, estimators) at once. Being composed of several decision trees, the model's accuracy will vary based on its `max_depth` and the number of its trees (`n_estimators`). `max_depth` will be set from 1--10 and `n_estimators` will range from 10--100 with an increment of 10 estimators in every iteration.

In [10]:
# Defining variables to store scores and models in
forest_best_training_score = 0
forest_best_score = 0
forest_best_model = None

for depth in range(1, 11):
    for estimator in range(10, 101, 10): # Setting the range of estimators with an increase of 10 estimators per iteration
        
        # Creating & training the model with different max_depth and n_estimators
        forest_model = RandomForestClassifier(random_state=12345, max_depth=depth, n_estimators=estimator)
        forest_model.fit(train_features, train_target)
        
        # Getting training scores
        train_pred = forest_model.predict(train_features)
        forest_train_score = accuracy_score(train_target, train_pred)
        
        # Validating the model
        val_pred = forest_model.predict(val_features)
        val_acc_score = accuracy_score(val_target, val_pred)
        
        # Storing the best score and model
        if val_acc_score > forest_best_score:
            forest_best_training_score = forest_train_score
            forest_best_score = val_acc_score
            forest_best_model = forest_model
                 
print('Best training accuracy:', forest_best_training_score)
print('Best validation accuracy:', forest_best_score)
forest_best_model

Best training accuracy: 0.875
Best validation accuracy: 0.807153965785381


Our forest model with 40 estimators at `max_depth=8` achieved the best validation accuracy of **~80.8%**, which fulfills our baseline score.

### Logistic regression

Another way to classify data is to use the logistic regression model. This model is different from the previous two in that it doesn't have a `max_depth` parameter and is affected by scaling done previously. 

We will compare all five solvers provided by scikit-learn. `sag` and `saga` solvers need a plenty of iterations to fit well, so we will increase the `max_iter` hyperparameter to `3500` for these and use the default value of `100` for the rest.

In [11]:
solver_list = ['liblinear', 'newton-cg', 'lbfgs', 'sag', 'saga']

for solver in solver_list:
    # Creating & training logistic regression models, changing max_iter as needed
    if solver == 'sag' or solver == 'saga':
        logreg_model = LogisticRegression(random_state=12345, solver=solver, max_iter=3500)
    else:
        logreg_model = LogisticRegression(random_state=12345, solver=solver)
    logreg_model.fit(train_features, train_target)
    
    # Getting training accuracy scores
    train_pred = logreg_model.predict(train_features)
    train_acc_score = accuracy_score(train_target, train_pred)
    print(solver, 'training accuracy:', train_acc_score)

    # Validating model & getting accuracy
    val_pred = logreg_model.predict(val_features)
    val_acc_score = accuracy_score(val_target, val_pred)
    print(solver, 'validation accuracy:', val_acc_score)
    print()

liblinear training accuracy: 0.7531120331950207
liblinear validation accuracy: 0.7558320373250389

newton-cg training accuracy: 0.7531120331950207
newton-cg validation accuracy: 0.7558320373250389

lbfgs training accuracy: 0.7531120331950207
lbfgs validation accuracy: 0.7558320373250389

sag training accuracy: 0.7531120331950207
sag validation accuracy: 0.7558320373250389

saga training accuracy: 0.7531120331950207
saga validation accuracy: 0.7558320373250389



Apparently, all of the solvers in our logistic regression model returned the same score of **~75.5%**. However, its validation accuracy is higher than its training score, indicating that it's underfitted. This might change if we had a bigger dataset to train and validate with, but for now, the logistic regression models don't seem to be the best choice for this job.

### Model training and validation results

We have trained and validated three models with varying accuracy. Keeping the importance of the model making the correct recommendations, we will use the best-performing model of the three: the **random forest classifier** with `max_depth = 8, n_estimators = 40, random_state=12345`, resulting in a validation accuracy of **~80%**, which is 5% above our baseline score.

## Quality check with test set & sanity check

We need to verify the quality of the model using our last set, the test set. If the model can maintain its accuracy to be higher than our 75% baseline score, we can proceed.

In [12]:
# Redefining the random forest model
forest_final = forest_best_model

# Making predictions on the test set
test_pred = forest_final.predict(test_features)

# Getting the test score
test_acc = accuracy_score(test_target, test_pred)

print('Test accuracy:', test_acc)

Test accuracy: 0.7962674961119751


Our random forest succeeded in keeping its score only with a ~1.2% decrease from the validation accuracy.

As mentioned above, the dataset is imbalanced. We have set our baseline score to 75% to ensure that the model didn't learn to classify more towards `is_ultra = 0` and that it didn't get a high accuracy score by chance. By scoring ~79.6% on the last test, we can say that our random forest model has passed the sanity check.

## Conclusion

We were given a dataset containing 3214 rows of users' plan usage data and their plans of choice. We noted several things about the dataset:
- `is_ultra = 0` (~69%) dominated the target variable, making the dataset imbalanced.
- the data are measured in different units and numerical range.

The data were split into 3 parts: training (60%), validation, and test sets (20% each). Regarding the numerical values, to make the data more suitable for processing by logistic regression models, we scaled the data using sklearn's Standard Scaler. However, the performance of tree-based models were not affected by this change.

The goal was to create a machine-learning model that can classify users into the best of the two plan options, so we created 3 classification ML models with varying, manually-tuned hyperparameters and compared their training & validation performance. The performance was measured with the accuracy metric because we needed to know how likely would the model provide the correct predictions. In line with the dataset's imbalance, we set our **baseline accuracy score to 75%** to ensure that the model weren't simply making random correct guesses.

The three models, their best hyperparameters, and their respective scores were:
1. Decision tree classifier (`max_depth = 3, random_state = 12345`) <br>
Accuracy: **~78.5%** </br>
1. Random forest classifier (`max_depth = 8, n_estimators = 40, random_state = 12345`)<br>
Accuracy: **~80.8%** </br>
1. Logistic regression classifier (any solver, `random_state = 12345`)<br>
Accuracy: **~75.5%** </br>

Our random forest classifier yielded the best results and passed the testing stage with an accuracy of **~79.6%** which also made it pass the sanity check.

To conclude, the best machine-learning model for this task would be a **random forest classifier** with the following hyperparameters: `max_depth = 8, n_estimators = 40, random_state = 12345`, fitted with our training dataset.