# Goals and Overview

The goal of this project is to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. I will begin b yreviewing the data, then looking over for missing and duplicate values. After I will attempt to enrich the data with any features I can add to improve accuracy. A variety of models and hyperparameters will be tested in order to find the best model for the current task. The models will be tested using a test set, and the best model will be selected based on the scores.

# Project

## Initialization

In [2]:
#Loading necessary libraries.
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import  accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

## Reading Data

In [3]:
#Reading Data.
df = pd.read_csv('./datasets/users_behavior.csv')

In [4]:
#Looking at 'df'.
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [5]:
#Looking at 'df' info().
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Datatypes for 'calls' and 'minutes' could be changed to int64, but are fine as are. All other data types are correct, and there seems to be no missing values.

__Missing Values__

In [None]:
#Checking for missing values.
df.isna().sum()

No missing values confirmed.

__Duplicated Values__

In [None]:
#Chekcing for duplicate values.
df[df.duplicated()]

There are no duplicated rows.

## Data Preparation

In [None]:
df

Adding an 'average_call_time' was considered, however I don't think these values are as impactful as 'calls' and 'minutes' which are already part of the data set.

## Model Exploration

In [None]:
# Setting Random State
rs = 12345

### Data Splitting

In [None]:
#Splitting 'df' into 'df_train' and 'df_valid_test'. 'df_valid_test' will be split again below.
df_train, df_valid_test = train_test_split(df, test_size=0.4, random_state=rs)

#Splitting 'df_valid_test' into 'df_valid' and 'df_test'.
df_valid, df_test = train_test_split(df_valid_test, test_size=0.5, random_state=rs)

In [None]:
#Assigning to 'features_train' all columns except 'is_ultra'.
features_train = df_train.drop(['is_ultra'], axis=1)

#Assigning to 'target_train' the 'is_ultra' column.
target_train = df_train['is_ultra']

In [None]:
#Assigning to 'features_valid' all columns except 'is_ultra'.
features_valid = df_valid.drop(['is_ultra'], axis=1)

#Assigning to 'target_valid' the 'is_ultra' column.
target_valid = df_valid['is_ultra']

In [None]:
#Assigning to 'features_test' all columns except 'is_ultra'.
features_test = df_test.drop(['is_ultra'], axis=1)

#Assigning to 'target_test' the 'is_ultra' column.
target_test = df_test['is_ultra']

'df' has been split into 'df_train', 'df_valid', and 'df_test', and those three have been further split into feature and target variations.

In [None]:
print(features_train.shape)
print(target_train.shape)
print(features_valid.shape)
print(target_valid.shape)
print(features_test.shape)
print(target_test.shape)

### Testing Various Models and Hyperparameters

In [None]:
#Defining the hyperparameters grid.
param_grid = {
    'max_depth': [1, 2, 3, 4, 5],
    'criterion': ['gini', 'entropy'],
    'class_weight': [None, 'balanced']
}

#Initializing DecisionTreeClassifier model.
dtc_model = DecisionTreeClassifier(random_state=12345)

#Searching for the best combination of hyperparameters.
grid_search = GridSearchCV(dtc_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(features_train, target_train)

#Assigning to 'best_params' the best hyperparameters.
best_params = grid_search.best_params_

#Training Model using 'features_train' and 'target train'.
best_dtc_model = DecisionTreeClassifier(**best_params, random_state=12345)
best_dtc_model.fit(features_train, target_train)

#Assigning to 'train_accuracy' score based on training values.
train_accuracy = best_dtc_model.score(features_train, target_train)

#Assigning to 'train_accuracy' score based on validation values.
valid_accuracy = best_dtc_model.score(features_valid, target_valid)

#Printing findings.
print("Best hyperparameters:", best_params)
print("Accuracy on the training set:", train_accuracy)
print("Accuracy on the validation set:", valid_accuracy)

Based on the hyperparameters obtained through grid search, the best combination includes using the entropy criterion for splitting nodes and limiting the maximum depth of the tree to 3. The model achieved an accuracy of approximately 80.91% on the training set and 79.00% on the validation set.

In [None]:
#Defining the hyperparameters grid.
param_grid = {
    'n_estimators': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
    'max_depth': [None, 10, 20, 30, 40, 50]
}

#Initializing RandomForestClassifier model.
rfc_model = RandomForestClassifier(random_state=12345)

#Searching for the best combination of hyperparameters.
grid_search = GridSearchCV(rfc_model, param_grid, cv=7, scoring='accuracy')
grid_search.fit(features_train, target_train)

#Assigning to 'best_params' the best hyperparameters.
best_params = grid_search.best_params_

#Training Model using 'features_train' and 'target train'.
best_rfc_model = RandomForestClassifier(**best_params, random_state=12345)
best_rfc_model.fit(features_train, target_train)

#Assigning to 'train_accuracy' score based on training values.
train_accuracy = best_rfc_model.score(features_train, target_train)

#Assigning to 'train_accuracy' score based on validation values.
valid_accuracy = best_rfc_model.score(features_valid, target_valid)

#Printing findings.
print("Best hyperparameters:", best_params)
print("Accuracy on the training set:", train_accuracy)
print("Accuracy on the validation set:", valid_accuracy)

The best hyperparameters obtained are a max_depth of 10 and n_estimators of 20. With these settings, the random forest model achieved an accuracy of approximately 88.85% on the training set and 79.63% on the validation set.

In [None]:
#Defining the hyperparameters grid.
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.1, 1.0, 10.0],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 300],
    
}

#Initializing LogisticRegression model.
lr_model = LogisticRegression(random_state=12345)

#Searching for the best combination of hyperparameters.
grid_search = GridSearchCV(lr_model, param_grid, cv=5, scoring='accuracy')
grid_search.fit(features_train, target_train)

#Assigning to 'best_params' the best hyperparameters.
best_params = grid_search.best_params_

#Training Model using 'features_train' and 'target train'.
best_lr_model = LogisticRegression(**best_params, random_state=12345)
best_lr_model.fit(features_train, target_train)

#Assigning to 'train_accuracy' score based on training values.
train_accuracy = best_lr_model.score(features_train, target_train)

#Assigning to 'train_accuracy' score based on validation values.
valid_accuracy = best_lr_model.score(features_valid, target_valid)

#Printing findings.
print("Best hyperparameters:", best_params)
print("Accuracy on the training set:", train_accuracy)
print("Accuracy on the validation set:", valid_accuracy)

Based on the Grid Search, the best hyperparameters for the LogisticRegression model are: 'C': 10.0, 'max_iter': 100, 'penalty': 'l1', 'solver': 'liblinear'. These hyperparameters give the model an accuracy of 0.7531120331950207 on the training set, and 0.7558320373250389 on the validation set.

In summary, the RandomForestClassifier outperformed the other models on the validation set, while the DecisionTreeClassifier and LogisticRegression models showed slightly lower performance.

### Checking for Accuracy

In [None]:
dtc_test_predictions = best_dtc_model.predict(features_test)

In [None]:
rfc_test_predictions = best_rfc_model.predict(features_test)

In [None]:
lr_test_predictions = best_lr_model.predict(features_test)

In [None]:
print("Accuracy on test set for DesicionTreeClassifier:", accuracy_score(target_test, dtc_test_predictions))

In [None]:
print("Accuracy on test set for RandomForestClassifier:", accuracy_score(target_test, rfc_test_predictions))

In [None]:
print("Accuracy on test set for LogisticRegression:", accuracy_score(target_test, lr_test_predictions))

## Final Model

In [None]:
final_model = RandomForestClassifier(random_state=12345, max_depth= 10, n_estimators= 20)
final_model.fit(features_train, target_train)

In [None]:
test_predictions = final_model.predict(features_test)

In [None]:
print("Accuracy on test set for Final Model:", accuracy_score(target_test, test_predictions))

## Conclusion

In conclusion, after testing a variety of models and hyperparameters, the RandomForestClassifier using (max_depth=10, n_estimators=20, random_state=12345) stood out as the best model for analyzing subscribers' behavior and recommending an appropriate Megaline plan.