# Cell plans recommendation

We have data on the behavior of customers who have already switched to the Smart and Ultra tariffs (see our previous [study](https://github.com/Shurgalivan/Portfolio/blob/main/Cell%20Plan%20Selection/Cell_plan_selection_1.ipynb)). We need to build a model for the classification task that will select the appropriate tariff.

In the study, we will construct a model with the maximum possible accuracy value, no less than 0.75.

## Data preprocessing
### Let's open the dataset

In [1]:
import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression 


We save the dataset in a dataframe `df`

In [2]:
df=pd.read_csv('users_behavior.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [3]:
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


### Splitting the Data into Samples

To train a model, we need to split the available data into different samples. Here we split the data into training, validation, and testing samples. The training sample is used to train the model, the validation sample is used to tune the model's hyperparameters and evaluate its performance during training, and the testing sample is used to assess the final performance of the trained model.

In [4]:
#training sample with a size of 40%
df_train, df_temp = train_test_split(df, test_size=0.40, random_state=12345)

#validation and testing samples
df_valid, df_test = train_test_split(df_temp, test_size=0.50, random_state=12345)

## Model reasearch

### Select out target variable `is_ultra`

In [5]:
#features without the target variable
features_train = df_train.drop(['is_ultra'], axis=1)
#target variable
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)  
target_valid = df_valid['is_ultra']

### Decision tree model

Let's start with the decision tree model. We will create a loop that, in each iteration, will create models and evaluate their accuracy at different depths of the decision tree, ranging from 2 to 20. We will use a step size of 2.

In [6]:
for depth in range(2,20,2):
    model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions)
    print('max_depth =',depth,':', round(accuracy, 5))

max_depth = 2 : 0.78227
max_depth = 4 : 0.77916
max_depth = 6 : 0.78383
max_depth = 8 : 0.77916
max_depth = 10 : 0.77449
max_depth = 12 : 0.76205
max_depth = 14 : 0.75894
max_depth = 16 : 0.73406
max_depth = 18 : 0.73095


The best model was achieved with a depth of 6, and it has an accuracy of 0.78383.

### Random forest classifier

Let's take a look at the models built using the Random Forest algorithm. We will once again run a loop, keeping the max_depth at 6:

In [7]:
for estimators in range(2,16):
    model = RandomForestClassifier(random_state=12345, n_estimators=estimators, max_depth=6)
    model.fit(features_train, target_train)
    predictions = model.predict(features_valid)
    accuracy = accuracy_score(target_valid, predictions)
    print('n_estimators =',estimators,':', round(accuracy, 5))

n_estimators = 2 : 0.78538
n_estimators = 3 : 0.78383
n_estimators = 4 : 0.78849
n_estimators = 5 : 0.79471
n_estimators = 6 : 0.79316
n_estimators = 7 : 0.79471
n_estimators = 8 : 0.79938
n_estimators = 9 : 0.79938
n_estimators = 10 : 0.80093
n_estimators = 11 : 0.80093
n_estimators = 12 : 0.80404
n_estimators = 13 : 0.80249
n_estimators = 14 : 0.80249
n_estimators = 15 : 0.80093


The optimal number of trees in the forest, according to our loop, is 12, with an accuracy of 0.804. It seems like we have found the optimal hyperparameters. 

However, it is worth checking the model based on the logistic regression algorithm.

### Logistic regression

Let's conduct an experiment with logistic regression. 

In [8]:
model = LogisticRegression(random_state=12345)
model.fit(features_train, target_train) 
predictions = model.predict(features_valid)
#accuracy
accuracy = accuracy_score(target_valid, predictions)
print(accuracy)

0.7107309486780715


Overall, the accuracy of the logistic regression model is 0.71, which is lower than the accuracy achieved by the previous models. Unfortunately, it does not seem to be helpful with the test dataset.

## Testing rhe model


Let's evaluate the best model (Random Forest) on the test dataset.

In [9]:
#select our features and target variable
features_test = df_test.drop(['is_ultra'], axis=1)  
target_test = df_test['is_ultra']

In [10]:
model = RandomForestClassifier(random_state=8897, n_estimators=12, max_depth=6)
model.fit(features_train, target_train)

predictions = model.predict(features_test)
accuracy = accuracy_score(target_test, predictions) 
print('accuracy =',':', round(accuracy,5))

accuracy = : 0.79938


So, the evaluation on the test dataset demonstrated the effectiveness of our Random Forest model with carefully selected hyperparameters. The error rate is slightly above 20%.

## Adequacy of the model

To check the adequacy of the model, we can compare its performance with a baseline model or a random guessing approach. 

In [11]:
import random
random_predictions = np.random.randint(low = 0, high = 2, size = 643) 
accuracy = accuracy_score(target_test, random_predictions)
print('accuracy =',':', round(accuracy,5))

accuracy = : 0.49767


Based on our dataset, flipping a coin does not seem to be a more convenient or successful scenario. The model proposed in previous sections has proven to be effective and reliable.

### Key Takeaways:

We conducted a machine learning model training using algorithms such as Random Forest, Decision Tree, and Logistic Regression.
- The best-performing model was the Random Forest with hyperparameters 'n_estimators' (number of trees) set to 12 and 'max_depth' set to 6.
- The best model was evaluated on the validation dataset and achieved an accuracy of over 0.79.
- The model successfully passed the adequacy check, indicating that it outperformed a baseline model and demonstrated meaningful patterns in the data.