# Project description

Mobile carrier Megaline has found out that many of their subscribers use legacy plans. They want to develop a model that would analyze subscribers' behavior and recommend one of Megaline's newer plans: Smart or Ultra. 

## Data description


Every observation in the dataset contains monthly behavior information about one user. The information given is as follows: 

- сalls — number of calls,
- minutes — total call duration in minutes,
- messages — number of text messages,
- mb_used — Internet traffic used in MB,
- is_ultra — plan for the current month (Ultra - 1, Smart - 0).

In [1]:
import pandas as pd
import numpy as np
import math 
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.dummy import DummyClassifier

In [2]:

try:
    df = pd.read_csv('users_behavior.csv', sep=',')

except FileNotFoundError:
    df = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/users_behavior.csv', sep=',')

In [3]:
df.shape

(3214, 5)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
#check the data for missing values
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [6]:


# Split the data into features and target variables
X = df.drop('is_ultra', axis=1)
y = df['is_ultra']

# Split the data into training(60%), validation(20%), and test set(20%)
X_1, X_test, y_1, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=12345)
X_train, X_val, y_train, y_val = train_test_split(X_1, y_1, test_size=0.25, stratify=y_1, random_state=12345)

# Train the Decision Tree model
model = DecisionTreeClassifier(random_state=12345)
model.fit(X_train, y_train)
model.score(X_val, y_val)

0.7480559875583204

In [7]:
for i in range(1, 7):
  model = DecisionTreeClassifier(max_depth=i, random_state=12345)
  model.fit(X_train, y_train)
  print("Max depth" + str(i) + ":   " + str(model.score(X_val, y_val)))

Max depth1:   0.7589424572317263
Max depth2:   0.7838258164852255
Max depth3:   0.8040435458786936
Max depth4:   0.8040435458786936
Max depth5:   0.8164852255054432
Max depth6:   0.80248833592535


the decision tree model shows it highst acuracy with a max depth of 5, the acuracy of the model is %81.65

depth 5 shows the best accuracy for the model

In [8]:
# Train the Random Forest model
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_model.score(X_val, y_val)


0.8227060653188181

In [9]:
# Evaluate other metrics such as F1 Score, Precision, Recall
from sklearn.metrics import f1_score, roc_auc_score, precision_score, recall_score

print("F1 Score: "+str(f1_score(y_test, rf_model.predict(X_test))))
print("Roc Auc Score: "+str(roc_auc_score(y_test, rf_model.predict(X_test))))
print("Precision: "+str(precision_score(y_test, rf_model.predict(X_test))))
print("Recall: "+str(recall_score(y_test, rf_model.predict(X_test))))

F1 Score: 0.6303030303030304
Roc Auc Score: 0.7314481801006123
Precision: 0.7819548872180451
Recall: 0.5279187817258884


In [10]:
from sklearn.model_selection import GridSearchCV

parameters = {
    "max_depth": [5, 7, 10],
    "n_estimators": [5, 10],
    'min_samples_split': [2, 5, 7]
}

In [11]:
clf = GridSearchCV(rf_model, parameters, verbose=2)
clf.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] END ...max_depth=5, min_samples_split=2, n_estimators=5; total time=   0.0s
[CV] END ...max_depth=5, min_samples_split=2, n_estimators=5; total time=   0.0s
[CV] END ...max_depth=5, min_samples_split=2, n_estimators=5; total time=   0.0s
[CV] END ...max_depth=5, min_samples_split=2, n_estimators=5; total time=   0.0s
[CV] END ...max_depth=5, min_samples_split=2, n_estimators=5; total time=   0.0s
[CV] END ..max_depth=5, min_samples_split=2, n_estimators=10; total time=   0.0s
[CV] END ..max_depth=5, min_samples_split=2, n_estimators=10; total time=   0.0s
[CV] END ..max_depth=5, min_samples_split=2, n_estimators=10; total time=   0.0s
[CV] END ..max_depth=5, min_samples_split=2, n_estimators=10; total time=   0.0s
[CV] END ..max_depth=5, min_samples_split=2, n_estimators=10; total time=   0.0s
[CV] END ...max_depth=5, min_samples_split=5, n_estimators=5; total time=   0.0s
[CV] END ...max_depth=5, min_samples_split=5, n_

GridSearchCV(estimator=RandomForestClassifier(),
             param_grid={'max_depth': [5, 7, 10],
                         'min_samples_split': [2, 5, 7],
                         'n_estimators': [5, 10]},
             verbose=2)

the random forest classifier model shows without any other metrics the best accuracy of %82.58 

In [12]:
clf.best_params_

{'max_depth': 10, 'min_samples_split': 2, 'n_estimators': 10}

In [13]:
model_1 = clf.best_estimator_

In [14]:
model_1.score(X_test, y_test)

0.8009331259720062

In [15]:
# Create predictions for the decision tree model
y_pred = model.predict(X_test)

mean_squared_error(y_pred, y_test)

0.20217729393468117

In [16]:
# checking for model error
mean_squared_error(y_pred, y_test)
math.sqrt(0.28149300155520995)

0.5305591404878536

In [17]:
# Create predictions for the random forest model
rf_y_pred = rf_model.predict(X_test)

mean_squared_error(rf_y_pred, y_test)

0.18973561430793157

In [18]:
math.sqrt(0.18662519440124417)

0.43200138240663555

In [19]:
from sklearn.dummy import DummyClassifier
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
dummy_clf.score(X_val, y_val)


0.6936236391912908

the dummy Classifier has the lowest accuracy of all the models with %69.36

# conclusion
after testing several models for predicting clients behavior and recommending a plan the best model with the highest acuracy and lowest error is the Random Forest Clssifier model 