# Mobile Carrier Megaline Project

Introduction: This analysis is to develop a model that would analyze subscribers' behavior and use tuning models to determine which model will pass the threshold on the test set.

In [1]:
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score

imported libraries in the above cell.

In [2]:
try:
    df_users = pd.read_csv("/datasets/users_behavior.csv")
except FileNotFoundError:
    df_users = pd.read_csv('https://practicum-content.s3.us-west-1.amazonaws.com/datasets/users_behavior.csv')

df_users.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


downloaded dataset.

In [3]:
df_train, df_valid_and_test = train_test_split(df_users, test_size=0.25, random_state=12345)

df_valid, df_test = train_test_split(df_valid_and_test, test_size=0.5, random_state=12345)

features_train = df_train.drop(['is_ultra'], axis=1)
target_train = df_train['is_ultra']
features_valid = df_valid.drop(['is_ultra'], axis=1)
target_valid = df_valid['is_ultra']
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

splitted the source data in the cell above.

In [4]:
for depth in range(1, 6):
    model = DecisionTreeClassifier(random_state=35, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid = model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(target_valid, predictions_valid))

max_depth = 1 : 0.763681592039801
max_depth = 2 : 0.7935323383084577
max_depth = 3 : 0.7985074626865671
max_depth = 4 : 0.7985074626865671
max_depth = 5 : 0.7960199004975125


ran the first tuning model with decision tree classifier and shows accuracy of 80%.

In [5]:
best_score = 0
best_est = 0
for est in range(10, 101, 10): 
    model = RandomForestClassifier(random_state=12345, n_estimators=est) 
    model.fit(features_train, target_train)
    score = model.score(features_valid, target_valid) 
    print("Number of estimators:", est, "Score:", score)
    if score > best_score:
        best_score = score
        best_est = est

print("Accuracy of the best model on the validation set (n_estimators = {}): {}".format(best_est, best_score))

final_model = RandomForestClassifier(random_state=12345, n_estimators=best_est) 
final_model.fit(features_train, target_train)

Number of estimators: 10 Score: 0.7960199004975125
Number of estimators: 20 Score: 0.7910447761194029
Number of estimators: 30 Score: 0.7935323383084577
Number of estimators: 40 Score: 0.7960199004975125
Number of estimators: 50 Score: 0.8009950248756219
Number of estimators: 60 Score: 0.8084577114427861
Number of estimators: 70 Score: 0.8009950248756219
Number of estimators: 80 Score: 0.7985074626865671
Number of estimators: 90 Score: 0.7985074626865671
Number of estimators: 100 Score: 0.8009950248756219
Accuracy of the best model on the validation set (n_estimators = 60): 0.8084577114427861


RandomForestClassifier(n_estimators=60, random_state=12345)

the second tuning model will be the random forest classifier and shows accuracy of 81%.

In [6]:
model = LogisticRegression(
    random_state=54321, solver="liblinear"
)
model.fit(features_train, target_train)
score_train = model.score(features_train, target_train)
score_valid = model.score(features_valid, target_valid)

print(
    "Accuracy of the logistic regression model on the training set:",
    score_train,
)
print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

Accuracy of the logistic regression model on the training set: 0.74149377593361
Accuracy of the logistic regression model on the validation set: 0.7661691542288557


the third is logistic regression model with an accuracy of 77%.

Below is the model Random Forest with an estimator of 90 to show the high accuracy of train, validation, and test set.

In [7]:
model = RandomForestClassifier(n_estimators=80, random_state=12345)
model.fit(features_train, target_train)

print('Accuracy of the best model on the train set:', model.score(features_train, target_train))
print('Accuracy of the best model on the validation set:', model.score(features_valid, target_valid))
print('Accuracy of the best model on the test set:', model.score(features_test, target_test))

Accuracy of the best model on the train set: 1.0
Accuracy of the best model on the validation set: 0.7985074626865671
Accuracy of the best model on the test set: 0.7885572139303483


CONCLUSION:

The Random Forest has the highest accuracy of 81% which is passing the threshold of the recommended 75%.

The runner up is the Decision Tree at 80% accuracy which has also passed the threshold of the recommended 75%.

At third is the Logistic Regression at 76% which is just barely above the recommended threshold.
