# Introduction:

Megaline is a mobile provider that is looking to analyze subscribers' behavior and recommend the appropriate plan: Smart or Ultra. In this project I will:

1. Import and prepare the data for users' behavior
2. Split the data into training, validation, and test sets
3. Test various machine learning models and hyperparameters to find the most accurate prediction method
4. Assess quality of our model of choice
5. Give a general conclusion of our findings

## Import and prepare dataset

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
df.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.duplicated().sum()

0

In [5]:
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


This dataset looks to have no duplicated values, no missing values, and no changes required for datatypes. 

## Splitting dataset for Training, Validation, and Test

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
#First the dataset is split into 2 datasets, one for test data and the other to split again for training and validation
df_temp, df_test = train_test_split(df, test_size=0.20, random_state=12345)

In [9]:
print(df.shape)
print(df_temp.shape)
print(df_test.shape)

(3214, 5)
(2571, 5)
(643, 5)


In [10]:
#Next, we split the temp data set into 2 sets for training and validation
df_train, df_valid = train_test_split(df_temp, test_size=0.25, random_state=12345)

In [11]:
#Verify the Datasets
print('Training dataset: ', df_train.shape)
print('Testing dataset: ', df_test.shape)
print('Validation dataset: ', df_valid.shape)

Training dataset:  (1928, 5)
Testing dataset:  (643, 5)
Validation dataset:  (643, 5)


In [12]:
#Create the feature and target datasets

#Training datasets:
features_train = df_train.drop('is_ultra', axis=1)
target_train = df_train['is_ultra']

#Validation datasets:
features_valid= df_valid.drop(['is_ultra'], axis=1)
target_valid= df_valid['is_ultra']

#Testing datasets:
features_test = df_test.drop(['is_ultra'], axis=1)
target_test = df_test['is_ultra']

Dataset has been broken up into our training, validation, and test sets of data

## Investigation into different models for prediction

### Decision Tree

Decision Trees generally have a low accuracy and a high processing speed. We will create a decision tree model for our dataset and test the accuracy

In [13]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

In [14]:
for depth in range (1, 6):
    dt_model = DecisionTreeClassifier(random_state=12345, max_depth=depth)
    dt_model.fit(features_train, target_train)
    train_predictions = dt_model.predict(features_train)
    valid_predictions = dt_model.predict(features_valid)
    print('max_depth =', depth, ': ', end='')
    print(accuracy_score(target_valid, valid_predictions))

max_depth = 1 : 0.7387247278382582
max_depth = 2 : 0.7573872472783826
max_depth = 3 : 0.7651632970451011
max_depth = 4 : 0.7636080870917574
max_depth = 5 : 0.7589424572317263


Our model becomes most accurate with a depth of 3 with the accuracy being 77%. The model may be underfitted and we will test other models to improve on our accuracy.

### Random Forest

In [15]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

In [16]:
best_score = 0
best_est = 0
for est in range(1,21):
    rf_model = RandomForestClassifier(random_state=12345, n_estimators=est)
    rf_model.fit(features_train, target_train)
    score = rf_model.score(features_valid, target_valid)
    if score > best_score:
        best_score = score
        best_est= est

print('Accuracy of best model on validation set (n_estimators = {}): {}'.format(best_est, best_score))

Accuracy of best model on validation set (n_estimators = 20): 0.7900466562986003


With the Random Forest Model our accuracy improves signicantly with our best number of estimators being 20 and an accuracy of 79%.

### Logistic Regression

In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
lr_model = LogisticRegression(random_state=12345, solver='liblinear')
lr_model.fit(features_train, target_train)
score_train=lr_model.score(features_train, target_train)
score_valid=lr_model.score(features_valid, target_valid)

print(
    "Accuracy of the logistic regression model on the validation set:",
    score_valid,
)

Accuracy of the logistic regression model on the validation set: 0.7293934681181959


Accuracy for the Logistic Regression Model is 72% making it the least accurate of our ML models

## Check Quality of Model with Test Set

We will now use our test data to see whether our models align with our expectations. According to our training investigation we should expect Random Forrest to be the most accurate.

### Decision Tree

In [19]:
dt_predictions = dt_model.predict(features_test)

In [20]:
accuracy = accuracy_score(target_test, dt_predictions)
print("Decision Tree Model Accuracy:", accuracy)

Decision Tree Model Accuracy: 0.7884914463452566


### Random Forrest

In [21]:
rf_predictions = rf_model.predict(features_test)  
rf_accuracy = accuracy_score(target_test, rf_predictions)
print("Random Forest Model Accuracy:", rf_accuracy)

Random Forest Model Accuracy: 0.7791601866251944


### Logistical Regression

In [22]:
lr_predictions = lr_model.predict(features_test)  # replace lr_model with your Logistic Regression model variable
lr_accuracy = accuracy_score(target_test, lr_predictions)
print("Logistic Regression Model Accuracy:", lr_accuracy)

Logistic Regression Model Accuracy: 0.7511664074650077


Our expectations were incorrect. During our testing, the Decision Tree model yielded the most accurate predictions instead of the Random Forrest model.

## Sanity Test

In [23]:
from sklearn.metrics import classification_report

### Decision Tree

In [24]:
report = classification_report(target_test, dt_predictions, output_dict=True)

report_df = pd.DataFrame(report).transpose()

print(report_df)

              precision    recall  f1-score     support
0              0.787431  0.953020  0.862348  447.000000
1              0.794118  0.413265  0.543624  196.000000
accuracy       0.788491  0.788491  0.788491    0.788491
macro avg      0.790774  0.683143  0.702986  643.000000
weighted avg   0.789469  0.788491  0.765194  643.000000


### Random Forrest

In [25]:
report = classification_report(target_test, rf_predictions, output_dict=True)

report_df = pd.DataFrame(report).transpose()

print(report_df)

              precision    recall  f1-score    support
0              0.804391  0.901566  0.850211  447.00000
1              0.690141  0.500000  0.579882  196.00000
accuracy       0.779160  0.779160  0.779160    0.77916
macro avg      0.747266  0.700783  0.715046  643.00000
weighted avg   0.769565  0.779160  0.767809  643.00000


### Logistical Regression

In [26]:
report = classification_report(target_test, lr_predictions, output_dict=True)

report_df = pd.DataFrame(report).transpose()

print(report_df)

              precision    recall  f1-score     support
0              0.738769  0.993289  0.847328  447.000000
1              0.928571  0.198980  0.327731  196.000000
accuracy       0.751166  0.751166  0.751166    0.751166
macro avg      0.833670  0.596134  0.587530  643.000000
weighted avg   0.796625  0.751166  0.688944  643.000000


## Conclusion

In this report we have broken down the dataset into training, validation, and testing datasets at a 3:1:1 ratio to design 3 ML models in order to find a means to make predictions about consumer behavior. 

We created and trained 3 models: Decision Tree, Random Forrest, and Logistical Regression to determine whether buyers will likely purchase the Ultra plan from Megaline.

We have found that although the Random Forrest model would yield the most accurate predictions based on our training of the model, the test sample showed that the Decision Tree model had the highest accuracy at 79%, followed by Decision Tree at 78% accuracy rating. 

For this reason, the Decision Tree model appears to be the most useful model for Megaline to use as a tool to measure the liklihood of buyers choosing the Ultra plan.