# Introduction

### State Objective Here:
    Megaline wants to develop a model that will analyze phone plan subscribers' behaviors and also recommend the subscribers one of the two newer plans: Smart or Ultra.
### State Goal Here:
    Develop a classification model that will curate the highest possible accuracy that allows subscribers to choose between one of the newer plans (Smart or Ultra)

### Initial Questions: 
    Which classification model will exceed a 75% accuracy?

In [1]:
# Load libraries here
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score #, #roc_curve, confusion_matrix

In [2]:
# Load dataset here
df = pd.read_csv('/datasets/users_behavior.csv')

Although, we already explored in-depth in terms of the exploratory data analysis on Megaline's customers' behavior. It's still best to check and enrich the data as much as possible in order to work with clean, sensible data.

# Data Wrangling

In [3]:
df.head(5)

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [5]:
df.isna().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

I also converted the **Messages** & **Calls** columns from floats into integers. 

In [6]:
df['calls'] = df['calls'].astype('int')
df['messages'] = df['messages'].astype('int')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40,311.9,83,19915.42,0
1,85,516.75,56,22696.96,0
2,77,467.66,86,21060.45,0
3,106,745.53,81,8437.39,1
4,66,418.74,1,14502.75,0


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   int64  
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   int64  
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(2), int64(3)
memory usage: 125.7 KB


In order to a see an overall numerical glance of how many subscribers fall into either plans, I categorized and summed up all the values in the dataframe. 

In [8]:
#new plan types are sorted

#Ultra - 1, Smart - 0
df['is_ultra'].value_counts() 

0    2229
1     985
Name: is_ultra, dtype: int64

# Machine Learning

Finally we are onto the meat & potatoes: Machine Learning. At this stage, we're going to split, train & validate and lastly, test the top ranking model who showcased the most accurate predictions on the dataset. 

## Spliting the Dataset

Given we are only working with one source of data, we will split it into threes: 
1. (A) Training set = 60% | (Temporary set) Evaluation Set = 40% (both valid and testing set)
2. (B) Validating Set = 20%
3. (C) Testing Set = 20%

Then we're are going to, for all 3 datasets, set the standard that the target will hold the *Is_Ultra* column while the features will hold the remaining four columns' data. This is to ensure that we can fairly test that model is learning to predict sensible results we're are looking for since we already know *Is_Ultra* column acts as a placeholder for the answers we want to be predicted. 

In [9]:
# Training set = 60% | Evaluation Set = 40% (both valid and testing set)
df_train, df_temp = train_test_split(df, test_size=0.40, random_state=451)
df_test, df_valid = train_test_split(df_temp, test_size=0.40, random_state=451)

#training set
train_features = df_train.drop(['is_ultra'], axis =1)
train_target = df_train['is_ultra']

#validation set
valid_features = df_valid.drop(['is_ultra'], axis =1)
valid_target = df_valid['is_ultra']

#testing set
test_features = df_test.drop(['is_ultra'], axis =1)
test_target = df_test['is_ultra']

Here we can see that by calling *shape* onto the variables we can see that the data is sectioned appropriately and stored respectively into their variables.

In [10]:
train_features.shape, train_target.shape, valid_features.shape, valid_target.shape, test_features.shape, test_target.shape

((1928, 4), (1928,), (515, 4), (515,), (771, 4), (771,))

<div class="alert alert-block alert-warning">📝
    

__Reviewer's comment №1__

1. It is good here, random_state is fixed. We have ensured reproducibility of the results of splitting the sample into training (training) / test / validation samples, so the subsamples will be identical in all subsequent runs of our code.
    
2. Fraction of train/valid/test sizes 3:1:1 is good.


</div>

## Model Training & Validating

Now that our dataset has been organized, we can train our model to become a trained model, and then onwards to be tested again to ensure our data is reacting the way the anticipate it to be. But we're going to test our data with three different models to see how it behaves.

Without any parameters, outside of the **random_state=451**, the best model that won was the **RandomForestClassifier** model. After tinkering around with the parameters for each model with 15-20 adjustments the model with the highest accuracy stayed consistently in favor with *RandomForestClassifer*

### DecisionTreeClassifier Model

In [11]:
#DecisionTreeClassifier

model = DecisionTreeClassifier(random_state=724, max_depth=2, max_features=4)
model.fit(train_features, train_target)

valid_predictions = model.predict(valid_features)
score = accuracy_score(valid_target, valid_predictions)
print(score)

0.7805825242718447


### RandomForestClassifier Model

In [12]:
#RandomForestClassifier == Winner

model = RandomForestClassifier(random_state=724, n_estimators=4, max_depth=2, max_features=4)
model.fit(train_features, train_target)

valid_predictions = model.predict(valid_features)
score = accuracy_score(valid_target, valid_predictions)
print(score)

0.7961165048543689


### LogisticRegression Model

In [13]:
#logisticRegression

model = LogisticRegression(random_state=724, solver='liblinear')
model.fit(train_features, train_target)

valid_predictions = model.predict(valid_features)
score = accuracy_score(valid_target, valid_predictions)
print(score)

0.7339805825242719


## Model Testing

Upon analyzing the different models outputs, it's clear to see the closest model who hit (and exceeded) the threshold was the **RandomForestClassifier** with **79.63%** accuracy -- almost an 80% accurate prediction rate.

In [14]:
model = RandomForestClassifier(random_state=724, n_estimators=4, max_depth=2, max_features=4)
model.fit(train_features, train_target)

test_predictions = model.predict(test_features)
score = accuracy_score(test_target, test_predictions)
print(score)

0.7808041504539559


# Sanity Check

A lot of the model tinkering is a constant iterative process but mid validation testing, I realized I didnt clearly check the source data to see which plan had the most subscribers. Ultra has 2229/3214 customers and Smart has 985/3214 customers. The Smart plan seems to outshine Ultra's plan with close to an overwhelming 70% of users choosing it. 

In [15]:
#Ultra - 1, Smart - 0
df['is_ultra'].value_counts() 

0    2229
1     985
Name: is_ultra, dtype: int64

In [16]:
smart = 2229/3214*100
ultra = 985/3214*100

print(f"Smart has about {smart:.2f}% of the phone plan subscribers combined.")
print(f"Ultra has about {ultra:.2f}% of the phone plan subscribers combined.")

Smart has about 69.35% of the phone plan subscribers combined.
Ultra has about 30.65% of the phone plan subscribers combined.


# Conclusion

Megaline wants to develop a classification machine learning model that'll analyze their customers' behaviors in order to suggest a 75% accurate new phone plan (i.e. Smart or Ultra) recommendation to them. To explore this ad hoc model, we had to explore three different types: DecisionTreeClassifier, RandomForestClassifier and LogisticRegression and as a result RandomForestClassifier Model consistently shown itself to reach and exceed the accuracy rate.