### Project Description

The mobile company Megaline is unhappy to see that many of their customers are using legacy plans. They want to develop a model that can analyze customer behavior and recommend one of Megaline's new plans: Smart or Ultra.

You have access to the behavioral data of subscribers who have already switched to the new plans (from the Statistical Data Analysis sprint project). For this classification task you must create a model that chooses the correct plan. Since you have already done the data processing step, you can jump right into creating the model.

Develop a model that is as accurate as possible. In this project, the accuracy threshold is 0.75. Use the dataset to check the accuracy.

## Introduction

In this project I will be using different kinds of machine learning model using the library Scikit learn and conclude what would be the best model to use Megalines Scenario.

In [2]:
# First we start importing our libraries
import pandas as pd

In [3]:
# Import our df
df = pd.read_csv('./datasets/users_behavior.csv')
df

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.90,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0
...,...,...,...,...,...
3209,122.0,910.98,20.0,35124.90,1
3210,25.0,190.36,0.0,3275.61,0
3211,97.0,634.44,70.0,13974.06,0
3212,64.0,462.32,90.0,31239.78,0


In [None]:
# We will analyze the dataframe types and
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


In [4]:
# We proceed to stablish out features and target for our models using the next library
from sklearn.model_selection import train_test_split

# Select all the columns except the one we want our model to conclude
features = df.drop(['is_ultra'], axis=1)
# Select only the column we want our model to conclude
target = df['is_ultra']
# We assign 25% of the data to out tests and 75% to training
features_train, features_valid, target_train, target_valid = train_test_split(features, target, test_size=0.25, random_state=12345)

## Test Models

In [None]:
# For our next models we will always need our Accuracy operation
from sklearn.metrics import accuracy_score

### DecisionTreeClassifier
With the resulting table we can see that our DecisionTreeClassifier model reached our 75% of accuracy at out first depth of our tree, and from the 2nd depth of our tree the accuracy fluctuates in between the 76 - 78%.

In [11]:
# Import our model
from sklearn.tree import DecisionTreeClassifier

# Set values for our future dataframe
depths_dt = []
accuracy_dt = []

# Create 10 depth level results
for depth_dt in range(1,11):
    # Set hyperparameters to our model
    model = DecisionTreeClassifier(max_depth = depth_dt, random_state=12345)
    # Train our model with the correct data
    model.fit(features_train,target_train)
    # Make a prediction giving our test data
    prediction = model.predict(features_valid)
    # Adding the values to our lists
    depths_dt.append(depth_dt)
    accuracy_dt.append(accuracy_score(target_valid, prediction))
# Creates the dataframe with all our stats
df_results_dt = pd.DataFrame({"Accuracy":accuracy_dt, "Depth":depths_dt})
df_results_dt


Unnamed: 0,Accuracy,Depth
0,0.75,1
1,0.783582,2
2,0.788557,3
3,0.781095,4
4,0.781095,5
5,0.766169,6
6,0.789801,7
7,0.788557,8
8,0.788557,9
9,0.787313,10


### RandomForestClassifier
Augmenting the number of forests to 20, we can observate that our highest level of accuracy is 79.10% on the 11 forest, which is very good for our investigation because it surpasses by 4 porcent our desired level of accuracy.

In [12]:
# Import our model
from sklearn.ensemble import RandomForestClassifier

# Set values for our future dataframe
depths_rf = []
accuracy_rf = []

# Create 10 depth level results
for est_rf in range(1,21):
    # Set hyperparameters to our model
    model = RandomForestClassifier(n_estimators=est_rf, random_state=12345)
    # Train our model with the correct data
    model.fit(features_train,target_train)
    # Make a prediction giving our test data
    prediction = model.predict(features_valid)
    # Adding the values to our lists
    depths_rf.append(est_rf)
    accuracy_rf.append(accuracy_score(target_valid, prediction))
# Creates the dataframe with all our stats
df_results_rf = pd.DataFrame({"Accuracy":accuracy_rf, "Depth":depths_rf})
df_results_rf

Unnamed: 0,Accuracy,Depth
0,0.736318,1
1,0.773632,2
2,0.764925,3
3,0.78607,4
4,0.778607,5
5,0.78607,6
6,0.778607,7
7,0.783582,8
8,0.781095,9
9,0.789801,10


### LogisticRegression
Observing the next results we see that with our model trained by our train data, we get our aimed 75.37% to also contemplate this model, as this is the fastest, but less accurate than other solutions we have earlier.

In [13]:
# Import our model
from sklearn.linear_model import LogisticRegression

# Set values for our future dataframe
scores_lr = []

# Set hyperparameters to our model
model = LogisticRegression(solver='liblinear', random_state=12345)
# Train our model with the correct data
model.fit(features_train,target_train)
# Score both train and test data
scores_lr.append(model.score(features_train, target_train))
scores_lr.append(model.score(features_valid, target_valid))
# Creates the dataframe with all our stats
df_results_lr = pd.DataFrame({"Type of Data":["Train","Valid"],"Score":scores_lr})
df_results_lr

Unnamed: 0,Type of Data,Score
0,Train,0.741494
1,Valid,0.753731


## Conclusion

We have set our dataframe and investigation into a lot of types of models to discover which of these models are the best fit for Megaline's interest. For the first instance we can be certain that "Random Forest Classifier" is the best one at getting the accuracy on a higher level than other models. This is becase of his nature of being able to reproduce a lot of trees and get closer with just extra processing on creating each of the trees.