# Project Title: Model of User Behavior For Megaline 

# Introduction: 
Mobile carrier Megaline has become aware that many subscribers remain on a legacy plan. Therefore, Megaline wants to develop a model that analyzes subscriber's behaviors to encourage subscribers to upgrade to a newer plan: Smart or Ultra. The data examined in this project will be based on data from subscribers who have already switched to one of the new plans offered by Megaline. The aim of this classification task will be to develop a model that will choose the right plan for the subscriber. 

The model will need to have an accuracy threshold of 0.75. The models that will be tested will be decision tree classifcation, randrom forest classification, and logistic regression models for this goal threshold of 0.75. In order to ensure that the appropriate model is choosen the data will be split into a training set, validation set and a test set to ensure the same sets are used throughout the project. Furthermore, the hyperparameters will be tweaked to ensure the most effective model is selected. Finally, a sanity check will be performed to ensure that the chosen model performs better than random chance. 



# Examining General Information of Data file:

In [1]:
#Importing all libraries 
import pandas as pd 

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score

from joblib import dump 

In [2]:
# Load Dataset 
df = pd.read_csv('/datasets/users_behavior.csv')

In [3]:
#Basic exploration of the dataset 
df.info()#looking at general contents of table 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


Description of the data based on documentation: 
calls: number of calls 
minutes: total call duration in minutes 
messages: number of text messages 
mb_used:internet traffic used in mb 
is_ultra: plan for the current month (ultra-1, smart-0) 

In [4]:
df.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [5]:
df.duplicated().sum()

0

In [6]:
df.describe()# Closer look at the statistical information shown by the data 

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
count,3214.0,3214.0,3214.0,3214.0,3214.0
mean,63.038892,438.208787,38.281269,17207.673836,0.306472
std,33.236368,234.569872,36.148326,7570.968246,0.4611
min,0.0,0.0,0.0,0.0,0.0
25%,40.0,274.575,9.0,12491.9025,0.0
50%,62.0,430.6,30.0,16943.235,0.0
75%,82.0,571.9275,57.0,21424.7,1.0
max,244.0,1632.06,224.0,49745.73,1.0


No missing values or duplicates in the dataset and the columns are all in the correct data types. 
Zero values in the features need to remain as indicates subsciption plan. 

In [7]:
# Determining number of users for each plan 
ultra_plan_users = df.is_ultra.sum()
print("Number of Ultra plan users:", ultra_plan_users)
print("Percentage of users with Ultra plan:", ultra_plan_users/len(df))


Number of Ultra plan users: 985
Percentage of users with Ultra plan: 0.30647168637212197


Out of 3214 total users only approximately 31 percent are Ultra plan users. While nearly 69 percent are Smart plan users. That is almost 2 times as many Smart plan users as opposed to Ultra plan users. 

# Model Testing 

In [8]:
#Create series 
features= df.drop(['is_ultra'], axis=1)
target = df['is_ultra']

In [9]:
# Split source data into training, validation and test set: 
x_train,features_test,y_train,target_test=train_test_split(features,target, test_size =0.2, random_state =123)
features_train, features_valid, target_train,target_valid = train_test_split(x_train,y_train,test_size =0.25,random_state = 123)

Description: 60% of the data needs to be used for training and 20% for the validation and tests sets each. First the features and target dataset will be split into a training set at an 80/20 ratio. Then it will be further be divided into an ultimate training set and validation set. In this particular scenario the validation set needs to come from 25% of the training set to provide a 60/20/20 ratio. 

# Decision tree 

Description: Decision trees provide an opporutnity to utilize a versatile model that will predict a likely outcome with less data preparation steps compared to other models. However, it may result in overfitting. The hyperparameter that can be tweaked is the maximum tree depth. Therefore a quick loop can be written to try out various tree depths to produce a model with the highest accuracy. 

In [10]:
#devise varaibles 
top_depth = 0 
top_score = 0 
model_tree = None 


#creating loop 
for depth in range(1,11):
    model = DecisionTreeClassifier(max_depth=depth,random_state=123)
    model.fit(features_train,target_train) #Fit model based on training data 
    score=model.score(features_valid,target_valid) #Determine accuracy 

#determine best performance based on hyperparameters and accuracy 
    if score> top_score:
        top_score = score 
        top_depth = depth 
        model_tree = model 

#display results 
print(f'Best accuracy:{top_score} reached using max depth {top_depth}.')


Best accuracy:0.7947122861586314 reached using max depth 9.


Conclusion: The decision tree achieves an accuracy of 79% when using the validation set. This is above our threshold of 75%. 

# Random forest 

Description: Random forest decreases the likelhood of overfitting by devising multiple decision trees using various subsets of data and features. However, it sometimes can be difficult to use for interpretation of specific outcomes. The hyperparameters that can be tweaked are the number of estimators and maximum tree depth. Therefore, it will be necessary to devise a loop to determine what the best model will be based on these hyperparamters. 

In [11]:
#create variables 
top_score = 0 
best_est = 0 
top_depth = 0 
model_forest = None 

#creating loop 
for est in range (10,81,10):
    for depth in range (1, 11): 
        model = RandomForestClassifier(
            max_features = 1.0, 
            n_estimators=est, max_depth =depth, random_state=123)
        model.fit(features_train, target_train)
        score = model.score(features_valid,target_valid)
        
#determine best performance based on hyperparameters and accuracy 
        if score > top_score:
            top_score = score 
            best_est = est
            best_depth = depth 
            model_forest = model 

#display results
print(f"Best accuracy:{top_score} achieved using {best_est} trees with max_depth {best_depth}.")

Best accuracy:0.7978227060653188 achieved using 40 trees with max_depth 6.


Conclusion: Accuracy for the Random forest model is only slightly better than the decision tree. It is also above the threshold of 75%. 

# Logistic regression 

Description: Logistic regression models offer easily to interpret results and is fairly quick to analyze data. However, if a large data set is used overfitting can occur. There are no hyperparameters that can be tweaked using the validation set. The data could be replit from the original dataset into a 75/25 ration, however I already have a 80/20 training/testing split, therefore there is no reason to split it differently.  

In [12]:
model=LogisticRegression(solver='liblinear',random_state=123)
model.fit(features_train, target_train)

score_train = model.score(features_train,target_train)
print(f'Training accuracy:{score_train}')

score_valid = model.score(features_valid, target_valid)
print(f'Validation accuracy: {score_valid}')

Training accuracy:0.7142116182572614
Validation accuracy: 0.702954898911353


Conclusion: The accuracy of the Logistic regression model is below the threshold of 75%. This could indicate some overfitting is occuring. This model has the lowest accuracy compared to the other models. 

# Checking Model Quality Using Test Set 

In [13]:
model=RandomForestClassifier(random_state=123, max_depth=6)
model.fit(features_train,target_train)
predictions_test = model.predict(features_test)
result=accuracy_score(target_test,predictions_test)
print('The accuracy of the test is:',result)

The accuracy of the test is: 0.8118195956454122


Conclusion: The tested accuracy is approximately 81% and this exceeds the validation set accuracy. The random forest model is slower to train than the other models. If this was a larger dataset that may become an issue. However, for this current smaller dataset it worked just fine. 

Conducting a quick sanity check verfies that the model performs better than chance. The reason this make sense is that the two plans are split approximately 70/30 (Smart/Ultra). If one were to guess Smart for each observation alone it would result in an approximate 70% accuracy rate. Our model performs above 70% accuracy proving this is not due to random chance. 

# Conclusion

The analysis of the data provided yielded that the random forest classifier model is the best model to choose the right plan for the subscriber. Various models were examined including a decision tree classifer and a logistic regression classfier. Data was split into 60/20/20(training/validation/testing). The random forest classifer model performed the best in terms of accuracty and meeting the acceptable threshold of 75%. When accuracy was tested it achieved approximately 81%. Although, random forest classifer models are slowest to train, this is a small dataset therefore, it is still sufficient to choose this model for selecting the right plan for subscribers. 