# **Customer Behavior Analysis in Megaline Mobile Provider**

# Project Goals

- Megaline Mobile Provider is dissatisfied because many of their customers are still using old packages. The company wants to develop a model that can analyze customer behavior and recommend one of the two new Megaline packages: Smart or Ultra.

- We have access to data on the behavior of customers who have switched to the latest package (from the Statistical Data Analysis course project). In this classification task, we need to develop a model that able to choose the relevant package. Now, we have completed the data pre-processing steps, we can proceed directly to the modeling stage.

- Develop a model that has the highest possible accuracy. In this project, the threshold for the accuracy level is 0.75. Check the model accuracy metrics using a test dataset.


- perform a train test split as usual
  
  - train datasets
  - valid datasets
  - test datasets
  
- perform a cross validation

  - train_parent
  
    - train_child
    - valid_child
    
  - for test datasets do the following
  - max_depth = 0.01, 0.1, 0.5, 1,5, 10, 50
  - probability in binary classification (0 or 1)
  
    - 0 until 1
    - p = 0.6 (cutoff 0.5) -> 1
    - p=0.1 -> 0 for accuracy
    
  - 0 - 1
  - 0% - 100%
  - 70% is it good or not ? (?)
  - 1000
    
    - 700 target score 1
    - 300 target score 0
    
  - prediction 1 score for all
    
    - baseline model -> Dummyclassifier
    - 700/1000 -> 70%
    
  - at least has minimum 80%
  
    - 1000
    - 500 target for 1 score
    - 500 target for 0 score, prediction 1 score at 50%
    

# Initialization


In [6]:
import pandas as pd
import numpy as np

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

#for importing files to google collab
from google.colab import files

In [7]:
#upload datasets
#uploaded = files.upload()

# **1. Open and check the datasets**

In [8]:
df_user = pd.read_csv('/content/users_behavior.csv')

In [9]:
df_user.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


# Sanity Check variables/features target

In [10]:
df_user['is_ultra'].value_counts()

0    2229
1     985
Name: is_ultra, dtype: int64

In [11]:
#imbalance datasets -> is not a good model -> has bad accuracy
#upsample dan downsample
df_user['is_ultra'].value_counts() / df_user.shape[0] * 100

0    69.352831
1    30.647169
Name: is_ultra, dtype: float64

**Conclusion**

Data has been dominated by feature data as much as 70%

In [12]:
df_user.shape

(3214, 5)

In [13]:
df_user.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
is_ultra    0
dtype: int64

In [14]:
df_user.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


# **2. Split data into training set, validation set, dan test set.**

- training + validation -> used to create models + tuning hyperparameters
     - will result in final model
- test set -> used to test the final model

In [15]:
train_valid, test = train_test_split(df_user, test_size=0.2)
train, valid = train_test_split(train_valid, test_size=0.25)

#train
features_train = train.drop(['is_ultra'], axis=1)
target_train = train['is_ultra']

#validation
features_valid = valid.drop(['is_ultra'], axis=1)
target_valid = valid['is_ultra']

#test
features_test = test.drop(['is_ultra'], axis=1)
target_test = test['is_ultra']

print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(1928, 4)
(643, 4)
(643, 4)


In [16]:
#data total in original df 3214
#check the total features_train + features_valid + features_test, the value must be matched with the original df
df_user.shape

(3214, 5)

In [17]:
features_train.head()

Unnamed: 0,calls,minutes,messages,mb_used
605,24.0,135.33,33.0,15479.48
1969,51.0,419.89,93.0,11355.36
2177,38.0,301.27,37.0,28914.24
1518,13.0,96.3,33.0,6750.08
1119,72.0,415.28,25.0,30960.74


In [18]:
target_train[:5]

605     0
1969    0
2177    1
1518    1
1119    0
Name: is_ultra, dtype: int64

# 3. Check the quality in different models by changing the hyperparameters. Describe the findings that we've got from the research.

# Modelling without Tuning Hyperparameters

In [19]:
#Hyperparameters default
log_reg = LogisticRegression()
log_reg.fit(features_train, target_train)

y_prediksi_train = log_reg.predict(features_train)
y_prediksi_valid = log_reg.predict(features_valid)

print(accuracy_score(target_train, y_prediksi_train) * 100)
print(accuracy_score(target_valid, y_prediksi_valid) * 100)

74.63692946058092
74.96111975116641


**Conclusion**

The result values in training and validation datasets have not much differences, so they didn't result in overfitting data

In [20]:
#Hyperparameters default
dtree = DecisionTreeClassifier()
dtree.fit(features_train, target_train)

y_prediksi_valid_2 = dtree.predict(features_valid)
y_prediksi_test_2 = dtree.predict(features_test)

print(accuracy_score(target_valid, y_prediksi_valid_2) * 100)
print(accuracy_score(target_test, y_prediksi_test_2) * 100)

74.02799377916018
70.91757387247279


**Conclusion**

The result values in validation and test datasets have not much differences, so they didn't result in overfitting data

In [21]:
#Hyperparameters default
rf = RandomForestClassifier()
rf.fit(features_train, target_train)

y_prediksi_valid_3 = rf.predict(features_valid)
y_prediksi_test_3 = rf.predict(features_test)

print(accuracy_score(target_valid, y_prediksi_valid_3) * 100)
print(accuracy_score(target_test, y_prediksi_test_3) * 100)

79.93779160186625
82.42612752721618


**Conclusion**

The result values in validation and test datasets have not much differences, so they didn't result in overfitting data

# Modelling with Tuning Hyperparams


- grid search
- random search
- bayesian
- manual

In [22]:
#Hyperparameters tuning
max_depth_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]

for md in max_depth_list:
    dtree = DecisionTreeClassifier(max_depth=md)
    dtree.fit(features_train, target_train)

    y_prediksi_valid = dtree.predict(features_valid)
    y_prediksi_train = dtree.predict(features_train)
    acc_valid = accuracy_score(target_valid, y_prediksi_valid) * 100
    acc_train = accuracy_score(target_train, y_prediksi_train) * 100

    print(f'Untuk max depth {md} nilai acc validation {acc_valid}')
    print(f'Untuk max depth {md} nilai acc training {acc_train}')
    print('------------------------------------------------------')

Untuk max depth 1 nilai acc validation 74.02799377916018
Untuk max depth 1 nilai acc training 75.20746887966806
------------------------------------------------------
Untuk max depth 2 nilai acc validation 76.82737169517885
Untuk max depth 2 nilai acc training 78.52697095435684
------------------------------------------------------
Untuk max depth 3 nilai acc validation 78.84914463452566
Untuk max depth 3 nilai acc training 79.71991701244814
------------------------------------------------------
Untuk max depth 4 nilai acc validation 78.69362363919129
Untuk max depth 4 nilai acc training 80.18672199170125
------------------------------------------------------
Untuk max depth 5 nilai acc validation 80.248833592535
Untuk max depth 5 nilai acc training 81.89834024896265
------------------------------------------------------
Untuk max depth 6 nilai acc validation 79.78227060653188
Untuk max depth 6 nilai acc training 82.98755186721992
------------------------------------------------------


**Conclusion**

The best value is at max depth 1, with the smallest difference at 1.28

In [23]:
#DecisionTreeClassifier
#Tuning Hyperparameters
best_max_depth = 4
dtree = DecisionTreeClassifier(max_depth = best_max_depth)
dtree.fit(features_train, target_train)

y_prediksi_valid_2 = dtree.predict(features_valid)
y_prediksi_test_2 = dtree.predict(features_test)

print(accuracy_score(target_valid, y_prediksi_valid_2) * 100)
print(accuracy_score(target_test, y_prediksi_test_2) * 100)

78.69362363919129
80.7153965785381


**Conclusion**

Accuracy has been increased to 77% for validation dataset, and 78% for the test dataset

In [24]:
#Random Forest
max_depth_list = [1, 2, 3, 4, 5, 6, 7, 8, 9]
n_estimators_list = [100, 200, 300, 400, 500]

for md in max_depth_list:
    for est in n_estimators_list:
        rf = RandomForestClassifier(max_depth=md, n_estimators=est)
        rf.fit(features_train, target_train)

        y_prediksi_valid = dtree.predict(features_valid)
        acc_valid = accuracy_score(target_valid, y_prediksi_valid) * 100

        print(f'Untuk max depth {md} dan estimators {est} nilai acc validation {acc_valid}')
        print('------------------------------------------------------')

Untuk max depth 1 dan estimators 100 nilai acc validation 78.69362363919129
------------------------------------------------------
Untuk max depth 1 dan estimators 200 nilai acc validation 78.69362363919129
------------------------------------------------------
Untuk max depth 1 dan estimators 300 nilai acc validation 78.69362363919129
------------------------------------------------------
Untuk max depth 1 dan estimators 400 nilai acc validation 78.69362363919129
------------------------------------------------------
Untuk max depth 1 dan estimators 500 nilai acc validation 78.69362363919129
------------------------------------------------------
Untuk max depth 2 dan estimators 100 nilai acc validation 78.69362363919129
------------------------------------------------------
Untuk max depth 2 dan estimators 200 nilai acc validation 78.69362363919129
------------------------------------------------------
Untuk max depth 2 dan estimators 300 nilai acc validation 78.69362363919129
-------

# 4. Review model quality with test dataset

In [25]:
max_depth_best = 5
n_estimators = 400

rf = RandomForestClassifier(max_depth=max_depth_best, n_estimators=n_estimators)
rf.fit(features_train, target_train)

y_prediksi_train_2 = rf.predict(features_train)
y_prediksi_valid_2 = rf.predict(features_valid)
y_prediksi_test_2 = rf.predict(features_test)

print(accuracy_score(target_train, y_prediksi_train_2) * 100)
print(accuracy_score(target_valid, y_prediksi_valid_2) * 100)
print(accuracy_score(target_test, y_prediksi_test_2) * 100)

81.53526970954357
79.93779160186625
81.18195956454122


**Conclusion**

The value on the target test is **81%**, this value is higher than the **Decision Tree Classifier** which is **78%**. The difference in training, validation, and test values is not too much, so it has no overfitting data, and the accuracy value is more than 0.75. It can be concluded that overall testing has been completed.

# 5. Additional task: perform a sanity check on the model. This data is more complex than data we've worked with before

In [26]:
features_train.value_counts()

calls  minutes  messages  mb_used 
0.0    0.00     6.0       22428.00    1
77.0   528.82   91.0      15940.73    1
76.0   430.70   34.0      25138.49    1
75.0   551.79   31.0      20046.45    1
       550.72   55.0      11258.72    1
                                     ..
48.0   305.57   35.0      9183.85     1
       301.58   0.0       5200.13     1
       293.06   0.0       8139.80     1
       289.40   84.0      15696.86    1
244.0  1632.06  39.0      9756.91     1
Length: 1928, dtype: int64

In [27]:
features_valid.value_counts()

calls  minutes  messages  mb_used 
0.0    0.00     8.0       35525.61    1
81.0   506.46   21.0      7510.09     1
73.0   550.58   25.0      18322.63    1
       560.06   42.0      24364.87    1
       620.25   0.0       21098.49    1
                                     ..
48.0   322.61   73.0      18642.34    1
       334.49   19.0      17538.01    1
       341.65   0.0       19000.02    1
       395.71   118.0     12491.75    1
196.0  1279.75  133.0     26927.87    1
Length: 643, dtype: int64

In [28]:
features_test.value_counts()

calls  minutes  messages  mb_used 
0.0    0.00     0.0       530.78      1
81.0   560.68   21.0      16369.73    1
75.0   503.26   12.0      11859.17    1
       520.70   0.0       24596.12    1
       529.66   20.0      21749.32    1
                                     ..
48.0   379.99   92.0      12995.38    1
       392.04   0.0       19122.13    1
49.0   327.45   0.0       12049.91    1
       344.92   17.0      23383.40    1
185.0  1217.83  17.0      1444.81     1
Length: 643, dtype: int64

In [29]:
features_train.shape

(1928, 4)

In [30]:
features_valid.shape

(643, 4)

In [31]:
features_test.shape

(643, 4)

In [32]:
features_train.value_counts() / features_train.shape[0] * 100

calls  minutes  messages  mb_used 
0.0    0.00     6.0       22428.00    0.051867
77.0   528.82   91.0      15940.73    0.051867
76.0   430.70   34.0      25138.49    0.051867
75.0   551.79   31.0      20046.45    0.051867
       550.72   55.0      11258.72    0.051867
                                        ...   
48.0   305.57   35.0      9183.85     0.051867
       301.58   0.0       5200.13     0.051867
       293.06   0.0       8139.80     0.051867
       289.40   84.0      15696.86    0.051867
244.0  1632.06  39.0      9756.91     0.051867
Length: 1928, dtype: float64

In [33]:
features_valid.value_counts() / features_valid.shape[0] * 100

calls  minutes  messages  mb_used 
0.0    0.00     8.0       35525.61    0.155521
81.0   506.46   21.0      7510.09     0.155521
73.0   550.58   25.0      18322.63    0.155521
       560.06   42.0      24364.87    0.155521
       620.25   0.0       21098.49    0.155521
                                        ...   
48.0   322.61   73.0      18642.34    0.155521
       334.49   19.0      17538.01    0.155521
       341.65   0.0       19000.02    0.155521
       395.71   118.0     12491.75    0.155521
196.0  1279.75  133.0     26927.87    0.155521
Length: 643, dtype: float64

In [34]:
features_test.value_counts() / features_test.shape[0] * 100

calls  minutes  messages  mb_used 
0.0    0.00     0.0       530.78      0.155521
81.0   560.68   21.0      16369.73    0.155521
75.0   503.26   12.0      11859.17    0.155521
       520.70   0.0       24596.12    0.155521
       529.66   20.0      21749.32    0.155521
                                        ...   
48.0   379.99   92.0      12995.38    0.155521
       392.04   0.0       19122.13    0.155521
49.0   327.45   0.0       12049.91    0.155521
       344.92   17.0      23383.40    0.155521
185.0  1217.83  17.0      1444.81     0.155521
Length: 643, dtype: float64

In [35]:
features_train.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
dtype: int64

In [36]:
features_valid.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
dtype: int64

In [37]:
features_test.isnull().sum()

calls       0
minutes     0
messages    0
mb_used     0
dtype: int64

In [38]:
features_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1928 entries, 605 to 1152
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     1928 non-null   float64
 1   minutes   1928 non-null   float64
 2   messages  1928 non-null   float64
 3   mb_used   1928 non-null   float64
dtypes: float64(4)
memory usage: 75.3 KB


In [39]:
features_valid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 1639 to 1216
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
dtypes: float64(4)
memory usage: 25.1 KB


In [40]:
features_test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 643 entries, 792 to 2187
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     643 non-null    float64
 1   minutes   643 non-null    float64
 2   messages  643 non-null    float64
 3   mb_used   643 non-null    float64
dtypes: float64(4)
memory usage: 25.1 KB


# Final Conclusion

Based on the data, we have a dataset `users_behavior.csv`. This dataset has 5 columns and 3214 rows data. In this case, we performed an analysis using several machine learning models to help Megaline mobile operators grow their business. After we do a sanity check on the dataset that has been given, then
we separate the data source into training set, validation set, and test set.

After we share some of the needed data sources. The next step is to do a test with several model approaches, in this case the approach we used were; **Logistic Regression**, **Decision Tree Classifier**, and **Random Forest Classifier**. And the results shows that data validation and test values has not much differences, so they didn't have overfitting data. Then, we perform a tuning hyperparameters to get all parameters from the training set. After we do tuning hyperparameters, we review the quality of the model using a test set.

In this case we find that the value on the target test is 81%, this value is higher than the Decision Tree Classifier which is 78%. The difference in training, validation, and test values is not too much, so it has no overfitting data, and the accuracy value is more than 0.75. It can be concluded that overall testing has already been completed.