# Pendahuluan

Bekerja sebagai data analyst di Megaline, dimana mereka ingin membuat plan baru bernama Smart atau Ultra. Hal ini di karenakan user lebih cepat beralih di karenakan ke plan baru, untuk itu di perlukan model untuk memberikan pilihan yang tepat kepada user dengan posibilitas 75% akurasi. 

In [1]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv('users_behavior.csv')
df.head()

Unnamed: 0,calls,minutes,messages,mb_used,is_ultra
0,40.0,311.9,83.0,19915.42,0
1,85.0,516.75,56.0,22696.96,0
2,77.0,467.66,86.0,21060.45,0
3,106.0,745.53,81.0,8437.39,1
4,66.0,418.74,1.0,14502.75,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3214 entries, 0 to 3213
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   calls     3214 non-null   float64
 1   minutes   3214 non-null   float64
 2   messages  3214 non-null   float64
 3   mb_used   3214 non-null   float64
 4   is_ultra  3214 non-null   int64  
dtypes: float64(4), int64(1)
memory usage: 125.7 KB


## Splitting data into sets

In [5]:
from sklearn.model_selection import train_test_split

train_valid, test = train_test_split(df, test_size=0.2)
train, valid = train_test_split(train_valid, test_size=0.25)

features_train = train.drop(['is_ultra'], axis=1)
target_train = train['is_ultra']
features_valid = valid.drop(['is_ultra'], axis=1)
target_valid = valid['is_ultra']
features_test = test.drop(['is_ultra'], axis=1)
target_test = test['is_ultra']

print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

(1928, 4)
(643, 4)
(643, 4)


## Tuning Models

In [6]:
print("Decision Tree")
for depth in range(1, 11):
    model = DecisionTreeClassifier(max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    print("max_depth =", depth)
    print("Train:", model.score(features_train, target_train))
    print("Valid:", model.score(features_valid, target_valid))

Decision Tree
max_depth = 1
Train: 0.7546680497925311
Valid: 0.7433903576982893
max_depth = 2
Train: 0.7894190871369294
Valid: 0.7698289269051322
max_depth = 3
Train: 0.8029045643153527
Valid: 0.7869362363919129
max_depth = 4
Train: 0.8049792531120332
Valid: 0.7729393468118196
max_depth = 5
Train: 0.8241701244813278
Valid: 0.7884914463452566
max_depth = 6
Train: 0.8355809128630706
Valid: 0.7776049766718507
max_depth = 7
Train: 0.8495850622406639
Valid: 0.7900466562986003
max_depth = 8
Train: 0.8651452282157677
Valid: 0.7900466562986003
max_depth = 9
Train: 0.8781120331950207
Valid: 0.7931570762052877
max_depth = 10
Train: 0.8853734439834025
Valid: 0.7822706065318819


In [7]:
print("Random Forest")
for estim in range(10, 101, 10):
    model = RandomForestClassifier(n_estimators=estim, random_state=12345)
    model.fit(features_train, target_train)
    print("n_estimators =", estim)
    print("Train:", model.score(features_train, target_train))
    print("Valid:", model.score(features_valid, target_valid))

Random Forest
n_estimators = 10
Train: 0.9808091286307054
Valid: 0.7853810264385692
n_estimators = 20
Train: 0.9963692946058091
Valid: 0.7962674961119751
n_estimators = 30
Train: 0.9968879668049793
Valid: 0.7947122861586314
n_estimators = 40
Train: 0.9984439834024896
Valid: 0.8009331259720062
n_estimators = 50
Train: 0.9989626556016598
Valid: 0.7947122861586314
n_estimators = 60
Train: 0.9994813278008299
Valid: 0.7962674961119751
n_estimators = 70
Train: 1.0
Valid: 0.7916018662519441
n_estimators = 80
Train: 1.0
Valid: 0.7947122861586314
n_estimators = 90
Train: 1.0
Valid: 0.7931570762052877
n_estimators = 100
Train: 1.0
Valid: 0.7900466562986003


In [8]:
print("Logistic Regression")
model = LogisticRegression(random_state=12345)
model.fit(features_train, target_train)
print("Train:", model.score(features_train, target_train))
print("Valid:", model.score(features_valid, target_valid))

Logistic Regression
Train: 0.6981327800829875
Valid: 0.702954898911353


### Kesimpulan Awal

- Linear regression shows the worst performance but it is not overfitted
- Decision tree is overfitted but the accuracy is higher
- Random forrest is also overfitted but the accuracy is slightly higher compared to decision tree

## Testing model

In [10]:
features_full_train = train_valid.drop(['is_ultra'], axis=1)
target_full_train = train_valid['is_ultra']

In [11]:
model = RandomForestClassifier(n_estimators=80, random_state=12345)
model.fit(features_full_train, target_full_train)
model.score(features_test, target_test)

0.7993779160186625

### Additional task is ultra 

In [13]:
df['is_ultra'].value_counts() / df.shape[0]

0    0.693528
1    0.306472
Name: is_ultra, dtype: float64

Sanity check score is ~69%, so the logistic regression hasn't learned much.