# Model Prediksi Perilaku Nasabah Bank Beta

# Konten <a id='contents'></a>

* [1 Gambaran Besar](#big_picture)
    * [1.1 Pendahuluan](#intro)
    * [1.2 Deskripsi Data](#data_description)
    * [1.3 Tujuan dan Langkah-Langkah Pengerjaan Projek](#goals_and_step)

* [2 Prapemrosesan Data](#data_preprocessing)
    * [2.1 Memuat Data](#load_data)
    * [2.2 Eksplorasi Data Awal](#initial_data_exploration)
    * [2.3 Kesimpulan Awal](#initial_summary)
    
* [3 Memisahkan Data menjadi 3 Set](#split_data)
    
* [4 Melatih Model](#model_train)

* [5 Memeriksa Kualitas Data Set](#check_model_accuracy)

* [6 Sanity Check](#sanity_check)

* [7 Kesimpulan](#summary)

 ## Gambaran Besar <a id='big_picture'></a>

### Pendahuluan <a id='intro'></a>

Nasabah Bank Beta pergi meninggalkan perusahaan: sedikit demi sedikit, jumlah mereka berkurang setiap bulannya. Para pegawai bank menyadari bahwa lebih murah untuk mempertahankan nasabah lama mereka yang setia daripada menarik nasabah baru.

Sebagai seorang Data Scientist di Bank Beta, saya diminta untuk membuat model yang mampu untuk memprediksi apakah seorang nasabah akan segera meninggalkan bank atau tidak. Data yang dimiliki adalah data terkait perilaku para klien di masa lalu dan riwayat pemutusan kontrak mereka dengan bank.

Model yang buat harus memiliki nilai F1 lebih besar dari 0.59 dengan pertimbangan metrik AUC-ROC untuk melakukan penilaian terhadap model

### Deskripsi Data <a id='data_description'></a>

**Fitur:**
- RowNumber — indeks string data
- CustomerId — ID pelanggan
- Surname — nama belakang
- CreditScore — skor kredit
- Geography — negara domisili
- Gender — gender
- Age — umur
- Tenure — jangka waktu jatuh tempo untuk deposito tetap nasabah (tahun)
- Balance — saldo rekening
- NumOfProducts — jumlah produk bank yang digunakan oleh nasabah
- HasCrCard — apakah nasabah memiliki kartu kredit
- IsActiveMember — tingkat keaktifan nasabah
- EstimatedSalary — estimasi gaji

**Target:**
- Exited — apakah nasabah telah berhenti

### Tujuan dan Langkah-Langkah Pengerjaan Projek <a id='goals_and_step'></a>

**Tujuan dari proyek ini adalah membuat model untuk memprediksi apakah nasabah Bank Beta akan memutus kontrak dengan perusahaan atau tidak**

**Langkah yang akan saya lakukan**
1. Mempelajari data pada tabel
2. Memisahkan data sumber menjadi training set, validation set, dan test set.
3. Melakukan pelatihan model tanpa memeperbaiki ketidakseimbangan.
4. Meningkatkan kualitas model dengan memperbaiki ketidakseimbangan.
5. Mekaukan pelatihan dengan menggunakan model lainnya
5. Melakukan sanity check terhadap model.
7. Melakukan pemeriksanaan nilai AUC-ROC dan F1

## Pra Pemrosesan Data <a id='data_preprocessing'></a>

In [None]:
# Muat semua library
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from scipy import stats as st 

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, f1_score, roc_auc_score
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.metrics import roc_auc_score

from sklearn.utils import shuffle

import warnings
warnings.filterwarnings("ignore")

### Memuat Data <a id='load_data'></a>

In [None]:
# Muat file data menjadi DataFrame
df = pd.read_csv('/datasets/Churn.csv')

### Eksplorasi Data Awal <a id='initial_data_exploration'></a>

In [None]:
# Menampilkan sample data untuk melihat data secara sekilas
df.sample(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
1818,1819,15800517,Huang,633,Spain,Male,32,5.0,163340.12,2,1,1,74415.2,0
186,187,15771977,T'ao,730,France,Female,39,1.0,99010.67,1,1,0,194945.8,0
3131,3132,15614187,Pottinger,648,Germany,Female,39,3.0,126935.98,2,0,1,57995.74,0
4174,4175,15810593,Forbes,568,France,Male,51,4.0,0.0,3,1,1,66586.56,0
3711,3712,15729489,Hyde,762,Germany,Female,34,8.0,98592.88,1,0,1,191790.29,1
8089,8090,15623357,Onio,692,Germany,Male,24,2.0,120596.93,1,0,1,180490.53,0
6649,6650,15635277,Coates,605,Spain,Male,47,7.0,142643.54,1,1,0,189310.27,0
3673,3674,15606915,Genovese,764,France,Male,24,7.0,98148.61,1,1,0,26843.76,0
7801,7802,15798844,Chijindum,678,France,Male,54,,128914.97,1,0,0,191746.23,1
8608,8609,15649060,Chien,727,Germany,Female,31,,82729.47,2,1,0,60212.51,0


In [None]:
# Menampilkan informasi/rangkuman umum tentang DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [None]:
df.IsActiveMember.unique()

array([1, 0])

In [None]:
# Memampilkan nilai statistik dari kolom numerik
df.describe()

Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,10000.0,10000.0,10000.0,10000.0,9091.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,15690940.0,650.5288,38.9218,4.99769,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,2886.89568,71936.19,96.653299,10.487806,2.894723,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,1.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,2500.75,15628530.0,584.0,32.0,2.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,5000.5,15690740.0,652.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,7500.25,15753230.0,718.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,10000.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [None]:
# Melihat distribusi nasabah yang masih aktif/tidak
df['Exited'].value_counts()/ len(df)

0    0.7963
1    0.2037
Name: Exited, dtype: float64

In [None]:
df.fillna(value=0, inplace=True)

### Kesimpulan Awal <a id='initial_summary'></a>

**Insights:**
1. Terdapat ketidakseimbangan data, jika dianggap nilai 1 adalah nasabah yang sudah berhenti maka data nasabah yang sudah berhenti hanya berkisar 20% dari total data nasabah Bank Beta
2. Terdapat nilai yang hilang pada kolom Tenure, tetapi disini saya akan menganggap bahwa Nasabah tidak memiliki deposito sama sekali sehingga datanya saya isi dengan nilai nol saja
3. Rata-rata umur nasabah Bank Beta adalah 39
4. Mayoritas nasabah Bank Beta memiliki kartu kredit

## Memisahkan Data menjadi 3 set <a id='split_data'></a>

In [None]:
"""# Rumus untuk lowercase nama kolom pada dataframe
def lowercase_columns(df):
    df.columns = [col.lower() for col in df.columns]
    return df"""

'# Rumus untuk lowercase nama kolom pada dataframe\ndef lowercase_columns(df):\n    df.columns = [col.lower() for col in df.columns]\n    return df'

In [None]:
# Rumus untuk lowercase nama kolom pada dataframe
df.columns = df.columns.str.lower()

In [None]:
# Mnggunakan fungsi lowercase
df.columns

Index(['rownumber', 'customerid', 'surname', 'creditscore', 'geography',
       'gender', 'age', 'tenure', 'balance', 'numofproducts', 'hascrcard',
       'isactivemember', 'estimatedsalary', 'exited'],
      dtype='object')

In [None]:
# Melakukan Pengkodean label untuk Form 
encoder = OrdinalEncoder()
df_ordinal = pd.DataFrame(encoder.fit_transform(df), columns=df.columns)
df_ordinal

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,0.0,2736.0,1115.0,228.0,0.0,0.0,24.0,2.0,0.0,0.0,1.0,1.0,5068.0,1.0
1,1.0,3258.0,1177.0,217.0,2.0,0.0,23.0,1.0,743.0,0.0,0.0,1.0,5639.0,0.0
2,2.0,2104.0,2040.0,111.0,0.0,0.0,24.0,8.0,5793.0,2.0,1.0,0.0,5707.0,1.0
3,3.0,5435.0,289.0,308.0,0.0,0.0,21.0,1.0,0.0,1.0,0.0,0.0,4704.0,0.0
4,4.0,6899.0,1822.0,459.0,2.0,0.0,25.0,2.0,3696.0,0.0,1.0,1.0,3925.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9995.0,1599.0,1999.0,380.0,0.0,1.0,21.0,5.0,0.0,1.0,1.0,0.0,4827.0,0.0
9996,9996.0,161.0,1336.0,125.0,0.0,1.0,17.0,10.0,124.0,0.0,1.0,1.0,5087.0,0.0
9997,9997.0,717.0,1570.0,318.0,0.0,0.0,18.0,7.0,0.0,0.0,0.0,1.0,2062.0,1.0
9998,9998.0,4656.0,2345.0,381.0,1.0,1.0,24.0,3.0,427.0,1.0,1.0,0.0,4639.0,1.0


In [None]:
# Mendefininikan data features dan data target
features = df_ordinal.drop(['exited'], axis=1)
target = df_ordinal['exited']

In [None]:
# Membagi dataframe menjadi 3 set
features_train, features_check, target_train, target_check = train_test_split(
    features, target, test_size = 0.4, random_state = 12345)

features_valid, features_test, target_valid, target_test = train_test_split(
    features_check, target_check, test_size = 0.5, random_state = 12345)

In [None]:
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)
print(target_train.shape)
print(target_valid.shape)
print(target_test.shape)

(6000, 13)
(2000, 13)
(2000, 13)
(6000,)
(2000,)
(2000,)


## Melatih Model <a id='model_train'></a>

Dari ketiga jenis model yang dipelajari pada sprint sebelumnya (Decision Tree, Random Forest, dan Logistics Regresion) didapatkan bahwa **Random Forest** memiliki tingkat akurasi yang tinggi, sehingga untuk model pertama saya akan menggunakan  model Random Forest, lalu melakukan perbandingan dengan Logistics Regression.

### Melatih Model RandomForestClassifier <a id='rfc_model_train'></a>

In [None]:
# Membuat model
rfc = RandomForestClassifier(random_state=12345)
rfc.fit(features_train, target_train)
predicted_valid = rfc.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5341812400635929

In [None]:
# Menghitung nilai AUC-ROC
probabilities_valid = rfc.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.8316383779238927

In [None]:
# Hyperparameter_tunning
max_features_range = np.arange(1,11,1)
n_estimators_range = np.arange(1,11,1)
max_depth_range = np.arange(1,11,1)

param_grid = dict(max_features = max_features_range, 
                  n_estimators = n_estimators_range, 
                  max_depth = max_depth_range)

rfc = RandomForestClassifier(random_state = 12345)

grid = GridSearchCV(estimator = rfc, 
                    param_grid = param_grid, 
                    cv = 5, scoring='f1_micro')

grid.fit(features_train, target_train)

print("The best paramenets are %s with a score of %0.2f"% 
      (grid.best_params_,grid.best_score_))

The best paramenets are {'max_depth': 8, 'max_features': 4, 'n_estimators': 10} with a score of 0.86


In [None]:
# f1_score setelah hyperparameter tunning
rfc = RandomForestClassifier(random_state = 12345, 
                             n_estimators = 10, 
                             max_depth = 8, 
                             max_features = 4)
rfc.fit(features_train, target_train)
predicted_valid = rfc.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5492063492063491

In [None]:
# Menghitung nilai AUC-ROC setelah hyperparameter tunning
probabilities_valid = rfc.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.8428303764237626

In [None]:
# Melihat distribusi nasabah yang masih aktif/tidak
df['exited'].value_counts()/ len(df)

0    0.7963
1    0.2037
Name: exited, dtype: float64

Setelah melakukan tunning pada hyperparameter didapatkan peningkata pada nilai f1 dan auc_roc, tetapi nilainya masih dibawah 0.59. Disini diketahui juga bahwa ada ketidakseimbang data antara nasabah yang masih aktif dan tidak sehingga untuk menghindari kesalahan, saya akan melakukan penyesuaian bobot

#### Mengunakan Metode class_weight <a id='using_class_weight'></a>

In [None]:
# Melatih model baru dengan mempertimbangkan class_weight
rfc1 = RandomForestClassifier(random_state = 12345, 
                             n_estimators = 10, 
                             max_depth = 8, 
                             max_features = 4,
                             class_weight = 'balanced')
rfc1.fit(features_train, target_train)
predicted_valid = rfc1.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.6038961038961039

In [None]:
# Menghitung nilai AUC-ROC dengan mempertimbangkan class_weight
probabilities_valid = rfc1.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.8390361966864063

#### Mengunakan Metode upsampling <a id='using_upsampling'></a>

In [None]:
# Memisahkan fitur dan target berdasarkan nilai 0 dan 1
features_zeros = features_train[target_train == 0]
features_ones = features_train[target_train == 1]
target_zeros = target_train[target_train == 0]
target_ones = target_train[target_train == 1]

In [None]:
# Melihat shape fitur = 0
features_zeros.shape

(4804, 13)

In [None]:
# Melihat shape fitur = 1
features_ones.shape

(1196, 13)

In [None]:
# Menentukan ratio pengali
len(features_zeros)/len(features_ones)

4.016722408026756

In [None]:
# Membuat fungsi Upsampling
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled

In [None]:
# Memanggil fungsi upsample
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

In [None]:
# Melatih model baru dengan menggunakan data yang sudah diupsampled
rfc2 = RandomForestClassifier(random_state = 12345, 
                             n_estimators = 10, 
                             max_depth = 8, 
                             max_features = 4)
rfc2.fit(features_upsampled, target_upsampled)
predicted_valid = rfc2.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5857740585774058

In [None]:
# Menghitung nilai AUC-ROC dengan menggunakan data yang sudah diupsampled
probabilities_valid = rfc2.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.8388615343668907

#### Mengunakan Metode downsampling <a id='using_downsampling'></a>

In [None]:
# Menentukan ratio pengali
len(features_ones)/len(features_zeros)

0.24895920066611157

In [None]:
# Membuat fungsi Downsampling
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)]
        + [features_ones]
    )
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)]
        + [target_ones]
    )

    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345
    )

    return features_downsampled, target_downsampled

In [None]:
# Memanggil fungsi downsample
features_downsampled, target_downsampled = downsample(features_train, 
                                                      target_train, 0.25)

In [None]:
# Melatih model baru dengan menggunakan data yang sudah didownsampled
rfc3 = RandomForestClassifier(random_state = 12345, 
                             n_estimators = 10, 
                             max_depth = 8, 
                             max_features = 4)
rfc3.fit(features_downsampled, target_downsampled)
predicted_valid = rfc3.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.6001955034213099

In [None]:
# Menghitung nilai AUC-ROC dengan menggunakan data yang sudah didownsampled
probabilities_valid = rfc3.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.8436371499948584

#### Menguji dengan Test Dataset <a id='test_dataset_testing'></a>

In [None]:
# f1_score RFC1
predicted_test = rfc1.predict(features_test)
f1_score(target_test, predicted_test)

0.5852090032154341

In [None]:
# auc_roc_score RFC1
probabilities_test = rfc1.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]

roc_auc_score(target_test, probabilities_one_test)

0.8342860055376414

In [None]:
# f1_score RFC2
predicted_test = rfc2.predict(features_test)
f1_score(target_test, predicted_test)

0.5923694779116466

In [None]:
# auc_roc_score RFC2
probabilities_test = rfc2.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]

roc_auc_score(target_test, probabilities_one_test)

0.846633266923611

In [None]:
# f1_score RFC3
predicted_test = rfc3.predict(features_test)
f1_score(target_test, predicted_test)

0.5684410646387833

In [None]:
# auc_roc_score RFC3
probabilities_test = rfc3.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]

roc_auc_score(target_test, probabilities_one_test)

0.8308201076047378

#### Hasil Model RandomForestClassifier <a id='rfc_model_result'></a>

Berikut rangkuman hasil f1_score dan auc_roc_score menggunakan metode penyeimbangan kelas yang berbeda-beda:

In [None]:
f1_data = {'Method': ['class_weight', 'upsampling', 'downsampling'],
         'valid_data': [0.603, 0.585, 0.600],
         'test_data': [0.585, 0.592, 0.568]}

df_f1 = pd.DataFrame(f1_data)
df_f1

Unnamed: 0,Method,valid_data,test_data
0,class_weight,0.603,0.585
1,upsampling,0.585,0.592
2,downsampling,0.6,0.568


In [None]:
auc_roc_data = {'Method': ['class_weight', 'upsampling', 'downsampling'],
                'valid_data': [0.839, 0.838, 0.843],
                'test_data': [0.834, 0.846, 0.830]}

df_auc_roc = pd.DataFrame(auc_roc_data)
df_auc_roc

Unnamed: 0,Method,valid_data,test_data
0,class_weight,0.839,0.834
1,upsampling,0.838,0.846
2,downsampling,0.843,0.83


**Insights:**
1. Berdasarkan pengujian menggunakan valid_data didapatkan bahwa metode class_weight menghasilkan nilai f1 tertinggi, tetapi mengalami penurunan pada pengujian menggunakan test_data meskipun tidak setinggi metode downsampling
2. Hal yang serupa pada poin (1) ditemukan kembali pada nilai auc_roc
3. Berdasarkan ketiga pengujian, didapatkan bahwa metode class_weight dan upsampling lebih baik dibandingkan dengan metode downsampling. Metode class_weight menghasilkan nilai f1 > 0.59 pada valid_data, sedangkan metode upsampling menghasilkan nilai f1 > 0.59 pada test data.
4. Disini saya akan menetapkan bahwa metode upsampling lebih baik karena mampu meprediksi dengan lebih tepat jika dihadapkan dengan data diluar train_dataset dibandingkan class_weight

### Melatih Model LogisticRegression <a id='lr_model_train'></a>

In [None]:
# Drop kolom surname sehingga tidak dimasukan dalam OHE
df.drop(columns=['surname', 'customerid', 'rownumber'], inplace = True)

In [None]:
# OHE
df_ohe = pd.get_dummies(df, drop_first=True)

In [None]:
# Mendefininikan data features dan data target
features_ohe = df_ohe.drop(['exited'], axis=1)
target_ohe = df_ohe['exited']

In [None]:
# Membagi dataframe menjadi 3 set
features_train, features_check, target_train, target_check = train_test_split(
    features_ohe, target_ohe, test_size = 0.4, random_state = 12345)

features_valid, features_test, target_valid, target_test = train_test_split(
    features_check, target_check, test_size = 0.5, random_state = 12345)

In [None]:
df.columns

Index(['creditscore', 'geography', 'gender', 'age', 'tenure', 'balance',
       'numofproducts', 'hascrcard', 'isactivemember', 'estimatedsalary',
       'exited'],
      dtype='object')

In [None]:
numeric = ['balance', 'age', 'estimatedsalary']

scaler = StandardScaler()
scaler.fit(features_train[numeric])
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

#### Mengunakan Metode class_weight <a id='using_class_weight'></a>

In [None]:
# Membuat model
lr = LogisticRegression(random_state=12345, 
                        solver = 'liblinear', 
                        class_weight = 'balanced')
lr.fit(features_train, target_train)
predicted_valid = lr.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.4888888888888888

In [None]:
# Menghitung nilai AUC-ROC
probabilities_valid = lr.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.7635873674532268

In [None]:
# Hyperparameter_tunning
params = {'C': [0.1, 1, 10, 100],
          'solver': ['lbfgs', 'sag', 'saga', 
                     'newton-cg', 'liblinear', 'newton-cholesky'],
          'penalty': ['l1', 'l2']}

lr = LogisticRegression(random_state = 12345, class_weight = 'balanced')
grid = GridSearchCV(estimator = lr, 
                    param_grid = params, 
                    cv=5, scoring='f1_micro')

grid.fit(features_train, target_train)

print("The best paramenets are %s with a score of %0.2f"% 
      (grid.best_params_,grid.best_score_))

The best paramenets are {'C': 0.1, 'penalty': 'l1', 'solver': 'saga'} with a score of 0.73


In [None]:
# f1_score setelah hyperparameter tunning
lr1 = LogisticRegression(random_state = 12345,
                        class_weight = 'balanced',
                        C = 10, penalty = 'l1', 
                        solver = 'liblinear')
lr1.fit(features_train, target_train)
predicted_valid = lr1.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.4888888888888888

In [None]:
# Menghitung nilai AUC-ROC dengan menggunakan data yang sudah diupsampled
probabilities_valid = lr1.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.7634346324378928

#### Mengunakan Metode upsampling <a id='using_upsampling'></a>

In [None]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

In [None]:
# Melatih model baru dengan menggunakan data yang sudah diupsampled
lr2 = LogisticRegression(random_state = 12345,
                         C = 10, penalty = 'l1', 
                         solver = 'liblinear')
lr2.fit(features_upsampled, target_upsampled)
predicted_valid = lr2.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.4888888888888888

In [None]:
# Menghitung nilai AUC-ROC dengan menggunakan data yang sudah diupsampled
probabilities_valid = lr2.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.7633953145131533

#### Mengunakan Metode downsampling <a id='using_downsampling'></a>

In [None]:
# Memanggil fungsi downsample
features_downsampled, target_downsampled = downsample(features_train, 
                                                      target_train, 0.25)

In [None]:
# Melatih model baru dengan menggunakan data yang sudah diupsampled
lr3 = LogisticRegression(random_state = 12345,
                         C = 10, penalty = 'l1', 
                         solver = 'liblinear')
lr3.fit(features_downsampled, target_downsampled)
predicted_valid = lr3.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.4888507718696398

In [None]:
# Menghitung nilai AUC-ROC dengan menggunakan data yang sudah diupsampled
probabilities_valid = lr3.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.7623745606978025

#### Hasil Model Logistic Regression <a id='rfc_model_result'></a>

**Insights:**
Menggunakan Model LogisticRegression, saya tidak mendapatkan nilai f1 lebih dari 0.59 bahkan rata-rata modelnya menghasilkan nilai kurang dari 0.5, sehingga saya akan mencoba satu model lagi yang diajarkan, yaitu DecisionTree

### Melatih Model DecisionTreeClassifier <a id='dtc_model_train'></a>

In [None]:
# Membuat model
dtc = RandomForestClassifier(random_state=12345)
dtc.fit(features_train, target_train)
predicted_valid = dtc.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5807407407407408

In [None]:
# Menghitung nilai AUC-ROC
probabilities_valid = dtc.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.842530955304593

In [None]:
# Hyperparameter_tunning

param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, None],
    'max_features': ['auto', 'sqrt', 'log2', None]
}

dtc = DecisionTreeClassifier(random_state = 12345)

grid = GridSearchCV(estimator = dtc, 
                    param_grid = param_grid, 
                    cv = 5, scoring='f1')

grid.fit(features_train, target_train)

print("The best paramenets are %s with a score of %0.2f"% 
      (grid.best_params_,grid.best_score_))

The best paramenets are {'criterion': 'gini', 'max_depth': 7, 'max_features': None} with a score of 0.56


In [None]:
# f1_score setelah hyperparameter tunning
dtc = DecisionTreeClassifier(random_state = 12345, 
                             criterion = 'entropy', 
                             max_depth = 7, 
                             max_features = None)
dtc.fit(features_train, target_train)
predicted_valid = dtc.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5468750000000001

In [None]:
# Menghitung nilai AUC-ROC setelah hyperparameter tunning
probabilities_valid = dtc.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.8166635413957258

#### Mengunakan Metode class_weight <a id='using_class_weight'></a>

In [None]:
# f1_score setelah hyperparameter tunning
dtc1 = DecisionTreeClassifier(random_state = 12345,
                             class_weight = 'balanced',
                             criterion = 'entropy', 
                             max_depth = 7, 
                             max_features = None)
dtc1.fit(features_train, target_train)
predicted_valid = dtc1.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5655577299412915

In [None]:
# Menghitung nilai AUC-ROC dengan mempertimbangkan class_weight
probabilities_valid = dtc1.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.8145486907131062

#### Mengunakan Metode upsampling <a id='using_upsampling'></a>

In [None]:
# Melatih model baru dengan menggunakan data yang sudah diupsampled
dtc2 = DecisionTreeClassifier(random_state = 12345,
                             criterion = 'entropy', 
                             max_depth = 7, 
                             max_features = None)
dtc2.fit(features_upsampled, target_upsampled)
predicted_valid = dtc2.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5641527913809989

In [None]:
# Menghitung nilai AUC-ROC dengan menggunakan data yang sudah diupsampled
probabilities_valid = dtc2.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.8127136021872864

#### Mengunakan Metode downsampling <a id='using_downsampling'></a>

In [None]:
# Melatih model baru dengan menggunakan data yang sudah diupsampled
dtc3 = DecisionTreeClassifier(random_state = 12345,
                             criterion = 'entropy', 
                             max_depth = 7, 
                             max_features = None)
dtc3.fit(features_downsampled, target_downsampled)
predicted_valid = dtc2.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5641527913809989

In [None]:
# Menghitung nilai AUC-ROC dengan menggunakan data yang sudah didownsampled
probabilities_valid = dtc3.predict_proba(features_valid)
probabilities_one_valid = probabilities_valid[:, 1]

roc_auc_score(target_valid, probabilities_one_valid)

0.7973213907657317

#### Hasil Model DecisionTreeClassifier <a id='dtc_model_result'></a>

**Insights:**
- Menggunakan Model DecisionTreeClassifier, saya tidak mendapatkan nilai f1 lebih dari 0.59 rata-rata model ini dapat menghasilkan nilai f1 +- 0.56

## Kesimpulan <a id='summary'></a>

Pada projek ini saya membuat sebuah Machine Learning sederhana dengan mempertimbangkan keseimbangan antara data positif dan negatif menggunkaan 3 metode, yaitu class_weight, upsampling, dan downsampling. Saya membuatnya menggunakan 3 model yang berbeda, tetapi nilai f1 tertinggi didapatkan oleh model RandomForest, lalu DecisionTree dan f1 terendah oleh Model LogisticRegression.