## Welcome to Day 16 Hands On!

In [1]:
import pandas as pd
import numpy as np

### Part 1: Regularized Regression

Pada kesempatan kali ini, kita akan mengunduh data dari: http://archive.ics.uci.edu/ml/datasets/Communities+and+Crime. Kita diminta untuk memprediksi jumlah `Violent Crime` yang terjadi pada setiap lokasi (satu lokasi direpresentasikan oleh satu baris). Terdapat 120 kolom 'feature', dan hanya 1 kolom predictor.

Berarti, datanya memiliki ukuran yang besar, banyak kolom, dan mungkin saja banyak kolom yang 'redundant' atau malah tidak penting. Jika kita menggunakan semua kolom tersebut, maka kemungkinan OverFitting akan besar terjadi. Oleh sebab itu, mari kita lihat apakah regularization membantu atau tidak. 

Selain itu, data ini juga memiliki banyak missing values. Dari hampir 2000 baris, hanya ada 300 baris yang memiliki nilai penuh (tanpa missing values). Missing values yang dihadapi di sini juga sulit untuk diimputasikan dikarenakan banyaknya kolom yang ada. 

Ini adalah salah satu contoh data yang memiliki kualitas 'buruk' (row sedikit, kolom sangat banyak dan kita tidak memiliki domain knowledge yang cukup, serta banyak missing values). 

In [2]:
from sklearn.linear_model import LinearRegression

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error as mse

In [3]:
import pandas as pd
url = 'https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/communities.data'
crime = pd.read_csv(url, header=None, na_values=['?'])

df = crime.iloc[:,5:]
df['location'] = crime[[3]]

df = df.dropna().reset_index(drop = True)

Unnamed: 0,5,6,7,8,9,10,11,12,13,14,...,119,120,121,122,123,124,125,126,127,location
0,0.19,0.33,0.02,0.9,0.12,0.17,0.34,0.47,0.29,0.32,...,0.26,0.2,0.06,0.04,0.9,0.5,0.32,0.14,0.2,Lakewoodcity
1,0.15,0.31,0.4,0.63,0.14,0.06,0.58,0.72,0.65,0.47,...,0.39,0.84,0.06,0.06,0.91,0.5,0.88,0.26,0.49,Albanycity
2,0.25,0.54,0.05,0.71,0.48,0.3,0.42,0.48,0.28,0.32,...,0.46,0.05,0.09,0.05,0.88,0.5,0.76,0.13,0.34,Modestocity
3,1.0,0.42,0.47,0.59,0.12,0.05,0.41,0.53,0.34,0.33,...,0.07,0.15,1.0,0.35,0.73,0.0,0.31,0.21,0.69,Jacksonvillecity
4,0.11,0.43,0.04,0.89,0.09,0.06,0.45,0.48,0.31,0.46,...,0.12,0.07,0.04,0.01,0.81,1.0,0.56,0.09,0.63,SiouxCitycity


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319 entries, 0 to 318
Columns: 124 entries, 5 to location
dtypes: float64(123), object(1)
memory usage: 309.2+ KB


In [9]:
X = df.drop(127, axis=1)
X = X.drop('location', axis = 1)
y = df[127]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size = 0.2)

In [11]:
def k_fold_eval(model):
    kf = KFold(n_splits = 5)
    RMSE_length = 5
    RMSE_list = []
    
    for i, (train, val) in enumerate(kf.split(X_train)):
        train_features = X_train.iloc[train]
        train_target = y_train.iloc[train]
        
        val_features = X_train.iloc[val]
        val_target = y_train.iloc[val]
        
        ml_model = model.fit(train_features, train_target)
        prediction = ml_model.predict(val_features)
        
        rmse_score = np.sqrt(mse(val_target, prediction))
        RMSE_list.append(rmse_score)
        
    print('RMSE Scores:')
    print(RMSE_list)
    print('')
    print(f'Average RMSE Score: {np.mean(RMSE_list)}')
    print('')
    
    ml_model_final = model.fit(X_train, y_train)
    test_prediction = ml_model_final.predict(X_test)
    rmse_final = np.sqrt(mse(y_test, test_prediction))
    
    print(f'RMSE Evaluate on Test Set: {rmse_final}')
    return ml_model_final

In [19]:
linear_reg = k_fold_eval(LinearRegression())
print('Coefficient:', linear_reg.coef_[1])

RMSE Scores:
[0.20371217588591636, 0.25960386793372553, 0.24344012193554235, 0.2575351704674003, 0.28869148341737866]

Average RMSE Score: 0.2505965639279926

RMSE Evaluate on Test Set: 0.23422645103031833
Coefficient: 0.7435816979912195


Okay, dengan Linear Regression, kita mendapatkan rata-rata RMSE 0.22 (pada cross validation), serta RMSE 0.23 pada Test Set. 

In [22]:
from sklearn.linear_model import Ridge
ridge_reg = k_fold_eval(Ridge())
print('Coefficient:', ridge_reg.coef_[1])

RMSE Scores:
[0.1538000424041132, 0.21349040597832783, 0.1848617101765152, 0.16674280558306048, 0.19832156274921262]

Average RMSE Score: 0.18344330537824588

RMSE Evaluate on Test Set: 0.16683820768815266
Coefficient: 0.026587211841090044


In [23]:
from sklearn.linear_model import Lasso
lasso_reg = k_fold_eval(Lasso())
print('Coefficient:', lasso_reg.coef_[1])

RMSE Scores:
[0.26494020130075124, 0.3072293744166097, 0.2959851861025427, 0.2707031627558652, 0.2784957810065473]

Average RMSE Score: 0.28347074111646325

RMSE Evaluate on Test Set: 0.24938730808883183
Coefficient: 0.0


In [30]:
from sklearn.linear_model import ElasticNet
elastic_net = k_fold_eval(ElasticNet())
print('Coefficient:', elastic_net.coef_[1])

RMSE Scores:
[0.26494020130075124, 0.3072293744166097, 0.2959851861025427, 0.2707031627558652, 0.2784957810065473]

Average RMSE Score: 0.28347074111646325

RMSE Evaluate on Test Set: 0.24938730808883183
Coefficient: 0.0


Ternyata, pada data kita kali ini, teknik "regularization" yang paling "ampuh" adalah Ridge Regression. 

Apa bedanya Ridge Regression dan Linear Regression? Jika kita ingat pemaparan di teori, Ridge Regression berusaha membuat agar nilai koefisien regresi sekecil mungkin. Yuk bandingin koefisien regresi di Ridge dan di Linear Regression!

In [12]:
np.abs(ridge_reg.coef_).sum()

6.339058904227911

In [13]:
np.abs(linear_reg.coef_).sum()

148.75334563710976

### Part 2: K-Nearest Neighbors

In [14]:
heart_data = pd.read_csv('heart.csv')

In [15]:
heart_data

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


In [16]:
heart_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 14 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       303 non-null    int64  
 1   sex       303 non-null    int64  
 2   cp        303 non-null    int64  
 3   trestbps  303 non-null    int64  
 4   chol      303 non-null    int64  
 5   fbs       303 non-null    int64  
 6   restecg   303 non-null    int64  
 7   thalach   303 non-null    int64  
 8   exang     303 non-null    int64  
 9   oldpeak   303 non-null    float64
 10  slope     303 non-null    int64  
 11  ca        303 non-null    int64  
 12  thal      303 non-null    int64  
 13  target    303 non-null    int64  
dtypes: float64(1), int64(13)
memory usage: 33.3 KB


Data ini berisi data pasien dalam sebuah rumah sakit. Jika pasien memiliki nilai '0' pada `target`, maka pasien tersebut sehat, namun jika pasien tersebut memiliki nilai '1' pada `target`, maka pasien tersebut terdeteksi memiliki sakit jantung.

In [17]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

In [18]:
X = heart_data.drop('target', axis = 'columns')
y = heart_data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size = 0.2)

In [19]:
kf = KFold(n_splits = 5)
report_dictionary = {
    'k':[],
    'cross_val_avg_acc':[],
    'test_data_acc':[]
}

for k in range(20):
    acc_list = []
    for i, (train, val) in enumerate(kf.split(X_train)):
        train_features = X_train.iloc[train]
        train_target = y_train.iloc[train]

        val_features = X_train.iloc[val]
        val_target = y_train.iloc[val]
        
        ml_model = KNeighborsClassifier(k+1).fit(train_features, train_target)
        prediction = ml_model.predict(val_features)
        
        acc_list.append(accuracy_score(val_target, prediction))
    
    average_accuracy = np.mean(acc_list)
    report_dictionary['k'].append(k+1)
    report_dictionary['cross_val_avg_acc'].append(average_accuracy)
    
    ml_model_final = KNeighborsClassifier(k+1).fit(X_train, y_train)
    prediction_test = ml_model_final.predict(X_test)
    final_acc = accuracy_score(y_test, prediction_test)
    
    report_dictionary['test_data_acc'].append(final_acc)
    

In [20]:
pd.DataFrame(report_dictionary).sort_values(['test_data_acc'], ascending = False).head(5)

Unnamed: 0,k,cross_val_avg_acc,test_data_acc
17,18,0.632483,0.639344
19,20,0.636565,0.622951
15,16,0.61165,0.622951
14,15,0.628231,0.622951
18,19,0.644813,0.606557


In [21]:
pd.DataFrame(report_dictionary).sort_values(['cross_val_avg_acc'], ascending = False).head(5)

Unnamed: 0,k,cross_val_avg_acc,test_data_acc
2,3,0.677891,0.57377
3,4,0.64898,0.540984
4,5,0.648895,0.57377
6,7,0.644813,0.57377
18,19,0.644813,0.606557


Kesimpulan:
- Memang, rata-rata cross validation tertinggi dimenangkan oleh k = 3 dan k = 4, namun kedua KNN model ini memiliki nilai akurasi yang buruk pada data test. Ini adalah contoh bentuk dari overfitting - dimana model memiliki performa tinggi pada training set, namun buruk pada test set. 

- Hasil k = 18 dirasa optimal karena memiliki akurasi yang tinggi baik di training maupun di test set.

Oleh sebab itu, k = 18 menjadi nilai 'k' yang optimal.

### Part 3: Random Forest Tuning

Untuk Random Forest, kita akan menggunakan dataset Apartment yang teman-teman sekalian telah analisa di EDA. Bedanya, kali ini kita langsung menggunakan dataset yang sudah dibersihkan dari 'kejanggalan'. 

In [2]:
apt = pd.read_csv('cleaned_apt_data.csv')

In [3]:
apt.head()

Unnamed: 0,No_Rooms,Bathroom,Longitude,Latitude,Furnished,Area,Total_Facilities,AnnualPrice
0,1,1,106.819159,-6.226598,1,43.0,23,96000000
1,2,1,106.756061,-6.192081,0,35.0,19,30000000
2,2,1,106.757651,-6.186415,1,53.0,22,70000000
3,2,2,106.7846,-6.272637,1,85.0,24,576000000
4,2,1,106.796056,-6.153652,0,48.0,15,32000000


In [4]:
X = apt.drop('AnnualPrice', axis = 'columns')
y = apt['AnnualPrice']

In [7]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error as mse

base_model = RandomForestRegressor()

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size = 0.2)

In [27]:
base_model.fit(X_train, y_train)
base_pred = base_model.predict(X_test)
base_rmse = np.sqrt(mse(y_test, base_pred))
print('Base Model has RMSE:', base_rmse)
print('Base Model has R2-Score:', r2_score(y_test, base_pred))

Base Model has RMSE: 28305040.367174413
Base Model has R2-Score: 0.9045058622190447


In [8]:
from sklearn.model_selection import RandomizedSearchCV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

# Create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

{'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4]}


In [29]:
# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestRegressor()
# Random search of parameters, using 5 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                              n_iter = 100, cv = 5, verbose=3, random_state=42, n_jobs=-1,
                              return_train_score=True)

# Fit the random search model
rf_random.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


RandomizedSearchCV(cv=5, estimator=RandomForestRegressor(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'max_depth': [10, 20, 30, 40, 50, 60,
                                                      70, 80, 90, 100, 110,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000, 1200, 1400, 1600,
                                                         1800, 2000]},
                   random_state=42, return_train_score=True, verbose=3)

In [30]:
rf_random.best_params_

{'n_estimators': 1600,
 'min_samples_split': 2,
 'min_samples_leaf': 1,
 'max_features': 'auto',
 'max_depth': 30}

In [31]:
new_pred = rf_random.best_estimator_.predict(X_test)
new_rmse = np.sqrt(mse(y_test, new_pred))
print('New Model has RMSE:', new_rmse)
print('New Model has R2-Score:', r2_score(y_test, new_pred))

New Model has RMSE: 28122212.542384055
New Model has R2-Score: 0.9057355089359399


In [33]:
print('Improvement of:', ((base_rmse - new_rmse)/base_rmse)*100, '%')

Improvement of: 0.64591967514869 %
