## Pertemuan 7: Machine Learning Supervised Learning for Regression

In [1]:
import numpy as np
import pandas as pd

#### IMPORT DATASET

In [2]:
data = "https://raw.githubusercontent.com/stedy/Machine-Learning-with-R-datasets/master/insurance.csv"
df = pd.read_csv(data)

In [3]:
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In [4]:
# Check for missing values
print("Missing values in the dataset:")
print(df.isnull().sum())

Missing values in the dataset:
age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64


In [5]:
# Convert categorical variables to numerical
df = pd.get_dummies(df, columns=['sex','smoker', 'region'])

In [6]:
# Memisahkan fitur dan target
X = df.drop(columns=['charges'])
y = df['charges']

#### Train Test Split

In [7]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import mean_absolute_error

In [8]:
# Membagi data menjadi data latih dan data uji
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#### Evaluation Function

In [9]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

def train_and_evaluate_regression(model, X_train, X_test, y_train, y_test):
    # Melatih model dengan data latih
    model.fit(X_train, y_train)
    
    # Melakukan prediksi menggunakan data uji
    y_pred = model.predict(X_test)
    
    # Menghitung Mean Absolute Error (MAE)
    mae = mean_absolute_error(y_test, y_pred)
    
    # Menghitung Mean Squared Error (MSE)
    mse = mean_squared_error(y_test, y_pred)
    
    # Menghitung Root Mean Squared Error (RMSE)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    
    # Menghitung R-squared (Koefisien Determinasi)
    r2 = r2_score(y_test, y_pred)
    
    # Menyusun hasil evaluasi
    result = {'Mean Absolute Error': mae, 'Mean Squared Error': mse, 'Root Mean Squared Error': rmse, 'R-squared': r2}
    
    # Menampilkan hasil evaluasi
    print("Mean Absolute Error:", mae)
    print("Mean Squared Error:", mse)
    print("Root Mean Squared Error:", rmse)
    print("R-squared:", r2)
    
    return y_pred, result


#### Linear Regression

<img src = "https://media.geeksforgeeks.org/wp-content/uploads/20231129130431/11111111.png" width = 70%></img>

Regresi linearr adalah salah satu model statistik yang paling sederhana dan paling banyak digunakan. Hal ini mengasumsikan adanya hubungan linearr antara variabel independen dan dependen. Artinya perubahan variabel terikat sebanding dengan perubahan variabel bebas. 

Persamaan yang menjelaskan bagaimana keterkaitan antara variabel X dengan variabel Y dan suatu model error disebut model regresi. Model regresi yang digunakan dalam regresi linear sederhana adalah:
<div align="center">
Y= b0+b1*x
</div>

dimana
b0 dan b1 menyatakan parameter model, X merupakan variabel independen.

In [10]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
y_pred, result = train_and_evaluate_regression(model, X_train, X_test, y_train, y_test)

Mean Absolute Error: 4181.194473753641
Mean Squared Error: 33596915.85136145
Root Mean Squared Error: 5796.284659276273
R-squared: 0.7835929767120724


In [21]:
import statsmodels.api as sm

X_train_reg = sm.add_constant(X_train)

reg_model = sm.OLS(y_train, X_train_reg)
result = reg_model.fit()

In [22]:
print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                charges   R-squared:                       0.742
Model:                            OLS   Adj. R-squared:                  0.740
Method:                 Least Squares   F-statistic:                     380.9
Date:                Fri, 10 May 2024   Prob (F-statistic):          1.32e-305
Time:                        09:46:18   Log-Likelihood:                -10845.
No. Observations:                1070   AIC:                         2.171e+04
Df Residuals:                    1061   BIC:                         2.175e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
const             -255.3492    486.430  

#### Decision Tree

<img src = "https://miro.medium.com/v2/resize:fit:828/format:webp/1*ekWgr-yVc-ba6DHC_9FeRA.png" width = 50%></img>

Regresi Decision Tree adalah jenis algoritma regresi yang membangun pohon keputusan untuk memprediksi nilai target. Decision Tree adalah struktur mirip pohon yang terdiri dari simpul dan cabang. Setiap node mewakili sebuah keputusan, dan setiap cabang mewakili hasil dari keputusan tersebut. Tujuan dari regresi pohon keputusan adalah untuk membangun pohon yang dapat secara akurat memprediksi nilai target untuk titik data baru.

In [11]:
from sklearn.tree import DecisionTreeRegressor
model = DecisionTreeRegressor(random_state=12)
y_pred, result = train_and_evaluate_regression(model, X_train, X_test, y_train, y_test)

Mean Absolute Error: 3345.766503018657
Mean Squared Error: 50547861.88265708
Root Mean Squared Error: 7109.70195455879
R-squared: 0.6744072470225911


#### Random Forest

<img src ="https://cnvrg.io/wp-content/uploads/2021/02/Random-Forest-Algorithm-1024x576.jpg" width = 70%></img>

Regresi random forest adalah metode ensemble yang menggabungkan beberapa pohon keputusan untuk memprediksi nilai target. Metode ensemble adalah jenis algoritme pembelajaran mesin yang menggabungkan beberapa model untuk meningkatkan performa model secara keseluruhan. Regresi random forest bekerja dengan membangun sejumlah besar pohon keputusan, yang masing-masing pohon dilatih pada subset data pelatihan yang berbeda. Prediksi akhir dibuat dengan merata-ratakan prediksi seluruh pohon.

In [12]:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=10)
y_pred, result = train_and_evaluate_regression(model, X_train, X_test, y_train, y_test)

Mean Absolute Error: 2530.1953826473527
Mean Squared Error: 21393212.71300119
Root Mean Squared Error: 4625.279744296683
R-squared: 0.8622004024932416


#### SVR

<img src = "https://www.researchgate.net/publication/359343222/figure/fig1/AS:1182277103042573@1658888235641/SVM-and-SVR-modeling-In-SVM-left-a-hyperplane-with-maximal-margin-is-constructed-to_W640.jpg" width = 50%></img>

Support Vector Regression (SVR) adalah jenis algoritma regresi yang didasarkan pada algoritma support vector machine (SVM). SVM adalah jenis algoritma yang digunakan untuk tugas klasifikasi, tetapi juga dapat digunakan untuk tugas regresi. SVR bekerja dengan mencari hyperplane yang meminimalkan jumlah sisa kuadrat antara nilai prediksi dan nilai aktual.

In [13]:
from sklearn.svm import SVR
model = SVR()
y_pred, result = train_and_evaluate_regression(model, X_train, X_test, y_train, y_test)

Mean Absolute Error: 8598.964701526631
Mean Squared Error: 166502152.13488975
Root Mean Squared Error: 12903.571293827525
R-squared: -0.07248639351177277


#### KNN

<img src = "https://miro.medium.com/v2/resize:fit:828/format:webp/1*JBbZ9ert8sML5M3UjgVfoA.png" width = 50%></img>

K-Nearest Neighbors (KNN) adalah algoritma pembelajaran mesin non-parametrik yang dapat digunakan untuk tugas klasifikasi maupun regresi. Dalam konteks regresi, KNN sering disebut sebagai "Regresi KNN." Ini adalah algoritma yang sederhana dan intuitif yang membuat prediksi dengan mencari K titik data terdekat dari input yang diberikan dan melakukan rata-rata dari nilai target mereka.


In [14]:
from sklearn.neighbors import KNeighborsRegressor
model = KNeighborsRegressor(n_neighbors=3)
y_pred, result = train_and_evaluate_regression(model, X_train, X_test, y_train, y_test)

Mean Absolute Error: 6285.787042279851
Mean Squared Error: 109941146.97943063
Root Mean Squared Error: 10485.282398649577
R-squared: 0.2918386776947268


#### All Model

In [15]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor

def auto_model(X_train, y_train, X_test, y_test):
    # Definisi model-model yang akan digunakan
    models = [
        ('Linear Regression', LinearRegression()),
        ('Support Vector Machine (SVM) Regression', SVR()),
        ('Decision Tree Regression', DecisionTreeRegressor(random_state=12)),
        ('Random Forest Regression', RandomForestRegressor(n_estimators=100, random_state=10)),
        ('K-Nearest Neighbors (KNN) Regression', KNeighborsRegressor(n_neighbors=3))
    ]

    # Inisialisasi tabel untuk menyimpan hasil evaluasi
    table = {
        'Model': [],
        'Mean Absolute Error': [],
        'Mean Squared Error': [],
        'Root Mean Squared Error': [],
        'R-squared': []
    }

    # Latih dan evaluasi setiap model
    for name, model in models:
        y_pred, result = train_and_evaluate_regression(model, X_train, X_test, y_train, y_test)
        table['Model'].append(name)
        table['Mean Absolute Error'].append(result['Mean Absolute Error'])
        table['Mean Squared Error'].append(result['Mean Squared Error'])
        table['Root Mean Squared Error'].append(result['Root Mean Squared Error'])
        table['R-squared'].append(result['R-squared'])

    # Konversi ke DataFrame
    hasil = pd.DataFrame(table)

    return hasil




In [16]:
# Panggil fungsi auto_model dengan X_train, X_test, y_train, y_test
hasil_evaluasi = auto_model(X_train, y_train, X_test, y_test);

Mean Absolute Error: 4181.194473753641
Mean Squared Error: 33596915.85136145
Root Mean Squared Error: 5796.284659276273
R-squared: 0.7835929767120724
Mean Absolute Error: 8598.964701526631
Mean Squared Error: 166502152.13488975
Root Mean Squared Error: 12903.571293827525
R-squared: -0.07248639351177277
Mean Absolute Error: 3345.766503018657
Mean Squared Error: 50547861.88265708
Root Mean Squared Error: 7109.70195455879
R-squared: 0.6744072470225911
Mean Absolute Error: 2530.1953826473527
Mean Squared Error: 21393212.71300119
Root Mean Squared Error: 4625.279744296683
R-squared: 0.8622004024932416
Mean Absolute Error: 6285.787042279851
Mean Squared Error: 109941146.97943063
Root Mean Squared Error: 10485.282398649577
R-squared: 0.2918386776947268


In [17]:
hasil_evaluasi

Unnamed: 0,Model,Mean Absolute Error,Mean Squared Error,Root Mean Squared Error,R-squared
0,Linear Regression,4181.194474,33596920.0,5796.284659,0.783593
1,Support Vector Machine (SVM) Regression,8598.964702,166502200.0,12903.571294,-0.072486
2,Decision Tree Regression,3345.766503,50547860.0,7109.701955,0.674407
3,Random Forest Regression,2530.195383,21393210.0,4625.279744,0.8622
4,K-Nearest Neighbors (KNN) Regression,6285.787042,109941100.0,10485.282399,0.291839


In [23]:
from catboost import CatBoostRegressor

In [24]:
model = CatBoostRegressor()

y_pred, result = train_and_evaluate_regression(model, X_train, X_test, y_train, y_test)



Learning rate set to 0.041383
0:	learn: 11664.1506160	total: 190ms	remaining: 3m 10s
1:	learn: 11302.4132197	total: 192ms	remaining: 1m 35s
2:	learn: 10972.4379054	total: 194ms	remaining: 1m 4s
3:	learn: 10636.5056116	total: 195ms	remaining: 48.6s
4:	learn: 10345.1670634	total: 197ms	remaining: 39.3s
5:	learn: 10056.4342446	total: 209ms	remaining: 34.7s
6:	learn: 9764.1444995	total: 215ms	remaining: 30.5s
7:	learn: 9502.3406312	total: 233ms	remaining: 28.9s
8:	learn: 9240.4787871	total: 237ms	remaining: 26.1s
9:	learn: 9004.4544077	total: 251ms	remaining: 24.8s
10:	learn: 8761.3282039	total: 253ms	remaining: 22.7s
11:	learn: 8558.6278643	total: 254ms	remaining: 20.9s
12:	learn: 8350.2691705	total: 263ms	remaining: 19.9s
13:	learn: 8176.8508745	total: 266ms	remaining: 18.7s
14:	learn: 7988.4325112	total: 269ms	remaining: 17.6s
15:	learn: 7803.2627158	total: 271ms	remaining: 16.7s
16:	learn: 7627.5679386	total: 274ms	remaining: 15.8s
17:	learn: 7464.8479684	total: 280ms	remaining: 15.3s


In [25]:
from lazypredict.Supervised import LazyRegressor

reg = LazyRegressor()
models,predictions = reg.fit(X_train, X_test, y_train, y_test)

Status is 4: Numerical difficulties encountered.
Result message of linprog:
The solution does not satisfy the constraints within the required tolerance of 3.16E-04, yet no errors were raised and there is no certificate of infeasibility or unboundedness. Check whether the slack and constraint residuals are acceptable; if not, consider enabling presolve, adjusting the tolerance option(s), and/or using a different method. Please consider submitting a bug report.
100%|██████████| 42/42 [02:22<00:00,  3.40s/it]


In [26]:
models

Unnamed: 0_level_0,Adjusted R-Squared,R-Squared,RMSE,Time Taken
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GradientBoostingRegressor,0.87,0.88,4328.15,0.16
RandomForestRegressor,0.86,0.86,4582.97,0.88
LGBMRegressor,0.86,0.86,4601.62,37.39
HistGradientBoostingRegressor,0.86,0.86,4637.54,1.13
BaggingRegressor,0.84,0.85,4806.26,0.22
ExtraTreesRegressor,0.84,0.84,4939.41,0.99
XGBRegressor,0.84,0.84,4952.89,2.9
AdaBoostRegressor,0.81,0.82,5267.06,0.22
KNeighborsRegressor,0.79,0.8,5583.38,0.03
PoissonRegressor,0.79,0.8,5613.25,0.03
