Kita memulai dengan membaca dan mengeksplorasi data.

In [10]:
import pandas as pd

# Membaca data winequality-red.csv
file_path = '/content/winequality-red.csv'
data = pd.read_csv(file_path)

# Menampilkan 5 baris pertama data untuk eksplorasi awal
data.head()

Unnamed: 0,"fixed acidity;""volatile acidity"";""citric acid"";""residual sugar"";""chlorides"";""free sulfur dioxide"";""total sulfur dioxide"";""density"";""pH"";""sulphates"";""alcohol"";""quality"""
0,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
1,7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
2,7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;...
3,11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58...
4,7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5


Data terlihat tidak terpisah dengan benar. Pemisah kolom sepertinya menggunakan tanda titik koma (;) daripada koma (,), sehingga kita perlu memperbaikinya saat membaca file. Kita akan mengatur ulang pembacaan data dengan pemisah yang sesuai

In [11]:
# Membaca ulang data dengan pemisah titik koma (;)
data = pd.read_csv(file_path, sep=';')

# Menampilkan 5 baris pertama data untuk memastikan pembacaan benar
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


Data telah berhasil dibaca dengan benar. Berikut langkah berikutnya:

1. Meninjau Data: Memeriksa apakah ada nilai kosong, jenis data yang tidak sesuai, atau outlier.
2. Memisahkan Fitur dan Target: Kolom "quality" akan menjadi target, sementara kolom lainnya adalah fitur.
3. Normalisasi Data: Semua fitur numerik akan dinormalisasi agar model lebih mudah dilatih.

Kita akan memulai dengan tinjauan dan preprocessing data.

In [12]:
# Mengecek informasi data
data.info()

# Mengecek apakah ada nilai kosong
missing_values = data.isnull().sum()

# Menampilkan jumlah nilai kosong jika ada
missing_values[missing_values > 0]


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


Unnamed: 0,0


Data tidak memiliki nilai kosong, dan semua kolom memiliki jenis data yang sesuai (numerik). Selanjutnya:

1. Memisahkan Fitur dan Target: Kolom "quality" menjadi target, dan kolom lainnya menjadi fitur.
2. Normalisasi Data: Data akan dinormalisasi menggunakan skala [0, 1].
3. Pembagian Data: Memisahkan data menjadi set pelatihan dan pengujian.

Kita akan melanjutkan ke langkah tersebut.

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# Memisahkan fitur dan target
X = data.drop(columns=['quality'])  # Fitur
y = data['quality']                 # Target

# Normalisasi target agar dimulai dari 0
y = y - y.min()  # Mengurangi nilai terkecil untuk memastikan label dimulai dari 0

# Setelah normalisasi, target akan memiliki rentang [0, num_classes-1]
num_classes = len(y.unique())

# Normalisasi data fitur
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Membagi data menjadi set pelatihan dan pengujian (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Memeriksa bentuk data hasil split
X_train.shape, X_test.shape, y_train.shape, y_test.shape


((1279, 11), (320, 11), (1279,), (320,))

Data telah dipisahkan dan dinormalisasi:

- Set Pelatihan: 1279 sampel dengan 11 fitur.
- Set Pengujian: 320 sampel dengan 11 fitur.

Langkah berikutnya adalah membangun model Bidirectional RNN menggunakan PyTorch. Kita akan memulai dengan mendefinisikan arsitektur dasar dan menjelaskan implementasinya.

In [14]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

# Konversi data ke tensor PyTorch
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

# Membuat DataLoader untuk batch processing
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Definisi model Bidirectional RNN
class BidirectionalRNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes):
        super(BidirectionalRNN, self).__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, num_classes)  # *2 karena bidirectional

    def forward(self, x):
        # Inisialisasi hidden state
        h0 = torch.zeros(2, x.size(0), self.hidden_size)  # 2 untuk bidirectional

        # Forward pass melalui RNN
        out, _ = self.rnn(x, h0)

        # Mengambil output terakhir
        out = out[:, -1, :]

        # Fully connected layer
        out = self.fc(out)
        return out

# Parameter model
input_size = X_train.shape[1]
hidden_size = 64  # Hidden size yang akan diuji
num_classes = len(y.unique())  # Jumlah kelas output (berdasarkan target "quality")

# Inisialisasi model
model = BidirectionalRNN(input_size, hidden_size, num_classes)

# Menampilkan arsitektur model
print(model)


BidirectionalRNN(
  (rnn): RNN(11, 64, batch_first=True, bidirectional=True)
  (fc): Linear(in_features=128, out_features=6, bias=True)
)


Model Bidirectional RNN telah didefinisikan dengan parameter berikut:

- Ukuran Input: 11 (jumlah fitur).
- Ukuran Hidden Layer: 64 (bisa diubah untuk eksperimen).
- Jumlah Kelas: Berdasarkan distribusi target "quality".

**Membuat model Bidirectional RNN menggunakan PyTorch**

In [41]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

# Model RNN
class RNNModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_classes, pooling_type='max'):
        super(RNNModel, self).__init__()
        self.hidden_size = hidden_size
        self.pooling_type = pooling_type
        self.rnn = nn.RNN(input_size, hidden_size, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_size * 2, num_classes)  # *2 karena bidirectional

    def forward(self, x):
        # Pastikan input memiliki dimensi batch (batch_size, seq_len, input_size)
        if x.dim() == 2:
            x = x.unsqueeze(1)  # Tambahkan dimensi sequence jika hilang

        # Inisialisasi hidden state
        batch_size = x.size(0)
        h0 = torch.zeros(2, batch_size, self.hidden_size).to(x.device)  # 2 untuk bidirectional

        # Forward pass melalui RNN
        out, _ = self.rnn(x, h0)

        # Pooling
        if self.pooling_type == 'max':
            out, _ = torch.max(out, dim=1)
        elif self.pooling_type == 'avg':
            out = torch.mean(out, dim=1)
        else:
            raise ValueError("pooling_type harus 'max' atau 'avg'")

        # Fully connected layer
        out = self.fc(out)
        return out

# Fungsi pelatihan
def train_model(model, optimizer, criterion, train_loader, test_loader, epochs, scheduler=None):
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for X_batch, y_batch in train_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            optimizer.zero_grad()
            y_pred = model(X_batch)

            # Debugging: Pastikan y_pred bukan None
            assert y_pred is not None, "Model menghasilkan None sebagai output."

            loss = criterion(y_pred, y_batch)
            loss.backward()
            optimizer.step()
            train_loss += loss.item()

        if scheduler:
            scheduler.step()

        # Evaluasi
        model.eval()
        test_loss = 0
        correct = 0
        with torch.no_grad():
            for X_batch, y_batch in test_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                y_pred = model(X_batch)

                # Debugging: Pastikan y_pred valid
                assert y_pred is not None, "Model menghasilkan None saat evaluasi."

                loss = criterion(y_pred, y_batch)
                test_loss += loss.item()
                correct += (y_pred.argmax(1) == y_batch).sum().item()

        print(f'Epoch {epoch+1}/{epochs}, Loss: {train_loss/len(train_loader):.4f}, '
              f'Test Loss: {test_loss/len(test_loader):.4f}, Accuracy: {correct/len(test_loader.dataset):.4f}')

# Hyperparameter dan Data Preparation
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Dataset dan DataLoader
X_train_tensor = torch.tensor(X_train, dtype=torch.float32)
y_train_tensor = torch.tensor(y_train.values, dtype=torch.long)
X_test_tensor = torch.tensor(X_test, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test.values, dtype=torch.long)

train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

# Eksperimen
input_size = X_train.shape[1]
num_classes = len(y.unique())
hidden_sizes = [32, 64, 128]
optimizers = {'SGD': optim.SGD, 'RMSProp': optim.RMSprop, 'Adam': optim.Adam}
pooling_types = ['max', 'avg']

for hidden_size in hidden_sizes:
    for pooling_type in pooling_types:
        for opt_name, opt_func in optimizers.items():
            print(f"\nEksperimen: Hidden Size={hidden_size}, Pooling={pooling_type}, Optimizer={opt_name}")
            model = RNNModel(input_size, hidden_size, num_classes, pooling_type).to(device)
            optimizer = opt_func(model.parameters(), lr=0.01)
            criterion = nn.CrossEntropyLoss()
            scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.1)

            train_model(model, optimizer, criterion, train_loader, test_loader, epochs=50, scheduler=scheduler)



Eksperimen: Hidden Size=32, Pooling=max, Optimizer=SGD
Epoch 1/50, Loss: 1.6616, Test Loss: 1.5746, Accuracy: 0.4062
Epoch 2/50, Loss: 1.4948, Test Loss: 1.4414, Accuracy: 0.4062
Epoch 3/50, Loss: 1.3835, Test Loss: 1.3541, Accuracy: 0.4062
Epoch 4/50, Loss: 1.3122, Test Loss: 1.2991, Accuracy: 0.4094
Epoch 5/50, Loss: 1.2676, Test Loss: 1.2650, Accuracy: 0.4094
Epoch 6/50, Loss: 1.2395, Test Loss: 1.2428, Accuracy: 0.4125
Epoch 7/50, Loss: 1.2213, Test Loss: 1.2280, Accuracy: 0.4281
Epoch 8/50, Loss: 1.2088, Test Loss: 1.2174, Accuracy: 0.4531
Epoch 9/50, Loss: 1.1999, Test Loss: 1.2097, Accuracy: 0.4625
Epoch 10/50, Loss: 1.1928, Test Loss: 1.2035, Accuracy: 0.4781
Epoch 11/50, Loss: 1.1876, Test Loss: 1.1990, Accuracy: 0.4437
Epoch 12/50, Loss: 1.1831, Test Loss: 1.1952, Accuracy: 0.4375
Epoch 13/50, Loss: 1.1794, Test Loss: 1.1909, Accuracy: 0.4875
Epoch 14/50, Loss: 1.1761, Test Loss: 1.1881, Accuracy: 0.4531
Epoch 15/50, Loss: 1.1731, Test Loss: 1.1847, Accuracy: 0.4938
Epoch 16

Hasil menunjukkan bahwa RMSProp lebih baik dibandingkan SGD, dengan loss lebih cepat turun (0.98) dan akurasi lebih tinggi (54.8% vs 52%). Namun, akurasi mulai stagnan di akhir, sehingga perlu penyesuaian hyperparameter atau regularisasi untuk hasil lebih optimal.






