![Natural Language Processing GIF](https://raw.githubusercontent.com/FelixMatrixar/Basic-NLP-in-PyTorch/main/Natural%20Language%20Processing.gif)

*“Turning text into insight, one token at a time.”*

# **Tugas Pengantar Text Mining (Rekognisi)**
---

- **Nama :** Felix
- **NIM :** M0721028
- **Pengampu :** Mr. Fajar Muslim S.T., M.T.


---
## **Import Library**

Disini hanya melakukan import library math untuk pembuatan syntax TF-IDF dan BoW secara manual mengikuti rumus matematika dari metode masing-masing dan library sklearn untuk membantu melakukan modeling menggunakan **Decision Tree**.

In [2]:
import math
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, random_split

---
## **Membuat Data**

Tahap ini adalah membuat data contoh dan dipisahkan untuk memastikan mana variabel independen dan variabel dependen. Keterangan Label adalah sebagai berikut :
- **P** artimya kalimat dengan sentimen **Positif**
- **Ne** artimya kalimat dengan sentimen **Netral**
- **N** artimya kalimat dengan sentimen **Negatif**

In [3]:
texts = [
    "aku suka statistika",
    "kelasnya bersih",
    "AC-nya kurang dingin",
    "papan tulisnya putih bersih",
    "kursinya kurang empuk"
]

labels = ["P", "P", "N", "Ne", "N"]

---
## **Proses Tokenisasi**

Proses tokenisasi mencakup lima tahap:

1. Lowercase Conversion
Karakter diubah menjadi huruf kecil berdasarkan kode ASCII.
$$c' = c + 32, \quad \text{jika } 65 \leq c \leq 90$$

2. Split on Space
Teks dipisah menjadi kata berdasarkan spasi.
$$\text{Words} \gets \text{Words} \cup \{w\}, \quad \text{jika } c = \text{" "}$$

3. Append to List
Menambahkan elemen $𝑒$ ke dalam daftar $𝐿$
$$L' = L + [e]$$

4. Add to Set
Menambahkan elemen $𝑒$ ke dalam himpunan $𝑆$ jika elemen belum ada.
$$S' = S \cup \{e\}, \quad \text{jika } e \notin S$$

5. Bubble Sort
Mengurutkan daftar dengan menukar dua elemen jika tidak dalam urutan yang benar.
$$\text{Jika } L[j] > L[j+1], \text{ maka } L[j], L[j+1] \gets L[j+1], L[j]$$




In [5]:
# Custom function to convert string to lowercase
def custom_lower(text):
    result = ""
    for char in text:
        # If character is uppercase, convert to lowercase
        if 'A' <= char <= 'Z':
            result += chr(ord(char) + 32)
        else:
            result += char
    return result

# Custom function to split a string into words
def custom_split(text):
    words = []
    word = ""
    for char in text:
        if char == " ":  # Split on space
            if word:     # Add word if non-empty
                words.append(word)
                word = ""
        else:
            word += char
    if word:  # Add the last word
        words.append(word)
    return words

# Custom function to append an element to a list
def custom_append(lst, element):
    lst += [element]  # Concatenate a single-element list
    return lst

# Custom function to add an element to a set
def custom_add(s, element):
    if element not in s:
        s |= {element}  # Add element by creating a new set
    return s

# Custom function to sort a list
def custom_sort(lst):
    for i in range(len(lst)):
        for j in range(len(lst) - i - 1):
            if lst[j] > lst[j + 1]:
                lst[j], lst[j + 1] = lst[j + 1], lst[j]  # Swap
    return lst


# Initialize vocab set and tokenized texts
vocab_set = set()
tokenized_texts = []

# Tokenize each text
for t in texts:
    tokens = custom_split(custom_lower(t))  # Use custom lower and split
    tokenized_texts = custom_append(tokenized_texts, tokens)  # Use custom append
    for token in tokens:
        vocab_set = custom_add(vocab_set, token)  # Use custom add

# Convert vocab set to list and sort
vocab_list = list(vocab_set)
vocab_list = custom_sort(vocab_list)  # Use custom sort

# Display vocab list
print("=== Vocab List ===")
print(vocab_list)

=== Vocab List ===
['ac-nya', 'aku', 'bersih', 'dingin', 'empuk', 'kelasnya', 'kurang', 'kursinya', 'papan', 'putih', 'statistika', 'suka', 'tulisnya']


---
## **Proses TF-IDF dan BoW (Bag of Words)**

### **TF-IDF**

Proses TF-IDF mencakup:

1. Term Frequency (TF): Mengukur seberapa sering kata $𝑡$ muncul dalam dokumen $𝑑$, dibandingkan dengan total kata dalam dokumen tersebut:
$$\text{TF}(t, d) = \frac{\text{count}(t, d)}{\text{total\_words}(d)}$$

2. Inverse Document Frequency (IDF): Mengukur kelangkaan kata  $𝑡$ dalam koleksi dokumen $𝐷$. Jika `df($𝑡$)` adalah jumlah dokumen yang mengandung kata $𝑡$:
$$\text{IDF}(t, D) = \log \left( \frac{N + 1}{\text{df}(t) + 1} \right) + 1$$

3. TF-IDF: Kombinasi dari TF dan IDF:
$$\text{TF-IDF}(t, d, D) = \text{TF}(t, d) \cdot \text{IDF}(t, D)$$


In [6]:
def TF_IDF(tokenized_docs, vocab):
    """
    Fungsi untuk menghitung matriks TF-IDF dari dokumen yang sudah di-tokenisasi.

    Parameters:
    - tokenized_docs (list of list of str): Daftar dokumen yang sudah di-tokenisasi.
    - vocab (list of str): Kosakata unik dari semua dokumen.

    Returns:
    - list of list of float: Matriks TF-IDF, di mana baris adalah dokumen dan kolom adalah kosakata.
    """
    N = len(tokenized_docs)  # Jumlah dokumen

    # Langkah 1: Hitung DF untuk setiap kata dalam kosakata
    df = {}
    for v in vocab:
        df[v] = 0
    for tokens in tokenized_docs:
        unique_tokens = set(tokens)  # Token unik dalam dokumen
        for t in unique_tokens:
            if t in df:
                df[t] += 1

    # Langkah 2: Hitung Matriks TF-IDF
    tf_idf_matrix = []
    for tokens in tokenized_docs:
        row = []
        total_kata = len(tokens)  # Total kata dalam dokumen
        for v in vocab:
            # Hitung TF
            tf = tokens.count(v) / float(total_kata) if total_kata > 0 else 0
            # Hitung IDF
            idf = math.log((N + 1) / (df[v] + 1)) + 1
            # Hitung TF-IDF
            tf_idf = tf * idf
            row.append(tf_idf)
        tf_idf_matrix.append(row)
    return tf_idf_matrix

tfidf_matrix = TF_IDF(tokenized_texts, vocab_list)

# Menampilkan hasil TF-IDF
print("\n=== TF-IDF Matrix ===")
for i, row in enumerate(tfidf_matrix):
    print(f"Dokumen {i} ({texts[i]}): {row}")


=== TF-IDF Matrix ===
Dokumen 0 (aku suka statistika): [0.0, 0.6995374295560366, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6995374295560366, 0.6995374295560366, 0.0]
Dokumen 1 (kelasnya bersih): [0.0, 0.0, 0.8465735902799727, 0.0, 0.0, 1.049306144334055, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Dokumen 2 (AC-nya kurang dingin): [0.6995374295560366, 0.0, 0.0, 0.6995374295560366, 0.0, 0.0, 0.5643823935199818, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
Dokumen 3 (papan tulisnya putih bersih): [0.0, 0.0, 0.42328679513998635, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5246530721670275, 0.5246530721670275, 0.0, 0.0, 0.5246530721670275]
Dokumen 4 (kursinya kurang empuk): [0.0, 0.0, 0.0, 0.0, 0.6995374295560366, 0.0, 0.5643823935199818, 0.6995374295560366, 0.0, 0.0, 0.0, 0.0, 0.0]


### **Bag of Words**

Proses **Bag of Words (BOW)** mencakup tiga tahap:

1. Membuat kosakata dengan cara mengumpulkan semua kata unik yang muncul di seluruh dokumen, kemudian menyusunnya ke dalam sebuah daftar yang diurutkan secara alfabetis.
2. Menghitung frekuensi kemunculan setiap kata dalam kosakata untuk masing-masing dokumen. Jika sebuah kata tidak ditemukan dalam dokumen, frekuensinya akan bernilai 0.
3. Menyusun Matriks BoW (Bag of Words) berdasarkan hasil perhitungan frekuensi kata. Pada matriks ini, setiap baris mewakili dokumen, setiap kolom mewakili kata dalam kosakata, dan nilai pada matriks menunjukkan frekuensi kemunculan kata tersebut di dokumen tertentu.

In [7]:
def BoW(tokenized_docs, vocab):
    bow_matrix = []
    for tokens in tokenized_docs:
        row = []
        for v in vocab:
            count = tokens.count(v)
            row.append(count)
        bow_matrix.append(row)
    return bow_matrix

bow_matrix = BoW(tokenized_texts, vocab_list)

# Menampilkan hasil BoW

print("\n=== BoW Matrix ===")
for i, row in enumerate(bow_matrix):
    print(f"Dokumen {i} ({texts[i]}): {row}")


=== BoW Matrix ===
Dokumen 0 (aku suka statistika): [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0]
Dokumen 1 (kelasnya bersih): [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]
Dokumen 2 (AC-nya kurang dingin): [1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0]
Dokumen 3 (papan tulisnya putih bersih): [0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]
Dokumen 4 (kursinya kurang empuk): [0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0]


---
## **Modeling**

Mengubah label dari bentuk *string* ke *integer*

In [9]:
label_map = {"P":0, "N":1, "Ne":2}
y = [label_map[l] for l in labels]

Melakukan split *train-test* untuk persiapan melakukan *modeling* dengan rasio 60:40. dengan 3 kalimat sebagai *data train* dan 2 kalimat sebagai *data test*

In [10]:
# Convert data to PyTorch tensors
X_bow_tensor = torch.tensor(bow_matrix, dtype=torch.float32)
X_tfidf_tensor = torch.tensor(tfidf_matrix, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.long)

# Dataset class to wrap features and labels
class TextDataset(Dataset):
    def __init__(self, features, labels):
        self.features = features
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        return self.features[idx], self.labels[idx]

# Create datasets for BoW and TF-IDF
dataset_bow = TextDataset(X_bow_tensor, y_tensor)
dataset_tfidf = TextDataset(X_tfidf_tensor, y_tensor)

# Split datasets into train and test sets (60:40 split)
train_size = int(0.6 * len(dataset_bow))
test_size = len(dataset_bow) - train_size
train_dataset_bow, test_dataset_bow = random_split(dataset_bow, [train_size, test_size])
train_dataset_tfidf, test_dataset_tfidf = random_split(dataset_tfidf, [train_size, test_size])

# Create DataLoader for training and testing
batch_size = 4
train_loader_bow = DataLoader(train_dataset_bow, batch_size=batch_size, shuffle=True)
test_loader_bow = DataLoader(test_dataset_bow, batch_size=batch_size, shuffle=False)

train_loader_tfidf = DataLoader(train_dataset_tfidf, batch_size=batch_size, shuffle=True)
test_loader_tfidf = DataLoader(test_dataset_tfidf, batch_size=batch_size, shuffle=False)

In [12]:
class SimpleNN(nn.Module):
    def __init__(self, input_dim, num_classes):
        super(SimpleNN, self).__init__()
        self.fc = nn.Sequential(
            nn.Linear(input_dim, 16),  # First layer with 16 hidden units
            nn.ReLU(),
            nn.Linear(16, num_classes)  # Output layer
        )

    def forward(self, x):
        return self.fc(x)

In [13]:
def train_model(model, optimizer, criterion, train_loader, num_epochs=10):
    model.train()  # Set the model to training mode
    for epoch in range(num_epochs):
        total_loss = 0
        for X_batch, y_batch in train_loader:
            # Forward pass
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)

            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
        
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_loss:.4f}")


In [14]:
def evaluate_model(model, test_loader):
    model.eval()  # Set the model to evaluation mode
    correct = 0
    total = 0
    with torch.no_grad():  # Disable gradient calculation
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            _, predicted = torch.max(outputs, 1)  # Get predicted class
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()
    return correct / total


In [21]:
# Hyperparameters
input_dim_bow = X_bow_tensor.shape[1]  # Number of features in BoW
input_dim_tfidf = X_tfidf_tensor.shape[1]  # Number of features in TF-IDF
num_classes = len(label_map)  # Number of output classes
learning_rate = 0.001
num_epochs = 200

# Model, loss, and optimizer for BoW
model_bow = SimpleNN(input_dim_bow, num_classes)
criterion = nn.CrossEntropyLoss()
optimizer_bow = optim.Adam(model_bow.parameters(), lr=learning_rate)

print("\nTraining BoW Model:")
train_model(model_bow, optimizer_bow, criterion, train_loader_bow, num_epochs)

# Evaluate BoW Model
accuracy_bow = evaluate_model(model_bow, test_loader_bow)
print(f"Accuracy (BoW): {accuracy_bow:.4f}")

# Model, loss, and optimizer for TF-IDF
model_tfidf = SimpleNN(input_dim_tfidf, num_classes)
optimizer_tfidf = optim.Adam(model_tfidf.parameters(), lr=learning_rate)

print("\nTraining TF-IDF Model:")
train_model(model_tfidf, optimizer_tfidf, criterion, train_loader_tfidf, num_epochs)

# Evaluate TF-IDF Model
accuracy_tfidf = evaluate_model(model_tfidf, test_loader_tfidf)
print(f"Accuracy (TF-IDF): {accuracy_tfidf:.4f}")



Training BoW Model:
Epoch [1/200], Loss: 1.1661
Epoch [2/200], Loss: 1.1604
Epoch [3/200], Loss: 1.1548
Epoch [4/200], Loss: 1.1492
Epoch [5/200], Loss: 1.1437
Epoch [6/200], Loss: 1.1381
Epoch [7/200], Loss: 1.1326
Epoch [8/200], Loss: 1.1271
Epoch [9/200], Loss: 1.1216
Epoch [10/200], Loss: 1.1162
Epoch [11/200], Loss: 1.1109
Epoch [12/200], Loss: 1.1057
Epoch [13/200], Loss: 1.1004
Epoch [14/200], Loss: 1.0952
Epoch [15/200], Loss: 1.0900
Epoch [16/200], Loss: 1.0847
Epoch [17/200], Loss: 1.0795
Epoch [18/200], Loss: 1.0743
Epoch [19/200], Loss: 1.0690
Epoch [20/200], Loss: 1.0638
Epoch [21/200], Loss: 1.0586
Epoch [22/200], Loss: 1.0534
Epoch [23/200], Loss: 1.0482
Epoch [24/200], Loss: 1.0429
Epoch [25/200], Loss: 1.0377
Epoch [26/200], Loss: 1.0325
Epoch [27/200], Loss: 1.0272
Epoch [28/200], Loss: 1.0220
Epoch [29/200], Loss: 1.0167
Epoch [30/200], Loss: 1.0114
Epoch [31/200], Loss: 1.0061
Epoch [32/200], Loss: 1.0008
Epoch [33/200], Loss: 0.9954
Epoch [34/200], Loss: 0.9901
Ep

In [22]:
class ImprovedNN(nn.Module):
    def __init__(self, input_dim, num_classes, architecture):
        super(ImprovedNN, self).__init__()
        layers = []
        for i, (in_dim, out_dim) in enumerate(architecture):
            layers.append(nn.Linear(in_dim, out_dim))
            if i < len(architecture) - 1:  # Add activation and dropout for hidden layers
                layers.append(nn.ReLU())
                layers.append(nn.Dropout(0.3))
        layers.append(nn.Linear(architecture[-1][1], num_classes))  # Output layer
        self.fc = nn.Sequential(*layers)

    def forward(self, x):
        return self.fc(x)


In [23]:
# Architectures: List of tuples (input_dim, output_dim)
architectures = {
    "Small": [(input_dim_bow, 16), (16, 8)],
    "Medium": [(input_dim_bow, 32), (32, 16), (16, 8)],
    "Large": [(input_dim_bow, 64), (64, 32), (32, 16)],
}


In [24]:
def train_and_evaluate(model, optimizer, criterion, train_loader, test_loader, num_epochs=20):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for X_batch, y_batch in train_loader:
            outputs = model(X_batch)
            loss = criterion(outputs, y_batch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()
        print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {total_loss / len(train_loader):.4f}")

    # Evaluate the model
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            outputs = model(X_batch)
            _, predicted = torch.max(outputs, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()
    accuracy = correct / total
    return accuracy


In [25]:
results = []

for arch_name, arch in architectures.items():
    # Train and evaluate on BoW
    model_bow = ImprovedNN(input_dim_bow, num_classes, arch)
    optimizer_bow = optim.Adam(model_bow.parameters(), lr=0.01)
    criterion = nn.CrossEntropyLoss()
    print(f"\nTraining BoW Model - Architecture: {arch_name}")
    acc_bow = train_and_evaluate(model_bow, optimizer_bow, criterion, train_loader_bow, test_loader_bow)

    # Train and evaluate on TF-IDF
    model_tfidf = ImprovedNN(input_dim_tfidf, num_classes, arch)
    optimizer_tfidf = optim.Adam(model_tfidf.parameters(), lr=0.01)
    print(f"\nTraining TF-IDF Model - Architecture: {arch_name}")
    acc_tfidf = train_and_evaluate(model_tfidf, optimizer_tfidf, criterion, train_loader_tfidf, test_loader_tfidf)

    # Save results
    results.append({"Architecture": arch_name, "BoW Accuracy": acc_bow, "TF-IDF Accuracy": acc_tfidf})

# Convert results to DataFrame
results_df = pd.DataFrame(results)
print(results_df)



Training BoW Model - Architecture: Small
Epoch [1/20], Loss: 1.1007
Epoch [2/20], Loss: 1.0741
Epoch [3/20], Loss: 1.0336
Epoch [4/20], Loss: 1.0406
Epoch [5/20], Loss: 0.9878
Epoch [6/20], Loss: 0.9267
Epoch [7/20], Loss: 0.9077
Epoch [8/20], Loss: 0.8677
Epoch [9/20], Loss: 0.8121
Epoch [10/20], Loss: 0.7794
Epoch [11/20], Loss: 0.7548
Epoch [12/20], Loss: 0.7109
Epoch [13/20], Loss: 0.6724
Epoch [14/20], Loss: 0.5916
Epoch [15/20], Loss: 0.5890
Epoch [16/20], Loss: 0.6362
Epoch [17/20], Loss: 0.6157
Epoch [18/20], Loss: 0.4026
Epoch [19/20], Loss: 0.4531
Epoch [20/20], Loss: 0.4973

Training TF-IDF Model - Architecture: Small
Epoch [1/20], Loss: 1.1278
Epoch [2/20], Loss: 1.1250
Epoch [3/20], Loss: 1.1136
Epoch [4/20], Loss: 1.1080
Epoch [5/20], Loss: 1.0710
Epoch [6/20], Loss: 1.0650
Epoch [7/20], Loss: 1.0689
Epoch [8/20], Loss: 1.0501
Epoch [9/20], Loss: 1.0755
Epoch [10/20], Loss: 1.0114
Epoch [11/20], Loss: 0.9822
Epoch [12/20], Loss: 0.9921
Epoch [13/20], Loss: 0.9965
Epoch [

---
## **Kesimpulan**

<div align='justify'>
&emsp;&emsp;&emsp;&emsp;
Dari hasil evaluasi model dengan berbagai arsitektur neural network (Small, Medium, dan Large), dapat dilihat bahwa representasi TF-IDF konsisten menghasilkan akurasi 0.5 untuk semua arsitektur, sementara representasi BoW hanya memberikan akurasi yang lebih tinggi (0.5) pada arsitektur Large. Hal ini menunjukkan bahwa model yang lebih kompleks, seperti arsitektur Large dengan lebih banyak lapisan dan unit tersembunyi, dapat memanfaatkan representasi BoW secara lebih efektif dibandingkan arsitektur yang lebih sederhana (Small dan Medium). Akan tetapi, untuk representasi TF-IDF, akurasi tetap stabil di semua arsitektur, mengindikasikan bahwa model tidak dapat menangkap informasi tambahan yang lebih baik meskipun kapasitas model meningkat. Kesimpulannya, arsitektur Large memberikan performa terbaik secara keseluruhan, terutama dalam mengolah representasi BoW, tetapi representasi TF-IDF lebih konsisten dan stabil terlepas dari kompleksitas model. Untuk peningkatan lebih lanjut, eksplorasi terhadap data, representasi, atau penyesuaian hiperparameter dapat dilakukan.
</div>