# 1. Problem

#### *problem disini

# 2. Corpus Resource

**Corpus** didapatkan dari dataset yang diambil dari kaggle pada link 
*https://www.kaggle.com/datasets/muhammadanasmahmood/google-scholar-article-listingdata-science*. 
Data yang diambil untuk melakukan autocomplete pada google scholar memakai corpus dari kolom 'Title' dan kolom 'Description'.

# 3. Methods

### Langkah-langkah untuk Metode yang Digunakan yaitu:

#### 1. **Pre-processing**:
   - **Text-cleaning**: Membersihkan teks dengan menghapus tag-tag yang tidak diperlukan (seperti `[HTML]`), mengubah teks menjadi huruf kecil, dan menghapus spasi yang berlebihan.
   - **Menghapus stopword dan tanda baca**: Menyaring kata-kata umum (stopwords) dan tanda baca yang tidak memberikan nilai tambah untuk analisis.
   - **POS Tagging**: Melakukan penandaan Part-of-Speech (POS) pada teks yang telah dibersihkan menggunakan library averaged_perceptron_tagger dari nltk.

#### 2. **Membangun Model**:
   - **Definisi Vocabulary**: Membuat kosakata dengan menghitung frekuensi setiap kata yang valid dari semua judul artikel.
   - **Membuat Model Trigram**: Membangun model trigram (urutan 3 kata) yang menyimpan frekuensi setiap trigram. Model ini berguna untuk memprediksi kata yang kemungkinan muncul setelah urutan kata tertentu.
   - **Mengecek POS Tag**: Melacak POS tag dari kata berikutnya dalam setiap trigram. Ini memungkinkan sistem untuk mempertimbangkan bukan hanya kata itu sendiri, tetapi juga peran gramatikalnya dalam prediksi.

#### 3. **Metode Lanjutan (POS Tagging)**:
   - **Menghitung POS Tag**: Untuk setiap trigram, menghitung seberapa sering tag POS tertentu mengikuti urutan kata tertentu. Ini membantu menghubungkan urutan kata dengan struktur gramatikalnya.
   - **POS Tagging pada text yang dites**: Ketika pengguna memberikan query, lakukan POS tagging pada kata terakhir dari query untuk memahami peran gramatikalnya.
   - **Memilih Persentase tertinggi**: Ketika memprediksi kata berikutnya, pilih kata yang memiliki tag POS yang sesuai dengan kata terakhir dalam query dan memiliki frekuensi tertinggi dalam model trigram.



# 4.Code

In [None]:
import pandas as pd
import re
import nltk
from nltk.util import ngrams
from nltk import pos_tag, word_tokenize
from collections import defaultdict, Counter
import string
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')

# Load data
df = pd.read_csv("preprocessed_titles.csv")

stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)

def clean_title(title):
    title = re.sub(r'\[.*?\]', '', title)
    title = title.lower().strip()
    return title

def is_valid_token(token):
    return token not in stop_words and token not in punctuation

# Clean and tokenize
df['clean_title'] = df.iloc[:, 0].apply(clean_title)
df['clean_meta'] = df.iloc[:, 2].astype(str).apply(clean_title)

# Step 2: Create two separate DataFrames with the same column name
title_df = pd.DataFrame({'clean_title': df['clean_title']})
meta_df = pd.DataFrame({'clean_title': df['clean_meta']})

df['clean_title'] = pd.concat([title_df, meta_df], ignore_index=True)

df['tokens_pos'] = df['clean_title'].apply(lambda t: pos_tag(word_tokenize(t)))

# Split into training and test sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Christian\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Christian\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Christian\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [23]:
ngram_freq = defaultdict(Counter)
word_vocab = Counter()
ngram_pos_freq = defaultdict(lambda: defaultdict(Counter))

for tokens_pos in train_df['tokens_pos']:
    tokens = [word for word, pos in tokens_pos if is_valid_token(word)]
    for word in tokens:
        word_vocab[word] += 1
    for ngram in ngrams(tokens, 3):
        prefix = " ".join(ngram[:2])
        next_word = ngram[2]
        next_word_pos = pos_tag([next_word])[0][1]
        ngram_freq[prefix][next_word] += 1
        ngram_pos_freq[prefix][next_word][next_word_pos] += 1


In [32]:
def word_completion(partial_word, max_suggestions=5):
    matches = [word for word in word_vocab if word.startswith(partial_word)]
    matches = sorted(matches, key=lambda w: word_vocab[w], reverse=True)
    return matches[:max_suggestions]

def get_query_pos(query):
    tokens = word_tokenize(query)
    if tokens:
        return pos_tag([tokens[-1]])[0][1]
    return None

def suggest_autocomplete(query, max_suggestions=5):
    query = query.lower().strip()
    words = word_tokenize(query)
    
    if not words:
        return []

    last_word = words[-1]
    base = " ".join(words[:-1])
    last_word_pos = get_query_pos(query)

    if not is_valid_token(last_word) or len(last_word) < 2:
        return []

    word_matches = word_completion(last_word, max_suggestions=1)
    if not word_matches:
        return []

    completed_word = word_matches[0]
    full_query = (base + ' ' + completed_word).strip()
    full_words = word_tokenize(full_query)

    if len(full_words) >= 2:
        prefix = " ".join(full_words[-2:])
    else:
        prefix = full_words[0]

    next_word_pos_freq = ngram_pos_freq.get(prefix, defaultdict(Counter))

    suggestions = []

    if next_word_pos_freq:
        for next_word, pos_freq in next_word_pos_freq.items():
            if last_word_pos and last_word_pos in pos_freq:
                suggestions.append(full_query + ' ' + next_word)

    suggestions.insert(0, full_query)
    return suggestions[:max_suggestions]
    

In [41]:
def count_n_grams(data, n, start_token='<s>', end_token='<e>'):
    n_grams = {}
    for sentence in data:   
        sentence = [start_token] * (n - 1) + sentence + [end_token]
        for i in range(len(sentence) - n + 1):
            n_gram = tuple(sentence[i:i + n])
            if n_gram in n_grams:
                n_grams[n_gram] += 1
            else:
                n_grams[n_gram] = 1
    
    return n_grams

# Example usage with bigrams (n=2)
bigram_counts = count_n_grams(train_df['tokens_pos'], 2)

# Example usage with trigrams (n=3)
trigram_counts = count_n_grams(train_df['tokens_pos'], 3)

# Print or use bigram_counts and trigram_counts as needed
print("Bigram counts:", bigram_counts)
print("Trigram counts:", trigram_counts)




In [42]:
# Your provided functions:
def estimate_probability(word, n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):
    n_gram = tuple(n_gram)
    n_plus1_gram = n_gram + (word,)
    count_n_gram = n_gram_counts.get(n_gram, 0)
    count_n_plus1_gram = n_plus1_gram_counts.get(n_plus1_gram, 0)
    return (count_n_plus1_gram + k) / (count_n_gram + k * vocabulary_size)

def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, start_token='<s>', end_token='<e>', k=1.0):
    n = len(list(n_gram_counts.keys())[0])
    sentence = [start_token] * n + sentence + [end_token]
    sentence = tuple(sentence)
    N = len(sentence)
    product_pi = 1.0
    for t in range(n, N):
        n_gram = sentence[t - n:t]
        word = sentence[t]
        probability = estimate_probability(word, n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k)
        product_pi *= 1 / probability
    return product_pi ** (1 / N)

def evaluate_perplexity(test_df, n_gram_counts, n_plus1_gram_counts, vocab_size):
    perplexities = []
    for tokens_pos in test_df['tokens_pos']:
        tokens = [word for word, _ in tokens_pos if is_valid_token(word)]
        if len(tokens) >= 3:
            perplexity = calculate_perplexity(tokens, n_gram_counts, n_plus1_gram_counts, vocab_size)
            perplexities.append(perplexity)
    avg_perplexity = sum(perplexities) / len(perplexities) if perplexities else float('inf')
    print(f"\n📊 Average Perplexity: {avg_perplexity:.2f}")


# Call the evaluation
print("test set: ")
evaluate_perplexity(test_df, bigram_counts, trigram_counts, vocab_size)
print("train set: ")
evaluate_perplexity(train_df, bigram_counts, trigram_counts, vocab_size)

test set: 

📊 Average Perplexity: 471.16
train set: 

📊 Average Perplexity: 479.38


In [44]:
# Function to calculate perplexity for a sentence
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, start_token='<s>', end_token='<e>', k=1.0):
    n = len(list(n_gram_counts.keys())[0])  # Determine n from n-gram keys
    
    # Add start and end tokens
    sentence = [start_token] * n + sentence + [end_token]
    
    # Convert the sentence to a tuple for easy indexing
    sentence = tuple(sentence)
    N = len(sentence)
    
    # Initialize the cumulative product
    product_pi = 1.0
    
    for t in range(n, N):
        n_gram = sentence[t - n:t]
        word = sentence[t]
        
        # Estimate probability using the n-gram counts and smoothing
        probability = estimate_probability(word, n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k)
        
        # Update the product
        product_pi *= 1 / probability
    
    # Calculate the perplexity as the Nth root of the product
    perplexity = product_pi ** (1 / N)
    
    return perplexity


# Function to evaluate the perplexity of a dataset
def evaluate_perplexity(data, n_gram_counts, n_plus1_gram_counts, vocab_size, n=2, k=1.0):
    perplexities = []
    
    for sentence in data:
        if len(sentence) >= n:  # Only consider sentences with enough tokens for n-grams
            perplexity = calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocab_size, k=k)
            perplexities.append(perplexity)
    
    # Calculate average perplexity
    avg_perplexity = sum(perplexities) / len(perplexities) if perplexities else float('inf')
    return avg_perplexity


# Example usage with train and test sets

# Define the sentences (train and test data)
train_sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
test_sentences = [['i', 'like', 'a', 'dog'], ['the', 'cat', 'is', 'on', 'the', 'mat']]

# Get the unique words and vocabulary size
unique_words = list(set([word for sentence in train_sentences + test_sentences for word in sentence]))
vocab_size = len(unique_words)

# Count the n-grams (unigram, bigram, trigram)
unigram_counts = count_n_grams(train_sentences, 1)
bigram_counts = count_n_grams(train_sentences, 2)
trigram_counts = count_n_grams(train_sentences, 3)

train_perplexity_trigram = evaluate_perplexity(train_sentences, bigram_counts, trigram_counts, vocab_size, n=3, k=1.0)

print(f"Average Perplexity on Training Set (Trigram): {train_perplexity_trigram:.2f}")

test_perplexity_trigram = evaluate_perplexity(test_sentences, bigram_counts, trigram_counts, vocab_size, n=3, k=1.0)
print(f"Average Perplexity on Test Set (Trigram): {test_perplexity_trigram:.2f}")


Average Perplexity on Training Set (Trigram): 3.26
Average Perplexity on Test Set (Trigram): 5.03


In [34]:
user_query = "data analysis"
suggestions = suggest_autocomplete(user_query)
print("Suggestions:", suggestions)

Suggestions: ['data analysis', 'data analysis community', 'data analysis spss', 'data analysis measurement', 'data analysis dea']


# 5. Performance Evaluation

### *pake apa

# 6. Conclusion and Future works

#### *problem disini

# Reference (kalo ada)