# Analisis Kesamaan Abstrak Artikel Jurnal

## Deskripsi Proyek
Program ini dirancang untuk menganalisis kesamaan abstrak dari beberapa artikel jurnal menggunakan teknik *Text Mining*. Proses ini mencakup pengambilan abstrak dari halaman web, penerjemahan teks ke bahasa Inggris, pembersihan data, hingga perhitungan kesamaan antar teks menggunakan *TF-IDF* dan *Cosine Similarity*.

## Tujuan
- Mengambil abstrak artikel jurnal secara otomatis dari tautan yang disediakan.
- Menerjemahkan abstrak ke bahasa Inggris agar seragam.
- Membersihkan teks dengan menghilangkan elemen-elemen yang tidak relevan, seperti tanda baca dan kata-kata umum (*stopwords*).
- Mengukur tingkat kemiripan antar artikel berdasarkan teks abstrak mereka.
- Menyimpan hasil analisis kesamaan dalam format tabel untuk evaluasi lebih lanjut.

## Alur Proses
1. **Pengambilan Data**:
   - Program mengakses halaman jurnal untuk mengambil abstrak menggunakan pustaka `requests` dan `BeautifulSoup`.
2. **Penerjemahan**:
   - Abstrak diterjemahkan ke bahasa Inggris menggunakan pustaka `translate`.
3. **Pembersihan Teks**:
   - Teks abstrak dibersihkan dari tanda baca, huruf kapital, dan kata-kata tidak penting menggunakan `nltk`.
4. **Penghitungan Kesamaan**:
   - Abstrak dibandingkan menggunakan metode *TF-IDF Vectorization* dan *Cosine Similarity* dari pustaka `scikit-learn`.
5. **Hasil Akhir**:
   - Tingkat kesamaan antar abstrak ditampilkan dalam bentuk tabel matriks dan disimpan sebagai file CSV untuk analisis lebih lanjut.

## Kegunaan
Kode ini bermanfaat untuk peneliti, akademisi, atau pengelola jurnal yang ingin memahami hubungan atau kemiripan antara artikel yang diterbitkan, sehingga dapat membantu dalam analisis bibliometrik atau rekomendasi artikel serupa.


In [None]:
# Instal Dependencies
%pip install requests beautifulsoup4 translate nltk scikit-learn pandas tabulate

In [None]:
# Import Library yang Dibutuhkan
import requests
from bs4 import BeautifulSoup
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import nltk
import string
import pandas as pd
import numpy as np
import time
import random
from tabulate import tabulate

# Download stopwords NLTK
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

In [31]:
article_links = [
    "https://jurnal.ugm.ac.id/ijccs/article/view/91857",
    "https://jurnal.ugm.ac.id/ijccs/article/view/82107",
    "https://jurnal.ugm.ac.id/ijccs/article/view/85834",
    "https://jurnal.ugm.ac.id/ijccs/article/view/87393",
    "https://jurnal.ugm.ac.id/ijccs/article/view/88081",
    "https://jurnal.ugm.ac.id/ijccs/article/view/90030",
    "https://jurnal.ugm.ac.id/ijccs/article/view/90062",
    "https://jurnal.ugm.ac.id/ijccs/article/view/90165",
    "https://jurnal.ugm.ac.id/ijccs/article/view/90437",
    "https://jurnal.ugm.ac.id/ijccs/article/view/92636",
]

def translate_text(text, target_lang='en', max_retries=3):
    """
    Menerjemahkan teks dengan penanganan kesalahan
    """
    for attempt in range(max_retries):
        try:
            # Membuat penerjemah untuk bahasa Inggris
            translator = Translator(to_lang=target_lang)
            
            # Menerjemahkan teks
            translated_text = translator.translate(text)
            
            # Memeriksa apakah terjemahan bermakna
            if translated_text and translated_text.strip():
                return translated_text
            
            # Jika terjemahan kosong, tunggu dan coba lagi
            time.sleep(1 * (attempt + 1))
        
        except Exception as e:
            print(f"Percobaan penerjemahan {attempt + 1} gagal: {e}")
            time.sleep(1)
    
    # Kembali ke teks asli jika semua percobaan gagal
    print("Penerjemahan gagal")
    return text

In [32]:
def fetch_titles_and_abstracts(links):
    titles_abstracts = {}
    for link in links:
        try:
            # Add delay to avoid overwhelming the server
            time.sleep(random.uniform(0.5, 2))

            # Send request with user-agent
            response = requests.get(link, headers={
                'User-Agent': 'Mozilla/5.0 ...'
            })
            response.raise_for_status()
            soup = BeautifulSoup(response.text, 'html.parser')

            # Extract the title
            title_tag = soup.find('meta', attrs={'name': 'DC.Title'})
            title = title_tag['content'] if title_tag else "Title not found"

            # Extract the abstract
            abstract_div = soup.find('div', id='articleAbstract')
            abstract = abstract_div.get_text(strip=True) if abstract_div else "Abstract not found"

            titles_abstracts[link] = {
                "title": title,
                "abstract": abstract
            }
        except Exception as e:
            titles_abstracts[link] = {
                "title": f"Error fetching title: {e}",
                "abstract": f"Error fetching abstract: {e}"
            }
    return titles_abstracts

In [33]:
def preprocess_text(text):
    # Ubah ke huruf kecil
    text = text.lower()
    # Hapus tanda baca
    text = ''.join([char for char in text if char not in string.punctuation])
    # Hapus stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

In [34]:
def compare_abstracts(titles_abstracts):
    # Preprocess the abstracts
    processed_abstracts = [preprocess_text(data["abstract"]) for data in titles_abstracts.values()]

    # Vectorize the abstracts using TF-IDF
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(processed_abstracts)

    # Calculate cosine similarity matrix
    cosine_sim_matrix = cosine_similarity(tfidf_matrix)

    # Create a DataFrame to store the similarity results
    similarity_df = pd.DataFrame(np.round(cosine_sim_matrix, 4), 
        index=[f"Article {i+1}" for i in range(len(titles_abstracts))], 
        columns=[f"Article {i+1}" for i in range(len(titles_abstracts))])
    
    return similarity_df

In [35]:
def main():
    try:
        # Fetch the titles and abstracts
        titles_abstracts = fetch_titles_and_abstracts(article_links)

        # Compare abstracts and get the similarity table
        similarity_table = compare_abstracts(titles_abstracts)

        # Print the titles for reference
        print("Article Titles:")
        for i, (link, data) in enumerate(titles_abstracts.items(), start=1):
            print(f"Article {i}: {data['title']}")

        # Display the similarity table in a nice format
        print("\nSimilarity Table:")
        print(tabulate(similarity_table, headers="keys", tablefmt="fancy_grid"))

        # Save to CSV (optional)
        similarity_table.to_csv('abstract_similarity.csv')
        print("\nSimilarity table saved to abstract_similarity.csv")

    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    main()


Article Titles:
Article 1: Anomaly Detection of Hospital Claim Using Support Vector Regression
Article 2: The Adoption of Blockchain Technology the Business Using Structural Equation Modelling
Article 3: Ensemble Method for Anomaly Detection On the Internet of Things
Article 4: Webcam-Based Bus Passenger Detection System Using Single Shot Detector Method
Article 5: Rule-Based Natural Language Processing in Volcanic Ash Data Searching System
Article 6: Modeling OTP Delivery Notification Status through a Causality Bayesian Network
Article 7: Maintaining Query Performance through  Table Rebuilding & Archiving
Article 8: Multivariat Predict Sales Data Using the Recurrent Neural Network (RNN) Method
Article 9: Effect of Hyperparameter Tuning Using Random Search on Tree-Based Classification Algorithm for Software Defect Prediction
Article 10: DEVELOPMENTS AND TRENDS IN CYBERSECURITY AGAINST HUMAN FACTORS AND TIME PRESSURE USING BIBLIOMETRIC ANALYSIS

Similarity Table:
╒════════════╤═════════