# Latihan - Pipeline Analisis Artikel Berita
Tujuannya adalah membuat alat otomatis yang dapat memberikan ringkasan tingkat tinggi (topik) dan juga informasi detail (entitas) dari teks berita apa pun. Ini adalah tugas yang sangat umum di dunia industri, digunakan untuk menganalisis media, laporan keuangan, dan banyak lagi.

Kita akan menggunakan dataset **AG News**, yang berisi ribuan artikel berita yang diklasifikasikan ke dalam 4 kategori: **World, Sports, Business, dan Sci/Tech.**

## Bagian 1: Klasifikasi Topik Artikel

In [None]:
import pandas as pd
import numpy as np
import spacy
from collections import defaultdict

# Library untuk Klasifikasi
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Muat dataset (versi training)
# Dataset ini memiliki kolom: Class Index, Title, Description
# Kita akan menggabungkan Title dan Description
df = pd.read_csv('https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv',
                 names=['label', 'title', 'description'])

# Mapping label dari angka ke teks
label_map = {1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'}
df['topic'] = df['label'].map(label_map)

# Gabungkan title dan description menjadi satu kolom teks
df['text'] = df['title'] + " " + df['description']

# Pilih kolom yang relevan saja dan ambil sampel agar proses lebih cepat
df = df[['text', 'topic']].sample(10000, random_state=42) # Ambil 10,000 sampel acak

print("Dataset berhasil dimuat dan diproses.")
df.head()

Dataset berhasil dimuat dan diproses.


Unnamed: 0,text,topic
71787,"BBC set for major shake-up, claims newspaper L...",Business
67218,Marsh averts cash crunch Embattled insurance b...,Business
54066,"Jeter, Yankees Look to Take Control (AP) AP - ...",Sports
7168,Flying the Sun to Safety When the Genesis caps...,Sci/Tech
29618,Stocks Seen Flat as Nortel and Oil Weigh NEW ...,Business


### 1. Memuat Data & Library

In [None]:
import pandas as pd
import numpy as np
import spacy
from collections import defaultdict

# Library untuk Klasifikasi
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Muat dataset (versi training)
# Dataset ini memiliki kolom: Class Index, Title, Description
# Kita akan menggabungkan Title dan Description
df = pd.read_csv('https://raw.githubusercontent.com/mhjabreel/CharCnn_Keras/master/data/ag_news_csv/train.csv',
                 names=['label', 'title', 'description'])

# Mapping label dari angka ke teks
label_map = {1: 'World', 2: 'Sports', 3: 'Business', 4: 'Sci/Tech'}
df['topic'] = df['label'].map(label_map)

# Gabungkan title dan description menjadi satu kolom teks
df['text'] = df['title'] + " " + df['description']

# Pilih kolom yang relevan saja dan ambil sampel agar proses lebih cepat
df = df[['text', 'topic']].sample(10000, random_state=42) # Ambil 10,000 sampel acak

print("Dataset berhasil dimuat dan diproses.")
df.head()

Dataset berhasil dimuat dan diproses.


Unnamed: 0,text,topic
71787,"BBC set for major shake-up, claims newspaper L...",Business
67218,Marsh averts cash crunch Embattled insurance b...,Business
54066,"Jeter, Yankees Look to Take Control (AP) AP - ...",Sports
7168,Flying the Sun to Safety When the Genesis caps...,Sci/Tech
29618,Stocks Seen Flat as Nortel and Oil Weigh NEW ...,Business


### 2. Vektorisasi dan Pembuatan Model

In [None]:
# Tentukan Fitur (X) dan Target (y)
X = df['text']
y = df['topic']

# Bagi data menjadi training dan testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Inisialisasi dan fit TF-IDF Vectorizer
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vec = tfidf.fit_transform(X_train)
X_test_vec = tfidf.transform(X_test)

# Latih model Logistic Regression
classifier = LogisticRegression(max_iter=1000)
print("Melatih model klasifikasi...")
classifier.fit(X_train_vec, y_train)
print("Model selesai dilatih!")

# Evaluasi model
y_pred = classifier.predict(X_test_vec)
print("\nLaporan Klasifikasi:\n")
print(classification_report(y_test, y_pred))

Melatih model klasifikasi...
Model selesai dilatih!

Laporan Klasifikasi:

              precision    recall  f1-score   support

    Business       0.90      0.86      0.88       494
    Sci/Tech       0.89      0.89      0.89       503
      Sports       0.94      0.97      0.96       510
       World       0.89      0.90      0.89       493

    accuracy                           0.91      2000
   macro avg       0.91      0.91      0.91      2000
weighted avg       0.91      0.91      0.91      2000



## Bagian 2: Ekstraksi Entitas dengan spaCy

In [None]:
# Muat model spaCy
nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    """
    Fungsi ini mengambil teks dan mengembalikan kamus entitas yang ditemukan.
    """
    doc = nlp(text)
    entities = defaultdict(list)
    for ent in doc.ents:
        entities[ent.label_].append(ent.text)

    # Hapus duplikat
    for key in entities:
        entities[key] = list(set(entities[key]))

    return dict(entities)

# Contoh penggunaan
sample_text = "Elon Musk, CEO of SpaceX, announced a new mission to Mars from their headquarters in California."
print("Contoh Ekstraksi Entitas:")
print(extract_entities(sample_text))

Contoh Ekstraksi Entitas:
{'PERSON': ['Elon Musk'], 'GPE': ['SpaceX', 'California'], 'LOC': ['Mars']}


## Bagian 3: Menggabungkan Semuanya

In [None]:
def analyze_article(article_text):
    """
    Pipeline lengkap:
    1. Mengklasifikasikan topik artikel.
    2. Mengekstrak entitas dari artikel.
    3. Mencetak hasilnya dengan rapi.
    """
    print("--- Menganalisis Artikel ---")

    # 1. Prediksi Topik
    text_vec = tfidf.transform([article_text])
    predicted_topic = classifier.predict(text_vec)[0]

    # 2. Ekstrak Entitas
    entities_found = extract_entities(article_text)

    # 3. Tampilkan Hasil
    print(f"\nPrediksi Topik: **{predicted_topic}**\n")
    print("--- Entitas yang Ditemukan ---")
    if not entities_found:
        print("Tidak ada entitas yang ditemukan.")
    else:
        for label, items in entities_found.items():
            print(f"- **{label}**: {', '.join(items)}")

    print("\n--- Analisis Selesai ---\n")

# Uji Coba Pipeline Lengkap

In [None]:
# Contoh 1: Artikel Bisnis/Teknologi
article_1 = """
Microsoft Corp on Tuesday announced its next-generation Surface laptops,
including a new model with a custom artificial intelligence chip, as it amps up its rivalry
with Apple Inc ahead of the holiday shopping season in the United States.
Satya Nadella presented the new features in a conference in New York.
"""
analyze_article(article_1)

# Contoh 2: Artikel Olahraga
article_2 = """
Real Madrid secured a dramatic late victory against Manchester City in the Champions League final
held in Istanbul. A stunning goal from Vinicius Junior in the 88th minute sealed the win for
the Spanish giants, leaving manager Pep Guardiola disappointed.
"""
analyze_article(article_2)

--- Menganalisis Artikel ---

Prediksi Topik: **Sci/Tech**

--- Entitas yang Ditemukan ---
- **ORG**: Apple Inc, Surface, Microsoft Corp
- **DATE**: Tuesday
- **GPE**: the United States, New York
- **PERSON**: Satya Nadella

--- Analisis Selesai ---

--- Menganalisis Artikel ---

Prediksi Topik: **Sports**

--- Entitas yang Ditemukan ---
- **ORG**: Real Madrid, the Champions League, Vinicius Junior
- **GPE**: Istanbul, Manchester City
- **TIME**: the 88th minute
- **NORP**: Spanish
- **PERSON**: Pep Guardiola

--- Analisis Selesai ---

