## UAS
##  Crawling
Link app :https://uasppw-appmr7xxfrksngtbcrdhmd6.streamlit.app/

Nama : Farid Ghozali

NIM  : 210411100119

**Crawling adalah proses otomatis mengumpulkan informasi dari World Wide Web, biasanya dilakukan oleh program yang disebut "web crawler" atau "spider." Web crawler adalah bot yang dirancang untuk menjelajah secara sistematis melalui halaman web, mengikuti tautan, dan mengumpulkan data untuk diindeks atau diolah lebih lanjut. Proses ini merupakan bagian penting dari aktivitas web scraping dan indeksasi mesin pencari.**





## $Manfaat$ $Crawling$
* Penelitian Akademis dan Ilmiah:

  Dalam konteks penelitian, crawling dapat digunakan untuk mengumpulkan data untuk analisis akademis, penelitian pasar, atau untuk memahami tren di berbagai bidang studi.
* Pengembangan Aplikasi dan Layanan:

  Web crawling sering digunakan oleh pengembang untuk mengumpulkan data yang diperlukan untuk pengembangan aplikasi dan layanan, seperti agregator berita, portal informasi, atau layanan pembanding harga.

* Pemantauan Sosial Media:

  Web crawling dapat digunakan untuk memantau dan menganalisis data dari platform sosial media untuk memahami sentimen pengguna, tren, atau untuk tujuan penelitian.

## $LDA$
adalah model generatif probabilistik yang digunakan untuk mengidentifikasi topik utama atau kelompok topik yang muncul dalam koleksi dokumen. Metode ini didasarkan pada asumsi bahwa setiap dokumen adalah gabungan dari beberapa topik, dan setiap kata dalam dokumen berasal dari salah satu topik tersebut.

Inti dari LDA adalah mengasumsikan bahwa terdapat distribusi probabilitas tertentu dari topik dalam seluruh korpus dokumen, dan masing-masing dokumen merupakan campuran dari topik-topik ini. Selanjutnya, setiap kata dalam dokumen dihasilkan oleh salah satu topik tersebut.



## $Algoritma$ $(LDA)$

* Inisialisasi Parameter:

  Tentukan jumlah topik,
  $K$.
  Tetapkan distribusi prior Dirichlet untuk distribusi topik dalam dokumen
  $(α)$ dan distribusi kata dalam topik (β).
* Inisialisasi Alokasi Topik:

  Setiap kata dalam setiap dokumen diberi topik awal secara acak.
* Iterasi Gibbs Sampling (atau Variational Inference):

* Ulangi langkah-langkah berikut untuk sejumlah iterasi tertentu:
Perhitungan Distribusi Posterior:

  Hitung distribusi probabilitas posterior untuk setiap kata dalam setiap dokumen berdasarkan alokasi topik saat ini.

  Resampling Topik:
Secara acak alokasikan ulang setiap kata ke topik baru berdasarkan distribusi posterior dan distribusi kata dalam topik (β).

  Perhitungan Distribusi Posterior:
Hitung kembali distribusi probabilitas posterior setelah pengalokasian ulang topik.

* Output Hasil:

  Setelah sejumlah iterasi, hasilnya adalah distribusi topik dalam setiap dokumen dan distribusi kata dalam setiap topik.
Langkah-langkah ini mungkin terlihat kompleks, tetapi algoritma LDA secara umum menggunakan Gibbs Sampling atau Variational Inference untuk mencapai distribusi posterior yang sesuai. Dengan melakukan iterasi dan mengulangi langkah-langkah tersebut, model LDA berusaha untuk menemukan distribusi topik yang paling mungkin di dalam korpus dokumen.

## $Implementasi$
Crawling data judul berita dan isi berita dari https://www.jpnn.com/

## $Politik$

In [1]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import date, timedelta

# get_article_content(article_url) digunakan untuk mengambil konten artikel dari URL yang diberikan.
def get_article_content(article_url):
    response = requests.get(article_url)

    # untuk memeriksa apakah status kode respons adalah 200, yang berarti permintaan berhasil kemudian mengambil konten artikel dari elemen div
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        article_content = soup.find('div', itemprop="articleBody").find_all('p')
        content = '\n'.join([p.get_text() for p in article_content])
        return content
    return ""

# scrape_news_data(start_date, end_date) digunakan untuk mengambil data berita dari situs web jpnn.com dalam rentang tanggal tertentu.
def scrape_news_data(start_date, end_date):
    base_url = "https://www.jpnn.com/indeks?id=248&d={day}&m={month}&y={year}&tab=all"
    data = []

    current_date = start_date
    while current_date <= end_date:
        url = base_url.format(day=current_date.day, month=current_date.month, year=current_date.year)
        response = requests.get(url)

        # untuk memeriksa apakah status kode respons adalah 200, yang berarti permintaan berhasil kemudian mengambil judul berita dari elemen div
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            news_elements = soup.find('div', class_="content").find_all('h1')

            for element in news_elements:
                title = element.a.text
                article_url = element.a['href']
                content = get_article_content(article_url)

                # Data berita disimpan dalam sebuah list dictionary dengan kunci 'Date', 'Title', dan 'Content'
                data.append({'Date': current_date.strftime('%d-%m-%Y'), 'Title': title, 'Content': content})

            current_date += timedelta(days=1)
        else:
            print(f"Failed to fetch data for {current_date.strftime('%d-%m-%Y')}")

    return data

# Untuk menentukan rentang waktu untuk pengambilan data berita.
from datetime import date, timedelta
start_date = date(2023, 10, 1)
end_date = date(2023, 11, 1)

news_data = scrape_news_data(start_date, end_date)

In [2]:
# Untuk membuat DataFrame dari data berita yang dikumpulkan oleh fungsi scrape_news_data(start_date, end_date)
df = pd.DataFrame(news_data)

In [3]:
# untuk menghapus karakter baris baru ('\n') dari kolom 'Title' dalam DataFrame df
df['Title'] = df['Title'].str.replace('\n', '')

In [4]:
# menambahkan kolom baru bernama 'Label' ke dalam DataFrame df dan mengisinya dengan nilai 'politik' untuk setiap baris
df['Label'] = 'politik'

In [5]:
# menampilkan data
df

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Megawati Ungkap Orang Luar Tak Bisa Langsung J...,"jpnn.com, JAKARTA - Ketua Umum PDI Perjuangan ...",politik
1,01-10-2023,"Peringati Hari Kesaktian Pancasila, KawanJuang...","jpnn.com, PURWAKARTA - Para sukarelawan penduk...",politik
2,01-10-2023,"Ganjar dan Anies Hadiri Acara IdeaFest, di Man...","jpnn.com, JAKARTA - Ketiga bacapres Ganjar Pra...",politik
3,01-10-2023,"Silaturahmi ke Rembang, Anies Diberi Tongkat K...","jpnn.com, JAKARTA - Anies Baswedan mengunjungi...",politik
4,01-10-2023,"Survei Erick Thohir Teratas di Jatim, Pengamat...","jpnn.com, JAKARTA - Nama Erick Thohir punya ke...",politik
...,...,...,...,...
635,01-11-2023,Ribuan Warga Aceh Berzikir & Selawat Bersama A...,"jpnn.com, ACEH UTARA - Puluhan ribu rakyat Ace...",politik
636,01-11-2023,Mendagri Tito Karnavian Dorong Polri Aktif Awa...,"jpnn.com, JAKARTA - Menteri Dalam Negeri (Men...",politik
637,01-11-2023,"PKPU Nomor 19 Sudah Direvisi, tetapi Gibran Be...",jpnn.com - JAKARTA - Komisi II DPR RI dan peme...,politik
638,01-11-2023,Survei: Masyarakat Jateng Mulai Masif Mendukun...,jpnn.com - JAKARTA - Hasil survei terbaru Poll...,politik


In [6]:
# Simpan DataFrame ke file CSV
df.to_csv('jnn_politik_news.csv', index=False)

## $Ekonomi$

In [7]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import date, timedelta

# get_article_content(article_url) digunakan untuk mengambil konten artikel dari URL yang diberikan.
def get_article_content(article_url):
    response = requests.get(article_url)

    # untuk memeriksa apakah status kode respons adalah 200, yang berarti permintaan berhasil kemudian mengambil konten artikel dari elemen div
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        article_content = soup.find('div', itemprop="articleBody").find_all('p')
        content = '\n'.join([p.get_text() for p in article_content])
        return content
    return ""

# scrape_news_data(start_date, end_date) digunakan untuk mengambil data berita dari situs web jpnn.com dalam rentang tanggal tertentu.
def scrape_news_data(start_date, end_date):
    base_url = "https://www.jpnn.com/indeks?id=216&d=01&m=10&y=2023&tab=all"
    data = []

    current_date = start_date
    while current_date <= end_date:
        url = base_url.format(day=current_date.day, month=current_date.month, year=current_date.year)
        response = requests.get(url)

        # untuk memeriksa apakah status kode respons adalah 200, yang berarti permintaan berhasil kemudian mengambil judul berita dari elemen div
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            news_elements = soup.find('div', class_="content").find_all('h1')

            for element in news_elements:
                title = element.a.text
                article_url = element.a['href']
                content = get_article_content(article_url)

                # Data berita disimpan dalam sebuah list dictionary dengan kunci 'Date', 'Title', dan 'Content'
                data.append({'Date': current_date.strftime('%d-%m-%Y'), 'Title': title, 'Content': content})

            current_date += timedelta(days=1)
        else:
            print(f"Failed to fetch data for {current_date.strftime('%d-%m-%Y')}")

    return data

# Untuk menentukan rentang waktu untuk pengambilan data berita.
from datetime import date, timedelta
start_date = date(2023, 10, 1)
end_date = date(2023, 11, 1)

news_data = scrape_news_data(start_date, end_date)

In [8]:
# Untuk membuat DataFrame dari data berita yang dikumpulkan oleh fungsi scrape_news_data(start_date, end_date)
df2 = pd.DataFrame(news_data)

In [9]:
# menghapus karakter tab ('\t') dari kolom 'Title' dan karakter baris baru ('\n') dari kolom 'Content'
df2['Title'] = df2['Title'].str.replace('\t', '')
df2['Content'] = df2['Content'].str.replace('\n', '')

In [10]:
# menambahkan kolom baru bernama 'Label' ke dalam DataFrame df dan mengisinya dengan nilai 'Ekonomi' untuk setiap baris
df2['Label'] = 'Ekonomi'

In [11]:
# menampilkan data
df2

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Erick Thohir Mengaku Jatuh Cinta pada Program ...,"jpnn.com, JAKARTA - Menteri Badan Usaha Milik ...",Ekonomi
1,01-10-2023,Menteri Bahlil Pastikan Investasi Rempang Berd...,"jpnn.com, JAKARTA - Menteri Investasi/Kepala B...",Ekonomi
2,01-10-2023,"Gandeng LKPP, Pertamina Implementasikan Aplika...","jpnn.com, JAKARTA - PT Pertamina (Persero) men...",Ekonomi
3,01-10-2023,"Kini Fokus Jadi Entrepreneur, Zahra Amalina Me...","jpnn.com, JAKARTA - Model sekaligus pemain sin...",Ekonomi
4,01-10-2023,"Terapkan ESG, OCS Group Gandeng World Cleanup ...","jpnn.com, JAKARTA - OCS Group Indonesia mengua...",Ekonomi
...,...,...,...,...
507,01-11-2023,Ikhitiar Pinjam Yuk Mendorong UMKM Kembangkan ...,"jpnn.com, JAKARTA - Platform peer to peer lend...",Ekonomi
508,01-11-2023,Gegara Ini Industri Kreatif di Berbagai Daerah...,"jpnn.com, JAKARTA - Pelaku industri kreatif di...",Ekonomi
509,01-11-2023,"Kinerja Moncer, KAI Logistik Raih Penghargaan ...","jpnn.com, JAKARTA - KAI Logistik meraih pengha...",Ekonomi
510,01-11-2023,"Bicara di Diskusi Pameran Pangan, Ketua Aprind...","jpnn.com, JAKARTA PUSAT - Ketua Asosiasi Pengu...",Ekonomi


In [12]:
# Simpan DataFrame ke file CSV
df2.to_csv('jpnn_ekonomi_news.csv', index=False)

## $Olahraga$

In [13]:
import requests
import pandas as pd
from bs4 import BeautifulSoup
from datetime import date, timedelta

# get_article_content(article_url) digunakan untuk mengambil konten artikel dari URL yang diberikan.
def get_article_content(article_url):
    response = requests.get(article_url)

    # untuk memeriksa apakah status kode respons adalah 200, yang berarti permintaan berhasil kemudian mengambil konten artikel dari elemen div
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        article_content = soup.find('div', itemprop="articleBody").find_all('p')
        content = '\n'.join([p.get_text() for p in article_content])
        return content
    return ""

# scrape_news_data(start_date, end_date) digunakan untuk mengambil data berita dari situs web jpnn.com dalam rentang tanggal tertentu.
def scrape_news_data(start_date, end_date):
    base_url = "https://www.jpnn.com/indeks?id=213&d=01&m=10&y=2023&tab=all"
    data = []

    current_date = start_date
    while current_date <= end_date:
        url = base_url.format(day=current_date.day, month=current_date.month, year=current_date.year)
        response = requests.get(url)

        # untuk memeriksa apakah status kode respons adalah 200, yang berarti permintaan berhasil kemudian mengambil judul berita dari elemen div
        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            news_elements = soup.find('div', class_="content").find_all('h1')

            for element in news_elements:
                title = element.a.text
                article_url = element.a['href']
                content = get_article_content(article_url)

                # Data berita disimpan dalam sebuah list dictionary dengan kunci 'Date', 'Title', dan 'Content'
                data.append({'Date': current_date.strftime('%d-%m-%Y'), 'Title': title, 'Content': content})

            current_date += timedelta(days=1)
        else:
            print(f"Failed to fetch data for {current_date.strftime('%d-%m-%Y')}")

    return data

# Untuk menentukan rentang waktu untuk pengambilan data berita.
from datetime import date, timedelta
start_date = date(2023, 10, 1)
end_date = date(2023, 11, 1)

news_data_olahraga = scrape_news_data(start_date, end_date)

In [14]:
# Untuk membuat DataFrame dari data berita yang dikumpulkan oleh fungsi scrape_news_data(start_date, end_date)
df3 = pd.DataFrame(news_data_olahraga)

In [15]:
# menghapus karakter tab ('\t') dari kolom 'Title' dan karakter baris baru ('\n') dari kolom 'Content'
df3['Title'] = df3['Title'].str.replace('\t', '')
df3['Content'] = df3['Content'].str.replace('\n', '')

In [16]:
# menambahkan kolom baru bernama 'Label' ke dalam DataFrame df dan mengisinya dengan nilai 'Olahraga' untuk setiap baris
df3['Label'] = 'Olahraga'

In [17]:
# menampilkan data
df3

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,"Klasemen Liga 1: Berpesta Gol, Persib Bandung ...",jpnn.com - BANDUNG - Persib Bandung membukukan...,Olahraga
1,01-10-2023,Menpora Dito Ariotedjo Ungkap Peran Anak Muda ...,jpnn.com - Menteri Pemuda dan Olahraga (Menpor...,Olahraga
2,01-10-2023,"China Belum Bisa Menang di Road to Paris, Liha...",jpnn.com - XI’AN – China menelan kekalahan ked...,Olahraga
3,01-10-2023,"Road to Paris: Mesir Bikin Jepang Menderita, A...",jpnn.com - TOKYO – Kejutan besar terjadi pada ...,Olahraga
4,01-10-2023,Lagi! Veda Ega Bikin Pembalap Tuan Rumah Tak B...,"jpnn.com, JAKARTA - Pembalap muda Indonesia Ve...",Olahraga
...,...,...,...,...
571,01-11-2023,"Lalu Muhammad Zohri Finis Keenam, Indonesia Pa...","jpnn.com, JAKARTA - Kontingen atlet Indonesia ...",Olahraga
572,01-11-2023,Garuda Muda Siap Beri Obat Pelipur Lara Bagi F...,jpnn.com - Tim bulu tangkis beregu campuran In...,Olahraga
573,01-11-2023,Bulu Tangkis Asian Games 2022: Tim Putra China...,jpnn.com - Tim bulu tangkis beregu putra China...,Olahraga
574,01-11-2023,"Asian Games 2022: Kehabisan Bensin, Timnas Bas...",jpnn.com - Timnas basket putra Indonesia menga...,Olahraga


In [18]:
# Simpan DataFrame ke file CSV
df3.to_csv('jpnn_olahraga_news.csv', index=False)

## $Gabungkan Data$

In [19]:
# menggabungkan tiga DataFrame df, df2, dan df3 menjadi satu DataFrame baru bernama combined_df.
combined_df = pd.concat([df, df2, df3], ignore_index=True)
combined_df

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Megawati Ungkap Orang Luar Tak Bisa Langsung J...,"jpnn.com, JAKARTA - Ketua Umum PDI Perjuangan ...",politik
1,01-10-2023,"Peringati Hari Kesaktian Pancasila, KawanJuang...","jpnn.com, PURWAKARTA - Para sukarelawan penduk...",politik
2,01-10-2023,"Ganjar dan Anies Hadiri Acara IdeaFest, di Man...","jpnn.com, JAKARTA - Ketiga bacapres Ganjar Pra...",politik
3,01-10-2023,"Silaturahmi ke Rembang, Anies Diberi Tongkat K...","jpnn.com, JAKARTA - Anies Baswedan mengunjungi...",politik
4,01-10-2023,"Survei Erick Thohir Teratas di Jatim, Pengamat...","jpnn.com, JAKARTA - Nama Erick Thohir punya ke...",politik
...,...,...,...,...
1723,01-11-2023,"Lalu Muhammad Zohri Finis Keenam, Indonesia Pa...","jpnn.com, JAKARTA - Kontingen atlet Indonesia ...",Olahraga
1724,01-11-2023,Garuda Muda Siap Beri Obat Pelipur Lara Bagi F...,jpnn.com - Tim bulu tangkis beregu campuran In...,Olahraga
1725,01-11-2023,Bulu Tangkis Asian Games 2022: Tim Putra China...,jpnn.com - Tim bulu tangkis beregu putra China...,Olahraga
1726,01-11-2023,"Asian Games 2022: Kehabisan Bensin, Timnas Bas...",jpnn.com - Timnas basket putra Indonesia menga...,Olahraga


In [20]:
#menghitung jumlah kemunculan nilai unik pada kolom 'Label' dalam DataFrame combined_df
combined_df['Label'].value_counts()

politik     640
Olahraga    576
Ekonomi     512
Name: Label, dtype: int64

In [21]:
# Simpan DataFrame ke file CSV
combined_df.to_csv('data_berita.csv', index=False)

In [22]:
# menampilkan keseluruhan data
combined_df

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Megawati Ungkap Orang Luar Tak Bisa Langsung J...,"jpnn.com, JAKARTA - Ketua Umum PDI Perjuangan ...",politik
1,01-10-2023,"Peringati Hari Kesaktian Pancasila, KawanJuang...","jpnn.com, PURWAKARTA - Para sukarelawan penduk...",politik
2,01-10-2023,"Ganjar dan Anies Hadiri Acara IdeaFest, di Man...","jpnn.com, JAKARTA - Ketiga bacapres Ganjar Pra...",politik
3,01-10-2023,"Silaturahmi ke Rembang, Anies Diberi Tongkat K...","jpnn.com, JAKARTA - Anies Baswedan mengunjungi...",politik
4,01-10-2023,"Survei Erick Thohir Teratas di Jatim, Pengamat...","jpnn.com, JAKARTA - Nama Erick Thohir punya ke...",politik
...,...,...,...,...
1723,01-11-2023,"Lalu Muhammad Zohri Finis Keenam, Indonesia Pa...","jpnn.com, JAKARTA - Kontingen atlet Indonesia ...",Olahraga
1724,01-11-2023,Garuda Muda Siap Beri Obat Pelipur Lara Bagi F...,jpnn.com - Tim bulu tangkis beregu campuran In...,Olahraga
1725,01-11-2023,Bulu Tangkis Asian Games 2022: Tim Putra China...,jpnn.com - Tim bulu tangkis beregu putra China...,Olahraga
1726,01-11-2023,"Asian Games 2022: Kehabisan Bensin, Timnas Bas...",jpnn.com - Timnas basket putra Indonesia menga...,Olahraga


In [1]:
import pandas as pd


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# membaca file CSV yang berada pada path dan menyimpannya dalam sebuah DataFrame bernama data.
data = pd.read_csv("/content/drive/MyDrive/ppw/tugas/Data/data_berita.csv")
data

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Megawati Ungkap Orang Luar Tak Bisa Langsung J...,"jpnn.com, JAKARTA - Ketua Umum PDI Perjuangan ...",politik
1,01-10-2023,"Peringati Hari Kesaktian Pancasila, KawanJuang...","jpnn.com, PURWAKARTA - Para sukarelawan penduk...",politik
2,01-10-2023,"Ganjar dan Anies Hadiri Acara IdeaFest, di Man...","jpnn.com, JAKARTA - Ketiga bacapres Ganjar Pra...",politik
3,01-10-2023,"Silaturahmi ke Rembang, Anies Diberi Tongkat K...","jpnn.com, JAKARTA - Anies Baswedan mengunjungi...",politik
4,01-10-2023,"Survei Erick Thohir Teratas di Jatim, Pengamat...","jpnn.com, JAKARTA - Nama Erick Thohir punya ke...",politik
...,...,...,...,...
1723,01-11-2023,"Lalu Muhammad Zohri Finis Keenam, Indonesia Pa...","jpnn.com, JAKARTA - Kontingen atlet Indonesia ...",Olahraga
1724,01-11-2023,Garuda Muda Siap Beri Obat Pelipur Lara Bagi F...,jpnn.com - Tim bulu tangkis beregu campuran In...,Olahraga
1725,01-11-2023,Bulu Tangkis Asian Games 2022: Tim Putra China...,jpnn.com - Tim bulu tangkis beregu putra China...,Olahraga
1726,01-11-2023,"Asian Games 2022: Kehabisan Bensin, Timnas Bas...",jpnn.com - Timnas basket putra Indonesia menga...,Olahraga


# menghilangkan tanda baca

In [4]:
#import library
import pandas as pd
import numpy as np

In [5]:
!pip install Sastrawi


Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


In [6]:
from nltk.corpus import stopwords
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [7]:
#Remove Puncutuation
clean_tag = re.compile('@\S+')
clean_url = re.compile('https?:\/\/.*[\r\n]*')
clean_hastag = re.compile('#\S+')
clean_symbol = re.compile('[^a-zA-Z]')
def clean_punct(text):
    text = clean_tag.sub('', str(text))
    text = clean_url.sub('', text)
    text = clean_hastag.sub(' ', text)
    text = clean_symbol.sub(' ', text)
    return text
# Buat kolom tambahan untuk data description yang telah diremovepunctuation
preprocessing = data['Content'].apply(clean_punct)
clean=pd.DataFrame(preprocessing)
clean

Unnamed: 0,Content
0,jpnn com JAKARTA Ketua Umum PDI Perjuangan ...
1,jpnn com PURWAKARTA Para sukarelawan penduk...
2,jpnn com JAKARTA Ketiga bacapres Ganjar Pra...
3,jpnn com JAKARTA Anies Baswedan mengunjungi...
4,jpnn com JAKARTA Nama Erick Thohir punya ke...
...,...
1723,jpnn com JAKARTA Kontingen atlet Indonesia ...
1724,jpnn com Tim bulu tangkis beregu campuran In...
1725,jpnn com Tim bulu tangkis beregu putra China...
1726,jpnn com Timnas basket putra Indonesia menga...


## Tokenisasi

In [9]:
data_clean=[]
for i in range(len(preprocessing)):
  data_clean.append(preprocessing[i])

In [10]:
tokenize=[]
for i in range(len(data_clean)):
  token=word_tokenize(data_clean[i])
  tokendata = []
  for x in token :
    tokendata.append(x)
  tokenize.append(tokendata)
  print(tokendata)

Output hidden; open in https://colab.research.google.com to view.

## Stopwords Removal

In [11]:
stopword=[]
for i in range(len(tokenize)):
  listStopword =  set(stopwords.words('indonesian'))
  removed=[]
  for x in (tokenize[i]):
    if x not in listStopword:
       removed.append(x)
  stopword.append(removed)
  print(removed)

['jpnn', 'com', 'JAKARTA', 'Ketua', 'Umum', 'PDI', 'Perjuangan', 'Megawati', 'Soekarnoputri', 'partai', 'pemenenang', 'pemilu', 'pendatang', 'langsung', 'ketua', 'Menurutnya', 'aturan', 'dipatuhi', 'PDIP', 'Megawati', 'mengaku', 'langsung', 'ketua', 'Ia', 'memulainya', 'kader', 'Saya', 'kader', 'orang', 'ketua', 'Karena', 'memilih', 'orang', 'dipilih', 'Dan', 'melanggar', 'AD', 'ART', 'Lah', 'bayangkan', 'kesempatan', 'menerangkan', 'kontradiktif', 'Megawati', 'penutupan', 'Rakernas', 'IV', 'PDIP', 'JIExpo', 'Kemayoran', 'Jakarta', 'Minggu', 'Megawati', 'mengaku', 'petugas', 'partai', 'PDIP', 'Ia', 'ditugasi', 'Kongres', 'ketua', 'partai', 'Karena', 'Megawati', 'heran', 'dianggap', 'sombong', 'Presiden', 'Joko', 'Widodo', 'petugas', 'partai', 'Saya', 'bingung', 'lha', 'bilang', 'Pak', 'Jokowi', 'petugas', 'partai', 'kader', 'lho', 'diomongkan', 'namanya', 'sombong', 'Itu', 'AD', 'ART', 'partai', 'Saya', 'petugas', 'partai', 'lho', 'presiden', 'RI', 'Ditugasi', 'Kongres', 'partai', 'dip

In [12]:
join=[]
for i in range(len(stopword)):
  joinkata = ' '.join(stopword[i])
  join.append(joinkata)

hasilpreproses = pd.DataFrame(join, columns=['Content'])
hasilpreproses

Unnamed: 0,Content
0,jpnn com JAKARTA Ketua Umum PDI Perjuangan Meg...
1,jpnn com PURWAKARTA Para sukarelawan pendukung...
2,jpnn com JAKARTA Ketiga bacapres Ganjar Pranow...
3,jpnn com JAKARTA Anies Baswedan mengunjungi Po...
4,jpnn com JAKARTA Nama Erick Thohir keterpiliha...
...,...
1723,jpnn com JAKARTA Kontingen atlet Indonesia Asi...
1724,jpnn com Tim bulu tangkis beregu campuran Indo...
1725,jpnn com Tim bulu tangkis beregu putra China s...
1726,jpnn com Timnas basket putra Indonesia petuala...


## Term Frequency

In [13]:
from sklearn.feature_extraction.text import CountVectorizer

In [14]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(np.array(hasilpreproses['Content']))

In [15]:
tf = vectorizer.get_feature_names_out()
tf

array(['aamiin', 'ab', 'abad', ..., 'zulhas', 'zulkifli', 'zulkilfi'],
      dtype=object)

In [16]:
tf_array = X.toarray()
print(tf_array)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [19]:
df_tf= pd.DataFrame(tf_array, columns = tf)
df_tf

Unnamed: 0,aamiin,ab,abad,abadi,abah,abas,abbas,abd,abdel,abdi,...,zoni,zubair,zubir,zuhad,zuhro,zuhur,zulhaq,zulhas,zulkifli,zulkilfi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1723,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1724,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1725,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1726,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [28]:
# Menghapus baris dari indeks 1500 ke atas
df_tf = df_tf.iloc[:1000, :]

# Menampilkan DataFrame yang telah diubah
df_tf= pd.DataFrame(df_tf)
df_tf


Unnamed: 0,aamiin,ab,abad,abadi,abah,abas,abbas,abd,abdel,abdi,...,zoni,zubair,zubir,zuhad,zuhro,zuhur,zulhaq,zulhas,zulkifli,zulkilfi
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
df_tf.to_csv('/content/drive/MyDrive/ppw/tugas/Data/uas/df_tf_baru.csv', index=False)

In [30]:
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

lda_results = []

for n in range(1, 51):
    lda = LatentDirichletAllocation(n_components=n, doc_topic_prior=0.2, topic_word_prior=0.1, random_state=42, max_iter=1)
    lda_top = lda.fit_transform(df_tf)
    lda_results.append(lda_top)

In [31]:
n_components = 50
column_names = [f'Topik {i+1}' for i in range(n_components)]
topik50 = pd.DataFrame(lda_top, columns=column_names)

In [32]:
topik50

Unnamed: 0,Topik 1,Topik 2,Topik 3,Topik 4,Topik 5,Topik 6,Topik 7,Topik 8,Topik 9,Topik 10,...,Topik 41,Topik 42,Topik 43,Topik 44,Topik 45,Topik 46,Topik 47,Topik 48,Topik 49,Topik 50
0,0.001413,0.001441,0.001427,0.001470,0.001409,0.001419,0.001419,0.001423,0.001414,0.001415,...,0.001413,0.001433,0.203412,0.001418,0.001416,0.001440,0.001418,0.001433,0.001463,0.001424
1,0.001423,0.009920,0.001436,0.001436,0.001419,0.001429,0.001431,0.001456,0.001430,0.001443,...,0.001427,0.001437,0.001457,0.001429,0.001437,0.001462,0.001454,0.001442,0.001454,0.001444
2,0.001940,0.001974,0.001948,0.001969,0.001924,0.001936,0.001950,0.001947,0.001931,0.001958,...,0.001932,0.001961,0.001960,0.001927,0.001957,0.001984,0.001934,0.001960,0.001969,0.001948
3,0.001646,0.001665,0.001657,0.001655,0.001641,0.001649,0.001651,0.001654,0.001650,0.001665,...,0.001662,0.001674,0.001661,0.001648,0.001644,0.001667,0.001650,0.001655,0.001660,0.001657
4,0.001520,0.766272,0.001542,0.151609,0.001516,0.001550,0.001521,0.001528,0.001520,0.001544,...,0.001526,0.001536,0.001543,0.001521,0.001586,0.001556,0.001524,0.001542,0.001536,0.001534
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.001787,0.001793,0.001791,0.001839,0.001786,0.001789,0.001791,0.001791,0.001788,0.001792,...,0.001788,0.001792,0.001794,0.912144,0.001791,0.001795,0.001792,0.001793,0.001794,0.001792
996,0.001668,0.001675,0.001669,0.001673,0.001667,0.001669,0.001670,0.001676,0.001668,0.001670,...,0.001668,0.001672,0.001671,0.001669,0.001673,0.001672,0.918116,0.001673,0.001670,0.001674
997,0.001308,0.001310,0.001309,0.001312,0.001307,0.001309,0.001309,0.001309,0.001308,0.001317,...,0.001308,0.001313,0.001311,0.001308,0.935752,0.001310,0.001312,0.001313,0.001310,0.001315
998,0.001343,0.001346,0.001344,0.001346,0.001343,0.001345,0.001344,0.001345,0.001343,0.001346,...,0.001343,0.001356,0.001346,0.001344,0.001355,0.001347,0.001348,0.001350,0.001345,0.001368


In [33]:
data = pd.concat([topik50, data['Label']], axis=1)

print(data)

       Topik 1   Topik 2   Topik 3   Topik 4   Topik 5   Topik 6   Topik 7  \
0     0.001413  0.001441  0.001427  0.001470  0.001409  0.001419  0.001419   
1     0.001423  0.009920  0.001436  0.001436  0.001419  0.001429  0.001431   
2     0.001940  0.001974  0.001948  0.001969  0.001924  0.001936  0.001950   
3     0.001646  0.001665  0.001657  0.001655  0.001641  0.001649  0.001651   
4     0.001520  0.766272  0.001542  0.151609  0.001516  0.001550  0.001521   
...        ...       ...       ...       ...       ...       ...       ...   
1723       NaN       NaN       NaN       NaN       NaN       NaN       NaN   
1724       NaN       NaN       NaN       NaN       NaN       NaN       NaN   
1725       NaN       NaN       NaN       NaN       NaN       NaN       NaN   
1726       NaN       NaN       NaN       NaN       NaN       NaN       NaN   
1727       NaN       NaN       NaN       NaN       NaN       NaN       NaN   

       Topik 8   Topik 9  Topik 10  ...  Topik 42  Topik 43  To

In [34]:
data= data.dropna()

In [35]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = data.drop(columns=['Label']).values
y = data['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model Naive Bayes
model = MultinomialNB()

# Pelatihan model Naive Bayes dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi: {:.2f}%".format(accuracy * 100))

Akurasi: 96.50%


In [36]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = data.drop(columns=['Label']).values
y = data['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model KNN dengan n_neighbors=3
model = KNeighborsClassifier(n_neighbors=3)

# Pelatihan model KNN dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model KNN
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi KNN: {:.2f}%".format(accuracy * 100))


Akurasi KNN: 98.50%


In [53]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = data.drop(columns=['Label']).values
y = data['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model Decision Tree
model = DecisionTreeClassifier()

# Pelatihan model Decision Tree dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model Decision Tree
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi Decision Tree: {:.2f}%".format(accuracy * 100))


Akurasi Decision Tree: 96.53%


In [37]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = data.drop(columns=['Label']).values
y = data['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model Random Forest
model = RandomForestClassifier()

# Pelatihan model Random Forest dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model Random Forest
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi Random Forest: {:.2f}%".format(accuracy * 100))


Akurasi Random Forest: 100.00%


In [38]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = data.drop(columns=['Label']).values
y = data['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model SVM
model = SVC()

# Pelatihan model SVM dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model SVM
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi SVM: {:.2f}%".format(accuracy * 100))


Akurasi SVM: 97.00%
