##  $ Crawling $
Crawling adalah proses otomatis mengumpulkan informasi dari World Wide Web, biasanya dilakukan oleh program yang disebut "web crawler" atau "spider." Web crawler adalah bot yang dirancang untuk menjelajah secara sistematis melalui halaman web, mengikuti tautan, dan mengumpulkan data untuk diindeks atau diolah lebih lanjut. Proses ini merupakan bagian penting dari aktivitas web scraping dan indeksasi mesin pencari.





## $Manfaat$ $Crawling$
* Penelitian Akademis dan Ilmiah:

  Dalam konteks penelitian, crawling dapat digunakan untuk mengumpulkan data untuk analisis akademis, penelitian pasar, atau untuk memahami tren di berbagai bidang studi.
* Pengembangan Aplikasi dan Layanan:

  Web crawling sering digunakan oleh pengembang untuk mengumpulkan data yang diperlukan untuk pengembangan aplikasi dan layanan, seperti agregator berita, portal informasi, atau layanan pembanding harga.

* Pemantauan Sosial Media:

  Web crawling dapat digunakan untuk memantau dan menganalisis data dari platform sosial media untuk memahami sentimen pengguna, tren, atau untuk tujuan penelitian.

## $Implementasi$
Crawling data judul berita dan isi berita dari https:/www.bisnis.com/

## $Sports$

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import date, timedelta

def get_article_content(article_url):
    response = requests.get(article_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        article_content = soup.find('article', class_='detailsContent force-17 mt40').find_all('p')
        content = '\n'.join([p.get_text() for p in article_content])
        return content
    return ""

def scrape_news_data(start_date, end_date):
    base_url = "https://www.bisnis.com/index?c=392&d={}"
    data = []

    current_date = start_date
    while current_date <= end_date:
        url = base_url.format(current_date.strftime('%Y-%m-%d'))
        response = requests.get(url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            news_elements = soup.find_all('div', class_='col-sm-8')

            for element in news_elements:
                title = element.find('h2').a.text.strip()
                article_url = element.find('h2').a['href']
                content = get_article_content(article_url)

                data.append({'Date': current_date.strftime('%d-%m-%Y'), 'Title': title, 'Content': content})

            current_date += timedelta(days=1)
        else:
            print(f"Failed to fetch data for {current_date.strftime('%d-%m-%Y')}")

    return data

start_date = date(2023, 10, 1)
end_date = date(2023, 11, 1)

news_data = scrape_news_data(start_date, end_date)

In [None]:
# Create a pandas DataFrame from the scraped data
df = pd.DataFrame(news_data)

In [None]:
df['Title'] = df['Title'].str.replace('\n', '')

In [None]:
df['Label'] = 'Sport'


In [None]:
df

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Prediksi Skor Atalanta vs Juventus: Head to He...,"Bisnis.com, SOLO - Duel Atalanta vs Juventus a...",Sport
1,01-10-2023,Hasil Drawing dan Jadwal Bulu Tangkis Asian Ga...,"Bisnis.com, SOLO - Hasil drawing dan jadwal bu...",Sport
2,01-10-2023,"Naik Podium MotoGP Setelah Setahun Absen, Marc...","Bisnis.com, JAKARTA - Pembalap Repsol Honda, M...",Sport
3,01-10-2023,"Prediksi Skor Persib vs Persita: Head to Head,...","Bisnis.com, SOLO - Persib vs Persita akan menj...",Sport
4,01-10-2023,"Prediksi Persib vs Persita, Pelatih Minta Maun...","Bisnis.com, JAKARTA - Jelang Persib vs Persita...",Sport
...,...,...,...,...
389,01-11-2023,Demam Megawati di Korea Bikin Ofisial Red Spar...,"Bisnis.com, SOLO - Dunia voli Korea Selatan di...",Sport
390,01-11-2023,Jadwal Hylo Open Hari Ini: PraMel Lawan Christ...,"Bisnis.com, SOLO - Jadwal Hylo Open 2023 akan ...",Sport
391,01-11-2023,Jadwal Liga 1 Pekan 18: Madura United vs Persi...,"Bisnis.com, SOLO - Jadwal Liga 1 2023-2024 pek...",Sport
392,01-11-2023,Jadwal Piala Liga Inggris: Manchester United v...,"Bisnis.com, SOLO - Jadwal Piala Liga Inggris a...",Sport


In [None]:
# Save the DataFrame to a CSV file
df.to_csv('sports.csv', index=False)

## $Jakarta$

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import date, timedelta

def get_article_content(article_url):
    response = requests.get(article_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        article_content = soup.find('article', class_='detailsContent force-17 mt40').find_all('p')
        content = '\n'.join([p.get_text() for p in article_content])
        return content
    return ""

def scrape_news_data(start_date, end_date):
    base_url = "https://www.bisnis.com/index?c=382&d={}"
    data = []

    current_date = start_date
    while current_date <= end_date:
        url = base_url.format(current_date.strftime('%Y-%m-%d'))
        response = requests.get(url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            news_elements = soup.find_all('div', class_='col-sm-8')

            for element in news_elements:
                title = element.find('h2').a.text.strip()
                article_url = element.find('h2').a['href']
                content = get_article_content(article_url)

                data.append({'Date': current_date.strftime('%d-%m-%Y'), 'Title': title, 'Content': content})

            current_date += timedelta(days=1)
        else:
            print(f"Failed to fetch data for {current_date.strftime('%d-%m-%Y')}")

    return data

start_date = date(2023, 8, 1)
end_date = date(2023, 12, 12)

news_data_1= scrape_news_data(start_date, end_date)

In [None]:
# Create a pandas DataFrame from the scraped data
df2 = pd.DataFrame(news_data_1)

# Save the DataFrame to a CSV file
df2.to_csv('jakarta.csv', index=False)

In [None]:
df2['Title'] = df2['Title'].str.replace('\t', '')
df2['Content'] = df2['Content'].str.replace('\n', '')

In [None]:
df2['Label'] = 'Jakarta'

In [None]:
df2['Title'] = df2['Title'].str.replace('\n', '')
df2

Unnamed: 0,Date,Title,Content,Label
0,01-08-2023,"Gelar Forum Gubernur dan Wali Kota se-Asean, P...","Bisnis.com, JAKARTA - Pemerintah Provinsi (Pem...",Jakarta
1,01-08-2023,Heru Budi: MGMAC AMF 2023 se-Asean Diinisiasi ...,"Bisnis.com, JAKARTA — Penjabat (Pj) Gubernur D...",Jakarta
2,01-08-2023,Heru Budi Tunjuk Agus Himawan Jadi Dirut Pasar...,"Bisnis.com, JAKARTA — Penjabat (Pj) Gubernur D...",Jakarta
3,01-08-2023,Heru Budi Angkat Mantan Dirut Sarana Jaya Jadi...,"Bisnis.com, JAKARTA — Penjabat (Pj) Gubernur D...",Jakarta
4,01-08-2023,"Tak Hanya Cabut KJP, DPRD DKI Minta Pemprov Bi...","Bisnis.com, JAKARTA —DPRD DKI Jakarta meminta ...",Jakarta
...,...,...,...,...
465,05-12-2023,Draf RUU DKJ: Gubernur Jakarta Dipilih Preside...,"Bisnis.com, JAKARTA - Pasal 10 ayat (2) draf R...",Jakarta
466,05-12-2023,Lalu Lintas Menuju Slipi Dialihkan Imbas Demo ...,"Bisnis.com, JAKARTA - Polisi menyarankan masya...",Jakarta
467,09-12-2023,"Hadirkan Lebih Banyak Area Hijau, AEON Mall De...","Bisnis.com, JAKARTA - AEON Mall Deltamas menye...",Jakarta
468,09-12-2023,"Cuaca Hari Ini, 9 Desember, Jakarta Diguyur Hu...","Bisnis.com, JAKARTA – Badan Meteorologi Klimat...",Jakarta


In [None]:
df2.to_csv('bisnis_jakarta_news.csv', index=False)

## $Surabaya$

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from datetime import date, timedelta

def get_article_content(article_url):
    response = requests.get(article_url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        article_content = soup.find('article', class_='detailsContent force-17 mt40').find_all('p')
        content = '\n'.join([p.get_text() for p in article_content])
        return content
    return ""

def scrape_news_data(start_date, end_date):
    base_url = "https://www.bisnis.com/index?c=526&d={}"
    data = []

    current_date = start_date
    while current_date <= end_date:
        url = base_url.format(current_date.strftime('%Y-%m-%d'))
        response = requests.get(url)

        if response.status_code == 200:
            soup = BeautifulSoup(response.text, 'html.parser')
            news_elements = soup.find_all('div', class_='col-sm-8')

            for element in news_elements:
                title = element.find('h2').a.text.strip()
                article_url = element.find('h2').a['href']
                content = get_article_content(article_url)

                data.append({'Date': current_date.strftime('%d-%m-%Y'), 'Title': title, 'Content': content})

            current_date += timedelta(days=1)
        else:
            print(f"Failed to fetch data for {current_date.strftime('%d-%m-%Y')}")

    return data

start_date = date(2023, 8, 1)
end_date = date(2023, 11, 1)

news_data_2= scrape_news_data(start_date, end_date)

In [None]:
# Create a pandas DataFrame from the scraped data
df3 = pd.DataFrame(news_data_2)

In [None]:
df3['Title'] = df3['Title'].str.replace('\t', '')
df3['Content'] = df3['Content'].str.replace('\n', '')

In [None]:
df3['Label'] = 'Surabaya'

In [None]:
df3['Title'] = df3['Title'].str.replace('\n', '')
df3

Unnamed: 0,Date,Title,Content,Label
0,01-08-2023,Pelindo Terminal Petikemas Perluas Area Kerja ...,"Bisnis.com, SURABAYA - PT Pelindo Terminal Pet...",Surabaya
1,01-08-2023,"Pemutihan Pajak di Jatim, 1,18 Juta Kendaraan ...","Bisnis.com, SURABAYA - Pemerintah Provinsi Jaw...",Surabaya
2,01-08-2023,Biaya Pendidikan Dorong Laju Inflasi di Jatim ...,"Bisnis.com, SURABAYA - Provinsi Jawa Timur pad...",Surabaya
3,01-08-2023,Latihan Gabungan TNI di Situbondo Mengesankan ...,"Bisnis.com, SITUBONDO - Menteri Koordinator Po...",Surabaya
4,01-08-2023,Petani Garam Berharap Ada Stabilisasi Harga Sa...,"Bisnis.com, PAMEKASAN - Para petani garam di P...",Surabaya
...,...,...,...,...
401,01-11-2023,"Kemiskinan di Kota Malang Turun Jadi 4,26% Tah...","Bisnis.com, MALANG — Kemiskinan di Kota Malang...",Surabaya
402,01-11-2023,Tingkat Penghunian Kamar Hotel di Malang Tembu...,"Bisnis.com, MALANG — Tingkat penghunian kamar ...",Surabaya
403,01-11-2023,Beras Masih Jadi Penyumbang Utama Inflasi di K...,"Bisnis.com, MALANG — Beras masih menjadi penyu...",Surabaya
404,01-11-2023,Pertamina Catat Ada 32 Kasus Pidana Penyalahgu...,"Bisnis.com, SURABAYA — Pertamina Patra Niaga J...",Surabaya


In [None]:
# Save the DataFrame to a CSV file
df3.to_csv('surabaya.csv', index=False)

## $Gabungkan Data$

In [None]:
combined_df = pd.concat([df, df2, df3], ignore_index=True)
combined_df

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Prediksi Skor Atalanta vs Juventus: Head to He...,"Bisnis.com, SOLO - Duel Atalanta vs Juventus a...",Sport
1,01-10-2023,Hasil Drawing dan Jadwal Bulu Tangkis Asian Ga...,"Bisnis.com, SOLO - Hasil drawing dan jadwal bu...",Sport
2,01-10-2023,"Naik Podium MotoGP Setelah Setahun Absen, Marc...","Bisnis.com, JAKARTA - Pembalap Repsol Honda, M...",Sport
3,01-10-2023,"Prediksi Skor Persib vs Persita: Head to Head,...","Bisnis.com, SOLO - Persib vs Persita akan menj...",Sport
4,01-10-2023,"Prediksi Persib vs Persita, Pelatih Minta Maun...","Bisnis.com, JAKARTA - Jelang Persib vs Persita...",Sport
...,...,...,...,...
1265,01-11-2023,"Kemiskinan di Kota Malang Turun Jadi 4,26% Tah...","Bisnis.com, MALANG — Kemiskinan di Kota Malang...",Surabaya
1266,01-11-2023,Tingkat Penghunian Kamar Hotel di Malang Tembu...,"Bisnis.com, MALANG — Tingkat penghunian kamar ...",Surabaya
1267,01-11-2023,Beras Masih Jadi Penyumbang Utama Inflasi di K...,"Bisnis.com, MALANG — Beras masih menjadi penyu...",Surabaya
1268,01-11-2023,Pertamina Catat Ada 32 Kasus Pidana Penyalahgu...,"Bisnis.com, SURABAYA — Pertamina Patra Niaga J...",Surabaya


In [None]:
combined_df['Label'].value_counts()

Jakarta     470
Surabaya    406
Sport       394
Name: Label, dtype: int64

In [None]:
# Save the DataFrame to a CSV file
combined_df.to_csv('berita.csv', index=False)

In [None]:
combined_df

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Prediksi Skor Atalanta vs Juventus: Head to He...,"Bisnis.com, SOLO - Duel Atalanta vs Juventus a...",Sport
1,01-10-2023,Hasil Drawing dan Jadwal Bulu Tangkis Asian Ga...,"Bisnis.com, SOLO - Hasil drawing dan jadwal bu...",Sport
2,01-10-2023,"Naik Podium MotoGP Setelah Setahun Absen, Marc...","Bisnis.com, JAKARTA - Pembalap Repsol Honda, M...",Sport
3,01-10-2023,"Prediksi Skor Persib vs Persita: Head to Head,...","Bisnis.com, SOLO - Persib vs Persita akan menj...",Sport
4,01-10-2023,"Prediksi Persib vs Persita, Pelatih Minta Maun...","Bisnis.com, JAKARTA - Jelang Persib vs Persita...",Sport
...,...,...,...,...
1265,01-11-2023,"Kemiskinan di Kota Malang Turun Jadi 4,26% Tah...","Bisnis.com, MALANG — Kemiskinan di Kota Malang...",Surabaya
1266,01-11-2023,Tingkat Penghunian Kamar Hotel di Malang Tembu...,"Bisnis.com, MALANG — Tingkat penghunian kamar ...",Surabaya
1267,01-11-2023,Beras Masih Jadi Penyumbang Utama Inflasi di K...,"Bisnis.com, MALANG — Beras masih menjadi penyu...",Surabaya
1268,01-11-2023,Pertamina Catat Ada 32 Kasus Pidana Penyalahgu...,"Bisnis.com, SURABAYA — Pertamina Patra Niaga J...",Surabaya


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import pandas as pd


In [3]:
berita = pd.read_csv("/content/drive/MyDrive/ppw/tugas/tugas-ppw/data_uas/berita.csv")
berita

Unnamed: 0,Date,Title,Content,Label
0,01-10-2023,Prediksi Skor Atalanta vs Juventus: Head to He...,"Bisnis.com, SOLO - Duel Atalanta vs Juventus a...",Sport
1,01-10-2023,Hasil Drawing dan Jadwal Bulu Tangkis Asian Ga...,"Bisnis.com, SOLO - Hasil drawing dan jadwal bu...",Sport
2,01-10-2023,"Naik Podium MotoGP Setelah Setahun Absen, Marc...","Bisnis.com, JAKARTA - Pembalap Repsol Honda, M...",Sport
3,01-10-2023,"Prediksi Skor Persib vs Persita: Head to Head,...","Bisnis.com, SOLO - Persib vs Persita akan menj...",Sport
4,01-10-2023,"Prediksi Persib vs Persita, Pelatih Minta Maun...","Bisnis.com, JAKARTA - Jelang Persib vs Persita...",Sport
...,...,...,...,...
1265,01-11-2023,"Kemiskinan di Kota Malang Turun Jadi 4,26% Tah...","Bisnis.com, MALANG — Kemiskinan di Kota Malang...",Surabaya
1266,01-11-2023,Tingkat Penghunian Kamar Hotel di Malang Tembu...,"Bisnis.com, MALANG — Tingkat penghunian kamar ...",Surabaya
1267,01-11-2023,Beras Masih Jadi Penyumbang Utama Inflasi di K...,"Bisnis.com, MALANG — Beras masih menjadi penyu...",Surabaya
1268,01-11-2023,Pertamina Catat Ada 32 Kasus Pidana Penyalahgu...,"Bisnis.com, SURABAYA — Pertamina Patra Niaga J...",Surabaya


menghilangkan tanda baca

In [4]:
#import library
import pandas as pd
import numpy as np

In [5]:
!pip install Sastrawi


Collecting Sastrawi
  Downloading Sastrawi-1.0.1-py2.py3-none-any.whl (209 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m209.7/209.7 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Sastrawi
Successfully installed Sastrawi-1.0.1


In [6]:
from nltk.corpus import stopwords
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [7]:
#Remove Puncutuation
clean_tag = re.compile('@\S+')
clean_url = re.compile('https?:\/\/.*[\r\n]*')
clean_hastag = re.compile('#\S+')
clean_symbol = re.compile('[^a-zA-Z]')
def clean_punct(text):
    text = clean_tag.sub('', str(text))
    text = clean_url.sub('', text)
    text = clean_hastag.sub(' ', text)
    text = clean_symbol.sub(' ', text)
    return text
# Buat kolom tambahan untuk data description yang telah diremovepunctuation
preprocessing = berita['Content'].apply(clean_punct)
clean=pd.DataFrame(preprocessing)
clean

Unnamed: 0,Content
0,Bisnis com SOLO Duel Atalanta vs Juventus a...
1,Bisnis com SOLO Hasil drawing dan jadwal bu...
2,Bisnis com JAKARTA Pembalap Repsol Honda M...
3,Bisnis com SOLO Persib vs Persita akan menj...
4,Bisnis com JAKARTA Jelang Persib vs Persita...
...,...
1265,Bisnis com MALANG Kemiskinan di Kota Malang...
1266,Bisnis com MALANG Tingkat penghunian kamar ...
1267,Bisnis com MALANG Beras masih menjadi penyu...
1268,Bisnis com SURABAYA Pertamina Patra Niaga J...


## Tokenisasi

In [8]:
data_clean=[]
for i in range(len(preprocessing)):
  data_clean.append(preprocessing[i])

In [9]:
tokenize=[]
for i in range(len(data_clean)):
  token=word_tokenize(data_clean[i])
  tokendata = []
  for x in token :
    tokendata.append(x)
  tokenize.append(tokendata)
  print(tokendata)

['Bisnis', 'com', 'SOLO', 'Duel', 'Atalanta', 'vs', 'Juventus', 'akan', 'tersaji', 'dalam', 'lanjutan', 'Liga', 'Italia', 'Prediksi', 'skor', 'mengunggulkan', 'Juventus', 'meraih', 'kemenangan', 'Juventus', 'akan', 'bertandang', 'ke', 'markas', 'Atalanta', 'dalam', 'giornata', 'ketujuh', 'Liga', 'Italia', 'Minggu', 'malam', 'La', 'Vecchia', 'Signora', 'kini', 'tertinggal', 'lima', 'angka', 'dari', 'pemuncak', 'klasemen', 'setelah', 'duo', 'Milan', 'meraih', 'kemenangan', 'pekan', 'ini', 'Maka', 'dari', 'itu', 'mau', 'tak', 'mau', 'Juventus', 'wajib', 'menang', 'untuk', 'memangkas', 'defisit', 'poin', 'dari', 'tim', 'di', 'atasnya', 'Laga', 'Atalanta', 'lawan', 'Juventus', 'merupakan', 'duel', 'penting', 'karena', 'kedua', 'tim', 'berada', 'di', 'posisi', 'yang', 'berdekatan', 'dalam', 'tabel', 'klasemen', 'Liga', 'Italia', 'Juventus', 'menduduki', 'peringkat', 'keempat', 'dengan', 'poin', 'sedangkan', 'Atalanta', 'tepat', 'di', 'bawahnya', 'dengan', 'angka', 'Baca', 'Juga', 'Pertanding

## Stopwords Removal

In [10]:
stopword=[]
for i in range(len(tokenize)):
  listStopword =  set(stopwords.words('indonesian'))
  removed=[]
  for x in (tokenize[i]):
    if x not in listStopword:
       removed.append(x)
  stopword.append(removed)
  print(removed)

['Bisnis', 'com', 'SOLO', 'Duel', 'Atalanta', 'vs', 'Juventus', 'tersaji', 'lanjutan', 'Liga', 'Italia', 'Prediksi', 'skor', 'mengunggulkan', 'Juventus', 'meraih', 'kemenangan', 'Juventus', 'bertandang', 'markas', 'Atalanta', 'giornata', 'ketujuh', 'Liga', 'Italia', 'Minggu', 'malam', 'La', 'Vecchia', 'Signora', 'tertinggal', 'angka', 'pemuncak', 'klasemen', 'duo', 'Milan', 'meraih', 'kemenangan', 'pekan', 'Maka', 'Juventus', 'wajib', 'menang', 'memangkas', 'defisit', 'poin', 'tim', 'atasnya', 'Laga', 'Atalanta', 'lawan', 'Juventus', 'duel', 'tim', 'posisi', 'berdekatan', 'tabel', 'klasemen', 'Liga', 'Italia', 'Juventus', 'menduduki', 'peringkat', 'keempat', 'poin', 'Atalanta', 'bawahnya', 'angka', 'Baca', 'Juga', 'Pertandingan', 'melawan', 'La', 'Dea', 'menjaga', 'kans', 'Juventus', 'bersaing', 'papan', 'menembus', 'posisi', 'Namun', 'Atalanta', 'lawan', 'mudah', 'ditaklukkan', 'Juventus', 'Pasukan', 'Gian', 'Piero', 'Gasperini', 'perlawanan', 'Juventus', 'pulang', 'poin', 'Hingga', '

In [11]:
join=[]
for i in range(len(stopword)):
  joinkata = ' '.join(stopword[i])
  join.append(joinkata)

hasilpreproses = pd.DataFrame(join, columns=['Content'])
hasilpreproses

Unnamed: 0,Content
0,Bisnis com SOLO Duel Atalanta vs Juventus ters...
1,Bisnis com SOLO Hasil drawing jadwal bulu tang...
2,Bisnis com JAKARTA Pembalap Repsol Honda Marc ...
3,Bisnis com SOLO Persib vs Persita duel penutup...
4,Bisnis com JAKARTA Jelang Persib vs Persita pe...
...,...
1265,Bisnis com MALANG Kemiskinan Kota Malang menga...
1266,Bisnis com MALANG Tingkat penghunian kamar TPK...
1267,Bisnis com MALANG Beras penyumbang utama infla...
1268,Bisnis com SURABAYA Pertamina Patra Niaga Jati...


## Term Frequency

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

In [13]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(np.array(hasilpreproses['Content']))

In [14]:
tf = vectorizer.get_feature_names_out()
tf

array(['aaci', 'aakarshi', 'aan', ..., 'zumba', 'zumrotul', 'zvezda'],
      dtype=object)

In [15]:
tf_array = X.toarray()
print(tf_array)

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [17]:
# Misalkan 'kolom_yang_dihapus' adalah kolom yang ingin dihapus
kolom_yang_dihapus = berita_tf.columns[8057:]

# Membuat DataFrame baru dengan kolom yang ingin disertakan
berita_pangkas1= berita_tf.drop(columns=kolom_yang_dihapus)


In [18]:
berita_pangkas1

Unnamed: 0,aaci,aakarshi,aan,aaron,aat,aau,aba,abad,abadinya,abai,...,kedudukan,kedungdoro,kedungkandang,kedungwaru,kedutaan,keefektifan,keekonomisannya,keempat,keenam,keesokan
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1265,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1266,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1267,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1268,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
berita_pangkas1.to_csv("/content/drive/MyDrive/ppw/tugas/tugas-ppw/berita_pangkas1.csv")

In [19]:
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np

lda_results = []

for n in range(1, 51):
    lda = LatentDirichletAllocation(n_components=n, doc_topic_prior=0.2, topic_word_prior=0.1, random_state=42, max_iter=1)
    lda_top = lda.fit_transform(berita_pangkas1)
    lda_results.append(lda_top)

In [20]:
n_components = 50
column_names = [f'Topik {i+1}' for i in range(n_components)]
topik50 = pd.DataFrame(lda_top, columns=column_names)

In [21]:
topik50

Unnamed: 0,Topik 1,Topik 2,Topik 3,Topik 4,Topik 5,Topik 6,Topik 7,Topik 8,Topik 9,Topik 10,...,Topik 41,Topik 42,Topik 43,Topik 44,Topik 45,Topik 46,Topik 47,Topik 48,Topik 49,Topik 50
0,0.001294,0.001299,0.001299,0.001304,0.001301,0.001303,0.001305,0.001294,0.001298,0.001299,...,0.001307,0.001297,0.001308,0.001293,0.001296,0.001310,0.001305,0.001309,0.936058,0.001306
1,0.002281,0.002294,0.002286,0.002300,0.002298,0.002304,0.002301,0.002288,0.002287,0.002286,...,0.002307,0.002285,0.002295,0.002284,0.139949,0.002306,0.002333,0.002308,0.002285,0.002296
2,0.001823,0.001830,0.001838,0.001828,0.001829,0.001827,0.001822,0.001826,0.001828,0.001827,...,0.001835,0.001823,0.001832,0.001820,0.001822,0.001832,0.001835,0.001834,0.001826,0.001832
3,0.002278,0.002312,0.002291,0.002285,0.887520,0.002281,0.002282,0.002290,0.002286,0.002285,...,0.002292,0.002280,0.002296,0.002279,0.002284,0.002311,0.002318,0.002314,0.002294,0.002293
4,0.001948,0.001996,0.001966,0.001955,0.317801,0.001950,0.001955,0.001953,0.001969,0.001950,...,0.001968,0.001955,0.001983,0.001949,0.001947,0.001995,0.001988,0.001983,0.001961,0.001968
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1265,0.003011,0.003012,0.003051,0.003026,0.002998,0.082669,0.002990,0.003004,0.003011,0.003006,...,0.003028,0.003089,0.066326,0.002995,0.002996,0.003010,0.003012,0.003065,0.002999,0.003016
1266,0.003011,0.003007,0.003021,0.003017,0.003001,0.255319,0.002990,0.003027,0.003001,0.003008,...,0.003018,0.003033,0.003069,0.002995,0.002997,0.003019,0.003022,0.003039,0.002998,0.003018
1267,0.001704,0.001712,0.001725,0.001710,0.001703,0.001752,0.001710,0.001710,0.001703,0.001720,...,0.001736,0.001743,0.001739,0.001709,0.001703,0.001705,0.001717,0.001745,0.001702,0.001717
1268,0.001799,0.001792,0.001794,0.001793,0.001792,0.001789,0.001780,0.001779,0.001784,0.001788,...,0.001790,0.001788,0.001807,0.001780,0.001781,0.001794,0.001786,0.001799,0.001781,0.001797


In [22]:
berita = pd.concat([topik50, berita['Label']], axis=1)

print(berita)

       Topik 1   Topik 2   Topik 3   Topik 4   Topik 5   Topik 6   Topik 7  \
0     0.001294  0.001299  0.001299  0.001304  0.001301  0.001303  0.001305   
1     0.002281  0.002294  0.002286  0.002300  0.002298  0.002304  0.002301   
2     0.001823  0.001830  0.001838  0.001828  0.001829  0.001827  0.001822   
3     0.002278  0.002312  0.002291  0.002285  0.887520  0.002281  0.002282   
4     0.001948  0.001996  0.001966  0.001955  0.317801  0.001950  0.001955   
...        ...       ...       ...       ...       ...       ...       ...   
1265  0.003011  0.003012  0.003051  0.003026  0.002998  0.082669  0.002990   
1266  0.003011  0.003007  0.003021  0.003017  0.003001  0.255319  0.002990   
1267  0.001704  0.001712  0.001725  0.001710  0.001703  0.001752  0.001710   
1268  0.001799  0.001792  0.001794  0.001793  0.001792  0.001789  0.001780   
1269  0.002706  0.002688  0.002709  0.002703  0.002688  0.067872  0.002674   

       Topik 8   Topik 9  Topik 10  ...  Topik 42  Topik 43  To

In [23]:
berita= berita.dropna()

In [29]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = berita.drop(columns=['Label']).values
y = berita['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model Naive Bayes
model = MultinomialNB()

# Pelatihan model Naive Bayes dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi: {:.2f}%".format(accuracy * 100))

Akurasi: 74.80%


In [30]:
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = berita.drop(columns=['Label']).values
y = berita['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model KNN dengan n_neighbors=3
model = KNeighborsClassifier(n_neighbors=3)

# Pelatihan model KNN dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model KNN
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi KNN: {:.2f}%".format(accuracy * 100))


Akurasi KNN: 72.05%


In [31]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = berita.drop(columns=['Label']).values
y = berita['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model Decision Tree
model = DecisionTreeClassifier()

# Pelatihan model Decision Tree dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model Decision Tree
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi Decision Tree: {:.2f}%".format(accuracy * 100))


Akurasi Decision Tree: 75.59%


In [32]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = berita.drop(columns=['Label']).values
y = berita['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model Random Forest
model = RandomForestClassifier()

# Pelatihan model Random Forest dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model Random Forest
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi Random Forest: {:.2f}%".format(accuracy * 100))


Akurasi Random Forest: 83.86%


In [33]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Pisahkan fitur dan label dari DataFrame yang telah digabungkan
X = berita.drop(columns=['Label']).values
y = berita['Label'].values

# Pembagian data menjadi data pelatihan dan data pengujian
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Inisialisasi model SVM
model = SVC()

# Pelatihan model SVM dengan data pelatihan
model.fit(X_train, y_train)

# Prediksi label kelas pada data pengujian
y_pred = model.predict(X_test)

# Mengukur akurasi model SVM
accuracy = accuracy_score(y_test, y_pred)
print("Akurasi SVM: {:.2f}%".format(accuracy * 100))


Akurasi SVM: 73.23%
