# Latent Semantic Analysis (LSA) _ hay

# Crawling Data Berita

sebelum melakukan proses crawling data, pastikan anda sudah menginstall library Scrapy dari python. Jika anda belum menginstall Scrapy anda dapat menginstall nya dengan cara ketikkan "pip install Scrapy" pada cmd

## Crawling pertama

pada proses crawling yang pertama ini, kita akan mengambil link yang ada pada halaman kumpulan judul berita. cara untuk melakukan crawling adalah:
1. buat file python (.py) misalkan "crawling1.py".
2. copy paste code yang ada dibawah ini. (anda dapat memodifikasi kode ini sesuai dengan link berita yang anda inginkan).
3. jalankan file "crawling1.py" dengan cara mengetikkan "scrapy runspider crawling1.py -O link.csv" , untuk yang bagian "link.csv" ini merupakan output file yang anda crawling, karena disini saya menggunakan contoh "link.csv" maka hasil outputnya dalam bentuk file csv.

In [1]:
import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):

        arrayData = []
        for i in range(1, 6):
            inArray = 'https://indeks.kompas.com/?site=news&page=' + str(i)
            arrayData.append(inArray)
        for url in arrayData:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        for i in range(1,16):
            yield {
                'link': response.css('body > div.wrap > div.container.clearfix > div:nth-child(3) > div.col-bs10-7 > div.latest--indeks.mt2.clearfix > div:nth-child(' + str(i) +') > div.article__list__title > h3 > a::attr(href)').extract(),
            }

## Crawling kedua

Untuk proses crawling yang kedua ini, saya mengambil link website berita hasil dari crawling pertama yang sudah di export dalam bentuk csv. untuk membaca file csv ini saya menggunakan library pandas. lalu setelah file dibaca, saya masukkan kedalam array. setelah itu masing masing link akan dilakukan proses crawling.
Pada proses cawling kedua ini kita akan menuju website beritanya langsung, untuk mendapatkan data judul, label dan isi dari masing-masing berita.
jalankan file ini dengan cara yang sama seperti yang pertama, akan tetapi sesuaikan nama filenya. cnothnya seperti "scrapy runspider crawling2.py -O isi_berita.csv"

In [2]:
import scrapy
import pandas as pd


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        dataCSV = pd.read_csv('link.csv')
        dataCSV.head()
        indexData = dataCSV.iloc[:, [0]].values
        arrayData = []
        for i in indexData:
            ambil = i[0]
            arrayData.append(ambil)
        print(arrayData)


        for url in arrayData:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        yield {
            'judul': response.css('body > div.wrap > div.container.clearfix > div:nth-child(3) > div > h1::text').extract(),
            'label': response.css('body > div.wrap > div.container.clearfix > div:nth-child(3) > div > h3 > ul > li:nth-child(3) > a > span::text').extract(),
            'isi': response.css('body > div.wrap > div.container.clearfix > div.row.col-offset-fluid.clearfix.js-giant-wp-sticky-parent > div.col-bs10-7.js-read-article > div.read__article.mt2.clearfix.js-tower-sticky-parent > div.col-bs9-7 > div.read__content > div > p::text').extract(),
           
        }

# Latent Semantic Analysis (LSA)

sebelum kita berpindah ke LSA, ada beberapa hal yang perlu dipersiapkan terlebih dahulu.
beberapa library yang perlu di siapkan yaitu nltk, pandas, numpy dan scikit-learn.
jika anda menggunakan google colab anda bisa mengetikan syntax dibawah ini untuk melakukan instalasi library yang dibutuhkan.

!pip install nltk <br>
!pip install pandas <br>
!pip install numpy <br>
!pip install scikit-learn <br>


## preprocessing data

### import libray

import library yang dibutuhkan untuk preprocessing data

In [3]:
# import library
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk import word_tokenize
import numpy as np

export file "isi_berita.csv" dalam bentuk data frame pandas.

In [4]:
#import data frame
dataCSV = pd.read_csv('isi_berita.csv')
dataCSV.head()

Unnamed: 0,judul,label,isi
0,"UPDATE 22 April: Bertambah 12, Total Kasus Cov...",Megapolitan,- Dinas Kesehatan Kota Tangerang melaporkan 1...
1,2 Remaja Bawa Celurit Ditangkap Saat Hendak Ta...,Megapolitan,- Polsek Kebayoran Baru menangkap dua remaja ...
2,"""Perempuan yang Nangis Histeris ke Jokowi itu ...",Megapolitan,- Direktur Utama (Dirut) Perumda Pasar Pakuan ...
3,Ada Posko Pengaduan THR di Kantor Sudin Naker ...,Megapolitan,"Layanan Posko Pengaduan , (THR) 2022 dibuka o..."
4,"Diduga Korsleting Listrik, Ruang Logistik Masj...",Megapolitan,"Ruang logistik ,Nurul Iman di Jalan Nurul Ima..."


### Cleansing dan Stopword
disini kita melakukan cleansing data, yang artinya kita membersihkan data dari simbol, angka dan spasi. <br>
lalu untuk stopword ini untuk membuang kata yang tidak mempunyai makna seperti:
1. "dan"
2. "yang" 
3. "atau"
4. "adalah"

In [5]:
# cleansing & stopword
index_iloc = 0
len_df = len(dataCSV.index)
array_stopwords = []
for kata in range(len_df):
    # indexData itu ambil tiap bagian dari data frame dengan nama dataCSV
    indexData = dataCSV.iloc[index_iloc, [2]].values
    clean_words = []
    for w in word_tokenize(indexData[0].lower()):
        if w.isalpha() and w not in stopwords.words('indonesian'):
            clean_words.append(w)
    array_stopwords.append(clean_words)
    index_iloc += 1

# membuat kata-kata 1 dokumen di list yang sama
NewArray_stopwords = []
for j in array_stopwords:
    # proses stem per kalimat
    temp = ""
    for i in j:
        # print(i)
        temp = temp +" "+ i

    NewArray_stopwords.append(temp)
print(NewArray_stopwords[0])

 dinas kesehatan kota tangerang melaporkan jumat total kota tangerang pasien dirawat menjalani isolasi mandiri aktif berkurang orang berdasarkan data dinas kesehatan pasien sembuh bertambah orang pasien meninggal kecamatan cipondoh wilayah memiliki total tertinggi kecamatan karawaci data terkait kota tangerang diakses situs web


diatas ini adalah contoh isi dari salah satu berita yang sudah dilakukan cleansing dan stopword.

dibawah ini adalah proses memasukkan data yang sudah dilakukan preprocessing ke dalam data frame yang mempunyai nama "dataSCV"

In [6]:
dataCSV = dataCSV.drop('isi', axis=1)
dataCSV = dataCSV.drop('judul', axis=1)
dataCSV = dataCSV.drop('label', axis=1)
dataCSV['isi_berita_final'] = np.array(NewArray_stopwords)
dataCSV.head()

Unnamed: 0,isi_berita_final
0,dinas kesehatan kota tangerang melaporkan jum...
1,polsek kebayoran menangkap remaja kedapatan m...
2,direktur utama dirut perumda pasar pakuan jay...
3,layanan posko pengaduan thr dibuka suku dinas...
4,ruang logistik nurul iman jalan nurul iman ke...


## Term Frequency - Inverse Document Frequency (TF-IDF)

setelah melakukan pre-processing data, selanjutnya dilakukan proses TF-IDF <br>
TF-IDF adalah suatu metode algoritma untuk menghitung bobot setiap kata di setiap dokumen dalam korpus. Metode ini juga terkenal efisien, mudah dan memiliki hasil yang akurat. <br>
Term Frequency (TF) merupakan jumlah kemunculan kata pada setiap dokumen. dirumuskan dengan jumlah frekuensi kata terpilih / jumlah kata <br>
Inverse Document Matrix (IDF) dirumuskan dengan log((jumlah dokumen / jumlah frekuensi kata terpilih). <br>
untuk menghasilkan TF-IDF maka hasil dari TF dikalikan dengan IDF, seperti rumus dibawah ini:

$$
W_{i, j}=\frac{n_{i, j}}{\sum_{j=1}^{p} n_{j, i}} \log _{2} \frac{D}{d_{j}}
$$

Dengan:

$
{W_{i, j}}\quad\quad\>: \text { pembobotan tf-idf untuk term ke-j pada dokumen ke-i } \\
{n_{i, j}}\quad\quad\>\>: \text { jumlah kemunculan term ke-j pada dokumen ke-i }\\
{p} \quad\quad\quad\>\>: \text { banyaknya term yang terbentuk }\\
{\sum_{j=1}^{p} n_{j, i}}: \text { jumlah kemunculan seluruh term pada dokumen ke-i }\\
{d_{j}} \quad\quad\quad: \text { banyaknya dokumen yang mengandung term ke-j }\\
$



### import Library TF-IDF

import library yang dibutuhkan dalam melakukan pemrosesan TF-IDF dan juga ambil data dari data hasil preprocessing yang sudah dilakukan diatas.

In [7]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
df = dataCSV

### Term Frequency

ubah data menjadi bentuk list, lalu lakukan proses tf dengan cara memanggil library CountVectorizer dari scikit-learn.

In [8]:
#mengubah fitur dalam bentuk list
list_isi_berita = []
for i in range(len(df.iloc[:, -1])):
    list_isi_berita.append(df.iloc[i, -1])

# proses term frequency
count_vectorizer = CountVectorizer(min_df=1)
tf = count_vectorizer.fit_transform(list_isi_berita)

#get fitur
fitur = count_vectorizer.get_feature_names_out()

# menampilkan data TF
show_tf = count_vectorizer.fit_transform(list_isi_berita).toarray()
df_tf =pd.DataFrame(data=show_tf,index=list(range(1, len(show_tf[:,1])+1, )),columns=[fitur])
df_tf = df_tf.T

df_tf.head(8)

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,66,67,68,69,70,71,72,73,74,75
abang,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
abdul,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
abk,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ac,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
acara,0,0,0,0,0,0,0,0,1,0,...,4,3,0,0,4,0,0,0,0,0
acd,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
aceh,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
achmad,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


## TF-IDF

setelah melakukan proses TF, lakukan proses TF-IDF dan kemudian simpan hasilnya dalam bentuk data frame.

In [9]:
#tfidf dengan tfidf transformer
tfidf_transform = TfidfTransformer(use_idf=True,norm='l2',smooth_idf=True)
tfidf=tfidf_transform.fit_transform(count_vectorizer.fit_transform(list_isi_berita)).toarray()
df_tfidf =pd.DataFrame(data=tfidf,index=list(range(1, len(tfidf[:,1])+1, )),columns=[fitur])
df_tfidf.head(8)

Unnamed: 0,abang,abdul,abk,ac,acara,acd,aceh,achmad,acs,ad,...,yogyakarta,youtube,yudisial,yuhronur,za,zaman,zee,zona,zudan,zulpan
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.113748,...,0.0,0.0,0.113748,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.049189,0.0,0.0,0.0,0.0,0.0,0.0


## Latent Simantic Analysis (LSA)

Algoritma LSA (Latent Semantic Analysis) adalah salah satu algoritma yang dapat digunakan untuk menganalisa hubungan antara sebuah frase/kalimat dengan sekumpulan dokumen.
Dalam pemrosesan LSA ada tahap yang dinamakan Singular Value Decomposition (SVD), SVD adalah salah satu teknik reduksi dimensi yang bermanfaat untuk memperkecil nilai kompleksitas dalam pemrosesan term-document matrix. berikut adalah rumus SVD:

$$
A_{m n}=U_{m m} x S_{m n} x V_{n n}^{T}
$$

Dengan:

$
{A_{m n}}: \text { Matrix Awal } \\
{U_{m m}}: \text { Matrix ortogonal U }\\
{S_{m n}}\>: \text { Matrix diagonal S }\\
{V_{n n}^{T}}\>\>: \text { Transpose matrix ortogonal V }\\
$

In [10]:
from sklearn.decomposition import TruncatedSVD

### proses LSA dengan library TruncatedSVD dari scikit

In [11]:
lsa = TruncatedSVD(n_components=10, random_state=36)
lsa_matrix = lsa.fit_transform(tfidf)

## proporsi topik pada tiap dokumen

In [12]:
# menampilkan proporsi tiap topic pada masing-masing dokumen
df_topicDocument =pd.DataFrame(data=lsa_matrix,index=list(range(1, len(lsa_matrix[:,1])+1)))
df_topicDocument.head(6)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
1,0.087707,0.102211,0.149338,-0.062967,0.452697,-0.34532,0.460555,0.058317,0.040384,-0.033407
2,0.017242,0.036715,0.039887,0.019609,0.039418,-0.007147,-0.034355,0.01973,0.059311,0.065048
3,0.031948,0.099333,0.421999,0.587809,-0.153856,0.096218,0.127955,0.006938,-0.047221,0.00101
4,0.018733,0.078729,0.079897,-0.02123,0.036714,-0.062772,-0.037238,0.058692,0.02842,0.23112
5,0.037463,0.028119,0.069912,-0.021885,0.076855,0.034009,-0.016099,-0.009766,0.069195,0.098574
6,0.013169,0.054063,0.068924,0.008792,0.073871,-0.081508,-0.039932,0.059776,0.035486,-0.012879


## proporsi term terhadap topik

In [13]:
# menampilkan proporsi tiap topic pada masing-masing dokumen
df_termTopic =pd.DataFrame(data=lsa.components_,index=list(range(1, len(lsa.components_[:,1])+1)), columns=[fitur])
df_termTopic.head(100)

Unnamed: 0,abang,abdul,abk,ac,acara,acd,aceh,achmad,acs,ad,...,yogyakarta,youtube,yudisial,yuhronur,za,zaman,zee,zona,zudan,zulpan
1,0.002096,0.001974,0.000287,0.000222,0.014052,8.9e-05,0.000844,0.000598,8.9e-05,0.000264,...,0.002774,0.000476,0.000264,0.000116,0.003163,0.000138,0.000511,0.001227,0.000767,0.001354
2,0.011068,0.007553,0.001667,0.001921,0.184206,0.000783,0.001797,0.004189,0.000783,0.001714,...,0.012041,0.001457,0.001714,0.000969,0.074783,0.000522,0.001178,0.004469,0.008112,0.008512
3,0.029964,0.008754,0.003026,0.004717,-0.06166,0.002419,0.003188,0.007372,0.002419,0.002785,...,0.028159,0.002522,0.002785,0.001363,-0.036084,0.000526,0.000866,0.00907,0.004107,0.010106
4,-0.03363,-0.002757,0.002464,0.001007,0.00388,-0.00288,-0.000617,0.006211,-0.00288,-0.000125,...,-0.030508,-0.001661,-0.000125,-0.00053,0.007928,-0.000542,-0.000313,-0.008011,-0.000953,-0.002091
5,-0.022686,0.004609,0.003164,0.015089,-0.017309,-0.002009,0.006876,0.010812,-0.002009,0.003965,...,-0.014942,0.000825,0.003965,-0.000142,-0.005985,0.000985,0.002626,0.005589,0.003869,-0.001965
6,-0.000383,-0.003709,-0.001705,0.012435,0.024226,-0.00128,0.00014,-0.019491,-0.00128,-0.005014,...,-0.006529,0.000426,-0.005014,-0.001988,0.005679,-0.000258,-0.00373,0.003722,-0.009428,-0.00097
7,0.003739,-0.017315,-0.001267,-0.005656,0.0006,-0.000142,0.000157,-0.027056,-0.000142,-0.002832,...,-0.00582,-0.002401,-0.002832,-0.002624,0.005363,-0.000699,-0.005326,-0.005896,-0.017913,-0.006559
8,0.001146,0.023973,0.000231,-0.004016,0.007235,-0.000851,0.001363,-0.032589,-0.000851,0.003467,...,0.016544,-0.000612,0.003467,0.004888,-0.000108,3.2e-05,0.009665,0.009503,0.02145,0.009894
9,-0.017573,-0.006449,0.001113,0.00359,0.113075,-0.002986,-0.000125,-0.003433,-0.002986,0.003734,...,-0.010674,0.003259,0.003734,-0.000607,-0.008938,0.002604,0.001675,-0.003909,-0.006123,0.00754
10,-0.004793,0.00128,0.002732,0.004587,-0.041176,-0.000271,0.005368,-0.006124,-0.000271,-0.000617,...,0.033078,0.002331,-0.000617,-0.003878,-0.001915,0.001074,0.0227,0.025567,-0.029575,0.016553
