# File yang dipakai

data yang diambil berasal dari kaggle https://www.kaggle.com/datasets/shakhauat/reviews-data-for-nlp-task

untuk filenya dapat di download di https://drive.google.com/drive/folders/1eNIWnPYEnFNuwJKOGHFYplKTjF2WXpZt?usp=sharing



Kode berikut mengimpor tiga pustaka Python dasar yang sering digunakan dalam analisis data dan manipulasi teks:


In [None]:
#Import Library Dasar
import numpy as np
import pandas as pd
import re

Kode ini mengimpor pustaka **NLTK** (Natural Language Toolkit), yang berguna untuk pemrosesan bahasa alami. Berikut penjelasan dari tiap bagian.

In [None]:
#Import NLTK untuk Pemrosesan Teks
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer

Kode ini digunakan untuk **mengunduh sumber daya NLTK** yang diperlukan untuk pemrosesan teks.

Kode ini memastikan bahwa modul NLTK memiliki sumber daya yang diperlukan untuk fungsi tertentu, seperti tokenisasi dan pembersihan teks.

In [None]:
#Mengunduh Resource NLTK
nltk.download('punkt_tab')
nltk.download('stopwords')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Kode berikut mengimpor **TfidfVectorizer** dari pustaka **scikit-learn**, yang digunakan untuk memproses teks berdasarkan **TF-IDF** (Term Frequency-Inverse Document Frequency).  

**Tujuan**: Menghasilkan fitur teks yang lebih informatif untuk digunakan dalam analisis data atau pembelajaran mesin.

In [None]:
#Import Library untuk TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer

#Preprocessing


Kode ini bertujuan untuk membaca file CSV dan mengatur ulang nama kolomnya. Namun, terdapat kesalahan kecil dalam penulisan df.colums (seharusnya df.columns).
pd.read_csv('Starbucks_Reviews_Filtered(2).csv')
1. Membaca file CSV bernama Starbucks_Reviews_Filtered(2).csv dan memuat isinya ke dalam sebuah DataFrame.

2. df.columns = ['docID', 'text']
Mengganti nama kolom DataFrame menjadi docID (mungkin sebagai identitas dokumen) dan text (berisi ulasan atau teks).

3. df
Menampilkan DataFrame untuk melihat data yang telah dimuat.

Jika file CSV ini ada di direktori yang sesuai, kode akan berhasil memuat data dan menampilkan tabel dengan kolom docID dan text. Jika file tidak ditemukan, akan muncul error.

In [None]:
df = pd.read_csv('Starbucks_Reviews_Filtered(2).csv')
df.columns = ['docID', 'Text']
df

Unnamed: 0,docID,Text
0,0,Amber and LaDonna at the Starbucks on Southwes...
1,1,** at the Starbucks by the fire station on 436...
2,2,I just wanted to go out of my way to recognize...
3,3,Me and my friend were at Starbucks and my card...
4,4,I’m on this kick of drinking 5 cups of warm wa...
...,...,...
295,295,"I was at store #16710 and was waiting, waiting..."
296,296,Lady who works there would NOT let me use the ...
297,297,I'm tired of spending $5 on coffee and not bei...
298,298,Good amenities but overpriced. Promotions are ...


## memberi dokumen id

Kode ini mengubah nilai dalam kolom docID menjadi sebuah string yang diawali dengan huruf D diikuti oleh indeks asli dari setiap baris dalam DataFrame

In [None]:
df.docID = pd.Series(["D"+ str(ind) for ind in df.docID])
df

Unnamed: 0,docID,Text
0,D0,Amber and LaDonna at the Starbucks on Southwes...
1,D1,** at the Starbucks by the fire station on 436...
2,D2,I just wanted to go out of my way to recognize...
3,D3,Me and my friend were at Starbucks and my card...
4,D4,I’m on this kick of drinking 5 cups of warm wa...
...,...,...
295,D295,"I was at store #16710 and was waiting, waiting..."
296,D296,Lady who works there would NOT let me use the ...
297,D297,I'm tired of spending $5 on coffee and not bei...
298,D298,Good amenities but overpriced. Promotions are ...


## menghilangkan tanda baca, karakter special, dan lowercasing

menghilangkan tanda baca, Menghapus spasi awal/akhir, menemukan semua karakter non-kata dan menggantinya dengan spasi, dan mengubah seluruh teks menjadi huruf kecil.

In [None]:
df.Text = df.Text.str.replace("[^a-zA-Z0-9\s]", " ", regex=True)
df.Text = df.Text.str.replace(","," ")
df.Text = df.Text.str.replace(r'\W',' ')
df.Text = df.Text.str.strip().str.lower()

df

Unnamed: 0,docID,Text
0,D0,amber and ladonna at the starbucks on southwes...
1,D1,at the starbucks by the fire station on 436 in...
2,D2,i just wanted to go out of my way to recognize...
3,D3,me and my friend were at starbucks and my card...
4,D4,i m on this kick of drinking 5 cups of warm wa...
...,...,...
295,D295,i was at store 16710 and was waiting waiting...
296,D296,lady who works there would not let me use the ...
297,D297,i m tired of spending 5 on coffee and not bei...
298,D298,good amenities but overpriced promotions are ...


## Tokenisasi

Kode ini menggunakan word_tokenize untuk memecah teks dalam kolom Text menjadi daftar token (kata-kata)

In [None]:
df.Text = df.Text.apply(word_tokenize)
df

Unnamed: 0,docID,Text
0,D0,"[amber, and, ladonna, at, the, starbucks, on, ..."
1,D1,"[at, the, starbucks, by, the, fire, station, o..."
2,D2,"[i, just, wanted, to, go, out, of, my, way, to..."
3,D3,"[me, and, my, friend, were, at, starbucks, and..."
4,D4,"[i, m, on, this, kick, of, drinking, 5, cups, ..."
...,...,...
295,D295,"[i, was, at, store, 16710, and, was, waiting, ..."
296,D296,"[lady, who, works, there, would, not, let, me,..."
297,D297,"[i, m, tired, of, spending, 5, on, coffee, and..."
298,D298,"[good, amenities, but, overpriced, promotions,..."


## Stopword

Kode ini menghapus **stopwords** (kata-kata umum yang kurang bermakna) dari kolom **Text**.

In [None]:
stop_words = set(stopwords.words('english'))
df['Text'] = df['Text'].apply(lambda x: [word for word in x if word not in stop_words])
df

Unnamed: 0,docID,Text
0,D0,"[amber, ladonna, starbucks, southwest, parkway..."
1,D1,"[starbucks, fire, station, 436, altamonte, spr..."
2,D2,"[wanted, go, way, recognize, starbucks, employ..."
3,D3,"[friend, starbucks, card, work, thankful, work..."
4,D4,"[kick, drinking, 5, cups, warm, water, work, i..."
...,...,...
295,D295,"[store, 16710, waiting, waiting, waiting, wife..."
296,D296,"[lady, works, would, let, use, restroom, custo..."
297,D297,"[tired, spending, 5, coffee, able, drink, ask,..."
298,D298,"[good, amenities, overpriced, promotions, cost..."


## Stemming

Kode ini melakukan **stemming** pada kata-kata dalam kolom **Text** menggunakan algoritma **PorterStemmer**.

In [None]:
stemmer = PorterStemmer()
df['Text'] = df['Text'].apply(lambda x: [stemmer.stem(word) for word in x])
df

Unnamed: 0,docID,Text
0,D0,"[amber, ladonna, starbuck, southwest, parkway,..."
1,D1,"[starbuck, fire, station, 436, altamont, sprin..."
2,D2,"[want, go, way, recogn, starbuck, employe, bil..."
3,D3,"[friend, starbuck, card, work, thank, worker, ..."
4,D4,"[kick, drink, 5, cup, warm, water, work, insta..."
...,...,...
295,D295,"[store, 16710, wait, wait, wait, wife, drink, ..."
296,D296,"[ladi, work, would, let, use, restroom, custom..."
297,D297,"[tire, spend, 5, coffe, abl, drink, ask, extra..."
298,D298,"[good, amen, overpr, promot, costli, 125, doll..."


# Procesesing

## Inverted index

menunjukkan posisi kata ada di dokumen mana

Kode ini membuat sebuah **inverted index** (indeks terbalik) yang menyimpan informasi tentang dokumen mana saja yang mengandung kata tertentu.

### Hasil:
- **Inverted index** berisi kata-kata sebagai kunci, dan nilai berupa set dari **`docID`** yang berisi dokumen yang mengandung kata tersebut.
- Output mencetak setiap kata bersama dengan daftar dokumen yang mengandung kata itu, dalam format `kata: docID1, docID2, ...`.

In [None]:
inverted_index = {}

for index, row in df.iterrows():
    doc_id = row['docID']
    words = row['Text']

    for word in words:
        if word not in inverted_index:
            inverted_index[word] = set()
        inverted_index[word].add(doc_id)

for word, doc_ids in inverted_index.items():
    print(f"{word}: {', '.join(doc_ids)}")

amber: D0
ladonna: D0
starbuck: D13, D41, D138, D31, D55, D278, D251, D83, D115, D2, D154, D72, D89, D99, D274, D62, D273, D112, D294, D130, D11, D170, D212, D15, D219, D4, D123, D131, D225, D289, D283, D275, D40, D23, D151, D113, D96, D51, D214, D0, D45, D71, D220, D233, D258, D163, D207, D296, D249, D19, D82, D286, D126, D185, D159, D34, D227, D240, D291, D201, D243, D165, D75, D134, D164, D1, D80, D276, D10, D74, D226, D246, D160, D281, D25, D195, D127, D117, D277, D52, D264, D92, D135, D105, D144, D141, D172, D44, D86, D244, D50, D121, D169, D235, D167, D262, D48, D173, D90, D18, D161, D168, D46, D100, D179, D215, D149, D254, D255, D118, D147, D253, D20, D94, D191, D129, D270, D38, D177, D194, D238, D108, D216, D142, D209, D232, D132, D114, D64, D69, D222, D263, D250, D228, D106, D95, D290, D33, D119, D192, D196, D24, D85, D29, D9, D157, D70, D202, D153, D186, D239, D152, D136, D181, D166, D280, D102, D176, D143, D6, D61, D77, D42, D146, D208, D268, D205, D30, D223, D210, D49, D272

## Matriks vektor

membuat daftar dari teks dokumen. kemudian membuat frekuensi kata baru

In [None]:
documents = df['Text'].apply(lambda x: ' '.join(x)).tolist()

vectorizer = TfidfVectorizer(use_idf=False, norm=None, binary=False)
word_count_matrix = vectorizer.fit_transform(documents).toarray()

word_freq_df = pd.DataFrame(word_count_matrix, columns=vectorizer.get_feature_names_out())
word_freq_df['docID'] = df['docID']
word_freq_df = word_freq_df.set_index('docID')

##TF IDF

melakukan proses pembobotan kata menggunakan tf idf

In [None]:
tfidf_vectorizer = TfidfVectorizer(use_idf=True, norm='l2')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents).toarray()

tfidf_df = pd.DataFrame(tfidf_matrix, columns=tfidf_vectorizer.get_feature_names_out())
tfidf_df['docID'] = df['docID']

tfidf_df = tfidf_df.set_index('docID')
final_df = word_freq_df.join(tfidf_df, rsuffix='_tfidf')

final_df

Unnamed: 0_level_0,00,000php,01,02,03,04,05,06,10,100,...,yep_tfidf,yesterday_tfidf,yet_tfidf,yogurt_tfidf,yong_tfidf,young_tfidf,youth_tfidf,yr_tfidf,zeeb_tfidf,zero_tfidf
docID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
D0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
D1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
D2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
D3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
D4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
D295,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
D296,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
D297,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
D298,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Preprocessing Query

Fungsi **`query_processing(query)`** dirancang untuk memproses kueri pencarian dengan beberapa langkah pembersihan dan pengolahan teks. Berikut penjelasan langkah-langkah dalam fungsi tersebut.
### Hasil:
Fungsi ini mengubah kueri pencarian menjadi bentuk yang lebih bersih, terstandarisasi, dan lebih mudah diproses untuk pencocokan atau analisis, seperti menghapus stopwords, mengubah kata ke bentuk dasarnya (stemming), dan membersihkan karakter-karakter non-alfanumerik.

In [None]:
def query_processing(query):
  stemmer = PorterStemmer()
  query = re.sub('\W', ' ', query)
  query = query.strip().lower()
  query = " ".join([stemmer.stem(word) for word in query.split() if word not in stopwords.words('english')])
  return query

In [None]:
query = "recomended starbuck's Places near#300"
query = query_processing(query)
query

'recomend starbuck place near 300'

## Cosine similiarity

Fungsi **`cosine_similarity(vector1, vector2)`** menghitung **cosine similarity** antara dua vektor numerik (biasanya digunakan dalam teks untuk mengukur kesamaan antara dua dokumen). Berikut penjelasan rinci.

### Hasil:
- Fungsi ini mengembalikan nilai **cosine similarity** antara dua vektor, yang mengukur kesamaan arah antara kedua vektor tersebut.
- Nilai 1 berarti kedua vektor identik, nilai 0 berarti tidak ada kesamaan arah, dan nilai di antara keduanya menunjukkan tingkat kesamaan yang bervariasi.

In [None]:
def cosine_similarity(vector1, vector2):

  dot_product = np.dot(vector1, vector2)
  magnitude_vector1 = np.linalg.norm(vector1)
  magnitude_vector2 = np.linalg.norm(vector2)

  if magnitude_vector1 == 0 or magnitude_vector2 == 0:
    return 0
  return dot_product / (magnitude_vector1 * magnitude_vector2)

## similiarity query scoring

Fungsi **`query_similarity_scoring(query, tfidf_df, vectorizer)`** digunakan untuk menghitung kesamaan antara sebuah kueri dan dokumen-dokumen yang ada dalam DataFrame **`tfidf_df`** dengan menggunakan **cosine similarity**. Fungsi ini mengembalikan skor kesamaan antara kueri dan dokumen yang terurut berdasarkan kesamaan tertinggi.

### Hasil:
Fungsi ini mengembalikan **DataFrame** yang berisi 10 dokumen dengan **skor kesamaan tertinggi** terhadap kueri yang diberikan, berdasarkan **cosine similarity**. Ini berguna untuk peringkat dokumen dalam sistem pencarian informasi.

In [None]:
def query_similarity_scoring(query, tfidf_df, vectorizer):

  query_vector = vectorizer.transform([query]).toarray()
  similarity_scores = []

  for doc_id in tfidf_df.index:
    doc_vector = tfidf_df.loc[doc_id].values
    similarity = cosine_similarity(query_vector[0], doc_vector)
    similarity_scores.append((doc_id, similarity))
  similarity_df = pd.DataFrame(similarity_scores, columns=['docID', 'similarity_score'])
  similarity_df = similarity_df.sort_values('similarity_score', ascending=False).head(10)
  return similarity_df

# Memasukkan query

Kode ini melakukan pemrosesan kueri dan kemudian menghitung kesamaan kueri tersebut dengan dokumen-dokumen dalam **`tfidf_df`** menggunakan **`query_similarity_scoring`**.

### Hasil:
**`similarity_results`** akan menampilkan dokumen-dokumen yang paling mirip dengan kueri "Starbucks bad service", berdasarkan analisis **cosine similarity**. DataFrame ini berisi dua kolom:
- **`docID`**: ID dokumen.
- **`similarity_score`**: Skor kesamaan antara kueri dan dokumen.

In [None]:
query = 'Starbuck como Lake Bad service'
query = query_processing(query)

similarity_results = query_similarity_scoring(query, tfidf_df, tfidf_vectorizer)
similarity_results

Unnamed: 0,docID,similarity_score
19,D19,0.415931
187,D187,0.212693
252,D252,0.183047
201,D201,0.171754
204,D204,0.124669
286,D286,0.09418
31,D31,0.081679
84,D84,0.080155
1,D1,0.077816
245,D245,0.072202
