# Deskripsi Tugas

RNN dapat mengalami vanishing gradient ketika terdapat input data yang panjang atau dikenal dengan istilah long-term dependencies. LSTM dengan cell state nya dapat mengatasi hal tersebut. Tugas ini bertujuan untuk membandingkan performa RNN dan LSTM untuk memprediksi sentimen dari review movie pada dataset IMDB dan melakukan investigasi update gradient pada saat proses training oleh kedua model. Eksperimen terkait dengan long-term dependencies juga diperlukan untuk mengevaluasi seberapa kuat kedua model dalam memprediksi. Eksperimen dapat dilakukan dengan cara menganalisis kesalahan prediksi dari model pada confusion matrix.

- Dataset
  
  IMDB Movie Review Dataset berisi 50k reviews. Dataset ini terdapat review yang panjang dan kompleks (terdapat kata yang tidak baku dan simbol). Tugasnya adalah membuat model klasifikasi binary apakah suatu review memiliki sentimen negatif atau positif.

# Import modul yang dibutuhkan

In [None]:
!pip install contractions



In [None]:
import pandas as pd
import requests
import nltk
import re
import contractions
import spacy

from tqdm.auto import tqdm
from nltk.tokenize import word_tokenize
from nltk.tokenize.treebank import TreebankWordDetokenizer

# Load dataset

In [None]:
path = 'https://drive.google.com/uc?export=download&id=1FI78s5Bsr3lf53w_vnJiSE7HQfSvhBS3'
df = pd.read_csv(path)
df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


In [None]:
df['sentiment'].value_counts()

Unnamed: 0_level_0,count
sentiment,Unnamed: 1_level_1
positive,25000
negative,25000


# Text preprocessing

## Ngecek Null

In [None]:
# Mengecek dataset yang kosong
df.isnull().sum()

Unnamed: 0,0
review,0
sentiment,0


## Ngecek duplikasi

In [None]:
# Mengecek duplikasi data
df.duplicated(subset=['review']).sum()

np.int64(418)

Karena hanya 418 data yang terduplikasi dimana tidak sampai 10%, maka duplikasi ini tidak dihapus.

## Text Cleaning

In [None]:
df_clean = df.copy()

### Menangani noise

In [None]:
def remove_URL(text): # Menghapus URL
    url = re.compile(r'https?://\S+|www\.\S+')
    return url.sub(r'', text)

def remove_html(text): # Menghapus tag HTML
    html = re.compile(r'<.*?>')
    return html.sub(r'', text)

def case_folding(text): # Mengubah huruf yang masih uppercase menjadi lowercase
  if isinstance(text, str):
    lowercase_text = text.lower()
    return lowercase_text
  else:
    return text

def remove_emoji(text): # Menghapus emoji
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F700-\U0001F77F"  # alchemical symbols
        u"\U0001F780-\U0001F7FF"  # Geometric Shapes Extended
        u"\U0001F800-\U0001F8FF"  # Supplemental Arrows-C
        u"\U0001F900-\U0001F9FF"  # Supplemental Symbols and Pictographs
        u"\U0001FA00-\U0001FA6F"  # Chess Symbols
        u"\U0001FA70-\U0001FAFF"  # Symbols and Pictographs Extended-A
        u"\U00002702-\U000027B0"  # Dingbats
        u"\U000024C2-\U0001F251"  # Enclosed characters
                            "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', text)

def remove_numbers (text): # Menghapus angka
    text = re.sub(r'\d+', '', text)
    return text

def remove_symbols(text):
    # Menghapus semua karakter kecuali huruf, angka, spasi, dan simbol ?, !, , .
    text = re.sub(r'[^a-zA-Z0-9\s\?\!\,\.]', '', text)
    return text

def remove_duplspaces (text): # Menghapus spasi berlebih
    text = re.sub(r'\s+', ' ', text)
    return text

In [None]:
df_clean['clean1'] = df_clean['review'].apply(remove_html)
df_clean['clean1'] = df_clean['clean1'].apply(remove_URL)
df_clean['construction'] = df_clean['clean1'].apply(contractions.fix)
df_clean['clean2'] = df_clean['construction'].apply(case_folding)
df_clean['clean2'] = df_clean['clean2'].apply(remove_emoji)
df_clean['clean2'] = df_clean['clean2'].apply(remove_numbers)
df_clean['clean2'] = df_clean['clean2'].apply(remove_symbols)
df_clean['clean2'] = df_clean['clean2'].apply(remove_duplspaces)
df_clean

Unnamed: 0,review,sentiment,clean1,construction,clean2
0,One of the other reviewers has mentioned that ...,positive,One of the other reviewers has mentioned that ...,One of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,A wonderful little production. The filming tec...,A wonderful little production. The filming tec...,a wonderful little production. the filming tec...
2,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...,I thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,Basically there's a family where a little boy ...,Basically there is a family where a little boy...,basically there is a family where a little boy...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"Petter Mattei's ""Love in the Time of Money"" is...","Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love in the time of money is a ...
...,...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,I thought this movie did a down right good job...,I thought this movie did a down right good job...,i thought this movie did a down right good job...
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"Bad plot, bad dialogue, bad acting, idiotic di...","Bad plot, bad dialogue, bad acting, idiotic di...","bad plot, bad dialogue, bad acting, idiotic di..."
49997,I am a Catholic taught in parochial elementary...,negative,I am a Catholic taught in parochial elementary...,I am a Catholic taught in parochial elementary...,i am a catholic taught in parochial elementary...
49998,I'm going to have to disagree with the previou...,negative,I'm going to have to disagree with the previou...,I am going to have to disagree with the previo...,i am going to have to disagree with the previo...


# Menggabungkan dan menyimpan dataset yang sudah dibersihkan

In [None]:
df_cleaned = df_clean[['clean2', 'sentiment']]

In [None]:
df_cleaned.rename(columns={'clean2':'cleaned_review'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_cleaned.rename(columns={'clean2':'cleaned_review'}, inplace=True)


In [None]:
df_cleaned

Unnamed: 0,cleaned_review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production. the filming tec...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically there is a family where a little boy...,negative
4,petter matteis love in the time of money is a ...,positive
...,...,...
49995,i thought this movie did a down right good job...,positive
49996,"bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,i am a catholic taught in parochial elementary...,negative
49998,i am going to have to disagree with the previo...,negative


In [None]:
df_cleaned.to_csv('IMDB_cleaned.csv', index=False)

### Menangani slang (Gak perlu)

Dictionary slang diambil dari kaggle dengan link: https://www.kaggle.com/code/nmaguette/up-to-date-list-of-slangs-for-text-preprocessing

In [None]:
# Import kamus slang dari Gdrive
url = 'https://drive.google.com/uc?export=download&id=193JeGNX9VkrLgItun5LnEQO02KLSO0Uy'

response = requests.get(url)
slang_text = response.text

In [None]:
# Parsing teks menjadi dictionary slang
slang_dict = {}
for line in slang_text.splitlines():
    if ':' in line:
        key, value = line.split(':', 1)
        slang_dict[key.strip().lower()] = value.strip().lower()

In [None]:
def replace_slang(text, slang_dict):
    tokens = word_tokenize(text)
    new_tokens = []
    for token in tokens:
        # Tidak perlu lowercase karena sudah case folding sebelumnya
        new_token = slang_dict.get(token, token)
        new_tokens.append(new_token)
    return TreebankWordDetokenizer().detokenize(new_tokens)

In [None]:
tqdm.pandas(desc='Menangani slang')
df_clean['slang'] = df_clean['clean2'].progress_apply(lambda x: replace_slang(x, slang_dict))
df_clean

Menangani slang:   0%|          | 0/49582 [00:00<?, ?it/s]

Menangani slang: 100%|██████████| 49582/49582 [01:58<00:00, 419.01it/s]


Unnamed: 0,review,sentiment,clean1,construction,clean2,slang
0,One of the other reviewers has mentioned that ...,positive,One of the other reviewers has mentioned that ...,One of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...
1,A wonderful little production. <br /><br />The...,positive,A wonderful little production. The filming tec...,A wonderful little production. The filming tec...,a wonderful little production. the filming tec...,a wonderful little production . the filming te...
2,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...,I thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...
3,Basically there's a family where a little boy ...,negative,Basically there's a family where a little boy ...,Basically there is a family where a little boy...,basically there is a family where a little boy...,basically there is a family where a little boy...
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"Petter Mattei's ""Love in the Time of Money"" is...","Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love in the time of money is a ...,petter matteis love in the time of money is a ...
...,...,...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,I thought this movie did a down right good job...,I thought this movie did a down right good job...,i thought this movie did a down right good job...,i thought this movie did a down right good job...
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"Bad plot, bad dialogue, bad acting, idiotic di...","Bad plot, bad dialogue, bad acting, idiotic di...","bad plot, bad dialogue, bad acting, idiotic di...","bad plot, bad dialogue, bad acting, idiotic di..."
49997,I am a Catholic taught in parochial elementary...,negative,I am a Catholic taught in parochial elementary...,I am a Catholic taught in parochial elementary...,i am a catholic taught in parochial elementary...,i am a catholic taught in parochial elementary...
49998,I'm going to have to disagree with the previou...,negative,I'm going to have to disagree with the previou...,I am going to have to disagree with the previo...,i am going to have to disagree with the previo...,i am going to have to disagree with the previo...


### Menangani stopword dan lematisasi (Gak perlu)

In [None]:
nlp = spacy.load("en_core_web_sm")
tqdm.pandas(desc='Stopword Removal')

In [None]:
def stopword_removal(text):
  """
  Fungsi untuk melakukan tokenisasi dan menghapus stopwords.
  """
  # Proses teks dengan SpaCy
  doc = nlp(text)

  # List comprehension untuk mengumpulkan token yang sudah bersih
  cleaned_tokens = [
      token.lemma_  # Ambil lemma
      for token in doc
      if not token.is_stop # Hanya jika bukan stopword
  ]

  return cleaned_tokens

In [None]:
df_clean['cleaned_text'] = df_clean['slang'].progress_apply(stopword_removal)
df_clean

Stopword Removal: 100%|██████████| 49582/49582 [37:26<00:00, 22.07it/s]  


Unnamed: 0,review,sentiment,clean1,construction,clean2,slang,cleaned_text
0,One of the other reviewers has mentioned that ...,positive,One of the other reviewers has mentioned that ...,One of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...,one of the other reviewers has mentioned that ...,"[reviewer, mention, watch, oz, episode, hook, ..."
1,A wonderful little production. <br /><br />The...,positive,A wonderful little production. The filming tec...,A wonderful little production. The filming tec...,a wonderful little production the filming tech...,a wonderful little production the filming tech...,"[wonderful, little, production, filming, techn..."
2,I thought this was a wonderful way to spend ti...,positive,I thought this was a wonderful way to spend ti...,I thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...,i thought this was a wonderful way to spend ti...,"[think, wonderful, way, spend, time, hot, summ..."
3,Basically there's a family where a little boy ...,negative,Basically there's a family where a little boy ...,Basically there is a family where a little boy...,basically there is a family where a little boy...,basically there is a family where a little boy...,"[basically, family, little, boy, jake, think, ..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"Petter Mattei's ""Love in the Time of Money"" is...","Petter Mattei's ""Love in the Time of Money"" is...",petter matteis love in the time of money is a ...,petter matteis love in the time of money is a ...,"[petter, matteis, love, time, money, visually,..."
...,...,...,...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,I thought this movie did a down right good job...,I thought this movie did a down right good job...,i thought this movie did a down right good job...,i thought this movie did a down right good job...,"[think, movie, right, good, job, creative, ori..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"Bad plot, bad dialogue, bad acting, idiotic di...","Bad plot, bad dialogue, bad acting, idiotic di...",bad plot bad dialogue bad acting idiotic direc...,bad plot bad dialogue bad acting idiotic direc...,"[bad, plot, bad, dialogue, bad, acting, idioti..."
49997,I am a Catholic taught in parochial elementary...,negative,I am a Catholic taught in parochial elementary...,I am a Catholic taught in parochial elementary...,i am a catholic taught in parochial elementary...,i am a catholic taught in parochial elementary...,"[catholic, teach, parochial, elementary, schoo..."
49998,I'm going to have to disagree with the previou...,negative,I'm going to have to disagree with the previou...,I am going to have to disagree with the previo...,i am going to have to disagree with the previo...,i am going to have to disagree with the previo...,"[go, disagree, previous, comment, maltin, seco..."
