# Task 2

# Aspect Extraction

Nixon Andhika / 13517059

Ferdy Santoso / 13517116

Jan Meyer Saragih / 13517131

# Data Source

https://github.com/mulhod/steam_reviews


# Task Description

Aspect Extraction merupakan sebuah task NLP yang dapat digunakan untuk menentukan apakah sebuah kata merupakan sebuah aspect atau bukan. Aspect Extraction ini sangat penting terutama dalam proses sentiment analysis untuk mengetahui lebih detail aspect apa yang menyebabkan sentiment tersebut menjadi positif, negatif, atau netral.

# Latar Belakang

Game **Steam** memiliki banyak review, dan masing-masing review tersebut sudah memiliki sentimentnya masing-masing dalam bentuk rekomendasi atau tidak rekomendasi. Namun, review steam tersebut belum memiliki detail yang lebih spesifik seperti apa yang membuat game tersebut direkomendasikan atau tidak direkomendasikan. Sehingga, aspect extraction ini berguna untuk mencari aspek apa yang membuat sentiment dari game review tersebut menjadi direkomendasikan atau tidak direkomendasikan.


# Flow Modul

1. Aspect extraction pada awalnya akan dilakukan dengan cara melakukan generate data yang bersifat supervised, karena pada awalnya data belum supervised.
2. Dari data yang masih berupa review saja akan dicari aspectnya apa saja dengan program jupyter notebook dataGeneration.ipynb. 
3. Setelah itu akan dilakukan parsing data sehingga menjadi bentuk word_before, word_now, word_after, dan pos_tag beserta class nya melalui program dataParser.ipynb.
4. Setelah itu data telah siap dilakukan training.
5. Training dilakukan dengan melakukan konkatenasi word_before, word_now, word_after dan pos_tag jika dibutuhkan. Lalu akan dicari tfidf nya.
6. Setelah itu akan dilakukan pemisahan data untuk data training dan testing.
7. Setelah itu akan dilakukan training dengan menggunakan model machine learning yang sudah ada di sklearn. Di tugas ini saya menggunakan LogisticRegression dan SVM.
8. Setelah dilakukan training akan dilakukan testing, dan hasil score akan muncul.

# Modul: Aspect Extraction

## Teknik yang Digunakan

1. Preprocessing: convert unsupervised data to supervised, data parsing, POS Tagging
2. Feature Extraction: TF-IDF
3. Classification: Logistic Regression, SVM

## Data

Untuk konversi data dari unsupervised ke supervised menggunakan kolom 'review' saja. Untuk parsing data digunakan kolom 'review' dan 'aspect_keywords'. setelah didapatkan hasil parse data, untuk feature extraction dan classification menggunakan kolom 'word_before', 'word_now', 'word_after', 'pos_tag', dan 'class'. Hasil akhir data akan memiliki 5 kolom dan jumlah data sebanyak 48000.

## Eksperimen

### Hasil

Hasil yang kami dapatkan adalah POS Tagging berpengaruh kepada hasil klasifikasi. Hasil yang didapatkan dari POS Tagging terbukti memiliki skor yang lebih baik jika dibandingkan dengan tidak menggunakan POS Tagging. Hasil untuk Logistic Regression dan SVM juga berbeda. Hasil dan analisis yang lebih detail akan dijelaskan pada tabel di bagian analisis.

### Analisis

Analisis yang kami simpulkan adalah sebagai berikut:
1. POS Tag meningkatkan skor klasifikasi untuk kedua algoritma.
2. Logistic Regression menghasilkan hasil yang lebih baik daripada SVM, dan prosesnya berjalan jauh lebih cepat (Logistic Regression hanya beberapa detik, sementara SVM berjalan sekitar 30 menit).
3. SVM menghasilkan hasil yang lebih buruk daripada Logistic Regression, dan prosesnya berjalan jauh lebih lambat (Logistic Regression hanya beberapa detik, sementara SVM berjalan sekitar 30 menit).

Tabel hasil eksperimen:

**Model** | **LogReg tanpa POS Tag** | **LogReg POS Tag** | **SVM tanpa POS Tag** | **SVM POS Tag**
--- | --- | --- | --- | ---
Akurasi | 82.26% | 82.56% | 82.23% | 82.40%

In [1]:
import numpy as np
import pandas as pd
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

import os

## Read Data

In [2]:
data = pd.read_csv('./data/steam-aspect.csv')
data.head()

Unnamed: 0,word_before,word_now,word_after,pos_tag,class
0,[START],My,first,PRP$,False
1,My,first,game,JJ,False
2,first,game,on,NN,True
3,game,on,A3,IN,False
4,on,A3,brought,NNP,False


## Combine columns

In [3]:
aspect_data = pd.DataFrame()
aspect_data_no_pos_tag = pd.DataFrame()
arr_words = []
arr_words_no_pos_tag = []

for i in range(len(data['word_now'])):
    word = ""
    if (data['word_before'][i] != '[START]'):
        word += str(data['word_before'][i])
        
    word += " " + str(data['word_now'][i])
    
    if (data['word_after'][i] != '[END]'):
        word += " " + str(data['word_after'][i])
    
    arr_words_no_pos_tag.append(word)
    
    word_pos_tag = word + " " + str(data['pos_tag'][i])
    
    arr_words.append(word_pos_tag)

aspect_data['review'] = arr_words
aspect_data['class'] = data['class'].copy()
aspect_data_no_pos_tag['review'] = arr_words_no_pos_tag
aspect_data_no_pos_tag['class'] = data['class'].copy()
aspect_data_no_pos_tag.head()

Unnamed: 0,review,class
0,My first,False
1,My first game,False
2,first game on,True
3,game on A3,False
4,on A3 brought,False


## Train Test Split Data

In [4]:
X_train, X_test, y_train, y_test = train_test_split(aspect_data['review'], aspect_data['class'], test_size=0.33)
X_train_no_pos_tag, X_test_no_pos_tag, y_train_no_pos_tag, y_test_no_pos_tag = train_test_split(aspect_data_no_pos_tag['review'], aspect_data_no_pos_tag['class'], test_size=0.33)

## Feature Extraction

### 1. With Pos Tag

In [5]:
tfidf = TfidfVectorizer(binary=True, use_idf = True, max_features=256)
tfidf = tfidf.fit(X_train)

X_train_tfidf = pd.DataFrame(tfidf.transform(X_train).toarray(), columns=[tfidf.get_feature_names()])
X_test_tfidf = pd.DataFrame(tfidf.transform(X_test).toarray(), columns=[tfidf.get_feature_names()])

X_train_tfidf

Unnamed: 0,10,able,about,actually,after,again,ai,all,almost,alpha,...,why,will,with,work,worth,would,wp,wrb,you,your
0,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
1,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
2,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
3,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
4,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
5,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
6,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
7,0.0,0.0,0.650489,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
8,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000
9,0.0,0.0,0.000000,0.000000,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.893425,0.0,0.0,0.000000,0.000000


### 2. No Pos Tag

In [6]:
tfidf_no_pos_tag = TfidfVectorizer(binary=True, use_idf = True, max_features=256)
tfidf_no_pos_tag = tfidf_no_pos_tag.fit(X_train_no_pos_tag)

X_train_tfidf_no_pos_tag = pd.DataFrame(tfidf_no_pos_tag.transform(X_train_no_pos_tag).toarray(), columns=[tfidf_no_pos_tag.get_feature_names()])
X_test_tfidf_no_pos_tag = pd.DataFrame(tfidf_no_pos_tag.transform(X_test_no_pos_tag).toarray(), columns=[tfidf_no_pos_tag.get_feature_names()])

X_train_tfidf_no_pos_tag

Unnamed: 0,10,20,able,about,actually,add,after,again,ai,all,...,with,work,world,worth,would,years,yes,you,your,yourself
0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
1,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.706919,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
2,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
3,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
4,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
5,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
6,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
7,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
8,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0
9,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.000000,...,0.0,0.000000,0.0,0.0,0.000000,0.0,0.0,0.000000,0.000000,0.0


## Classification

### 1.a. Logistic Regression With Pos Tag

In [7]:
lg = LogisticRegression(C=1000, solver='liblinear')

In [8]:
lg.fit(X_train_tfidf, y_train)

LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [9]:
lg.score(X_test_tfidf, y_test)

0.7896892416821483

### 1.b. Logistic Regression Without Pos Tag

In [10]:
lg_no_pos_tag = LogisticRegression(C=1000, solver='liblinear')

In [11]:
lg_no_pos_tag.fit(X_train_tfidf_no_pos_tag, y_train_no_pos_tag)

LogisticRegression(C=1000, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [12]:
lg_no_pos_tag.score(X_test_tfidf_no_pos_tag, y_test_no_pos_tag)

0.7720401959128526

### 2.a. SVM With Pos Tag

In [13]:
svc = SVC(C=1, kernel='linear')

In [14]:
svc.fit(X_train_tfidf, y_train)

SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [15]:
svc.score(X_test_tfidf, y_test)

0.777402465799696

### 2.a. SVM Without Pos Tag

In [16]:
svc_no_pos_tag = SVC(C=1, kernel='linear')

In [17]:
svc_no_pos_tag.fit(X_train_tfidf_no_pos_tag, y_train_no_pos_tag)

SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='linear',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [18]:
svc_no_pos_tag.score(X_test_tfidf_no_pos_tag, y_test_no_pos_tag)

0.7719135281202499

## Save Model

In [19]:
pickle.dump(lg, open("./model/aspect_lg.pkl", "wb"))

In [20]:
pickle.dump(lg_no_pos_tag, open("./model/aspect_lg_no_pos_tag.pkl", "wb"))

In [21]:
pickle.dump(svc, open("./model/aspect_svc.pkl", "wb"))

In [22]:
pickle.dump(svc_no_pos_tag, open("./model/aspect_svc_no_pos_tag.pkl", "wb"))