# Task 2

# Aspect Extraction

Nixon Andhika / 13517059

Ferdy Santoso / 13517116

Jan Meyer Saragih / 13517131

# Data Source

https://github.com/mulhod/steam_reviews


# Task Description

Aspect Extraction merupakan sebuah task NLP yang dapat digunakan untuk menentukan apakah sebuah kata merupakan sebuah aspect atau bukan. Aspect Extraction ini sangat penting terutama dalam proses sentiment analysis untuk mengetahui lebih detail aspect apa yang menyebabkan sentiment tersebut menjadi positif, negatif, atau netral.

# Latar Belakang

Game **Steam** memiliki banyak review, dan masing-masing review tersebut sudah memiliki sentimentnya masing-masing dalam bentuk rekomendasi atau tidak rekomendasi. Namun, review steam tersebut belum memiliki detail yang lebih spesifik seperti apa yang membuat game tersebut direkomendasikan atau tidak direkomendasikan. Sehingga, aspect extraction ini berguna untuk mencari aspek apa yang membuat sentiment dari game review tersebut menjadi direkomendasikan atau tidak direkomendasikan.


# Flow Modul

1. Aspect extraction pada awalnya akan dilakukan dengan cara melakukan generate data yang bersifat supervised, karena pada awalnya data belum supervised.
2. Dari data yang masih berupa review saja akan dicari aspectnya apa saja dengan program jupyter notebook dataGeneration.ipynb. 
3. Setelah itu akan dilakukan parsing data sehingga menjadi bentuk word_before, word_now, word_after, dan pos_tag beserta class nya melalui program dataParser.ipynb.
4. Setelah itu data telah siap dilakukan training.
5. Training dilakukan dengan melakukan konkatenasi word_before, word_now, word_after dan pos_tag jika dibutuhkan. Lalu akan dicari tfidf nya.
6. Setelah itu akan dilakukan pemisahan data untuk data training dan testing.
7. Setelah itu akan dilakukan training dengan menggunakan model machine learning yang sudah ada di sklearn. Di tugas ini saya menggunakan LogisticRegression dan SVM.
8. Setelah dilakukan training akan dilakukan testing, dan hasil score akan muncul.

# Modul: Aspect Extraction

## Teknik yang Digunakan

1. Preprocessing: convert unsupervised data to supervised, data parsing, POS Tagging
2. Feature Extraction: TF-IDF
3. Classification: Logistic Regression, SVM

## Data

Untuk konversi data dari unsupervised ke supervised menggunakan kolom 'review' saja. Untuk parsing data digunakan kolom 'review' dan 'aspect_keywords'. setelah didapatkan hasil parse data, untuk feature extraction dan classification menggunakan kolom 'word_before', 'word_now', 'word_after', 'pos_tag', dan 'class'. Hasil akhir data akan memiliki 5 kolom dan jumlah data sebanyak 48000.

## Eksperimen

### Hasil

Hasil yang kami dapatkan adalah POS Tagging berpengaruh kepada hasil klasifikasi. Hasil yang didapatkan dari POS Tagging terbukti memiliki skor yang lebih baik jika dibandingkan dengan tidak menggunakan POS Tagging. Hasil untuk Logistic Regression dan SVM juga berbeda. Hasil dan analisis yang lebih detail akan dijelaskan pada tabel di bagian analisis.

### Analisis

Analisis yang kami simpulkan adalah sebagai berikut:
1. POS Tag meningkatkan skor klasifikasi untuk kedua algoritma.
2. Logistic Regression menghasilkan hasil yang lebih baik daripada SVM, dan prosesnya berjalan jauh lebih cepat (Logistic Regression hanya beberapa detik, sementara SVM berjalan sekitar 30 menit).
3. SVM menghasilkan hasil yang lebih buruk daripada Logistic Regression, dan prosesnya berjalan jauh lebih lambat (Logistic Regression hanya beberapa detik, sementara SVM berjalan sekitar 30 menit).

Tabel hasil eksperimen:

**Model** | **LogReg tanpa POS Tag** | **LogReg POS Tag** | **SVM tanpa POS Tag** | **SVM POS Tag**
--- | --- | --- | --- | ---
Akurasi | 82.26% | 82.56% | 82.23% | 82.40%

# Imports

In [1]:
import numpy as np
import pandas as pd
import pickle
import spacy
from tqdm import tqdm

import nltk
from posTagger import posTagger
nltk.download('punkt')
nltk.download('maxent_treebank_pos_tagger')

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

import re
import os

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Meyjan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package maxent_treebank_pos_tagger to
[nltk_data]     C:\Users\Meyjan\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_treebank_pos_tagger is already up-to-
[nltk_data]       date!


# Data Generation

Data generation ini adalah tahap pertama dari aspect extraction. Data generation ini perlu dilakukan karena kebanyakan data di luar sana merupakan data unsupervised. Sehingga di data generation ini akan dicari aspect dari masing-masing review, sehingga nantinya dapat dilakukan training yang bersifat supervised. Setelah ini akan dilakukan dataParser.

## Read datasets

In [2]:
toy_rev = pd.read_csv('./data/data.csv')

## SpaCy Dependency Parser

In [3]:
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

## SpaCy displacy to show dependency parser

In [4]:
txt = toy_rev['review'][0]
doc = nlp(txt)
spacy.displacy.render(doc,style='dep',jupyter=True)

## Aspect extraction algorithm

## Competitors List

In [5]:
competitors = ['epic games', 'origin', 'gog', 'humble bundle']

In [6]:
aspect_terms = []
comp_terms = []
easpect_terms = []
ecomp_terms = []
enemy = []
for x in tqdm(range(len(toy_rev['review']))):
    amod_pairs = []
    advmod_pairs = []
    compound_pairs = []
    xcomp_pairs = []
    neg_pairs = []
    eamod_pairs = []
    eadvmod_pairs = []
    ecompound_pairs = []
    eneg_pairs = []
    excomp_pairs = []
    enemlist = []
    if len(str(toy_rev['review'][x])) != 0:
        lines = str(toy_rev['review'][x]).replace('*',' ').replace('-',' ').replace('so ',' ').replace('be ',' ').replace('are ',' ').replace('just ',' ').replace('get ','').replace('were ',' ').replace('When ','').replace('when ','').replace('again ',' ').replace('where ','').replace('how ',' ').replace('has ',' ').replace('Here ',' ').replace('here ',' ').replace('now ',' ').replace('see ',' ').replace('why ',' ').split('.')       
        for line in lines:
            enem_list = []
            for eny in competitors:
                enem = re.search(eny,line)
                if enem is not None:
                    enem_list.append(enem.group())
            if len(enem_list)==0:
                doc = nlp(line)
                str1=''
                str2=''
                for token in doc:
                    if token.pos_ is 'NOUN':
                        for j in token.lefts:
                            if j.dep_ == 'compound':
                                compound_pairs.append((j.text+' '+token.text,token.text))
                            if j.dep_ is 'amod' and j.pos_ is 'ADJ': #primary condition
                                str1 = j.text+' '+token.text
                                amod_pairs.append(j.text+' '+token.text)
                                for k in j.lefts:
                                    if k.dep_ is 'advmod': #secondary condition to get adjective of adjectives
                                        str2 = k.text+' '+j.text+' '+token.text
                                        amod_pairs.append(k.text+' '+j.text+' '+token.text)
                                mtch = re.search(re.escape(str1),re.escape(str2))
                                if mtch is not None:
                                    amod_pairs.remove(str1)
                    if token.pos_ is 'VERB':
                        for j in token.lefts:
                            if j.dep_ is 'advmod' and j.pos_ is 'ADV':
                                advmod_pairs.append(j.text+' '+token.text)
                            if j.dep_ is 'neg' and j.pos_ is 'ADV':
                                neg_pairs.append(j.text+' '+token.text)
                        for j in token.rights:
                            if j.dep_ is 'advmod'and j.pos_ is 'ADV':
                                advmod_pairs.append(token.text+' '+j.text)
                    if token.pos_ is 'ADJ':
                        for j,h in zip(token.rights,token.lefts):
                            if j.dep_ is 'xcomp' and h.dep_ is not 'neg':
                                for k in j.lefts:
                                    if k.dep_ is 'aux':
                                        xcomp_pairs.append(token.text+' '+k.text+' '+j.text)
                            elif j.dep_ is 'xcomp' and h.dep_ is 'neg':
                                if k.dep_ is 'aux':
                                        neg_pairs.append(h.text +' '+token.text+' '+k.text+' '+j.text)
            
            else:
                enemlist.append(enem_list)
                doc = nlp(line)
                str1=''
                str2=''
                for token in doc:
                    if token.pos_ is 'NOUN':
                        for j in token.lefts:
                            if j.dep_ == 'compound':
                                ecompound_pairs.append((j.text+' '+token.text,token.text))
                            if j.dep_ is 'amod' and j.pos_ is 'ADJ': #primary condition
                                str1 = j.text+' '+token.text
                                eamod_pairs.append(j.text+' '+token.text)
                                for k in j.lefts:
                                    if k.dep_ is 'advmod': #secondary condition to get adjective of adjectives
                                        str2 = k.text+' '+j.text+' '+token.text
                                        eamod_pairs.append(k.text+' '+j.text+' '+token.text)
                                mtch = re.search(re.escape(str1),re.escape(str2))
                                if mtch is not None:
                                    eamod_pairs.remove(str1)
                    if token.pos_ is 'VERB':
                        for j in token.lefts:
                            if j.dep_ is 'advmod' and j.pos_ is 'ADV':
                                eadvmod_pairs.append(j.text+' '+token.text)
                            if j.dep_ is 'neg' and j.pos_ is 'ADV':
                                eneg_pairs.append(j.text+' '+token.text)
                        for j in token.rights:
                            if j.dep_ is 'advmod'and j.pos_ is 'ADV':
                                eadvmod_pairs.append(token.text+' '+j.text)
                    if token.pos_ is 'ADJ':
                        for j in token.rights:
                            if j.dep_ is 'xcomp':
                                for k in j.lefts:
                                    if k.dep_ is 'aux':
                                        excomp_pairs.append(token.text+' '+k.text+' '+j.text)
        pairs = list(set(amod_pairs+advmod_pairs+neg_pairs+xcomp_pairs))
        epairs = list(set(eamod_pairs+eadvmod_pairs+eneg_pairs+excomp_pairs))
        for i in range(len(pairs)):
            if len(compound_pairs)!=0:
                for comp in compound_pairs:
                    mtch = re.search(re.escape(comp[1]),re.escape(pairs[i]))
                    if mtch is not None:
                        pairs[i] = pairs[i].replace(mtch.group(),comp[0])
        for i in range(len(epairs)):
            if len(ecompound_pairs)!=0:
                for comp in ecompound_pairs:
                    mtch = re.search(re.escape(comp[1]),re.escape(epairs[i]))
                    if mtch is not None:
                        epairs[i] = epairs[i].replace(mtch.group(),comp[0])
    aspect_pairs = []
    for i in range(len(pairs)):
        words = pairs[i].split()
        aspect_pairs.append(words[-1])
    
    aspect_terms.append(aspect_pairs)
    comp_terms.append(compound_pairs)
    easpect_terms.append(epairs)
    ecomp_terms.append(ecompound_pairs)
    enemy.append(enemlist)
toy_rev['compound_nouns'] = comp_terms
toy_rev['aspect_keywords'] = aspect_terms
toy_rev['competition'] = enemy
toy_rev['competition_comp_nouns'] = ecomp_terms
toy_rev['competition_aspects'] = easpect_terms
toy_rev.head()

  if token.pos_ is 'NOUN':
  if j.dep_ is 'amod' and j.pos_ is 'ADJ': #primary condition
  if j.dep_ is 'amod' and j.pos_ is 'ADJ': #primary condition
  if k.dep_ is 'advmod': #secondary condition to get adjective of adjectives
  if token.pos_ is 'VERB':
  if j.dep_ is 'advmod' and j.pos_ is 'ADV':
  if j.dep_ is 'advmod' and j.pos_ is 'ADV':
  if j.dep_ is 'neg' and j.pos_ is 'ADV':
  if j.dep_ is 'neg' and j.pos_ is 'ADV':
  if j.dep_ is 'advmod'and j.pos_ is 'ADV':
  if j.dep_ is 'advmod'and j.pos_ is 'ADV':
  if token.pos_ is 'ADJ':
  if j.dep_ is 'xcomp' and h.dep_ is not 'neg':
  if j.dep_ is 'xcomp' and h.dep_ is not 'neg':
  if k.dep_ is 'aux':
  elif j.dep_ is 'xcomp' and h.dep_ is 'neg':
  elif j.dep_ is 'xcomp' and h.dep_ is 'neg':
  if k.dep_ is 'aux':
  if token.pos_ is 'NOUN':
  if j.dep_ is 'amod' and j.pos_ is 'ADJ': #primary condition
  if j.dep_ is 'amod' and j.pos_ is 'ADJ': #primary condition
  if k.dep_ is 'advmod': #secondary condition to get adjective of adjectiv

100%|████████████████████████████████████████████████████████████████████████████| 79437/79437 [24:59<00:00, 52.97it/s]


Unnamed: 0,total_game_hours_last_two_weeks,num_groups,orig_url,num_badges,review_url,num_found_funny,review,date_updated,num_workshop_items,date_posted,...,num_friends,num_screenshots,num_comments,num_reviews,num_games_owned,compound_nouns,aspect_keywords,competition,competition_comp_nouns,competition_aspects
0,58.7,,http://steamcommunity.com/app/107410/homeconte...,,http://steamcommunity.com/id/thisisthefallout/...,1,My first game on A3 brought me the most horrif...,"May 3, 2015, 2:28AM",0,"Oct 31, 2014, 7:00AM",...,,0,70,0,0,"[(dump east, east), (enemy forces, forces), (L...","[skills, had, dump, times, here, skills, side,...",[],[],[]
1,2.8,5.0,http://steamcommunity.com/app/107410/homeconte...,9.0,http://steamcommunity.com/id/PeaceFaker/recomm...,1,This is not a game for people who want fast ac...,"Sep 22, 2014, 9:17AM",0,"May 17, 2014, 11:01AM",...,55.0,33,90,7,272,"[(action style, style), (movie style, style), ...","[run, portion, pace, back, measure, detonated,...",[],[],[]
2,38.2,14.0,http://steamcommunity.com/app/107410/homeconte...,36.0,http://steamcommunity.com/id/TheDanius/recomme...,1,Oh man. Where to even begin with this one. It ...,"Sep 30, 2014, 12:43PM",0,"Sep 30, 2014, 8:55AM",...,191.0,371,51,12,62,"[(gaming platform, platform), (job packaging, ...","[call, distance, ways, feedback, acts, pace, h...",[],[],[]
3,25.2,3.0,http://steamcommunity.com/app/107410/homeconte...,10.0,http://steamcommunity.com/id/ArtificialApple/r...,1,This is quite possibly the most emotional shoo...,,0,"Jan 11, 2015, 7:21PM",...,54.0,35,27,4,47,"[(story time, time), (anti air, air), (air bat...","[destroyed, shooter, die, played, squad, cover...",[],[],[]
4,0.1,67.0,http://steamcommunity.com/app/107410/homeconte...,17.0,http://steamcommunity.com/profiles/76561198058...,1,"If you have friends, this is a great game to p...",,4,"Oct 18, 2014, 11:58AM",...,139.0,328,77,6,180,[],"[alone, game]",[],[],[]


In [7]:
toy_rev.to_csv('./data/data-aspect.csv')

# Data Parser for Aspect Extraction

Pada tahapan kedua ini akan dilakukan data parsing. Data parsing yang dimaksud adalah melakukan pembagian data dari berbentuk review string biasa menjadi kolom-kolom word_before, word_now, word_after, dan pos_tag nya serta class nya. Parsing ini harus dilakukan sebelum melakukan training atau testing. Di dalam data parsing ini digunakan program posTagger milik anggota kelompok kami yakni Nixon Andhika.

## Read data

In [8]:
data = pd.read_csv('./data/data-aspect.csv')

## Create Pos Tagger Class

In [9]:
pos_tagger = posTagger()

## Parse Data

In [10]:
def split_string_to_list(string_list):
    splitted_strings = []
    for string_comp in string_list:
        splitted_strings.append(string_comp.split())
    return splitted_strings

In [11]:
aspect_data = pd.DataFrame()
arr_word_before = []
arr_word_now = []
arr_word_after = []
arr_pos_tag = []
arr_class = []
string_list = split_string_to_list(data['review'])
pos_tagged_data = pos_tagger.pos_tag(string_list)

for i in range(500):
    aspect = data['aspect_keywords'][i]
    
    for j in range(len(pos_tagged_data[i])):
        if (j == 0):
            arr_word_before.append('[START]')
        else:
            arr_word_before.append(pos_tagged_data[i][j-1][0])
        
        arr_word_now.append(pos_tagged_data[i][j][0])
        
        arr_pos_tag.append(pos_tagged_data[i][j][1])
        
        if (j == (len(pos_tagged_data[i]) - 1)):
            arr_word_after.append('[END]')
        else:
            arr_word_after.append(pos_tagged_data[i][j+1][0])
        
        if (pos_tagged_data[i][j][0] in aspect):
            arr_class.append('true')
        else:
            arr_class.append('false')

aspect_data['word_before'] = arr_word_before
aspect_data['word_now'] = arr_word_now
aspect_data['word_after'] = arr_word_after
aspect_data['pos_tag'] = arr_pos_tag
aspect_data['class'] = arr_class

In [12]:
aspect_data.head()

Unnamed: 0,word_before,word_now,word_after,pos_tag,class
0,[START],My,first,DET,False
1,My,first,game,ADJ,False
2,first,game,on,NOUN,True
3,game,on,A3,ADP,False
4,on,A3,brought,NOUN,False


In [13]:
aspect_data.to_csv('./data/steam-aspect.csv', index=False)

## Read Data

In [14]:
data = pd.read_csv('./data/steam-aspect.csv')
data.head()

Unnamed: 0,word_before,word_now,word_after,pos_tag,class
0,[START],My,first,DET,False
1,My,first,game,ADJ,False
2,first,game,on,NOUN,True
3,game,on,A3,ADP,False
4,on,A3,brought,NOUN,False


## Combine columns

In [15]:
aspect_data = pd.DataFrame()
aspect_data_no_pos_tag = pd.DataFrame()
arr_words = []
arr_words_no_pos_tag = []

for i in range(len(data['word_now'])):
    word = ""
    if (data['word_before'][i] != '[START]'):
        word += str(data['word_before'][i])
        
    word += " " + str(data['word_now'][i])
    
    if (data['word_after'][i] != '[END]'):
        word += " " + str(data['word_after'][i])
    
    arr_words_no_pos_tag.append(word)
    
    word_pos_tag = word + " " + str(data['pos_tag'][i])
    
    arr_words.append(word_pos_tag)

aspect_data['review'] = arr_words
aspect_data['class'] = data['class'].copy()
aspect_data_no_pos_tag['review'] = arr_words_no_pos_tag
aspect_data_no_pos_tag['class'] = data['class'].copy()
aspect_data_no_pos_tag.head()

Unnamed: 0,review,class
0,My first,False
1,My first game,False
2,first game on,True
3,game on A3,False
4,on A3 brought,False


## Train Test Split Data

In [16]:
X_train, X_test, y_train, y_test = train_test_split(aspect_data['review'], aspect_data['class'], test_size=0.33)
X_train_no_pos_tag, X_test_no_pos_tag, y_train_no_pos_tag, y_test_no_pos_tag = train_test_split(aspect_data_no_pos_tag['review'], aspect_data_no_pos_tag['class'], test_size=0.33)

## Feature Extraction

### 1. With Pos Tag

In [17]:
tfidf = TfidfVectorizer(binary=True, use_idf = True, max_features=256)
tfidf = tfidf.fit(X_train)

X_train_tfidf = pd.DataFrame(tfidf.transform(X_train).toarray(), columns=[tfidf.get_feature_names()])
X_test_tfidf = pd.DataFrame(tfidf.transform(X_test).toarray(), columns=[tfidf.get_feature_names()])

X_train_tfidf

Unnamed: 0,10,20,about,actually,adj,adp,adv,after,again,ai,...,will,with,work,worth,would,years,yes,you,your,yourself
0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.558131,0.0,0.0,0.0,0.0,0.0,0.771616,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41987,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
41988,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
41989,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0
41990,0.0,0.0,0.0,0.0,0.482403,0.0,0.0,0.0,0.0,0.0,...,0.000000,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0


### 2. No Pos Tag

In [18]:
tfidf_no_pos_tag = TfidfVectorizer(binary=True, use_idf = True, max_features=256)
tfidf_no_pos_tag = tfidf_no_pos_tag.fit(X_train_no_pos_tag)

X_train_tfidf_no_pos_tag = pd.DataFrame(tfidf_no_pos_tag.transform(X_train_no_pos_tag).toarray(), columns=[tfidf_no_pos_tag.get_feature_names()])
X_test_tfidf_no_pos_tag = pd.DataFrame(tfidf_no_pos_tag.transform(X_test_no_pos_tag).toarray(), columns=[tfidf_no_pos_tag.get_feature_names()])

X_train_tfidf_no_pos_tag

Unnamed: 0,10,20,able,about,actually,after,again,ai,all,almost,...,with,without,work,world,worth,would,years,yes,you,your
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.575575,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
41987,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41988,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41989,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41990,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Classification

### 1.a. Logistic Regression With Pos Tag

In [19]:
lg = LogisticRegression(C=1000, solver='liblinear')

In [20]:
lg.fit(X_train_tfidf, y_train)

LogisticRegression(C=1000, solver='liblinear')

In [21]:
lg.score(X_test_tfidf, y_test)

0.8229947299714742

### 1.b. Logistic Regression Without Pos Tag

In [22]:
lg_no_pos_tag = LogisticRegression(C=1000, solver='liblinear')

In [23]:
lg_no_pos_tag.fit(X_train_tfidf_no_pos_tag, y_train_no_pos_tag)

LogisticRegression(C=1000, solver='liblinear')

In [24]:
lg_no_pos_tag.score(X_test_tfidf_no_pos_tag, y_test_no_pos_tag)

0.8233815210559396

### 2.a. SVM With Pos Tag

In [25]:
svc = SVC(C=1, kernel='linear')

In [26]:
svc.fit(X_train_tfidf, y_train)

SVC(C=1, kernel='linear')

In [27]:
svc.score(X_test_tfidf, y_test)

0.8208190301213557

### 2.a. SVM Without Pos Tag

In [28]:
svc_no_pos_tag = SVC(C=1, kernel='linear')

In [29]:
svc_no_pos_tag.fit(X_train_tfidf_no_pos_tag, y_train_no_pos_tag)

SVC(C=1, kernel='linear')

In [30]:
svc_no_pos_tag.score(X_test_tfidf_no_pos_tag, y_test_no_pos_tag)

0.8238650099115216

## Save Model

In [31]:
pickle.dump(lg, open("./model/aspect_lg.pkl", "wb"))

In [32]:
pickle.dump(lg_no_pos_tag, open("./model/aspect_lg_no_pos_tag.pkl", "wb"))

In [33]:
pickle.dump(svc, open("./model/aspect_svc.pkl", "wb"))

In [34]:
pickle.dump(svc_no_pos_tag, open("./model/aspect_svc_no_pos_tag.pkl", "wb"))