## Suggestion Mining for InDomain and CrossDomain Datasets

### Subtask A: Test Data of Same Domain

#### We use fasttext library to analyze text input

In [0]:
import fasttext as ft
import pandas as pd

In [3]:
from google.colab import files
uploaded = files.upload()

Saving subtaskA_evaluation_data.csv to subtaskA_evaluation_data.csv
Saving subtaskA_trial_test.csv to subtaskA_trial_test.csv
Saving subtaskA_trial_test_labelled.csv to subtaskA_trial_test_labelled.csv
Saving v1.4_training.csv to v1.4_training (2).csv


In [4]:
import io
train_data = pd.read_csv(io.StringIO(uploaded['v1.4_training.csv'].decode('utf-8')))
train_data.head()

Unnamed: 0,id,sentence,label
0,663_3,"""Please enable removing language code from the...",1
1,663_4,"""Note: in your .csproj file, there is a Suppor...",0
2,664_1,"""Wich means the new version not fully replaced...",0
3,664_2,"""Some of my users will still receive the old x...",0
4,664_3,"""The store randomly gives the old xap or the n...",0


We have to write the data in a text file for fasttext library.
It requires the presence of label 0 or label 1 at the beginning of each sentence input.

In [5]:
x = train_data['sentence'].copy()
for i in range(len(x)):
    x[i] = '__label__'+str(train_data['label'][i])+' '+x[i]
train_data['sentence'] = x
train_data.head()

Unnamed: 0,id,sentence,label
0,663_3,"__label__1 ""Please enable removing language co...",1
1,663_4,"__label__0 ""Note: in your .csproj file, there ...",0
2,664_1,"__label__0 ""Wich means the new version not ful...",0
3,664_2,"__label__0 ""Some of my users will still receiv...",0
4,664_3,"__label__0 ""The store randomly gives the old x...",0


In [0]:
fw = open('train_data_indomain.txt','w')
for i in range(len(train_data)):
    fw.write(train_data['sentence'][i])
    fw.write('\n')
fw.close()

train_data_clf = ft.supervised('train_data_indomain.txt', 'model')

In [10]:
subtaskA_trial_test = pd.read_csv(io.StringIO(uploaded['subtaskA_trial_test.csv'].decode('utf-8')))
subtaskA_trial_test.head()

Unnamed: 0,id,sentence,label
0,13101,"""I'm not asking Microsoft to Gives permission ...",X
1,13121,"""somewhere between Android and iPhone.""",X
2,13131,"""And in the Windows Store you can flag the App...",X
3,13132,"""Many thanks Sameh Hi, As we know, there is a ...",X
4,13133,"""The idea is that we can develop a regular app...",X


In [11]:
subtaskA_trial_test_texts = []
for i in range(len(subtaskA_trial_test)):
    subtaskA_trial_test_texts.append(subtaskA_trial_test['sentence'][i])
pred = train_data_clf.predict(subtaskA_trial_test_texts)
subtaskA_trial_test_pred = []
for i in range(len(pred)):
    subtaskA_trial_test_pred.append(int(pred[i][0]))

for index in range(subtaskA_trial_test.shape[0]):
    subtaskA_trial_test.at[index, 'label'] = subtaskA_trial_test_pred[index]
subtaskA_trial_test.head()

Unnamed: 0,id,sentence,label
0,13101,"""I'm not asking Microsoft to Gives permission ...",0
1,13121,"""somewhere between Android and iPhone.""",0
2,13131,"""And in the Windows Store you can flag the App...",0
3,13132,"""Many thanks Sameh Hi, As we know, there is a ...",0
4,13133,"""The idea is that we can develop a regular app...",0


This is the dataset labelled by our classification algorithm.

In [12]:
subtaskA_trial_test_labelled = pd.read_csv(io.StringIO(uploaded['subtaskA_trial_test_labelled.csv'].decode('utf-8')))
subtaskA_trial_test_labelled.head()

Unnamed: 0,id,sentence,label
0,1310_1,I'm not asking Microsoft to Gives permission l...,1
1,1312_1,somewhere between Android and iPhone.,0
2,1313_1,And in the Windows Store you can flag the App ...,0
3,1313_2,"Many thanks Sameh Hi, As we know, there is a l...",0
4,1313_3,The idea is that we can develop a regular app ...,1


In [13]:
correct_preds = [index for index in range(subtaskA_trial_test.shape[0]) if subtaskA_trial_test.at[index, 'label'] == subtaskA_trial_test_labelled.at[index, 'label']]
len(correct_preds)

435

In [15]:
accuracy = len(correct_preds) / subtaskA_trial_test.shape[0]
accuracy

0.7347972972972973

In [18]:
subtaskA_evaluation_data = pd.read_csv(io.StringIO(uploaded['subtaskA_evaluation_data.csv'].decode('utf-8')))
subtaskA_evaluation_data.head()

Unnamed: 0,id,sentence,label
0,9566,This would enable live traffic aware apps.,X
1,9569,Please try other formatting like bold italics ...,X
2,9576,Since computers were invented to save time I s...,X
3,9577,Allow rearranging if the user wants to change ...,X
4,9579,Add SIMD instructions for better use of ARM NE...,X


In [19]:
subtaskA_evaluation_data_texts = []
for i in range(len(subtaskA_evaluation_data)):
    subtaskA_evaluation_data_texts.append(subtaskA_evaluation_data['sentence'][i])
pred = train_data_clf.predict(subtaskA_evaluation_data_texts)
evaluation_pred = []
for i in range(len(pred)):
    evaluation_pred.append(int(pred[i][0]))

for index in range(subtaskA_evaluation_data.shape[0]):
    subtaskA_evaluation_data.at[index, 'label'] = evaluation_pred[index]
subtaskA_evaluation_data.head()

Unnamed: 0,id,sentence,label
0,9566,This would enable live traffic aware apps.,0
1,9569,Please try other formatting like bold italics ...,1
2,9576,Since computers were invented to save time I s...,0
3,9577,Allow rearranging if the user wants to change ...,1
4,9579,Add SIMD instructions for better use of ARM NE...,0


In [0]:
subtaskA_evaluation_data.to_csv('subtaskA_evaluation_data_labelled.csv', sep=',')
files.download('subtaskA_evaluation_data_labelled.csv')

### Subtask B: Test Data of a Different Domain

In [21]:
uploaded_B = files.upload()

Saving subtaskB_evaluation_data.csv to subtaskB_evaluation_data.csv
Saving subtaskB_trial_test.csv to subtaskB_trial_test.csv
Saving subtaskB_trial_test_labelled.csv to subtaskB_trial_test_labelled.csv


In [22]:
subtaskB_trial_test = pd.read_csv(io.StringIO(uploaded_B['subtaskB_trial_test.csv'].decode('utf-8')))
subtaskB_trial_test.head()

Unnamed: 0,id,sentence,label
0,1,"For a lovely breakfast, turn left out of the f...",X
1,2,If you catch them right your in a 4 star hotel...,X
2,3,and travelers should avoid it if they are look...,X
3,4,On thing I liked was just a block away was an ...,X
4,5,"Might be a good place for tourists, but try an...",X


In [23]:
subtaskB_trial_test_texts = []
for i in range(len(subtaskB_trial_test)):
    subtaskB_trial_test_texts.append(subtaskB_trial_test['sentence'][i])
pred = train_data_clf.predict(subtaskB_trial_test_texts)
subtaskB_trial_test_pred = []
for i in range(len(pred)):
    subtaskB_trial_test_pred.append(int(pred[i][0]))

for index in range(subtaskB_trial_test.shape[0]):
    subtaskB_trial_test.at[index, 'label'] = subtaskB_trial_test_pred[index]
subtaskB_trial_test.head()

Unnamed: 0,id,sentence,label
0,1,"For a lovely breakfast, turn left out of the f...",0
1,2,If you catch them right your in a 4 star hotel...,0
2,3,and travelers should avoid it if they are look...,1
3,4,On thing I liked was just a block away was an ...,0
4,5,"Might be a good place for tourists, but try an...",0


In [24]:
subtaskB_trial_test_labelled = pd.read_csv(io.StringIO(uploaded_B['subtaskB_trial_test_labelled.csv'].decode('utf-8')))
subtaskB_trial_test_labelled.head()

Unnamed: 0,id,sentence,label
0,1,"For a lovely breakfast, turn left out of the f...",1
1,2,If you catch them right your in a 4 star hotel...,1
2,3,and travelers should avoid it if they are look...,1
3,4,On thing I liked was just a block away was an ...,1
4,5,"Might be a good place for tourists, but try an...",1


In [25]:
correct_preds = [index for index in range(subtaskB_trial_test.shape[0]) if subtaskB_trial_test.at[index, 'label'] == subtaskB_trial_test_labelled.at[index, 'label']]
len(correct_preds)

434

In [26]:
accuracy = len(correct_preds) / subtaskB_trial_test.shape[0]
accuracy

0.5371287128712872

We can combine datasets of different domains together to make a single classifier for better results, if training data for various domains is available.

In [27]:
subtaskB_evaluation_data = pd.read_csv(io.StringIO(uploaded_B['subtaskB_evaluation_data.csv'].decode('utf-8')))
subtaskB_evaluation_data.head()

Unnamed: 0,id,sentence,label
0,0,This hotel was very modern and sleek.,X
1,1,"Beautiful, well-laid out, albeiit small rooms.",X
2,2,Fantastic breakfast with an incredible selecti...,X
3,3,the staff were uber-helpful.,X
4,4,Great location in front of a u-bahn stop.,X


In [28]:
subtaskB_evaluation_data_texts = []
for i in range(len(subtaskB_evaluation_data)):
    subtaskB_evaluation_data_texts.append(subtaskB_evaluation_data['sentence'][i])
pred = train_data_clf.predict(subtaskB_evaluation_data_texts)
evaluation_pred = []
for i in range(len(pred)):
    evaluation_pred.append(int(pred[i][0]))

for index in range(subtaskB_evaluation_data.shape[0]):
    subtaskB_evaluation_data.at[index, 'label'] = evaluation_pred[index]
subtaskB_evaluation_data.head()

Unnamed: 0,id,sentence,label
0,0,This hotel was very modern and sleek.,0
1,1,"Beautiful, well-laid out, albeiit small rooms.",0
2,2,Fantastic breakfast with an incredible selecti...,0
3,3,the staff were uber-helpful.,0
4,4,Great location in front of a u-bahn stop.,0


In [0]:
subtaskB_evaluation_data.to_csv('subtaskB_evaluation_data_labelled.csv', sep=',')
files.download('subtaskB_evaluation_data_labelled.csv')